<![CDATA[Kentik Blog Posts]]><![CDATA[Read the latest industry updates and hot topics in enterprise-scale network monitoring from the Kentik blog. Stay in the know on the latest with Kentik.]]>https://www.kentik.com/blog/RSS for NodeMon, 17 Feb 2025 15:24:18 GMT<![CDATA[Evaluating Cloud Gateways for Cost and Performance]]><![CDATA[Cloud networking costs can escalate due to inefficient routing and limited visibility. Kentik’s cloud visibility and analytics solution helps engineers optimize transit, reduce costs, and improve performance by analyzing AWS Transit Gateways and exploring alternatives like direct peering, storage endpoints, and AWS CloudWAN.]]>https://www.kentik.com/blog/evaluating-cloud-gateways-for-cost-and-performancehttps://www.kentik.com/blog/evaluating-cloud-gateways-for-cost-and-performance<![CDATA[Phil Gervasi]]>Thu, 13 Feb 2025 05:00:00 GMT<p>As many enterprises are painfully aware, cloud networking costs can quickly spiral out of control. It can be a struggle to see how data moves across cloud environments, leading to unexpected expenses and inefficient resource usage. Understanding where traffic is flowing, how much data is moving, and whether there are more cost-effective routing options is crucial for maintaining both network performance and financial efficiency.</p> <p>Kentik provides a powerful suite of data analysis, exploration, and reporting tools designed to give engineers deeper insights into their cloud networking activity. By using Kentik’s capabilities, engineers can identify inefficiencies and control costs more effectively by optimizing cloud transit.</p> <div as="Promo"></div> <h2 id="managing-cloud-transit-efficiently">Managing cloud transit efficiently</h2> <p>Cloud service providers offer Transit Gateways to simplify connectivity between VPCs and on-premises networks. While these gateways consolidate connectivity, they also introduce significant costs – especially when processing large amounts of data. This is why understanding how and where these gateways are used can help us optimize our cloud architecture.</p> <p>Kentik ingests and processes a variety of <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Kentik Blog: Best Practices for Enriching Network Telemetry to Support Network Observability">network telemetry</a>, including <a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="Kentipedia: What are VPC Flow Logs?">cloud flow logs</a> and metrics, which we can then use to analyze heavily used <a href="https://www.kentik.com/kentipedia/aws-transit-gateway-explained/" title="Kentipedia: AWS Transit Gateway Explained">transit gateways</a>, identify high-traffic patterns, and explore alternative connectivity options.</p> <h2 id="analyzing-transit-gateway-traffic-in-kentik">Analyzing Transit Gateway traffic in Kentik</h2> <p>Kentik’s Data Explorer provides the ability to explore and find detailed insights from AWS and Azure traffic patterns. To evaluate heavily used Transit Gateways, you can run queries sorting key traffic dimensions.</p> <p>For example, we can build a query to identify unnecessarily high-cost data transfers. In the image below, notice that we can filter our data to help pinpoint traffic traversing our Transit Gateways.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Pb5yxwV3IJGiz4nd3OZ81/83bd984695062ba1b643039eccc6340a/transit-gateways-1.png" style="max-width: 800px;" class="image center" thumbnail withFrame="shadowBall" alt="Transit Gatewats showing internal traffic" /> <p>Here, we can see something concerning. There is significant intra-VPC traffic, or in other words, internal traffic, using a Transit Gateway. This incurs cost, can impact performance, and is unnecessary because the traffic is only internal.</p> <p>Notice that the Transit Gateway is listed under Source Interface Type and identified by name under Source ENI Entity Name. Exploring our cloud data in this way can help us make informed decisions to optimize cloud network activity.</p> <p>Adjusting our filters allows us to gain more insight. The following image provides a visual breakdown of our applied dimensions and filters, helping us understand which traffic is ingressing or egressing, where it’s headed, and which applications are generating traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1YM2nHGZsRyi0Rr2n05VdH/839d4d7737b4408c8542f1b15db0cfdb/transit-gateways-filtering.png" style="max-width: 360px;" class="image right no-shadow" alt="Filtering options" /> <p>In this way, we can use these filters and dimensions to pinpoint which Transit Gateways are handling the most traffic, flagging potential areas where cost optimization strategies could be applied.</p> <h2 id="transit-gateway-flow-logs">Transit Gateway flow logs</h2> <p>In some cases, VPC flow logs may not be the right choice for assessing traffic across a Transit Gateway. These logs can contain additional information specific to transit gateways, but they may also lack information specific to VPCs. So, they make a great addition to VPC flow logs when monitoring overall network health.</p> <p>They can also provide additional packet loss information for each flow to help troubleshoot difficult problems like large MTU drops, TTL expired, no route, and blackhole routes. These issues can be challenging to track down, so Transit Gateway flow logs can quickly help isolate the problem and associated workloads!</p> <h2 id="alternatives-to-transit-gateways">Alternatives to Transit Gateways</h2> <p>Once we identify high-traffic transit gateways, we can take steps to optimize cloud connectivity and reduce costs.</p> <p>First, we can consider direct peering. If large volumes of data are moving between VPCs frequently, peering may be a more cost-effective option than routing traffic through a Transit Gateway.</p> <p>Next, we can use storage endpoints. For workloads that frequently access cloud storage services, using dedicated storage endpoints instead of routing through a TGW can definitely reduce costs.</p> <p>Another alternative is to utilize AWS CloudWAN. While AWS CloudWAN is a Transit Gateway under the hood, the significant difference is that it is owned and managed by AWS. Unfortunately, that means Transit Gateway flow logs are unavailable to CloudWAN customers. However, Kentik uses the AWS Network Manager API to collect CloudWan metadata and enrich standard VPC flow logs to provide the same actionable insights you get when using Transit Gateways.</p> <p>Fourth, we should monitor data processing trends. Kentik’s historical traffic analysis allows us to track usage trends over time and adjust cloud architecture accordingly.</p> <p>Lastly, we need to be alerted when utilization is high. In Kentik, we can set up automated alerts when TGW usage spikes unexpectedly or, as in the cases above, when traffic is unnecessarily traversing paths it shouldn’t be. This provides us with proactive investigation and mitigation capabilities.</p> <p>With rising cloud networking costs, engineers need powerful tools to understand and optimize traffic patterns. Kentik’s advanced data analysis and reporting capabilities give us the visibility we need to manage AWS Transit Gateways more effectively. By analyzing traffic with Data Explorer, we can identify high-traffic gateways, explore alternative routing strategies, and ultimately reduce unnecessary cloud spending.</p><![CDATA[Celebrating a Year of Impact: Kentik Cares Giving Program]]><![CDATA[In 2024, the Kentik Cares Giving Program achieved incredible success! We've collectively donated $14,904 to causes close to our hearts. From local shelters to global relief efforts, we’ve made a real impact. Curious to see how we did it and what’s next? Read on to learn more about our first year of giving back -- and the positive change we’re creating together!]]>https://www.kentik.com/blog/celebrating-a-year-of-impact-kentik-cares-giving-programhttps://www.kentik.com/blog/celebrating-a-year-of-impact-kentik-cares-giving-program<![CDATA[Mary LeBlanc]]>Tue, 11 Feb 2025 05:00:00 GMT<p>As we reflect on 2024, we’re thrilled to celebrate another successful year of our Kentik Cares Giving Program. Launched in December 2017 as a seasonal initiative around the holidays — a time when many of us think about giving back — the program has since grown alongside Kentik, becoming a year-round effort in 2024. This shift made 2024 our most successful year yet, with increased participation and a bigger impact than ever before.</p> <h2 id="transparency-and-tracking">Transparency and tracking</h2> <p>The Kentik Cares program encourages employees to donate to causes they care about, with Kentik matching up to $200 per employee. To help track our progress, we created the Kentik Cares Impact Dashboard, which allowed employees to easily monitor donations and company matches throughout the year. This tool highlighted the difference we made together, showing the tangible impact of our collective efforts. Thanks to the generosity of our team, we’ve <strong>donated a total of $14,904</strong> to various charitable organizations!</p> <img src="//images.ctfassets.net/6yom6slo28h2/3b1nF89EIQTXCQ9c0DLLMH/4d19fb936d8062d37c0cdc9c37a1c4d2/kentik-cares-2024.jpg" style="max-width: 600px;" class="image center" alt="Kentik Cares 2024" /> <h2 id="spreading-the-support">Spreading the support</h2> <p>Together, we supported nearly 30 organizations around the world—nearly one for each U.S. state where a Kentikian lives! Some of the organizations we supported include:</p> <ul> <li><a href="https://www.sandiegohabitat.org/">San Diego Habitat for Humanity</a></li> <li><a href="https://wck.org/">World Central Kitchen</a></li> <li><a href="https://www.aclu.org/">ACLU</a></li> <li><a href="https://renfrewlibrary.ca/">Renfrew Public Library</a></li> <li><a href="https://www.catf.us/">Clean Air Task Force</a></li> </ul> <p>We want to extend a heartfelt thank you to every employee who contributed. Your generosity helped make the Kentik Cares Giving Program a resounding success. Together, we’ve proven that even small acts of kindness, multiplied by the power of community, can create real change.</p> <h2 id="looking-ahead">Looking ahead</h2> <p>What started as an end-of-year tradition has now evolved into a year-round opportunity to reflect our shared values. We’re incredibly proud of our accomplishments in 2024 and are excited to carry that momentum into the new year. For 2025, we’re setting a new goal: $20,000 in collective donations. Let’s continue creating positive change and strengthening the communities we serve.</p> <p>Thank you to everyone who participated in the Kentik Cares Giving Program. Together, we made a meaningful difference!</p><![CDATA[Data-Driven AI: The Key to Network Observability]]><![CDATA[AI has the power to optimize network performance, detect anomalies, and improve capacity planning, but its effectiveness depends on high-quality data. Without a strong data foundation, AI remains a theoretical tool rather than delivering real operational value.]]>https://www.kentik.com/blog/data-driven-ai-the-key-to-network-observabilityhttps://www.kentik.com/blog/data-driven-ai-the-key-to-network-observability<![CDATA[Phil Gervasi]]>Thu, 30 Jan 2025 05:00:00 GMT<p>AI has captured the imagination of many industries, promising transformative insights and automation. In the realm of network operations, AI has the potential to revolutionize performance optimization, anomaly detection, capacity planning, and the ability to glean meaningful insight from network telemetry. However, realizing these benefits isn’t as simple as flipping a switch. AI thrives on high-quality data, and the reality is that most organizations underestimate the complexities involved in achieving this foundational requirement.</p> <p>Without sufficient high-quality data, AI becomes an experiment or a theoretical exercise rather than a practical tool. Understanding the data challenges of AI is essential for network admins, cloud engineers, and IT managers to deploy solutions that deliver real value.</p> <h2 id="the-importance-of-data-to-ai">The importance of data to AI</h2> <p>An AI system will only be as good as the data we feed it. Poor quality or insufficient data leads to inaccurate predictions, unreliable results, and ineffective automation. To be of any real value, AI systems need vast amounts of data to learn patterns and make reliable predictions. The nature of networking, in particular, poses some challenges to an effective AI implementation.</p> <p>First, networks are complex, dynamic systems with diverse telemetry sources. Network telemetry includes a range of data, such as flows, metrics, routing tables, contextual metadata, etc. This results in two problems: dealing with a massive volume of data and wildly different data types.</p> <p>Second, this raw data is often riddled with noise, inconsistencies, errors, and missing values. Cleaning this data—removing outliers, filling missing values, and ensuring consistency—is critical for AI to work effectively. In fact, data processing and preparation are often the bulk of the activity involved in implementing an AI solution.</p> <p>Next, AI insights need to be based on up-to-date information. Stale data can result in irrelevant decisions, especially in dynamic network environments. Handling real-time telemetry data in an AI workflow is a challenge for even skilled data scientists and engineers, so careful attention must be given to how telemetry data is ingested.</p> <p>Lastly, data must be efficiently stored, processed, and accessible to AI models in real or near-real times. A poorly designed database can prevent efficient queries and produce inaccurate results. High latency between an AI system and relevant databases can slow a workflow and adversely affect network operations.</p> <p>Addressing these requirements is not trivial in the least. It requires a sophisticated infrastructure and approach to data management. On top of this are the common challenges network operations teams already face, including data silos, volume overload, and a fragmented ecosystem.</p> <div as="Promo"></div> <h2 id="how-kentik-addresses-the-data-challenge">How Kentik addresses the data challenge</h2> <p>Kentik’s platform is purpose-built to solve the data challenges that hinder effective AI deployment in network operations. By focusing on the foundational elements of data ingestion, storage, cleaning, and processing, Kentik enables AI to deliver actionable insights with accuracy and relevance.</p> <p><strong>Kentik’s approach to data can be broken down into several components.</strong></p> <p>Kentik ingests a significant variety of network telemetry data, including flow records (NetFlow, sFlow, IPFIX), SNMP metrics, streaming telemetry, cloud flow logs, BGP routing tables, and enriched metadata. This breadth ensures that AI models have access to the diverse data required to understand the full context of network behavior.</p> <p>The platform was designed from the beginning to handle the massive scale of modern networks. Kentik’s advanced storage architecture retains detailed telemetry data over time, enabling longitudinal analysis and access to rich historical data.</p> <p>A showstopper for many organizations interested in implementing an AI solution is data processing. Kentik automatically processes raw telemetry to eliminate noise and inconsistencies. This includes deduplication, normalization across data types, and enrichment with additional contexts, such as geolocation and device-level and organization-level metadata.</p> <p>Kentik’s real-time analytics engine processes incoming data as it’s generated, providing the low-latency insights required for day-to-day network operations. This ensures that AI-driven alerts and recommendations are always based on the latest network information.</p> <p>Because Kentik already ingests, cleans, and processes vast volumes of network telemetry, it’s perfectly positioned to support AI models. Rather than starting from scratch, organizations can leverage Kentik’s pre-built infrastructure and automated intelligence functionality to accelerate their own AI initiatives.</p> <h2 id="kentik-journeys">Kentik Journeys</h2> <p><a href="https://www.kentik.com/solutions/kentik-ai/">Journeys</a> provides an AI-augmented user experience built on this foundation of data. Since Kentik already ingests, processes, and updates vast amounts of network telemetry data, engineers and IT managers using Journeys can gain insights from data more efficiently and perform advanced analyses to achieve a meaningful understanding without architecting, building, and maintaining sophisticated data repositories.</p> <p>By solving the most complex parts of the data challenge, Kentik enables network operators to focus on AI’s outcomes rather than the intricacies of data preparation. For network engineers and IT professionals, this means that Kentik doesn’t just make AI possible; it makes AI practical and effective for network operations.</p><![CDATA[TikTok Emerges from Shutdown Without Bytedance’s US CDN]]><![CDATA[Kentik's Doug Madory looks into this weekend's 14-hour outage of popular video sharing service TikTok, which was slated to be banned from the US per recent legislation. While TikTok came back, it is notably no longer being served by parent company Bytedance’s US CDN. We delve into the traffic statistics in this blog post.]]>https://www.kentik.com/blog/tiktok-emerges-from-shutdown-without-bytedances-us-cdnhttps://www.kentik.com/blog/tiktok-emerges-from-shutdown-without-bytedances-us-cdn<![CDATA[Doug Madory]]>Tue, 21 Jan 2025 05:00:00 GMT<p>On Saturday evening, TikTok ceased operating in the United States as a law banning the service went into effect. However, 14 hours later, the popular app, used by 170 million users in the US, resumed service — but notably without the support of parent company Bytedance’s US CDN.</p> <h2 id="background">Background</h2> <p>The Chinese-owned video sharing service became a political target in recent years, beginning with President Trump’s <a href="https://www.npr.org/2020/08/06/900019185/trump-signs-executive-order-that-will-effectively-ban-use-of-tiktok-in-the-u-s">call to ban the service</a> in the final months of his first administration. Despite TikTok <a href="https://www.nytimes.com/2020/09/13/technology/tiktok-microsoft-oracle-bytedance.html">signing a deal</a> to store US data with US database and cloud computing company Oracle, TikTok’s <a href="https://www.cbsnews.com/news/why-is-tiktok-being-banned-supreme-court-congress/">purported links</a> to the Chinese government led Congress to pass a law banning it from the United States if it wasn’t sold to a US company.</p> <p>On Thursday, the <a href="https://abcnews.go.com/Politics/biden-administration-leave-trump-implement-tiktok-ban/story?id=117753133">Biden administration announced</a> that it would not enforce the ban, essentially leaving it up to the incoming Trump administration to handle. And on Friday, the US Supreme Court <a href="https://www.pbs.org/newshour/politics/supreme-court-upholds-tiktok-ban-if-not-sold-by-chinese-trump-has-promised-a-solution">voted unanimously</a> to uphold the ban set to take effect on Sunday, January 19 — awkwardly, one day before President Trump was to take office.</p> <p>In the waning hours of Saturday, January 18, TikTok went dark.</p> <h2 id="the-outage">The outage</h2> <p>TikTok is one of hundreds of services analyzed by Kentik’s <a href="https://www.kentik.com/product/subscriber-intelligence/">OTT Service Tracking</a>. Whether it’s <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-the-tyson-paul-fight-on-netflix/">Netflix’s Mike Tyson fight</a>or <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday-aug2024/">Microsoft’s Patch Tuesday</a>, this unique functionality gives users the ability to explore many dimensions of how traffic for a particular service is sourced and delivered.</p> <p>Based on our aggregate NetFlow data from our OTT Service Tracking customers, we observed traffic for TikTok ceased at 03:30 UTC on Sunday, January 19 (or 10:30pm ET on Saturday, January 18). The <a href="https://x.com/DougMadory/status/1880951684064760284">dropoff in traffic</a>, broken down by source CDN, is depicted below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5oW2W2TIIIND3GD7loYJui/082855305d3094e114803f386124465a/TikTok_Shutdown_3day_linechart.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik portal showing TikTok shutdown, 3 days" /> <h2 id="the-return">The return</h2> <p>Only 14 hours after TikTok went down, the <a href="https://x.com/TikTokPolicy/status/1881030712188346459">company announced</a> that it was in the process of restoring service, thanking the incoming president for “providing the necessary clarity and assurance to our service providers that they will face no penalties providing TikTok.”</p> <p>At 17:30 UTC (12:30pm ET), users <a href="https://x.com/sevikasbabymama/status/1881030565299941703">began reporting</a> that they were able to access the service again, and we started seeing TikTok traffic begin to flow — only now with a change to the composition of traffic sources: no traffic from Bytedance’s US CDN.</p> <p>The graph below shows TikTok traffic to the US over the past week by source CDN. Prior to the outage, Bytedance’s US CDN (blue line) was clearly the largest source of TikTok traffic in bits/sec. After the outage, it was delivering none, with Akamai (green), Fastly (orange) and CDN77 (red) taking up the slack.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4RBAP0qPG53HmSKaePOFY3/4b71a229620b7a29f29178ff5ab9f292/TikTok_Shutdown_7day_linechart.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik portal showing TikTok shutdown, 7 days" /> <p>And we weren’t the <a href="https://x.com/AkiAndFam/status/1881381016222441534">only ones to notice</a> that Bytedance’s US CDN is no longer sending traffic. Below is a broader view at the dropoff in traffic from AS396986 to the US in the past week — each line represents a different US provider (unique destination AS).</p> <p><img src="https://images.ctfassets.net/6yom6slo28h2/5FLFbhE9zCpLrsd2xBtRHf/73fb87ccf8c1d449229e4f3112036374/us-internet-traffic-bytedance.png " style="max-width: 800px;" class="image center" thumbnail withFrame alt="US Internet Traffic from Bytedance in the Kentik portal" /></p> <h2 id="conclusion">Conclusion</h2> <p>At the time of this writing, if you search for TikTok on the Apple App Store in the US, you get a message that reads, ”TikTok and other ByteDance apps are not available in the country or region you’re in.” So, the future of TikTok is still very much up in the air in the US.</p> <p>Like the App Store message, the loss of ByteDance’s US CDN as a source of TikTok traffic may simply be another measure of precaution taken by TikTok to avoid incurring the penalties from the ban.</p> <p>Several <a href="https://www.tiktok.com/@jazzygrimdo/video/7462113378908179758">TikTok users</a> have reported that the service behaves differently following the restoration. While it is hard to empirically measure the behavior of its algorithm, it is definitely the case that the service is hosted differently.</p><![CDATA[Democratizing Access to Network Telemetry with Kentik Journeys]]><![CDATA[In this post, discover how Kentik Journeys integrates large language models to revolutionize network observability. By enabling anyone in IT to query and analyze network telemetry in plain language, regardless of technical expertise, Kentik breaks down silos and democratizes access to critical insights simplifying complex workflows, enhancing collaboration, and ensuring secure, real-time access to your network data. ]]>https://www.kentik.com/blog/democratizing-access-to-network-telemetry-with-kentik-journeyshttps://www.kentik.com/blog/democratizing-access-to-network-telemetry-with-kentik-journeys<![CDATA[Phil Gervasi]]>Thu, 16 Jan 2025 05:00:00 GMT<p>Modern networks generate an enormous variety and volume of telemetry data, which has become increasingly difficult to query, especially in real time. Network observability, including the use of data analysis methods, emerged to solve this problem, and we saw platforms such as Kentik bring these components — network visibility and data analysis — together in a single platform.</p> <p>But we’re not done yet. IT silos still exist, and disparate teams still struggle to understand why an application is performing poorly over the network. For years, the dream was that the security, network, systems, application, and business teams would be able to work closely together to understand their digital environment, solve problems, and make collaborative decisions.</p> <p>Network observability can help us get closer, especially if we make it easy for anyone to access information.</p> <p>At Kentik, we believe a network observability platform must enable engineers to ask any question about their network. That means the platform must allow engineers to filter, parse, and visualize data exactly how they need to get the answers they’re looking for.</p> <p>However, this requires a good understanding of networking, including terminology, jargon, vendor-specific knowledge, and familiarity with device syntax. Doing this at scale would require a solid understanding of SQL, Python, common data analysis libraries, etc.</p> <p>The problem here is that the ability to ask any question of your network is often limited to highly skilled network specialists. And unfortunately, that still can mean dealing with siloes.</p> <div as="Promo"></div> <h2 id="llms-democratize-information">LLMs democratize information</h2> <p>This is where LLMs come in.</p> <p>Large language models understand the natural language we input into a prompt, inasmuch as transformer models and semantic similarity allow an LLM to “understand.” And because the largest LLMs have been trained on enormous bodies of training text, including network vendor documentation, blog posts, r/networking threads, and pages and pages of code, they can synthesize a response in surprisingly sophisticated ways.</p> <p>This means anyone can use an LLM to generate accurate code, which is currently one of the most common uses for LLMs in IT. But what if we incorporate that ability into a workflow that runs a query for us? That can allow almost anyone in our IT organization to mine through massive datasets much faster and easier, regardless of their skill level, job function, or team.</p> <p>In this way, an entry-level network admin can ask sophisticated security questions, which would require a significant amount of clue-chaining by a more senior engineer, who is likely an engineer on an entirely different team. It also means logging into devices to run <em>show</em> commands, running simple scripts to pull data from devices, and complex advanced scripts to analyze telemetry already in a database.</p> <p>Kentik already makes this easier by ingesting and processing various network telemetry and providing the data exploration tools to query those databases. However, running the right queries still requires knowledge of terminology and technical concepts.</p> <p>However, integrating an LLM into this workflow elevates network observability to the next level. It makes “asking any question about your network” much more literal and something anyone in the IT organization can do.</p> <h2 id="a-real-world-example">A real-world example</h2> <p>Consider a level 1 engineer investigating a possible compliance incident involving application traffic possibly egressing a VPC in AWS US-EAST-1 in the last 24 hours destined for an embargoed country.</p> <p>In this scenario, the engineer would need an understanding of AWS networking, terminology, and commands to filter traffic in the cloud, as well as the knowledge of what to look for (and the appropriate syntax) for the routing and security devices involved. Then there’s the added complexity of constraining the search to the last 24 hours.</p> <p>The first step for many is to escalate the ticket, which would likely delay the resolution. However, with an LLM interface to a data analysis workflow, the level 1 engineer could simply ask the system in plain language if any traffic was leaving the VPC heading for embargoed countries in question. The system would accomplish the query without anyone needing to know AWS, Cisco routing commands, which embargoed countries map to what IP addresses, and so on.</p> <p>The workflow would begin with an LLM interface and looking something like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/55soBT9KnWP1ecLShlg4cX/e287838025919fcaf9216a28ae9d0778/workflow-with-llm.png" style="max-width: 800px;" class="image center" alt="Workflow showing LLM" /> <p>Instead of having the LLM synthesize the response, our workflow could also return the query results in a structured format so we can create a visualization or alert. Below is an example output from <a href="https://www.kentik.com/solutions/kentik-ai/">Kentik Journeys</a> showing how this scenario would work.</p> <p>First, we simply ask the system what we want to know in the LLM prompt window.</p> <img src="//images.ctfassets.net/6yom6slo28h2/EhgtpdAUhnO3G2VgB7htf/20ca9943cffe39335d6e5e4fe77cb74b/llm-prompt.png" style="max-width: 800px;" class="image center" alt="Prompt example" /> <p>Then, the system, powered by an LLM that interprets the prompt, generates the appropriate query (in this case, in SQL with output omitted for brevity). Notice that the LLM understood the desired timeframe, the need for geographical identifiers (embargoed countries), etc.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1vxPcHPbq2Bxa9TccCti4Z/9d7cdd24dd8a595e5bf37c902728c038/feature-llm-workflow.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="LLM example code" /> <p>In our workflow, the system runs that query against our network telemetry database. It returns the result to the LLM to synthesize a natural-language response, or we can receive the response in a structured format to create a visualization, as in the image below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5wC97sMlyraFlgKm0oM9nr/9d382d4e1d6e3b58d579174b65c3a5cc/llm-visualization.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Visualization using LLM in the Kentik platform" /> <p>Here, we can see the traffic for the last day leaving US-EAST-1 and destined for any embargoed country. The level 1 engineer simply asked the system in natural language without needing to know commands, syntax, SQL, Python, etc.</p> <h2 id="ensuring-accuracy">Ensuring accuracy</h2> <p>Because we’re using the LLM to generate the query, we can ensure the accuracy of the data as long as it has been properly cleaned and processed. In this example, the prompt initializes a workflow that queries hard metrics without using a vector database, which could introduce potential hallucinations due to inaccurate semantic similarity.</p> <p>Of course, semantic similarity can still be a concern because the LLM has to interpret the prompt in the first place. In our example, we used words like “traffic,” “AWS US-EAST-1,” “destined,” and so on, which guide the LLM to the intent of the prompt. If the words used in the prompt are ambiguous or, at worst, completely wrong, we could generate incorrect results. However, this is still a huge step forward in making access to network telemetry data easy for almost anyone in IT operations.</p> <h2 id="keeping-things-safe">Keeping things safe</h2> <p>LLMs don’t inherently understand what is an appropriate prompt and what isn’t. Therefore, we put guardrails on the system to ignore irrelevant, inappropriate, or prompts deemed outside the scope of a particular end user. In the image below, you can see the response that Kentik Journeys returned for an inappropriate prompt.</p> <img src="//images.ctfassets.net/6yom6slo28h2/MCPyLyNReAmMJUKyxN6sX/9d361d8c7f98e320773a7cb74ddcfea6/bad-prompt.png" style="max-width: 800px;" class="image center" alt="Example of an inappropriate prompt" /> <p>Other guardrails would instruct the LLM and backend workflow to limit queries to only your own data and not the private data of other organizations. We can also obfuscate data in transit and use enterprise agreements with LLM providers rather than publicly available services.</p> <p>These practices, along with traditional data loss prevention mechanisms, policies, and practices, ensure that using an LLM to democratize network telemetry data can safely revolutionize network observability without exposing our organization to new security risks.</p> <h2 id="the-next-evolution-of-network-observability">The next evolution of network observability</h2> <p>At Kentik, one way we define network observability is the ability to ask any question about your network. Integrating LLM technology into a network observability workflow, as Kentik has done with Journeys, is another significant step forward in making that a reality for anyone in IT, regardless of skill level or job role.</p> <p>Since Journeys is a built-in function of our platform and not a separately licensed feature, you can get started immediately if you’re already using Kentik in your organization. If not, <a href="https://www.kentik.com/get-started/">sign up for a free trial today</a> and see how easy it is to interrogate your network data.</p><![CDATA[Anatomy of an OTT Traffic Surge: Netflix Rumbles Into Wrestling]]><![CDATA[On Monday, Netflix debuted professional wrestling as its latest foray into live event streaming. The airing of WWE's Monday Night Raw followed Netflix's broadcasts of a heavily-promoted boxing match featuring a 58-year-old Mike Tyson and two NFL games on Christmas Day. In this post, we look into the traffic statistics of how these programs were delivered.]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-netflix-rumbles-into-wrestlinghttps://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-netflix-rumbles-into-wrestling<![CDATA[Doug Madory]]>Wed, 08 Jan 2025 05:00:00 GMT<p>On Monday evening, Netflix made its latest stride into live streaming with its <a href="https://apnews.com/article/wwe-netflix-monday-night-raw-50a271fd0091e192c7c4889345b4437f">exclusive broadcast</a> of WWE’s Monday Night Raw. With this latest event, Netflix continues its charge into live programming in recent months following its broadcasts of the NFL on Christmas Day and the Mike Tyson versus Logan Paul <del>spectacle</del> fight.</p> <p>The following post is part of our <em>Anatomy of an OTT Traffic Surge</em> series and the latest in our ongoing coverage of streaming giant Netflix. Here, we compare the three recent <a href="https://en.wikipedia.org/wiki/Over-the-top_media_service">OTT events</a> and how we saw the traffic delivered.</p> <h2 id="ott-service-tracking">OTT Service Tracking</h2> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/">Call of Duty update</a> or the first-ever exclusively live-streamed <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-peacock-streamed-nfl-playoff-game/">NFL playoff game</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p>Kentik <a href="https://www.kentik.com/resources/kentik-true-origin/">True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 1000 categorized OTT services delivered by 79 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <div as="Promo"></div> <h2 id="a-redemption-arc">A redemption arc</h2> <p>Having those explanations out of the way, let’s talk about the content delivery. Perhaps the most memorable aspect of last year’s Tyson-Paul fight was Netflix’s difficulties in delivering the broadcast, which was plagued by <a href="https://www.espn.com/boxing/story/_/id/42418997/netflix-experiences-streaming-delays-leading-tyson-paul-fight">buffering and disconnections</a> for many viewers, including myself.</p> <p>The graphic below comes from <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-the-tyson-paul-fight-on-netflix/">my post from November</a> following the bout. Those jagged plotlines during the surge of traffic were unusual and appeared to indicate some of the delivery problems that Netflix experienced.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5FMkFaXdaJ2SLGEQcTFcb7/71c4ca06b5ee008d577bcfd3eb7aec36/netflix-embedded-cache-jagged.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Netflix traffic by connectivity type" /> <p>The big question coming out of the event was whether Netflix could shore up its infrastructure to avoid a repeat of these delivery problems on Christmas Day when the streaming service would, for the first time ever, air two NFL games. Fortunately for football fans, those games were streamed without a hitch.</p> <p>The surge of traffic from the Christmas Day NFL games is illustrated below, where it is broken down by source connectivity type (e.g., embedded cache, peering, transit). There is even a small dip in the middle as viewers tuned away during the break between the two games. Based on data from our OTT customers, 89.9% of the traffic (bits/sec) of Netflix traffic during the NFL games came from embedded caches.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4Q3ojlTfcu0lrevNWOhEM4/7d86b314a71cdae2cef895943a392e17/traffic-surge-christmas-nfl.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Christmas Day internet traffic surge" /> <p>So, did Netflix resolve its delivery issues between mid-November and Christmas? Maybe. One thing that is very important to note is that the viewership from the Tyson-Paul fight was stratospheric — significantly higher than the NFL games.</p> <p>Netflix claimed to have achieved <a href="https://about.netflix.com/en/news/jake-paul-vs-mike-tyson-over-108-million-live-global-viewers">108 million live viewers globally</a>, including “65 million concurrent streams (globally), with 38 million concurrent streams in the US.” Perhaps those figures should come with an asterisk if many of those viewers were unable to watch the program like I was. Regardless, the Tyson-Paul fight was a spectacle that drew many millions of onlookers around the world.</p> <p>And while the NFL is the top sports attraction in the US, it garnered a relatively smaller audience than the fight. A mere 30 million viewers caught the games on Christmas Day, according to <a href="https://about.netflix.com/en/news/nfl-christmas-day-games-on-netflix-average-over-30-million-global-viewers">Netflix’s published figures</a>. That jives with our data where we observed the Tyson-Paul fight lead to more than double the traffic during the NFL games.</p> <h2 id="from-the-top-rope-here-comes-netflix">From the top rope, here comes Netflix</h2> <p>That leads us to this week’s development: Monday Night Raw on Netflix.</p> <p>Unlike the NFL games which occurred midday, the pro wrestling flagship program aired in primetime amid the peak of household streaming. Regardless, the surge of Netflix traffic is easy to spot in the illustration below, which shares the 15-hour time range as the previous traffic graph (15:00 - 06:00 UTC).</p> <img src="//images.ctfassets.net/6yom6slo28h2/7eHs2Wz5ddVtb6TZwTmcUK/3d106f89430a0e63eef2f7977baa54fa/traffic-surge-monday-night-raw.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Monday Night Raw Netflix traffic" /> <p>Like the NFL games, the WWE event was primarily delivered through embedded cache (87.9%). However, based on our traffic data, Monday Night Raw clocked in at less traffic, 28% less in bits/sec, to be specific.</p> <p>The lower level of traffic might be explained by the fact that the wrestling program aired on a Monday evening while the NFL games aired on Christmas Day when many people were home from work with the TV on.</p> <img src="//images.ctfassets.net/6yom6slo28h2/49O9YGYJLr3d5CCM1OCAUJ/26ebd06e3ae1430b92adbd713d7321c0/holiday2024traffic.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Comparison of Christmas Day to Monday Night Raw" /> <p>Still, it remains an open question as to whether Netflix can, <em>even today</em>, deliver a problem-free stream of an event on the scale of the Tyson-Paul fight with two to three times the viewers of an NFL game. If Netflix keeps doing more live events, perhaps we’ll again bump up against those limits.</p> <h2 id="conclusion">Conclusion</h2> <p>Our OTT Service Tracking workflow allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like the ones mentioned in this post can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/product/subscriber-intelligence/">subscriber intelligence</a>.</p> <p>Ready to improve over-the-top service tracking for your own networks? <a href="https://www.kentik.com/get-demo/">Get a personalized demo</a>.</p><![CDATA[AWS re:Invent 2024: A Week of Innovation and Excitement]]><![CDATA[AWS re:Invent 2024 brought together a record-breaking 80,000 attendees in Las Vegas to explore the latest innovations in cloud computing, from generative AI to sustainability. In this post, Justin Ryburn shares his key takeaways from the event, highlighting AWS’s vision for the future and the vibrant energy of the conference.]]>https://www.kentik.com/blog/aws-re-invent-2024-a-week-of-innovation-and-excitementhttps://www.kentik.com/blog/aws-re-invent-2024-a-week-of-innovation-and-excitement<![CDATA[Justin Ryburn]]>Mon, 23 Dec 2024 05:00:00 GMT<h2 id="introduction">Introduction</h2> <p>AWS re:Invent 2024, the annual cloud computing conference hosted by Amazon Web Services (AWS), concluded last week in Las Vegas. This year’s event was marked by a flurry of announcements showcasing AWS’s commitment to innovation and its vision for the future of cloud computing. From generative AI to data analytics, security, and sustainability, re:Invent 2024 had something for everyone. And there were plenty of people.</p> <p>Everywhere you went at re:Invent, there was a sea of people. The organizers reportedly had 80,000 people pick up badges, making this the largest re:Invent in history. Kentik is a proud sponsor of re:Invent, so I was fortunate enough to attend. This blog reviews some of my key takeaways from the event.</p> <h2 id="venue">Venue</h2> <p>AWS re:Invent 2024 was held at multiple venues, including the Venetian, Palazzo, Mandalay Bay, Wynn, and MGM Grand. This year’s event was an exhibition of innovation and excitement.</p> <p>From the bustling expo halls filled with cutting-edge technologies from their partner eco-system to the insightful keynote sessions delivered by AWS leaders, re:Invent 2024 offered a comprehensive look at the future of cloud computing. The vibrant atmosphere of Las Vegas provided the backdrop for this week-long technology event, with attendees networking, learning, and exploring the latest developments.</p> <h2 id="announcements">Announcements</h2> <p>This will not be a shock to anyone reading this, but one of the major themes of this year’s conference was generative AI. AWS unveiled a range of new tools and services designed to make it easier for developers to build and deploy AI-powered applications. Announcements included <a href="https://www.aboutamazon.com/news/aws/amazon-nova-artificial-intelligence-bedrock-aws">Amazon Nova foundation models</a>, availability of <a href="https://www.aboutamazon.com/news/aws/aws-reinvent-2024-keynote-live-news-updates?p=aws-trainium2-instances-now-generally-available">Trainium2 instances</a>, the launch of <a href="https://www.aboutamazon.com/news/aws/aws-reinvent-2024-keynote-live-news-updates?p=trainium3-chips-designed-for-high-performance-needs-of-next-frontier-of-generative-ai-workloads">Trainium3</a>, new capabilities for a generative AI search engine called <a href="https://www.aboutamazon.com/news/aws/aws-reinvent-2024-keynote-live-news-updates?p=customers-scale-use-of-amazon-q-business-as-new-innovations-transform-how-employees-work">Amazon Q</a>, and more. Amazon does a nice job of summarizing their announcements, so be sure to check out <a href="https://www.aboutamazon.com/news/aws/aws-reinvent-2024-keynote-live-news-updates">this page</a> for all the latest. One of my personal favorites is the new <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch-NetworkFlowMonitor-What-is-NetworkFlowMonitor.html">CloudWatch Flow Monitor feature</a>. This allows AWS customers to get an idea of how the AWS underlay (backbone) network is performing.</p> <h2 id="keynotes-and-talks">Keynotes and talks</h2> <p>The conference kicked off with a series of keynote presentations by AWS CEO Adam Selipsky and other key executives. These keynotes highlighted the latest advancements in generative AI, data analytics, security, and sustainability.</p> <ul> <li><strong>Generative AI:</strong> The rapid evolution of generative AI was a significant focus of the keynotes. AWS showcased new tools and services, such as Amazon Bedrock, that empower developers to build innovative AI-powered applications. This product was featured in several sessions with customers to help them understand how they use Bedrock to solve real business problems.</li> <li><strong>Data analytics and machine learning:</strong> AWS announced significant enhancements to Amazon SageMaker and Amazon DataZone, making it easier for data scientists and analysts to derive insights from data.</li> <li><strong>Security and compliance:</strong> The company emphasized its commitment to security with new services like Amazon Detective and expanded compliance certifications.</li> <li><strong>Sustainability:</strong> AWS underscored its dedication to sustainable practices, outlining its goal of powering its global infrastructure with 100% renewable energy by 2025.</li> </ul> <p>In addition to the keynotes, hundreds of technical sessions and workshops were held throughout the week. These sessions covered a wide range of topics, from building serverless applications to deploying machine learning models at scale. Attendees had the opportunity to learn from AWS experts and industry leaders and connect with fellow developers and IT professionals.</p> <h2 id="conclusion">Conclusion</h2> <p>Overall, AWS re:Invent 2024 was a resounding success. The conference showcased the incredible breadth and depth of AWS’s offerings and demonstrated the company’s commitment to innovation and customer success. With a focus on generative AI, data analytics, security, sustainability, and a wide range of other areas, AWS is well-positioned to continue its leadership in the cloud computing market. AWS has all the <a href="https://youtube.com/playlist?list=PL2yQDdvlhXf_ZsP25dGLTNbrVSphM2JDl&#x26;si=geIaAURaa_4pcKeB">recorded sessions on YouTube</a>, so be sure to check out anything that catches your eye.</p> <hr> <p>This article originally appeared on Justin Ryburn’s personal blog. Kentik is grateful that we have so many incredible writers on our team who are willing to share their work with us. You can find the <a href="https://ryburn.org/2024/12/12/aws-reinvent-2024-a-week-of-innovation-and-excitement/">original post here</a>.</p><![CDATA[Comparing Azure NSG and VNet Flow Logs]]><![CDATA[Azure VNet flow logs significantly improve network observability in Azure. Compared to NSG flow logs, VNet flow logs provide broader traffic visibility, enhanced encryption status monitoring, and simplified logging at the virtual network level enabling advanced traffic analysis and a more comprehensive solution for modern cloud network management.]]>https://www.kentik.com/blog/comparing-azure-nsg-and-vnet-flow-logshttps://www.kentik.com/blog/comparing-azure-nsg-and-vnet-flow-logs<![CDATA[Phil Gervasi]]>Fri, 20 Dec 2024 05:00:00 GMT<p>First introduced in early 2024, Azure Virtual Network (VNet) flow logs expand upon NSG flow logs, improving network visibility in Azure. Before VNet flow logs, NSG flow logs were the primary method to capture network traffic information. However, NSG flow logs lacked visibility of several key resources and were limited in scope and application. VNet flow logs fill in many of these visibility gaps and enhance network observability in Azure virtual networks.</p> <h2 id="expanded-capabilities">Expanded capabilities</h2> <p>VNet flow logs both simplify and expand on the breadth of traffic monitoring by logging traffic at the virtual network level. This means that network traffic flowing through the workloads in a virtual network is logged, in contrast to NSG flow logs, which capture traffic flowing only through a specific network security group or NSG.</p> <p>First, VNet flow logs simplify network monitoring by eliminating the need to enable multiple-level flow logging, which is necessary for NSG flow logging. NSG logs require configuring NSGs at both the subnet and network interface levels and skips important information when no NSG is applied.</p> <p>In contrast, VNet flow logs provide a broader scope of visibility by enabling logging at the virtual network, subnet, or even network interface level, eliminating the need for NSGs to be attached to specific resources at the network security group level only.</p> <p>Next, VNet flow logs provide more enhanced traffic analysis by identifying traffic allowed or denied by both NSG rules and the Azure Virtual Network Manager security admin rules. NSG flow logs capture information only at the network security group level.</p> <p>Lastly, VNet flow logs also provide a form of encryption status monitoring, meaning that by using VNet flow logs, we can evaluate the encryption status of network traffic, especially in scenarios utilizing virtual network encryption.</p> <p>In this short video, I give a quick overview of how VNet flow logs differ from and improve upon NSG flow logs (and you can find more quick overviews like this in our <a href="https://www.kentik.com/video/kentik-bytes/" title="Kentik Bytes video collection">Kentik Bytes video collection</a>):</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/vg1ua82sii?web_component=true&amp;seo=true&amp;videoFoam=false" title="Comparing Azure NSG and VNet Flow Logs Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/player.js" async=""></script> </div> </div> <h2 id="how-vnet-flow-logs-work">How VNet flow logs work</h2> <p>VNet flow logs operate at Layer 4 and record network flows through a virtual network. They are collected at one-minute intervals and generally do not impact resource performance.</p> <p>Like NSG flow logs, VNet flow logs are stored in JSON format, which isn’t new for Azure but is an important distinction from other cloud flow records like AWS VPC flow logs. JSON organizes log information into key-value pairs, providing a clear and consistent structure. That uniformity simplifies data parsing and analysis and enables more efficient extraction of relevant information, which is critical for network observability and integration with various third-party tools.</p> <p>VNet flow logs include information such as source and destination IP, source and destination port, protocol, and the network interface identifier. They also include traffic direction, the flow state, the encryption state, and throughput information, a significant improvement over NSG flow logs.</p> <p>After flow records are generated, they are sent to Azure Storage, where they can be accessed and exported to visualization tools such as Kentik.</p> <h2 id="uses-for-vnet-flow-logs">Uses for VNet flow logs</h2> <p>VNet flow logs have several specific uses for monitoring cloud traffic.</p> <p>First, they’re used for general-purpose network monitoring, meaning they are used to identify unknown or undesired traffic, monitor traffic levels, and understand application behavior over the network.</p> <p>Beyond that, VNet flow logs are used for usage optimization, such as identifying the top talkers in your cloud network, analyzing cross-region traffic, and forecasting capacity needs. Because VNet flow logs can be enabled at multiple levels, we can use them to monitor various services, such as Azure Firewall, VPN gateways, or ExpressRoute gateways.</p> <p>(NSG flow logs are limited to monitoring traffic through resources with associated NSGs, which means they provide an incomplete picture of cloud network traffic.)</p> <p>From a security perspective, these logs are used for compliance, such as ensuring network isolation, verifying encryption standards, and adhering to organizational access rules. In fact, they can also be used for security analysis tasks, such as analyzing network flows from compromised IPs, detecting intrusions, recognizing suspicious traffic patterns, and so on. When used along with an SIEM or IDS tool, VNet flow logs can help provide advanced threat detection.</p> <p>NSG flow logs don’t provide information regarding the encryption status of network traffic, making this use case unique to VNet flow logs.</p> <p>Also, from a security perspective, VNet flow logs provide more enhanced traffic analysis by identifying traffic allowed or denied by both NSG rules and the Azure Virtual Network Manager security admin rules. NSG flow logs, on the other hand, capture traffic decisions only based on NSG rules, lacking insights into other security configurations.</p> <h2 id="comparing-nsg-and-vnet-flow-logs-side-by-side">Comparing NSG and VNet flow logs side-by-side</h2> <p>Because VNet flow logs fill many of the visibility gaps of NSG flow logs, it’s helpful to compare supported features and visibility side-by-side.</p> <table> <thead> <tr> <th>Scope</th> <th>NSG flow logs</th> <th>VNet flow logs</th> </tr> </thead> <tbody> <tr> <td>Identifying virtual network encryption</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Azure Application Gateway</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>ExpressRoute Gateway</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>VPN Gateway</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Virtual machine scale sets</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Bytes and packets in stateless flows</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Azure API management</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Azure Virtual Network Manager</td> <td>No</td> <td>Yes</td> </tr> </tbody> </table> <h2 id="kentik-provides-comprehensive-azure-observability">Kentik provides comprehensive Azure observability</h2> <p>Azure VNet flow logs mark a significant advancement in network observability for Azure environments. They address many of the limitations of NSG Flow Logs and offer a comprehensive solution for monitoring and analyzing network traffic. <a href="https://www.kentik.com/solutions/microsoft-azure/">Kentik leverages VNet flow logs</a> to capture traffic at the virtual network level, extending visibility to a wide range of Azure resources. By using VNet flow logs, Kentik provides unparalleled insight into network behavior and performance.</p> <h2 id="related-resources-about-cloud-flow-logs">Related resources about cloud flow logs</h2> <p>For more on cloud flow logs, see our post “<a href="https://www.kentik.com/blog/understanding-the-differences-between-flow-logs-on-aws-and-azure/" title="Kentik Blog: Understanding the Differences Between Flow Logs on AWS and Azure">Understanding the Differences Between Flow Logs on AWS and Azure</a>” and our Kentipedia article, “<a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="Kentipedia: What are VPC Flow Logs">What are VPC Flow Logs?</a>”</p><![CDATA[Observer Bias: A Look Back at 2024]]><![CDATA[Leon Adato takes a lighthearted and personal look back at 2024, reflecting on the standout moments and trends that caught his eye throughout the year. In his signature off-the-cuff style, Leon shares his unique perspective on the past year's highlights.]]>https://www.kentik.com/blog/observer-bias-a-look-back-at-2024https://www.kentik.com/blog/observer-bias-a-look-back-at-2024<![CDATA[Leon Adato]]>Fri, 20 Dec 2024 05:00:00 GMT<p>It’s December, which means it’s time for (if you’re American) the overwhelming glut of holiday cheer, Hallmark Channel movies, heavy food, Mariah Carey singing <em>that song</em>, and end-of-year retrospectives.</p> <p>The good news is that in this blog, I will not tell you how to make my grandmother’s baklavah, nor at any time will I state that all I want is <a href="https://www.youtube.com/watch?v=FXTi-KDK11k">Hugh</a>. No, not <a href="https://www.youtube.com/watch?v=U8XGephn-gI">him</a>, either.</p> <p>Nope, it’s time for me to sign down in front of a crackling menorah, open a bottle of Manischewitz, slather some apple sauce on my latkes, and wax nostalgic for the things that stand out for me during good old 2024.</p> <p>This is going to be <em>nothing</em> like my colleague Doug Madory’s <a href="https://www.kentik.com/blog/a-year-in-analysis-2024/">The Year in Internet Analysis: 2024</a>. Meaning it’s going to be an off-the-cuff and from-the-gut list, rather than a carefully researched and deeply meaningful report.</p> <p>Look, I <em>aim</em> to please, but some days I’m a lousy shot.</p> <h2 id="kentik-in-2024">Kentik in 2024</h2> <p>I’m going to start with the things from Kentik that shine brightly in my memory – at the top of my list are ten amazing episodes of <a href="https://www.youtube.com/watch?v=Njetq2evYbA&#x26;list=PLLr1vqeAl9QNMvS3uHyb3YjrxsWrAGMDx">What’s New at Kentik</a>.</p> <p>Next, and staying on the theme of Kentik videos, we added four new entries to our “just for fun” series that are worth a look (and maybe even a chuckle) if you’re stuck at your desk during a slow December code freeze:</p> <ul> <li><a href="https://www.youtube.com/watch?v=NcYZizMaxJo&#x26;list=PLLr1vqeAl9QNLOQ3DNnQzhjakJBn9BuID&#x26;index=1&#x26;pp=iAQB">The Cloud Cost Slasher</a></li> <li><a href="https://www.youtube.com/watch?v=VdJpnAhRKKc&#x26;list=PLLr1vqeAl9QNLOQ3DNnQzhjakJBn9BuID&#x26;index=2&#x26;pp=iAQB">Kentik (Observabilium) Daily</a></li> <li><a href="https://www.youtube.com/watch?v=OtIg9iKcJzQ&#x26;list=PLLr1vqeAl9QNLOQ3DNnQzhjakJBn9BuID&#x26;index=3&#x26;pp=iAQB">The Kentik Eclipse</a></li> <li><a href="https://www.youtube.com/watch?v=2nn9MI-FUm0&#x26;list=PLLr1vqeAl9QNLOQ3DNnQzhjakJBn9BuID&#x26;index=4&#x26;pp=iAQB">Answer Any Question About Your Network</a></li> </ul> <p>We also launched a new podcast series, <a href="https://www.kentik.com/telemetrynow/s02-e11/">Telemetry News Now</a>, which is technically a sub-series under the <a href="https://www.kentik.com/telemetrynow/">Telemetry Now</a> umbrella. Still, in my mind, it stands apart in tone and topic.</p> <p>In March, Kentik launched NMS, one of the first new network monitoring solutions to hit the market in years.</p> <p>It’s no small thing for anyone – an individual or company of any size – to launch a new product, application, or solution. It’s even more noteworthy for a group that already has a solid application in the market to launch something new – not just a new module or feature update but a wholly new product. Finally, it’s especially significant when that completely new product moves into a market space that is seen as congested, if not archaic.</p> <p>It’s important to understand why Kentik took the time, effort, money, and engineering cycles to build and release NMS. Admittedly, this is <a href="https://www.kentik.com/blog/setting-sail-with-kentik-nms-unified-network-telemetry/">a question for a different blog</a>, but it’s an important question nonetheless. For this retrospective, it’s important to note that it happened and was necessary for a variety of reasons.</p> <h2 id="the-it-world-beyond-the-kentik-window">The (IT) world beyond the (Kentik) window</h2> <p>I’m not going to sugar coat it: 2024 saw more than its fair share of hacks, outages, and even shot-ourselves-in-the-foot “oopsies” that immediately hit the top of our news feed.</p> <p>But I’m not going to dwell on them because, if you’ve worked in IT for more than 15 minutes, you know those things can happen to anybody (and sometimes <em>everybody</em>), at any moment, and not always for the reasons people and pundits are swift to speculate about. Instead, I’m going to mention moments that had a special or personal impact on me, my friends, and my colleagues in tech:</p> <p>One of the promotions for Kentik NMS was a video that promoted​​ the idea of “eclipsing” other tools. OK, that’s cute, but why? Well, I know it was eight months ago (at the time I’m writing this), but there was, in fact, an eclipse – a big one, <a href="https://www.youtube.com/watch?v=Zto3TcfB_oI">right over my house</a>.</p> <p>Certainly, one of the unignorable trends of the year was AI’s steady progression. While I <a href="https://www.kentik.com/telemetrynow/s01-e27/">make no secret of my skepticism</a> to much of the hype surrounding LLMs and AI, it’s an unimpeachable truth that this moment and movement within tech is influencing and even reshaping how we work and think about that work.</p> <p>However, the best moments of 2024 were those when we as a community stepped back from the hype, breathless (and uncritical) excitement, and, unrestrained, pushed forward and began to build the systems, controls, and boundaries that are the hallmarks of every mature and truly useful technology—ways to monitor and manage everything from heat and energy to access and trust.</p> <p>It’s hard to review 2024 without also mentioning the tectonic changes within social media and their impact on our communities. I’m not going to name specific platforms or attempt to detail the full litany of changes. It’s enough to say that the landscape of when, where, and how we share ourselves online looks very different in the waning days of the year, and that has, in turn, changed our experience of community.</p> <p>One of the things I do over on <a href="https://adatosystems.com/">my personal website</a> is share jobs that come across my desk, slacks, discords, emails, and more. This gives me a window into the job market, which I also blog about fairly frequently over there. And let’s face it: the job market was a big part of the 2024 news cycle. But overall, what I can say is that the job market wasn’t “bad” (I was posting between 75 and 200 new jobs every week, all year long), but rather that it was “slow.” This means that finding new jobs has taken everyone – from newbies fresh out of boot camp to grizzled veterans with years of experience and piles of privilege – much longer than they expected to land their next role.</p> <p>While my heart goes out to everyone who is currently looking, my advice is “stick with it.” The jobs are there; they just require a bit more persistence to land.</p> <p>Which leads to one of the sweetest moments of 2024 – Sesame Street’s Elmo checking in on everyone.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1x17kOzTGzdouycqmOCRJU/3f3e543221988b649d672bd1998a151f/elmo-stoop.jpg" style="max-width: 500px;" class="image center" alt="Elmo checking in" /> <p>This one post racked up millions of views and tens of thousands of reposts, quotes, likes, and bookmarks. It was a case of the right person (or Muppet) asking the right question at the right time.</p> <h2 id="the-view-from-inside">The view from inside</h2> <p>Over the last two days, as I’ve sat writing the bulk of this post, I’ve come back to the question of which moments of 2024 resonated most for me, personally. Here are two that stand out:</p> <p>While it was satisfying to see the subscriber count to my weekly email listing triple, it was truly rewarding every time a person asked to be <em>removed</em> from the email because they’d landed a job and no longer needed it.</p> <p>Traveling to (and speaking at) CodeMash Sandusky, OHNUG Toledo, NANOG 90 Charlotte, MINUG Detroit, NYNUG Saratoga Springs, DevOpsDays Seattle, DevOpsDays Montreal, DevOpsDays Houston, PyOhio Cleveland, SRE Day London and getting to be part of all those amazing communities.</p> <p>With those final thoughts, I invite you to reflect on 2024 and see what moments shine brightest in your personal rearview mirror.</p><![CDATA[A Year in Analysis: 2024]]><![CDATA[In this post, Doug Madory reviews the highlights of his wide-ranging internet analysis from the past year, which included covering the state of BGP (leaks and the state of RPKI adoption), submarine cables (both cuts), major outages, and how geopolitics has shaped the internet in 2024.]]>https://www.kentik.com/blog/a-year-in-analysis-2024https://www.kentik.com/blog/a-year-in-analysis-2024<![CDATA[Doug Madory]]>Wed, 18 Dec 2024 05:00:00 GMT<p>It’s the end of another eventful year on the internet and time for another end-of-year recap of the analysis we’ve published since our <a href="https://www.kentik.com/blog/a-year-in-internet-analysis-2023/">last annual recap</a>. This year, we’ve organized the pieces into a few categories, including BGP analysis, major internet outages, submarine cables, and geopolitics. We hope you find this annual summary insightful.</p> <div as="Promo"></div> <h2 id="border-gateway-protocol-bgp-analysis">Border Gateway Protocol (BGP) analysis</h2> <p>This year, I continued my <a href="https://www.kentik.com/blog/author/job-snijders/">ongoing collaboration</a> with internet routing expert <a href="https://www.fastly.com/blog/author/job-snijders">Job Snijders</a> of Fastly, looking at the state of RPKI ROV (Route Origin Validation) adoption. In May, we <a href="https://www.kentik.com/blog/rpki-rov-deployment-reaches-major-milestone/">covered a major milestone</a> as the majority of routes in the IPv4 routing table now have Route Origin Authorizations (ROAs) — IPv6 achieved this milestone last year.</p> <p>Now that the majority of both IPv4 and IPv6 routes in the global routing table have ROAs, we can update our <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">traffic statistics</a>, which show that we’re approaching three-quarters of all traffic (measured in bits/sec, pictured below) heading to routes with ROAs and thus eligible for the protection that RPKI ROV provides. I’m not sure for how much longer we can maintain this steady rate of adoption, but it has been remarkable thus far.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7jpoO9AbYLtjueIXLmBvue/08dd473ea4f4c9566f007d224393218b/internet-traffic-rpki-202410.png" style="max-width: 650px;" class="image center" alt="Internet traffic volume by RPKI evaluation" /> <p>Job and I also got into the weeds by delving deeper into the cryptographic operations of RPKI to explain why ROAs have “effective expirations” that are <a href="https://www.kentik.com/blog/times-up-how-rpki-roas-perpetually-are-about-to-expire/">perpetually about to expire</a>. As colorfully illustrated below, the effective expiration of a ROA is heavily dependent on which RIR publishes it due to differences in software implementations.</p> <img src="//images.ctfassets.net/6yom6slo28h2/iAYNeqaW07lfiXR8aNkZS/9931ac9563fd674d1ef246022ea60084/when-do-roas-expire-several-days.png" style="max-width: 700px;" class="image center" alt="ROA expiration over several days" /> <p>Additionally, this year, the US Federal Communications Commission made its first moves towards improving BGP routing security by publishing proposed rules that would require the nine largest telecoms in the US to deploy RPKI ROV. In a post from July, <a href="https://www.kentik.com/blog/dissecting-the-fccs-proposal-to-improve-bgp-security/">I reviewed</a> where each of these companies stands with respect to this proposed new requirement and the debate around mandating the use of RPKI ROV.</p> <p>Lastly, a new challenge that I believe we face in the routing security community is that we may become victims of our success. Specifically, that routing hygiene has improved to the point that headline-grabbing BGP mishaps are becoming rarer, potentially taking wind out of the sails of various advocacy efforts like <a href="https://manrs.org/">MANRS</a>. The improvements are, of course, good news and have many contributing causes (increased adoption of RPKI ROV, greater levels of IRR-based route filtering, greater awareness of routing issues, peerlock), but there are numerous unsolved issues around BGP security that will need continued attention.</p> <p>That fact is that <a href="https://www.kentik.com/analysis/BGP-Routing-Leak-Leads-to-Spike-of-Misdirected-Traffic/">route leaks continue</a> to occur, even if the potential disruption they could cause has often been mitigated. I began a new blog series in October entitled <em>Beyond Their Intended Scope</em> to shed some light on the BGP mishaps that may have escaped the attention of the community but are worthy of analysis. The <a href="https://www.kentik.com/blog/beyond-their-intended-scope-uzing-into-russia/">first edition</a> covered a leak that redirected internet traffic through Russia and Central Asia as a result of a path error leak by Uztelecom, the incumbent service provider of Uzbekistan.</p> <h2 id="major-internet-outages">Major internet outages</h2> <p>The <a href="https://www.kentik.com/blog/digging-into-the-orange-espana-hack/">first outage analysis piece I wrote</a> this year covered the outage caused by RPKI ROV due to the compromise of Orange España’s RPKI NCC account. Spain’s second-largest mobile operator suffered a national outage as a result of a hacker’s unprecedented use of RPKI as a tool for denial of service. Using a password found in a public leak of stolen credentials, a hacker was able to log into Orange España’s RIPE NCC portal using the password “ripeadmin.” Oops!</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1ZxVCBqRJpVkOcOgJ7VVgZ/53937882a503d0351fdecfee92f2bac0/orange-espana-outage-impact-adjacents.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Orange Espana outage" /> <div class="caption" style="margin-top: -35px;">Drop in propagation of an Orange España BGP route during the outage</div> <p>I concluded that piece by saying that while RPKI was employed as a central instrument of this attack, it should not be construed as the cause of the outage any more than we would blame a router if an adversary were to get ahold of the login credentials and started disabling interfaces. In my opinion, the silver lining of the outage was that it showed that RPKI-invalid routes really do get rejected at a significant rate. Otherwise, the outage could not have occurred.</p> <p>In a somber post, I also <a href="https://www.kentik.com/blog/hurricane-helene-devastates-network-connectivity-in-parts-of-the-south/">covered the effects of Hurricane Helene</a>, which cost the lives of over 200 people in the Southeast United States. I combined our traffic data with internet measurement data from Georgia Tech’s IODA tool to survey the disruption of internet connectivity in the region.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1FqdyDCnrbeOz3nwRB2P5W/3b7a0d338660955ed7abff2a645a1ec8/hurricane-helene-impacts-georgia1.png" style="max-width: 600px;" class="image center" alt="Hurricane Helene impacts in Georgia" /> <div class="caption" style="margin-top: -35px;">Internet service providers impacted by Hurricane Helene</div> <h2 id="submarine-cable-failures">Submarine cable failures</h2> <p>It has been another eventful year in the world of submarine cables. The two highlights of our coverage this year were our collaborations with WIRED magazine and The New York Times.</p> <p>In the spring, we teamed up with WIRED to investigate the cause of the submarine cable cuts in the Red Sea in February. As you may recall, in response to the ongoing war in Gaza, the Houthi-controlled Yemeni government began firing missiles and armed drones at ships transiting the nearby <a href="https://en.wikipedia.org/wiki/Bab-el-Mandeb">Bab al-Mandab</a> strait that they believed had an affiliation with Israel, Britain, or the United States.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4eQtCXWUtLy1obEj6JvVTN/991699c5afa942eab9b754112363f738/wired-ghost-ship.jpg" style="max-width: 600px;" class="image center" alt="WIRED graphic: Ghost Ship" /> <div class="caption" style="margin-top: -35px;">Graphic from the <a href="https://www.wired.com/story/houthi-internet-cables-ship-anchor-path/">WIRED article</a> about a “Ghost Ship” and the “Gate of Tears”</div> <p>As I wrote In <a href="https://www.kentik.com/blog/what-caused-the-red-sea-submarine-cable-cuts/">my blog post</a>, on February 24, three submarine cables were cut in the Red Sea (Seacom/TGN-EA, EIG, and AAE-1), disrupting internet traffic for service providers from East Africa to Southeast Asia. Initial speculation of the cause of the cuts focused on the purported threats against submarine cables posted in a Telegram channel months earlier.</p> <p>Before long, a more realistic theory emerged from the submarine cable industry. Days before the three subsea cable failures, a Belize-flagged, United Kingdom-owned cargo ship was struck by missiles fired from Yemen. The crew dropped anchor and abandoned the crippled ship, the MV Rubymar. Afterward, the Rubymar began to drift, dragging its anchor — one of the top causes of submarine cable cuts, according to the International Cable Protection Committee. On March 2, the derelict vessel finally sank, taking with it more than 41,000 tons of fertilizer.</p> <p><a href="https://www.wired.com/story/houthi-internet-cables-ship-anchor-path/">WIRED combined our data</a> with AIS location tracking data and satellite imagery to estimate the location of the Rubymar and place it near the location of the cable cuts.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3hw4dFqB9MeJgobMAhhl6q/781f362560e2fac567892957acfc9094/wired-location-tracking.png" style="max-width: 700px;" class="image center" alt="WIRED graphic: Location tracking" /> <p>The following month, after the submarine cable cuts in the Red Sea, another multi-cable incident took place on the opposite side of the continent of Africa. On March 14, an undersea landslide in the Trou Sans Fond (French for bottomless pit) Canyon off the coast of Abidjan, Côte d’Ivoire.</p> <img src="//images.ctfassets.net/6yom6slo28h2/62wzpEpNWMPcUSoyJcYK6c/6d80336e1f8468e3f03c13cd6a385a16/nyt-trou-sans-fond.jpg" style="max-width: 700px;" class="image center" alt="New York Times graphic of the Trou Sans Fond Canyon" /> <div class="caption" style="margin-top: -35px;">New York Times graphic of the Trou Sans Fond Canyon</div> <p>I worked with the investigative team at the New York Times as they researched the incident from its impact to the crews tasked with repairing severed undersea cables at great depths. The <a href="https://www.nytimes.com/interactive/2024/11/30/world/africa/subsea-cables.html?unlocked_article_code=1.d04.7zuy.NnIkq9B5y1dD&#x26;smid=nytcore-ios-share&#x26;referringSource=articleShare">resulting piece</a> was a beautifully written and presented story of the outages and repairs that took place this year. It’s really worth your time if you haven’t read it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3aMcHJeKeMM6d3purI3hNc/5e71f603671a53459490a68c3e74314b/nyt-graphic-kentik-data.png" style="max-width: 700px;" class="image center" alt="New York Times graphic based on Kentik data" /> <div class="caption" style="margin-top: -35px;">New York Times graphic based on Kentik data</div> <h2 id="geopolitics">Geopolitics</h2> <p>In August, I wrote a <a href="https://www.kentik.com/blog/afghanistan-internet-presses-on-three-years-after-us-departure/">follow-up piece</a> to my <a href="https://www.kentik.com/blog/whats-next-for-the-internet-in-afghanistan/">2021 post</a> written in the aftermath of the US withdrawal from Afghanistan. This led me to get connected with the organizers of the newly formed AFNOG (<a href="https://www.linkedin.com/company/afgnog/posts/?feedView=all">Afghanistan Network Operators Group</a>), who continue to work to develop internet service in the war-battered nation despite an array of unique challenges, from austere terrain to international sanctions.</p> <table border="0" cellpadding="10" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td>&nbsp;</td> <td colspan="2" align="center"><strong>2021</strong></td> <td colspan="2" align="center"><strong>2024</strong></td> </tr> <tr> <td>&nbsp;</td> <td align="center"><strong>Routes</strong></td> <td align="center"><strong>IPs</strong></td> <td align="center"><strong>Routes</strong></td> <td align="center"><strong>IPs</strong></td> </tr> <tr> <td><strong>IPv4</strong></td> <td align="center">468</td> <td align="center">252,928</td> <td align="center">528</td> <td align="center">137,984</td> </tr> <tr> <td><strong>IPv6</strong></td> <td align="center">5</td> <td align="center">2.37E+30</td> <td align="center">12</td> <td align="center">1.58E+30</td> </tr> </tbody> </table> <div class="caption" style="margin-top: -15px;">From a BGP standpoint, the topology of Afghanistan has changed in subtle ways.</div> <p>I was honored to be included in the <a href="https://www.linkedin.com/feed/update/urn:li:activity:7271154800166162433/">opening ceremony</a> of this year’s AFNOG2 and served as a moderator for a <a href="https://www.linkedin.com/posts/afgnog_afnog2-afnog2-afnog2-activity-7271523805993308160-QaWb">fascinating discussion</a> on navigating the country’s transit and peering environment.</p> <h2 id="over-the-top-anatomies">Over-the-top anatomies</h2> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/">OTT Service Tracking</a> (part of Kentik Service Provider Analytics) combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/">Call of Duty update</a> or the first-ever exclusively live-streamed <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-peacock-streamed-nfl-playoff-game/">NFL playoff game</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <img src="//images.ctfassets.net/6yom6slo28h2/69prvpmDRJPQLWycWjzPni/c10c847d21907f21da883f9ed21c895b/internet-traffic-gaming-platforms.png" style="max-width: 750px;" class="image center" alt="Internet traffic from gaming platforms showing Fortnite Remix" /> <div class="caption" style="margin-top: -35px;">Traffic surge due to a live event hosted within Fortnite</div> <p>To demonstrate some of the unique capabilities of the OTT Service Tracker, I occasionally publish blog posts as part of a series called Anatomy of an OTT Traffic Surge. This year I covered:</p> <ul> <li>The <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-the-tyson-paul-fight-on-netflix/">challenges Netflix faced</a> during its live event streaming the highly anticipated, if somewhat absurd, boxing match between the 58-year-old former heavyweight champion Mike Tyson and social media star Jake Paul.</li> <li>The <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-the-fortnite-chapter-2-remix-update/">surge of traffic</a> triggered by the release of Fortnite’s Chapter 2 Remix featuring rapper Eminem and <a href="https://www.nbcnews.com/sports/olympics/snoop-dogg-olympics-nbc-correspondent-paris-games-rcna164284">special NBC Olympics correspondent</a> Snoop Dogg.</li> <li>A <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday-aug2024/">dissection of the delivery</a> of one of Microsoft’s Patch Tuesday software updates.</li> <li>The first-ever <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-peacock-streamed-nfl-playoff-game/">exclusively live-streamed NFL playoff game</a> delivered while refraining from making any references to pop superstar Taylor Swift or her sizzling romance with nine-time Pro Bowler Travis Kelce.</li> </ul> <p>As depicted below, this capability gives us the In aggregate the top individual OTT services we see providers include Netflix and Youtube at the top spots followed by social media stalwarts Instagram, Facebook, and <a href="https://www.usatoday.com/story/tech/2024/12/16/tiktok-getting-banned-why-when-trump/77022873007/">soon-to-be-banned</a>(?) Tiktok.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7m4yT59zPBcBb3fJ52N5Gw/abadd57a1ee3568f67f8880f097beef1/ott-services.png" style="max-width: 750px;" class="image center" alt="Top OTT services" /> <div class="caption" style="margin-top: -35px;">Top individual OTT services by traffic volume (bits/sec)</div> <h2 id="cloud-latency-map">Cloud Latency Map</h2> <p>In October, we released the Kentik <a href="https://clm.kentik.com/">Cloud Latency Map</a>, a free public tool that employs <a href="https://www.kentik.com/product/synthetic-monitoring/">Kentik’s synthetics capabilities</a> to allow users to explore the latencies measured between over 100 different cloud regions located around the world.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6kPcc8mHuIvodNwOTn88Ug/77207e1669754cd57152439baf689caf/cloud-latency-map.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Latency Map" /> <p>As I said in my <a href="https://www.kentik.com/blog/announcing-the-cloud-latency-map/">write-up</a> announcing the release, the Map is the latest expression of Kentik’s dedication to extending <a href="https://www.kentik.com/product/multi-cloud-observability/?utm_source=google&#x26;utm_medium=paid_search&#x26;utm_source=google&#x26;utm_medium=cpc&#x26;utm_campaign=&#x26;utm_term=kentik%20cloud&#x26;utm_content=&#x26;gad_source=1&#x26;gclid=CjwKCAiA34S7BhAtEiwACZzv4f3tkWUKkK2K-meSM4wiAuM-hVIVDSYc8M3fMpFu-6iTkYwM1LVGvRoCkeAQAvD_BwE">network observability to the cloud</a>.</p> <p>Broadly speaking, the Map can assist someone trying to determine if there is a connectivity issue impacting particular cloud regions. Additionally, since the public clouds rely on the same physical infrastructure as the rest of the global internet, the Map can often pick up on the latency impacts of failures of core infrastructure, such as the loss of a major submarine cable.</p> <p>Since its release, I have posted a <a href="https://x.com/DougMadory/status/1851278407998316974">handful</a> of interesting observations found in the Map, including the <a href="https://www.linkedin.com/posts/dougmadory_appears-that-amazon-web-services-awss-activity-7266848570585104386-3PQX">temporary rerouting of traffic</a> between AWS’s and Azure’s cloud regions in Sao Paulo, Brazil, as well as a <a href="https://www.linkedin.com/posts/dougmadory_interesting-connectivity-change-detected-activity-7262101293274390528-2Ym0/">significant change in connectivity</a> between AWS in China and Google in both Europe and Asia.</p> <p>As the Map illustrates, it’s a volatile internet out there — even for rarefied traffic between the hyperscalers. Take a look around and see what you can find. The data is updated every hour, and we have a list of features we’d like to add to it in the future.</p> <h2 id="podcasts">Podcasts</h2> <p>If anything, 2024 might be the year of podcasts! I haven’t included this as a section in previous <a href="https://www.kentik.com/blog/a-year-in-internet-analysis-2023/">year-end write-ups</a>, but this year was a real boon for audiophiles.</p> <p>I was the subject of an episode of networking mainstays <a href="https://artofnetworkengineering.com/2024/02/14/ep-139-doug-madory-the-man-who-can-see-the-internet/">The Art of Network Engineering</a> back in February and <a href="https://podcast.impostersyndrome.network/2016832/episodes/14589553-doug-madory">Imposter Syndrome Network</a> in April. While at <a href="https://2024.apricot.net/">APRICOT 2024</a> in Bangkok, Thailand, I recorded a conversation with APNIC’s podcast <a href="https://blog.apnic.net/2024/05/16/podcast-measuring-rpki-and-bgp-with-oregon-routeviews/">PING</a> about BGP, RPKI, and Oregon’s Routeviews project. I got deeper into BGP security with <a href="https://pod.chaoslever.com/going-deeper-into-bgp-with-doug-madory/">Chaos Lever</a> and Russ While’s <a href="https://rule11.tech/hedge-217/">Hedge</a> podcast.</p> <p>In June, I was interviewed on Bitporto, the podcast of the Italian IXP <a href="https://www.namex.it/">NAMEX</a>. Versions are available in <a href="https://podcasts.apple.com/us/podcast/the-art-of-global-monitoring-with-doug-madory/id1772944059?i=1000676600885">English</a>, and, in a first, I was also dubbed into <a href="https://podcasts.apple.com/us/podcast/larte-del-global-monitoring-con-doug-madory-versione-ai/id1772944059?i=1000676601095">Italian</a>.</p> <p>And, of course, I make the occasional appearance on Kentik’s Telemetry Now podcast hosted by my colleague Phil Gervasi. In four episodes this year, I discussed <a href="https://www.kentik.com/telemetrynow/s02-e02/">submarine cables</a> and <a href="https://www.kentik.com/telemetrynow/s02-e06/">RPKI adoption</a>. I also helped interview <a href="https://www.kentik.com/telemetrynow/s02-e12/">Andres Azpurua</a> about Internet Censorship in Venezuela and out-going president (and my former colleague at Dyn) <a href="https://www.kentik.com/telemetrynow/s02-e07/">Andrew Sullivan</a> on how the Internet Society helps maintain an open Internet.</p> <h2 id="conclusion">Conclusion</h2> <p>To watch my webinar covering these and other topics, please check out the link below:</p> <div as="WistiaVideo" videoId="mdcsdvndus"></div> <p>And if you enjoy reading these annual recaps, please check out our annual rollups from <a href="https://www.kentik.com/blog/a-year-in-internet-analysis-2022/">2022</a>, <a href="https://www.kentik.com/blog/a-year-in-internet-analysis-2023/">2023</a>, as well as our list of the top outages from <a href="https://www.kentik.com/blog/the-10-top-network-outages-of-2021/">2021</a>.</p><![CDATA[AutoCon2: The Network of the Future is Automated]]><![CDATA[AutoCon2 gathered network automation professionals in Denver for three days of insights, workshops, and discussions on the future of automation. Justin Ryburn, Kentik Field CTO, shares his key takeaways, highlights, and why this conference stands out as a top destination for the network automation community.]]>https://www.kentik.com/blog/autocon2-the-network-of-the-future-is-automatedhttps://www.kentik.com/blog/autocon2-the-network-of-the-future-is-automated<![CDATA[Justin Ryburn]]>Tue, 17 Dec 2024 05:00:00 GMT<h2 id="introduction">Introduction</h2> <p><a href="https://networkautomation.forum/autocon2">AutoCon2</a>, a premier event put on by <a href="https://networkautomation.forum/">Network Automation Forum (NAF)</a> and dedicated to network automation, recently took place in Denver, Colorado. This three-day conference brought together industry experts, network engineers, and automation enthusiasts to explore the latest trends, challenges, and best practices in network automation. With a packed agenda featuring insightful talks, hands-on workshops, and engaging discussions, AutoCon2 offered a comprehensive look into the future of network automation. This conference is quickly becoming one of my favorite events of the year. Kentik is a proud sponsor of NAF, so I got to attend and will share my thoughts in this blog.</p> <h2 id="autocon2-key-themes-and-highlights">AutoCon2 key themes and highlights</h2> <p>One of the most prominent themes at AutoCon2 was the increasing importance of network automation in today’s digital age. As networks become more complex and dynamic, manual configuration and management are no longer sustainable. Network automation tools, techniques, and frameworks are essential for streamlining operations, reducing errors, and improving overall network efficiency.</p> <p>Several key highlights from the conference included:</p> <ul> <li><strong>Source of truth (SoT):</strong> SoT emerged as a critical concept, emphasizing the need for a single, reliable source of information about the network. Attendees discussed the benefits of using SoT tools like NetBox, OpsMill, and Nautobot to maintain accurate and up-to-date network documentation.</li> <li><strong>Orchestration:</strong> Orchestration frameworks are becoming key to each organization’s automation success. There are both open source solutions like Ansible and commercial solutions like <a href="https://www.itential.com/">Itential</a>. These were widely discussed, highlighting their role in automating complex network deployments and configurations.</li> <li><strong>Observability:</strong> Observability tools were showcased, demonstrating their ability to monitor network performance, detect anomalies, and proactively address issues. You can not automate without the input data on what is going on in the network. <a href="https://www.linkedin.com/in/scottrobohn/">Scott Robohn</a> has made this point several times in his <a href="https://packetpushers.net/podcast/total-network-operations/">Total Network Operations (TNoPs)</a> podcast. Just like with Orchestration, there are both open source and commercial tools like Kentik that were highlighted at Autocon2.</li> <li><strong>AI and machine learning:</strong> It’s 2024, so of course, the event explored AI and ML. Snark and sarcasm aside, artificial intelligence provides emerging technologies with the potential to revolutionize network automation. Attendees discussed how these technologies can be used for predictive analytics, anomaly detection, and automated troubleshooting.</li> </ul> <h2 id="autocon2-hands-on-workshops">AutoCon2 hands-on workshops</h2> <p>AutoCon2 offered a variety of hands-on workshops on Monday and Tuesday that allowed attendees to gain practical experience with network automation tools and techniques. These workshops covered topics such as Ansible, Python scripting, integration with a source of truth (SoT), network observability, and network troubleshooting using large language models (LLMs). I only wish I could have been in more than one place at a time. The workshops were broken into half-day sessions on Monday and Tuesday. In each block, four workshops were run in parallel. I heard only great things about all of the workshops. It was my pleasure to work with <a href="https://www.linkedin.com/in/phillipgervasi/">Phil Gervasi</a>, <a href="https://www.linkedin.com/in/smeuse/">Steve Meuse</a>, and <a href="https://www.linkedin.com/in/krygeris/">Mike Krygeris</a> to put on a workshop on network observability using Kentik for practical exercises.</p> <h2 id="autocon2-community">AutoCon2 community</h2> <p>The conference also fostered a strong sense of community among network automation professionals. One of the key benefits of attending an event like this is learning from other practitioners about what is working for them in their organization. Attendees had the opportunity to connect with like-minded individuals, share experiences, and collaborate on projects. Social events, networking sessions, and Birds of Feather lunches further enhanced the community atmosphere.</p> <h2 id="naffy">Naffy</h2> <img src="//images.ctfassets.net/6yom6slo28h2/VFDuf1tacIukMHSHb2tl6/259faf2d4dfa5f9ae832cfea31222399/chris-grundemann-autocon2.jpg" style="max-width: 300px;" class="image right" alt="Naffy, NAF mascot" /> <p>Every great organization has a great logo <em>and</em> a great mascot. Network Automation Forum (NAF) launched a new cat mascot named Naffy, who quickly became popular at the event. Here is a picture of <a href="https://www.linkedin.com/in/cgrundemann/">Chris Grundemann</a> announcing the mascot on the main stage.</p> <h2 id="the-future-of-network-automation">The future of network automation</h2> <p>As the networking industry continues to evolve, network automation will play an increasingly vital role. AutoCon2 provided a glimpse into the future of network automation, showcasing the latest trends and innovations. Key takeaways from the conference include:</p> <ul> <li><strong>Embrace automation:</strong> Network automation is no longer optional; it is essential for staying competitive in today’s digital landscape. The good news is you can start small and mature to automate easy tasks.</li> <li><strong>Prioritize SoT:</strong> A well-maintained SoT is crucial for efficient network operations and automation. The first step for a lot of organizations is building that SoT.</li> <li><strong>Leverage orchestration:</strong> Orchestration frameworks can significantly streamline network deployments and configurations. It essentially allows you to make your network automation a service that users within your organization can leverage.</li> <li><strong>Invest in observability:</strong> Robust observability tools are essential for monitoring network health and performance. Similar to SoT, this is a great place to get started.</li> <li><strong>Explore AI and ML:</strong> AI and ML have the potential to transform network automation and troubleshooting, enabling more intelligent and proactive network management.</li> </ul> <h2 id="conclusion">Conclusion</h2> <p>AutoCon2 was a resounding success, providing a valuable platform for network automation professionals to learn, collaborate, and network. The conference highlighted the importance of network automation in today’s digital age and showcased the latest trends and technologies shaping the future of networking. By embracing automation, prioritizing SoT, leveraging orchestration, investing in observability, and exploring AI and ML, network engineers can drive innovation, improve efficiency, and ensure the reliability of their networks. The team at NAF will be cleaning up and posting the videos on YouTube as time allows. Keep an eye on their <a href="https://www.youtube.com/@NetworkAutomationForum">YouTube feed</a> if you missed the event, and start planning for Autocon3 in Europe in May!</p> <hr> <p>This article originally appeared on Justin Ryburn’s personal blog. Kentik is grateful that we have so many incredible writers on our team willing to share their work with us. You can find the <a href="https://ryburn.org/2024/11/26/autocon2-the-network-of-the-future-is-automated/">original post here</a>.</p><![CDATA[Get Growing at Kentik: Matt Cavanaugh]]><![CDATA[At Kentik, our people are the heart of everything we do, and we’re proud to celebrate their growth. In this post, we spotlight Matt Cavanaugh and his inspiring journey from Customer Success Manager to Vice President of Customer Success. ]]>https://www.kentik.com/blog/get-growing-at-kentik-matt-cavanaughhttps://www.kentik.com/blog/get-growing-at-kentik-matt-cavanaugh<![CDATA[Kentik People Operations Team]]>Mon, 16 Dec 2024 05:00:00 GMT<h2 id="celebrating-our-people-the-heart-of-kentik">Celebrating our people: The heart of Kentik</h2> <p>At Kentik, our people are at the core of everything we do. We take pride in having a diverse, talented team that brings their authentic selves to work every day, living out our value of “Be You.” We know that our team members’ work is just one part of who they are, and we celebrate their dedication, growth, and contributions in both their professional and personal lives.</p> <div as="WistiaVideo" videoId="jky7zy7kdb"></div> <h2 id="from-customer-success-manager-to-vp-matt-cavanaughs-journey">From Customer Success Manager to VP: Matt Cavanaugh’s journey</h2> <p>Today, we’re excited to spotlight Matt Cavanaugh, whose career at Kentik has been nothing short of inspiring. Matt began his journey at Kentik as our very first Customer Success Manager, where he quickly established himself as a customer-first leader. Over the years, he has progressed through roles, taking on more responsibility and making a lasting impact on both our customers and our company.</p> <p>Today, Matt is our Vice President of Customer Success. He leads a team located around the globe that is dedicated to creating exceptional experiences for our clients. His story perfectly exemplifies the growth and opportunity we nurture here at Kentik.</p> <h2 id="getting-to-know-matt">Getting to know Matt</h2> <p>To celebrate Matt, we asked him a few questions to give you a peek into his life at Kentik and beyond:</p> <p><strong>Q. Matt, what first attracted you to Kentik?</strong><br> The culture! Kentik has always been a place where everyone is approachable and willing to listen to all ideas! I love working at a company where everyone has the ability to affect change.</p> <p><strong>Q. What’s the best piece of career advice you’ve ever received?</strong><br> Focus on building relationships, they’re the only thing you get to keep for life.</p> <p><strong>Q. What motivates you to do your best work?</strong><br> The people around me. I love who I work with and that motivates me to do better.</p> <p><strong>Q. What’s your favorite thing about living in Montana?</strong><br> I can hop on my dirt bike in my driveway and be on high mountain logging roads in 12 minutes.</p> <p><strong>Q. What would you be doing right now if money and time were no object?</strong><br> Riding a horse in Mongolia.</p> <p>Matt’s career at Kentik embodies the very spirit of our company — a commitment to innovation, leading with respect and enjoying the journey. We’re proud to have him on our team and look forward to all the great things yet to come!</p> <p>At Kentik, we’re not just building a company — we’re creating a community where everyone can thrive. Interested in joining Matt and the rest of the team? We’re hiring! Check out our <a href="/careers/">careers page</a> for open positions.</p><![CDATA[Faster Network Troubleshooting with Kentik AI]]><![CDATA[When network issues strike, every second matters. Latency or packet loss can frustrate users and hurt revenue. Learn how Kentik AI uses natural language to speed up troubleshooting and isolate problems quickly.]]>https://www.kentik.com/blog/faster-network-troubleshooting-with-kentik-aihttps://www.kentik.com/blog/faster-network-troubleshooting-with-kentik-ai<![CDATA[Eric Hian-Cheong]]>Tue, 10 Dec 2024 05:00:00 GMT<p>When things go wrong, it is often a race against time to figure out what’s happening and get it fixed. Whether resulting in latency or completely unreachable services, issues can have harmful effects on the business as users grow frustrated and abandon their transactions to go elsewhere. Observability tools have gotten very good at detecting when these things happen, but many times, especially where the dynamic complexities of the network are involved, isolating the root cause requires a network engineer to roll up their sleeves and start digging into complex network data and logs – a time-consuming process when time is of the essence.</p> <p><a href="https://www.kentik.com/solutions/kentik-ai/">Kentik Journeys</a> provides a new solution in the network engineer’s toolbox to reduce the time and effort it takes to troubleshoot network issues when they occur by allowing engineers to use natural language and AI-augmented analysis to dig into network data. It then allows you to systematically ask follow-up questions to continue probing data as you follow the proverbial trail of breadcrumbs toward the root cause. And since application delivery relies on many network and network-adjacent components and services, we’ve designed Journeys to work across the entire Kentik product surface.</p> <p>Let’s take a look at how it works in a real situation.</p> <div as="WistiaVideo" videoId="y4mtgjb875"></div> <p>In this scenario, users have reported an application performance issue. Specifically, the connection to the application breaks intermittently, and it’s getting in the way of getting work done. Additionally, we know that the location is connected to on-prem data centers and the public cloud by a Cisco SD-WAN, which the application’s local PostgreSQL mechanism uses to connect to the resources it needs.</p> <p>Let’s see how Journeys helps us identify the cause quickly.</p> <h2 id="natural-language-troubleshooting-with-journeys">Natural language troubleshooting with Journeys</h2> <h3 id="step-1">Step 1</h3> <p>First, we select “New Journey” to start the process. Because we intend to keep this entire conversation to refer to later, we’ll give it a more useful name, in this case, “SD-WAN Troubleshooting.”</p> <p>Journeys supports any queries related to flow data, which would typically be visible in Data Explorer, as well as metrics from SNMP and streaming telemetry, which we’d typically see in Kentik NMS Metrics Explorer.</p> <p>Let’s start troubleshooting by querying traffic from our Cisco SD-WAN network segment over the last 6 hours.</p> <img src="//images.ctfassets.net/6yom6slo28h2/164Swpqz2JGUEy20RJVlyA/0e1671d2792be526031d85b519d343d5/cisco-traffic-sd-wan.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI: Traffic from Cisco SD-WAN" /> <p>This natural language prompt is sent to a large language model to interpret before sending back a value Data Explorer query for Kentik to run. As you can see, a new graph and table are generated, showing traffic grouped by application.</p> <p>In this output above, we can see the applications traversing our SD-WAN devices, which is a good start, but the application we’re interested in likely has low-volume traffic, and it is not visible in the TOP applications.</p> <h3 id="step-2">Step 2</h3> <p>Let’s filter this traffic just on the PostgreSQL application by typing “Filter this to postgresql.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/4TiMQzCYl6ApBFl05WQCq3/6856f83988a1b61d6ac718f81b9b8f8a/filter-postgresql.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI: Filter for postgresql" /> <p>Journeys interprets this to add a new filter on top of the existing query, and we now only see PostgreSQL traffic presented.</p> <p><strong>Tip</strong>: You can also skip this step and just ask for PostgreSQL traffic right off the bat by asking something like “show me postgresql application traffic in my Cisco SD-WAN network in the last 6 hours.”</p> <h3 id="step-3">Step 3</h3> <p>We know that the PostgreSQL traffic is problematic on certain sites, so we want to add site and device dimensions to this view. This is simply done by asking Journeys to “add sites and devices.” <img src="//images.ctfassets.net/6yom6slo28h2/39FUsy4lKOmGT0xRbKLCGI/a53ffb1d53dd4d5490a1216f2c5f3f46/add-sites-devices.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI: Add sites and devices" /></p> <p>We can now see that the majority of the traffic is going between Sites 1 and 5 using the devices <code class="language-text">cedge-01</code> and <code class="language-text">cegde-05</code>.</p> <h3 id="step-4">Step 4</h3> <p>Next, let’s also include destination interfaces in the query to see over which interfaces this traffic goes.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3MpWwdHtoU0DRfAXMqdTZL/2dea7f27545845e543565611e322aee7/add-destination-interfaces.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI: Add destination interfaces" /> <p>The results show that the <code class="language-text">cedge-01</code> device periodically switches the outgoing traffic between two WAN interfaces. This might explain why users were experiencing intermittent application connection problems. But we still don’t know what might have caused it yet. SD-WAN technology dynamically routes traffic based on conditions like interface utilization, packet loss, latency, and jitter. So, let’s look at some of those metrics to see why this rerouting might occur.</p> <p>Let’s start by asking about the outbound bitrate for these WAN interfaces.</p> <h3 id="step-5">Step 5</h3> <img src="//images.ctfassets.net/6yom6slo28h2/7opUA7wceUlG6sTg8xVEYw/3e0bd0cbd70beacdf5c8450459050ee0/wan-outbound-bitrate.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI: Outbound bitrate" /> <p>The query results show traffic patterns on these links, but with a capacity of 6 Mbps and traffic below 2 Mbps, congestion isn’t the issue.</p> <p><strong>Note</strong>: Until now, we’ve been looking at results from Data Explorer queries. This one is showing us metrics from Metrics Explorer instead. Journeys allows us to seamlessly query data from across Kentik, keeping relevant filters and devices in context.</p> <p>Since it doesn’t look like congestion is the issue, let’s check in on the device’s health by asking to see general metrics on the device.</p> <h3 id="step-6">Step 6</h3> <img src="//images.ctfassets.net/6yom6slo28h2/660wkEfDqeYVynglKplKQb/67f60d649e796a7576fcb0d358d83ef2/general-sd-wan-metrics.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI: Show general Cisco SD-WAN metrics" /> <p>The results include a chart with packet loss metrics on the silver link towards Site 4, which isn’t our focus. However, the table also reveals metrics for silver and gold links to Site 5. We can see a lot of traffic on these links, but the silver link also shows a higher average jitter and latency than the gold link.</p> <p>Let’s look at this a little more closely.</p> <h3 id="step-7">Step 7</h3> <img src="//images.ctfassets.net/6yom6slo28h2/6KxNU0Il73y6kRb5UUWfca/4f197cc79f72ea9f9af93e4152b52891/show-latency.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI: Show latency" /> <p>The results show that there is an increased latency on this link, which periodically jumps to about 300 milliseconds, likely causing the PostgreSQL traffic to reroute.</p> <p>Now, because these graphs are time-normalized, we can quickly look and compare this latency against the traffic switching pattern we looked at earlier:</p> <img src="//images.ctfassets.net/6yom6slo28h2/7GA5SUlBln6iAkyKh8qGtq/c8dd314e387c39f1467848fae3a1e75d/latency-with-destination-interfaces.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI: Add destination interfaces" /> <p>We can see that traffic goes through the silver link when the latency is normal. But when the latency on the silver link increases, the traffic is routed through the gold link.</p> <p>This makes sense as it is the expected behavior based on our routing policies on the Cisco SD-WAN controller. With Kentik, we’ve confirmed that SD-WAN routing policies function correctly, but our connectivity service provider needs to address the periodic link quality degradation.</p> <h2 id="faster-root-cause-analysis-using-natural-language">Faster root cause analysis using natural language</h2> <p>Being able to troubleshoot a network problem, especially one that stems from intermittent network activity, means analyzing data about devices, application flows, user behavior, service provider information, and more. In other words, to get to the answer, it takes a lot of time and effort to mine through data, look for clues, ask questions, and draw conclusions. It’s an iterative process that isn’t always linear.</p> <p>Journeys brings the power of AI and natural language to this process, eliminating the need to manually go through multiple menus, filters, and mine charts. It also makes it easier for non-routine users to use Kentik to start diagnosing issues and to provide an easy-to-reference process to reflect on and see what you’ve already done if you need to change thought process or direction.</p> <p>Want to give Journeys a try? You can <a href="https://www.kentik.com/get-started/">try Kentik for free for 30 days</a>.</p><![CDATA[Understanding the Differences Between Flow Logs on AWS and Azure]]><![CDATA[AWS VPC flow Logs and Azure NSG flow Logs offer network traffic visibility with different scopes and formats, but both are essential for multi-cloud network management and security. Unified network observability solutions analyze both in one place to provide comprehensive insights across clouds.]]>https://www.kentik.com/blog/understanding-the-differences-between-flow-logs-on-aws-and-azurehttps://www.kentik.com/blog/understanding-the-differences-between-flow-logs-on-aws-and-azure<![CDATA[Phil Gervasi]]>Wed, 04 Dec 2024 05:00:00 GMT<p>Visibility into our AWS and Azure cloud environments requires collecting AWS VPC flow logs and Azure NSG flow logs. These logs provide a granular view of network traffic, offering insights into connections, bandwidth usage, security incidents, and more. And while these two services are functionally similar, their implementation, features, format, and use cases are different.</p> <h2 id="aws-vpc-flow-logs">AWS VPC flow logs</h2> <p>AWS VPC flow logs capture information about the IP traffic going to and from network interfaces within a VPC. They’re typically used to monitor and troubleshoot network connectivity and ensure compliance with security policies.</p> <p><a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="Kentipedia: What are VPC Flow Logs?">VPC flow logs</a> operate at the VPC level, subnet level, or individual ENI (elastic network interface) level. This means engineers can use VPC flow logs to capture traffic across multiple layers of the AWS network hierarchy. Logs capture inbound and outbound traffic at the packet level so you can get information about packet and byte count, source and destination addresses, ports, and protocols.</p> <p>This is in contrast to Azure NSG flow logs, which monitor the traffic filtered by the NSG security rules. That means Azure VPC flow logs will be more broad in scope and more granular in depth.</p> <p>VPC flow logs are stored in Amazon CloudWatch Logs (if you’re using CloudWatch) or Amazon S3, which allows for relatively easy integration with other AWS services like Athena for analytics or AWS Glue for data processing.</p> <p>The structure of VPC flow logs is interesting. They’re text-based and consist of useful fields such as:</p> <ul> <li>Version</li> <li>Account ID</li> <li>Interface ID</li> <li>Source and destination IP</li> <li>Protocol</li> <li>Source and destination port</li> <li>Action (ACCEPT/REJECT)</li> <li>Log status (OK, NODATA)</li> </ul> <p>We can use these logs for a variety of cases beyond mapping traffic by source, destination, port, and protocol. For example, we can use them to identify connectivity issues between EC2 instances, map latency and bottlenecks, detect unauthorized access attempts to verify our traffic adheres to regulatory compliance, and so on. However, one thing to note is that VPC flow logs are somewhat static and aren’t highly extensible beyond this.</p> <p>Below you can see an example of a raw flow record from the AWS website showing SSH traffic from 172.31.16.139 to a network interface with private IP address is 172.31.16.21 and ID eni-1235b8ca123456789 in account 123456789010 was allowed.</p> <code style="padding: 10px; background-color: #f8f8f8;"> 2 123456789010 eni-1235b8ca123456789 172.31.16.139 172.31.16.21 20641 22 6 20 4249 1418530010 1418530070 ACCEPT OK </code> <h2 id="azure-nsg-flow-logs">Azure NSG flow logs</h2> <p>Azure Network Security Group (NSG) flow Logs monitor traffic flowing through NSGs, which control access to Azure virtual networks. NSG flow logs provide layer 4 traffic visibility and (naturally) integrate tightly with Azure’s security and analytics ecosystem.</p> <p>NSG flow Logs operate at the NSG level, which means they capture traffic for resources protected by the network security group, such as VMs and subnets. Therefore, use cases for using NSG flow logs often have to do with security. Additionally, like VPC flow logs, NSG flow Logs offer enriched data, but they deliver it in JSON format, another crucial difference between the two log types.</p> <p>In this output below from Azure’s website, we can see a version 2 NSG flow record in a structured JSON format:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">{ "records": [ { "time": "2018-11-13T12:00:35.3899262Z", "systemId": "66aa66aa-bb77-cc88-dd99-00ee00ee00ee", "category": "NetworkSecurityGroupFlowEvent", "resourceId": "/SUBSCRIPTIONS/aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e/RESOURCEGROUPS/FABRIKAMRG/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/FABRIAKMVM1-NSG", "operationName": "NetworkSecurityGroupFlowEvents", "properties": { "Version": 2, "flows": [ { "rule": "DefaultRule\_DenyAllInBound", "flows": [ { "mac": "000D3AF87856", "flowTuples": [ "1542110402,192.0.2.190,10.5.16.4,28746,443,U,I,D,B,,,,", "1542110424,203.0.113.10,10.5.16.4,56509,59336,T,I,D,B,,,,", "1542110432,198.51.100.8,10.5.16.4,48495,8088,T,I,D,B,,,," ] } ] }, { "rule": "DefaultRule\_AllowInternetOutBound", "flows": [ { "mac": "000D3AF87856", "flowTuples": [ "1542110377,10.5.16.4,203.0.113.118,59831,443,T,O,A,B,,,,", "1542110379,10.5.16.4,203.0.113.117,59932,443,T,O,A,E,1,66,1,66", "1542110379,10.5.16.4,203.0.113.115,44931,443,T,O,A,C,30,16978,24,14008", "1542110406,10.5.16.4,198.51.100.225,59929,443,T,O,A,E,15,8489,12,7054" ] } ] } ] } }, { "time": "2018-11-13T12:01:35.3918317Z", "systemId": "66aa66aa-bb77-cc88-dd99-00ee00ee00ee", "category": "NetworkSecurityGroupFlowEvent", "resourceId": "/SUBSCRIPTIONS/aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e/RESOURCEGROUPS/FABRIKAMRG/PROVIDERS/MICROSOFT.NETWORK/NETWORKSECURITYGROUPS/FABRIAKMVM1-NSG", "operationName": "NetworkSecurityGroupFlowEvents", "properties": { "Version": 2, "flows": [ { "rule": "DefaultRule\_DenyAllInBound", "flows": [ { "mac": "000D3AF87856", "flowTuples": [ "1542110437,125.64.94.197,10.5.16.4,59752,18264,T,I,D,B,,,,", "1542110475,80.211.72.221,10.5.16.4,37433,8088,T,I,D,B,,,,", "1542110487,46.101.199.124,10.5.16.4,60577,8088,T,I,D,B,,,,", "1542110490,176.119.4.30,10.5.16.4,57067,52801,T,I,D,B,,,," ] } ] } ] } } ] </code></pre></div> <p>As you can see above, some of the fields in an NSG flow log include:</p> <ul> <li>Source and destination IP</li> <li>Source and destination port</li> <li>Protocol</li> <li>Traffic flow direction (Inbound/Outbound)</li> <li>Traffic flow state</li> <li>Rule name (denoting the NSG rule responsible for allowing or denying traffic)</li> </ul> <p>This may seem very similar to AWS VPC flow logs at first, but an important distinction is that the structured format of NSG flow logs is extensible and works well for both automation and advanced analytics. That, along with the focus on NSG flow logs on the actual security rules protecting subnets, means we can use NSG flow logs for more than just source and destination traffic analysis - can use them for more advanced security use cases such as detecting rule misconfigurations or potential attacks, understanding traffic patterns to optimize routing and security rules, and even for forensic analysis such as investigating incidents by analyzing historical flows.</p> <p>Azure utilizes its own set of tools to store and analyze NSG flow logs. Logs are stored in Azure Storage Accounts and can be analyzed with Azure Monitor, Azure Sentinel, or external tools like Splunk.</p> <div as="Promo"></div> <h2 id="the-right-tool-for-the-job">The right tool for the job</h2> <p>Visibility into cloud environments is critical for effective network management, security, and optimization. AWS VPC flow logs and Azure NSG flow logs offer invaluable insights into network traffic. Still, their differences in structure, capabilities, and integration with their respective ecosystems highlight the importance of choosing the right tool for your specific needs.</p> <p>In today’s <a href="https://www.kentik.com/kentipedia/multicloud-networking/" title="Kentipedia: Multicloud Networking - Definitions, Benefits, and Challenges">multi-cloud environments</a>, where AWS and Azure often coexist, organizations require a unified observability solution capable of seamlessly handling both log formats. That way, teams can centralize data analysis, correlate traffic patterns, and gain comprehensive visibility across cloud platforms.</p><![CDATA[Troubleshooting Cloud Traffic Inefficiencies with Kentik AI]]><![CDATA[Balancing cost efficiency and high performance in cloud networks is a constant challenge, especially when misconfigurations or inefficient routing lead to inflated costs or degraded performance. Learn how Kentik Journeys simplifies traffic analysis, helping cloud engineers identify inefficiencies like unnecessary Transit Gateway routing.]]>https://www.kentik.com/blog/troubleshooting-cloud-traffic-inefficiencies-with-kentik-aihttps://www.kentik.com/blog/troubleshooting-cloud-traffic-inefficiencies-with-kentik-ai<![CDATA[Eric Hian-Cheong]]>Tue, 03 Dec 2024 05:00:00 GMT<h2 id="introduction">Introduction</h2> <p>Cloud engineers often face the challenge of balancing cost efficiency with high performance in their networks. Misconfigurations or inefficiencies in traffic routing, particularly in complex and dynamic cloud environments, can lead to inflated costs, degraded performance, or both. Diagnosing these issues requires deep visibility into network traffic flows and patterns.</p> <p>However, traditional workflows often require manually crafting queries, applying filters, and iterating multiple times to understand the data you are working with to get the insights you need. It can be manual and time-consuming work. <a href="https://www.kentik.com/solutions/kentik-ai/">Kentik Journeys</a> alleviates this, allowing you to combine the power of AI and natural language with rich network and cloud traffic data to understand patterns and information faster.</p> <p>In this post, we’ll look at how Kentik Journeys can help you optimize cloud traffic and avoid costs by identifying traffic unnecessarily routed through a Transit Gateway (TGW) due to asymmetric routing. We’ll also take a look at how Connectivity Checker now seamlessly works with Journeys along with its new AI Summaries.</p> <div as="WistiaVideo" videoId="cfna1phctc"></div> <h2 id="step-1-identifying-traffic-over-transit-gateways">Step 1: Identifying traffic over Transit Gateways</h2> <p>In this scenario, we want to identify traffic that unnecessarily goes over AWS Transit Gateway, which can potentially be avoided. In a typical case, traffic that flows between VPCs in the same Availability Zone can also be routed using VPC Peering, which is a lower-cost solution than Transit Gateway.</p> <p>Let’s start a new Journey to investigate this further. First, let’s look at the traffic inside the same Availability Zones going through a Transit Gateway by asking, <strong>“Show me AWS intra-zone traffic going over the transit gateway in the last week.”</strong></p> <p>Kentik Journeys uses AI to generate a tailored Data Explorer query and filter.</p> <img src="//images.ctfassets.net/6yom6slo28h2/55fe3rg7H7fFzx4nX6d0ex/63912a0d4e4520481e3b4a6403d6ab9c/step1-identify-traffic.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI - Identify Traffic" /> <h2 id="step-2-analyzing-traffic-spikes">Step 2: Analyzing traffic spikes</h2> <p>Next, we can see that traffic occurs in two zones, with a significant spike in <strong>us-east-2a</strong> on November 17. To understand this spike better, let’s refine the view to include IP addresses and instances involved in the traffic.</p> <p>This gives us granular visibility into the endpoints driving the spike, setting the stage for deeper analysis.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1C28374ySyUVpiawNloT5e/d261d298092b71278859e62b20cd2cee/step2-analyze-traffic-spikes.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI - Analyze Spikes" /> <img src="//images.ctfassets.net/6yom6slo28h2/3tdAtVoOlGqzixlekUeuli/768cfb6f2bb3071f65bedd03c8fc10b8/step2-include-ips.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI - Include IPs and Instances" /> <h2 id="step-3-focusing-on-specific-traffic">Step 3: Focusing on specific traffic</h2> <p>The traffic spike reveals a pattern: The majority of traffic flows between IP addresses 10.67.200.223 and 10.68.200.128. To isolate this flow, we’ll ask, <strong>“Filter this for traffic between 10.67.200.223 and 10.68.200.128 in both directions.”</strong></p> <p>The resulting filter confirms that most traffic flows in only one direction. This asymmetry is unusual and suggests a potential misconfiguration or routing issue that needs further investigation.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2gVjiZgfMjerbQvzYomDc/1144623f7afa5a3aefdba24094129e71/step3-specific-traffic.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI - Focus on specific traffic" /> <h2 id="step-4-investigating-applications">Step 4: Investigating applications</h2> <p>Understanding the applications involved can provide additional context. Let’s adjust the query to remove zones and add <strong>application details</strong> and <strong>destination ports</strong>. This reveals the specific services or protocols driving the traffic between the two instances.</p> <p>This step helps pinpoint application-level dependencies that might contribute to the routing inefficiencies.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6ezCym8H7ATOWHP9VvSJ6H/f3a9b2b95b34db2bac1d2b960b5c548e/step4-investigate-apps.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI - Application details" /> <h2 id="step-5-verifying-connectivity">Step 5: Verifying connectivity</h2> <p>The one-sided traffic pattern suggests potential connectivity issues. To verify this, we’ll use Kentik’s Cloud Connectivity Checker to test bidirectional connectivity between <strong>10.67.200.223</strong> and <strong>10.68.200.128</strong> on <strong>TCP port 3020</strong>.</p> <p>The Connectivity Checker examines metadata from AWS, including routing tables, security groups, and network access lists. The results show traffic from 10.67.200.223 to 10.68.200.128 is routed through the TGW, while the return path uses VPC peering. This asymmetric routing explains the incomplete traffic visibility.</p> <p>Note the new Kentik AI Summary that is now available with Connectivity Checker. It provides additional context and information about the connections between these two addresses.</p> <img src="//images.ctfassets.net/6yom6slo28h2/kwBELc09DZtMTJgcHJe5U/6ab6ab294c3e5e4f7ec464a5becde84a/step5-verify-connectivity.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI - Check for Connectivity Issues" /> <h2 id="step-6-confirming-bidirectional-traffic">Step 6: Confirming bidirectional traffic</h2> <p>We can confirm bidirectional traffic between the two IPs by asking Journeys to remove the Transit Gateway filter. Additionally, eliminating application and port dimensions simplifies the view, making the routing paths easier to analyze.</p> <p>This confirms that traffic flows correctly in both directions but takes different paths, highlighting the need for route optimization and routing the traffic through VPC Peering to minimize the cost.</p> <img src="//images.ctfassets.net/6yom6slo28h2/63S4W8bgFvUcBvKnRKx0Ps/c1626d0c1324a332bf8f0477120c0ea7/step5-confirm-bidirectional-traffic.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik AI - Confirm bidirectional traffic" /> <div as="Promo"></div> <h2 id="conclusion">Conclusion</h2> <p>Inefficient routing and traffic planning can create performance issues anywhere in a network. When it comes to cloud, though, quickly identifying and isolating it is even more crucial as it generates additional metered costs.</p> <p>This example demonstrates the power of Kentik Journeys in troubleshooting cloud network inefficiencies, leveraging the data you already collect with Kentik in a new and fast way. Kentik helped quickly develop the specific insight needed to optimize this traffic (or inform the cloud team of what they need to do). All that is left to do is create an appropriate route in the AWS account to leverage existing VPC peering to reduce costs and improve performance.</p> <p>With AI-driven insights and natural language queries, you can quickly isolate problems, optimize traffic paths, and enhance your cloud environment’s cost efficiency and performance.</p> <p>Start your <a href="https://www.kentik.com/get-started/">30-day free trial</a> of Kentik today and experience seamless cloud troubleshooting firsthand.</p><![CDATA[Introducing the Kentik Network Jamz Mixtape]]><![CDATA[Cassette tapes are back…sort of! Revisit the joys (and frustrations) of these iconic rectangles of the past with the Kentik Network Jamz Mixtape Bluetooth Speaker -- all the retro vibes, none of the tangled ribbons. Find out how to snag one and join the fun at AWS re:Invent or by sharing your best network-themed song title!]]>https://www.kentik.com/blog/introducing-the-kentik-network-jamz-mixtapehttps://www.kentik.com/blog/introducing-the-kentik-network-jamz-mixtape<![CDATA[David Hallinan]]>Tue, 26 Nov 2024 05:00:00 GMT<p>Remember cassette tapes? Sure, you do.</p> <p>No, not the shiny circles. Those were CDs. Not the big, black circles, either. Those were vinyl records. I can see why you’d be confused between the two; they both spin around, after all.</p> <p>I’m talking about <em>cassette tapes</em>. They debuted in the years <em>between</em> the revolving formats of music listening. And no, not 8-track tapes. Those came first, but we don’t talk about them anymore.</p> <p>Cassette tapes were those little plastic rectangles that played your favorite songs. Sometimes, you’d buy them in the music shop, complete with album artwork and a little folded piece of paper that, if you were lucky, had the lyrics to the entire album written in a font so small your eyes would water trying to read them. They had a little ribbon inside that, as it played, would inevitably fall off the track and become a rat’s nest of shiny lace that you’d spend valuable time re-spooling with a number 2 pencil. Remember how fun that was?</p> <p>You could even <em>record over the music</em> that came on them. Say you wanted your own copy of that top-10 single that you’d tune into KNTK every day at 3 pm to hear; with a cassette tape, you could have your own personal copy in much lower fidelity with static crackling that would only worsen as the tape deteriorated.</p> <p>Yeah, now you remember! Cassette tapes!</p> <p>Well, I’m beyond pleased to inform you that cassette tapes are back and better than ever.</p> <p>Introducing the <strong>Kentik Network Jamz Mixtape Bluetooth Speaker</strong>. (It’s a mouthful, I know.)</p> <img src="//images.ctfassets.net/6yom6slo28h2/18GWyPQ9Cy8MYsTlUP0PRX/713572f87cf5fdf40c2288f0128db02b/network-jamz-mixtape.jpg" style="max-width: 700px;" class="image center simple" alt="Network Jamz Mixtape" /> <p>We took everything you love about cassette tapes (the shape) and removed everything you hated about them (everything but the shape).</p> <p>This glorious collectible comes equipped with everything you need to relive those glory days of 1993. Picture it now: The dulcet tones of Kurt Loder’s voice echo through your dimly lit bedroom as a revolving fan gently pulls your Wreckx-N-Effect poster from the wall. The guy who played saxophone on Arsenio Hall has just been sworn into office, and there’s a new episode of Ren and Stimpy on tonight. All is right in the world.</p> <p>Now, you can go back in time without the help of a Delorean and a mason jar full of Plutonium. Through painstaking research and testing, we’ve crafted a playlist of music that speaks to the true nature of network engineering.</p> <p>Featuring your favorite network jams, such as:</p> <ul> <li>Yellow Submarine Cable</li> <li>Sweet Cloud O’ Mine</li> <li>Oops… I Did IoT Again</li> <li>Virtualization Insanity</li> <li>I WAN You To WAN Me</li> <li>And so much more!</li> </ul> <p>(Editor’s note: This is just a speaker. The above songs do not appear on the Kentik Network Jamz speaker. Also, they don’t exist.)</p> <p>So, bust out your Cross Colours, grab your 2-liter Crystal Pepsi, and get ready to rock.</p> <p>You can get your hands on your very own Kentik Network Jamz Mixtape Bluetooth speaker by doing one of the following:</p> <ul> <li> <p>Visit the Kentik booth (<strong>Booth #1781</strong>) at AWS re:Invent from December 2 to 6, 2024.</p> </li> <li> <p>Create your own <strong>amazing network song title</strong> that you think should be featured on our next mixtape and post it to <a href="https://twitter.com/kentikinc">Twitter</a>, <a href="https://www.linkedin.com/company/kentik/">LinkedIn</a>, or <a href="https://bsky.app/profile/kentik.bsky.social">BlueSky</a>. Make sure to tag Kentik on your post and use the hashtag #KentikNetworkJamz. We’ll pick the ten best song titles and send the winners a Kentik Jamz speaker.</p> </li> </ul> <p>Rad, we’re Audi 5000. Peace!</p><![CDATA[Anatomy of an OTT Traffic Surge: The Tyson-Paul Fight on Netflix]]><![CDATA[On November 15, Netflix made another venture into the business of live event streaming with the highly-anticipated, if somewhat absurd, boxing match between the 58-year-old former heavyweight champion Mike Tyson and social media star Jake Paul. The five hour broadcast also included competitive undercard fights including a bloody rematch between Katie Taylor and Amanda Serrano. Doug Madory looks at how Netflix delivered the fight using Kentik’s OTT Service Tracking.]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-the-tyson-paul-fight-on-netflixhttps://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-the-tyson-paul-fight-on-netflix<![CDATA[Doug Madory]]>Thu, 21 Nov 2024 16:00:00 GMT<p>On Friday night, Netflix streamed its <a href="https://www.netflix.com/tudum/articles/jake-paul-vs-mike-tyson-live-release-date-news">much-hyped boxing match</a> between social media star Jake Paul and retired boxing great “Iron” Mike Tyson. The problematic delivery of the fight became almost <a href="https://www.bloomberg.com/news/articles/2024-11-16/downdetector-shows-netflix-having-issues-before-paul-tyson-fight">as big a story</a> as the match itself — which Paul won by decision after 10 rounds.</p> <p>Many viewers (including myself) endured buffering and disconnections to watch the bout, leading many to conclude that Netflix’s streaming architecture was not ready for such a blockbuster event. In an <a href="https://www.instagram.com/p/DCcr9OPSL4u/">Instagram post on Saturday</a>, Netflix conceded that the fight “had our buffering systems on the ropes.”</p> <p>Let’s analyze this traffic surge using Kentik’s OTT Service Tracking…</p> <h2 id="ott-service-tracking">OTT Service Tracking</h2> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/">Call of Duty update</a> or the first-ever exclusively live-streamed <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-peacock-streamed-nfl-playoff-game/">NFL playoff game</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p>Kentik <a href="https://www.kentik.com/resources/kentik-true-origin/">True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 1000 categorized OTT services delivered by 79 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <h2 id="netflixs-punch-out">Netflix’s Punch Out</h2> <p>So, how big was the event? Do we actually know? On Saturday, Netflix reported an audience of <a href="https://about.netflix.com/en/news/60-million-households-tuned-in-live-for-jake-paul-vs-mike-tyson">60 million households worldwide</a>. While we only have a slice of the overall traffic, we can use it to gauge the increase in viewership.</p> <p>Kentik customers using OTT Service Tracking observed the following statistics, illustrated below. When measured in bits/sec, traffic surged to almost three times the normal peak (unique destination IPs were up 2x). A figure that has been echoed in my private discussions with multiple providers. Unlike every previous episode of <em>Anatomy of an OTT Traffic Surge</em>, here the content was delivered from a single source CDN: Netflix itself.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6FXkIyTxliwNSyI01EYp8N/a779abf349f1922cccebb33e5d0c7b2c/netflix-cdn-paul-v-tyson.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Netflix traffic by source CDN" /> <p>Netflix operates its own fleet of cache servers as part of its <a href="https://openconnect.netflix.com/en/">Open Connect</a> content delivery network. These Open Connect appliances show up in Kentik’s OTT Service Tracking as embedded caches, which is the primary way programming is delivered to Netflix viewers. It was no different for the Tyson fight on Friday.</p> <p>When broken down by Connectivity Type (below), Kentik customers received the traffic from a variety of sources including Embedded Cache (69.6%), Private Peering (23.9%), Transit (3.3%), and Public Peering (IXP) (3.0%).</p> <img src="//images.ctfassets.net/6yom6slo28h2/4dmMN6ra8KQPDqy5BzpdEe/9ac34d112df3fd1b1d8762eac48cc7ed/netflix-cdn-connectivity-type.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Netflix traffic by connectivity type" /> <p>So, what caused the buffering issues that many experienced on Friday night? Streaming media expert Dan Rayburn <a href="https://www.linkedin.com/feed/update/urn:li:activity:7264426234862874625/">posted a helpful list of myths and conjectures</a> that some vendors pushed to explain the challenges that Netflix faced.</p> <p>Bottom line: diagnosing any delivery failures is complicated. People watching movies on Netflix seemed to be fine while those watching the fight experienced buffering. We’d need to isolate each type of Netflix traffic to fully investigate the live streaming problems.</p> <p>John van Oppen of Ziply Fiber <a href="https://www.linkedin.com/feed/update/urn:li:activity:7263602889363718144/">also wrote on LinkedIn</a> from an engineer’s perspective about what might have caused the buffering problems — while making clear he received “near zero reports of issues” from Ziply’s customers. John raised a couple of points worth considering.</p> <p>The first was that because the content was delivered from a single CDN, “fewer paths available for traffic if links filled.” As mentioned earlier, this is the first <em>Anatomy of an OTT Surge</em> post that featured an OTT event delivered by a single CDN.</p> <p>He added that Ziply saw almost 2.5 times (not too far from our estimate) normal Netflix traffic levels — a manageable amount due to the extra capacity they built into their peering connections established at “redundant pairs of locations.” Alternatively, running transit and peering “hot” to save costs could lead to problems when there is a flood of traffic.</p> <p>Let’s take one last look at the data. If we zoom into a normal period of peak Netflix traffic (pictured below), we can see the smooth rise and fall of content being delivered, primarily by Netflix Open Connect appliances (i.e., Embedded Cache).</p> <img src="//images.ctfassets.net/6yom6slo28h2/44FAofRfiW3V0st6J1vurF/8839b21b33abfcfe3856c7212581ff03/netflix-embedded-cache.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Netflix traffic by connectivity type (Wednesday)" /> <p>Contrast the view above with the one below which corresponds to the delivery of Netflix traffic across all of our OTT customers around the time of the fight. The key aspects to me are the steep rise of Embedded Cache traffic (blue) just before 01:00 UTC on Nov 16 (which was 8pm ET) as millions of people start watching the program.</p> <p>At a certain point, the graph becomes jagged suggesting traffic isn’t being delivered as expected. Private Peering (green) also becomes jagged and Transit (yellow) begins to rise to partially supply the content not being satisfied by caching or peering.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5FMkFaXdaJ2SLGEQcTFcb7/71c4ca06b5ee008d577bcfd3eb7aec36/netflix-embedded-cache-jagged.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Netflix traffic by connectivity type (Friday)" /> <p>To really identify the culprit, we’d need to use the OTT Service Tracking to look for saturated links with Netflix — not something I can do with aggregate data here, unfortunately.</p> <h2 id="conclusion">Conclusion</h2> <p>Previously, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/">described enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like Netflix’s Tyson-Paul fight can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/product/subscriber-intelligence/" title="Subscriber Intelligence Use Cases for Kentik">subscriber intelligence</a>.</p> <p>Ready to improve <a href="https://www.kentik.com/kentipedia/ott-services/" title="Kentipedia: OTT Services (Over-the-Top Services)">over-the-top service</a> tracking for your own networks? <a href="#demo_dialog" title="Request your Kentik demo">Get a personalized demo</a>.</p><![CDATA[Unlocking Peak Performance with Kentik’s Azure Network Observability Tools]]><![CDATA[In today’s multi-cloud landscape, maintaining smooth and reliable connectivity requires complete visibility into cloud networks. With Kentik, network and cloud engineers gain the tools to monitor, visualize, and optimize Azure traffic flows, from ExpressRoute circuits to application performance, ensuring efficient and proactive operations.]]>https://www.kentik.com/blog/unlocking-peak-performance-with-kentiks-azure-network-observability-toolshttps://www.kentik.com/blog/unlocking-peak-performance-with-kentiks-azure-network-observability-tools<![CDATA[Phil Gervasi]]>Thu, 21 Nov 2024 05:00:00 GMT<p>Deep visibility into your public cloud networks is a cornerstone of modern, effective network management. For network and cloud engineers alike, understanding what’s happening within cloud environments such as Azure is not just a necessity – it’s a competitive advantage. Kentik offers <a href="https://www.kentik.com/solutions/microsoft-azure/">unparalleled visibility into public cloud infrastructures</a>, empowering you to filter, parse, and analyze telemetry data exactly how you need it.</p> <h2 id="the-power-of-cloud-visibility-with-kentik">The power of cloud visibility with Kentik</h2> <p>Kentik’s unified platform enables deep insight into your cloud and on-prem environments, including complex topologies within Azure. NSG flow logs, cloud metrics, and important metadata are ingested into the system so you can understand traffic among regions crossing numerous cloud network constructs. Moreover, Kentik allows you to interactively drill down into this telemetry, visualize traffic flows, and easily configure alerts and reports based on what you discover.</p> <div as="WistiaVideo" videoId="q3978bzynh"></div> <p>With Kentik, you gain access to data that helps answer operational questions such as:</p> <ul> <li>What are the primary traffic sources and destinations across our Azure resources?</li> <li>How are ExpressRoute circuits performing, and what kind of traffic are they handling?</li> <li>Which security policies are being used, and which aren’t?</li> </ul> <p>These capabilities provide network and cloud engineers the clarity needed to troubleshoot issues, optimize performance, and enforce security policies across cloud environments.</p> <div as="Promo"></div> <h2 id="exploring-azure-network-topologies-with-kentik-map">Exploring Azure network topologies with Kentik Map</h2> <p>To understand Kentik’s capabilities, let’s first look at the Kentik Map. This feature provides a high-level view of multi-cloud environments, showcasing the different clouds connected to your infrastructure. Here, you can visualize your Azure topology to see connections globally and drill down for detailed insights into traffic flow and resource utilization.</p> <p>The image below shows an overlay of our Azure connections on a global map. Drilling down into a specific connection is only a matter of clicking the link connecting the sites to expand the details pane on the right. In this case, we’re looking at total traffic between US-WEST-1 and AP-SOUTHEAST-1.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5DZIto8GdQZVm1MbXP8UKn/6ec72b514be60382d0c509603b069f77/azure-kentik-map.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure in the Kentik Map" /> <p>This is a great overview, but to go deeper we can view our Azure topology in an interactive layout where many elements are clickable. From the Kentik Map’s topology view, we can easily drill down into a specific cloud network component, on-prem router, VNet, etc.</p> <p>Notice in the image below that we’re looking at traffic flowing from our East US region to VNet Connection Connect-ER-direct-VNET-2023, and then on top of various connections and Express Route Circuits to our on-prem router.</p> <img src="//images.ctfassets.net/6yom6slo28h2/44W3m70QfF1ARmCuRjO2LW/dd02aea85c2a30c1afc186ec99e90d24/azure-topology.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure - network topology view" /> <p>We can also click on many of these elements, such as the individual VNets, Express Route Circuits, and even our on-prem router. For example, take a look at the next screenshot to see that when we select the Express Route Circuit POV-ER-01-VNET-Direct, we get valuable information about traffic flow as well as detailed information about the tenant ID, location, provider, traffic, cloud metrics, and peering information.</p> <p>This is valuable information for engineers monitoring the performance and stability of these dedicated private connections and identifying potential bottlenecks or misconfigurations.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4LuwPLtHoWXZvsx00tDnqd/214e04d973638de9f6c3cc0a7ad0f0c2/azure-topology-details.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure - network topology details" /> <p>The Kentik Map is a powerful tool for visualizing traffic flows and drilling down into details about our network cloud elements, but to parse network and cloud telemetry further, we can use Data Explorer.</p> <h2 id="deep-dive-with-data-explorer">Deep dive with Data Explorer</h2> <p><a href="https://www.kentik.com/blog/unleashing-the-power-of-kentik-data-explorer-for-cloud-engineers/">Kentik’s Data Explorer</a> is where you can truly explore data and deeply analyze specific resources and cloud traffic within Azure. Returning to our ExpressRoute example, imagine you need to investigate egress traffic for a particular circuit. Kentik makes it easy to drill down into exactly what you want to see. The image below shows Azure as our data source and several dimensions to filter our data, including source region, VNet, application, firewall rule, ExpressRoute circuit, and so on.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6MOjhCfpTCJ3De1qOC7RM1/ce583a7d345fdca848ca4c369cd44b8a/azure-data-explorer.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure in Data Explorer" /> <p>Using the built-in filtering options is great, but Kentik understands that exploring data sometimes means getting more complex with a filtering workflow. What we can do now is create additional filters to fine-tune our query further. Notice in the following image that we’ve added filters to narrow down our search to only <em>outbound</em> traffic and a specific ExpressRoute.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Lfs8F9yRWLcuILrwrjbwm/f6e5d67c59571f9cc7c0127ab038ebfd/azure-data-explorer-filters.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure in Data Explorer with filtering" /> <h2 id="configuring-alerts-and-reports-for-proactive-monitoring">Configuring alerts and reports for proactive monitoring</h2> <img src="//images.ctfassets.net/6yom6slo28h2/4B55VW6nkNPQH1ZxNJfxru/faa1a8b753e0c3ba143a794ddcb819d0/azure-alerts.png" style="max-width: 260px;" class="image right" alt="Azure network alerts" /> <p>Another of Kentik’s important features is the ability to create alerts and reports directly from any query in Data Explorer. That means once you’ve narrowed down your data to a specific ExpressRoute circuit, outbound traffic, and applications, you can configure real-time alerts that integrate with your ticketing system or set up scheduled reports to keep your team informed.</p> <p>For example, suppose you’ve identified an increase in outbound traffic from your East US region with just a few mouse clicks. In that case, you can set an alert to trigger when egress traffic exceeds a certain threshold, ensuring proactive monitoring of potential issues. This is especially useful for cloud cost management since high volumes of egress traffic can lead to higher cloud costs.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6NuuVhLgoT5vHEVnrXnSjR/dee5cfb35e9668fed37631ad4585fbb7/azure-alerting-policy.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure alert policy" /> <h2 id="bringing-visibility-to-azure-networks">Bringing visibility to Azure networks</h2> <p>In today’s multi-cloud world, complete visibility into cloud networks is crucial for efficient operations. Kentik empowers network and cloud engineers with the tools to explore, visualize, and monitor Azure traffic flows, ensuring smooth and reliable connectivity.</p> <p>Whether you’re tracking ExpressRoute circuits, observing application traffic, or setting up alerts for proactive monitoring, Kentik’s platform provides the flexibility and depth needed to optimize cloud infrastructure.</p><![CDATA[The Network Also Needs to be Observable, Part 4: Telemetry Data Platform]]><![CDATA[In order to answer any question, network observability requires a broad range of telemetry data. It takes a capable platform to make the data useful. In this 4th part in the network observability series, Avi Freedman describes requirements for the data telemetry platform.]]>https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-4-telemetry-data-platformhttps://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-4-telemetry-data-platform<![CDATA[Avi Freedman]]>Wed, 20 Nov 2024 05:00:00 GMT<p>In the previous two blogs, I talked about the need to gather <a href="https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources">telemetry data</a> from all of the devices and observation points in your network, and across the different types of telemetry available.</p> <p>The previous blog demonstrates practically how gathering more telemetry data directly enhances your ability to answer the questions that help you plan, run, and fix the network by framing the kinds of questions you can and can’t answer with different types and sources of telemetry. If you have other great examples, please also feel free to start a conversation! I’m avi at kentik.com, and <a href="https://twitter.com/avifreedman">@avifreedman</a> on Twitter.</p> <h2 id="awesome-i-have-data-now-what">Awesome, I have data! Now what?</h2> <p><a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">Network observability</a> deals with a wide diversity of high-speed and high cardinality real-time data and requires being able to ask questions both known and unknown. Doing this at network scale has been a hard problem for decades and requires a flexible, scalable, and high-performance “telemetry data platform.”</p> <p>(When I speak of this sort of platform, I’m not speaking of New Relic’s well-known Telemetry Data Platform (TDP), specifically, but about telemetry platforms in general. It’s worth noting that Kentik and New Relic are partners: See our case study to learn more about <a href="https://www.kentik.com/resources/case-study-new-relic/" title="New Relic boosts network performance and digital experience with Kentik">how New Relic uses Kentik</a> for network performance and digital experience monitoring, and see <a href="https://newrelic.com/instant-observability/kentik-firehose-pack" title="Kentik Firehose Quickstart">New Relic’s Kentik Firehose Quickstart</a> to easily and quickly import network performance metrics from Kentik.)</p> <h2 id="requirements-for-a-telemetry-data-platform-that-enables-network-observability">Requirements for a telemetry data platform that enables network observability</h2> <p>This creates a set of requirements for data architectures to support network observability:</p> <h3 id="standard-data-bus-for-telemetry">Standard data bus for telemetry</h3> <p>While all that’s required is that data be able to be routed from and to different systems, in practice, most modern network observability data platforms will have a common data bus, often based on Kafka, to support real-time ingest, subscription/consumption, and export of telemetry data.</p> <h3 id="network-primitives-support">Network primitives support</h3> <p>Prefix, path, topology, underlay, and overlay are all concepts that are unique or have unique contexts in the network world. Without the ability to easily enrich, group, and query by prefix, network path, and under/overlay, it’s difficult for networkers to ask questions and reason about their infrastructure.</p> <h3 id="enriched-data">Enriched data</h3> <p>Whether it’s simpler enrichment (like adding IP geography or origin ASN to various network metrics) or completely real-time streaming metadata (like user and application orchestration mapping data and real-time routing association with traffic data), enrichment is key to being able to ask high-level questions of your network.</p> <h3 id="high-cardinality-storage">High-cardinality storage</h3> <p>One requirement of a modern network telemetry data platform is the ability to store at high resolution, at full cardinality. Network telemetry comes in dozens of types and hundreds of attributes, and each attribute can have millions of unique values. In modern, complex, distributed network and application infrastructures, it’s impossible to know in advance all the questions you may need to ask.</p> <div style="background-color: rgb(238, 242, 244); padding: 20px; width: 100%;"> <p>Generally, people building on their own data platforms go through multiple backends in their search. Some options we commonly see are:</p> <ul> <li>Backends focused on log telemetry, which can typically handle cardinality but are usually prohibitively expensive, being more oriented to less structured ASCII data.</li> <li>Time-series databases, which operations teams usually are already running. These usually fail quickly with network traffic data because of cardinality, routing, or other network data because of a lack of support for network primitives.</li> <li>OLAP/analytics databases that rely on rollups or cubing of data, which can look fast, as long as you remove the ability to ask unanticipated questions.</li> <li>Streaming databases can be useful for learning patterns across the data, but also remove support for quickly asking new questions over historical data.</li> <li>Distributed columnar databases, which we see most (and use at Kentik) to provide high-resolution storage and querying. </li> </ul> </div> <img src="https://images.ctfassets.net/6yom6slo28h2/FzCmaNlBqZHsOBeKg3kQV/f89f39df227ce4594a73220c5d77e4ca/platform-diagram-revised1.png" class="image center no-shadow" style="max-width: 800px" /> <h3 id="immediate-query-results">Immediate query results</h3> <p>Users need to be able to ask complex questions and get answers in seconds, not hours, to support modern “trail of thought” diagnostic workflows — again, at high cardinality and resolution.</p> <h3 id="massive-multi-tenancy">Massive multi-tenancy</h3> <p>Operational data telemetry platforms need to be able to support dozens of queries. Not just from interactive users via the UI or API but from other operational systems within the network stack and across an organization’s operational systems. Kentik built our own data storage and querying platform because current OSS big-data systems either require rollups or don’t handle multi-tenant large queries well — often blocking on and losing data at ingest.</p> <h3 id="streaming-analytics">Streaming analytics</h3> <p>While it’s possible to simulate streaming analytics with high enough speed data storage and querying (early on at Kentik, we did this), to run a modern machine learning pipeline and thousands of real-time queries on streaming telemetry data, streaming analytics is the most common architecture, and is found in most modern and greenfield network telemetry data platforms.</p> <h3 id="apis-and-integrations">APIs and integrations</h3> <p>Modern observability and telemetry data platforms must be open in spec and easy to integrate, ideally with built-in integrations to key cross-stack systems, to allow application, infrastructure, and security engineers to collaborate with a common operational picture. This typically means full APIs for provisioning and querying and a streaming API to feed other systems with the volume and richness of normalized and enriched network telemetry in real time.</p> <p>Over the next few months, we’ll have panels to invite builders working in and around network observability to share tips, tricks, and platforms they’ve used for network observability and connect to broader operational and business observability platforms.</p> <h2 id="conclusion">Conclusion</h2> <p>The good news is — with the right architecture and capabilities, you can create a data platform capable of supporting <a href="https://www.kentik.com/kentipedia/what-is-network-observability/" title="What is Network Obvservability">network observability</a> and integrating across the business’s data platforms.</p> <p>Going from nothing to a complete observability platform may sound like a huge lift. This is the business Kentik is in, and we’re happy to help, but for those charting their own path, we’re creating resources like this blog series that we hope will be helpful.</p> <hr> <h2 id="telemetry-data-platform-faqs">Telemetry data platform FAQs</h2> <h3 id="what-is-a-telemetry-data-platform">What is a telemetry data platform?</h3> <p>A telemetry data platform is a specialized system designed to collect, store, analyze, and present telemetry data, which is machine-generated data used for monitoring and diagnosing issues in various systems. In the context of network observability, these platforms aggregate data from numerous network devices and endpoints, processing large volumes of high-cardinality, real-time data.</p> <p>Telemetry data platforms provide critical insights into the health and performance of your network by identifying patterns, anomalies, and potential threats. They enable you to ask and answer both known and unanticipated questions about your network, transforming raw data into actionable intelligence.</p> <p>These platforms are often characterized by their ability to handle high-resolution, high-cardinality storage, fast query responses, and comprehensive data enrichment capabilities. They also typically include features for streaming analytics and integrations with other operational systems, creating a cohesive ecosystem for network monitoring and diagnostics.</p> <p>Telemetry data platforms are a fundamental component of a robust network observability strategy, offering the necessary tools to understand and optimize network operations.</p> <h3 id="what-is-high-cardinality">What is high-cardinality?</h3> <p>High-cardinality refers to a data set that contains many unique values or categories. It’s a term often used in the context of databases and telemetry data, particularly when discussing the challenges of managing and analyzing large amounts of data with numerous distinct elements.</p> <p>In a telemetry data platform, data cardinality comes into play when you have a vast number of unique source-destination pairs, different network paths, multiple applications, or a combination of these. For example, if you are monitoring network traffic data, each unique IP address, port number, protocol, or path would be considered a different category, creating high-cardinality data.</p> <p>Managing high-cardinality data is a complex task, but it’s essential for network observability. The ability to handle high-cardinality data allows you to maintain granular visibility into your network. It enables you to ask complex, detailed questions about network performance and behavior, down to the level of individual network elements.</p> <p>However, not all systems are capable of effectively storing and querying high-cardinality data. This can limit the scope and detail of your network observability. Therefore, when considering a telemetry data platform, it’s important to ensure that it can support high-cardinality data to provide the depth of insight needed for effective network management.</p><![CDATA[What’s New with Kentik AI: Enhanced Journeys for Cloud Observability, DDoS, Peering, and Faster Network Insights]]><![CDATA[Kentik Journeys is an AI-powered user experience that helps you investigate your network. It combines knowledge about your network with deep GenAI integration to help you answer network questions and solve problems faster than ever. Since launch, we’ve been innovating on Journeys’ capabilities and skills with customer feedback. Here’s a peek at what’s new. ]]>https://www.kentik.com/blog/journeys-empowering-network-insights-optimizing-performance-with-aihttps://www.kentik.com/blog/journeys-empowering-network-insights-optimizing-performance-with-ai<![CDATA[Christoph Pfister]]>Tue, 19 Nov 2024 05:00:00 GMT<p>Earlier this year, we launched <a href="/solutions/kentik-ai/">Journeys, a new AI-powered user experience</a> that unifies your network data and context with machine learning and natural language processing to provide an augmented, richer, and more streamlined way to quickly understand what’s going on in your network.</p> <p>Since we first announced Journeys, we’ve made substantial improvements to the types of questions you can ask and investigations you can perform. We’ve also dramatically improved performance and accuracy, leveraging the latest foundation model advancements. Finally, we’re now allowing our customers to choose which foundation model they want to use, based on personal preference or company policy. We currently support the latest versions of OpenAI GPT, Google Gemini, Anthropic Claude, and Meta Llama.</p> <p>Let’s dive into what’s new and how these updates can transform how you manage and visualize your network.</p> <h2 id="improved-cloud-traffic-support">Improved cloud traffic support</h2> <p>One of the biggest changes is major improvements to how Journeys interprets questions about cloud traffic and information. You can now ask questions about traffic in, out of, and around your cloud services, such as how much traffic is egressing from your cloud overall, over specific gateways, or across specific subnets. We’re also working on bringing our Connectivity Checker product into Journeys for added cloud support, which is coming in the new year.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7q0bFbkXT3zyaOXnIXS5MX/a968ebeb903a1da5e1dab58e2b191386/journeys-cloud.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Journeys example: Cloud traffic" /> <h2 id="ask-big-picture-questions">Ask “big picture” questions</h2> <p>Of course, asking questions about the current state of the network is only part of what network operations are all about. In many organizations, it is equally important to be able to answer questions about how your network needs to change and evolve to keep up with business needs. In addition to operational queries, Journeys also now supports more open-ended questions to help users answer broader, more strategic topics that help the business make critical investment decisions. For instance, you can now ask questions about how much transit you might save peering with a particular AS at a particular internet exchange. We are actively investing in this area and would love your feedback on what else you’d like to see!</p> <img src="//images.ctfassets.net/6yom6slo28h2/5zGOZyzT2sf55Hdgvl1aUe/73027b8125260653a408fd97d0e27f96/journey-peering.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Journeys example: Peering" /> <h2 id="query-on-ddos-attack-alerts">Query on DDoS attack alerts</h2> <p>Proactively managing network security is essential for any digital organization. Journeys now supports DDoS alert queries, allowing you to quickly ask about DDoS threats to your network and filter on those alerts without leaving the Journeys interface. Easily drill into alerts with a single click when you need to learn more.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4t80fNNXEOwOvY1GON4RDj/e652cabc9a57a9459d40624926760a41/journey-ddos.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Journey example: DDoS alerting" /> <h2 id="expanded-knowledge-base-queries">Expanded Knowledge Base queries</h2> <p>Even veteran users of Kentik sometimes find questions about the platform that they don’t know how to answer. Journeys now supports natural language queries of the <a href="https://kb.kentik.com/">Kentik Knowledge Base</a> (KB), making it easier and faster than ever to get started with or learn about Kentik’s capabilities right in the platform. You can now ask questions about how to configure devices, what types of data are available in Kentik, assistance using Kentik’s APIs, or any other information contained in the Kentik Knowledge Base. Need to know more? Ask a follow-up question, or just click on the reference links in Journeys and be taken straight to more information in the KB.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/19kBsxSXw3z06HImZypsK2/e63efc1558073df5024ab8ebefee0a52/journey-api-settings.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Journeys example: APIs" /> <h2 id="customizable-journeys-for-personalized-insights">Customizable Journeys for personalized insights</h2> <p>Every network is unique, and so are the questions users need answered. You can now ask Journeys to remember your preferences for future queries, like always providing or sorting on the 95th percentile average by default or automatically displaying custom dimensions to the data that your organization cares most about. These customized views enable users to skip manual data filtering, getting straight to the information that you need to run your business faster than ever.</p> <h2 id="greater-control-over-visibility-of-journeys-across-your-team">Greater control over visibility of Journeys across your team</h2> <img src="//images.ctfassets.net/6yom6slo28h2/6ZwRk0hpkKC0kUQcgMV7YM/37b84105477065349113510814215b73/journey-share.png" style="max-width: 300px;" class="image right" thumbnail alt="Journeys: Share with your team" /> <p>We all want a place to do work or figure something out before we want others to see what we’ve been up to. We’ve also added a new quality-of-life feature to help you control who can see what Journeys. Now, new Journeys created by you are unique to your user ID and are not visible to others in your organization by default. A new button will allow you to share your Journeys with your company when ready or for collaboration.</p> <h2 id="beyond-a-co-pilot--an-ai-augmented-network-operations-experience">Beyond a co-pilot — An AI-augmented network operations experience</h2> <p>Journeys is built to intuitively understand your questions and queries in natural language and respond with the right insights you are looking for. But Journeys isn’t just about understanding your questions—it’s about transforming how you interact with your network data. We are working to further expand Journeys’ capabilities to add more context, semantics, and deeper insights to your network data. This will enable you to uncover probable root causes of issues you are trying to investigate, identify patterns, and help your team make sense of the deluge of telemetry data that modern networks generate.</p> <p>Ready to see these new features in action? You can <a href="https://www.kentik.com/get-started/">try Kentik for free</a> for 30 days.</p><![CDATA[Kentik Named a Value Leader in EMA’s 2024 Radar Report for Network Operations Observability]]><![CDATA[We are excited to share that Kentik has been named a Value Leader in EMA’s 2024 Radar Report for Network Operations Observability. This recognition highlights our continued commitment to building an AI-powered, end-to-end observability platform for modern networks, helping network and cloud teams optimize their infrastructures for availability, performance, cost-efficiency, and security.]]>https://www.kentik.com/blog/kentik-named-a-value-leader-in-emas-2024-radar-report-for-network-operationshttps://www.kentik.com/blog/kentik-named-a-value-leader-in-emas-2024-radar-report-for-network-operations<![CDATA[Christoph Pfister]]>Thu, 07 Nov 2024 05:00:00 GMT<h2 id="the-changing-landscape-of-network-observability">The changing landscape of network observability</h2> <p>Today’s modern digital infrastructure is more complex than ever, and networks underpin it all. Network and cloud teams require unified insights and analytics across on-prem, multi-cloud, and internet infrastructures to support their operational and security workflows.</p> <p>EMA’s research underscores that while data diversity — including flow records, VPC Flow, SNMP, streaming telemetry, and synthetic traffic — is crucial, it’s only the first step. Teams need solutions that transform this data into real-time, actionable insights. Predictive analytics, powered by AI and machine learning, are essential for anticipating issues, automating aspects of netops, and responding with agility. This is precisely where Kentik excels.</p> <img src="//images.ctfassets.net/6yom6slo28h2/65Qt5fGhUsFWLlH7imZqli/a9798bb89123b8a6a2538652aedeef78/ema-radar-report-bubble-chart1.png" style="max-width: 550px;" class="image center" thumbnail alt="EMA Radar Report chart" /> <h2 id="an-achievement-for-kentik-and-our-customers">An achievement for Kentik and our customers</h2> <p>EMA recognizes Kentik’s ability to meet these evolving demands, positioning us as a key enabler for enterprises and service providers seeking to stay ahead in today’s modern network landscape. Built with scalability and flexibility at its core, Kentik’s platform excels at collecting, enriching, and analyzing diverse data streams to deliver unmatched observability.</p> <p>Our SaaS-based model simplifies deployment and maintenance, while our <a href="https://www.kentik.com/product/global-agents/">global synthetic vantage point network</a> enhances visibility across the internet and cloud ecosystems. EMA’s rigorous evaluation process — including customer interviews, vendor surveys, and live product demos — affirmed Kentik’s strengths across the board, with outstanding scores in eight critical categories:</p> <ul> <li>Proof of Concept (POC) Support</li> <li>Professional Services Requirements</li> <li>Customer Training</li> <li>Administrative Overhead</li> <li>Product Update Impact</li> <li>Resiliency</li> <li>Product Integrations</li> <li>Active Controls</li> </ul> <p>Shamus McGillicuddy, EMA’s VP of Research, Network Infrastructure and Operations, highlighted our commitment to leading with customer-driven innovation:</p> <p>“Kentik has a long history of being an independent network observability company with a singular mission — to make life better for those connecting our world. I believe they are demonstrating success with that mission, and Kentik is poised with the right priorities — including network AI — to continue its culture of innovation for network and cloud teams.”</p> <h2 id="driving-innovation-for-network-operations">Driving innovation for network operations</h2> <p>Kentik’s Value Leader placement wouldn’t have been possible without our relentless focus on innovation. Over the past year, we’ve introduced key advancements that directly address the evolving needs of network and cloud teams:</p> <ul> <li><a href="https://www.kentik.com/solutions/kentik-ai/"><strong>Kentik AI</strong></a>: We’re pushing the boundaries of observability with Kentik AI, leveraging generative AI alongside machine learning (ML) to deliver deeper, faster insights that accelerate and simplify investigating complex networks.</li> <li><a href="https://www.kentik.com/product/network-monitoring-system/"><strong>Kentik NMS</strong></a>: We launched Kentik NMS, the first modern and AI-assisted network monitoring system – bringing Kentik’s unbounded exploration capabilities to network monitoring and unifying SNMP, streaming telemetry, VPC Flow, eBPF, and synthetic telemetry into one integrated platform offering.</li> <li><a href="https://www.kentik.com/product/multi-cloud-observability/"><strong>Multi-cloud traffic optimization</strong></a>: With hybrid and multi-cloud operations becoming the norm, we’ve significantly invested in expanding our platform’s multi-cloud optimization capabilities. Our latest innovation – the <a href="https://www.kentik.com/blog/announcing-the-cloud-latency-map/">Cloud Latency Map</a> – is one tool in a series of new developments helping network and cloud teams <a href="https://www.kentik.com/go/optimize-cloud/">optimize cloud performance, cost, and security without compromise</a>.</li> </ul> <h2 id="thank-you">Thank you</h2> <p>At Kentik, we see this recognition as both an achievement and a call to action. To our customers and partners: thank you for trusting our platform to monitor, secure, optimize, and run your networks.</p> <p>As the network landscape continues to evolve, so will our platform. We’re committed to ensuring our customers stay ahead of the curve, driving business success in an increasingly interconnected world.</p> <p>For more details, read our <a href="/press-releases/value-leader-2024-ema-radar-report-network-operations-observability/">press release</a> and <a href="https://www.kentik.com/go/analyst-report/ema-radar-report-network-operations-observability/">download the report</a>.</p><![CDATA[Anatomy of an OTT Traffic Surge: The Fortnite Chapter 2 Remix Update]]><![CDATA[On Saturday, November 2, the wildly popular video game Fortnite released its latest game update: Fortnite Chapter 2 Remix. The result was a surge of traffic as gaming platforms around the world downloaded the latest update for the seven-year-old game. Doug Madory looks at how the resulting traffic surge can be analyzed using Kentik’s OTT Service Tracking.]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-the-fortnite-chapter-2-remix-updatehttps://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-the-fortnite-chapter-2-remix-update<![CDATA[Doug Madory]]>Wed, 06 Nov 2024 16:00:00 GMT<p>Some of the internet’s biggest traffic surges these days involve the global distribution of software updates for popular video games. The latest such event occurred on Saturday when Epic Games released its latest update for Fortnite, one of the world’s most popular games. The <a href="https://gamerant.com/fortnite-chapter-2-remix-dates-skins-trailer/">Fortnite Chapter 2 Remix</a> update caused a massive spike in traffic levels for many service providers.</p> <p>Let’s analyze this traffic surge using Kentik’s OTT Service Tracking…</p> <h2 id="ott-service-tracking">OTT Service Tracking</h2> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/">Call of Duty update</a> or a the first-ever exclusively live-streamed <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-peacock-streamed-nfl-playoff-game/">NFL playoff game</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p>Kentik <a href="https://www.kentik.com/resources/kentik-true-origin/">True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 540 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <h2 id="the-nitty-griddy">The Nitty “Griddy”</h2> <p>Kentik customers using OTT Service Tracking observed the following statistics, illustrated below. When measured in bits/sec, 56% of the traffic during the surge was delivered over the Xbox Live platform, 28.4% through Playstation and 14.4% sent directly from Epic Games, the publisher of Fortnite.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5UsBBkTmt3ONxQDBJfjN7R/caad0a008306982bd45bfcca2255f8a7/fortnite-update-by-ott-service.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Fortnite update by OTT service" /> <p>When broken down by Connectivity Type (below), Kentik customers received the Snoop Dogg-themed update from a variety of sources including Private Peering (72.8%, both free and paid), Transit (12.8%), IXP (11.3%) and Embedded Cache (1.9%).</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Hl3orE5w6w4cEzkNqF5s8/6be3277435da001ca00535635d38b1b7/fortnite-update-by-connectivity-type.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Fortnite update by connectivity type" /> <p>The update caused a traffic peak that was over 15 times what was observed at the same time the previous day. It was delivered via a variety of content providers including Fastly (41.7%), Akamai (22.1%), Amazon (10%), Cloudflare (8.9%) with Edgecast and Qwilt delivering the remaining bits/sec.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2ZpMg1BDlFli3vh3Vz66Uh/c4d690f653565e90569f2ad0a075ca73/fortnite-update-by-source-cdn.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Fortnite update by source CDN" /> <p>In addition to source CDN and connectivity type, users of Kentik’s OTT Service Tracking are also able to break down traffic volumes by subscribers, specific router interfaces and customer locations.</p> <h2 id="how-does-ott-service-tracking-help">How does OTT Service Tracking help?</h2> <p>Previously, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/">described enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like Microsoft’s Patch Tuesday can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/product/subscriber-intelligence/" title="Subscriber Intelligence Use Cases for Kentik">subscriber intelligence</a>.</p> <p>Ready to improve <a href="https://www.kentik.com/kentipedia/ott-services/" title="Kentipedia: OTT Services (Over-the-Top Services)">over-the-top service</a> tracking for your own networks? <a href="#demo_dialog" title="Request your Kentik demo">Get a personalized demo</a>.</p><![CDATA[The Network Also Needs to be Observable, Part 3: Network Telemetry Types]]><![CDATA[In part 3 of the network observability series, Kentik CEO Avi Freedman discusses the different categories telemetry data. Avi shows how a complete network observability solution can answer an exponentially broader range of questions. ]]>https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-3-network-telemetry-typeshttps://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-3-network-telemetry-types<![CDATA[Avi Freedman]]>Tue, 05 Nov 2024 05:00:00 GMT<p>In <a href="/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources/">part 2 of this series</a>, I talked about the range of network devices and observation points that generate <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-4-telemetry-data-platform">telemetry data</a>. Over time, this range has expanded, and networks are more diverse than ever. All of our operational concerns, planning, running and fixing need to be coordinated across the complete variety of the networks that affect our traffic.</p> <p>In this post, I discuss the telemetry data itself. Telemetry is the key to seeing, and seeing is the first step in the practice of observability.</p> <p>The wonderful thing about network telemetry is that there are so many types, which also creates the challenge of starting on the <a href="https://www.kentik.com/blog/category/network-observability/">network observability</a> journey!</p> <p>Historically, many systems have taken one or two types of telemetry to answer a more limited set of questions. However, with modern data systems and techniques, it’s possible to take a broader set of telemetry, which opens up an even wider set of use cases and questions that can be answered.</p> <h2 id="telemetry-types">Telemetry types</h2> <h3 id="traffic-telemetry">Traffic telemetry</h3> <p style="border-left: 10px solid #53b8e2; padding-left: 15px;">Information on the flow of traffic across networks &mdash; NetFlow, sFlow, and IPFIX in the classic sense or equivalent, but in a modern sense including cloud VPC Flow Logs, and traffic data in JSON and other interchange formats, service meshes, and services and security proxies.</p> <p>Wire data can also help as a type of traffic data, but we see almost all cloud-focused customers using traffic summaries (flow) because of difficulty scaling packet observation in distributed networks.</p> <p>However you get it, traffic is the key “what is” that shows you what users and applications are up to and how they’re interacting with the network!</p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>Questions you can answer with traffic telemetry:</strong></p></div> <ul> <li>Is this spike/congestion an attack, a misconfig, a distributed system dynamic, or something else?</li> <li>Am I under attack? Am I attacking others? What are the sources? How can I mitigate?</li> <li>What will break if I add filters/change policy?</li> <li>Who did this IP address talk to? What did it do? What did those destinations then do?</li> <li>Why is my bandwidth bill so high? </li> <li>What can I do to localize traffic?</li> <li>How much does this customer cost me?</li> <li>What clouds am I sending the most traffic to?</li> <li>How much traffic is each of my departments using so I can bill them properly? </li> </ul> <div style="margin-top: 30px;"><p><strong>Questions you can’t answer with traffic alone:</strong></p></div> <ul> <li>Is this drop in traffic to Google due to network, application, or other performance issues?</li> <li>What users and applications are consuming my network bandwidth? </li> </ul> </div> <h3 id="device-telemetry">Device telemetry</h3> <p style="border-left: 10px solid #53b8e2; padding-left: 15px;">Typically expressed as metrics, exposing state of the physical and logical network elements.</p> <p>This typically covers high level stats about both the control and forwarding planes, though usually not the deep telemetry on the traffic flowing across the network. Historically this was CLI, then became majority SNMP, evolved to add API access, and more recent energy has been around streaming telemetry.</p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>Questions you can answer with device telemetry alone:</strong><br></div> </p> <ul> <li>Is the device running out of memory, overheating, or otherwise showing it might stop working altogether?</li> <li>What are my interface level statistics and usage now and historically?What are my optical power and optics temperature levels?</li> <li>How much traffic is passing over each LSP?</li> <li>What version of software am I running on my network device?</li> <li>Is my interface down or up? </li> </ul> <div style="margin-top: 30px;"><p><strong>Questions you can’t answer with device telemetry alone:</strong><br></div> </p> <ul> <li>Why is this interface full?</li> <li>Are these interface errors causing performance problems?</li> <li>What are my top talkers?</li> <li>What is the traffic going through this interface or device?</li> <li>What applications and users are passing through a specific interface?</li> <li>Was my spike in traffic an attack? </li> </ul> </div> <h3 id="events">Events</h3> <p style="border-left: 10px solid #53b8e2; padding-left: 15px;">Updates about events occurring from the point of view of the network elements, including alerts such as threshold violation of temperatures, CPU, optic, wireless/radio interfaces or other element health, config changes, and routing session state. Typically such notifications are sent via syslog or SNMP trap.</p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>Questions you can answer with events alone:</strong></p></div> <ul> <li>Did someone make a change they shouldn’t have?</li> <li>Was the interface shutdown or did it flap? </li> <li>Who and what change was made?</li> <li>Did a process crash on my device? </li> <li>How long has my device been operational for? </li> <li>Is my network having trouble with routing stability?</li> <li>Is someone attempting to login to my device? </li> </ul> <div style="margin-top: 30px;"><p><strong>Questions you can’t answer with events alone:</strong></p></div> <ul> <li>Did config changes cause a customer-visible problem?</li> <li>How did the change affect my network traffic and performance?</li> </ul> </div> <h3 id="synthetics">Synthetics</h3> <p style="border-left: 10px solid #53b8e2; padding-left: 15px;">Measurements from testing using “synthetic” traffic &mdash; apart from actual user traffic. While synthetics can be triggered or collected via device telemetry interfaces, they are actually a broader category spanning client and server endpoints, network elements, and internet-wide locations performing network and application-layer testing.</p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>Questions you can answer with synthetics alone:</strong></p></div> <ul> <li>Are there specific links that are having packet loss, potentially when run above current levels of traffic?</li> <li>What is my performance to specific endpoints, between data centers or from on-prem to cloud?</li> <li>What path is my traffic taking to reach a destination and what is the performance?</li> </ul> <div style="margin-top: 30px;"><p><strong>Questions you can’t answer with synthetics alone:</strong></p></div> <ul> <li> Were any applications/users affected by this bad performance test result? </li> <li>What is causing the performance problem? </li> <li>Did a change in the network cause a performance problem? </li> <li>Were there important customers, users, or application traffic affected by the performance issue?</li> </ul> </div> <h3 id="routing">Routing</h3> <p style="border-left: 10px solid #53b8e2; padding-left: 15px;">Dynamic updates and/or routing table state specifically for the routes or paths, determined and propagated by the network elements.</p> <p>This information tells you (modulo bugs) how traffic or packets will flow through the network under different conditions. Broadly this includes inter-domain (BGP), intra-domain (OSPF, IS-IS, RIP, BGP) and even switching (ARP and CAM) updates and tables. Routing is generally observed by participating in listen-only routing sessions, or for BGP, via BMP. <em>Note: Really, I think of tables as composed in the observability data layer from updates as well but probably better to skip for now.</em></p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>Questions you can answer with routing alone:</strong></p></div> <ul> <li>Is someone hijacking my network reachability?</li> <li>What path is my traffic being routed through my network?</li> <li>What AS_PATH is my traffic taking to reach an endpoint? </li> <li>What interface is my traffic egressing to a destination AS? </li> <li>Am I announcing my customers’ networks properly?</li> </ul> <div style="margin-top: 30px;"><p><strong>Questions you can’t answer with routing alone:</strong></p></div> <ul> <li> Do the networks I see instability to represent any application or user traffic for me?</li> <li>Is routing instability causing performance problems?</li> </ul> </div> <h3 id="configuration-data">Configuration data</h3> <p style="border-left: 10px solid #53b8e2; padding-left: 15px;">The (typically static) configuration data representing the operating intent for all configurable network elements such as addresses, ID’s, ACLs, topology info, location data, even device details such as hardware and software versions.</p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>Questions you can answer with configuration data alone:</strong></p></div> <ul> <li> Did I make an obvious mistake with a configuration change?</li> </ul> <div style="margin-top: 30px;"><p><strong>Questions you can’t answer with configuration data alone:</strong></p></div> <ul> <li>Did I block traffic that shouldn’t have been blocked?</li> </ul> </div> <h3 id="business-and-operational-metadata">Business and operational metadata</h3> <p style="border-left: 10px solid #53b8e2; padding-left: 15px;">Often called “layer 8,” or the data about the use of the network that’s beyond strictly network scope, the business and operational context about what the network is used for is a critical source of telemetry for network observability.</p> <p>There are a wide variety of metadata types to tap into, often already available on data busses. Examples include application orchestration from Kubernetes, VMware, and controllers; user association from IPAM, NAC, and RADIUS; threat intelligence curated by security groups; SaaS and cloud identity mapping; customer or department identification; and “business criticality” metadata including customer size or application criticality to business operations.</p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>How metadata lets you ask better questions:</strong></p></div> <ul> <li>“Who did this IP address attack?” becomes “What users or hosts were attacked from this IP address? Was this IP address part of my production infrastructure, or was it one of my users or customers?”</li> <li>“What IP addresses were affected?” becomes “What hostname, applications, users, and customers were affected?”</li> <li>“What IP or subnet is using a certain amount of bandwidth?” becomes “What department, user, or customer can I bill for this usage?</li> </ul> </div> <h3 id="dns">DNS</h3> <p style="border-left: 10px solid #53b8e2; padding-left: 15px;">DNS query streams are also a very useful type of network telemetry, whether observed from the DNS service logs, or from the network traffic on the hosts running DNS servers. In addition, DNS telemetry can be helpful to put traffic and other telemetry types in context.</p> <p>For example, if apps or sites are using cloud infrastructure, flow without DNS may not be able to “see” them distinctly, but adding DNS to traffic data can help you to peer better into your traffic to those properties.</p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>Questions you can answer with DNS alone:</strong></p></div> <ul> <li>Am I getting or returning DNS failures (404 or otherwise)?</li> <li>What are my frequently queried domain names and talkative DNS clients?</li> <li>What is my DNS request load for my servers? <br> </li> </ul> <div style="margin-top: 30px;"><p><strong>Questions you can’t answer with DNS alone:</strong></p></div> <ul> <li>Is slowness in DNS response causing pages to load slowly?</li> </ul> </div> <h3 id="application-data-sources">Application data sources</h3> <p>One explicit note that’s critical to modern network observability is that some of the most rich, real-time, granular, and valuable data to shine light on the network comes from application-layer sources. Most application-layer traffic data has performance instrumentation simply not available from high-speed silicon-accelerated network elements. While network and application observability teams have work to be done to obtain common telemetry, terminology, workflows, and platform interoperability, we see this unification as an active effort in 2021 across our customer base.</p> <div style="padding-left: 40px;"> <div style="margin-top: 30px;"><p><strong>Questions you can answer with application telemetry alone:</strong></p></div> <ul> <li>Did I return a slow response to a user? How often? When?</li> <li>Am I returning errored responses to users or other application components?</li> </ul> <div style="margin-top: 30px;"><p><strong>Questions you can’t answer with application telemetry alone:</strong></p></div> <ul> <li>Were slow responses due to application or network problems? Which and where?</li> </ul> </div> <h2 id="putting-it-all-together">Putting it all together</h2> <p>Gathering network telemetry data is the key to being able to ask questions, and is the first step in the practice of observability.</p> <p>As I’ve tried to lay out in this blog, a wider and varied set of telemetry types can answer many more questions — and this makes your network more observable! Many common questions require two or more telemetry types to answer, and generally, adding combinations of <a href="https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources" title="See part 2 of this series, The Network Also Needs to be Observable, Part 2: Network Telemetry Sources">network telemetry types</a> gives you exponentially better ability to ask questions. Which is what network observability is about.</p> <p>Now that we see the need to have lots of different network telemetry, and from the key network elements and types, how do we create a practical solution that is capable of handling all this data?</p> <p>That will be the subject of my next blog in this series — the <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-4-telemetry-data-platform/">Telemetry Data Platform</a>.</p><![CDATA[NANOG 92 in Toronto: A Focus on IPv6]]><![CDATA[NANOG 92 has wrapped up, and Kentik’s Field CTO, Justin Ryburn, is here to recap the event. From Kentik-led sessions to IPv6 adoption, real-time routing analysis, and networking for AI data centers, learn about everything that made NANOG 92 an event to remember.]]>https://www.kentik.com/blog/nanog-92-in-toronto-a-focus-on-ipv6https://www.kentik.com/blog/nanog-92-in-toronto-a-focus-on-ipv6<![CDATA[Justin Ryburn]]>Fri, 01 Nov 2024 04:00:00 GMT<h2 id="introduction">Introduction</h2> <p>The North American Network Operators’ Group (<a href="https://nanog.org/">NANOG</a>) 92 conference, held in Toronto from October 21–23, 2024, provided a platform for internet professionals to discuss key advancements in networking technology and infrastructure. Known for drawing a diverse mix of engineers, operators, and academics, this conference focused on critical aspects of internet connectivity, routing technologies, and network security. Kentik is a proud sponsor of NANOG 92, so I was able to attend. Here is an overview of the highlights from the agenda and what you missed if you couldn’t make it to Toronto for the event.</p> <h2 id="venue">Venue</h2> <p>NANOG 92 was held at the Westin Harbour Castle in Toronto, Ontario. The Westin Harbour Castle, a popular choice for large conferences, provided an ideal setting on Toronto’s waterfront, offering participants convenient amenities and proximity to the city’s vibrant downtown area. This location allowed attendees to connect, network, and enjoy a professional yet relaxed environment conducive to engaging discussions on critical networking topics. NANOG’s emphasis on community-driven events also offered an atmosphere where network engineers could share knowledge and innovative ideas without being interrupted by marketing-driven presentations.</p> <h2 id="keynote-sessions">Keynote sessions</h2> <p>One of the draws of any NANOG event is the keynote speakers who address the audience on Monday and Tuesday morning. NANOG 92 was no exception.</p> <h3 id="day-1-networking-for-ai-and-hpc-and-ultra-ethernet">Day 1: Networking for AI and HPC, and Ultra Ethernet</h3> <p><a href="https://www.linkedin.com/in/hugh-holbrook-a96912/">Hugh Holbrook</a>, VP of software engineering at Arista, kicked off Day 1 with a keynote titled <em>Networking for AI and HPC and Ultra Ethernet</em>. Holbrook started off by talking about the various challenges that distributed GPU clusters for AI bring to modern data center design. He then walked the audience through the work being done by the Ultra Ethernet Consortium to address these issues. I have sat through a number of networking talks around AI in the last couple of years, and this was one of the best I have seen.</p> <h3 id="day-2-whatever-happened-to-ipv6">Day 2: Whatever Happened to IPv6</h3> <p><a href="https://www.linkedin.com/in/geoff-huston-89182842/">Geoff Huston</a>, chief scientist at APNIC, kicked off Day 2 with a keynote titled <em>Whatever Happened to IPv6.</em> Huston addressed ongoing IPv6 adoption challenges despite the exhaustion of IPv4 addresses, underscoring the need for a robust strategy to encourage organizations to shift from legacy systems to IPv6. His insights resonated with many attendees, as IPv6 adoption remains a critical issue in the networking community.</p> <h2 id="technical-sessions">Technical sessions</h2> <p>As usual, the NANOG Program Committee (PC) did an excellent job of putting together talks, workshops, tutorials, and a hackathon. For those not aware, the program at a NANOG conference is vetted by industry experts to keep the content vendor-neutral and applicable to network operators. To that end, they set a theme for each NANOG conference, and for NANOG 92 that was IPv6. That being said, a wide range of topics were discussed, including DNS, RPKi, EVPN, and even a talk titled <em>Go Long! Sending Weird Signals Long Distances Over Existing Optical Infrastructure</em>. I will put in a plug for my own talk titled <em>BGP Flowspec Doesn’t Suck. We’re Just Using it Wrong,</em> and my colleague Doug Madory’s panel titled <em>Routing Security: Fostering Continuous Improvement</em>. The slides are posted on the <a href="https://nanog.org/events/nanog-92/agenda/">agenda page</a>, and the video recordings will be added soon.</p> <h2 id="getting-social">Getting social</h2> <p>Anyone who has been to a NANOG event is aware of what is called the <em>hallway track</em>. This is a term attendees use to describe the value they get out of discussions in the hallway between sessions. One of the main reasons for attending a NANOG event is to spend time talking to fellow network operators. It is a great way to bounce ideas off your peers and learn what they are doing. NANOG 92 provided many opportunities for this, including the NANOG Peering Coordination Forum, Monday Night Social, Beer ‘N Gear (my personal favorite), and Tuesday Night Social. The Monday Night Social was especially fun this time, hosted at <a href="https://nanog.org/events/nanog-92/socials/">The Rec Room</a>, allowing attendees to play games while socializing.</p> <h2 id="conclusion">Conclusion</h2> <p>NANOG 92 was a successful event that provided valuable insights into the latest trends and challenges in the networking industry. As the industry continues to evolve, it is essential for network operators to stay informed and adapt to new technologies and approaches. The next NANOG meeting will be held in Atlanta, Georgia, February 3-5, 2025. We look forward to seeing what new innovations and ideas will be presented at the next event.</p> <hr> <p>This article originally appeared on Justin Ryburn’s personal blog, and Kentik is grateful that we have so many incredible writers on our team who are willing to share their work with us. You can find the original post <a href="https://ryburn.org/2024/10/29/nanog-92-in-toronto-a-focus-on-ipv6/">here</a>.</p><![CDATA[The Cloud Conundrum: Optimizing Without Compromise]]><![CDATA[Balancing cost, performance, and security in cloud infrastructure is challenging, but cloud-mature companies are proving it’s possible to optimize without compromise -- here’s how they do it.]]>https://www.kentik.com/blog/the-cloud-network-conundrum-optimizing-without-compromisehttps://www.kentik.com/blog/the-cloud-network-conundrum-optimizing-without-compromise<![CDATA[Rosalind Whitley]]>Wed, 30 Oct 2024 04:00:00 GMT<p>Cloud technologies have revolutionized the way companies operate, innovate, and grow. Today, cloud computing serves as the backbone of digital strategy, allowing businesses to scale efficiently and stay competitive. But with this rapid growth in cloud utilization comes the challenge of optimizing cloud infrastructure and operations without making trade-offs between cost, performance, and security.</p> <p>In this post, we’ll explore the challenges organizations face when balancing these three priorities, why trade-offs are common, and how cloud-mature companies solve these problems. We’ll also examine the benefits of using network observability to optimize without compromise.</p> <h2 id="cloud-adoption-growth-and-complexity">Cloud adoption: Growth and complexity</h2> <p>Cloud utilization continues to soar, with enterprise spending on cloud infrastructure at an all-time high of $79 billion worldwide in Q2 2024 and growing 20% or more year-over-year.<sup><a href="https://www.srgresearch.com/articles/cloud-market-growth-stays-strong-in-q2-while-amazon-google-and-oracle-nudge-higher"><b>1</b></a></sup> Large enterprises, which continue to accelerate cloud adoption, aim to have roughly 60% of their IT environments in the cloud by 2025.<sup><a href="https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/projecting-the-global-value-of-cloud-3-trillion-is-up-for-grabs-for-companies-that-go-beyond-adoption"><b>2</b></a></sup> About half of enterprises use on-premises data centers alongside cloud, but three-quarters use two or more cloud vendors.<sup><a href="#3"><b>3</b></a></sup></p> <p>While the benefits of cloud infrastructure are undeniable, so too are the complexities, especially for companies using multi-cloud and hybrid environments. Each provider has unique pricing structures, service offerings, architectural best practices, and security measures, making it difficult for organizations to consistently manage, monitor, and optimize all aspects.</p> <h2 id="the-cloud-maturity-spectrum">The cloud maturity spectrum</h2> <p>Organizations sit on a spectrum of cloud maturity. Each stage comes with its own set of challenges and focuses, but as companies grow in cloud maturity, they become more focused on optimizing infrastructure versus operationalizing cloud or reactively firefighting issues.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4mFFNB7L37lpowo897Lf9i/2bf3da11927342321394a49a111b16dd/cloud-maturity-scale.png" style="max-width: 700px;" class="image center no-shadow" alt="Cloud maturity spectrum" /> <p>At the far left of the spectrum, organizations are navigating the cloud. They may limit cloud usage to simple storage solutions or basic workloads such as application frontends. At this stage, optimization isn’t a primary focus—getting up and running is.</p> <p>In the middle of the spectrum, companies have started using the cloud for more diverse workloads, with a variety of teams deploying to different service providers and developing more complex cloud architectures. They will begin to notice inefficiencies in cloud spending and may experience performance issues, but may not yet have the tools or expertise to address them comprehensively.</p> <p>At the far right of the spectrum, cloud-mature organizations are architected to utilize cloud for mission-critical operations. They use very large-scale, multi-cloud, or hybrid-cloud environments, and face continuous challenges in their mission to delight users while improving gross margins and maximizing profits. For medium and high-maturity companies, constantly fine-tuning cloud infrastructure and operations is central to business success.</p> <div as="Promo"></div> <h2 id="how-cloud-mature-companies-optimize">How cloud-mature companies optimize</h2> <p>Cloud-mature organizations have little room for inefficiency. They must strike the right balance between cost control, performance, and security. These three pillars form the foundation of their cloud strategy:</p> <ol> <li><strong>Cost efficiency</strong>: In large-scale and multi-cloud environments, costs can quickly spiral out of control, while efficiencies hide in plain sight. Cloud-mature organizations must keep spending in check, but avoid cuts that harm performance or weaken security.</li> <li><strong>Performance optimization</strong>: Poor performance—such as latency and slow page loads— directly impact user experience and business outcomes. Driving continuous improvement in cloud performance is crucial for companies that rely on seamless digital services.</li> <li><strong>Security</strong>: As enterprises continue to expand cloud surface area and utilize new commodity and managed cloud services, security vulnerabilities increase. These organizations need to keep cloud networks secure from data breaches and compliance risks while balancing performance and costs.</li> </ol> <p>The challenge lies in optimizing all three without compromise. More often than not, organizations end up making difficult trade-offs.</p> <h2 id="the-trade-offs-between-cost-performance-and-security">The trade-offs between cost, performance, and security</h2> <p>Balancing optimization of cost, performance, and security can feel like walking a tightrope. Companies often face the following trade-offs:</p> <h3 id="1-cost-vs-performance">1. Cost vs. performance</h3> <p>In an effort to reduce costs, organizations might scale back resources—such as choosing cheaper storage options or smaller instances. While this approach can save money, it can also lead to performance issues like slower processing speeds or higher latency.</p> <p><strong>The impact</strong>: Reduced performance hurts customer satisfaction, increases abandonment rates, and causes long-term revenue loss. This directly impacts a company’s bottom line.</p> <h3 id="2-performance-vs-security">2. Performance vs. security</h3> <p>Many companies prioritize performance and feature velocity by taking a continuous, agile approach to software development. However, these decisions can leave security as an afterthought, making systems more vulnerable. Alternatively, placing too much emphasis on security can result in slower growth and cumbersome, layered protocols that slow down network performance.</p> <p><strong>The impact</strong>: Sacrificing security for speed can lead to data breaches, non-compliance, and reputational damage, while over-prioritizing security can degrade user experiences, resulting in frustrated customers.</p> <h3 id="3-cost-vs-security">3. Cost vs. security</h3> <p>Organizations trying to rein in cloud spending may fail to comply with security best practices, believing them to be overbuilt or overemphasized. This decision can leave gaps in cloud defenses, exposing sensitive data to security threats.</p> <p><strong>The impact</strong>: Skimping on security may improve margins in the short term, but it can backfire in costly breaches, fines for non-compliance, and long-term reputational damage.</p> <h2 id="the-solution-cloud-optimization-without-compromise">The solution: Cloud optimization without compromise</h2> <p>The secret to achieving the right balance between cost, performance, and security lies in identifying realistic opportunities to positively impact one or more priorities without detracting from another. Some optimization opportunities are easier to implement than others, measured in either dollars or engineering hours, but virtually all of them require improving observability. A 2024 Gartner report notes that, “For many enterprises, increasing costs associated with storing and analyzing observability data offsets the benefits they receive from it.”<sup><a href="#4"><b>4</b></a></sup> To realize real benefits, enterprise infrastructure teams must focus on the lowest effort, highest-yield optimization opportunities and identify the most cost-effective ways to capture value from the huge volume of telemetry available today.</p> <p>Network optimization represents a critical, but commonly overlooked, opportunity to impact all three priorities—cost, performance, and security—without compromise. To optimize cloud networks, organizations need comprehensive visibility into cloud infrastructure that can quickly provide actionable insight into the traffic and metrics that impact each priority.</p> <h3 id="ai-powered-network-observability-a-game-changer">AI-powered network observability: A game changer</h3> <p>Kentik is trusted by leading organizations like <a href="https://www.kentik.com/go/optimize-cloud/?int_source=optimize-cloud">Zoom, Box, and Booking.com</a> to optimize cloud cost, performance, and security simultaneously. Here’s how Kentik’s network observability helps:</p> <h4 id="1-one-platform-for-total-visibility">1. One platform for total visibility</h4> <p><a href="https://www.kentik.com/product/multi-cloud-observability/?int_source=optimize-cloud">Kentik’s AI-powered network observability platform</a> provides unified visibility across all environments, across every public cloud account and on-premises data center. By centralizing traffic telemetry and performance metrics, organizations can holistically monitor their entire cloud ecosystem and spot inefficiencies, bottlenecks, and security vulnerabilities in real-time. No more piecing together data from different sources or relying on costly homegrown solutions.</p> <h4 id="2-immediate-answers">2. Immediate answers</h4> <p>Cloud environments are complex, and when performance or security issues arise, they can be time-consuming to investigate, whether as part of an optimization project or a response to an incident. Kentik’s platform provides immediate, data-driven answers to network questions. This means faster resolutions, more rewarding optimizations, and smoother operations—all of which contribute to enhanced user experiences.</p> <h4 id="3-cost-insights-that-drive-savings">3. Cost insights that drive savings</h4> <p>Without proper oversight, cloud costs can quickly spiral out of control. Kentik enables companies to keep an eye on transit and other connectivity costs across large-scale environments and multiple cloud providers like AWS, Azure, Google Cloud, and OCI. With expert network-layer insight, organizations can identify cost-saving opportunities that don’t sacrifice performance or security and achieve measurable savings in weeks rather than months.</p> <h4 id="4-simple-high-performance-querying">4. Simple, high-performance querying</h4> <p>Kentik’s user-friendly interface makes it easy for teams to get the insights they need from traffic flow data without the need to construct complex SQL queries. Natural language or GUI queries based on hundreds of data attributes enable fast querying and decision-making. Out-of-the-box visualization tools further simplify the process, allowing organizations to see results clearly and act quickly.</p> <h4 id="5-automated-network-context">5. Automated network context</h4> <p>Unlike CSP-native monitoring tools and agnostic observability platforms, Kentik automatically enriches all telemetry with deep network context, turning IP addresses and obscure attributes into meaningful, human-readable internet, business, application, and security information. This gives organizations quick and actionable insights on how to optimize cost, performance, and security—without needing to build custom tooling or re-architect their infrastructure.</p> <h2 id="conclusion-no-more-trade-offs">Conclusion: No more trade-offs</h2> <p>In the fast-paced world of cloud computing, organizations can no longer afford to compromise between cost, performance, and security. By partnering with Kentik to bring network optimization into your cloud operations strategy, you have the power to address all three simultaneously. Stop making tough trade-offs and leaving money on the table. Optimize cloud networks smarter, faster, and more securely with Kentik. Visit our <a href="https://www.kentik.com/go/optimize-cloud/?int_source=optimize-cloud">Cloud Resources Hub</a> to learn more.</p> <hr> <ol> <li> <p><a href="https://www.srgresearch.com/articles/cloud-market-growth-stays-strong-in-q2-while-amazon-google-and-oracle-nudge-higher">Synergy Research: Cloud Market Growth Stays Strong in Q2 While Amazon, Google and Oracle Nudge Higher</a>, 2024</p> </li> <li> <p><a href="https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/projecting-the-global-value-of-cloud-3-trillion-is-up-for-grabs-for-companies-that-go-beyond-adoption">McKinsey: $3 trillion is up for grabs in the cloud</a></p> </li> <li> <p><a name="3"></a>IDC Cloud Pulse Survey, 2023</p> </li> <li> <p><a name="4"></a>Gartner: Prepare for the Future of Observability, 2024</p> </li> </ol><![CDATA[Announcing the Cloud Latency Map]]><![CDATA[Today, we’re excited to announce the launch of Kentik’s Cloud Latency Map, a public service that uses Kentik Synthetics to continuously measure latency between the regions of the biggest cloud providers.]]>https://www.kentik.com/blog/announcing-the-cloud-latency-maphttps://www.kentik.com/blog/announcing-the-cloud-latency-map<![CDATA[Doug Madory]]>Tue, 29 Oct 2024 04:00:00 GMT<p>The <a href="https://clm.kentik.com">Cloud Latency Map</a> is the latest expression of Kentik’s dedication to extending <a href="https://www.kentik.com/product/multi-cloud-observability/">network observability to the cloud</a>. The Map is powered by a small army of software agents hosted in cloud regions around the world. These agents are capable of performing a range of monitoring functions and form the basis of <a href="/product/synthetic-monitoring/">Kentik Synthetics</a>.</p> <h2 id="what-is-it">What is it?</h2> <p>The <a href="https://clm.kentik.com">Cloud Latency Map</a> is a <strong>free public tool</strong> that allows users to explore the latencies measured between over 100 different cloud regions worldwide. It can be used to compare latencies over common routes or identify recent changes in observed latencies between specified public clouds, cloud regions, or large geographic areas.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1X5lfpqCrjZEUFXMmsYXGX/e109b6cb1a6f9bba741f41cdd15d30ed/cloud-latency-map-overview.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Latency Map" /> <p>In cloud parlance, a “cloud region” refers to a geographic location where cloud services are hosted. A cloud region typically consists of multiple distinct data centers within a metropolitan area. In addition to hosting our agents in all of the regions of the big three public clouds (AWS, Azure, Google Cloud), we also host agents in the clouds of IBM and Oracle.</p> <p>Broadly speaking, the tool can assist someone trying to determine if there is a connectivity issue impacting particular cloud regions. Additionally, since the public clouds rely on the same physical infrastructure as the rest of the global internet, the Map can often pick up on the latency impacts of failures of core infrastructure, such as the loss of a major submarine cable.</p> <h2 id="how-do-you-use-it">How do you use it?</h2> <p>There are two key components of the Cloud Latency Map: Latency Comparisons and Latency Changes. Before we get into these, let’s briefly discuss latency.</p> <p>Typically, the primary factor contributing to latency between two faraway locations is distance. There is a theoretical upper limit for how fast data can travel down a fiber optic cable due to the speed of light. As a result, there is a minimum latency that can be achieved between any two locations. However, suboptimal routing can increase the distance traveled and thus increase the latency. Sustained excessive latency may cause problems for internet-based applications, but spikes in latency can often be indicative of other connectivity problems.</p> <p>In the first component, <strong>Latency Comparisons</strong>, we compare latencies between any two of eight cities common to the three major public clouds (AWS, Azure, Google Cloud). This provides an “apples-to-apples” comparison as traffic has to traverse the same geographic distance between the two cities and can therefore highlight instances of persistent suboptimal routing. Within this view, there are two options: “Intra-cloud” and “Inter-cloud.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/30qSGcBc8sE6fD3LVmRhDF/d40f6aeb574e8d6693eb83d26e99b75f/cloud-latency-map-latency-comparisons.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Latency Map: Latency Comparisons" /> <p><em>Intra</em>-cloud latencies are those performed between regions of the same cloud, while <em>Inter</em>-cloud latencies are based on measurements between different clouds. As one might expect, the Inter-cloud latencies tend to be higher as they require a traffic hand-off between two clouds governed by two separate networking teams which may employ differing interconnection strategies.</p> <p>Lastly, the Latency Comparison section has a subsection which lists the largest latency differences observed between common cities for both the Intra-cloud and Inter-cloud measurements. These latency differences are calculated by subtracting the smallest latency from the largest latency along a route.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5p89q00nRwoyanBJOy8prS/f2bb68385b70aed1d2acbae8f7500a7f/cloud-latency-map-intra-inter-cloud.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Latency Map: Largest Latency Differences" /> <p>In the second component, <strong>Latency Changes</strong>, the Map can be used to explore measurements experiencing latency changes among the over 10,000 individual measurement series in the dataset. The user can choose to restrict the query by selecting a particular cloud, geographic region, or city as the source and destination. This allows the user to find any measurements experiencing changes to a particular city or cloud in the past seven days.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2accfwNntOwssVd3gBbZij/9446e5f06f1452f51a5c6555c494e4c5/cloud-latency-map-latency-changes.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Latency Map: Latency Changes" /> <div class="caption" style="margin-top: -35px;">The Map showing latencies experiencing the largest changes from AWS to Azure in the past week.</div> <p>Let’s say there is a report of a fiber cut in a particular city. One can use the Map to check to see if any of the latency measurements to or from that city have experienced changes in recent days as a way of identifying a disruption. Any spikes in latency may have been caused by the infrastructure failure.</p> <p>In the example above, latencies between Azure’s <code class="language-text">southafricanorth</code> region in Johannesburg, South Africa occasionally jump by 100ms or more when going to Asia. We see this behavior in all of the cloud regions in South Africa and believe this is caused by traffic to Asia getting routed up the west coast of Africa vs a more direct route on the east.</p> <h2 id="recent-observations">Recent observations</h2> <p>So, what are some examples of interesting observations we’ve seen on the Map lately?</p> <p>While there haven’t been any dramatic events like submarine cable cuts in recent days, there are always interesting changes appearing on the Map. Let’s take a look at latency changes that popped up involving AWS’s <code class="language-text">cn-north-1</code> region in Beijing, China.</p> <p>At around 17:00 UTC on October 22, 2024, the Map reported increases in latencies from a variety of Google Cloud locations in Europe. It takes two to tango, so we don’t know from this data alone which cloud made the change that caused all of these independent measurements to increase at the same time.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6rhTDinRXL11n02lXySAxi/dda4812cec5e34ac959051a726fa3601/cloud-latency-map-google-cloud.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Latency Map: Google Cloud - Europe" /> <p>At the same time, the Map showed decreased latencies from Google Cloud locations in Asia. In fact, the measurements from Asia appeared to be bimodal, with the plot bouncing between two discrete populations of latencies. After the change, the bimodality went away, the latencies appeared lower and more stable.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1ZcKYLFam0OiaIpvHPFry7/d9c5c6a371ce911cf03aa287ad6b0306/cloud-latency-map-google-cloud-asia.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Latency Map: Google Cloud - Asia" /> <p>In the Latency Comparison section, many of the latencies exhibit a tight distribution and are quite stable, but the Sydney to Tokyo route has something worth analyzing. The path from Sydney to Tokyo requires the traversal of multiple submarine cables, the combination of which can greatly influence the overall latency.</p> <p>Both Inter-cloud and Intra-cloud latencies along this route hover just under 110ms, except for the Azure→AWS and GCP→AWS routes which exhibit latencies of 131ms and 134ms, respectively. In other words, traffic from GCP and Azure in Sydney to AWS in Tokyo experiences latency of more than 20ms more than other cloud region combinations — including from AWS.</p> <p>This is likely caused by how AWS in Tokyo accepts traffic from outside of its cloud. Traffic from non-AWS regions in Sydney is likely traversing a different combination of submarine cables to reach AWS in Tokyo.</p> <img src="//images.ctfassets.net/6yom6slo28h2/11jMrfSQY0ng6x5FoU7WxD/b1d1325b779fd3a0c889157214c1cff5/cloud-latency-map-tokyo.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Latency Map:" /> <h2 id="conclusion">Conclusion</h2> <p>Try it out! It’s free!</p> <p>As the Map illustrates, it’s a volatile internet out there — even for rarefied traffic between the hyperscalers. Take a look around and see what you can find. The data is updated every hour, and we have a list of features we’d like to add to it in the future.</p> <p>If you’d like to tailor this monitoring capability for your own organization, request a demo of Kentik’s <a href="https://www.kentik.com/product/synthetic-monitoring/">synthetic monitoring</a>.</p><![CDATA[Beyond Their Intended Scope: Uzing into Russia]]><![CDATA[The first installment of our new blog series, *Beyond Their Intended Scope*, covers BGP mishaps that may have escaped the community's attention but are worthy of analysis. In this post, we review a recent BGP leak that redirected internet traffic through Russia and Central Asia as a result of a path error leak by Uztelecom, the incumbent service provider of Uzbekistan.]]>https://www.kentik.com/blog/beyond-their-intended-scope-uzing-into-russiahttps://www.kentik.com/blog/beyond-their-intended-scope-uzing-into-russia<![CDATA[Doug Madory]]>Wed, 23 Oct 2024 04:00:00 GMT<p>Welcome to a new blog series entitled <em>Beyond Their Intended Scope</em> which intends to shed some light on BGP mishaps that may have escaped the attention of the community but are worthy of analysis.</p> <p>In the first installment, let’s take a look at a recent BGP leak that redirected internet traffic through Russia and Central Asia as a result of a path error leak by Uztelecom, the incumbent service provider of Uzbekistan.</p> <p>It may come as a surprise to many that route leaks continue to occur with some regularity. The difference today is that routing hygiene has improved to such a point that these leaks are often contained to the country or region where they originated, limiting the disruption. Despite this, they are still important to study as they can help us better understand what is and isn’t working in routing security.</p> <h2 id="what-are-bgp-leaks">What are BGP leaks?</h2> <p>“A route leak is the propagation of routing announcement(s) <em>beyond their intended scope</em>.”</p> <p>That was the overarching definition of a BGP route leak introduced by <a href="https://datatracker.ietf.org/doc/html/rfc7908">RFC7908</a> in 2016. Border Gateway Protocol (BGP) enables the internet to function by providing a mechanism by which autonomous systems (e.g., telecoms, companies, universities, etc.) exchange information on how to forward packets based on their destination IP addresses.</p> <p>In this context, the term “route,” when a noun, is shorthand for the prefix (range of IP addresses), AS_PATH and other associated information relating to packet delivery. When routes are circulated farther than where they are supposed to go, traffic can be misdirected, or even disrupted, as happens numerous times per year.</p> <div as="Promo"></div> <p>RFC7908 went on to define a taxonomy for BGP leaks by enumerating six common scenarios, half of which appear in the two leaks covered in this post. In my writing on route leaks, I like to group them into two broad categories: mis-originations and path errors. As I described in my blog post last year, <a href="https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents/">A Brief History of the Internet’s Biggest BGP Incidents</a>, this distinction is useful because the two types of error require different mitigation strategies.</p> <ul> <li>A mis-origination occurs when an AS originates (announces with its ASN as the origin) a new advertisement of a route to an IP address block over which it does not possess legitimate control, consequently soliciting traffic destined to those IP addresses.</li> <li>An AS path error occurs when an AS inserts itself as an illegitimate intermediary into the forwarding path of traffic bound for a different destination.</li> </ul> <p>With those definitions out of the way, let’s get into the details of this particular incident.</p> <h2 id="what-happened">What happened?</h2> <p>Beginning at 06:25 UTC on September 26, 2024, Uztelecom (AS28910), incumbent of the former Soviet republic of Uzbekistan, began leaking routes from its peers through one of its transit providers, Rostelecom (AS12389), Russia’s state telecom. <a href="https://x.com/Qrator_Radar/status/1839202143993213100">First reported</a> by our friends over at <a href="https://qrator.net">Qrator</a>, the incident involved the leak of over 3,000 routes lasting about 40 minutes misdirecting traffic from a dozen countries.</p> <table> <tbody> <tr> <td>Mis-origination or path error?</td> <td>Path error</td> </tr> <tr> <td>How many leaked routes?</td> <td>3144 (371 seen by 50+ Routeviews peers)</td> </tr> <tr> <td>New more-specifics?</td> <td>None</td> </tr> </tbody> </table> <p>In this post, we’ll take a closer look at what happened, the impact on traffic, as well as what can be done to prevent these types of incidents in the future.</p> <h2 id="impacts">Impacts</h2> <h3 id="bgp">BGP</h3> <p>Let’s start at the individual BGP message level and work out from there.</p> <p>In the two BGP messages below collected from <a href="https://www.routeviews.org/routeviews/">Routeviews</a>, we can see AS14041 (University Corporation for Atmospheric Research) in the United States initially accepting an Amazon route from its transit provider Arelion (AS1299). During the leak, AS14041 began selecting the leaked route <em>with a longer AS path</em> from a peer Hurricane Electric (AS6939). The leaked route is likely preferable because of a localpref setting which would prefer <em>sending traffic for free</em> through a peer regardless of the AS path length, over <em>paying to send traffic</em> through a transit provider.</p> <pre style="background-color: #f8f8f8; padding: 20px;"> TIME: 09/26/24 06:00:12 FROM: 192.43.217.142 AS14041 ORIGIN: IGP ASPATH: <b>14041 1299 174 16509</b> NEXT_HOP: 192.43.217.142 COMMUNITY: 1299:25000 14041:102 ANNOUNCE 185.2.49.0/24 </pre> <p>BGP announcement containing the AS28910 leak:</p> <pre style="background-color: #f8f8f8; padding: 20px;"> TIME: 09/26/24 06:28:49.293030 FROM: 192.43.217.142 AS14041 ORIGIN: IGP ASPATH: <b>14041 6939 12389 28910 8359 16509</b> NEXT_HOP: 192.43.217.142 COMMUNITY: 14041:400 ANNOUNCE 185.2.49.0/24 </pre> <p>Kentik’s BGP visualization, shown below, illustrates how routes from hundreds of BGP vantage points changed for this Amazon prefix during the leak. While the lower portion of the visualization shows a pruned ball-n-stick AS-level diagram, the upper graph depicts the ASes observed upstream of Amazon’s ASN (AS16509) for this route by count of BGP vantage points. The evidence of the leak is marked in red boxes.</p> <p>During the leak, VEON (AS3216) suddenly emerges as a popular upstream, as it is a transit provider for leaker Uztelecom (AS28910). Traffic traveling along the leaked route could either be misdirected through Russia (AS12389) and Uzbekistan (AS28910) or simply go undelivered due to congestion or excessive latency.</p> <img src="//images.ctfassets.net/6yom6slo28h2/20tv8P7dzVjjYEk50HA4Ck/c61696a5e5cc93c0716c6943457cf471/uztelecom-amazon.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP Monitoring: VEON" /> <p>Another prefix sucked into the leak (162.158.84.0/24) belonged to Cloudflare. As depicted in the upper graph of the visualization below, this prefix is normally present in the tables of just over half of our BGP sources — likely a regional route with intentionally limited propagation to steer traffic in only this part of the world.</p> <p>As was the case in the previous example, AS3216 appeared as a new popular upstream of Cloudflare (AS13335) for 162.158.84.0/24 as AS28910 began leaking the route through AS12389. Leaks are especially impactful to routes with limited propagation like this one because, for much of the internet, there is no alternative version for the leaked route to contend against, making it an automatic winner in the BGP selection algorithm.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5nmUzhGndwofO34nNPFgUc/ecdfad42aec6ea1f4db6105a3e84490f/uztelecom-cloudflare.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP Monitoring: Cloudflare" /> <p>Hosting provider xTom also had routes impacted by this leak, including 103.135.234.0/24 shown below. Like the previous example, the leak appears in the upper graph as a new popular upstream to xTom, Russian fixed-line operator Transtelecom (AS20485) — another network providing transit (albeit indirectly) to AS28910. Despite its AS path length, the leaked version of the route becomes quite popular during the 40-minute leak.</p> <img src="//images.ctfassets.net/6yom6slo28h2/16FSuF6nVhz2O2dLLBmlou/42c7998c5d3c8742fda237fe82072c69/uztelecom-xtom.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP Monitoring: xTom" /> <h3 id="traffic">Traffic</h3> <p>One of the unique things that Kentik has the ability to do is to take a BGP incident and analyze its impact on internet traffic. Using Kentik’s unique aggregated NetFlow data, we can get a sense for how much traffic was misdirected vs dropped entirely.</p> <p>During ingest, Kentik annotates each NetFlow record with the AS path of the destination IP from the perspective of the router producing the NetFlow. This allows us to then make a query that retrieves NetFlow records which were annotated with AS paths that match the leak.</p> <p>Below is the result of a query which retrieved aggregate NetFlow directed along AS paths containing “12389 28910” (Rostelecom to Uztelecom) — excluding traffic destined for Uzbekistan, since that’s the majority case and to be expected.</p> <p>What results is a fascinating shift in the makeup of the traffic along 12389_28910 during the leak. The impact is clearly evident as the traffic goes from being solely destined for Tajikistan and Turkmenistan, to also destined for countries outside of Central Asia: Hong Kong, the Netherlands, Afghanistan, Russia, Japan, and the United States.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Sj5vNeMZJc5TVN2JFeSqT/e518fce527c886c1e4b022e9850bccf4/uztelecom-traffic.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Internet traffic along 12389_28910" /> <h2 id="call-to-action">Call to action</h2> <p>Years ago large routing leaks like these might have been the cause of <a href="https://www.theregister.com/2019/06/24/verizon_bgp_misconfiguration_cloudflare/">widespread internet disruption</a>. Not so much anymore.</p> <p>Humans are still (for the time being) configuring routers and, being human, are prone to the occasional mistake. What has changed is that the global routing system has become better at containing the evitable goof-ups. Route hygiene has improved due to efforts like <a href="https://www.manrs.org/">MANRS</a> and the hard work of network engineers around the world.</p> <p>In September 2024, the White House Office of the National Cyber Director released the <a href="https://www.whitehouse.gov/wp-content/uploads/2024/09/Roadmap-to-Enhancing-Internet-Routing-Security.pdf">Roadmap to Enhancing Internet Routing Security</a>, which aims to address BGP vulnerabilities. The report cited my <a href="https://www.kentik.com/blog/rpki-rov-deployment-reaches-major-milestone/">recent collaboration</a> with Job Snijders of Fastly on RPKI ROV adoption, a technology that aims to reduce the disruption caused by things like BGP route leaks.</p> <p>That progress is largely the macro-level, but there are many individual networks that have not deployed RPKI and to them I would say the following:</p> <ul> <li>Creating ROAs will help to protect your inbound traffic by asserting to the rest of the internet which origin is the legitimate one.</li> <li>Conversely, rejecting RPKI-invalids helps to protect your outbound traffic by rejecting leaked routes that might misdirect that traffic or lead to a disruption.</li> </ul> <p>By continuing to reduce the impact of BGP leaks, we can focus on the harder problems left to be solved in routing security. Problems such as the “determined adversary” scenario witnessed in last year’s <a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/">attacks on cryptocurrency services</a>. In that realm, there is still much work to be done.</p><![CDATA[The Network Also Needs to be Observable, Part 2: Network Telemetry Sources]]><![CDATA[In part 2 of the network observability series, we tackle the first key to the input needed for network observability -- from what networks and network elements we gather telemetry data.]]>https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sourceshttps://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources<![CDATA[Avi Freedman]]>Thu, 10 Oct 2024 05:00:00 GMT<h2 id="introduction">Introduction</h2> <p><a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">In part 1 of this series</a>, I talked about the importance of network observability as our customers define it — using advances in data platforms and machine learning to supply answers to critical questions and enable teams to take critical action to keep application traffic flowing.</p> <p>Most of the history of network operations has been supported by monitoring tools, mostly standalone, closed systems, seeing one or a couple of network element and telemetry types, and generally on-prem and one- or few-node, without modern, open-data architectures.</p> <p>Networkers running enterprise and critical service provider infrastructure need infrastructure-savvy analogs of the same observability principles and practices being deployed by DevOps groups. We see these DevOps teams unifying logs, metrics, and traces into systems that can answer critical questions to support great operations and improved revenue flow.</p> <p>We see the network observability platforms, teams, and tool-builders needing:</p> <ul> <li>Telemetry input from all critical networks and forwarding elements</li> <li>The key telemetry types to shine a light on network activity and health</li> <li>The critical context that enables teams to ask questions about users, applications, and customers (and not just IP addresses and ports)</li> </ul> <p>In part 2 of this series, to continue diving into what’s needed to make the network observable, we tackle the first key to the input needed for network observability — which networks and network elements to get telemetry from.</p> <h2 id="consider-the-range-of-network-telemetry-sources-and-observation-points">Consider the range of network telemetry sources and observation points</h2> <p>To achieve <a href="https://www.kentik.com/blog/monitoring-vs-observability-understanding-the-role-of-each">observability</a> in modern networks, it is key to gather the state of all of the networks your application traffic traverses — overlay and underlay, physical and virtual, as well as the ones you run and the ones you don’t.</p> <p>The breadth of <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-3-network-telemetry-types">network telemetry</a> sources we see in modern networks include the components of network types such as:</p> <ul> <li><strong>Cloud infrastructure</strong>: Elements specific to the cloud such as service meshes, transit and ingress gateways</li> <li><strong>Data center</strong>: Leaf and spine switches, top of rack, modular, fixed and stackable. API gateways for digital services.</li> <li><strong>Internet and broadband infrastructure</strong>: The internet itself that connects the clouds, applications, and users. Access and transit networks, edge and exchange points, CDNs.</li> <li><strong>4G, 5G</strong>: Including evolved packet core (v)EPC, Multi-access edge computing (MEC), optical transport switches (ONT/OLT), Radio Access Network (RAN)</li> <li><strong>IoT</strong>: IoT endpoints, gateways and industrial switches for consumer, smart city, and corporate</li> <li><strong>Campus</strong>: Ethernet switches, layer 2 and 3 switches, hubs and network extenders. Wireless access points and controller.</li> <li><strong>Traditional WAN</strong>: WAN access switches, integrated services routers, cloud access routers</li> <li><strong>SD-WAN</strong>: Access gateways, uCPE, vCPE, and composed SD-WAN services including their cloud overlays</li> <li><strong>Service provider backbone</strong>: Edge and core routers, transport switches, optical switches, DC interconnects</li> <li><strong>MSO</strong>: Cable Access Platforms (CAP) and CMTS, Optical Distribution Network (ODN), Broadband Network Gateway (BNG) and the virtualized version of these</li> </ul> <p>It’s also critical to think about the forwarding and control elements and observation points:</p> <ul> <li><strong>Network devices</strong>: Physical and virtual routers, switches, wireless access points, application delivery controllers and a myriad of possible on-prem or cloud-based devices</li> <li><strong>Endpoints</strong>: Both eyeball and server/service endpoints, including physical, virtual, and overlay/tunnel interfaces</li> <li><strong>Controllers</strong>: Software-defined network controllers, orchestrators and path computation applications that program network configuration</li> <li><strong>TAPs, SPAN, NPB</strong>: Access points in the network that provide port mirroring, tapping or packet brokering</li> <li><strong>L4-7 network elements</strong>: Web, appliance, content delivery networks, application delivery controllers that generate, or route, shape, and control traffic</li> <li><strong>Firewalls and other security appliances and services</strong>: As physical and logical (VM, VNF, CNF) gateways, policy enforcement, and telemetry sources, the security layer is both part of the network and key to full-stack debugging of operational issues</li> <li><strong>Application layer</strong>: ADCs, load balancers and service meshes</li> </ul> <div class="pullquote right" style="max-width: 330px;">Most companies can’t yet see a unified view across these networks and key elements in one place.</div> <p>There’s probably nothing on these lists that comes as any surprise, other than the fact that most companies can’t yet see a unified view across these networks and key elements in one place.</p> <p>This highlights one of the big challenges of making the network observable. Our networks have been built up with a wide range of devices — from multiple vendors, old and new, physical and virtual — all working together. Network observability must include most, or all of these to be capable of answering the questions critical to keeping application and user traffic fast and available.</p> <p>The good news is that it’s possible — with modern data platforms and an inclusive, upfront design — to get started, add value, and iterate/repeat towards complete coverage.</p> <h2 id="conclusion">Conclusion</h2> <p>In the past, it may have been okay for the network to consist of interconnected islands, each with their own network monitoring tools. With the shift to DevOps and application-driven everything, we simply can’t work in this fragmented way anymore. All of our operational concerns, planning, running and fixing, need to be coordinated across the complete variety of the networks that affect our traffic.</p> <p>In my next blog, the third in this series, I will discuss the types of <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-3-network-telemetry-types">network telemetry data</a> that is generated across this wide range of network types and devices.</p><![CDATA[Hurricane Helene Devastates Network Connectivity in Parts of the South]]><![CDATA[In this post, we dig into the impacts from Hurricane Helene which came ashore late last month wreaking destruction and severe flooding in the Southeastern United States. Using Kentik's traffic data as well as Georgia Tech's IODA, we detail the impacts in three of the hardest-hit states: Georgia, South Carolina, and North Carolina.]]>https://www.kentik.com/blog/hurricane-helene-devastates-network-connectivity-in-parts-of-the-southhttps://www.kentik.com/blog/hurricane-helene-devastates-network-connectivity-in-parts-of-the-south<![CDATA[Doug Madory]]>Tue, 08 Oct 2024 04:00:00 GMT<p>Hurricane Helene came ashore late last month wreaking destruction and severe flooding in its path. As of this writing, over <a href="https://www.axios.com/2024/10/04/hurricane-helene-deadliest-us-storms-death-toll">200 people tragically lost their lives</a>, and countless others have been displaced from their homes (including one Kentik employee).</p> <p>As the hard-hit regions continue to recover from the storm’s devastation, let’s take a moment to review what we’re seeing in terms of the impacts on internet connectivity in three of the hardest-hit states: Georgia, South Carolina, and North Carolina.</p> <img src="//images.ctfassets.net/6yom6slo28h2/28eU180sCYQPaABlWRdJcY/008ecfdecc18c19a055ced1956c805fe/hurricane-helene.jpg" style="max-width: 600px;" class="image center" alt="Radar map of Hurricane Helene" /> <h2 id="state-level-impacts">State-level impacts</h2> <p>Based on Kentik’s aggregate NetFlow data, traffic volumes to numerous providers in affected states experienced similar changes before and after the arrival of Helene. As the storm approached on September 26, providers handled a surge of traffic as residents of the southeast used the internet to closely follow the latest news about the storm.</p> <p>Following the arrival of Helene, drops in traffic were seen around the region as the storm washed out roads and caused power outages. The immediate decline in traffic can also be attributed to people focusing on addressing their immediate needs that don’t require internet service.</p> <h3 id="georgia">Georgia</h3> <p>In Georgia, there were service impacts around the state. The graphic below illustrates traffic volume to three Georgia-based service providers: Clearwave Fiber (AS400511), ATC Broadband (AS11240), and the Brantley Telephone Company (AS394473).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1FqdyDCnrbeOz3nwRB2P5W/3b7a0d338660955ed7abff2a645a1ec8/hurricane-helene-impacts-georgia1.png" style="max-width: 600px;" class="image center" alt="Hurricane Helene impacts in Georgia" /> <p>Seen below, the impact on Alma-based ATC Broadband is also well <a href="https://ioda.inetintel.cc.gatech.edu/asn/11240?from=1727222410&#x26;until=1727913610">illustrated</a> in <a href="https://ioda.inetintel.cc.gatech.edu">IODA</a>, the public outage analysis tool from our friends at Georgia Tech. In this graphic, the green line (representing BGP routes) is stable while the blue line (representing active measurements into AS11240) drops out early on September 27 before slowly recovering over the following days.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3kAMS2dKb7zBpHDHxeX6rm/1ee59b39a4385ddd16947177cdb9882b/atc-broadband1.png" thumbnail style="max-width: 800px;" class="image center" alt="Impact on ATC Broadband" /> <p>US broadband giant Comcast has three service regions within Georgia alone, as defined in their published <a href="https://lir.comcast.net/geodata">geolocation data</a>. These three regional networks experienced different levels of disruption based on their proximity to the storm’s path and as resilience of the infrastructure of the region. Their Atlanta Regional Network, the largest by traffic volume, showed little impact from Helene. In contrast, the Savannah Regional Network showed lower traffic levels beginning on September 27 before recovering as power was restored and Georgians in this part of the state began resuming normal life. It was the Augusta Regional Network that took the hardest blow from Helene and, as of this writing, has not yet fully recovered.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5a88In5s7EvBY012cSvg3v/79682c9eff0efbce7e08694ed050dade/hurricane-helene-impacts-georgia-comcast1.png" style="max-width: 600px;" class="image center" alt="Impacts on Comcast in Georgia" /> <p>IODA also picked up disruptions to a variety of other providers around the state including Northland Cable Television (<a href="https://ioda.inetintel.cc.gatech.edu/asn/40285?from=1727222420&#x26;until=1728302400">AS40285</a>), the Pembroke Telephone Company (<a href="https://ioda.inetintel.cc.gatech.edu/asn/15313?from=1727222420&#x26;until=1728302400">AS15313</a>), and the Glenwood Telephone Company (<a href="https://ioda.inetintel.cc.gatech.edu/asn/397118?from=1727222420&#x26;until=1728302400">AS397118</a>) — all located in the southern half of the state.</p> <h3 id="south-carolina">South Carolina</h3> <p>As Helene continued moving north after Georgia, South Carolina was the next state to feel its wrath.</p> <p>In the graphic below, we can observe the drop in traffic levels for three South Carolinian providers upon the arrival of Helene. Based on Kentik traffic data, Lexington-based Carolina Connect (AS397068), broadband provider Breezeline (AS11776), and Piedmont Rural Telephone (AS19212) all experienced significant drops in traffic as a result of the storm.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2NSnNW4AdAxoqOjnBgNpI1/69b00c03a76728d6ecd3b5cd25ed0bb1/hurricane-helene-impacts-south-carolina.png" style="max-width: 600px;" class="image center" alt="Hurricane Helene impacts on South Carolina" /> <p>Using BGP and active measurement instead of traffic, IODA gives an alternative view of the disruptions in South Carolina. Below is its <a href="https://ioda.inetintel.cc.gatech.edu/asn/10279?from=1727222420&#x26;until=1728302400">graphic</a> for West Carolina Communications (AS10279). As was the case for ATC Broadband in Georgia, West Carolina Communications’ BGP routes stayed up (green line), while active measurement (i.e., responding pings) dropped precipitously on September 27 as last mile infrastructure became unreachable.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3nT30EhbLpLpwfQRD28C77/959efa7e78ad93f4b656ed94c9bdc94b/wccl.png" style="max-width: 800px;" class="image center" thumbnail alt="West Carolina Communications" /> <p>The drop in active measurement suggests that drops in traffic volume weren’t simply based on changes in user behavior but due to technical failures caused by the hurricane. Some other disruptions that were visible in IODA include TruVista (both <a href="https://ioda.inetintel.cc.gatech.edu/asn/20222?from=1727222420&#x26;until=1728302400">AS20222</a> and <a href="https://ioda.inetintel.cc.gatech.edu/asn/21898?from=1727222420&#x26;until=1728302400">AS21898</a>) based in Tifton, Rock Hill Telephone (<a href="https://ioda.inetintel.cc.gatech.edu/asn/14615?from=1727222420&#x26;until=1728302400">AS14615</a>), and the Fujifilm manufacturing plant (<a href="https://ioda.inetintel.cc.gatech.edu/asn/32186?from=1727222420&#x26;until=1728302400">AS32186</a>) in Greenwood, SC.</p> <h3 id="north-carolina">North Carolina</h3> <p>Western North Carolina has been arguably the hardest-hit region. Below is a graphic based on traffic to the cities of Asheville, Hickory, Lenoir, and Morganton (subject to geolocation accuracy). As seen in previous charts, there was a peak of traffic on September 26, followed by a dramatic drop-off and slow recovery. By October 3, traffic levels were only at 50% of where they were prior to the storm.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/khY8WjOebOynuZZ2tUYqA/fcbf9fdb68b6df06f6c13af3e17b7dfd/kentik-western-north-carolina1.png" style="max-width: 800px;" class="image center" thumbnail alt="Internet traffic to Western North Carolina" /> <p>In perhaps a bit of irony, the <a href="https://ncdc.noaa.gov/">National Climatic Data Center (NCDC)</a>, based in Asheville, North Carolina, suffered an outage due to Helene. The NCDC is responsible for housing the world’s largest active archive of weather data and went offline at 14:43 UTC on September 27 before coming back online four days later. A Kentik BGP visualization of a NCDC route (192.153.129.0/24) is shown below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4DJHOAPMeR9CLwyRbXLprX/9ac1386eaab6071e7c9ed67536c3bdec/ncdc-asheville.png" style="max-width: 800px;" class="image center" alt="National Climatic Data Center in the Kentik platform" /> <p>Pictured below are the traffic plots to three service providers based in North Carolina. Based in Hendersonville, Morris Broadband (AS53488) was completely knocked offline for almost five days. Wilkes Communications (AS22191) and Skyline Telephone (AS23118) also experienced significant drops in traffic and are still in the process of recovering.</p> <img src="//images.ctfassets.net/6yom6slo28h2/53uWwr89QtNiRfXJYjtxRf/a39bdb22986f8cd1e9002985b5af12e3/hurricane-helene-impacts-north-carolina1.png" style="max-width: 600px;" class="image center" alt="Hurricane Helene impacts on North Carolina" /> <p>The <a href="https://ioda.inetintel.cc.gatech.edu/asn/53488?from=1727222420&#x26;until=1727913620">IODA dashboard</a> below gives another view of the Morris Broadband outage. AS53488’s BGP routes were withdrawn (green line) during three separate periods, while active measurement (blue line) showed a lack of responsiveness even when those routes were being announced. This suggests that while their address space returned to the global routing table, the last-mile connectivity was still down and unreachable.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4vZd91x67A65SWur0JaAgf/f07147bd6c81375a16ffe41cd7dbeeb2/morris-broadband1.png" style="max-width: 800px;" class="image center" thumbnail alt="Morris Broadband connectivity" /> <p>Other significant impacts in Western North Carolina that were captured in IODA include Pyranah Communications (<a href="https://ioda.inetintel.cc.gatech.edu/asn/40063?from=1727222420&#x26;until=1728302400">AS40063</a>), BalsamWest (<a href="https://ioda.inetintel.cc.gatech.edu/asn/14430?from=1727222420&#x26;until=1728302400">AS14430</a>), and Skyrunner (<a href="https://ioda.inetintel.cc.gatech.edu/asn/53274?from=1727222420&#x26;until=1728302400">AS53274</a>).</p> <p>Finally, <a href="https://bgp.tools/as/81">AS81</a> operates the <a href="https://www.mcnc.org/">North Carolina Research and Education Network</a> (NCREN) which saw three of its routes withdraw 16:22 UTC on September 27: 152.18.0.0/16 (University of North Carolina at Asheville), 152.30.0.0/16 and 152.30.10.0/24 (Western Carolina University in Cullowhee).</p> <h2 id="conclusion">Conclusion</h2> <p>The devastation from Hurricane Helene was widespread and profound. In its aftermath, lay washed out roads, downed power lines, and disconnected communities. In this post, we have documented this disruption in terms of withdrawn BGP routes, drops in traffic levels, and non-responding ICMP pings, but none of these metrics can capture the toll on the people living in these places.</p> <p>Along with basic necessities like food, water, and shelter, the ability to communicate is vital during a crisis like this. The loss of communication capabilities can hamper rescue and recovery efforts. Additionally, the inability to reach loved ones during a time of crisis like this can be stressful and traumatic. All of this underscores the importance of the work the internet industry does, whether in <a href="https://time.com/6222111/ukraine-internet-russia-reclaimed-territory/">war-torn Ukraine</a> or the hurricane-ravaged Blue Ridge Mountains of North Carolina.</p> <p>Please follow <a href="https://www.nbcnews.com/news/us-news/help-victims-hurricane-helene-rcna173627">this link</a> for a list of charities and nonprofit groups that are accepting donations to help those affected by Hurricane Helene.</p><![CDATA[Unleashing the Power of Kentik Data Explorer for Cloud Engineers]]><![CDATA[Kentik Data Explorer is a powerful tool designed for engineers managing complex environments. It provides comprehensive visibility across various cloud platforms by ingesting and enriching telemetry data from sources like AWS, Google Cloud, and Azure, and with the ability to explore data through granular filters and dimensions, engineers can quickly analyze cloud performance, detect security threats, and control costs in real-time exploration and historically.]]>https://www.kentik.com/blog/unleashing-the-power-of-kentik-data-explorer-for-cloud-engineershttps://www.kentik.com/blog/unleashing-the-power-of-kentik-data-explorer-for-cloud-engineers<![CDATA[Phil Gervasi]]>Tue, 01 Oct 2024 04:00:00 GMT<p>Cloud infrastructure, whether a single cloud, hybrid, or multi-cloud environment, can be a complex beast to manage. For cloud engineers, visibility across distributed environments and the massive amounts of telemetry data they generate is not just a nice-to-have — it’s a need-to-have. Without a good way to explore, analyze, and act on the data, cloud performance issues can go unnoticed, security threats can slip through, and costs can skyrocket.</p> <p><a href="https://www.kentik.com/resources/video-data-explorer-in-kentik-portal/">Kentik Data Explorer</a> is a powerful tool for cloud engineers looking to make sense of the enormous volume and variety of telemetry data that flows through their infrastructure. Data Explorer is how engineers can explore the entire underlying database of telemetry in the system in real time and historically. That includes a diverse range of cloud and cloud-related data sources and relevant metadata such as application and security tags, geo-location, user/customer IDs, circuit IDs, DNS information, and more.</p> <div as="WistiaVideo" videoId="f4vxtvn2ic"></div> <p>In effect, Data Explorer allows cloud engineers to slice and dice their data using Dimensions, the categorical filtering options that enable granular visibility into the entire database. This means that on one screen, you can parse data from multiple sources like your SD-WAN, AWS, and on-prem data center and track application activity across the entire thing.</p> <p>In addition to Dimensions, Data Explorer also has a filtering wizard that allows you to get even more granular to chain filters to include or exclude whatever is essential to you in whatever time frame is relevant to you.</p> <h2 id="ingesting-and-enriching-diverse-telemetry-data">Ingesting and enriching diverse telemetry data</h2> <p>Kentik ingests a variety of important cloud telemetry such as AWS and Google VPC Flow logs, Azure NSG Flow logs, the various cloud firewall logs, cloud metrics, more specific types like Transit Gateway and VNET Flow logs, AKS, EKS, and GKE metrics, and so on.</p> <p>However, it’s not just about collecting raw telemetry. Kentik enriches this data with multiple types of metadata that significantly enhance visibility by adding context.</p> <p>In the screenshot below, you can see a simple example of the output of multi-cloud traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7yRtdXBMsObKpRfMO5jTz8/fb7ccfcc88497cf7122e7a715223dc54/multicloud-path-visualization.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Multi-cloud path visualization" /> <p>This is a great start, but the likelihood is that you’ll want to go deeper because you’re looking into something very specific. To do that, we start to adjust our Dimensions and filters, including adding metadata.</p> <p>Just a few examples of metadata include:</p> <ul> <li>Cloud provider tags</li> <li>Geo-IDs</li> <li>User and customer IDs</li> <li>DNS information</li> <li>Pod names and process IDs</li> <li>Security and application tags</li> <li>Custom metadata</li> </ul> <p>That last bullet is pretty interesting. Kentik enables you to define your own metadata, allowing you to tag and categorize telemetry based on your specific requirements. This means you can make your queries more specific and contextually relevant.</p> <p>So, with the added relevant metadata and all the other telemetry from on-prem resources, we get a more comprehensive view of cloud network traffic and inter-cloud communications.</p> <div as="Promo"></div> <h2 id="exploring-data-through-dimensions">Exploring data through dimensions</h2> <p>A Dimension is essentially any property of the data that you can use to organize, filter, and drill down into specific subsets of information. This makes Kentik Data Explorer such a versatile tool for cloud engineers.</p> <p>Here are just a small few of the Dimensions Data Explorer has for cloud telemetry:</p> <ul> <li>Cloud provider (AWS, GCP, Azure)</li> <li>Regions (us-east-1, eu-west-1)</li> <li>VPC IDs</li> <li>Instance types (e.g., EC2, VM instances)</li> <li>Service tags (AWS Lambda, S3 buckets, GCP Cloud Functions)</li> <li>Applications and protocols (HTTP, HTTPS, DNS)</li> <li>Network interfaces (ENI, VNIC)</li> <li>Firewall Action</li> <li>AWS Transit Gateway</li> </ul> <p>Look at the screenshot below, where you can choose Dimensions for AWS. Remember that this is just a portion of the AWS Dimensions and doesn’t even include the ones available for Google and Azure.</p> <img src="//images.ctfassets.net/6yom6slo28h2/qYZvFTeJBMOpV1E4VDfkX/4155a4aab65e8b4b619d4b1ac66bf641/aws-dimensions.png" style="max-width: 800px;" class="image center" thumbnail alt="AWS filtering dimensions" /> <p>A cornerstone of network observability is being able to ask any question of your network. Using these Dimensions along with the other advanced filtering workflows in the system, cloud engineers can build granular queries and zero in on particular pieces of information, or in other words, ask any question about their cloud network.</p> <p>For instance, an engineer could filter for all the traffic generated by EC2 instances in the US-West-2 region, narrow it down to a particular VPC, and then filter by application type (e.g., web traffic over HTTPS).</p> <h2 id="granular-filtering-and-workflow-customization">Granular filtering and workflow customization</h2> <p>I mentioned advanced filtering workflows, which is weird to say because the Dimensions menu is already pretty granular. What I mean by that is the ability in Data Explorer to create filtering workflows by layering multiple filters across different Dimensions, metrics, and timeframes. This way, you can extract precisely the dataset you need for your given scenario.</p> <p>In the screenshot below, I selected AWS, Azure, and Google Cloud as my data sources, but this can also be any combination of data sources in the cloud and on-prem.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6AUWw8c5omWy2RURTIwBJG/fce0f716f7ae93493ed6f11d4ce42acb/granular-filtering-options.png" style="max-width: 800px;" class="image center" thumbnail alt="Cloud filtering options" /> <p>For example, take a cloud engineer troubleshooting a spike in traffic within a multi-cloud environment. They can build a filter workflow that:</p> <ul> <li>Filters by AWS, Azure, and GCP simultaneously</li> <li>Drills down into specific regions where the traffic spike occurred</li> <li>Focuses on a particular type of resource (e.g., compute instances or containerized applications)</li> <li>Filters traffic generated by a particular application, such as a customer-facing web service</li> <li>Zooms into traffic flows from specific geographic locations</li> </ul> <p>Then, once these filters are applied, you can adjust timeframes, call out specific metrics (e.g., throughput, latency, packet loss), include or exclude particular elements, and choose from a variety of visualizations (heatmaps, time series charts, Sankey, bar graphs) to make the data easier to digest.</p> <p>Take a look at the following screenshot. You can see an example of the layering of additional filters on top of our Dimensions. I have a couple of exclusions at the top, including a specific IP, a specific project ID, and an application. This relatively simple filter allows you to see some of the options, but in a large production environment, they can be much more complex.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2J3lvpAmtry7vCMmkhGnsp/747b8f4af968007b380159876c412106/additional-filtering-options.png" style="max-width: 800px;" class="image center" thumbnail alt="Filtering for even more granular information" /> <h2 id="real-time-exploration-and-historical-comparison">Real-time exploration and historical comparison</h2> <p>An essential feature of Data Explorer is the ability to compare datasets across timeframes. Cloud engineers often must determine whether an anomaly or spike is part of a recurring issue or something new. Data Explorer allows engineers to filter a dataset in real time and instantly compare it to the same dataset from a previous timeframe.</p> <p>For example, suppose an engineer notices unusually high latency in a specific VPC. In that case, they can filter the data to examine only traffic in that VPC, identify the impacted services, and then compare the latency metrics from the past hour against a similar timeframe the previous day, week, or month. This temporal comparison helps cloud engineers diagnose whether the issue is transient, cyclical, or persistent.</p> <p>You can make this comparison across all Dimensions, so you can filter by geo-location, cloud provider, and instance type, and still compare performance metrics from different timeframes. The ability to do this in real time is vital for cloud operations where rapid decision-making can prevent potential outages or performance degradation.</p> <p>In the last screenshot below, you can see our multi-cloud traffic along with the query settings on the right. I have the toggle enabled for “Compare Over Previous Period” and the timeframe set for comparing the last hour to the previous week. In the actual graph, the dotted line (which you can click on in the platform) represents the historical data we’re comparing the last hour to.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5WBJIJy3cOb3d7o899y10W/58a33c8a7bef3843940a2b228b67b135/multicloud-path-visualization-chart.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Multi-cloud visualization chart view" /> <p>Cloud engineers need a tool that gives them complete visibility into their infrastructure. Kentik Data Explorer provides this visibility by ingesting various telemetry data, enriching it with relevant metadata, and enabling engineers to explore it through Dimensions and granular filters.</p> <p>By supporting real-time exploration and comparison across timeframes, Kentik Data Explorer is more than just a monitoring tool—it’s a comprehensive platform for data-driven cloud management. Whether managing a single cloud, hybrid, or multi-cloud environment, cloud engineers can leverage Kentik Data Explorer to gain deeper insights, troubleshoot issues faster, and ultimately deliver better cloud performance.</p><![CDATA[The Network Also Needs to be Observable: Part 1 in a Series on Network Observability]]><![CDATA[The goal of network observability is to answer any question about your network infrastructure and to have support from your observability stack to get those answers quickly, flexibly, proactively, and interactively. In this post, Kentik CEO Avi Freedman gives his thoughts on the past, present, and future of network observability.]]>https://www.kentik.com/blog/the-network-also-needs-to-be-observablehttps://www.kentik.com/blog/the-network-also-needs-to-be-observable<![CDATA[Avi Freedman]]>Wed, 25 Sep 2024 05:00:00 GMT<h2 id="the-network-is-the-key">The network is the key</h2> <p>The 21st century has made it abundantly clear that networking infrastructure is critical to connect people, applications, and the economy and distributed workforce that make the world go.</p> <p>At the same time, networks and IT infrastructure overall are becoming more diverse, dynamic, and interdependent. The internet is now the critical glue that connects traditional and cloud infrastructure. And the distributed workforce and online-focused lives we’re living have driven growing adoption of SASE, CDNs, and other methods of delivering service to the edge.</p> <h2 id="the-move-to-observability">The move to observability</h2> <p>The last five years have seen a major move to observability from the systems and application side. There are a number of definitions, but observability in the DevOps world has been about using diverse telemetry to know the internal states of systems over time (generally focused around metrics, logs, and traces), and providing answers to the unbounded questions needed to run modern applications.</p> <p>Since Kentik’s launch, we’ve been at ground zero for a parallel and exciting move towards network observability, and are excited to continue to partner to move the industry forward.</p> <p>Observability for the network looks at different telemetry and with a networking spin, but is based on the same principles — answering the questions you need to run the network infrastructure that drives the digital world.</p> <p>How do we define network observability?</p> <div style="font-family: 'Source Serif Pro', serif; margin-bottom: 20px;"> <span style="font-family: 'Source Serif Pro', serif; font-size: 24px; font-weight: 600;">net•work ob•servʹa•bilʹi•ty</span>: &nbsp;<span style="font-family: 'Source Serif Pro', serif; font-size: 24px; font-weight: 500;"><em>The ability to answer any question about your network.</em></span></div> <p>The goal is to answer any question of your network infrastructure — quickly and easily:</p> <ul> <li>Across any kind of network (cloud or on-prem)</li> <li>Across any kind of network element (physical, virtual, or cloud service)</li> <li>Whether isolated at the network level or with application and business context</li> </ul> <p>… and to have support from your observability stack to get those answers quickly and flexibly, and both proactively and interactively.</p> <p>The goal is to free up ops team time to architect, build, and develop for increased orchestration, automation, uptime, and performance!</p> <h2 id="the-three-keys-to-network-observability">The three keys to network observability</h2> <p>Our most successful customers to learn from their observability journey have invested in three key areas: <strong>telemetry</strong>, <strong>data platform</strong>, and <strong>action</strong>.</p> <p>I’ll talk more about each of these areas this month in subsequent blog posts.</p> <h2 id="telemetry-requirements-to-support-network-observability">Telemetry requirements to support network observability</h2> <p>In order to see and reason about the network, it’s critical to gather telemetry:</p> <ul> <li>From all networks (cloud, data center, WAN, SD-WAN, internet, mobile, branch, and edge)</li> <li>Of all telemetry types, including flow/traffic, device metrics, performance tests, configuration, routing, and provisioning/orchestration</li> <li>From all types of network elements, physical and virtual, forwarding and appliances, and dedicated or cloud-native</li> </ul> <p>Without a complete picture of the state and activity of all your networks, you’re missing key capabilities to ask the questions and take the actions needed to ensure great traffic delivery.</p> <h2 id="data-platform-requirements-for-network-observability">Data platform requirements for network observability</h2> <p>To take telemetry and support asking questions, knowing about issues, and driving the actions needed to run infrastructure, there are common patterns and requirements for underlying data platforms:</p> <ul> <li>Sending telemetry live to the system, usually via a streaming message bus</li> <li>Enriching network telemetry with context such as user, application, customer, threat, and physical location, live at ingest time to match the real-time orchestration that continually changes these context streams</li> <li>Supporting network storage and query primitives such path and prefix, underlay and overlay, and joining with routing and other types of information not found outside of the networking world</li> <li>High resolution storage and querying to support asking questions that were not planned in advance — requiring preserving the high cardinality (number of unique values) found in network data, such as IP addressing and port information</li> <li>Supporting open integrations across the ingest layer (telemetry, context, and provisioning APIs); query layer APIs; and outbound interactive and streaming push APIs to send unified telemetry, and insights and action triggers to other observability and action platforms</li> <li>Real-time learning, typically including feature extraction, baselining, and algorithmic and more advanced ML techniques, to surface insights before users know what questions to ask or ask them</li> </ul> <p>Depending on the scale of the architecture, planning, engineering, and operations teams, it can also be important that the underlying data platforms are:</p> <ul> <li>Fast, providing answers to old and new types of questions at the speed of thought</li> <li>Multi-tenant, with appropriate security and integrity, and maintaining speed while many users and automated endpoints are asking questions</li> </ul> <h2 id="actions-that-network-observability-platforms-must-assist-with">Actions that network observability platforms must assist with</h2> <p>For network observability, the goal of asking questions is to understand and take action. As we look across the networks we work with, they say they are looking to be able to:</p> <ul> <li>Answer questions in guided interactive ways (using and filtering/zooming in on maps, dashboards, and other defined views)</li> <li>Answer questions in unbounded ways, not pre-defined — and to zoom in to any granularity as needed</li> <li>Have insights (questions and their answers) proactively surfaced and presented to the users, ideally with suggested action</li> <li>Use workflows that automate human drudgery and increase efficiency of routing tasks like traffic engineering, bill auditing, cost reduction, and performance optimization</li> <li>Integrating with chatops and workflow tools like Slack, Teams, PagerDuty, and ServiceNow</li> <li>Flexibly integrate with orchestration and automation platforms to drive automatic remediation and scaling</li> </ul> <h2 id="traditional-network-monitoring-leaves-question-gaps">Traditional network monitoring leaves question gaps</h2> <p>How is network observability different from the hundreds of existing network monitoring management tools and platforms that have been around for many years?</p> <p>Historic tools have been standalone, closed systems, generally on-prem and one or few-node, without modern open data architectures.</p> <p>With limited enrichment, granularity, and retention they’ve also generally focused on the kind of rollups and pre-defined queries that have driven the move towards observability. Often vendor-specific, they generally don’t understand cloud or orchestration at all, or at most view them as separate kinds of networks.</p> <p>These systems have also been geared at a deep network expert, and as infrastructure layers converge, and ops teams need infrastructure and application visibility, these older more closed and limited systems have not found a place in greenfield observability and monitoring stacks.</p> <h2 id="devops-observability-is-great-but-cant-answer-many-network-questions">DevOps observability is great, but can’t answer many network questions</h2> <p>DevOps observability platforms have been a driver over the last few years at unifying a wide set of telemetry — traditional APM instrumentation with traditional logging, as well as metrics and the more recent waves of innovation in distributed tracing. Many of the platforms (though not all) also can deal in part or whole with the kind of cardinality seen in network data.</p> <p>But viewed from a “can I ask these questions about the network?” lens there are still some gaps in how easily the leading DevOps platforms take network telemetry.</p> <p>And more critically, gaps in understanding of network primitives like prefix, path, underlay, and overlay, and gaps in the kinds of workflows that network professionals engage in to plan, build, operate, debug, scale, and automate their infrastructures.</p> <p>This all makes sense — even network observability platforms like Kentik that take application-layer data as telemetry don’t have the kind of workflows that developers and app operations teams need to ask questions requiring deep application context.</p> <p>My view is — better together!</p> <p>At Kentik, we’re super excited about helping bridge the DevOps/NetOps gap.</p> <p>Watch this blog over the next month for a series of announcements about how we’ll be feeding unified, enriched network telemetry to a wide range of observability platforms, and some exciting work to drive network-focused views in leading DevOps and App Observability platforms — and the reverse, in Kentik.</p> <h2 id="conclusion">Conclusion</h2> <p>Networkers need the same observability principles, tooling, and platforms that those up the stack have been building towards, but with a network-savvy bent.</p> <p>The legacy network tools aren’t architected for modern infrastructure and the more modern DevOps-focused platforms still lack network savvy, especially around what happens when packets leave eth0.</p> <p>Network teams practicing observability in architecture and action are already driving better performance, reliability, security, remediation, and growth. As a passionate network, data, and ops nerd, I’m beyond excited about what these emerging practices mean for the industry over the next decade and beyond.</p> <p>It’s possible to get there, whether building yourself, working with a vendor, or both. At Kentik, we’re here as a resource wherever you are in your observability journey.</p> <p><a href="/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources/">Read the next post</a> in this series.</p><![CDATA[Why I Joined the Kentik Team]]><![CDATA[Jezzibell Gilmore joins Kentik as general manager, service provider, bringing her expertise to expand the company’s global service provider business. Read why she made the move to Kentik in her own words.]]>https://www.kentik.com/blog/jezzibell-gilmore-why-i-joined-the-kentik-teamhttps://www.kentik.com/blog/jezzibell-gilmore-why-i-joined-the-kentik-team<![CDATA[Jezzibell Gilmore]]>Wed, 25 Sep 2024 04:00:00 GMT<p>I couldn’t be more thrilled about the opportunity to contribute to Kentik, a company that is redefining how global service providers and sophisticated enterprises manage and optimize their network business.</p> <p>Throughout my 25+ year career in technologies, I’ve had the privilege of working in roles that blend innovation, leadership, and strategic partnerships throughout the service provider ecosystem. As the chief commercial officer of telecommunications services trading platform Connectbase, I brought together hundreds of service providers. I was a part of the team at Akamai that set up thousands of peering and partnerships worldwide. As the new entity of the industry, I helped GTT through multiple acquisitions to become the global brand it was. As far back as the 90s, I was part of the legendary IP network and co-location business at AS6461, which many of you knew as AboveNet, and now as Zayo. The culmination of these experiences provided me key insights into service providers’ motivations and challenges, which are now allowing me to better service the likes of these service providers at Kentik.</p> <p>Among all of my experiences in the telecommunications industry, I am most proud of what we accomplished at PacketFabric, where we built a groundbreaking platform for on-demand, private, and secure connectivity. It was an incredible experience as co-founder and chief commercial officer to see PacketFabric achieve industry recognition, including the 2020 Fierce Telecom Innovation Award for Cloud Services, and to be named a “Cool Vendor” by Gartner. These milestones are a testament to the importance of creating solutions that address real-world challenges in the service provider space.</p> <p>A key lesson from my PacketFabric experience is the importance of an agile, focused, and hyper-aligned team. That did not mean everyone on the team always agreed with each other. In fact, because of our diverse backgrounds, we rarely started with similar approaches. However, as a team, we always worked hard to find the best path forward for the company, by being real, communicating with purpose, and constantly checking in with each other. It was the committed team mentality that allowed the early PacketFabric team to achieve those highly regarded industry recognitions and commercial success.</p> <p>At Kentik, I’m not just joining technology visionaries and leaders, but a whole team of heroes in every discipline, focused on deliberate actions and continuous delivery of meaningful and substantial innovations in network observability and intelligence to our customers.</p> <p>Now, more than ever, service providers recognize network observability’s immense value as they navigate the rapidly evolving digital landscape. Real-time, actionable insights into network performance, cost, and security have never been more critical. Kentik’s platform is built to handle the complexities of today’s market by integrating both generative and traditional AI to deliver deep insights that help service providers to operate and scale their networks more efficiently than ever before. I know this is a game-changer for our customers, and I’m looking forward to collaborating with them to unlock the full potential of their networks and business.</p> <p>Digital infrastructure is the foundation of global economic and technical growth, and it is personally motivating to contribute each and everyday towards a greater good. I am grateful for the trust from Avi and Justin. But even more, I am grateful to the whole Kentik team for the opportunity to belong again, and help my beloved service provider community to level up!</p><![CDATA[Unlocking Network Insights: Bringing Context to Cloud Visibility]]><![CDATA[In today’s complex cloud environments, traditional network visibility tools fail to provide the necessary context to understand and troubleshoot application performance issues. In this post, we delve into how network observability bridges this gap by combining diverse telemetry data and enriching it with contextual information.]]>https://www.kentik.com/blog/unlocking-network-insights-bringing-context-to-cloud-visibilityhttps://www.kentik.com/blog/unlocking-network-insights-bringing-context-to-cloud-visibility<![CDATA[Phil Gervasi]]>Tue, 24 Sep 2024 04:00:00 GMT<p>Years ago, the network consisted of a bare metal server down the hall, a few ethernet cables, an unmanaged switch, and the computer sitting on your desk. We could almost literally see how an uptick in interface errors on a specific port directly affected application performance.</p> <p>Today, things have become just a little more complex (understatement of the year). Tracking down the cause of a problem could mean looking at firewall logs, pouring over Syslog messages, calling service providers, scrolling through page after page of AWS VPC flow logs, untangling SD-WAN policies, and googling what an ENI even is in the first place.</p> <p>What we care about, though, isn’t <em>really</em> the interface stat, the flow log, the cloud metric, the Syslog message, or whatever telemetry you’re looking at. Yes, we care about it, but only because it helps us figure out and understand why our application is slow or why VoIP calls are choppy.</p> <p>What we care about is almost always an application or a service. That’s the whole point of this system, and that, my friends, is the <em>context</em> for all the telemetry we ingest and analyze.</p> <h2 id="traditional-visibility-is-lacking">Traditional visibility is lacking</h2> <p>The problem with traditional network operations is that in the old days, visibility tools focused on only one form of telemetry at a time, leaving us to piece together how a flow log, an interface stat, a config change, a routing update, and everything else all fit together. However, traditional visibility was missing the context, or in other words, it lacked a method to connect all of that telemetry programmatically under what we actually cared about – the application.</p> <p>I remember spending hours in the evening pouring over logs, metrics, and device configs, trying to figure out why users were <a href="https://www.kentik.com/solutions/improve-cloud-performance/">experiencing application performance problems</a>. And in the end, I usually figured it out. But it took hours or days, a small team of experts, and too many WebEx calls.</p> <p>That may have worked in the days when most applications we used lived on our local computers, but today, almost all the apps I use daily are some form of SaaS running over the internet.</p> <p>Traditional visibility methods just don’t work in the cloud era.</p> <h2 id="network-observability-adds-context-to-the-cloud">Network observability adds context to the cloud</h2> <p>Network observability has emerged in recent years as a new and critical discipline among engineers, and it’s all about adding context to the vast volume and variety of data ingested from various parts of the network.</p> <p>Understanding a modern application’s behavior requires more than just surface-level metrics. Because the network has grown in scope, it demands a significant variety and volume of data to capture the nuances of its interactions within the system accurately.</p> <p>This is a big difference between legacy visibility and network observability. Legacy visibility tools tend to be point solutions that are missing context, whereas network observability is interested in analyzing the various types of telemetry <em>together</em>.</p> <p>For example, interface errors on a single router could hose app performance for an entire geographic region. However, finding that root cause is incredibly difficult when application delivery relies on many moving parts.</p> <p>Instead, we can tie disparate telemetry data points together in one system with an application tag or label so we can filter more productively. Instead of querying for a metric or device we think could be the issue, we can query for an application and see everything related to it in one place. This dramatically reduces clue-chaining and, therefore, our mean time to resolution.</p> <p>This means <a href="https://www.kentik.com/blog/simplifying-multi-cloud-visibility/">network observability looks at visibility</a> more as a data analysis endeavor than simply the practice of ingesting metrics and putting them on colorful graphs. Under the hood, there’s much more going on than merely pulling in and presenting data. There’s programmatic data analysis to add the context we need and derive meaning from the telemetry we collect.</p> <h2 id="how-does-network-observability-add-context">How does network observability add context?</h2> <p>Network observability starts with ingesting a significant volume and variety of telemetry from anything involved with application delivery. Then, that data is processed to ensure accuracy, to find outliers, to discover patterns, and organized into groups.</p> <p>Organizing data can take many forms, and for those with a math background, it could conjure thoughts of clustering algorithms, linear regression, time series models, and so on. Additional technical and business metadata is added to enrich the dataset during this process. Often, this is labeled but not quantitative data: an application ID, a customer name, a container pod name, a project identifier, a geographic region, etc.</p> <p>With an enriched and labeled dataset, we can know which metric is tied to which application, at what time, in which VPC, going over which Transit Gateway, and so on. In practice, the various telemetry will often relate to many activities and not just one application flow; network observability ties data points together to understand how the network impacts specific application delivery.</p> <h2 id="kentik-puts-context-into-the-cloud">Kentik puts context into the cloud</h2> <p>Kentik ingests information from many sources into one system. Then, each Kentik customer has a database of network telemetry that’s cleaned, transformed, normalized, and enriched with contextually relevant data from their business and third-party data, such as cloud provider metrics and the global routing table itself.</p> <p>Remember that working with the Kentik portal means you have all of that data in one place at your fingertips. Filtering for telemetry related to an application can get you data from the relevant cloud providers, network devices, security appliances, and so on, all in a single query. To get more precise, that query can be refined to see specific metrics like dropped packets, network latency, page load times, or anything else that’s important to you.</p> <p>And really, that’s the core of Kentik. Whether you’re a network engineer, SRE, or cloud engineer, the data is all there and fast and easy to explore. Start with what matters to you – perhaps an application tag – and start filtering based on that.</p> <div as="Promo"></div> <h2 id="a-programmatic-approach-to-network-observability">A programmatic approach to network observability</h2> <p>However, Kentik also approaches network observability programmatically. Since the data is stored in a single unified data repository, we can do exciting things with it. For example, straightforward ML algorithms can be applied to identify deviations from standard patterns or, in other words, detect anomalies.</p> <p>That’s not necessarily groundbreaking, especially because some anomalies are irrelevant if they occur just once and don’t impact anything. However, since we can put this into context, we know our line of business app performs poorly once latency on a particular link goes over 50ms, for example. Over time, even if latency never hits 50ms, throwing an alert, Kentik can still see the trend ticking upward and notify you that there’s a <em>potential</em> problem on its way. Instead of relying on a hardcoded threshold, Kentik has enough contextual information to know something <em>potentially</em> harmful is brewing.</p> <p>In the graphic below you can see that an edge device wasn’t sending any flow data for a time which Kentik identified as unusual. No, the device isn’t down, but it’s behaving outside the pattern that the system already identified for this device.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4xNTfx2MiNGV0LM0P4dTFF/7bb14c4ef34f22a44630b70cda330c7a/traffic-no-flow-data.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Graph showing device with no flow data" /> <p>Also, consider an unexpected drop in traffic to an ordinarily busy AWS VPC. Typically, alerts fire if there’s a hard-down of a device or an interface or maybe from hitting a utilization threshold. But what about if everything is fine, as there are no hard-down devices, <a href="https://www.kentik.com/blog/optimizing-cloud-networks-strategic-approach-to-eliminating-suboptimal-routing/">routing problems</a>, overwhelmed interfaces, CPU spikes, or anything else? In these cases, when something unusual happens, <a href="https://www.kentik.com/blog/lower-cloud-costs-with-kentik/">Kentik will spot the abnormality and let you know</a>.</p> <p>Ok, one last example. Since we’re analyzing data points in relation to other data points, Kentik can often identify the possible source of the problem. If there’s an unexpected increase in traffic on a link, Kentik could quickly identify and alert on what specific device or IP address is causing the problem.</p> <p>In the image below, notice that the platform automatically detected and reported on an unusual increase in traffic on one link on a Juniper device. If you look closely you’ll see that we’re talking about only a few MB, but it’s still unusual and therefore something the system alerted us to.</p> <p>And even more helpful is the notification at the bottom that identifies the likely cause of the traffic increase.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7psHX2lrjeiWSrsTyxC7av/3ad51ecf486bb329562613f1dfbb632f/traffic-anomaly-detail.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Traffic anomaly with alert notification" /> <h2 id="in-conclusion">In conclusion</h2> <p>Understanding the vast volume and variety of telemetry we ingest from our systems must be done in context, usually an application. That’s the whole point of these complex systems we manage. However, this requires not just the collection of data but also the intelligent integration and analysis of diverse telemetry sources.</p> <p>Kentik empowers engineers by providing the tools necessary to ingest, enrich, and analyze large volumes of network data with a comprehensive view – a view in the context of what we all care about. In that way, we can make data-driven decisions to ensure reliable and performant application delivery even when we consider networks we don’t own or manage.</p><![CDATA[Optimizing Cloud Networks: The Strategic Approach to Eliminating Suboptimal Routing]]><![CDATA[In this post, we look at optimizing cloud network routing to avoid suboptimal paths that increase latency, round-trip times, or costs. To mitigate this, we can adjust routing policies, strategically distributing resources, AWS Direct Connects, and by leveraging observability tools to monitor performance and costs, enabling informed decisions that balance performance with budget.]]>https://www.kentik.com/blog/optimizing-cloud-networks-strategic-approach-to-eliminating-suboptimal-routinghttps://www.kentik.com/blog/optimizing-cloud-networks-strategic-approach-to-eliminating-suboptimal-routing<![CDATA[Phil Gervasi]]>Thu, 12 Sep 2024 04:00:00 GMT<p>Cloud networking has become a foundational element of modern IT infrastructure, so the need for efficient, reliable, optimal, and cost-effective routing is more critical than ever. The thing is, words like “efficient” and “optimal” can mean very different things depending on who you talk to.</p> <p>Routing algorithms usually try to figure out the best path according to some type of cost. But what if what’s important to us differs from what our favorite routing protocol uses to determine the best path?</p> <p>Suboptimal routing, for example, is often defined as when packets take longer or more expensive paths through the network than necessary. This can have various implications, such as increased latency, higher round-trip times, and inflated transit costs, especially when factoring in services like AWS Transit Gateway. But what’s suboptimal for one organization may be just fine for another.</p> <div as="Promo"></div> <h2 id="what-does-suboptimal-mean-to-you">What does suboptimal mean to you?</h2> <p>Suboptimal routing occurs when network traffic takes a less-than-ideal path through the network, leading to inefficiencies. This can manifest in several ways.</p> <p>Suboptimal could refer to the number of hops it takes between a source and destination. Many hops implies many devices, each of which needs to process each packet. This could lead to an increase in latency, round-trip time, and so on, all of which can adversely affect application performance.</p> <p>However, suboptimal could also refer to a path with a very small number of hops but slow links. In this case, though the shortest physical path between source and destination may seem best if the links are all T1s, sending traffic over a longer distance with more hops and high-speed links may be much better.</p> <p>These are routing choices network engineers make all the time. But what about cost optimization?</p> <p>In cloud networking, suboptimal routing could refer to a path that causes traffic between a source and destination to cross-cloud regions or availability zones unnecessarily. This may happen inadvertently because it’s the best path in terms of latency, but in terms of cost, it could be very (unnecessarily) expensive.</p> <h2 id="a-balancing-act">A balancing act</h2> <p>Sometimes we want to use a different calculation than routing algorithms for how we want to engineer traffic. Instead of path cost in terms of latency, delay, hops, and so on, consider “path price.”</p> <p>Ultimately, the traffic needs to get where we want it to go, but there are many interesting ways to route traffic in the cloud. For instance, you might want to create a Transit Gateway and attach it to all your VPCs and use that to route all your traffic between them. This would work great, but then you get the AWS bill….</p> <p>For most organizations, it really is more of a balancing act of finding the sweet spot of acceptable application performance and staying within budget. There are always exceptions, but most organizations can tolerate an extra 10 milliseconds of latency if it saves a boatload of money.</p> <p>AWS charges a premium for data that crosses regional boundaries or enters/exits through Transit Gateways and across Availability Zones, a common occurrence in container networking, both of which can quickly escalate costs. But maybe those costs are worth it because the end-to-end latency is much lower for a mission-critical application.</p> <p>We really need to start weighing what’s important. Is saving every possible penny in transit cost the most important thing, or is maintaining only the very best application performance the most important thing, regardless of cost?</p> <h2 id="mitigating-suboptimal-routing">Mitigating suboptimal routing</h2> <p>Regardless of how you define suboptimal routing, we need to take a proactive approach to network design and monitoring to address it.</p> <p>The internet is dynamic, so one important mitigation step is to adjust BGP routing policies and other traffic engineering mechanisms to prioritize performance metrics like latency and throughput where possible. This may involve some tweaking, but creating custom route priorities that align with your goals is possible.</p> <p>For AWS networking, use Direct Connect and Private Links, especially for critical workloads. This will reduce the number of hops and control the traffic path more directly, reducing cost and improving performance.</p> <p>Another mitigation is to distribute the most-accessed resources over a larger geographic region so that they’re closer to end users. This would minimize the need for long-haul routing over multiple ASNs, which is usually more costly. In AWS, this can be done through strategically located VPCs or multi-region deployments.</p> <p>Lastly, and this may go without saying, you need to monitor network performance and cost to know that routing is suboptimal. This is where <a href="https://www.kentik.com/product/multi-cloud-observability/">cloud network observability</a> comes in. Your monitoring platform should allow you to analyze an application’s path alongside performance metrics for each hop and transit cost information for that routing decision. With that historical and real-time information correlated, you can quickly understand why your cloud cost is what it is. Set your team up with the right telemetry, capable tools for analysis, and a strategic understanding of what “suboptimal” design means for an application, initiative, or company, and they can identify opportunities to optimize and make informed decisions on how to solve.</p> <p>Kentik provides a unified solution for <a href="https://www.kentik.com/resources/demo-kentik-network-observability-for-hybrid-clouds/">understanding hybrid-cloud, multi-cloud, data-center, and campus networks</a> to effectively troubleshoot, interrogate all network telemetry, harden policy, and optimize software infrastructure.</p><![CDATA[Meet Our Kentik Pets!]]><![CDATA[At Kentik, one of the many benefits of our fully remote work environment is having our beloved pets by our side. In this post, we highlight some of our Kentik pets and celebrate the joy they bring to our workdays.]]>https://www.kentik.com/blog/meet-our-kentik-petshttps://www.kentik.com/blog/meet-our-kentik-pets<![CDATA[Kentik People Operations Team]]>Mon, 09 Sep 2024 04:00:00 GMT<h2 id="remote-working-at-kentik">Remote working at Kentik</h2> <p>One of the greatest perks of working at Kentik is our fully remote work environment — especially for pet owners. Our employees love knowing their furry friends are always close by. As we covered in our previous blog post, <a href="/blog/kentik-pets-making-fully-remote-work-a-paw-some-experience/">Making Fully Remote Work a Paw-some Experience</a>, research shows that having pets around while working from home offers significant benefits. It’s all about striking that perfect balance between work and play, which is something we truly value at Kentik!</p> <p>Since our pets play such an important role in our daily lives, we thought they deserved a proper introduction. We’re thrilled to present a special photo collage featuring some of our beloved Kentik pets (and more collages to come) — mischievous cats, loyal dogs, and even happy horses and goats! These adorable faces bring immense joy to our team. Their companionship adds a unique spark to our work environment, and we’re incredibly grateful for their presence.</p> <h2 id="meet-some-of-our-pets">Meet some of our pets!</h2> <p>So, take a moment to enjoy these adorable photos and see just how much we cherish these four-legged team members. They make our work-from-home experience a bit more fun and a lot more heartwarming!</p> <img src="//images.ctfassets.net/6yom6slo28h2/4JG4SWPOxb4nXT4dFBiWiN/24c34356ab46406c026d81c8f2288f74/pets-collage.jpg" style="max-width: 800px;" class="image center" alt="Kentik pets collage" /><![CDATA[Understanding Network Traffic Blockages in AWS]]><![CDATA[In this post, explore the challenges of diagnosing network traffic blockages in AWS due to the complex and dynamic nature of cloud networks. Learn how Kentik addresses these issues by integrating AWS flow data, metrics, and security policies into a single view, allowing engineers to quickly identify the source of blockages enhancing visibility and speeding up the resolution process. ]]>https://www.kentik.com/blog/understanding-network-traffic-blockages-in-awshttps://www.kentik.com/blog/understanding-network-traffic-blockages-in-aws<![CDATA[Phil Gervasi]]>Tue, 03 Sep 2024 04:00:00 GMT<p>The complexity of cloud networks can sometimes lead to unexpected traffic blockages or denials, which affect application availability and performance and frustrate engineering teams working at the application layer. Platform, networking, and reliability teams are often tasked with identifying and resolving these issues quickly to unblock teams and keep the packets flowing. Because of the scale and nature of cloud networking components, this can be challenging.</p> <h2 id="the-challenge-of-identifying-traffic-blockages-in-aws">The challenge of identifying traffic blockages in AWS</h2> <p>AWS networks are built on a combination of VPCs, subnets, various internet gateways, routing tables, security groups, and ACLs. In many ways, this isn’t so different from traditional networking, except that our only visibility in the cloud is from what cloud providers give us. Engineers feel this lack of visibility acutely because cloud workloads change rapidly, spinning up and down rapidly as new applications are built and demand fluctuates. While cloud networking components offer flexibility and control, they also introduce complexity when troubleshooting network traffic issues.</p> <div as="WistiaVideo" videoId="iry9zwmjvd"></div> <p>For example, a misconfigured security group or an ACL that’s too restrictive could inadvertently block legitimate traffic. That’s difficult enough to pinpoint in traditional networking, but finding the root cause of these blockages in the cloud is even more tedious, especially in large-scale environments.</p> <p>In a traditional network, we have the ability to log into devices, run show commands, look at firewall logs, and so on. However, many conventional network monitoring tools fall short in cloud environments due to their lack of visibility into cloud-native components and inability to correlate different data types.</p> <p>To solve this, <a href="https://www.kentik.com/solutions/amazon-web-services/">Kentik offers a comprehensive solution</a> that leverages cloud flow data, metrics, and metadata to provide context and a clear picture of where and why network traffic is blocked or denied.</p> <div as="Promo"></div> <h2 id="leveraging-flow-data-for-insight">Leveraging flow data for insight</h2> <p>Flow data is an essential component in understanding network traffic. In AWS, VPC flow logs tell us information about the IP traffic going to and from network interfaces within VPCs. However, these logs can become overwhelming due to the sheer volume of data generated, especially in dynamic and large-scale cloud environments.</p> <p>Kentik ingests flow data from AWS VPC flow logs and enhances it with enriched metadata, providing meaningful insights that go beyond raw log data. Kentik was built to operate at an enormous scale and accommodate the amount of data generated by some of the largest organizations in the world.</p> <p>Then Kentik puts that flow data into context by adding information such as application tags, geographic identifiers, hostname, and eni/IP address tag, etc. By filtering and analyzing this enriched flow data, Kentik can identify patterns that indicate traffic blockages, such as a sudden drop in traffic volume, an increase in denied connections, or specific IP addresses or ports repeatedly being blocked.</p> <p>In the image below from Data Explorer, you can see a very simple filter for VPC, application, firewall action, and flow direction. Filtering for specific dimensions allows us to immediately focus on what’s important to us; in this case, all blocked traffic is denoted in AWS by a REJECT firewall action.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4AC1Sc3t5dPYQgX8aRVKvd/397dc5b2802397f996a410dc63868706/data-explorer-aws-reject.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Data Explorer filter showing blocked traffic" /> <p>Kentik’s advanced analysis platform also correlates flow data with other metrics, helping pinpoint the blockage’s exact location. This can include identifying whether the blockage is due to a security group rule, a routing issue, or perhaps a problem with DNS.</p> <h2 id="analyzing-metrics-and-metadata">Analyzing metrics and metadata</h2> <p>While flow data provides a wealth of information, it’s crucial to analyze metrics and metadata in conjunction with flow data to truly understand where and why traffic is being blocked.</p> <p>Kentik collects a wide range of cloud metrics from AWS, including latency, packet loss, and throughput. By correlating these metrics with flow data, Kentik can help identify performance-related blockages.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4020sl01gYnPUefgCMvELp/3734482540cfbd75df93b2cecbf8e745/aws-mesh-hover.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Public cloud mesh - hover for details" /> <h2 id="correlating-firewall-policies-and-security-groups">Correlating firewall policies and security groups</h2> <p>AWS security groups and ACLs are essential components for controlling network traffic within your cloud environment. Still, they can also be a source of traffic blockages, especially if they’re misconfigured or too restrictive, which can easily happen when workloads are quickly productionalized.</p> <p>Kentik’s platform allows for the correlation of flow data with firewall policies and security group rules. By mapping traffic patterns to the associated security policies, Kentik can quickly identify which rules are causing traffic to be blocked.</p> <p>For example, Kentik can analyze flow data and correlate it with the security group rules applied to a specific instance or subnet. If traffic is being denied, Kentik can filter for the specific rule within the security group that is causing the denial, allowing for quick remediation. This correlation extends beyond just identifying the rule; Kentik also provides insights into why the rule blocks traffic, such as determining whether the traffic is coming from an unexpected source or using a non-standard port. Using the Connectivity Checker (below), we can trace network traffic in AWS and see where traffic is blocked and by which security group. This same data can be used to audit security groups to determine when a policy is more permissive than intended.</p> <img src="//images.ctfassets.net/6yom6slo28h2/52zURFdrCONrTtscAxYyeC/ee9683aaac675dc23f9a9d4a071362c3/connectivity-checker-detail.png" style="max-width: 650px;" class="image center" thumbnail withFrame alt="Detail showing where traffic is blocked" /> <p>Keep in mind that with Kentik, this can be completely self-service in that a network, platform, or SRE team doesn’t necessarily have to be the one to fix the issue. Since Kentik is built on a single database, an engineer of almost any type can use the Kentik Map, Connectivity Checker, or Data Explorer to pinpoint the problem. This means an engineer less comfortable with traditional networking can just as easily find the root cause of a problem.</p> <hr> <p>In the world of AWS networking, identifying and resolving traffic blockages requires more than just basic monitoring tools. Kentik offers a powerful solution that integrates flow data, metrics, metadata, and security policies into a single platform. This approach allows network and cloud engineers to quickly identify where and why traffic is being blocked, ensuring that cloud networks remain secure, performant, and reliable.</p><![CDATA[Afghanistan Internet Presses On, Three Years After US Departure]]><![CDATA[Kentik’s Director of Internet Analysis uses BGP to see how the internet of Afghanistan has changed in the three years following the US military’s departure.]]>https://www.kentik.com/blog/afghanistan-internet-presses-on-three-years-after-us-departurehttps://www.kentik.com/blog/afghanistan-internet-presses-on-three-years-after-us-departure<![CDATA[Doug Madory]]>Thu, 29 Aug 2024 07:00:00 GMT<p>Three years ago, the US military withdrew its forces from Afghanistan, ending their involvement in what had become the country’s longest armed conflict. As the departure ceded the country to the Taliban, a sense of dread arose for many Afghans as well as for the future of the fragile progress the rugged country had achieved over the previous two decades.</p> <p>At the time, I authored a blog post entitled, <a href="https://www.kentik.com/blog/whats-next-for-the-internet-in-afghanistan/">What’s next for the internet in Afghanistan?</a> In that piece, I analyzed the remarkable growth the Afghan Internet had achieved in the previous decade by comparing the BGP topologies of the country in 2011 to 2021, “Fast forward 10 years and note the internet in Afghanistan has grown substantially – from six domestic ASNs to 40!”</p> <img src="//images.ctfassets.net/6yom6slo28h2/53kSretWqd8fnoH1CXNRcr/04832c5597565cd86af1bb31bf7089d7/afghanistan-internet-use.png" style="max-width: 600px;" class="image center" thumbnail alt="Individuals using the internet in Afghanistan" /> <div class="caption" style="margin-top: -30px;">Data from the <a href="https://data.worldbank.org/indicator/IT.NET.USER.ZS?locations=AF">World Bank</a>.</div> <p>As for internet access, there was justified concern for what would happen under Taliban rule:</p> <ul> <li>Would mobile and fixed line internet services be disrupted or disconnected?</li> <li>Would international sanctions against the Taliban cripple the fledgling telecommunications sector?</li> <li>Would the departure of foreign investment lead to degradation of service?</li> </ul> <p>In this post, I dig into what has happened to the internet in Afghanistan in the past three years.</p> <h2 id="internet-under-taliban-20">Internet under Taliban 2.0</h2> <p>Since their ouster in 2001 following the terrorist attacks on September 11th, the Taliban had come to <a href="https://extremism.gwu.edu/evolution-in-taliban-media-strategy">embrace technology</a>. A group that had once <a href="http://www.rawa.org/internet.htm">banned the internet</a>, began using <a href="https://www.tandfonline.com/doi/full/10.1080/19434472.2014.986496">social media</a> to push their worldview. The Taliban <a href="https://www.nytimes.com/2023/06/17/world/asia/taliban-whatsapp-afghanistan.html">came to rely on WhatsApp</a> to coordinate their military takeover of the country and later to operate their government.</p> <p>Since taking over, the Taliban fighters <a href="https://www.vice.com/en/article/taliban-bureaucrats-hate-working-online-all-day-miss-the-days-of-jihad/">have become</a> “bored of the day-to-day drudgery of running” the country with some finding themselves “spending all their time on Twitter.”</p> <p>Writing for the Internet Society’s Pulse blog in February this year, Robbie Mitchell <a href="https://pulse.internetsociety.org/blog/afghanistan-developing-and-maintaining-internet-resilience-in-the-face-of-conflict">reviewed</a> the state of affairs for internet resilience in Afghanistan:</p> <blockquote> <p>…Afghanistan which, contrary to first thought, has not experienced a prolonged government-enforced national shutdown or regional shutdown to the extent of other war-torn countries.</p> <p>This has been mainly due to years of development in the country, which introduced, expanded, and strengthened critical Internet infrastructure, including a <a href="https://blog.apnic.net/2022/05/20/ixp-seeks-to-sustainably-lower-the-cost-of-internet-in-afghanistan/">fiber optic backbone and an Internet Exchange Point</a>.</p> </blockquote> <p>Despite our worst fears, the internet in Afghanistan has largely stayed online, owing much to the development done in the years preceding 2021.</p> <p>That is not to say there are no threats to digital rights in Afghanistan. Localized internet shutdowns have occurred under the Taliban beginning with a blackout in September 2021 in the <a href="https://twitter.com/SkyYaldaHakim/status/1431879692676521987?s=20">Panjshir valley</a>, one of the last remaining strongholds of anti-Taliban resistance. Most recently, internet service was cut in Kabul during the <a href="https://www.afintl.com/en/202407168740">Shia Ashura holiday ceremonies</a> in July.</p> <p>In a <a href="https://www.chathamhouse.org/2024/08/internet-under-attack/03-internet-resilience-afghanistan">recent report</a> covering the Afghanistan internet, Chatham House reported:</p> <blockquote> <p>While the new regime is taking steps to sharpen technical and regulatory measures to control the internet, personal devices are also just arbitrarily ‘seized’ at checkpoints, while individuals are subject to police ‘swiping through their apps’.</p> </blockquote> <p>This description was echoed by an Afghan contact who added that, “there are several cases of detaining and investigating people for their social media comments or posts, recently a TikToker was beaten by the morality police.”</p> <h2 id="bgp-analysis">BGP analysis</h2> <p>From a BGP standpoint, the topology of Afghanistan has changed in subtle ways. The main players are still operating there. As would be expected, the companies connected to the US government presence are no longer there, while some new ASes have emerged. The analysis below excludes satellite-based service due to the difficulty of verifying which ranges are still utilized in-country.</p> <p>The number of BGP routes representing the internet of Afghanistan grew over the past three years, however, those routes represented a significantly smaller amount of IP space. The count of IPv4 routes went from 468 to 528 in this time period, but the number of unique IPs those routes represented dropped 45%, from 252,928 to 137,984. The number of IPv6 routes in Afghanistan went from 5 to 12, but the amount of unique IP addresses went from one astronomical number to another astronomical number, 33% less, as depicted in the table below.</p> <table border="0" cellpadding="10" style="margin-left: auto; margin-right: auto;"> <tbody> <tr> <td>&nbsp;</td> <td colspan="2" align="center"><strong>2021</strong></td> <td colspan="2" align="center"><strong>2024</strong></td> </tr> <tr> <td>&nbsp;</td> <td align="center"><strong>Routes</strong></td> <td align="center"><strong>IPs</strong></td> <td align="center"><strong>Routes</strong></td> <td align="center"><strong>IPs</strong></td> </tr> <tr> <td><strong>IPv4</strong></td> <td align="center">468</td> <td align="center">252,928</td> <td align="center">528</td> <td align="center">137,984</td> </tr> <tr> <td><strong>IPv6</strong></td> <td align="center">5</td> <td align="center">2.37E+30</td> <td align="center">12</td> <td align="center">1.58E+30</td> </tr> </tbody> </table> <p>While some Afghan ranges like <a href="https://stat.ripe.net/widget/routing-history#w.resource=121.127.32.0/24">121.127.32.0/24</a> are now being used outside of the country, many ranges are simply no longer routed such as <a href="https://stat.ripe.net/widget/routing-history#w.resource=43.249.40.0/22">43.249.40.0/22</a> and <a href="https://stat.ripe.net/widget/routing-history#w.resource=128.41.128.0/17">128.41.128.0/17</a>.</p> <p><strong>AFTEL (AS55330)</strong></p> <p>In August 2021, AFTEL (<a href="https://www.afghantelecom.af/">Afghan Telecom</a>) announced 66 prefixes, representing over 54,000 unique IPv4 addresses. Three years later, AFTEL announced 91 prefixes, representing 23,552 unique IPv4 addresses. Some address ranges simply moved to other Afghan ISPs. For example, 149.54.5.0/24 is now being originated by AS132471 (MTN Afghanistan).</p> <p>To reach the global internet, AFTEL now uses three transit providers: <a href="https://elcat.kg/en/">ElCat</a> (AS8449) of Kyrgyzstan to the north, and <a href="https://ptcl.com.pk/">PTCL</a> (AS17557) and <a href="https://transworld-home.com/">Transworld</a> (AS38193) of Pakistan to the south. AFTEL had been also getting transit from TIC of Iran (AS49666) until March 2024 and Kazakhtelecom (AS9198) until July 2024. <img src="//images.ctfassets.net/6yom6slo28h2/4WBFQqGgOeX9V34scBD35m/772b2d7cc998d5de17d645faf3c960b7/afghanistan-providers.png" style="max-width: 700px;" class="image center" thumbnail alt="Global Transit Providers for AS553300" /></p> <div class="caption" style="margin-top: -30px;">Transit for AFTEL (AS55330) based on metrics from <a href="https://www.kentik.com/resources/kentik-market-intelligence/">KMI</a>. </div> <p>In 2010, my colleague Jim Cowie at Renesys <a href="https://web.archive.org/web/20100921125216/http://www.renesys.com/blog/2010/09/iran-exporting-the-internet-pa.shtml">published the first evidence</a> of Iran providing IP transit services into neighboring Iraq and Afghanistan. The service in Afghanistan was part of a larger influence campaign <a href="https://www.npr.org/2012/07/31/157637456/how-u-s-iran-tensions-are-tied-to-afghanistan">Iran was waging</a> to win over the population of the western province of <a href="https://en.wikipedia.org/wiki/Herat_Province">Herat</a>. At present, no Iranian telecom is providing IP transit services into Afghanistan.</p> <p><strong>MTN Afghanistan (AS132471)</strong></p> <p><a href="https://www.mtn.com.af/">MTN</a> is still operating in Afghanistan despite an announcement in 2020 that the South Africa-based telecom would be departing the region. In 2022, the Afghan subsidiary was <a href="https://developingtelecoms.com/telecom-business/operator-news/14180-mtn-exits-afghanistan-with-sale-to-m1.html">sold for $35 million</a> to Lebanese <a href="https://en.wikipedia.org/wiki/M1_Group">investor group M1</a>, but the sale is still in process.</p> <p>AS132471 footprint in the routing table increased slightly in the past three years from 29 to 32 routes.</p> <p><strong>Etisalat Afghanistan (AS131284)</strong></p> <p>The Afghan subsidiary of UAE-based Etisalat continues to operate in Afghanistan. AS131284 went from originating 32 routes in 2021 to 44 in 2024, going from 10,752 to 11,264 unique IP addresses in three years.</p> <p><strong>Afghan Wireless (AS38742, AS138322)</strong></p> <p><a href="https://afghan-wireless.com/">Afghan Wireless</a> uses two ASNs to operate their network and each has experienced growth in the past three years. AS38742 went from 23 routes representing 9728 unique IPs in 2021 to 71 routes representing 20,224 IP addresses. AS138322 went from 30 to 32 routes, representing 10,496 and 16,384 IP addresses, respectively.</p> <p><strong>Roshan (AS45178)</strong></p> <p><a href="https://roshan.af/home">Roshan</a> is a mobile operator which went from originating 29 routes representing 6,912 IP addresses in 2021 to 23 routes representing 5,888 IP addresses.</p> <p><strong>Notable departures and entrants</strong></p> <p>As the US military withdrew its forces from Afghanistan, the internet routes used to support operations in the country were also pulled. AS5800 was the main ASN used for Department of Defense communications in Afghanistan, and it went from originating nearly 100 routes to zero by the end of 2021.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4DDpTuYogkr7PLloiimigT/80dfff56d9406b8c919cfca4433b7893/iafghanistan-tweet-2021.png" style="max-width: 550px;" class="image center" alt="Doug Madory Tweet regarding Afghanistan internet" /> <p>Along with AS5800, the DoD stopped announcing routes (143.74.253.0/24, 144.104.103.0/24, 146.16.128.0/20, 150.196.176.0/20) from AS1501 and AS1541.</p> <p>Afghan companies supporting the US DoD also began disappearing from the routing table, including RANA Technologies (AS55732) and Sniperhill Internet services (AS140082). Sniperhill’s Afghan IP range (103.148.71.0/24), previously announced from the US, is now announced by Afghan provider Cyber Net ISP (AS136986).</p> <p>Neda Telecommunications (AS55745 and AS133174) has also significantly reduced its Afghan footprint. AS133174 disappeared from the global routing table in <a href="https://stat.ripe.net/widget/routing-history#w.resource=133174">June of 2024</a> and much of the address space previously originated by AS55745 is now being used (presumably leased) outside of Afghanistan, including the following IP ranges: 117.55.199.0/24, 117.55.202.0/24, 117.55.203.0/24, 117.55.206.0/23, 117.55.206.0/24, and 117.55.207.0/24.</p> <p>Finally, the American government-funded media organization <a href="https://www.rferl.org/">Radio Free Europe / Radio Liberty</a> (AS202756) also stopped operating in Afghanistan.</p> <p>Since August 2021, a few new ASNs have appeared, routing internet traffic from Afghanistan.</p> <ul> <li>Speed Light ICT Services Company (AS150155) first appeared in the global routing table in <a href="https://stat.ripe.net/widget/routing-history#w.resource=150155">August 2022</a>.</li> <li>Hindukush Bridge ICT Services Company (AS149173), using IP addresses previously originated by Instatelecom Limited (AS55424), first appeared in <a href="https://stat.ripe.net/widget/routing-history#w.resource=149173">December 2021</a>.</li> <li><a href="https://hewadaziz.com/">Hewad Aziz Services Company</a> (AS149771) first appeared in <a href="https://stat.ripe.net/widget/routing-history#w.resource=149771&#x26;w.starttime=2021-07-08T00:00:00&#x26;w.endtime=2024-08-27T00:00:00">May 2022</a>.</li> </ul> <h2 id="looking-to-the-future">Looking to the future</h2> <p>In December, <a href="https://blog.apnic.net/2024/01/15/beyond-borders-a-recap-of-afghanistans-school-on-internet-governance-2023/">Afghanistan School on Internet Governance (AFSIG)</a> held its first event in three years. Organized by the Internet Society, the mission of the AFSIG is to “promote Internet knowledge through training, mentoring, community engagement, and collaboration.“ The return of the event was especially notable because of the female participation. 29.4% of the 60 attendees as well as the head of the organizing committee were women.</p> <p>Additionally, the inaugural Afghanistan Network Operators Group (<a href="https://www.facebook.com/people/AFNOG-Afghanistan-Network-Operators-Group/61552773417335/?mibextid=LQQJ4d">AFNOG 1</a>) was held on December 6, 2023 and there will be an AFNOG 2 - it is scheduled to take place in November this year.</p> <p>Despite the many challenges, the internet community in Afghanistan is pressing on.</p> <p><em>Thanks to <a href="https://icannwiki.org/Mohibullah_Utmankhil">Mohibullah Utmankhil</a> for his inputs on this article.</em></p><![CDATA[Anatomy of an OTT Traffic Surge: Microsoft Patch Tuesday]]><![CDATA[Last Tuesday, August 13 was the second Tuesday of the month, and for anyone running a network or working in IT, you know what that means: another Microsoft Patch Tuesday. Doug Madory looks at how the resulting traffic surge can be analyzed using Kentik’s OTT Service Tracking.]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday-aug2024https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday-aug2024<![CDATA[Doug Madory]]>Tue, 20 Aug 2024 16:00:00 GMT<p>Last Tuesday, August 13 was the second Tuesday of the month, and according to my friend, <a href="https://krebsonsecurity.com/2024/08/six-0-days-lead-microsofts-august-2024-patch-push/">security researcher Brian Krebs</a>, Tuesday’s set of patches included, “updates to fix at least 90 security vulnerabilities in Windows and related software, including a <em>whopping six zero-day flaws</em> that are already being actively exploited by attackers.” (emphasis in original)</p> <p>It is also a traffic surge that can be analyzed using Kentik’s OTT Service Tracking.</p> <h2 id="ott-service-tracking">OTT Service Tracking</h2> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/">Call of Duty update</a> or a the first-ever exclusively live-streamed <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-peacock-streamed-nfl-playoff-game/">NFL playoff game</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p>Kentik <a href="https://www.kentik.com/resources/kentik-true-origin/">True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 540 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <h2 id="microsoft-patch-tuesday">Microsoft Patch Tuesday</h2> <p>Again, according to <a href="https://krebsonsecurity.com/2024/08/six-0-days-lead-microsofts-august-2024-patch-push/">Krebs</a>:</p> <blockquote> <p>This month’s bundle of update joy from Redmond includes patches for security holes in Office, .NET, Visual Studio, Azure, Co-Pilot, Microsoft Dynamics, Teams, Secure Boot, and of course Windows itself. Of the six zero-day weaknesses Microsoft addressed this month, half are local privilege escalation vulnerabilities — meaning they are primarily useful for attackers when combined with other flaws or access.</p> </blockquote> <p>Kentik customers using OTT Service Tracking observed the following statistics, illustrated below. Microsoft Update traffic experienced a peak that was almost 4.5 times that of the previous day. The update traffic was delivered via a variety of content providers including Fastly (47.5%), Akamai (22.8%), Edgio/Limelight (15.1%) and Qwilt (13.7%).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1nmI1FSf9yTWcGHV6vUSqo/b769366b6ebfa29edcd29b6d7d3fbb11/anatomy-of-traffic-surge-cdn.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Microsoft Patch Tuesday/Windows Update traffic analysis with Kentik" /> <p>When broken down by Connectivity Type (below), Kentik customers received Microsoft’s latest round of patches and updates from a variety of sources including Private Peering (58.5%, both free and paid), Transit (31.1%), IXP (7.0%) and Embedded Cache (3.5%).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5ayvBEZ5aM0L0UOM4nr5eZ/38c396d48a5d65de453878ffefe0dbae/anatomy-of-traffic-surge-connectivity-type.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Microsoft Patch Tuesday/Windows Update OTT traffic analysis by source" /> <p>In addition to source CDN and connectivity type, users of Kentik’s OTT Service Tracking are also able to break down traffic volumes by subscribers, specific router interfaces and customer locations.</p> <h2 id="how-does-ott-service-tracking-help">How does OTT Service Tracking help?</h2> <p>Previously, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/">described enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like Microsoft’s Patch Tuesday can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/product/subscriber-intelligence/" title="Subscriber Intelligence Use Cases for Kentik">subscriber intelligence</a>.</p> <p>Ready to improve <a href="https://www.kentik.com/kentipedia/ott-services/" title="Kentipedia: OTT Services (Over-the-Top Services)">over-the-top service</a> tracking for your own networks? <a href="#demo_dialog" title="Request your Kentik demo">Get a personalized demo</a>.</p><![CDATA[Understanding and Controlling AWS Transit Gateway Costs with Kentik]]><![CDATA[AWS Transit Gateway costs are multifaceted and can get out of control quickly. In this post, discover how Kentik can help you understand and control the network traffic driving AWS Transit Gateway costs. Learn how Kentik can help you understand traffic patterns, optimize data flows, and keep your Transit Gateway costs in check. ]]>https://www.kentik.com/blog/understanding-and-controlling-aws-transit-gateway-costs-with-kentikhttps://www.kentik.com/blog/understanding-and-controlling-aws-transit-gateway-costs-with-kentik<![CDATA[Phil Gervasi]]>Thu, 15 Aug 2024 04:00:00 GMT<p>As organizations increasingly rely on cloud environments like AWS, managing costs effectively becomes just as critical as keeping the packets flowing. One significant cost area engineers need to pay attention to is the <a href="https://www.kentik.com/kentipedia/aws-transit-gateway-explained/" title="Kentipedia: AWS Transit Gateway - Everything You Need to Know">AWS Transit Gateway</a>.</p> <p>While the Transit Gateway offers a powerful and scalable way to connect VPCs, on-premises data centers, and other networks, the costs associated with its usage can quickly balloon, especially when network traffic isn’t well-understood or optimized. This is where Kentik can play a vital role in identifying the network traffic driving Transit Gateway costs, enabling businesses to optimize their network architecture and reduce expenses.</p> <div as="WistiaVideo" videoId="v97969tlpu"></div> <h2 id="where-aws-transit-gateway-cost-comes-from">Where AWS Transit Gateway cost comes from</h2> <p>TGWs are a standard way to connect VPCs and on-prem networks through a central hub. This simplifies network management and reduces the complexity of multiple VPN connections. The problem is that the cost structure of the TGW is multi-faceted, including charges for each GB of network data processed, charges for each attachment used to connect VPCs and other network components, and also hourly costs per attachment.</p> <p>The <a href="https://aws.amazon.com/transit-gateway/pricing/">AWS Transit Gateway pricing website</a> provides an example that can help. Imagine a TGW deployed in the US East region with a VPC attached. To send traffic from an EC2 instance in that VPC to another VPC, you’d create a route via your TGW.</p> <p>After sending 1 GB of data, the breakdown looks something like this:</p> <ol> <li>First, we have the TGW hourly charge. This is the hourly charge for the TGW; for US East, it’s $0.05 per VPC attachment, which, for our example, comes to $0.1 per hour since we have two VPCs total.</li> <li>Second, we have the TGW data processing charge. Since 1 GB was sent from an EC2 instance in a VPC attached to a TGW, you’ll incur a data processing charge of $0.02.</li> <li>Next, we have the TGW data processing charge across peering attachments. In this example, 1 GB was sent from an EC2 instance in a VPC attached to a Transit Gateway in the US East over a peering attachment to a different Transit Gateway in the US West.<br> <br> For the cross-peering attachment, the total traffic-related charges will be $0.04. This figure comes from $0.02 for the first TGW data processing and an additional $0.02 for outbound inter-region data transfer charges. Since the traffic inbound from the US West is an inbound inter-region data transfer, there aren’t any charges on that side.</li> </ol> <p>In the example AWS provides us on their website, we can see how a simple data transfer over time and at scale, including multiple regions, multiple attachments, and large amounts of data, can result in huge fees.</p> <p>It goes without saying that if you want to control cloud costs, understanding the traffic that passes through the TGW is extremely important. Without visibility into this traffic, you could unknowingly (and very easily) incur high charges due to inefficient routing, unnecessary data transfer between VPCs, or other suboptimal network behaviors.</p> <h2 id="how-kentik-helps-monitor-aws-transit-gateway-costs">How Kentik helps monitor AWS Transit Gateway costs</h2> <p>Kentik is designed to help organizations understand, manage, and optimize their networks, both on-premises and in the cloud. This includes robust support for AWS environments, including detailed insights into traffic passing through <a href="https://www.kentik.com/kentipedia/aws-transit-gateway-explained/" title="Kentipedia - AWS Transit Gateway: Everything You Need to Know">AWS Transit Gateways</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/TzVm9PMvAkLhKGdpin8BS/200a3263019aa5a6312fa910d8d658be/aws-transit-gateway-sankey.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="AWS Transit Gateway details in the Kentik platform" /> <p>AWS VPC Flow Logs and cloud metrics provide the bulk of telemetry data for detailed visibility into the traffic flowing through your AWS environment down to a specific TGW. Then, by adding relevant metadata to these logs, such as application IDs, security tags, and project or department identifiers, we can start to make sense of flows and metrics that tell us things like VPC, port, protocol, source IP, and destination IP.</p> <p>In this way, Kentik can understand quite a bit about the traffic going over a busy TGW – volume, application type, patterns in seasonality, which attachments are used for a particular flow, unusual behavior, etc. These data points are correlated to provide context to a TGW cost calculation and help an engineer understand which components of your infrastructure drive the most traffic through the TGW and, thus, where the most significant costs are coming from.</p> <h2 id="identifying-high-cost-traffic-patterns">Identifying high-cost traffic patterns</h2> <p>Once all the data is visualized, Kentik allows you to drill down into specific traffic you’ve identified contributing to high TGW costs. You may find that certain VPCs are communicating more than necessary or that a DNS problem is causing an unexpected traffic volume between your on-prem data center and your cloud environment.</p> <p>Kentik’s advanced analytics is all about finding meaningful insights in the data so you can take action. For example, you might discover that a routing problem is causing specific applications to inefficiently route traffic through the TGW when they could use a more direct or less expensive route.</p> <p><strong>This is the point of network observability—the ability to answer any question about your network. In this case, it’s a question of cost, which is a function of TGW traffic.</strong></p> <p>For instance, in Kentik, you can filter for specific types of traffic patterns that imply some sort of growing problem. With traditional visibility tools, alerts don’t fire until a threshold is reached, or in other words, something bad already happened. Instead, Kentik analyzes a trend over time and sees traffic creeping up on a particular TGW. Though not hitting any thresholds yet, it’s a pattern that’s unusual and potentially bad over time.</p> <p>As another example, inter-VPC traffic is prevalent and, in many cases, unavoidable due to how modern distributed applications work. With Kentik, we can filter cloud telemetry to see traffic from a specific VPC to another specific VPC and filter for things like application, TGW, attachments, department tags, and so on to understand what’s driving our TGW cost.</p> <h2 id="traffic-attribution-and-cost-allocation">Traffic attribution and cost allocation</h2> <p>One of the challenges in cloud environments is attributing costs to specific applications, departments, or business units. Kentik’s ability to tag and categorize traffic by various dimensions, such as VPCs, tags in AWS, or specific services, allows you to map network traffic to the respective cost centers within your organization.</p> <p>This traffic attribution is especially useful when you need to understand who’s responsible for driving TGW costs. For instance, if a particular business unit generates significant traffic between VPCs across regions, Kentik can help you identify this and attribute the associated costs accordingly. This visibility allows you to implement accurate chargeback models or take corrective actions to optimize traffic flows.</p> <div as="Promo"></div> <h2 id="optimizing-network-traffic-to-reduce-costs">Optimizing network traffic to reduce costs</h2> <p>After identifying the traffic patterns and sources driving AWS Transit Gateway costs, the next step is optimization. Kentik observes traffic over time and, therefore, can identify traffic patterns behind unnecessary data transfers.</p> <p>An example could be observing a trend in cross-VPC traffic, which is driving TGW network data processing charges. With this information, an engineer could consolidate certain VPCs to minimize this kind of cross-VPC traffic or possibly use a Direct Connect for large volumes of data transfers, thereby reducing TGW charges. Since Kentik also monitors traffic in real-time, when you implement these optimizations, you can immediately see the impact on your TGW costs.</p> <h2 id="proactive-cost-management-with-alerting-and-reporting">Proactive cost management with alerting and reporting</h2> <p>An essential part of controlling cloud costs is the alerting capability that can notify you of traffic patterns likely to result in increased AWS TGW. Remember that Kentik was designed for the engineer, so any time a certain type of traffic is identified, or a pattern is recognized, an alert can be set up based on the filters used to get that data. Then, with whatever static or dynamic thresholds work for your organization, you can proactively manage and mitigate costs before they blow up your cloud bill.</p> <p>Lastly, accurate and usable reporting tools are crucial for generating detailed reports on the traffic you’ve identified as driving TGW costs. Whether these reports show traffic trends over time or identify specific cost drivers, they are especially valuable for periodic reviews of your cloud network’s performance and cost efficiency.</p> <p>So, to conclude, though AWS Transit Gateway provides a powerful way to connect and manage your cloud and on-premises networks, the costs associated with its use can become problematic if not carefully monitored. Kentik offers a powerful solution for identifying and understanding the network traffic driving these costs. You can effectively manage and reduce your AWS Transit Gateway expenses by leveraging Kentik’s capabilities to ingest cloud telemetry, visualize traffic, identify cost-driving patterns, attribute costs, and optimize traffic flows.</p><![CDATA[Understanding the Deficiencies of AWS CloudWatch for Cloud Visibility]]><![CDATA[While CloudWatch offers basic monitoring and log aggregation, it lacks the contextual depth, multi-cloud integration, and cost efficiency required by modern IT operations. In this post, learn how Kentik delivers more detailed insights, faster queries, and more cost-effective coverage across various cloud and on-premises resources.]]>https://www.kentik.com/blog/understanding-the-deficiencies-of-aws-cloudwatch-for-cloud-visibilityhttps://www.kentik.com/blog/understanding-the-deficiencies-of-aws-cloudwatch-for-cloud-visibility<![CDATA[Phil Gervasi]]>Thu, 08 Aug 2024 04:00:00 GMT<p>The complexity of modern networking requires effective monitoring and observability, especially when much of the network we rely on is in the public cloud. This applies to the security, performance, and cost-efficiency of cloud infrastructures.</p> <p>AWS CloudWatch, a native solution offered by Amazon Web Services, is a popular tool for <a href="https://www.kentik.com/kentipedia/enterprise-cloud-best-practices-challenges-solutions/" title="Kentipedia: Enterprise Cloud Monitoring: 5 Best Practices, Challenges, and Solutions">monitoring cloud resources</a>. Though it may be sufficient for some organizations, for many, it lacks the data and functionality to provide contextually relevant network observability, especially in a multi-cloud environment.</p> <p>AWS CloudWatch is designed to provide a broad overview of cloud resource performance, making it appropriate for basic monitoring tasks. With CloudWatch, we can visualize and analyze primary cloud performance and operational data. This includes collecting and storing logs and application and infrastructure metrics. It also provides basic dashboards, alarms, logs, and metrics correlation.</p> <p>However, this broad scope often comes at the expense of the focus and contextually relevant visibility of a full-fledged network observability solution.</p> <div as="Promo"></div> <h2 id="limitations-of-cloudwatch">Limitations of CloudWatch</h2> <p><strong>First, CloudWatch aggregates data which can make in-depth analysis and troubleshooting much more cumbersome, especially for network engineers.</strong></p> <p>The challenge is that as companies mature in their cloud adoption, they end up with hundreds of accounts spread over dozens of regions. Looking at the data one by one is not effective to quickly understand the root cause of a network issue that is impacting application performance.</p> <p>For example, CloudWatch allows viewing data on a per-account and per-region basis, but it lacks the capability to seamlessly integrate and analyze data across multiple accounts and regions simultaneously, not to mention multiple public clouds, which are completely out of CloudWatch’s scope.</p> <p>It’s possible to put data from other data sources into CloudWatch, but it comes with added complexity and expense if egressing data from other clouds. This isn’t an effective way to manage cloud visibility.</p> <p>A common complaint among CloudWatch users is that it simply isn’t intuitive for monitoring metrics. Whether that’s because CloudWatch uses a very basic UI or because all metrics data is dumped into one bucket and highly aggregated (or more likely both), it’s difficult to parse and analyze metrics for specific networking use cases. To make things worse, CloudWatch will take several minutes to run a query, which is an eternity for engineers troubleshooting a problem in real time.</p> <p>CloudWatch seems to have been built for more generic use cases and isn’t as useful for network or network security engineers trying to understand logs and metrics in context, or in other words, an application, workload ID, geographic region, and so on. What’s missing here is the context we find in modern network observability platforms, namely, the addition of enrichment data relevant data to the overall dataset to put raw logs and metrics into the context of what’s important to a network operator. This could be application or workload tags, geographic identifiers, DNS information, and so on.</p> <p>This isn’t necessarily a dealbreaker for some. Still, it points out that CloudWatch was built as a generic metrics and log aggregator, not an actual observability tool for accommodating specific networking use cases.</p> <p>For example, most metrics are tied to an interface, such as bytes in and out. This may be enough for some narrow use cases like an engineer interested in getting an alert if a <a href="https://www.kentik.com/kentipedia/nat-gateway/" title="Kentipedia - NAT Gateways: A Guide to Cost Management and Cloud Performance">NAT gateway</a> is overloaded. However, there’s no context attached to these metrics, so it’s difficult to understand the path view of traffic to and from that gateway, possible routing issues causing the spike, what specific application traffic is overwhelming the interface, and so on.</p> <p><strong>Second, Cloudwatch is generally built with AWS telemetry in mind.</strong></p> <p>Most organizations are operating multi-cloud environments, whether that’s multiple public cloud services like AWS and Azure, a single cloud service and a SaaS provider, or a mixture of public cloud and on-premises resources. Often this is by design over time, though sometimes it happens due to shadow IT.</p> <p><strong>As an organization grows, CloudWatch can quickly get very expensive.</strong></p> <p>When we add up the total amount of flow data and cloud metrics, the amount of information stored by AWS is significant. There’s a cost associated with that which really can’t be avoided. However, CloudWatch also charges to query that data with tools like Athena.</p> <p>For an organization fully invested in the public cloud, such as AWS, querying cloud telemetry is part of the average daily network, cloud, and security operations. This means the cost of normal IT operations can become cost-prohibitive. Additionally, because CloudWatch is limited to only AWS logs and metrics, there is also a cost to fill in the gaps with other tools to provide the missing visibility. This means the cost of licensing and storage and the operational cost of running more visibility tools.</p> <h2 id="kentik-puts-everything-into-context">Kentik puts everything into context</h2> <p>It’s important to remember that Kentik is a network <em>observability</em> platform, not a network visibility platform. That means Kentik provides much more than visibility into logs and metrics on colorful graphs. Instead, It ingests information to put it into <em>context</em>.</p> <p>Alongside AWS VPC flow logs and cloud metrics, Kentik also ingests telemetry from other cloud and SaaS providers, campus and data center networks, the public internet, as well as a variety of metadata such as application and security tags, IPAM and DNS information, geographic identifiers, and so on. All of this information is analyzed together, and in this way, an engineer can understand <em>why</em> a metric is what it is and why it’s vital in the first place.</p> <p>For example, if an engineer suspects a NAT gateway is overloaded, they can look into this network element with CloudWatch. However, with Kentik there would be context to help them figure out what applications or services are being affected, what caused the issue, and, therefore, the best way to go about fixing the problem.</p> <p>CloudWatch metrics are highly aggregated, and a bytes-in and bytes-out count, though helpful for figuring out which interface is being overutilized, doesn’t tell us the specifics of the application traffic going through that gateway.</p> <p>On the other hand, Kentik can quickly pivot between cloud metrics and flow data and beyond, looking at the relevant routes, DNS servers involved, firewall rules, security tags, and so on. Kentik is designed to be consumed by a network operations team focused on troubleshooting real problems, analyzing historical data, and understanding application traffic end-to-end. To that end, queries can take seconds rather than minutes with CloudWatch.</p> <h2 id="multi-cloud-is-becoming-the-norm">Multi-cloud is becoming the norm</h2> <p>Today, most organizations are multi-cloud, so understanding application traffic usually means understanding how multiple public clouds communicate with each other over the internet.</p> <p>As a simple example, imagine a line of business web application. The front end is hosted in one public cloud, while specific backend components are hosted in a different one. Certainly, this is not ideal, but technical and/or business constraints frequently create these situations.</p> <p>To troubleshoot an application performance issue, we’d need telemetry from both public cloud providers, AWS and Azure, in this example, as well as information about the connectivity over the internet between these instances. CloudWatch, though a decent log and metrics aggregator, wouldn’t help us piece this puzzle together across clouds.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1s0uuoy9mHxqTJ3nRhvKIT/bbbc270456542604b7b8bcff1578e587/data-explorer-sankey.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Data Explorer Sankey with AWS and Azure" /> <p>Kentik’s global mesh of synthetic test agents as well as privately deployed test agents measure network performance over the public internet including the path between public clouds. For most IT teams, this requires additional tools and licensing, but Kentik integrates all of this telemetry into a single platform.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7nFU8SisMkzV7BYCd9i5Zo/5b6bd021103d922e2bf6f68e7d7398b2/state-of-the-internet-monitoring-saas.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Synthetic mesh showing AWS and Google Cloud" /> <h2 id="cost-efficiency">Cost efficiency</h2> <p>First, though using AWS Athena with S3 for aggregating VPC flow records is an improvement over traditional CloudWatch, this method is costly. Each query incurs a cost for every read, even if querying the data with your own mechanism. This alone has priced some organizations out of CloudWatch, leaving them looking for an alternative.</p> <p>Kentik solves this problem by creating a read-only account in an AWS instance to ingest and store logs into a separate S3 bucket and using AWS API endpoints to get metrics and metadata. Then, Kentik can query that data at no cost other than the Kentik licensing itself, which is not based on the number of queries made.</p> <p>AWS still charges for generating the logs, which is unavoidable without doing your own instrumentation on your instances, but an IT team can avoid any additional cost for individual queries by using Kentik. And because all those logs never leave AWS, there’s no cost incurred for egress traffic.</p> <p>Second, because Kentik ingests telemetry from <em>multiple</em> public clouds, as well as on-premises resources, SaaS providers, and third party telemetry such as public DNS metrics and global routing information, switching to Kentik means IT teams no longer need to juggle many observability tools for one environment.</p> <p>This also benefits the democratization of information among teams because Kentik’s intuitive interface allows various teams within an organization to access and utilize network data independently, reducing reliance on one team for all cloud operations. The cost efficiency gained by streamlining IT operations can’t be understated.</p> <p>Kentik largely eliminates the complexity and access control limitations of CloudWatch even when using a CloudWatch and Athena solution. With Kentik, security, networking, and cloud teams all have the same information and access to the same data, though presented in dashboards that make the most sense for them.</p> <h2 id="conclusion">Conclusion</h2> <p>CloudWatch is an effective generic log aggregator for AWS telemetry, but it doesn’t meet the demands of today’s IT operations teams. Slow and expensive queries, aggregate data that are difficult to parse, a generic user interface, and enormous operating costs make CloudWatch a deficient built-in visibility solution for your cloud environment.</p> <p>Thankfully, AWS is open to third parties solving some of these problems. <a href="https://www.kentik.com/solutions/amazon-web-services/">Kentik integrates with AWS</a> and <a href="https://www.kentik.com/product/multi-cloud-observability/">all the major public clouds</a> and was built specifically for engineers analyzing data and fixing issues in complex, multi-cloud environments. Cheaper, faster, and both more granular in depth and more expansive in scope, Kentik is a solution designed for modern, real-world IT operations. To learn more about Kentik as a complement to CloudWatch, see “<a href="https://www.kentik.com/kentipedia/cloudwatch-alternatives-multicloud-network-observability/" title="CloudWatch Alternatives: Enhancing Multicloud Network Observability">CloudWatch Alternatives: Enhancing Multicloud Network Observability.</a>”</p><![CDATA[Infrastructure Modernization Means a Multi-cloud Future ]]><![CDATA[Today, 84% of enterprises have reportedly embraced a multi-cloud strategy, lured in by the promise of improved agility, resilience, and innovation. But that multi-cloud adoption comes with challenges. Read on to learn more about how a multi-cloud strategy paired with observability can simplify that complexity.]]>https://www.kentik.com/blog/infrastructure-modernization-means-a-multi-cloud-futurehttps://www.kentik.com/blog/infrastructure-modernization-means-a-multi-cloud-future<![CDATA[Josh Mayfield]]>Thu, 25 Jul 2024 04:00:00 GMT<p>Let’s just face the obvious: The future is multi-cloud.</p> <p>According to the latest Gartner report, “Predicts 2024—Multicloud and Sustainability Drive Modernization,” organizations are increasingly adopting multi-cloud strategies to optimize their infrastructure and ensure agility, resilience, and cost efficiency. This shift towards multi-cloud is not just a trend but a strategic move that promises a (hopefully) brighter future. But what does this mean for enterprises, and how can they navigate this complex landscape?</p> <h2 id="multi-cloud-the-new-normal">Multi-cloud: The new normal</h2> <p>Embracing a <a href="https://www.kentik.com/kentipedia/multicloud-networking/" title="Kentipedia: Multicloud Networking: Definitions, Benefits, and Challenges">multi-cloud strategy</a> offers a myriad of benefits, including enhanced agility, improved resilience, and greater innovation. This shift, as highlighted by the Gartner report, is driven by the need to avoid vendor lock-in, enhance disaster recovery capabilities, and optimize costs by leveraging the best services each cloud provider offers. However, managing multiple cloud environments can be a Herculean task, fraught with data integration, security, and visibility challenges.</p> <p>This complexity isn’t just about juggling different technologies. It’s also about managing relationships with multiple vendors, each with its own billing systems, support processes, and service levels. It’s like running a restaurant where every dish comes from a different kitchen with its own head chef and unique recipe.</p> <h3 id="the-challenges-of-multi-cloud">The challenges of multi-cloud</h3> <p><strong>Data integration and management</strong>: Integrating data across different cloud platforms can be messy. Each provider has its own set of APIs, data formats, and services, making it difficult to achieve seamless data flow. Native tools often fall short of providing a unified view across disparate environments, creating blind spots and inefficiencies. Gartner notes that traditional monitoring tools, often the first choice, fail to meet the comprehensive needs of a multi-cloud environment.</p> <p><strong>Security and compliance</strong>: Ensuring consistent <a href="https://www.kentik.com/kentipedia/cloud-security-policy-management/" title="Cloud Security Policy Management: Definitions, Benefits, and Challenges">security policies across multiple clouds</a> is a nightmare. Different clouds have different security protocols, and maintaining compliance with industry regulations across these diverse environments is daunting. Native tools often lack the advanced features needed for comprehensive security and compliance management in a multi-cloud setup.</p> <p><strong>Visibility and monitoring</strong>: Keeping an eye on performance, usage, and costs across multiple clouds is difficult. Traditional monitoring tools and even cloud-native tools from the providers themselves can fall short, leading to blind spots and inefficiencies. Native tools often don’t offer the depth or breadth required for effective multi-cloud monitoring, leaving enterprises without a clear view of the granular metrics and data inside each cloud environment.</p> <h3 id="practical-implications-for-enterprises">Practical implications for enterprises</h3> <p>Adopting a multi-cloud strategy isn’t just about technology; it’s about transforming how your business operates. Let’s lay out some practical implications:</p> <p><strong>Enhanced agility</strong>: Multi-cloud environments enable you to adapt quickly to changing business needs. Whether scaling up resources during peak times or experimenting with new services, the flexibility of multi-cloud allows you to respond faster to the demands of the market and your customers.</p> <p><strong>Improved resilience</strong>: By distributing your workloads across multiple clouds, you reduce the risk of downtime. If one provider experiences an outage, your services can continue running on another, ensuring business continuity and the ability to meet SLAs confidently.</p> <p><strong>Greater innovation</strong>: Access to diverse tools and services across different cloud platforms fosters innovation. You can leverage each provider’s unique strengths to develop cutting-edge solutions that drive your business forward.</p> <h3 id="the-subtle-pitfalls-of-native-tools">The subtle pitfalls of native tools</h3> <p>Many enterprises initially rely on cloud-native tools provided by their cloud service providers (CSPs). While these tools are tailored to their specific environments, they often lack the capability to integrate and manage multi-cloud setups seamlessly. Here’s why relying solely on native tools can be problematic:</p> <p><strong>Limited cross-platform integration</strong>: Native tools are designed to work best within their own ecosystems. When you bring another cloud provider into the mix, things can get tricky. Imagine trying to fit a square peg in a round hole—it’s not impossible, but it’s not ideal. For example, Gartner’s research shows that CSP-native tools often do not support the broad integration needs of multi-cloud environments.</p> <p><strong>Inconsistent security policies</strong>: Each CSP has its own security protocols, and aligning these across multiple platforms can be daunting. Native tools rarely offer the flexibility to enforce consistent security policies across all environments.</p> <p><strong>Fragmented visibility</strong>: Monitoring tools from different CSPs might not provide a unified view of your entire infrastructure. This fragmentation can lead to blind spots and make it difficult to understand your system’s health and performance comprehensively.</p> <h2 id="evidence-and-insights-from-industry-leaders">Evidence and insights from industry leaders</h2> <p>The shift towards multi-cloud is not just a trend but a strategic move backed by substantial research and real-world applications. For instance, a study by Flexera found that 84% of enterprises have a multi-cloud strategy, with 58% planning to use hybrid cloud. This shift is driven by the need to optimize costs, improve business agility, and leverage the best-of-breed services from different providers.</p> <p>Moreover, Gartner’s analysis indicates that enterprises are increasingly seeking tools that offer better integration, visibility, and security than what native tools can provide. The need for third-party solutions to bridge the gaps left by CSP-native tools is becoming more critical.</p> <h3 id="the-role-of-observability">The role of observability</h3> <p>Observability tools are becoming essential in managing the complexities of multi-cloud environments. These tools provide comprehensive insights into your cloud infrastructure’s performance, security, and costs, enabling you to make informed decisions. Let’s take a look at some of the key benefits of observability.</p> <p><strong>Comprehensive insights</strong>: Observability tools provide a holistic view of your entire cloud environment, enabling you to monitor performance, detect anomalies, and optimize resources. Unlike native tools, these solutions offer a unified view across all your clouds, ensuring you don’t miss any critical data points.</p> <p><strong>Enhanced security</strong>: Observability can enforce consistent security policies across all your clouds. These tools integrate with leading security solutions to provide continuous monitoring and threat detection, reducing the risk of breaches and compliance issues.</p> <p><strong>Cost optimization</strong>: Observability tools help you track usage and spending across multiple clouds, identify underutilized resources, and optimize your cloud spending. This ensures you’re not overspending on cloud resources and can allocate your budget more efficiently.</p> <h3 id="real-world-examples">Real-world examples</h3> <p>Several companies across various industries have successfully implemented multi-cloud strategies, leveraging advanced observability tools to overcome the inherent challenges. One global technology firm faced significant challenges with its legacy network monitoring solutions. These tools were not scalable and did not provide visibility into its hybrid cloud infrastructure. Adopting an advanced network observability platform allowed it to consolidate its network tools, reduce costs, and improve efficiency, ensuring a future-ready network.</p> <p>Another major enterprise in the energy sector struggled with segregated and rapidly changing infrastructure driven by evolving company strategies and regulatory pressures. An advanced observability platform enabled them to streamline their network operations, enforce consistent security policies, and gain comprehensive visibility across their multi-cloud environment, significantly enhancing their operational efficiency and agility—timely for an industry going through global disruption.</p> <h2 id="kentiks-role-in-multi-cloud-modernization">Kentik’s role in multi-cloud modernization</h2> <p>The Kentik network observability platform offers comprehensive visibility and control, <a href="https://www.kentik.com/product/multi-cloud-observability/">integrating telemetry data from all your cloud environments</a> into a single pane of glass. You can monitor performance, detect anomalies, measure traffic, and gain insights across your entire infrastructure without jumping between different tools. Our platform ensures your security policies are consistently enforced and helps you <a href="https://www.kentik.com/blog/lower-cloud-costs-with-kentik/">optimize costs</a> by tracking usage and spending across multiple clouds.</p> <p>Check out this <a href="https://www.kentik.com/resources/navigating-the-complexities-of-hybrid-and-multi-cloud-environments/">recent demo</a> to learn more about Kentik’s multi-cloud observability.</p> <div as="WistiaVideo" videoId="xogyx13gzl"></div><![CDATA[Calm Down, Cloud's Not THAT Different!]]><![CDATA[Are you considering the switch from network engineer to cloud engineer? This post won’t teach you everything needed to become a cloud expert, but hopefully, it will create a level of comfort and familiarity such that you can start your journey to the cloud from here.]]>https://www.kentik.com/blog/calm-down-clouds-not-that-differenthttps://www.kentik.com/blog/calm-down-clouds-not-that-different<![CDATA[Leon Adato]]>Thu, 18 Jul 2024 04:00:00 GMT<p>I was at a tech conference recently – <a href="https://nanog.org/events/nanog-91/">one built by (and for) networking professionals</a> – chatting with another attendee about the usual mix of news, tech challenges, and hobbies when the conversation took a sudden and sobering turn:</p> <p>“I guess I have a lot more time for side quests these days. I got laid off a couple months back and can’t seem to land even a solid conversation, let alone a job offer. I thought it would be easier than this.”</p> <p>Those of you who catch my writing <a href="https://adatosystems.com">on my personal blog</a> will know that job hunting has been a common topic for me. Everything from <a href="https://www.adatosystems.com/2024/02/13/whats-a-resume-for/">how (and why) to fix up your resume</a>, to creating a worthwhile collection of <a href="https://www.adatosystems.com/2024/01/29/do-you-have-any-questions-for-me/">interview questions</a> to simply sharing the job listings that come across my email, Slack, Discord, and social channels <a href="https://www.adatosystems.com/about-the-weekly-job-listing-post/">each week</a>. So, when the topic came up, I didn’t shy away from it.</p> <p>We spent some time discussing the mechanics of resumes, online profiles, and networking (the people kind, not the route-and-switch kind).</p> <h2 id="the-cloudy-elephant-in-the-room">The cloudy elephant in the room</h2> <p>By this point, a decent-sized crowd had gathered around. This was partially because we were camped out in front of the coffee station, but also because <a href="https://en.wiktionary.org/wiki/hallway_track">the hallway track</a> is definitely a thing. And that’s when the conversation pivoted to the elephant in the ballroom.</p> <p>The elephant in question was brought to this event by the inimitable <a href="https://www.duckbillgroup.com/about/">Corey Quinn</a> as part of his keynote speech: <a href="https://www.youtube.com/watch?v=mJshV8koylo">TCP Terminates on the Floor: The Ebbing Tide of Networking Expertise</a>.</p> <p>After pointing out a brain-meltingly stupid architectural decision he’d seen, Corey commented, “The real problem is that the companies that are responsible for doing these ridiculous stunts all needed a network engineer on the team when there very clearly wasn’t one. Because those job postings don’t exist anymore. They don’t really call the skill set ‘network engineering’ in a lot of these companies; they call them SREs or DevOps or special engineers or god knows what.”</p> <p>Toward the end of the talk, Corey pivoted to describe current challenges companies that are (or wish to be) cloud-native experience and how someone with a background in so-called “traditional” networking would have an advantage: “These are things that most of you in this room already know how to do, which puts you in a rarified group. These are all things you can take to a cloud-native company, who will value what you’re bringing to the table if you talk about it a little bit differently.”</p> <p>And that’s where Corey lost me (only a little, but it’s relevant to this blog). To offer career options, he suggested “meeting developers further up the stack” and “start moving into the application space,” which sounded—both to me and to other attendees—suspiciously like the oft-repeated call to “learn how to code” and “become a developer.”</p> <p>This isn’t to say it’s fine to eschew learning code at any level. As I’ve written <a href="https://dzone.com/articles/develop-a-sense-of-code">before</a>, everyone in tech should develop a “sense of code” – the ability to look at an existing piece of code and understand what it’s trying to accomplish. That’s not the same thing as being able to fix a bug, optimize, or create the code. But it’s nevertheless important. It strikes closer to what I believe Corey was talking about with regard to moving up the stack to find a comfortable middle ground to meet with our application developer colleagues.</p> <h2 id="pivoting-into-a-dead-end">Pivoting into a dead end</h2> <p>On the other hand, clinging to the past – whether we’re talking about old relationships or old technology – doesn’t help anyone. As <a href="https://www.linkedin.com/in/evelyn-osman/">Evelyn Osman</a>, platform engineering manager at AutoScout24, pointed out to me, “One of the first rules of cloud migrations is ‘don’t fire your network guys, yet.’”</p> <p>I’ll be honest, the “yet” part of her comment really bothered me. When I asked for more information, she elaborated that after a company goes through a so-called “digital transformation,” “the data center teams are either retasked as DevOps or shown the door. SysAdmins transition into these new roles easily, while data center/network engineers do not so much. It all comes down to OSI: They’re too far down to readily contribute post-migration. The end result is they’re critical to making the migration happen. We don’t need that deep networking knowledge in the cloud as IaaS simplifies it all.”</p> <p>But, like Corey, Evelyn offered a way forward for network folks: “IMHO, what may be beneficial is to start looking at how services are expected to communicate, help figure out that dynamic security nightmare. Go deeper into CDN and learn how to really tune things there, and dabble with serverless to see how to optimize at the edge because dev teams are shit at it.”</p> <p>This brings me to the crux of this post: Cloud isn’t that hard. It’s not even significantly different from what we already know as networking experts. It’s made up of techniques and technologies we’re intimately familiar with, but it seems inscrutable due to jargon and (possibly) the world’s worst UI. And once you make it past that hurdle, the stuff that’s actually new will be far more accessible and approachable. The barrier to entry for network veterans will have been lowered.</p> <h2 id="keep-calm-and-reload-in-5">Keep calm and reload in 5</h2> <p>Before we start to dig deeper into what cloud networking really is, and whether it’s new technology or just the same sh…tuff with different names, and how to go about learning it either way, I want to make a few things absolutely clear:</p> <p>If you’ve been working with traditional networking for any length of time, you already have all the knowledge, skills, and insights needed to understand cloud network architecture. Once we peel back the jargon and the shamefully poor UI, it’s all just routing tables, forwarding, ACLS, and tunnels—the same as any other network.</p> <p>But to break all of that down, we have to re-educate ourselves.</p> <h2 id="line-up-for-cloud-class">Line up for cloud class</h2> <p>If you’ve gotten to this point in the blog and are shouting, “ZOMG, THIS IS TOTALLY ME! What can I do or read or watch to get myself over the cloud education hump?!?” I’m not going to keep you in suspense. Here are some resources that my friends and I have found to be a Rosetta Stone, translating old-school networking knowledge into modern-day cloud architecture jargon:</p> <p>Let’s start with a two-part blog series from <a href="https://www.linkedin.com/in/matthnick/">Nick Matthews</a>, which is both free and written for network folks: <a href="https://aws.amazon.com/blogs/apn/amazon-vpc-for-on-premises-network-engineers-part-one/">Part 1</a> | <a href="https://aws.amazon.com/blogs/apn/amazon-vpc-for-on-premises-network-engineers-part-two/">Part 2</a>.</p> <p>Also free is a talk by Shean Leigon, a friend and former USAF buddy of my colleague <a href="https://www.kentik.com/blog/author/doug-madory/">Doug Madory</a>. In it, he speaks about his transition from network engineer to cloud engineer in <a href="https://artofnetworkengineering.com/2023/03/01/ep-114-things-are-getting-cloudy/">this podcast</a>.</p> <p>Staying with resources specifically designed to help networking folks, there’s Tim McConnaughy’s book <a href="https://www.amazon.com/Hybrid-Cloud-Handbook-AWS-Traditional/dp/B0BW2ZKNB6"><em>The Hybrid Cloud Handbook for AWS: AWS Cloud Networking for Traditional Network Engineers</em></a>. The reason I’m including it is right in the title.</p> <p>The last resource I want to share (for now) is less specific to (or for) network professionals and more geared toward learning cloud essentials: the <a href="https://d1.awsstatic.com/training-and-certification/ramp-up_guides/Ramp-Up_Guide_Architect.pdf">AWS Ramp-Up Guide</a>. It’s a PDF list of over 70 courses and labs, some of which have a cost, but many of which are free.</p> <h2 id="for-those-who-want-a-cloud-hall-pass">For those who want a cloud hall pass</h2> <p>I expect to lose some readers at this point because the previous section is what they needed most. I can accept that. My hope is that you will stick around, and those who clicked away eventually find their way back so we can talk about how the cloud - which ostensibly uses the same networking technologies and concepts as its on-premises predecessor - could have come to feel so foreign.</p> <h2 id="cloud-class-is-in-session">Cloud class is in session</h2> <p>This blog isn’t meant to teach everything needed to immediately pivot to a role as a cloud engineer (network or otherwise). Instead, I’m going to share three conventions used in cloud architecture and describe their similarity to standard networking concepts to make the point that a lot of what happens in cloud architecture is stuff you already know.</p> <p>Doing so will create a level of comfort and familiarity so you can start your journey to the cloud from here.</p> <h3 id="lesson-1-its-called-cloud-because-it-obscures-your-vision">Lesson 1: It’s called “cloud” because it obscures your vision</h3> <p>(Not really. The name is derived from the term “TAMO Cloud.“)</p> <p>It’s not that cloud providers intentionally make their interfaces confusing; it’s that humans aren’t the target market. The primary “user” of the cloud is actually a program. What I mean is that the expectation is that someone will be writing a program that requests cloud resources and manages the lifecycle of those resources – whether that’s seconds in the life of a microservice, hours of ephemeral containers, or the long operations of a persistent system.</p> <p>In that context, everything – from processors to storage (and, yes, networking) – exists and comes to life as a line of code in a program. While not exactly an afterthought, the UI built for <a href="https://www.youtube.com/watch?v=LAlqp0_a0tE">giant ugly bags of mostly water</a> to use is definitely <em>not</em> the cloud provider’s primary concern.</p> <p>Adding to the obfuscation is that every interaction you will have with cloud networking architecture will be at layer 3 or higher. You’ll never be able to see or affect layers 1 or 2, so stop looking for them. Everything starts at layer 3.</p> <p>Finally, I want you to start thinking like an MSP (managed service provider). If you’ve worked in tech for any length of time, you already understand the MSP mindset because you’ve either worked <em>with</em> them or <em>for</em> them (or both). MSPs have “service offerings”—packaged bundles of technology that have been standardized and operationalized so that the MSP can quickly roll out the service for a new customer.</p> <p>Of course, those service offerings connect back to hardware, configurations, and such. But much of the nuts and bolts have been curtained off from the paying customer. The customer is <em>paying</em> for the privilege of being shielded from the technical minutiae of routing, active directory, IP addressing, firewall rules, and such.</p> <p>Cloud is doing the same thing. So, as you look at each screen in AWS, GCP, Azure, or the rest, remind yourself that <em>you</em> are, in fact, sitting in the MSP customer seat. You need to learn how to see past the GUI.</p> <h3 id="lesson-2-cloud-is-actually-fabric">Lesson 2: Cloud is actually fabric</h3> <p><a href="https://www.linkedin.com/in/tmcconnaughy/">Tim McConnaughy</a> put it best in his book <a href="https://www.amazon.com/Hybrid-Cloud-Handbook-AWS-Traditional/dp/B0BW2ZKNB6"><em>The Hybrid Cloud Handbook for AWS</em></a>: “Broadly speaking, a cloud fabric is an arbitrary network topology based on creating tunnels over an underlying physical topology. … Basic connectivity [in the cloud] is established between an ocean of network devices, and on top of that is orchestrated this arbitrary fabric that serves as the customer-facing ‘cloud network.‘”</p> <p>The “fabric” in this question isn’t based on a single protocol like MPLS but may include any (or all) tunneling protocols, including IPSec, GRE, VXLan, and more.</p> <p>But I need to refer back to the point I made earlier: Your interactions with this network fabric start at layer 3. So, much of what we network professionals understand about fabric (the hardware layer of the mesh) is completely obscured from view, making it hard to identify that this is what’s happening under the hood.</p> <h3 id="lesson-3-vpc-is-just-a-router">Lesson 3: “VPC” is just a router</h3> <p>Or, more accurately, a routed domain.</p> <p>Again, the issue is the interface—you just see a screen that lists out IP ranges, and unless you visualize what’s happening in your head, it can feel very disconnected and random.</p> <p>What you see is a screen where you divide your VPC (virtual private cloud) into one or more subnets of the CIDR block you picked when you initially set up the VPC. So if your initial selection was 10.1.0.0/16, you might see the VPC contain:</p> <ul> <li>10.1.2.0/16</li> <li>10.1.16.0/20</li> <li>10.1.32.0/20</li> <li>and so on.</li> </ul> <p>But in the middle of all those subnets, think of them as individual LANs, and you’ll be on solid ground. It is a router. Not a real flesh-and-bloo… I mean silicon-and-copper router, but the programmatic equivalent of one. This router is doing all the same things a router would do, including DNS, DHCP, and even ACLs.</p> <p>Once you can visualize that device in the middle of everything, the options and commands make infinitely more sense.</p> <h2 id="the-next-step-is-always-the-most-important-one">The <em>next</em> step is always the most important one</h2> <p>It’s time to review and reflect. If you’ve made it to this point in the blog, you know that I’ve offered both educational resources and a short explainer that introduces cloud network architecture from a “traditional” network point of view.</p> <p>What comes next is largely up to you. You might jump in with both feet, picking up some free courses, books, and blogs, and begin to marinate in all things cloud. Or you may dip your toes in, asking folks you already know and trust about their experiences and getting a more personalized sense of which aspects of cloud platforms make sense to (and for) you.</p> <p>Or you can stay where you are, educationally speaking. I don’t say that as a slight on your judgment or that doing nothing is some kind of IT booby prize. There are many solid, valid reasons to hold off diving into a new technology. When Steve Jobs said, “<a href="https://youtu.be/BRTOlPdyPYU?t=89">Everybody in this country should learn to program a computer</a>” in 1995, he was wrong. Likewise, anyone who says, “Everyone should learn cloud architecture today” is equally wrong.</p> <p>That said, you need to decide what your next step will be. I hope this post and the other blogs, videos, and training from Kentik will be helpful as you forge your path forward, regardless of what direction you choose.</p> <p>Check out a <a href="https://www.kentik.com/resources/demo-kentik-network-observability-for-hybrid-clouds/">recent demo here</a> to learn how Kentik provides network observability for hybrid clouds.</p><![CDATA[How Kentik Lowers Cloud Costs]]><![CDATA[Migrating to public clouds initially promised cost savings, but effective management now requires a strategic approach to monitoring traffic. Kentik provides comprehensive visibility across multi-cloud environments, enabling detailed traffic analysis and custom billing reports to optimize cloud spending and make informed decisions.]]>https://www.kentik.com/blog/lower-cloud-costs-with-kentikhttps://www.kentik.com/blog/lower-cloud-costs-with-kentik<![CDATA[Phil Gervasi]]>Wed, 17 Jul 2024 04:00:00 GMT<p>Remember when the public cloud was brand new? In those days, it seemed like overnight, everyone’s strategy became to lift and shift all their on-prem workloads to public cloud and save tons of money. At that time, that just meant migrating VMs or spinning up new ones in AWS or Azure. Well, those days are pretty much gone, and we’ve learned the hard way that putting most or all our workloads in the public cloud does not always mean cost savings.</p> <p>The thing is, there’s still an incredible benefit to hosting workloads in AWS, Azure, Google Cloud, Oracle Cloud, etc. So today, rather than lift and shift, we have to be more strategic in what we host in the cloud, mainly because many of our workloads are distributed among multiple cloud regions and providers.</p> <p>Cloud billing is generally tied to the amount of traffic exiting a cloud, so the problem we face today isn’t necessarily deciding between on-prem hardware and hosted services but figuring out what traffic is going where after those workloads are up and running.</p> <p>In other words, to understand and control our cloud costs, we have to be diligent about monitoring how workloads communicate with each other across clouds and with our resources on-premises.</p> <div as="Promo"></div> <h2 id="managing-cloud-cost">Managing cloud cost</h2> <p>Most organizations have IT budgets with line items for software and hardware vendors, including recurring licensing and public cloud costs. A normal part of IT management today is establishing clear budget parameters, monitoring resources (especially idle resources), setting up spending alerts, and analyzing cost anomalies to prevent financial pitfalls.</p> <p>But what data can we actually use to determine which cloud resources are idle, which workloads are talking to which workloads, and what specific traffic is egressing the cloud and, therefore, incurring costs?</p> <p><em>The best guess just isn’t good enough, so we need to look at the data.</em></p> <p>Kentik provides comprehensive visibility across clouds and on-prem networks, giving engineers a powerful tool to explore data in several key areas. Let’s examine how Kentik helps organizations manage their cloud costs.</p> <h2 id="comprehensive-visibility">Comprehensive visibility</h2> <p>First, Kentik provides a <a href="https://www.kentik.com/product/multi-cloud-observability/">consolidated view of multi-cloud environments</a>, offering insights into usage and costs across AWS, Azure, Google Cloud, and Oracle Cloud from a single dashboard. Kentik monitors traffic flow between cloud resources and on-premises infrastructure, helping identify high-cost data transfers and inefficient routing.</p> <p>The graphic below shows a high-level overview of active traffic among our public clouds and on-premises locations. The Kentik Map is helpful as an overview, but most elements on the screen are clickable, meaning that you can drill down into what you want to know right from here.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5WB0XxoAN7ASChCkfmJKlj/2de3146c4210aa1e7c052f67701d5209/public-cloud-overview.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Overview of public cloud and on-prem locations in the Kentik Map" /> <p>To focus on just one cloud, we can view the topology for that provider. Here, active traffic traverses various elements such as Direct Connects, Direct Connect Gateways, and Transit Gateways.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6DH6qOcWZFWrSc47gHz5gd/b04da4530c4056e0dc8ed1dfc40ffbcd/aws-topology.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Drill into AWS topology" /> <p>We can also open the Details pane for our Transit Gateway to get a high-level snapshot of overall traffic, which we can filter in various ways. This gets us started, but we need to get deeper when analyzing current cloud bills and trying to understand where AWS’s numbers are coming from.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2F0rvEhF8pBzwKaNmyZ2Gv/2f15c34ae99a449b94043d58ee5eff74/transit-gateway-details.png" style="max-width: 400px;" class="image center" thumbnail withFrame alt="View of the details pane and filtering" /> <h2 id="analyzing-traffic-to-understand-the-cost">Analyzing traffic to understand the cost</h2> <h3 id="data-explorer">Data Explorer</h3> <p>From here, we can pivot to Data Explorer, allowing us to filter for specific data and clearly understand the traffic exiting the cloud.</p> <p>Data Explorer allows an engineer to create simple or very elaborate filters to interrogate the entire underlying data in the system, including telemetry from on-premises networks and the public cloud. For example, considering most cloud costs are incurred when traffic traverses a network load balancer, <a href="https://www.kentik.com/kentipedia/nat-gateway/" title="Kentipedia - NAT Gateways: A Guide to Cost Management and Cloud Performance">NAT gateway</a>, or ultimately some sort of internet gateway, we can set up a query that tells us what traffic and how much is leaving our public cloud region.</p> <p>Here, we see a filter for all AWS traffic, including the destination gateway, traffic path, average traffic in bits, and max traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3obpnio71Tu9oMCsOPsg4T/1f364c93df647e59a1cb7a09947067a1/data-explorer-aws.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Filtering cloud network data in Data Explorer" /> <p>We’ll need to get more specific so we can simply add another filter option for the internet gateway and a traffic profile to identify the direction of the traffic clearly. Notice in the image below that we can see specific traffic the system identified as “cloud to outside,” which we can use as part of our filter.</p> <p>Also, it’s very common for traffic to leave a public cloud only to go right back into the same cloud without technically going to the public internet. Typically, that traffic doesn’t incur a cost, so to be accurate in our cloud cost monitoring, we need to exclude that traffic, which would appear as a duplicate flow in the data.</p> <p>In the graphic below, we can see the results with the expanded filter, including the removal of duplicate flows. Our metrics output is measured specifically in bits outbound (how cloud providers bill their customers) in the timeframe we care about, usually a bill cycle, and in our example, the last 30 days.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3L7kzii8fsu7h3nh7SS6A5/34e8afa94b8b337146f8c2557b3c7f19/data-explorer-aws-bits-outbound-arrows.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="AWS traffic with filters applied" /> <p>Of course, this is only an example, so you can add a specific VPC, cloud region, availability zone, or whatever is vital to your organization’s workflow.</p> <p>For example, for the AWS Transit Gateway, we’re charged for the number of connections made to the Transit Gateway per hour and the amount of traffic that flows through the AWS Transit Gateway.</p> <p>In this next image, we’re filtering specifically for the destination gateway type of “internet,” which again shows us total traffic egressing our cloud instance.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4TVTE4tvcRQ6aUY2MbmTLB/36f000769fae5f1c923b3fc3f9138b8a/top-eni-entity-arrows.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Filter for destination gateway internet" /> <p>Using the above information, we simply multiply the total bytes by our cost. For the US East region, that would be $0.05 per attachment and then $0.02 per GB processed by the Transit Gateway.</p> <p>We can look at this data over a specific timeframe to check our billing or determine our current cost during a billing cycle. Looked at over time, we can also assess seasonality, trends, and anomalies in the traffic that directly relate to our AWS cost.</p> <p>Because this method uses cloud flow logs, we can also identify applications and custom tags like customer ID, department ID, or project ID. This way, we can create billing reports right in the Kentik platform for department allocations, etc.</p> <p>We can examine the source, destination, protocol, specific applications, and more to understand the main drivers of our cloud bill.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2NkGJQdr2T0poH4XZJ1jgP/b7800f3691dc2dfe931c9134f5afda6c/data-explorer-aws-main-drivers.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud cost billing cycle" /> <h3 id="metrics-explorer">Metrics Explorer</h3> <p>Suppose we’re concerned with just identifying total traffic per interface. In that case, we can also use Metrics Explorer to filter for the Direct Connect, virtual interface, and even circuits that we care about. This way, we can figure out how much traffic is going across each circuit and understand if traffic is being load-balanced across them the way we want according to the cost of each circuit.</p> <p>The next screenshot shows a breakdown of total outbound traffic from each Transit Gateway over the last two weeks, as well as the attachment, account, and region.</p> <img src="//images.ctfassets.net/6yom6slo28h2/REgjItkx05TzN5jLKeV0Y/51cc84e944a9c42724386cca7f8da33f/metrics-explorer-aws.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Metrics explorer view of cloud costs" /> <h2 id="monitoring-idle-resources">Monitoring idle resources</h2> <p>In addition to knowing how much and what kind of traffic is egressing our cloud instances, we also need to know what resources we’re paying for but aren’t doing anything. With Kentik, we can identify idle resources and generate reports that an IT team can use to decommission or re-allocate resources.</p> <p>An easy way to do this is to filter for Logging Status to find NODATA or SKIPPED message updates.</p> <p>NODATA means data has yet to be received on that ENI. Seeing the NODATA message tells us that an active ENI attached to an instance isn’t sending or receiving any data. Especially if we see this message over an extended period of time in our query, we can infer that this is an idle resource that might be able to be reallocated or shut down.</p> <p>In Data Explorer, we can select Logging Status, Observing VPC and Region ID, and the flow log Account ID. We’ll change our metrics to flow data and select our timeframe of the last 30 days.</p> <p>Notice that we have numerous active resources but not sending or receiving data other than the NODATA update message in our time frame.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6rMHqDNKzAzgJ0npjI64JJ/91ac2204ac66dfef6393d09d714a65f2/logging-status.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Logging status view for idle resources" /> <p>We’re auditing data movement in and out of our public cloud instances to manage data transfers effectively, understand where our cloud bill is coming from, and, therefore, minimize cloud costs.</p> <h2 id="managing-cloud-costs-means-understanding-the-data">Managing cloud costs means understanding the data</h2> <p>Just like managing any budget requires understanding where our money is going, managing public cloud costs requires knowing the nature of the traffic moving through our cloud instances, specifically the various types of internet gateways.</p> <p>When using a metered service like the public cloud, every egress byte represents an expense. Determining where those expenses are coming from, why they are occurring, and how they trend over time allows us to manage our cloud costs using actual data, not just a best guess.</p><![CDATA[Dissecting the FCC’s Proposal to Improve BGP Security]]><![CDATA[The FCC recently published a proposal to require nine major US internet service providers to deploy RPKI Route Origin Validation (ROV). In this post, we’ll look at the proposal and where the nine service providers currently stand with respect to ROV.]]>https://www.kentik.com/blog/dissecting-the-fccs-proposal-to-improve-bgp-securityhttps://www.kentik.com/blog/dissecting-the-fccs-proposal-to-improve-bgp-security<![CDATA[Doug Madory]]>Wed, 10 Jul 2024 04:00:00 GMT<p>As part of its ongoing efforts to improve Border Gateway Protocol (BGP) routing security, the US Federal Communications Commission (FCC) <a href="https://www.fcc.gov/document/fcc-proposes-internet-routing-security-reporting-requirements-0">proposed a rule</a> last month that it hoped would “increase the security of the information routed across the internet and promote national security.”</p> <p>Specifically, the FCC is seeking comment on a proposal that would require nine major US internet service providers to draft “BGP Routing Security Risk Management Plans” (BGP Plans), and confidentially report them to the federal agency.</p> <p>These “BGP Plans” would primarily cover two areas:</p> <ol> <li>Describe the specific efforts made to create and maintain Route Origin Authorizations (ROAs) for at least 90% of the routes under its control. [¶ 37, 54]</li> <li>Describe the extent to which it has implemented ROV filtering at its interconnection points. [¶ 50]</li> </ol> <p>In other words, each provider would report its progress deploying RPKI Route Origin Validation (ROV).</p> <div as="Promo"></div> <p>In this blog post, we’ll take a look at each of these nine service providers and see where they stand with respect to ROA creation using Kentik’s unique data sets.</p> <h2 id="background">Background</h2> <p>In her remarks on the proposal, Chairwoman Jessica Rosenworcel summed up the importance of securing internet routing like this:</p> <blockquote> …we all rely on BGP. Every one of us, every day. That is true if you are running a small business and using connections to engage with customers and suppliers, banking online, having a telemedicine session with a healthcare provider, helping the kids with their digital age schoolwork, staying in touch with family, or keeping up to date on the news. BGP is in the background, helping connect our critical infrastructure, support emergency services, keep the financial sector running, shore up manufacturing, and more. </blockquote> <p>In February 2022, the FCC began its work on this important topic by <a href="https://www.fcc.gov/document/fcc-launches-inquiry-internet-routing-vulnerabilities">launching an inquiry</a> seeking comment on the best ways to help address BGP security.</p> <p>In July 2023, the FCC hosted a <a href="https://www.fcc.gov/news-events/events/2023/07/bgp-security-workshop">workshop on BGP security</a> in which multiple presenters cited Kentik’s analysis on RPKI ROV adoption. Tony Tauber, Engineering Fellow at Comcast, <a href="https://www.youtube.com/watch?v=VQhoNX2Q0aM&#x26;t=3973s">cited our analysis</a> to argue that traffic data (i.e., NetFlow) suggests that we’re farther along in RPKI ROV adoption than the raw counts of BGP routes might suggest. Nimrod Levy of AT&#x26;T <a href="https://www.youtube.com/watch?v=VQhoNX2Q0aM&#x26;t=4332s">cited our observation</a> that a route that is evaluated as RPKI-invalid will have its propagation reduced by 50% or more.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1X3MNjf1HeMidx6W3BYFyM/16915b831e9cc95c0057365b744da6b6/bgp-security-workshop-slide.jpg" style="max-width: 700px;" class="image center" thumbnail alt="Slide from FCC BGP security workshop" /> <div class="caption" style="margin-top: -35px">Slide by Doug Montgomery (NIST) at the BGP Security Workshop (pie chart from <a href="https://www.kentik.com/blog/exploring-the-latest-rpki-rov-adoption-numbers/">this post</a>)</div> <p>And last summer, Chairwoman Rosenworcel and Jen Easterly, Director of the Cybersecurity and Infrastructure Security Agency (CISA), co-published a call to action on BGP security entitled, <a href="https://www.fcc.gov/news-events/notes/2023/08/02/most-important-part-internet-youve-probably-never-heard">The Most Important Part of the Internet You’ve Probably Never Heard Of</a>. Although, If <em>you’re</em> reading this post, I’m going to guess <em>you’ve</em> probably heard of it. 😀</p> <p>The <a href="https://docs.fcc.gov/public/attachments/FCC-24-62A1.pdf">full FCC proposal</a> cites analysis from Kentik on multiple occasions throughout the document. Paragraph 4 cites my post, <a href="https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents/">A Brief History of the Internet’s Biggest BGP Incidents</a>, to help describe the vulnerabilities present in our current internet routing system. Paragraph 11 cites our post, <a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/">What can be learned from recent BGP hijacks targeting cryptocurrency services?</a>, which covered how BGP hijacks were successfully employed by hackers against cryptocurrency services. Finally, paragraph 18 cites <a href="https://www.kentik.com/blog/exploring-the-latest-rpki-rov-adoption-numbers/">Exploring the Latest RPKI ROV Adoption Numbers</a>, which was part of my continuing collaboration with <a href="https://www.fastly.com/blog/job-snijders/">Job Snijders of Fastly</a> to measure RPKI ROV adoption.</p> <iframe src="https://fast.wistia.net/embed/iframe/qpfh2174t5?seo=false&videoFoam=true" title="Breaking the 50% Barrier: an RPKI ROV Discussion with Job Snijders Audio" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen width="100%" height="218px"></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async></script> <h2 id="internet-society-and-global-cyber-alliance-response">Internet Society and Global Cyber Alliance response</h2> <p>In reaction to the proposed FCC rule, the Internet Society joined with the Global Cyber Alliance (<a href="https://globalcyberalliance.org/achieving-greater-heights-manrs/">the former and current homes</a> of the <a href="https://manrs.org/">MANRS</a> secure routing initiative) to publish an <a href="https://www.internetsociety.org/wp-content/uploads/2024/04/2024-FCC-Ex-Parte-re-BGP.pdf">Ex Parte response</a> pushing back on any attempt to regulate BGP.</p> <p>They argue that codifying a technical security mechanism would be counterproductive. The first piece of this argument states that “the United States industry is already addressing routing security challenges,” and to support this claim, the paper relies on my collaboration with Job on both <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">progress on ROA creation</a> as well as the <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">rejection of RPKI-invalids</a>.</p> <p>The Ex Parte goes on to argue that regulations could limit the ability of service providers to quickly adapt to emerging security standards and could prove burdensome for smaller service providers. If security standards change in the future, the paper argues, regulation could hinder the ability of service providers to quickly adapt. Job Snijders raised similar points in his <a href="https://www.fcc.gov/ecfs/document/10510264496934/1">2022 FCC filing</a>.</p> <h2 id="the-nine">The nine</h2> <p>The proposal focuses on providers of “broadband internet access service (BIAS).” Let’s see where these companies presently stand in terms of both BGP routes and traffic statistics based on Kentik’s aggregate NetFlow data.</p> <p>For each of the nine service providers below, I discuss the ASNs we’re using for the analysis, the percentage of routes originated by those ASNs that have ROAs, and, finally, the proportion of traffic going to those routes versus routes without ROAs.</p> <h3 id="att">AT&#x26;T</h3> <p><strong>ASNs: 7018, 20057</strong></p> <p>While AT&#x26;T has over 100 ASNs, Kentik’s traffic statistics reveal that 98% of traffic to AT&#x26;T goes to AS7018 and AS20057. In order to simplify our analysis, we’ll just use those two ASNs. At the beginning of July 2024, we observed AS7018 as originating 2995 prefixes in total. 960 of those were RPKI-valid, while 2034 were RPKI-unknown. AS20057 originates 1537 prefixes and none of them have ROAs.</p> <p>Kentik observes that <strong>58.1%</strong> of traffic in bits/sec to AT&#x26;T is to RPKI-valid routes, while the rest goes to routes without ROAs.</p> <ul> <li>Routes with ROAs: 21.2%</li> <li>Proportion of traffic: 58.1%</li> </ul> <h3 id="altice">Altice</h3> <p><strong>ASNs: 19108 and 6128</strong></p> <p>Altice is a European telecommunications conglomerate. For this analysis, we’ll use their US networks, AS19108 (the former Suddenlink) and AS6128 (Cablevision). As of this writing, 68 of the 95 routes originated by AS6128 are RPKI-valid, while only 25 prefixes are RPKI-valid out of the 694 routes originated by AS19108.</p> <p>Based on Kentik’s statistics, virtually all traffic destined for AS6128 was to routes with ROAs, while all of the traffic to AS19108 was to routes without ROAs. Therefore, the proportion of traffic to routes with ROAs is reflective of the breakdown to traffic terminating in these two ASNs, as seen in our aggregate NetFlow. AS6128 also originates a few RPKI-invalid routes, such as <a href="https://rpki-validator.ripe.net/ui/66.152.159.0%2F24?validate-bgp=true">66.152.159.0/24</a>.</p> <ul> <li>Routes with ROAs: 11.8%</li> <li>Proportion of traffic: 60.6%</li> </ul> <h3 id="charterspectrum">Charter/Spectrum</h3> <p><strong>ASNs: 20115, 20001, 11427, 10796, 33363 and 20+ more</strong></p> <p>Charter, which goes by the trade name of Spectrum, is composed of several acquisitions, including <a href="https://www.reuters.com/article/business/charter-communications-completes-purchase-of-time-warner-cable-idUSKCN0Y92BR/">Time Warner Cable</a>. Whether counting cumulative BGP routes or aggregate traffic volume, Charter presently scores very high in terms of RPKI ROV deployment.</p> <ul> <li>Routes with ROAs: 92.6%</li> <li>Proportion of traffic: 99.9%</li> </ul> <h3 id="comcast">Comcast</h3> <p><strong>ASNs 7922 and 20+ more</strong></p> <p>Like AT&#x26;T, Comcast uses numerous ASNs. We’ve included the top 29 Comcast ASNs based on traffic in this analysis. By either metric (BGP routes or traffic), Comcast scores very high in terms of RPKI ROV preparedness.</p> <p>The Comcast ASNs in our analysis originated 4583 prefixes, 4362 (93.5%) of which had ROAs. In addition, we see 99.4% of traffic to these Comcast ASNs going to these routes with ROAs.</p> <ul> <li>Routes with ROAs: 93.5%</li> <li>Proportion of traffic: 99.4%</li> </ul> <h3 id="cox">Cox</h3> <p><strong>ASN: 22773</strong></p> <p>AS22773 originates 4,845 prefixes, 96.97% of which are RPKI-valid. AS22773 also announces a handful of RPKI-invalid routes, such as <a href="https://rpki-validator.ripe.net/ui/50.118.213.0%2F24?validate-bgp=true">50.118.213.0/24</a>. Based on Kentik’s statistics, 98.2% of traffic to Cox goes to RPKI-valid routes.</p> <ul> <li>Routes with ROAs: 96.7%</li> <li>Proportion of traffic: 99.8%</li> </ul> <h3 id="lumen">Lumen</h3> <p><strong>ASNs: 3356, 209, 3561, 22561</strong></p> <p>Lumen might not be a household name outside of places like Seattle* but it plays an important role for the global internet. It is unlike nearly any other service provider on this list because of its additional role as one of the world’s largest transit providers.</p> <p>Composed of the networks of acquired telecommunications companies, including Qwest (AS209), Level 3 (AS3356), and Savvis (AS3561), Lumen is one of the largest telecommunications providers in the United States.</p> <p>According to Kentik’s statistics, most traffic to Lumen is destined for the networks of AS209 and AS3356, and nearly none of the prefixes originated by Lumen’s ASNs have ROAs. However, it should be emphasized that AS209 and AS3356 reject RPKI-invalid routes, providing a critical benefit to their large international customer base and the greater internet. Now, they need to begin creating ROAs.</p> <ul> <li>Routes with ROAs: 3.9%</li> <li>Proportion of traffic: 0.0%</li> </ul> <p><em>*Lumen Field is the home stadium of the Seattle Seahawks of the NFL, Seattle Sounders FC of the MLS, and Seattle Reign FC of the National Women’s Soccer League.</em></p> <h3 id="t-mobile">T-Mobile</h3> <p><strong>ASN: 21928</strong></p> <p>As far as RPKI ROV preparedness goes, T-Mobile is clearly the best of the US mobile providers. This is likely due to the fact that T-Mobile’s business is entirely providing mobile service while the other US operators provide a variety of other network services, some of which were gained through prior acquisitions.</p> <p>Regardless, 99.2% of the 528 routes originated by AS21928 have ROAs, and virtually all traffic to T-Mobile is destined for IP space covered by ROAs.</p> <ul> <li>Routes with ROAs: 99.2%</li> <li>Proportion of traffic: 100%</li> </ul> <h3 id="tds">TDS</h3> <p><strong>ASNs: 4181, 6614</strong></p> <p>In this analysis, we used two ASNs to represent TDS: AS4181 (TDS) and AS6614 (US Cellular). These two networks are polar opposites when it comes to RPKI ROV deployment.</p> <p>AS4181 has ROAs for virtually all of the 1200+ routes it originates, and the proportion of traffic to AS4181 routes with ROAs is 100%. Conversely, AS6614 has ROAs for none of the 53 routes it originates.</p> <p>Since US Cellular is in the <a href="https://investors.uscellular.com/news/news-details/2024/UScellular-and-TDS-Announce-Sale-of-Wireless-Operations-and-Select-Spectrum-Assets-to-T-Mobile-for-Approximately-4.4-Billion-in-Cash-and-Assumed-Debt/default.aspx">process of being sold</a> to T-Mobile (listed above), AS6614 will soon be part of that network, and it will be the acquirer’s responsibility to deploy RPKI ROV on the network. The FCC proposal explicitly mentioned that US Cellular should be counted as part of TDS, despite the pending sale. If we removed AS6614 from the analysis, TDS would have nearly perfect coverage as measured by both BGP routes and traffic stats.</p> <ul> <li>Routes with ROAs: 95.6%</li> <li>Proportion of traffic: 82.5%</li> </ul> <h3 id="verizon">Verizon</h3> <p><strong>ASNs: 701, 6167, 22394</strong></p> <p>For this analysis, we used AS701, AS6167, and AS22394 to represent Verizon’s mobile service in the US. AS702 and AS703 are also part of Verizon, but they provide service in EMEA and Asia-Pacific, respectively, and are therefore out of scope.</p> <p>Based on Kentik’s aggregate NetFlow data, AS701 is the biggest destination of traffic to Verizon in the US. A clear majority (819 out of 1044) of the routes originated by AS701 have ROAs, and 76.3% of the traffic to AS701 goes to routes with ROAs.</p> <p>A little under half of the routes originated by AS6167 have ROAs and only a sixth of those originated by AS22394.</p> <ul> <li>Routes with ROAs: 56%</li> <li>Proportion of traffic: 64.1%</li> </ul> <h2 id="summary-of-analysis">Summary of analysis</h2> <p>Based on the analysis described above, only five of the nine BIAS providers named in the FCC proposal would currently pass the 90% metric of routes with ROAs. Of course, this metric could be gamed by providers who could de-aggregate (i.e., split into more routes) address space covered by ROAs and aggregate routes (i.e., consolidate into fewer routes) containing address space not covered by ROAs.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/66jWBzGJIshcXfY8mcBJ9D/58ad2a5dc7c26cc2600ef3cdce30d2c9/traffic-bias-providers-bgp.png" style="max-width: 600px;" class="image center" alt="Traffic to BIAS providers by RPKI evaluation (BGP routes)" /> <p>Regardless, the other four providers have some work to do here to receive a passing grade.</p> <p>Of course, the BGP plans the FCC is asking for are “risk management“ plans. One way to focus on this risk is to not treat each BGP route equally but instead focus on <em>where traffic is going.</em> Some routes simply don’t carry much traffic at all and therefore pose less risk.</p> <p>Kentik’s aggregate NetFlow enables us to uniquely operate in this dimension. Below is the same set of providers named in the proposal when traffic volume (bits/sec) are used as a metric instead of counts of BGP routes. It would also be more difficult for a provider to game this metric than counts of BGP routes. On the other hand, it would not capture the importance of routes which contain critical, but low traffic services such as DNS.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5ZHwHDJQxjsWCOPfqRD10a/5ac335ae738633538e85ff2581ee100e/traffic-bias-providers-bps.png" style="max-width: 600px;" class="image center" alt="Traffic to BIAS providers by RPKI evaluation (traffic in bps)" /> <p>TDS has a lower percentage of “valid traffic” simply due to the inclusion of US Cellular which has no ROAs for its IP address space. Verizon, Altice and AT&#x26;T all appear much more favorable when traffic is analyzed, as opposed to counts of BGP routes. These providers may have already conducted some risk management and focused their RPKI deployment efforts on the more important routes. Lumen is low in both metrics.</p> <h2 id="conclusion">Conclusion</h2> <p>It is unclear if requiring the deployment of RPKI ROV is the right move. Such a requirement may not be necessary and have unintended consequences. As the Ex Parte response from ISOC and GCA points out, the US government (to include the US Department of Defense) is lightyears behind most major US service providers in terms of RPKI ROV deployment. Perhaps the FCC could begin with the networks of the US federal government before requiring the same of private industry.</p> <p>It’s also important to appreciate how we were able to make significant progress on deploying this technology. A couple of months ago, we <a href="https://www.kentik.com/blog/rpki-rov-deployment-reaches-major-milestone/">broke the 50% barrier</a> for IPv4 routes with ROAs. This global feat was accomplished without government mandates. It took years of technical discussions, debate, and advocacy until it became “socially unacceptable” within the internet community to not have a plan for RPKI ROV. Peer pressure and, frankly, shame, can be powerful motivators.</p> <p>Lastly, it will be interesting to see how the recent Supreme Court <a href="https://www.scotusblog.com/2024/06/supreme-court-strikes-down-chevron-curtailing-power-of-federal-agencies/">decision to overturn</a> the long-standing <a href="https://www.law.cornell.edu/wex/chevron_deference">Chevron deference</a> in <a href="https://www.scotusblog.com/case-files/cases/loper-bright-enterprises-v-raimondo/">Loper Bright Enterprises v. Raimondo</a> affects the FCC’s ability to develop rules such as this one. If the development of such a rule requires an act of Congress, it may be unlikely to ever happen.</p> <p>Thankfully, the internet community has shown that progress is possible without regulatory requirements. Good thing, because securing BGP routing is going to take a lot more than just the widespread deployment of RPKI ROV, as great as it is.</p> <p>We need to continue to work together to develop and advocate for additional mechanisms of route hygiene, and there are many ways the FCC can play a vital role in those efforts. These include providing grants to those developing the software needed to enable routing security or by collecting and disseminating adoption metrics like the ones presented in this blog post.</p> <p>If you wish to contribute your opinion or expertise on this topic, you can do so by submitting a <a href="https://www.fcc.gov/ecfs/filings/standard">filing here</a>.</p><![CDATA[Demystifying Internet Measurement: A Profile of Doug Madory ]]><![CDATA[Dubbed “The man who can see the internet” by the Washington Post, Doug Madory has made significant contributions to the field of internet measurement. In this post, we explore how internet measurement works and what secrets it can uncover.]]>https://www.kentik.com/blog/demystifying-internet-measurement-a-profile-of-doug-madoryhttps://www.kentik.com/blog/demystifying-internet-measurement-a-profile-of-doug-madory<![CDATA[Phil Gervasi]]>Tue, 02 Jul 2024 04:00:00 GMT<p>Internet measurement is all about understanding the intricacies of internet activity and performance on a global scale. It involves analyzing data from various sources, which often means detecting outages, identifying security threats, and understanding how geo-political forces impact the internet.</p> <h2 id="the-importance-of-internet-measurement">The importance of internet measurement</h2> <p>Because the internet is the defining technology of our time—impacting businesses, individuals, and society at large—understanding its performance and reliability is crucial. Industry experts have advanced the field significantly, and though often operating behind the scenes, they have provided invaluable insight into what’s happening with the internet at scale.</p> <p>This is important for improving availability, performance, and security, but it’s also been critical for business leaders and policy-makers to understand the relationship between the internet and society itself.</p> <p><a href="https://www.linkedin.com/in/dougmadory">Doug Madory</a>, the director of internet analysis at Kentik, is a key figure in the field of internet measurement. His educational background is in electrical engineering and network engineering, and he’s dedicated his career to understanding and analyzing global internet connectivity for well over a decade. His journey began as an officer in the United States Air Force, where he worked in a telecommunications role. After his military service, Doug started his civilian career at Renesys, specializing in BGP analysis for service providers and telecom companies.</p> <div as="WistiaVideo" videoId="ktous2qht5"></div> <h2 id="a-collaborative-effort">A collaborative effort</h2> <p>Over recent decades, collaborative efforts between industry experts and academia have advanced the field significantly. Organizations like the <a href="https://www.caida.org/">Center for Applied Internet Data Analysis</a>, or CAIDA, at the University of California, San Diego; the Georgia Institute of Technology; and the <a href="https://www.iij.ad.jp/en/">Internet Initiative Japan</a> are at the forefront of this research, contributing to better tools and methodologies for internet measurement.</p> <p>At Kentik, Doug works with these and other organizations to analyze global internet connectivity, outages, and security issues. This has led to notable contributions such as detecting and analyzing major incidents such as the 2021 AWS US-East-1 outage. In December 2014, he discovered North Korea’s first social media site, which briefly allowed global access before being shut down. This incident underscored the importance of vigilance and awareness in internet measurement while providing a glimpse into the sometimes surprising aspects of the field.</p> <p>Doug’s work often intersects with geopolitical issues because today’s societies and governments rely on the internet. In that light, he’s written extensively about internet shutdowns in regions such as Northern Africa and Southeast Asia, offering crucial insights into how these events impact connectivity and access to information. His analysis of internet connectivity in Ukraine during the ongoing conflict has been particularly noteworthy, highlighting how geopolitical events affect the global internet landscape.</p> <h2 id="finding-answers-with-the-data">Finding answers with the data</h2> <p>Internet measurement relies primarily on BGP and NetFlow data. The <a href="https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents/">Border Gateway Protocol, or BGP</a>, is the routing protocol the internet runs on, making it an important source of information about how traffic is moving globally. Analyzing publicly available BGP data provides insights into how internet traffic moves from nation to nation among public clouds. It highlights the efforts of countries to suppress information, re-route traffic, or otherwise manipulate internet traffic in their geographic region.</p> <div as="Promo"></div> <p>Synthetic testing measures the internet’s performance by sending artificial test traffic between test agents deployed in strategic locations around the world. Metrics from these tests provide information like packet loss and network latency along the path among service providers, along submarine telecommunication cables, and so on.</p> <p>Flow data, such as NetFlow, provides information about traffic flowing through network devices. In this way, engineers working in internet measurement can gain insight into specific traffic volume, traffic type, and traffic patterns. This is crucial for understanding how application traffic is delivered over the internet; it plays a role in understanding security threats against publicly accessible networks and helps us understand accessibility to information worldwide.</p> <p>For example, using flow data and BGP information, Doug could detect and track a significant drop in traffic in and out of Kenya immediately after recent violent protests. With this data, he determined that the cause of the significant drop in traffic was likely a deliberate attempt by the Kenyan government to stifle internet access for its citizens.</p> <h2 id="the-man-who-can-see-the-internet">The man who can see the internet</h2> <p>Dubbed “<a href="https://www.washingtonpost.com/news/the-switch/wp/2014/08/06/the-man-who-can-see-the-internet/">The man who can see the internet</a>” by the Washington Post, Doug’s expertise and dedication have made significant contributions to the field of internet measurement. His work, along with others in academia and industry, ensures that the internet remains a reliable and secure tool for communication and commerce. By understanding and analyzing the complex web of global internet connectivity, Doug and his colleagues in the field help maintain the integrity of the global internet infrastructure.</p> <p>For more insights from Doug Madory, follow his <a href="https://www.kentik.com/blog/author/doug-madory/">blog posts</a> on the Kentik website and listen to his appearances on the <a href="https://www.kentik.com/telemetrynow/">Telemetry Now</a> podcast. Also, make sure to watch <a href="https://www.youtube.com/watch?v=sBzr-nQGe1U">LinkedIn Live</a>, where Doug was interviewed about his experiences working in internet measurement.</p><![CDATA[Exploring NetFlow with Kentik: A Simple Explainer]]><![CDATA[Testing NetFlow used to require time, expertise, and lab equipment. Using Kentik and the new Kappa agent, it can be done in minutes with nothing more than a spare Linux machine.]]>https://www.kentik.com/blog/exploring-netflow-with-kentik-a-simple-explainerhttps://www.kentik.com/blog/exploring-netflow-with-kentik-a-simple-explainer<![CDATA[Leon Adato]]>Thu, 27 Jun 2024 04:00:00 GMT<p>The technology commonly referred to as “NetFlow,” by which I include everything from Cisco’s eponymous protocol, to JFlow, sFlow, and even VPC Flow Logs on AWS, is possibly one of the most essential techniques that fall under the umbrella of observability overall, and <a href="https://www.kentik.com/blog/network-observability-beyond-metrics-and-logs/">network observability</a> in particular.</p> <p>The challenge is that, while many can appreciate its importance, <a href="https://www.kentik.com/kentipedia/what-is-netflow-overview/" title="Kentipedia: What is NetFlow? An Overview of the NetFlow Protocol">NetFlow</a> is a technology that’s been frustratingly difficult to try out or test in a lab or sandbox environment without a significant amount of time, expertise, and preparation.</p> <p>At least, that’s always been my experience in the past.</p> <p>It turns out that the <a href="https://kb.kentik.com/v0/Bd05.htm#Bd05-About_Kappa">Kentik Kappa</a> agent allows for the easy setup and collection of NetFlow metrics on everything from Kubernetes clusters to bare metal Linux boxes. So today I want to show you how, in a few steps, you can set up a lab box, along with a free <a href="https://www.kentik.com/go/get-started/">Kentik account</a>, and see how NetFlow is both different from and vastly superior to the network metrics you may be used to seeing in other monitoring and observability solutions.</p> <div as="WistiaVideo" videoId="gr3k0rh1nn"></div> <h2 id="step-1-get-a-kentik-account">Step 1: Get a Kentik account</h2> <img src="//images.ctfassets.net/6yom6slo28h2/2ORaT16Bki8Iu41XNM0Xn0/6d7861f81d2493bdc1ee6c0c12641a73/authentication.png" style="max-width: 280px;" class="image right" alt="Automation in dropdown" /> <p>The good news is this is an easy step, and you may have already done it!</p> <p>On the off chance you don’t have a Kentik account yet, head over to the <a href="https://www.kentik.com/go/get-started/">sign up page</a>, fill in a few fields, and you’ll be up and running in no time.</p> <p>There’s only one other essential step. Once you’re logged into your fresh new Kentik account, go to the account menu (the “head” in the upper right corner) and click “Authentication.”</p> <p>From that screen you’ll see an API Token. Copy it and put it somewhere for safe keeping because you’ll need it further down in this process.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4ZyWDIhOG8C4dDSojVSqfU/2c595035587b8929251c97a57c606da2/profile-authentication.png" style="max-width: 600px;" class="image center" thumbnail withFrame alt="Copy API token" /> <h2 id="step-2-get-a-linux-box-to-test-on">Step 2: Get a Linux box to test on</h2> <p>Next, you’ll need a machine to test this out on. Honestly, it doesn’t matter (to me) if this is your personal machine, a box sitting under your desk, a virtual machine running in <a href="https://www.virtualbox.org/">VirtualBox</a> or <a href="https://blogs.vmware.com/workstation/2024/05/vmware-workstation-pro-now-available-free-for-personal-use.html">VMWare</a>, a Docker container, or something else.</p> <p>In the example below, I’m using Ubuntu 22.04 running in VirtualBox. But you’re welcome to salt to taste.</p> <h2 id="step-3-download-install-and-configure-the-kentik-kappa-agent">Step 3: Download, install, and configure the Kentik Kappa agent</h2> <p>The Kappa agent for Kentik can be found in this repository: <a href="https://packagecloud.io/kentik/kappa">https://packagecloud.io/kentik/kappa</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/784ODoLrglkUUkHGw2CzsQ/3ddb2f0cade8cd660304c9e79c587753/kappa-repository.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kappa agent" /> <p>Find the link that matches your OS and version, and click the “Install” button on the right. While there are commands to install via the package manager, my experience is that people log into systems with accounts that may or may not have the correct level of built-in access (read “not root”) to get the job done correctly. Therefore, I recommend using the wget command to pull down the file, and then using either sudo rpm -ivh (for .rpm files) or sudo dpkg -i (for .deb files) to do the actual installation.</p> <img src="//images.ctfassets.net/6yom6slo28h2/43sJKB4HBYcSvAaTEvwm3Q/32ec0175fb7e3d5a21c82f45da19ca07/command.png" thumbnail withFrame style="max-width: 800px;" class="image center" alt="Command screen" /> <p>Once installed, you need to configure the agent so it connects this machine with your Kentik account. Using your favorite editor, and open up /etc/default/kappa. It should look like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2imN7258xu74WQKYjxDS73/da55ad846aa4aa233ce429025ea27093/config-test.png" style="max-width: 400px;" class="image center" alt="Config file" /> <ul> <li>For KENTIK_EMAIL=, enter the email you used to sign up for your Kentik account.</li> <li>After KENTIK_TOKEN= you should paste the API key I asked you to remember in step 1.</li> <li>Believe it or not, KENTIK_DEVICE= is going to matter a lot. You can put anything you want (it doesn’t have to match the actual hostname), but you’ll need to keep track of whatever you put here for later in the process</li> <li>For KENTIK_REGION=, use either “US” or “EU” (without the quotes)</li> </ul> <p>When done, your config file should look something like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/36WD8zp3pmfb2XbgstcLKX/38ec27647df81c6434b93bd0b9a90f8f/config-file.png" style="max-width: 400px;" class="image center" alt="Config file completed" /> <p>One last thing: Before we move to the next step, make a note of the device IP address.</p> <p>While we’ll need to come back to this system to restart the Kappa agent, we’re putting that step on hold for just a second and moving on to…</p> <h2 id="step-4-add-the-device-to-kentik">Step 4: Add the device to Kentik</h2> <p>Back in your Kentik portal, click the hamburger menu in the upper left corner, then Settings, and then choose Network Devices. This will take you to a screen where you can add your first network device.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4Lzm0A2uGUS1UkaVLf7ZwF/b15c27c8f42435d6e5f03ea93e9c84e7/add-device.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Add a device" /> <p>Clicking the friendly blue Add Device button in the upper right area of the screen will reveal a popup with several tabs or steps.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1N3MApVuwb8XgI9gZ96b4M/2279754e75dfccc3c5aba6d56ba765eb/device-general.png" style="max-width: 600px;" class="image center" thumbnail withFrame alt="Add a device - General tab" /> <ul> <li>Starting on the General tab, put in the name of the device. Remember that this has to match the name you put in the configuration file in step 3!</li> <li>Change the type from NetFlow-Enabled Router to Kentik Host Agent (kprobe).</li> <li>Finally (for this section) select a billing plan. For this test, the Free Trial Plan should be fine.</li> <li>The description, site, and label elements are optional.</li> </ul> <p><strong>Do not</strong> click Add Device. Instead, click Flow from the tabs at the top.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4iquj9M3Wk61Trj3Ceed5C/d9548273a289d68050659e0b297e0edb/device-flow.png" style="max-width: 600px;" class="image center" thumbnail withFrame alt="Add a device - Flow tab" /> <p>On this screen you’ll add the IP address (as noted in step 3) and set the sample rate. Our recommendation is that 10 is a good initial setting, and you can adjust up (for less detail) or down (for more) once you have a sense of both the granularity you need as well as the cost impact.</p> <h2 id="step-3-the-missing-link">(Step 3: The missing link)</h2> <p>As mentioned, we have one last task back on the test machine itself: restarting the Kappa agent.</p> <p>The right way to do this is to both restart the agent and set it to automatically start after every reboot. The commands to do that are:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">sudo systemctl enable kappa-agg.service sudo systemctl enable kappa-agent.service sudo systemctl start kappa-agg.service sudo systemctl start kappa-agent.service</code></pre></div> <p>Note that these are Ubuntu-specific commands so you’ll need to adjust for an RPM-based system.</p> <h2 id="step-5-explore-your-netflow">Step 5: Explore your NetFlow</h2> <p>At this point, the Kappa agent should be sending NetFlow data to Kentik, such as it is. I mean, this <em>is</em> a test system so it’s probably not sending much in the way of network traffic. In order to see it, click the hamburger menu in the upper left corner and select Data Explorer.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Vp8stS5FWfSlL4km0hMhy/1973a92738c964ea79d49416200b905e/data-explorer-sankey.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="See your data in Kentik" /> <p>Don’t panic! I know your screen doesn’t look like this. <em>Yet.</em> You just have to tweak a few settings:</p> <ol> <li>Add two dimensions <ul> <li>Device</li> <li>Application</li> </ul> </li> <li>Change the Visualization Type to Sankey</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/3nLUZDIg1HhRIuLArMvfTG/69e3d4873471309018ce0a834ec83201/data-explorer-sankey2.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="See your data - completed" /> <h2 id="you-did-it-what-now">You did it! What now?</h2> <p>With that done (and possibly after waiting a bit for some actual data to flow), you should be ready to explore the awe-inspiring, eye-opening, insight-revealing world of NetFlow!</p> <p>What comes next is a bit of a “choose your own adventure” situation. You may want to set up various applications on this test system, and then see what NetFlow shows you about how it’s performing and where the application is connecting. Or you might be ready to add more devices (network, server, or otherwise) to your Kentik test account and see how the network observability data from all those devices combine to show a bigger picture of your infrastructure performance and stability.</p> <p>Or you might be ready to jump into this network observability thing feet-first, setting up a full (non-trial) Kentik account and adding your production devices.</p> <p>The choice is yours, and we here at Kentik eagerly await whatever lies in store for you next!</p><![CDATA[Network Observability: Beyond Metrics and Logs]]><![CDATA[The network is the metaphorical plumbing through which all your precious non-metaphorical observability and monitoring data flows. It holds secrets you'll never find anywhere else. And it's more important today than ever before.]]>https://www.kentik.com/blog/network-observability-beyond-metrics-and-logshttps://www.kentik.com/blog/network-observability-beyond-metrics-and-logs<![CDATA[Leon Adato]]>Tue, 25 Jun 2024 04:00:00 GMT<p>These days, when you want to discuss tracking the state, stability, and overall performance of your environment, from the bottom layer of infrastructure all the way up to the application, we don’t say “monitoring.” We use the word “observability.”</p> <p>Now I like “observability.” It’s got history. Power. Gravitas. <em>But</em>…</p> <h2 id="respect-your-elders">Respect your elders</h2> <p>When I started out doing this work – installing and sometimes even creating solutions to track how my stuff was doing, and later replicating the same work to track how an entire business’ stuff was doing – we didn’t have any of this fancy “observability” stuff.</p> <p>No, I’m not saying, “Back in my day, we had to smack two rocks together and listen for the SNMP responses.” I’m saying that when I started, we had no way to understand the real user experience. What we had were metrics and logs from discrete elements and subsystems, which we had to cobble together and then extrapolate the user’s experience from that information.</p> <p>And then, almost miraculously, application tracing came onto the scene. Finally, we had the thing we’d always wanted. Not by inference, not using a synthetic test. We had the user’s actual experience, in real time. It was glorious.</p> <p>And it completely blinded us to anything else. Yes, in that moment “observability” was born. But in many ways “monitoring” died. Or, perhaps more accurately, it was dismissed to a dim corner of the data center, to be referred to infrequently, and respected even less often.</p> <p>And this was, in my opinion, to our overall detriment. The truth is, you won’t fully understand the health and performance of your application, infrastructure, or overall environment until and unless you include metrics. And more specifically, until you understand the network side of it.</p> <p>And that’s the goal of this blog post: to share the real, modern, useful value of network monitoring and observability and show how that data can enhance and inform your other sources of insight.</p> <h2 id="what-network-observability-is-not">What network observability is <em>not</em></h2> <p>For many tech practitioners, the idea of monitoring the network begins and ends with ping.</p> <img src="//images.ctfassets.net/6yom6slo28h2/33FtaRjaizUYsd7OiNR00J/9dd2611bcfd959d20a9a52185bc4ea7e/ping.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Ping example" /> <p>Or worse, with PCAP and a long slog through wireshark.</p> <img src="//images.ctfassets.net/6yom6slo28h2/N66I192zIcDq0htKAtYR2/27084dfa770b579c890f49a5a9d9db5d/wireshark.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Wireshark example" /> <p>To be honest, if it were true that this was the best that “network monitoring” had to offer, nobody – including me – would want to deal with it. But, after 35 years working in tech, and 25 of those years focused on monitoring and observability, I’m happy to report that it’s <em>not</em> true. Network monitoring and observability are way more than that.</p> <h2 id="the-difference-between-monitoring-and-observability">The difference between monitoring and observability</h2> <p>Before I go any further, I think it’s important to clarify what I mean when I say “monitoring” and “observability,” and if there are any meaningful differences between those terms (there are). While a lot of ink has been spilled (to say nothing of several flame wars on social media) on this topic, I believe I can offer a high-level gloss that will both clarify my meaning and also avoid having amazing folks like Charity Majors show up at my house to slap me so hard I’ll have observability of my own butt without having to turn around.</p> <p>“Monitoring” solutions and topics concern themselves with (again, broadly speaking):</p> <ul> <li><strong>Known unknowns</strong>: We know something <em>could</em> break, we just don’t know when.</li> <li><strong>All cardinalities of events</strong>: No matter how common or unique something is, monitoring wants to know about it.</li> <li><strong>(Mostly) manual correlation of events</strong>: The data is so specific to a particular use case, application, or technology combination that understanding how events or alerts relate to each other is something humans have to be involved with.</li> <li><strong>Domain-specific signals</strong>: The things that indicate a problem in one technology or system (like networking) bear no resemblance to the things that would tell you a different tech or system is having an issue (like the database).</li> </ul> <p>Conversely, “observability” is focused on:</p> <ul> <li><strong>Unknown unknowns</strong>: There’s no idea what might go wrong, and no idea when (or if) it will ever happen.</li> <li><strong>High cardinality</strong>: The focus is on unique signals, whether it’s a single element or a combination of elements in a particular span of time.</li> <li><strong>Correlation is (mostly) baked-in</strong>: Because of the high quantity and velocity of data that has to be considered, a human can’t possibly reason about it, and it requires machines (ML, or that darling marketing buzzword “AI”) to really get the job done.</li> <li><strong>Understanding a problem based on the “golden” signals</strong> of latency, traffic, errors, and saturation.</li> </ul> <p>With those definitions out of the way, we can circle back to network observability.</p> <h2 id="getting-to-the-point">Getting to the point</h2> <p>In many blog posts, this section would be where I begin to build my case by first educating you on the history of the internet, or how networking works, or why an <a href="https://www.kentik.com/blog/the-benefits-and-drawbacks-of-snmp-and-streaming-telemetry/">old-but-beloved technology like SNMP</a> still had relevance and value.</p> <p>Forget that noise. I’m going to skip to the good stuff, get straight to the essence of the question. What <em>is</em> network observability, and why do you need it?</p> <p>Network observability goes beyond simply being able to see that your application is spiking.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1MgaGo7UeucpJYAynbEn8B/bf5abcd50d89cec30d0285625ccb567b/bandwidth.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Spikes in network traffic" /> <p>It’s also understanding the discrete data types that make up the spike, both in this moment and over time. Moreover, network observability is being able to tell where the traffic – both overall and those discrete data types – is coming from and going to. <img src="//images.ctfassets.net/6yom6slo28h2/3KSxPebp9RVxRKFqmAgIOS/038759af7ea35b021262eb7724e4b28b/netflow-export.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Sankey diagram showing where network traffic is going" /></p> <p>Because your response to spiked data is going to be significantly different if the majority of the spike is due to authentication traffic versus database reads or writes versus streaming data coming from Netflix to the PC used by Bob in Accounting.</p> <p>Speaking of where that traffic is going, network observability means knowing when your application is sending traffic in directions that are latency-inducing – or worse, <em>cost</em> inducing!</p> <img src="//images.ctfassets.net/6yom6slo28h2/1J7P2NAC08VwiNljTsOhsN/56222ab06510d4dc8380f0ede675b361/sankey-fix-hairpin.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Sankey diagram showing hairpinning network traffic" /> <p>This chart shows a “hairpin”: traffic leaves New York on the left, routes to Chicago, and then…</p> <p>Comes right back to New York. Straight out, and straight back in. Why would something like this happen? It’s surprisingly easy to do without realizing it. Examples include having the application server in New York, but user activity has to authenticate, which makes a call to a server in Chicago. Or a process that includes an API call to a server in the other location. Or a dozen other common scenarios.</p> <p>Hairpinning traffic isn’t just inefficient. If you are talking about a cloud-based environment it introduces extra cost, because you will pay on both egress and ingress of that traffic on both sides of the (in this case) New York-Chicago transfer.</p> <p>The challenge isn’t in solving the problem – that’s as easy as adding a local device or replica to provide authentication, data, or whatever is needed – the real challenge is in identifying that traffic.</p> <p>Network observability means having visualizations that allow folks from all IT backgrounds, not just hard-core network engineers, to see that it’s happening.</p> <p>Speaking of cost, this chart might tell you <em>something</em> in your application is spiking…</p> <img src="//images.ctfassets.net/6yom6slo28h2/2hNIMD9JwaKPGivsgmb8Z0/95ae9d9ef3a73baf041b5965010f37fa/netflow-area.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Spikes in network traffic" /> <p>…but it doesn’t really give you the insight to identify whether this is bad or simply inconvenient.</p> <p>Network observability is having the means to understand which internal network(s) that traffic is coming from, and whether any (or all) of it is headed outbound, not to mention its final destination.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3sM82aWxOazcOJYVRxcneO/87d5a01c536fbea8ef55bfecd71b18d7/netflow-sankey.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Example of costly network traffic" /> <p>Because in the image above, <em>all</em> of that traffic is costing you money!</p> <p>As much as we throw around the term “cloud native,” the vast majority of us work in <a href="https://www.kentik.com/blog/the-importance-of-hybrid-cloud-visibility/">environments that are decidedly hybrid</a>: a mixture of on-prem, colo, cloud (and, in many cases, multiple cloud providers).</p> <p>Network observability is also the ability to show traffic in motion. <img src="//images.ctfassets.net/6yom6slo28h2/7Dv3sy7fne5fQURHEW2nyu/c0fd0461d2915a57e385acf8b39e95ac/aws-traffic.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud traffic in the Kentik platform" /></p> <p>So we can see how and when it’s getting from point A to B to Q to Z.</p> <p>“The internet” itself is a “multipath” environment. This means that the packets of a single transaction can each take decidedly different routes to the same destination.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2ZNcKf8geg3R4fbq9vCqVB/adbfad0c65178adb0c5800fcb4cfaf2e/multipath.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Multipath route showing hops" /> <p>Network observability includes being able to see each stop (or “hop”) along that multipath journey, for each packet, as well as the latency between those hops. This helps you identify why application response may be inconsistently sluggish.</p> <p>Moreover, network observability is the ability to see that multipath experience from multiple points of origination. This allows you to understand not just a single user’s experience, but the experience of the entire base of users across the footprint of your organization.</p> <img src="//images.ctfassets.net/6yom6slo28h2/GxLXqRhFQe5RnrHjd5Knn/15c8c0068e2fb0cda57fefc66a761115/traceroute-path-view.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Digital experience across user base" /> <p>Because the question we find ourselves asking isn’t normally “Why is my app slow?”, it’s “Why is my app slow from London, but not from Houston?”</p> <p>Unless you’re a long-time network engineer, you’ve probably never heard about IX (internet exchange) points, so let me explain:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3ZlFI6hq4Ni9n51UrzwbXd/ea03d60f96d170039d79f7b9a6e31a5f/ix-sources.png" style="max-width: 450px; margin-bottom: 0;" class="image center simple" alt="Internet exchange with sources" /> <p>An IX is a physical facility that hosts routers belonging to multiple private sources – ISPs, content providers, telecom companies, and also private businesses (like yours, probably). By putting all those routers in one place and allowing them to interconnect, IXes facilitate the rapid exchange of data between disparate organizations. They effectively make the internet smaller and closer together.</p> <p>Picking the right IX to help route traffic to/from key partner networks can make the difference between a smooth versus a sluggish experience for your users.</p> <p>But honestly, how could anyone know what the “right” IX is? It would require hours of analysis of current traffic patterns, then comparing those to the types of traffic hosted by various IXes, and then trying to select the correct one.</p> <p>And, to be fair, back in the day it <em>did</em> take hours, and was something nobody looked forward to doing.</p> <img src="//images.ctfassets.net/6yom6slo28h2/75YW9ZSKTo9WSFlFsTSGQE/cf3140a421592fb87421bc619e694869/peering-sankey.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Peering Sankey showing internet exchanges" /> <p>But network observability also includes having that type of information – both the understanding of your network traffic and the traffic (and saturation) of open IX points – and being able to quickly select the best one for your needs.</p> <h2 id="the-mostly-unnecessary-summary">The mostly unnecessary summary</h2> <p>Just as “observability” is more than just traces, network observability is more than metrics. It’s a unified collection of technologies and techniques, coupled with the ability to collect, normalize, and display all of that data in meaningful ways – effectively transforming data into information, which drives understanding and action.</p><![CDATA[NANOG 91: A Look into the Latest Trends in Networking]]><![CDATA[NANOG 91 gathered network professionals to discuss emerging trends and challenges in networking. The event featured keynotes and technical sessions focused on critical areas such as cloud-induced network complexities, the advent of network digital twins, IPv6 deployment, network automation, and BGP routing security. In this post, Justin Ryburn, Field CTO, summarizes his takeaways from the event.]]>https://www.kentik.com/blog/nanog-91-a-look-into-the-latest-trends-in-networkinghttps://www.kentik.com/blog/nanog-91-a-look-into-the-latest-trends-in-networking<![CDATA[Justin Ryburn]]>Mon, 24 Jun 2024 04:00:00 GMT<h2 id="introduction">Introduction</h2> <p>The North American Network Operators’ Group (<a href="https://www.nanog.org/">NANOG</a>) hosted its 91st meeting from June 10-12, 2024, in Kansas City, Missouri. This conference, essential for network professionals, brought together industry experts, engineers, and innovators to discuss the latest developments and challenges in networking. Kentik is a proud sponsor of NANOG 91 so I was able to attend. Here’s an overview of the highlights from the agenda and what you missed if you couldn’t make it.</p> <h2 id="venue">Venue</h2> <p>NANOG 91 was held at the Loews Kansas City Hotel. Since I’m based in St. Louis, the trip to Kansas City, Missouri is an easy one for me, and I’d been looking forward to this event. The Loews itself has a great setup for a NANOG conference. There’s plenty of meeting space, an open-concept lobby bar, and lots of hotel rooms for attendees. Being downtown it’s an easy walk to restaurants and other hotels. I did talk to several attendees who had challenges getting flights to and from the Kansas City airport. That’s the one downside to this location, but the upside is NANOG 91 attracted many attendees from the Midwest who hadn’t been to a NANOG in many years due to travel restrictions from their employer.</p> <h2 id="keynote-sessions">Keynote sessions</h2> <h3 id="opening-remarks-and-keynote">Opening remarks and keynote</h3> <p>NANOG 91 <a href="https://www.youtube.com/watch?v=nTbC9jV1yks">kicked off with a welcome</a> from the Program Committee and the event host, setting the tone for an event focused on trends and best practices in networking as well as collaboration among peers. The keynote address, titled “<a href="https://www.youtube.com/watch?v=mJshV8koylo">TCP Terminates on the Floor: The Ebbing Tide of Networking Expertise</a>,” delivered by <a href="https://www.linkedin.com/in/coquinn/">Corey Quinn</a> from <a href="https://www.duckbillgroup.com/">The Duckbill Group</a>, offered a candid exploration of the unintended complexities introduced into networking by cloud adoption. Corey brought his unique blend of humor, expertise, and practical wisdom to NANOG 91, setting a thought-provoking tone for the conference. If you have never heard Corey present, you should definitely check this one out.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/mJshV8koylo" title="TCP Terminates on the Floor: The Ebbing Tide of Networking Expertise" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <h3 id="day-2-keynote-network-digital-twin">Day 2 keynote: Network Digital Twin</h3> <p>Day 2 kicked off with a keynote by <a href="https://www.linkedin.com/in/kireetikompella/">Kireeti Kompella</a> from Juniper Networks titled “<a href="https://www.youtube.com/watch?v=NnoiGgGPfUA">Network Digital Twin</a>.” If you haven’t heard Kompella present, you’re really missing out. He’s been a thought leader in networking for many years and was one of the inventors of the MPLS protocol. In this talk, Kompella explored the concept of a network digital twin (NDT) – a digital replica of a real network. He discussed the potential benefits of NDTs in the context of increasing network automation. The talk covered what NDTs are, how they could be built, and what they could be used for. The focus was on sparking discussion and collaboration, not on specific products or implementation plans.</p> <h2 id="technical-sessions">Technical sessions</h2> <p>NANOG 91 featured a wide range of other technical sessions covering various aspects of network engineering and operation. Here are a few interesting topics:</p> <ul> <li>IPv6 Deployment and Challenges (<a href="https://youtu.be/0QCQVX9rZ6s">example</a>)</li> <li>Network Automation and Orchestration (<a href="https://youtu.be/rJtoyYgITxs">example</a>)</li> <li>Building and Operating Cloud-Native Networks (<a href="https://youtu.be/zRILaf5JeTM">example</a>)</li> <li>BGP Routing Security and RPKI (<a href="https://youtu.be/69KxwFJQoyg">example</a>)</li> <li>Network Monitoring and Measurement (<a href="https://youtu.be/Z-BFUbPl-b0">example</a>)</li> </ul> <h2 id="networking-beyond-the-sessions">Networking beyond the sessions</h2> <p>In addition to the structured sessions, NANOG 91 provided ample networking (the human kind) opportunities. Whether it was during the evening receptions, lunch breaks, Beer N Gear, or impromptu discussions in the hallways, attendees exchanged ideas and fostered new professional relationships. Personally, I serve on <a href="https://www.nanog.org/about/who-we-are/committees/mentorship-committee/">NANOG’s Mentorship Committee</a>, where we focus on welcoming new attendees into the NANOG community as well as helping pair mentors and mentees together for career growth. This is part of NANOG’s culture to help grow these relationships.</p> <h2 id="why-nanog-91-matters">Why NANOG 91 matters</h2> <p>NANOG 91 was a critical event for those involved in network operations. It brought together thought leaders and practitioners to share their knowledge, tackle pressing challenges, and explore the technologies driving the next generation of networks. As the world becomes more connected and data traffic continues to grow exponentially, discussions and innovations at events like NANOG are crucial for staying ahead in the ever-evolving landscape of network technology.</p> <p>For more details and to access presentation slides and recordings, visit the <a href="https://www.nanog.org/events/nanog-91/agenda/">NANOG 91 agenda page</a>.</p> <p>Whether you’re a network engineer looking to stay current or an organization aiming to future-proof your infrastructure, NANOG conferences offer invaluable insights and networking opportunities. Mark your calendars for <a href="https://www.nanog.org/events/nanog-92/">NANOG 92</a> (October 21-23, 2024 in Toronto, Canada), and plan now to be part of the conversation shaping the future of networking.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 40px; padding-bottom: 40px;"></div> <p><em>This article originally appeared on <a href="https://ryburn.org/2024/06/14/nanog-91-in-kansas-city-a-deep-dive-into-trends-in-networking/">Justin Ryburn’s personal blog</a>. Kentik is grateful that we have so many incredible writers on our team, and that they’re willing to share their work with us!</em></p><![CDATA[Engineering Intern Program at Kentik]]><![CDATA[At Kentik, our 12-week Engineering Intern Program gives aspiring tech talent hands-on experience, working on real projects with guidance from dedicated mentors. Beginning with an exciting in-person kickoff and transitioning into a remote program, interns build skills that shape future careers.]]>https://www.kentik.com/blog/engineering-intern-program-at-kentikhttps://www.kentik.com/blog/engineering-intern-program-at-kentik<![CDATA[Kentik People Operations Team]]>Fri, 14 Jun 2024 05:00:00 GMT<p>At Kentik, we strive to give everyone an opportunity to impact our business meaningfully. Our engineering internship program places each intern with a mentor and integrates them into real, production-destined projects.</p> <p>We want our interns to be able to walk away with skills and experience they can immediately apply to their careers after they finish school.</p> <h2 id="overview">Overview</h2> <p>Our program is a 12-week impactful learning experience. Interns work on production code, and are integrated with our team to solve real problems. We have a dedicated mentor for each intern, which gives them a chance to work directly with an engineer on our team and receive regular feedback on their work. They collaborate with product managers and designers as well.</p> <p>Kentik is a fully remote company — but we were able to have an in-person kickoff in Atlanta with all of the interns, mentors, and a few engineering leaders!</p> <img src="//images.ctfassets.net/6yom6slo28h2/3v8IsbE1jvsPBsR6N65m1V/19a04c8fafbc46f6594b215447faf35d/engineering-internship-buck.jpg" style="max-width: 500px;" class="image center" alt="Engineering intership" /> <img src="//images.ctfassets.net/6yom6slo28h2/49KLsJOLRTNn4bW6oqKds7/a4e5e4b75268ef9fe9c8a6acffc63d5b/engineering-internship-dinner.jpg" style="max-width: 500px;" class="image center" alt="Engineering intership" /> <h2 id="key-takeaways">Key takeaways</h2> <p>We were fortunate to have 5 wonderful interns during our Summer 2023 program!</p> <p>Here are some of their thoughts to the question: <strong>What are you going to take away from the internship, or what skills did you improve over the summer?</strong></p> <blockquote>“You will learn to work in a fast-paced online environment, and you will gain access to a community of knowledgeable people who are open to helping you grow. You will improve your technical skills, and you will probably realize you are capable of more than you imagined! Like most learning experiences, what you gain will depend on the effort you put in, but there is no shortage of opportunities here at Kentik.” <i>- Aditi</i></blockquote> <blockquote>“I got to research and implement a totally new technical tool during my internship, and I can now confidently add AI-based tool development to my developer toolbox. This is something I am absolutely going to pursue further in my future career, and I am grateful for the opportunity to have explored that technology at Kentik.” <i>- Jon</i></blockquote> <blockquote>“The key takeaway was developing a strong foundation for working in a fast-paced, agile professional development environment. This experience put the skills I had learned in the classroom and through independent study to the test in a collaborative, team-based setting. It also greatly challenged my ability to quickly learn and integrate new technologies and tools into my workflow effectively.” <i>- Alex</i></blockquote> <h2 id="from-interns-to-full-time-employees">From interns to full-time employees!</h2> <p>We are thrilled that two interns from our Summer 2023 program — Aditi and Alex — have joined our team as full-time software engineers on our product engineering team! Aditi joined in December 2023, and Alex joined in June 2024.</p><![CDATA[Simplifying Multi-cloud Visibility]]><![CDATA[Multi-cloud visibility is a challenge for most IT teams. It requires diverse telemetry and robust network observability to see your application traffic over networks you own, and networks you don’t. Kentik unifies telemetry from multiple cloud providers and the public internet into one place to give IT teams the ability to monitor and troubleshoot application performance across AWS, Azure, Google, and Oracle clouds, along with the public internet, for real-time and historical data analysis.]]>https://www.kentik.com/blog/simplifying-multi-cloud-visibilityhttps://www.kentik.com/blog/simplifying-multi-cloud-visibility<![CDATA[Phil Gervasi]]>Thu, 13 Jun 2024 04:00:00 GMT<p>The adoption of the public cloud has progressed to the point that we don’t typically talk anymore about lifting and shifting workloads from our on-premises data centers to our preferred public cloud. Instead, IT teams have taken a step back to understand what services make the most sense in which cloud provider, even if they end up using multiple public clouds simultaneously. <a href="https://www.kentik.com/kentipedia/multicloud-networking/" title="Kentipedia: Multicloud Networking">Multi-cloud environments</a> are so common today that many of the applications we use daily rely on multiple public cloud vendors talking to each other. Usually, that’s over the public internet.</p> <p>This is a huge advantage for IT teams trying to build a highly available and performant application delivery mechanism. However, it also makes the need for network observability that much more important. After all, if the applications we use every day depend on multiple public cloud providers and the public internet, how can we track down a performance problem when we don’t own most of the underlying infrastructure?</p> <p>Cloud vendors provide some visibility into their own environment, which is helpful. However, in a multi-cloud environment, IT teams often use multiple disparate tools to understand an application’s path and performance over the network. Compound that with having to figure out what’s happening on the public internet, and you end up with some pretty frustrated engineers.</p> <div as="WistiaVideo" videoId="tcf39kpnhv"></div> <h2 id="network-observability-puts-data-into-context">Network observability puts data into context</h2> <p>A pillar of network observability is unifying all of the telemetry we collect from various cloud and network sources into a single unified database. Metadata, routing tables, and synthetic testing results are also added to this database so that you can log into a single tool and filter for a single application, service, site, tag, or whatever is important to you.</p> <p>Network observability is all about context, or in other words, understanding how multiple public clouds work together to deliver an application.</p> <h2 id="see-all-your-networks-in-one-place">See all your networks in one place</h2> <p>Kentik is cloud vendor-agnostic, ingesting telemetry from the major public clouds and the public internet. We collect flow logs and metrics from AWS, Azure, Google, and Oracle, and we trace paths on the public internet to understand network performance between cloud regions, providers, and your on-prem resources.</p> <p>Multi-cloud environments rely on service providers, so we gather information from global routing tables and a worldwide mesh of synthetic tests to measure service provider availability and performance hop-by-hop.</p> <p>We also monitor all the major clouds and their specific regions from strategically located vantage points around the world, so we have information about how entire cloud regions are performing in addition to your own cloud logs. This comprehensive approach leaves no part of your network in the dark.</p> <p>From your own public cloud environments, we collect telemetry such as:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7xfD3JhEjwgxucxddSwxkj/592baab187344abca48aa96c533ea844/multi-cloud-visibility.png" style="max-width: 700px; margin-top: 10px; margin-bottom: 20px" class="image center simple" alt="Public Cloud Visibility" /> <p>We also enrich this with relevant metadata such as application and security tags, geo-id, DNS information, etc. You can learn about cloud traffic volume, types, paths, and metrics such as packet loss on transit gateways, latency between clouds, and so on.</p> <p>Let’s see what this actually looks like for an engineer logging into the Kentik Portal.</p> <h2 id="see-everything-in-one-place">See everything in one place</h2> <p>Notice the graphic below taken from the Kentik portal of the <a href="https://www.kentik.com/product/multi-cloud-observability/">Kentik map</a>, which includes all of our configured and discovered sites, both on-premises and public cloud. On one screen, you have a quick overview of all your active public clouds.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1WMK2VZl84kvAH7FEjEWe7/329de22253b1a140cf242bfa8257dc82/kentik-map.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik Map" /> <p>Kentik Cloud also provides an <a href="https://www.kentik.com/solutions/visualize-all-cloud-and-network-traffic/">overview of the metrics and stats you care about on a dynamic dashboard</a>. The image below shows some quick and basic information about our AWS, Azure, and Google environments on one screen. Dashboards are completely customizable, so instead of a breakdown of VPC and region, you can display important performance metrics, traffic volume, or whatever is important to your IT team.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Q0ALYpq6J8aj4ccWaf4Kf/776b5e96d44301cfa75aeb1fb1b59caf/kentik-cloud-multi-cloud.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="AWS, Azure, GCP dashboards" /> <h2 id="exploring-all-your-cloud-networking-data">Exploring all your cloud networking data</h2> <p>Whereas many visibility tools focus mainly on dashboards and graphs, Kentik ensures that the <a href="https://www.kentik.com/blog/how-hard-is-it-to-migrate-to-streaming-telemetry/">entire database of telemetry</a>, from on-premises, public cloud, SaaS providers, metadata, and so on, can be easily and quickly interrogated in real time. This is critical for network and cloud operators in the trenches solving real problems.</p> <p>To that end, most elements in the Kentik platform, whether that’s the Kentik Map, Cloud, or otherwise, are interactive, which means you can click on almost anything to get more information or pivot to another function, such as Data or Metrics Explorer. This makes it easy to go from an overview of all your cloud sites, for example, to specific application traffic between one data center and one VPC.</p> <p>Take a look at the graphic below from the Cloud Performance monitor. Here, you can see how traffic, which we can filter for application, tag, IP address, and so on, flows from specific subnets in AWS US-WEST-1, through the transit gateway, direct connection, etc., all the way to your on-premises router. Clicking on a connection, device, or subnet automatically opens the details pane on the right side of the screen, which gives us vital information and more clickable elements to drill down even further.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5O073204be4Yuqbb7v1yja/5567db51ab50a0a6073777711734045a/direct-connects.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloud Performance Monitor with Direct Connects" /> <p>When we start from the Kentik Map, hovering over a public cloud or a specific site on the map gives us more information and allows us to begin drilling down into the underlying data.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2rqYe9rxNd0Q22YL4Y30T2/74e22e69165bd0e0275761aaa53bf78c/kentik-map-hover.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik Map - Hover over a public cloud" /> <h2 id="filter-data-any-way-you-need">Filter data any way you need</h2> <p>To refine your search, we can select from a diverse set of filtering options. This is a powerful aspect of Kentik that sets it apart from legacy visibility solutions.</p> <p>The screenshot below shows a very simple filter set up to search for MySQL traffic going between AWS US-WEST-2 and our Azure North Central US region. This is relatively simple for demonstration purposes, but in production, you can be more granular to suit the needs of your own environment.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7Lm6Dsi3jZpDJRymPAoaPs/5a6da515fa352d87308497c151b08217/filtering-options.png" style="max-width: 650px;" class="image center" thumbnail withFrame alt="Filter options examples" /> <p>When you need access to everything in the underlying database, we can also filter using Data Explorer, which gives us a powerful way to filter on-premises and cloud traffic alongside any other data we want to see with whatever visualization makes the most sense.</p> <p>In the last image below, notice that we’re filtering for source and destination cloud provider, IP address, firewall action, and a specific timeframe. We can add more filters to see specific applications, customer tags, different time ranges, etc. Here, we can see traffic going from Azure to AWS, the specific IP addresses, the firewall action, and the volume of traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1s0uuoy9mHxqTJ3nRhvKIT/bbbc270456542604b7b8bcff1578e587/data-explorer-sankey.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Data Explorer Sankey" /> <h2 id="seeing-the-network-in-between-public-clouds">Seeing the network in between public clouds</h2> <p>The network in between your public cloud instances is just as important as any other network our applications rely on. The problem is, the network in between public clouds is the public internet, which we don’t own or manage.</p> <p>Using several methods including <a href="https://www.kentik.com/blog/the-power-of-paris-traceroute-for-modern-load-balanced-networks/">Paris Traceroute</a>, Kentik is able to trace the path between two disparate public clouds accommodating any load balancing that’s likely occurring. Those metrics allow us to see information like packet loss, latency, and jitter hop-by-hop as application traffic travels node to node, provider to provider, and ASN to ASN. This gives IT teams a better end-to-end understanding of network performance and how it affects application delivery even when much of the delivery mechanism is the internet itself.</p> <p>Notice in the image below we have a network mesh test between several AWS, Azure, and Google regions. In this case, we’ve deployed <a href="https://www.kentik.com/product/global-agents/">synthetic test agents</a> to these regions and configured them to test connection and network performance to each other periodically. We can also deploy these agents in an organization’s cloud instances to test connectivity and network performance directly to the appropriate VPC, VNET, etc.</p> <p>In the first image, we can see that over the last week there was trouble with the connection or performance between AWS US-WEST-1 and GCP US-WEST-1.</p> <img src="//images.ctfassets.net/6yom6slo28h2/prqxEQvEF3L43gE6tO92e/175658a5c1c6b2e1d19f628065ba69d3/aws-mesh.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Public cloud mesh" /> <p>If we hover over that box, we can see that there was packet loss in one direction that triggered a warning alert.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4020sl01gYnPUefgCMvELp/3734482540cfbd75df93b2cecbf8e745/aws-mesh-hover.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Public cloud mesh - hover for details" /> <p>We can drill into that specific issue by selecting View Details, which allows us to see these metrics over time as well as a path view of the hops between these two test agents. In the next screenshot, we can see, hop-by-hop, how traffic is moving between these two clouds including any problems like latency, etc. along the way.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Rkny0cQcAT4U1lsLlp48Z/fa941da0e681cbcb7d546d223620c2e0/sf-the-dalles.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Hop by hop traffic example" /> <p>The days of sitting in a cubicle from nine to five and accessing applications on servers literally down the hall in a campus data center are all but over. Today, most organizations are running multiple cloud providers, juggling multiple cloud visibility tools, and often struggling to piece it all together. Kentik ingests it all and puts it into context so that IT teams can see all their clouds in one place.</p><![CDATA[What Do Network Teams (Really) Care About in 2024?]]><![CDATA[The 2024 EMA Network Megatrends surveyed hundreds of IT professionals about their approach to managing, monitoring, and troubleshooting their networks. In this post, we examine the report's findings to learn the business and technology trends shaping network operations strategy.]]>https://www.kentik.com/blog/what-do-network-teams-really-care-about-in-2024https://www.kentik.com/blog/what-do-network-teams-really-care-about-in-2024<![CDATA[Josh Mayfield]]>Thu, 06 Jun 2024 05:00:00 GMT<p>The world of network management is always changing, full of new gadgets and old problems. The <a href="https://www.kentik.com/resources/ema-network-management-megatrends-2024/">EMA Network Management Megatrends</a> 2024 report sheds light on these trends, helping us understand where we stand and where we might be heading. Here’s a look at the key findings, why they matter, and how Kentik fits into the picture without making it all about us.</p> <div as="Promo"></div> <h2 id="the-comeback-of-network-operations">The comeback of network operations</h2> <p>One of the most cheerful findings from the report is that network operations are bouncing back. After years of giving themselves low marks, 42% of network operations groups now say they are fully successful, up from just 27% in 2022. The secret? AI tools, better alignment with cloud, SaaS, and SASE, and more integrated management toolsets.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1uc2ePXZOKJZqxLO16MnQi/dcaee31743ca6a20875d0bb26b485edd/netops-comeback.png" style="max-width: 520px;" class="image center" alt="Graph showing the success of network operations organizations over the last year" /> <p>Picture a large bank struggling with network visibility due to its hybrid cloud infrastructure. By using AI-enhanced observability tools, they could streamline their processes, drastically reducing downtime and improving overall network health. This shows a broader trend where organizations are using advanced tech to boost their network management.</p> <h2 id="tackling-operational-challenges">Tackling operational challenges</h2> <p>Despite the good news, network teams still face significant challenges. The report says only 29% of network alerts are actually useful, and manual errors cause nearly 30% of network issues. The sheer volume of alerts can overwhelm even the best teams, leading to missed critical problems and longer troubleshooting times.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5UldvjCw0I0TrZDgrJhQms/59bded0cd27c61957ee05acad6c4fdaa/manual-errors.png" style="max-width: 550px;" class="image center" alt="Graph showing percent of network problems caused by manual errors" /> <p>Imagine a global retailer with an extensive online presence. Their network team might get thousands of alerts every day, many of which are false alarms. By refining their alert management processes and adopting AI to filter out the noise, they can focus on real issues, cutting response times and preventing disruptions to their e-commerce platforms. This highlights how important good alert management is.</p> <h2 id="the-skills-gap-and-hiring-headaches">The skills gap and hiring headaches</h2> <p>Finding skilled network pros is still tough, with 41% of organizations saying it’s hard to hire and keep these people. Skills in network security and automation are particularly tough to find, posing risks to network resilience and security.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1q64xbbrAbAwext0IWbBe7/2ec8860085c69ca2ed6114a59f83a1a6/skills-gap.png" style="max-width: 750px;" class="image center" alt="Graph showing difficulty staffing network technology jobs" /> <p>Take a midsized company starting a digital transformation. They might struggle to find qualified network security experts. This gap can slow down their ability to secure their growing network, exposing them to potential cyber threats. Filling this skills gap is crucial for protecting digital assets and ensuring smooth network operations.</p> <p>And it’s not just midsize organizations feeling the labor pinch. We are experiencing a demographic shift that is leaving its mark on network engineering tradecraft, with certain functions of network monitoring becoming lost arts. This is driven by a desire for diverse skill sets, with an increased breadth of expertise being favored over deep artisans of network data exploration. Leading solutions are codifying those artisans’ tradecraft into expert <a href="https://www.kentik.com/solutions/kentik-ai/">AI-assisted troubleshooting and investigations</a> to make room for the versatilists operating networks by wearing dozens of hats.</p> <h2 id="navigating-hybrid-and-multi-cloud-environments">Navigating hybrid and multi-cloud environments</h2> <p>With over 56% of network teams supporting multi-cloud environments, strong monitoring and troubleshooting are more critical than ever. The top priorities for these organizations are adopting end-to-end multi-cloud network fabrics and hybrid cloud connectivity.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4n6myQIdBOmXKaqF4Wymot/8686d7810a68c04fda6712d103152057/number-of-cloud-providers.png" style="max-width: 550px;" class="image center" alt="Graph showing the number of cloud providers organization use" /> <p>Think of a media company that distributes a vast array of digital content, including news, movies, TV shows, and live sports events. To optimize delivery and ensure high availability, the company uses a multi-cloud strategy, leveraging AWS, Google Cloud, and Azure for different parts of its infrastructure — and remaining hybridized with their own data centers around the globe. How do you make sense of or even draw out a standard of measure with so much diversity?</p> <p>Despite the benefits, managing this multi-cloud environment introduces significant complexity and operational challenges. Today’s network solutions must advance from observability to inference, enabling more people to tap into <a href="https://www.kentik.com/product/multi-cloud-observability/">the meaning of their clouds and networks</a> regardless of how they shift, evolve, and morph into all new forms.</p> <h2 id="sase-adoption-and-its-challenges">SASE adoption and its challenges</h2> <p>SASE (secure access service edge) is catching on, with nearly 46% of organizations fully implementing these solutions. But, managing security policies and ensuring visibility into SASE points of presence remains tough.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6LKaWePlsaa3PnCnaZTu5g/7ce03d2b8839912517544f276f30dd67/sase.png" style="max-width: 700px;" class="image center" alt="Graph showing engagement with SASE" /> <p>For a multinational company with remote employees across different regions, implementing SASE can enhance security and connectivity. However, managing multiple security policies and gaining visibility into the network’s performance can be daunting. Integrating observability tools that provide detailed insights into SASE infrastructure can simplify these tasks, ensuring a secure and efficient remote work setup.</p> <p>As SASE gains traction, SD-WAN monitoring remains paramount because managing multiple security policies and ensuring network visibility poses significant challenges.</p> <p>SD-WAN is critical because it provides the foundational connectivity layer that SASE builds upon. Effective SD-WAN implementation ensures reliable and optimized paths for data traffic, which is essential for maintaining performance and efficiency in a SASE environment. Monitoring <a href="https://www.kentik.com/blog/sd-wan-best-practices/">SD-WAN is crucial as it offers real-time insights into network performance</a>, helping identify and resolve issues promptly, thereby ensuring a seamless user experience.</p> <p>Integrating observability tools that provide detailed insights into both SD-WAN and SASE infrastructures is essential. These tools simplify security policies and enhance visibility across the network by continuously monitoring SD-WAN to ensure the underlying dependency for SASE deployments is secure, efficient, and capable of supporting the demands of a modern, remote workforce. In essence, <a href="https://www.kentik.com/solutions/investigate-security-incidents/">SD-WAN monitoring is indispensable for optimizing SASE performance</a> and maintaining a robust, secure network infrastructure.</p> <h2 id="the-role-of-aiml-in-network-management">The role of AI/ML in network management</h2> <p>AI and machine learning are becoming mainstream, with 64% of organizations using these features to boost security threat detection, automate problem fixing, and improve network troubleshooting. The impact of AI/ML on network management is huge, offering new levels of efficiency and accuracy.</p> <img src="//images.ctfassets.net/6yom6slo28h2/38hQsM235Rdqs8ACXiPI5y/e798d2700e5ee958022844d190399b9f/ai-ml-network-management.png" style="max-width: 650px;" class="image center" alt="Graph showing use of AI and ML features in network management" /> <p>And while it is commonplace to see “back of the house” AI/ML in observability solutions, leading providers are now embedding <a href="https://www.kentik.com/blog/transforming-human-interaction-with-data-using-llms-and-genai/">generative AI into their platforms</a> – like Kentik. Using simple, natural language to prompt a response based on the data and intent of users’ questions.</p> <p>Through an interactive, conversational interface, the network source of truth becomes accessible and approachable for network and I&#x26;O teams pressed for time while wearing multiple hats of responsibility. Getting to the heart of a trouble ticket or failed mesh test is essential. And AI can help network teams move fast to support far-ranging stakeholders and domains across devices, data centers, clouds, containers, apps, and edges.</p> <h2 id="kentiks-perspective">Kentik’s perspective</h2> <p>Running networks is hard. Running networks in 2024 is really hard. At Kentik, we get the complexities and challenges highlighted in the EMA report. Our mission is to make it easy for network teams with the unified workbench they need to handle these challenges. Our comprehensive observability and insights help organizations manage their networks confidently and precisely.</p> <p>Kentik reduces alert noise, allowing teams to focus on real issues. We support hybrid and multi-cloud environments, offering end-to-end visibility that simplifies network management across different infrastructures — and cuts the crazy costs, too. Plus, Kentik is user-friendly, helping bridge the skills gap by enabling today’s multi-hat-wearing network team and their varied needs, skills, and projects.</p> <p>In our latest innovations with Kentik AI, anyone can tap into the Kentik network’s source of truth by asking any question and solving any riddle with AI-assisted troubleshooting and investigations. Yes, even our chatbots get it.</p> <p>In a world where network operations are getting more complex and crucial, Kentik is obsessed with our 450+ worldwide customers’ success. As the EMA Network Management Megatrends demonstrates — the shifting tides in network engineering and operations require easy, effective insights to support the experience economy.</p><![CDATA[Escaping Cloud Jail]]><![CDATA[Cloud costs are spiraling out of control at companies of all sizes. Here's how to not let your cloud infrastructure costs handcuff your business. ]]>https://www.kentik.com/blog/escaping-cloud-jailhttps://www.kentik.com/blog/escaping-cloud-jail<![CDATA[Avi Freedman]]>Tue, 04 Jun 2024 04:00:00 GMT<p>I hope you’re hugely successful in your business!</p> <p>But when you are, you may share the experience many of us have had — watching costs grow out of control. It’s usually the right decision to worry about those things post-traction (and tech scaling).</p> <p>But I want to talk a bit about architectural decisions up front to minimize some of the growth pain later.</p> <p>One big area of modern business cost relates to cloud usage and cost and often is exacerbated by people getting trapped in “Cloud Jail.”</p> <h2 id="how-do-you-wind-up-in-cloud-jail">How do you wind up in Cloud Jail?</h2> <p>It’s a story as old as the public clouds. A startup gets a $500K credit to set up its infrastructure in the cloud. “Excellent,” they think, “that’ll last us at least a year, and then either we won’t get traction, or we’ll have enough revenue to cover the costs as they increase.”</p> <p>It starts out great; they’re probably spending around $20K monthly on infrastructure costs. But, as they grow, so does their cloud bill. It becomes $50K — then $100K a month, then $250k. Suddenly, they’re out of credit as their bill continues to climb.</p> <p>Whether it’s the CFO, exec team, or the board, there’s usually an alarm point where people say, “Wait. What happened?! We were supposed to spend most of our expenses on people, and now you’re pouring it all into the cloud!”</p> <p>A team is formed, and the company scrambles as fast as possible to get those cloud costs down. They optimize their database-as-a-service use, buy spot or reserved instances, find and kill needless network data transfer, tune their instance types, and track down and delete the unused object and block storage, bringing their costs down to a more manageable $70K per month.</p> <p>The problem now, though, is that they’re still growing. If they grow to plan, they’ll be back to blowing past the water line in no time. And, to make things worse, the optimizations they put in place to bring down costs will only hold at the current levels of cloud usage, and as the company grows soon, they’ll grow to the next level. In the post-2021 world of suddenly not-so-free money, that can risk the company’s ability to operate efficiently at scale.</p> <div as="Promo"></div> <h2 id="a-cloud-jail-example">A Cloud Jail example</h2> <p>For example, once upon a time (in 2008), a company specializing in video encoding and streaming approached me with a staggering $300,000/month cloud bill that was only getting bigger monthly. This bill was eating into their margins, causing them to lose money as they grew. Together, we moved 500TB and 10 gigabits/sec of streaming from their public cloud provider to their own infrastructure.</p> <p>The result? Their bill dropped to under $100,000/month, including the salaries of staff who managed their physical infrastructure and routers. Over the next few years, that grew back to $250,000/mo all-in as they scaled 5x, and if they had stayed in pure cloud, it would have easily been well over $1,000,000/month.</p> <p>After this roller coaster ride, they told me they wished they’d initially invested time in setting up a hybrid of cloud-based infrastructure and their own servers, or at least a multi-cloud system.</p> <p>They’d have avoided the trap that left them struggling to migrate as they scaled. In today’s lingo, they’d fallen into “Cloud Jail,” utterly dependent on their cloud provider and finding it difficult and expensive to break free.</p> <p><strong>The hooks</strong>: The industry leaders are exceptionally skilled at keeping customers hooked with features like user identity, authentication, queuing, email, notifications, and seamless databases. These lightweight services save time, but only if you use those platforms. Their magic lies in making it nearly impossible for customers to leave, despite escalating costs for storage and bandwidth.</p> <p>Then, the dreaded call from the board comes, questioning why your gross margin is less than 40% and why you’re spending more on infrastructure than developers. You try to explain that costs should’ve decreased as you grew, but that’s not what’s happening. In today’s competitive VC market, these explanations won’t suffice.</p> <p>But while hybrid and multi-cloud use is on the rise, very few companies that I see are moving completely away from public cloud infrastructure and services.</p> <h2 id="how-to-avoid-cloud-jail">How to avoid Cloud Jail</h2> <p>First, it’s essential to dive into infrastructure with a clear understanding that the cloud isn’t always the cheapest or most efficient option just because it’s “cloud.” Especially for always-on infrastructure.</p> <p>It’s crucial to have a plan in place for when you reach a scale where relying solely on the cloud becomes impractical due to cost or performance reasons. You should know how to run at least some of your own infrastructure and bring on early team members who have experience with such alternatives.</p> <p>By this, I don’t mean constructing a building and filling it with chillers and racks.</p> <p>Instead, at medium scale and beyond, leasing colocation space in existing facilities managed by someone else, and investing in servers and switching/routing gear. This approach is generally more cost-effective at scale, particularly for the non-bursting and steadily increasing workloads commonly found in many startup infrastructures.</p> <p>The sooner you consider these options, the better. If possible, start by running multi-cloud and then, after gaining initial traction, establish a small infrastructure connected to your cloud provider(s).</p> <p>You can spend under $5,000 a month on space, power, and bandwidth by running your own starter infrastructure. Although initial equipment purchases may range from $50,000 to the low hundreds of thousands, these costs on a multi-year basis are relatively low compared to cloud compute, storage, and bandwidth. It means you can afford staff to manage your infrastructure early on.</p> <p>Operating servers in dedicated infrastructures has also become more straightforward over the years. Most operations teams now treat servers as “cattle, not pets” and can flexibly deploy applications using configuration management systems or containerization and container orchestration systems. It’s not that hard for platform teams to run these netbooted and/or “kubernetified.”</p> <p>Hiring the right staff also makes a world of difference.</p> <p>A small team of three to five people can manage both cloud and dedicated infrastructure, and this same team can often run a system ten times larger than when they started. This scalability is invaluable. As soon as possible, hire an infrastructure team lead with a solid background in running hybrid systems, including both cloud and physical infrastructure. This expert will keep an eye on your growing costs and know when it’s time to make the right changes. Prioritizing this investment early on can make all the difference.</p> <h2 id="are-you-on-the-path-to-cloud-jail">Are you on the path to Cloud Jail?</h2> <p>You may be already careening towards cloud jail and want to know if there’s anything to do about it. I’m happy to say that it’s not too late. You can still pass “go” and collect your $200.</p> <p>I’d recommend that startups keep an eye on these indicators to gauge whether they’re approaching dangerous territory:</p> <ol> <li> <p>Calculate the portion of your bill associated with “always on” and “steady state” or consistently growing workloads. When these costs surpass the $100,000/month mark, you may be approaching the tipping point sooner than you think.</p> </li> <li> <p>Pay attention to the number of infrastructure services you purchase from your cloud provider(s) beyond basic compute, network, and storage. Consider services like authentication, load balancing, SQL, and NoSQL services. Are there alternative options available? Will the services you’re using now work well over a direct connection to your own infrastructure if and when the time comes, or might they trap you in a single-provider jail?</p> </li> <li> <p>Be on the lookout for network performance issues that your current provider(s) can’t or won’t address, such as packet loss and subpar throughput to specific geographies or internet providers. If CDNs and SD-WAN acceleration services can’t resolve these problems, that’s a red flag. For many SaaS and web companies, performance becomes the primary reason to run either multi-cloud or at least some dedicated infrastructures to which they can load-balance for performance.</p> </li> </ol> <h2 id="too-late-were-already-in-cloud-jail">Too late. We’re already in Cloud Jail!</h2> <p>Did you come across this article too late? Are you already shackled to your rising cloud infrastructure costs with no easy fix in sight?</p> <p>Fear not; there’s still hope!</p> <p>It’s never too late to start, though it can take anywhere from six to 12 months to start running hybrid infrastructure from zero — especially if you’re dealing with petabytes of data to move or a company experiencing rapid revenue growth.</p> <p>However, I’ve also seen it happen in just two to three months, albeit with a healthy dose of “exigent engineering.” Or, if your footprint is smaller or your need for control is lower, perhaps they’ll skip the private network/colocation and simply start by adding some dedicated servers or “bare metal cloud” into the mix.</p> <p>I’ve personally witnessed 30 web companies go through this kind of transition, and most of them have three to five core people handling the network and physical server administration. The fantastic news is that, as long as you have the runway, you can dig yourself out when public cloud fees begin to take their toll.</p> <p>And if you’re spending a lot, unsure if you can achieve significant gross margins with your current cloud usage, and struggling to recruit infrastructure gurus on staff, don’t lose hope.</p> <p>Feel free to ping me (Avi at Kentik dot io), and the networking community is incredibly open, and people are generally happy to socialize and help.</p> <p>Attend NANOG, RIPE, APRICOT, or your local network nerding meetup or conference. Make connections and ask questions; you’ll usually find people who can help you analyze and plan your infrastructure.</p> <h2 id="remember-youre-not-alone">Remember, you’re not alone!</h2> <p>Now, it’s important to note that I’m not suggesting startups should avoid using the cloud initially — especially considering the credits available when you’re backed by venture capital.</p> <p>The cloud can be a fantastic, capital-efficient way to launch a business and handle fluctuating workloads. You just need to be aware of the breaking points.</p> <p>When your steady-state workloads are maxed out, and your cloud bill reaches hundreds of thousands per month and continues to grow by tens of thousands regularly, you may have already reached the tipping point. Before hitting that milestone, I recommend transitioning the steady-state load to mostly your own infrastructure.</p> <p>Often, people overlook the inefficiencies when it’s just $1 or 2 million annually.</p> <p>However, these seemingly small inefficiencies can sneak up on you and transform into an existential threat to your entire company. Your ability to make a profit, secure more funding, or even survive can hang in the balance. It’s at that moment when people wish they had considered the risks of Cloud Jail earlier in their startup journey. Don’t let your cloud infrastructure costs handcuff your business; be proactive and plan wisely to avoid falling into this trap.</p><![CDATA[Time’s Up! How RPKI ROAs Perpetually Are About to Expire 🙀]]><![CDATA[In RPKI, determining when exactly a ROA expires is not a simple question. In this post, BGP experts Doug Madory and Fastly’s Job Snijders discuss the difference between the expiration dates embedded inside ROAs and the much shorter effective expiration dates used by validators. Furthermore, we analyze how the behavior effective expiration dates change over time due to implementation differences in the chain of certificate authorities.]]>https://www.kentik.com/blog/times-up-how-rpki-roas-perpetually-are-about-to-expirehttps://www.kentik.com/blog/times-up-how-rpki-roas-perpetually-are-about-to-expire<![CDATA[Doug Madory, Job Snijders]]>Thu, 30 May 2024 04:00:00 GMT<p>In our <a href="https://www.kentik.com/blog/rpki-rov-deployment-reaches-major-milestone/">previous collaboration on RPKI</a>, we celebrated the latest milestone of RPKI ROV (Route Origin Validation) adoption: passing the 50% mark on IPv4 routes with Route Origin Authorizations (<a href="https://www.rfc-editor.org/rfc/rfc9582.html">ROA</a>). In this post, we will be digging deeper into the mechanics of RPKI to understand how the cryptographic chain contributes to the effective expiration date of a ROA.</p> <p>Within RPKI, the ROA is a cryptographically-signed record which stores the Autonomous System Number (ASN) authorized to originate an IP address range in BGP. Along with the ASN and one or more IP address prefixes, the ROA also contains an <a href="https://en.wikipedia.org/wiki/X.509">X.509 End-Entity certificate</a> which (among other things) states the <em>validity window:</em> the timestamps after and before which the ROA is valid.</p> <p>While the expiration dates of individual ROAs might be a year away, the <em>effective</em> expiration dates used by RPKI validators are typically only a few hours or days into the future. This is because these effective expiration dates are transitive, meaning they are set by the shortest expiration date of the links of the cryptographic chain.</p> <div as="WistiaVideo" videoId="qpfh2174t5" audio></div> <h2 id="how-does-this-work">How does this work?</h2> <p>To understand how this works, we need to dig into the “cryptographically-signed” part of the ROA mentioned at the beginning of this post.</p> <p>Using Job’s <a href="https://console.rpki-client.org/rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/e605f279-55f4-48ec-ba13-4845c0973a63/011df65e-19df-3014-8ee5-ce8d49880e37.roa.html">rpki-client console utility</a>, we can investigate the ROA for 151.101.8.0/22 which asserts AS54113 is authorized to originate this IPv4 prefix.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">asID: 54113 IP address blocks: 151.101.8.0/22 maxlen: 22</code></pre></div> <p></p> <p>Also, in that first block are our first dates relating to when this ROA is valid.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Signing time: Sat 11 May 2024 01:00:27 +0000 ROA not before: Sat 11 May 2024 01:00:27 +0000 ROA not after: Fri 09 Aug 2024 01:00:27 +0000</code></pre></div> <p></p> <p>So this ROA is valid until August 2024, provided all other elements in the chain also are valid until August 2024. That’s where the next <em>Signature path</em> section comes in.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Validation: OK Signature path: rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/e605f279-55f4-48ec-ba13-4845c0973a63/e605f279-55f4-48ec-ba13-4845c0973a63.crl rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/e605f279-55f4-48ec-ba13-4845c0973a63/e605f279-55f4-48ec-ba13-4845c0973a63.mft rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/e605f279-55f4-48ec-ba13-4845c0973a63.cer rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/871da40f-793a-4a45-a0a9-978148321a07.crl rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/871da40f-793a-4a45-a0a9-978148321a07.mft rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07.cer rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/5e4a23ea-e80a-403e-b08c-2171da2157d3.crl rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/5e4a23ea-e80a-403e-b08c-2171da2157d3.mft rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3.cer rsync://rpki.arin.net/repository/arin-rpki-ta/arin-rpki-ta.crl rsync://rpki.arin.net/repository/arin-rpki-ta/arin-rpki-ta.mft rsync://rpki.arin.net/repository/arin-rpki-ta.cer Signature path expires: Fri 31 May 2024 14:00:00 +0000 </code></pre></div> <p></p> <p>The above <em>Signature path</em> (also known as “<a href="https://en.wikipedia.org/wiki/Certification_path_validation_algorithm">Certification path</a>”) outlines the multi-step cryptographic signature validation process that it took to get from this ROA to the “Trust Anchor” (ARIN in this case). Each link in this chain has its own expiration date, the longest set well into the distant future (the year 2025!). But it is the shortest which governs the overall signature path expiration, and thus the effective expiration date of the ROA.</p> <p>There are three different types of files conveniently linked by the console utility: Certificate Revocation Lists (.crl), Manifests (.mft), and Certificates (.cer).</p> <div style="max-width: 750px; margin: 30px auto; padding: 20px; background-color: #f8f8f8; border: 1px solid #ebeff3; border-radius: 15px;" /> <p style="font-size: 18px;"><strong>Glossary</strong></p> <p style="font-size: 16px;"><em>Manifests</em> securely declare the contents of a RPKI repository and reference the current CRL and ROAs. A given Manifest is valid until the “next Update” time. When faced with multiple valid versions of a Manifest, RPKI Validators decide which version of the Manifest to use based on a monotonically increasing serial number inside the Manifest payload. </p> <p style="font-size: 16px;"><em>CRLs</em> (“Certificate Revocation Lists'') contain the list of serials of the Certificates that have been revoked by the issuing Certification Authority (CA) ahead of their scheduled expiration date. To take a ROA out of rotation, a CA would delist the ROA filename from the Manifest and add the ROA end-entity certificate’s serial number to the CRL. A CRL is valid until the embedded “next Update” time.</p> <p style="font-size: 16px;"><em>Certificates</em> are used to prove the validity of public keys. RPKI Certificates are defined using the <a href="https://en.wikipedia.org/wiki/X.509">X.509</a> standard. Each certificate contains its own validity window, a public key, pointers to the repository’s network location, and some additional metadata. RPKI validators use the public key to validate the Manifest, CRL, and ROAs at the repository location. In turn, the certificate’s contents are protected with a cryptographic signature from an issuer higher up in the chain. In the RPKI, the “root certificate” is known as the <em>Trust Anchor</em>. This is a self-signed certificate which can be validated using a <em>Trust Anchor Locator</em>.</p> </div> <p>By following the links, we can construct the following list of expirations on that signature path:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Signature path: Fri 31 May 2024 23:00:00 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/e605f279-55f4-48ec-ba13-4845c0973a63/e605f279-55f4-48ec-ba13-4845c0973a63.crl Fri 31 May 2024 23:00:00 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/e605f279-55f4-48ec-ba13-4845c0973a63/e605f279-55f4-48ec-ba13-4845c0973a63.mft Mon 13 Apr 2026 22:13:58 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/e605f279-55f4-48ec-ba13-4845c0973a63.cer Sat 01 Jun 2024 13:00:00 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/871da40f-793a-4a45-a0a9-978148321a07.crl Sat 01 Jun 2024 13:00:00 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07/871da40f-793a-4a45-a0a9-978148321a07.mft Thu 25 Dec 2025 14:09:41 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/871da40f-793a-4a45-a0a9-978148321a07.cer Wed 31 May 2024 14:00:00 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/5e4a23ea-e80a-403e-b08c-2171da2157d3.crl Wed 31 May 2024 14:00:00 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3/5e4a23ea-e80a-403e-b08c-2171da2157d3.mft Mon 04 May 2026 15:17:49 rsync://rpki.arin.net/repository/arin-rpki-ta/5e4a23ea-e80a-403e-b08c-2171da2157d3.cer Mon 30 Sep 2024 15:17:49 rsync://rpki.arin.net/repository/arin-rpki-ta/arin-rpki-ta.crl Mon 30 Sep 2024 15:17:49 rsync://rpki.arin.net/repository/arin-rpki-ta/arin-rpki-ta.mft Mon 03 Nov 2025 rsync://rpki.arin.net/repository/arin-rpki-ta.cer Signature path expires: Fri 31 May 2024 14:00:00 +0000</code></pre></div> <p></p> <h2 id="why-is-it-good-to-have-roas-perpetually-about-to-expire">Why is it good to have ROAs perpetually about to expire?</h2> <p>A lot of the elements in the above certification path appear to have relatively short validity windows, with only a few hours or a few days of validity remaining. These short, effective expirations serve a purpose.</p> <p>They help to avoid the scenario where one of the links in the cryptographic chain suffers a distribution outage, i.e., the rsync or rrdp server goes offline, preventing the retrieval of fresh information. The result would be that ROAs remain stuck in their last state.</p> <p>If a misconfigured ROA had contributed to the outage, then it would require manual intervention to clear. In this scenario, the distribution outage would prevent use of the CRL to revoke a problematic ROA.</p> <p>With short expirations, the misconfigured ROA will eventually expire automatically, potentially clearing the internet forwarding-path to the ROAs publication point.</p> <h2 id="reissuance-happens-well-before-expiration">Reissuance happens well before expiration</h2> <p>Issuers of ROAs, Manifests, CRLs, and Certificates do not idly wait around for the cryptographically signed product to expire before issuing a new version.</p> <p>As an example schedule, some issuers might resign their Manifest and CRL every hour with an expiry moment eight hours into the future, and in doing so assert the data is good for another eight hours. Frequent re-issuance helps overcome transient network issues between the ROA publication point and RPKI validators deployed in ISPs’ networks.</p> <p>RPKI validators will either use locally cached versions of objects until such time they become invalid, or can be replaced by successor objects from a successful synchronization with the publication point.</p> <p>This behavior is analogous to DNS time-to-live (TTL) settings. Short TTLs allow DNS operators to quickly redistribute traffic when the need arises, or to ensure that a DNS record is flushed from caches to prevent an out-of-date record to direct traffic.</p> <h2 id="analyzing-the-internets-effective-roa-expirations">Analyzing the internet’s effective ROA expirations</h2> <p>Using rpkiviews.org, we can take a recent snapshot of the roughly 528,000 ROAs currently in use. In CSV format, the contents of a snapshot look like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">ASN,IP Prefix,Max Length,Trust Anchor,Expires AS13335,1.0.0.0/24,24,apnic,1712499475 AS38803,1.0.4.0/24,24,apnic,1712532668 AS38803,1.0.4.0/22,22,apnic,1712532668 AS38803,1.0.5.0/24,24,apnic,1712532668 AS38803,1.0.6.0/24,24,apnic,1712532668 AS38803,1.0.7.0/24,24,apnic,1712532668 AS18144,1.0.64.0/18,18,apnic,1712358404 AS13335,1.1.1.0/24,24,apnic,1712499475 AS4134,1.1.4.0/22,22,apnic,1712508843</code></pre></div> <p></p> <p>The fifth and final column is the effective expiration dates in <a href="https://en.wikipedia.org/wiki/Epoch_%28computing%29">epoch format</a>. If we group those timestamps into one-hour buckets and plot the counts over time, we arrive at the following graph for one snapshot, which reveals several peaks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1oo0CGTXUe3GDDZ0YIfS1u/7469dfce43dcf14d0c85fe8cb5598661/roa-expiration-snapshot.png" style="max-width: 700px;" class="image center" alt="Chart showing ROA expiration date snapshot" /> <p>As the annotations show, each peak of ROA expirations corresponds to a different RIR. This visualization captures the effects of the differences in the cryptographic chains employed by each RIR.</p> <p>But that was just one snapshot in time. To understand how these effective expirations change through time, let’s take a look at the animation below:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/txyn7dq2mo" title="ROA expirations" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>As mentioned earlier, each peak corresponds to a different RIR, and the manner in which it evolves over time depends on the software used to manage the ROAs.</p> <p>Since it is hard to analyze a moving target, let’s look at a static plot of those effective expirations through time when the ROAs. In the graphs below, the x-axis is the time of the snapshot and the y-axis are the peaks of effective expirations, colored by RIR.</p> <p>The graph below depicts how ROA effective expirations (y-axis) change through time (x-axis). Expirations rounded to the previous 15-minute mark. To aid in interpretation, we have marked two points in the chart (A and B). They both represent ROAs published by LACNIC (blue) that expire at 23:00 UTC on April 13, 2024 (y-axis). Point A represents 2,165 ROAs with that expiration, while point B represents 15,852 ROAs and is drawn darker to reflect the larger amount of ROAs.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4kMtM6BI2bLUDuqYwzuCik/48af78fbfb711a00b67c300990d7477c/when-do-roas-expire.png" style="max-width: 700px;" class="image center" alt="Chart showing ROA expiration change over time" /> <table align="center" style="margin: 0 auto 30px; border-radius: 10px;"> <tbody> <tr> <td><br /></td> <td><strong>Point A</strong></td> <td><strong>Point B</strong></td> </tr> <tr> <td><strong>RIR</strong></td> <td>LACNIC (blue)</td> <td>LACNIC (blue)</td> </tr> <tr> <td><strong>Snapshot</strong></td> <td>2024-04-10 22:56:22</td> <td>2024-04-11 07:12:18</td> </tr> <tr> <td><strong>Expiration</strong></td> <td>2024-04-13 23:00:00</td> <td>2024-04-13 23:00:00</td> </tr> <tr> <td><strong>Count</strong></td> <td>2,165</td> <td>15,852</td> </tr> </tbody> </table> <p>If we redraw the graph over several days, we can visualize how effective expirations of ROAs change over time. Each RIR exhibits its own renewal behavior based on the different software in use.</p> <img src="//images.ctfassets.net/6yom6slo28h2/iAYNeqaW07lfiXR8aNkZS/9931ac9563fd674d1ef246022ea60084/when-do-roas-expire-several-days.png" style="max-width: 700px;" class="image center" alt="ROA expiration over several days" /> <p>Let’s analyze a couple of these separately.</p> <h3 id="arin">ARIN</h3> <p>When we isolate the effective expirations of the ROAs published by ARIN, we find two distinct behaviors. The first is a smaller (faint) population of expirations that are spread out from 8 to 24 hours in the future. In this group, the expirations are pushed out to 24 hours in the future when they approach 8 hours in the future.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4706ArR7h97sPYfwnKSqwm/f5ed06fc43af3d03e2babc17c0e528f0/when-do-roas-expire-arin.png" style="max-width: 700px;" class="image center" alt="Chart showing ROA expirations - ARIN" /> <p>The second consists of a larger population of expirations exhibiting a staircase shape. In this group, when expirations approach 24 hours in the future, they are all renewed with expirations which range from 24 to 48 hours in the future. The renewals continue as expirations approach 24 hours in the future, but never exceed the previous upper time limit creating the stair. The upper time limit for the expiration resets every 48 hours.</p> <h3 id="ripe">RIPE</h3> <p>Unlike ARIN, RIPE’s effective expirations are updated to a time between 8 and 18 hours in the future as they get within 8 hours of current time. RIPE expirations are never more than 24 hours into the future. This creates a tighter distribution, illustrated in the graphic below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6GBo1wDt7daf6H9uoKzjtX/f169d9e080721734e4d58f8eeb3ee888/when-do-roas-expire-ripe.png" style="max-width: 700px;" class="image center" alt="Chart showing ROA expirations - RIPE" /> <h3 id="apnic">APNIC</h3> <p>The effective expirations of APNIC ROAs fall into two categories. A small number of expirations (faint lower band) are spread out from 8 to 24 hours in the future. Like the lower faint band for ARIN, these expirations are pushed out to 24 hours in the future when they approach 8 hours in the future.</p> <p>Otherwise, the majority of the ROAs published by APNIC have the longest effective expirations of any RIR. They are at least five days in the future. As expirations reach five days out, they are updated to be six days out.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3dgrt3iMDz5h1jgtKVDdUl/2fc892a13838363d4cee77cc67f9ef5f/when-do-roas-expire-apnic.png" style="max-width: 700px;" class="image center" alt="Chart showing ROA expirations - APNIC" /> <h3 id="lacnic">LACNIC</h3> <p>In the first half of April, the effective expirations of LACNIC’s ROAs exhibited similar behavior to RIPE, but on April 15, <a href="https://www.lacnic.net/7148/2/lacnic/migration-to-lacnics-new-rpki-system">LACNIC changed</a> to use <a href="https://nlnetlabs.nl/projects/routing/krill/">Krill</a> as an RPKI management software. After April 15, the expirations began to resemble ARINs’ 48-hour staircase shape.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2S45QTawryslhda0kePKuS/8563562a136475c3d924ea5b10e95c73/when-do-roas-expire-lacnic.png" style="max-width: 700px;" class="image center" alt="Chart showing ROA expirations - LACNIC" /> <h3 id="afrinic">AFRINIC</h3> <p>As AFRINIC effective expirations approach 24 hours into the future from current time, they are renewed an additional 24 hours into the future. For most ROAs, this update occurs every day at midnight UTC.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/Ie89OE2oepwPEp6kiFb9d/27f681aa43040510a90b1899947ddf47/when-do-roas-expire-afrinic.png" style="max-width: 700px;" class="image center" alt="Chart showing ROA expirations - AFRINIC" /> <h2 id="conclusions">Conclusions</h2> <p>As you likely already know, RPKI ROV continues to be the best defense against accidental BGP hijacks and origination leaks that have been the source of <a href="https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents/">numerous disruptions</a>. Most recently, this technology celebrated a <a href="https://www.kentik.com/blog/rpki-rov-deployment-reaches-major-milestone/">major milestone</a> when the percentage of IPv4 routes in the global routing table with ROAs surpassed 50% on May 1, 2024 (IPv6 achieved this last year).</p> <p>ROV relies on a cryptographic chain to accurately convey the information contained in the ROAs to the validators which do the work of evaluating BGP announcements as they come in. As a result, there are two expirations for ROAs to be mindful of. There is the expiration set in the ROA itself, but there is also the expiration, as seen from the validator, something we call the <em>effective expiration</em>, derived from the shortest expiration along the chain. Both expiration types can be monitored with open source tools such as <a href="https://labs.ripe.net/author/massimo_candela/easy-bgp-monitoring-with-bgpalerter/">BGPAlerter</a>.</p> <p>These short effective expirations (often only hours away) are a feature, preventing validators from getting stuck with out-of-date information in the case of an outage. What is fascinating is the difference between how each RIR handles these expirations, ranging from just hours away (RIPE) to days away (APNIC).</p><![CDATA[The Importance of Hybrid Cloud Visibility]]><![CDATA[Hybrid cloud environments, combining on-premises resources and public cloud, are essential for competitive, agile, and scalable modern networks. However, they bring the challenge of observability, requiring a comprehensive monitoring solution to understand network traffic across different platforms. Kentik provides a unified platform that offers end-to-end visibility, crucial for maintaining high-performing and reliable hybrid cloud infrastructures. ]]>https://www.kentik.com/blog/the-importance-of-hybrid-cloud-visibilityhttps://www.kentik.com/blog/the-importance-of-hybrid-cloud-visibility<![CDATA[Phil Gervasi]]>Tue, 21 May 2024 04:00:00 GMT<p>Hybrid cloud environments are a common architecture for modern networks, and for good reason. A hybrid cloud architecture spanning on-premises resources and public cloud is pretty much required to stay competitive, agile, and scalable. It allows businesses to leverage both the stability of on-premises infrastructure and the scalability of public cloud platforms.</p> <p>While this combination provides a powerful infrastructure foundation, it also brings a significant challenge – observability. Understanding network traffic between on-premises data centers, public cloud platforms, and the public internet requires a comprehensive <a href="https://www.kentik.com/product/network-monitoring-system/" title="Learn more about Kentik NMS, the next-generation network monitoring system">monitoring solution</a>.</p> <h2 id="the-goal-of-hybrid-cloud-visibility">The goal of hybrid cloud visibility</h2> <p>The problem is that the network we’re concerned with today spans both the infrastructure we own and manage and the infrastructure we don’t. Modern application delivery relies on both the devices and services in our own campus data center as well as public cloud and internet service providers. If we’re going to deliver the seamless user experience people expect today, we need to understand network health end-to-end, and that’s hard to do when you don’t own much of the path.</p> <p>Traditional on-premises network visibility captured metrics like packet loss, interface errors, jitter, device information, and traffic information like flow, latency, round-trip time, etc. Once we started using providers like AWS and Azure, we spun up new visibility solutions to see many of these metrics in the cloud.</p> <p>This is all great, but we don’t operate in disparate silos anymore. Application performance relies on both on-prem and public cloud, so this information can’t be siloed. Therefore, <a href="https://www.kentik.com/kentipedia/hybrid-cloud-networking/" title="Kentipedia: Understanding Hybrid Cloud Networking: Architecture, Benefits, and Best Practices">hybrid cloud</a> visibility needs to understand how application performance is affected by <em>both</em> on-premises and cloud network activity. To be truly effective, it also needs to understand how the public internet in between also affects an application’s performance.</p> <p><em>The goal of hybrid cloud visibility is to understand these three areas in a single context so that an engineer can see them in one place</em>.</p> <h2 id="how-kentik-enables-visibility-across-hybrid-cloud">How Kentik enables visibility across hybrid cloud</h2> <p>Kentik provides a comprehensive solution for monitoring, analyzing, and optimizing complex networks on-premises, in the cloud, and public internet pathways.</p> <p>Kentik collects data from a wide range of sources, including on-premises data centers, campus networks, public clouds like <a href="https://www.kentik.com/solutions/amazon-web-services/" title="Learn more about AWS Observability with Kentik">AWS</a>, <a href="https://www.kentik.com/solutions/microsoft-azure/" title="Learn more about Azure Observability with Kentik">Azure</a>, <a href="https://www.kentik.com/solutions/google-cloud-platform/" title="Learn more about Google Cloud Observability with Kentik">Google</a>, and <a href="https://www.kentik.com/solutions/oracle-cloud-infrastructure-observability/" title="Learn more about Oracle Cloud Infrastructure Observability with Kentik">Oracle</a>, and the public internet. By consolidating this information into a single platform, Kentik delivers a <em>unified</em> view of network performance across a hybrid architecture.</p> <p>Let’s look at each area individually.</p> <p>First, on-prem network devices like switches, traditional routers and SD-WAN routers, firewalls, load balancers, and others still play a critical role in delivering applications. Kentik uses protocols and mechanisms like SNMP, streaming telemetry, flow data, and metadata from network-adjacent sources like IPAM to understand how the on-prem network you own and manage affects application performance.</p> <p>Second, Kentik collects flow logs and metrics from the public cloud providers to see how traffic moves to and through your cloud environment. For example, in AWS, traffic traverses transit gateways and directly connects from VPC to VPC and back to your on-prem network. For Azure, you can see traffic through your ExpressRoutes, to your Vnets, and, again, back to your on-prem network.</p> <p>Notice in the first graphic below that we can map traffic from our on-prem router through a Direct Connect and gateway to a specific transit gateway. On the right side of the image, we can see the details of the on-prem router when sending traffic to the cloud.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4x7zQ7Zv5SPeNqI8yYzw8o/e20a458130fc48eb8cb2617d228ad9ab/on-prem-direct-connect-transit-gateway.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Graphs showing internet traffic from on-prem through Direct Connect to transit gateway" /> <div as="Testimonial" index="0" color="blue"></div> <p>In the next graphic, we can see traffic going into a region and then between individual VPCs. This can be filtered by application, security tag, protocol, or whatever is relevant to you. Almost everything is clickable, meaning you can select a link, a VPC, etc., and drill down even further to see specific traffic flows, cloud network health metrics, and other information, such as link status and the metadata about each resource.</p> <img src="//images.ctfassets.net/6yom6slo28h2/123EP5xDi1meJvTxtkvYtB/2d7cd76743b16c25fcc57bb89294a063/aws-region-vpcs.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Graphs showing internet traffic into a region and VPCs" /> <p>Next, as anyone who uses the internet knows, ISPs also play a critical role in application performance. Kentik monitors service providers by collecting information from routing tables and path traces from source to destination over the public internet. This way, you can see where latency or packet drops are occurring hop-by-hop between your branch office or data center and between cloud regions.</p> <p>You can also use the <a href="https://www.kentik.com/resources/video-data-explorer-in-kentik-portal/">Data Explorer</a> to drill down into specific traffic, filtering our underlying database in any relevant way. In this image, we can see traffic from our various on-prem resources destined for AWS. For this visualization, we chose to see the application name and VPC name, though the filtering options on the right side of the image allow almost any query.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3tf443pJWRFWXmCabobGGs/6d449c51a6d1111944921292200e22fa/data-explorer-sankey.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Internet traffic from on-prem to AWS in Data Explorer" /> <p>We also need to see traffic between regions since we rely on the public internet. The next image below shows the path between the AWS us-east 2 and eu-west-2 regions. Each node and link in the image can be expanded to show each hop’s specific loss, latency, and jitter over the public internet.</p> <img src="//images.ctfassets.net/6yom6slo28h2/e2sslUJ4kZmStFhsdTB9c/e5ef0f1b76bc97b65ee891e35a212510/aws-us-east-eu-west.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="" /> <p>Lastly, it’s also essential to remember that a host of network-adjacent services affect application performance. For example, DNS resolution time, timeouts, caching, load-balancing, and even the location of particular DNS servers significantly impact application performance. Therefore, Kentik monitors the major public DNS services and an organization’s own DNS to provide that information alongside metrics, flow data, etc.</p> <p>In the image below of Kentik’s State of the Internet, we see health metrics and alerts for some of the world’s most popular DNS providers. We can also monitor private DNS services on-prem or hosted in the cloud.</p> <img src="//images.ctfassets.net/6yom6slo28h2/76tCJBux6EoljEWnnwFRQR/ef40c4aa4a47fb2348d05881f050c3a5/state-of-internet-dns.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="State of the Internet with DNS providers" /> <p>Suppose we see an issue, such as OpenDNS’s DNS service and network resource health data. In that case, we can drill down into the metrics to see that in the last hour, there were several spikes with resolution time that were outside our dynamically-created rolling standard deviation baseline.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2mp3PyJCtmvGDSEzkXyrWK/ecbe813ea96d9cf6b1f5b7c6ebb6d8cd/open-dns-drilldown.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Drill-down into DNS issue" /> <h2 id="in-conclusion">In conclusion</h2> <p>Kentik combines all of this information into a single platform so an engineer can understand application traffic end to end, from on-prem to the cloud, in context. Hybrid cloud visibility is crucial for maintaining a high-performing and reliable network infrastructure that spans on-prem, public cloud, and the public internet, and Kentik’s unified approach to network visibility provides the comprehensive strategy we need to monitor an application’s journey from source to destination, even when you don’t own or manage every part of the network in between.</p><![CDATA[Driving Network Automation Innovation: Kentik and Red Hat Launch Integration]]><![CDATA[We’re excited to announce a new collaboration between Kentik and Red Hat. This partnership will enable organizations to enhance network monitoring and management by integrating network observability with open-source automation tools.]]>https://www.kentik.com/blog/driving-network-automation-innovation-kentik-and-red-hat-launch-integrationhttps://www.kentik.com/blog/driving-network-automation-innovation-kentik-and-red-hat-launch-integration<![CDATA[Justin Ryburn]]>Mon, 20 May 2024 04:00:00 GMT<h2 id="the-partnership-kentik-and-red-hat">The partnership: Kentik and Red Hat</h2> <p>At Kentik, we’ve always been committed to delivering cutting-edge solutions that enable businesses to gain deep insights into their network cost, security, and performance. Our platform has become synonymous with scalability, flexibility, and actionable insights. Similarly, Red Hat is renowned for its dedication to open source innovation and fostering collaborative ecosystems.</p> <h2 id="ansible-eda-driving-automation-and-efficiency">Ansible EDA: Driving automation and efficiency</h2> <p>Today, we’re excited to <a href="https://www.redhat.com/en/blog/kentik-ansible-automation-platform-now-certified-red-hat?channel=/en/blog/channel/red-hat-ansible-automation">unveil our integration partnership with Red Hat</a>, centered around Ansible EDA (Event-Driven Ansible). Ansible EDA, available on GitHub, is a powerful framework designed to streamline network analytics workflows through automation and event-driven architectures.</p> <h2 id="key-benefits-of-integration">Key benefits of integration</h2> <p>This collaboration signifies our shared vision of democratizing network analytics and empowering organizations of all sizes to unlock the full potential of their data. By combining Kentik’s expertise in network observability with Red Hat’s commitment to open source excellence, we’re making it easier for our joint customers to build closed loop automation systems.</p> <p>So, what does this integration mean for you?</p> <ul> <li><strong>Enhanced visibility</strong>: With Ansible EDA seamlessly integrated into the Kentik platform, users can now leverage sophisticated analytics capabilities to gain deeper insights into their network traffic, performance, and security posture.</li> <li><strong>Simplified operations</strong>: Automation lies at the heart of Ansible EDA, allowing organizations to automate mundane tasks and focus on strategic initiatives. By automating network analytics workflows, teams can streamline operations, reduce manual errors, and improve overall efficiency.</li> <li><strong>Scalability and flexibility</strong>: The combined power of Kentik and Red Hat enables organizations to effortlessly scale their network analytics infrastructure. Whether you’re managing a small network or a global infrastructure, our integrated solution can adapt to your evolving needs.</li> <li><strong>Community-driven innovation</strong>: As proponents of open source, we’re committed to fostering a vibrant community around Ansible EDA. By collaborating with Red Hat and the broader open source community, we’re accelerating innovation and driving positive change in the industry.</li> </ul> <h2 id="how-it-works">How it works</h2> <p>The Ansible EDA plugin for Kentik accepts the Kentik JSON notification channel and kicks off an Ansible playbook. This integration empowers Kentik to trigger third-party actions based on events it identifies, such as anomalies flagged by our Alerting system. These actions could include network adjustments, DDoS attack mitigations, or other solutions. The JSON payload containing pertinent data is sent to a designated Webhook URL, where it is parsed by the Ansible EDA plugin to kick off the playbook. For a glimpse of the JSON payload format, refer to the <a href="https://kb.kentik.com/v4/Cb24.htm#Cb24-Sample_Alert_JSON">Sample Alert JSON</a> provided.</p> <h2 id="setting-it-all-up">Setting it all up</h2> <p>Now that we understand the value and how it works at a high level, let’s turn our attention to setting up the integration and getting it working.</p> <h3 id="install-the-collection">Install the collection</h3> <p>The first thing we need to do is install the collection from Ansible Galaxy so that we can leverage the plugin. Using the Ansible Galaxy command line, we do the following:</p> <p><code class="language-text">ansible-galaxy collection install kentik.ansible_eda</code></p> <p><strong>NOTE:</strong> when you install a collection from Ansible Galaxy, it is not automatically upgraded when you upgrade your Ansible install. To upgrade the Kentik EDA plugin, run the following:</p> <p><code class="language-text">ansible-galaxy collection install kentik.ansible_eda --upgrade</code></p> <p>See <a href="https://docs.ansible.com/ansible/devel/user_guide/collections_using.html">using Ansible collections</a> for more details.</p> <h3 id="event-driven-ansible-webhook">Event-Driven Ansible webhook</h3> <p>Once you have the collection installed, you will need to develop a rulebook that uses the webhook from Kentik as the event source and defines the port and address to listen on like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">- name: Listen for alerts using kentik_webhook hosts: all ### Define our source for events sources: - kentik_webhook: # for local tests only. In production use kentik.ansible_eda.kentik_webhook host: 0.0.0.0 port: 80 </code></pre></div> <p></p> <p>In the rules section, you will need to configure the action you want the rulebook to take like the following. Note that this will call a playbook named <em>example_playbook.yml</em> that contains your Ansible code for what you want to automate once an event is triggered.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">rules: - name: Print out the alert condition: event.i == 1 ### Define the action we should take should the condition be met action: run_playbook: name: playbooks/example_playbook.yml </code></pre></div> <p></p> <p>For the full example YAML file, <a href="https://github.com/kentik/ansible_eda/blob/main/extensions/eda/rulebooks/kentik_alert_example_rule.yml">check out the playbook</a> in the GitHub repository.</p> <h3 id="how-to-set-up-the-kentik-json-notification">How to set up the Kentik JSON notification</h3> <p>Finally, we need to configure the Kentik alerting notification engine to send a webhook to our listener.</p> <p>To configure the notification, navigate to Menu >> Settings >><a href="https://portal.kentik.com/v4/settings/notifications">Notifications</a>. From there, click on the blue Add Notification Channel button to bring up the following dialog window:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6OR83oTAIy7DO62M4QXgiC/06e00b3bbace0e65882d885ed6516ccb/add-notification-channel.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Red Hat Ansible - Add Notification Channel" /> <p>You will need to give your notification channel a name. The URL will need to be the endpoint that your Ansible EDA plugin is going to listen on. Notifications are tied to thresholds on an alert policy. Configuring those is beyond the scope of this blog but you can read more about it in the Kentik Knowledge Base article <a href="https://kb.kentik.com/v4/Ga12.htm">Threshold Policy Settings</a>.</p> <p><strong>NOTE</strong>: This URL will need to be publicly accessible where your Ansible EDA is running so the Kentik SaaS platform can reach it.</p> <h2 id="conclusion">Conclusion</h2> <p>Stay tuned for more updates as we continue to innovate and deliver value to our customers through this groundbreaking partnership. Together, we’re shaping the future of networking.</p> <p>You can explore the Ansible EDA project on <a href="https://github.com/kentik/ansible_eda">GitHub</a> or download the collection from <a href="https://console.redhat.com/ansible/automation-hub/repo/published/kentik/ansible_eda/?version=1.0.6">Red Hat’s Ansible Automation Hub</a>.</p> <p>Here’s to a future of smarter, more resilient networks!</p><![CDATA[East Africa Struck by More Submarine Cable Woes]]><![CDATA[In this post, Doug Madory digs into the latest submarine cable failure impacting internet connectivity in Africa: the Seacom and EASSy cables suffered breaks on May 12 near the cable landing station in Maputo, Mozambique.]]>https://www.kentik.com/blog/east-africa-struck-by-more-submarine-cable-woeshttps://www.kentik.com/blog/east-africa-struck-by-more-submarine-cable-woes<![CDATA[Doug Madory]]>Thu, 16 May 2024 04:00:00 GMT<p>On Sunday, May 12, two submarine cables (<a href="https://www.submarinecablemap.com/submarine-cable/seacomtata-tgn-eurasia">Seacom</a> and <a href="https://www.submarinecablemap.com/submarine-cable/eastern-africa-submarine-system-eassy">EASSy</a>) suffered breaks along a stretch of coastline between the cable landing stations in <a href="https://www.submarinecablemap.com/landing-point/mtunzini-south-africa">Mtunzini, South Africa</a> and <a href="https://www.submarinecablemap.com/landing-point/maputo-mozambique">Maputo, Mozambique</a>. It was the latest in a historically bad run of submarine cable failures that have led to disruptions of internet service around the continent of Africa.</p> <p>“Keen observer of the submarine cable industry” and friend of the blog, Philippe Devaux, <a href="https://twitter.com/philBE2/status/1790051184578871524">created</a> this markup of a <a href="https://blog.telegeography.com/introducing-the-2024-africa-telecommunications-map">Telegeography map</a> depicting the location of the cable cuts.</p> <img src="//images.ctfassets.net/6yom6slo28h2/44eYj3EhfPIdtI1E4snAwW/37f250db9c381c1ce6686146c4316fab/east-africa-teleography-map.jpg" style="max-width: 800px;" class="image center" alt="Map of submarine cable cuts" /> <p>Impacts were observed from Kenya down the east coast of Africa to South Africa beginning just before 07:30 UTC (10:30 am East Africa time). Among the East African countries impacted, Tanzania experienced the <a href="https://twitter.com/DougMadory/status/1790037332067754232">most severe disruption</a>, according to Kentik’s aggregate NetFlow. This observation was corroborated by outside sources such as the <a href="https://ioda.inetintel.cc.gatech.edu/dashboard/?from=1715437854&#x26;until=1715524254">Internet Outage Detection and Analysis</a> project at Georgia Tech and Cloudflare’s <a href="https://blog.cloudflare.com/east-african-internet-connectivity-again-impacted-by-submarine-cable-cuts">Radar</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4gey21jXIpcgX2WcI9XSCF/a824d450661d105926b22f40d4eba85a/internet-outage-tanzania.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Internet outage in Tanzania" /> <p>We’ll get into more detail below, but let’s first review some history.</p> <h2 id="a-rough-few-months">A rough few months</h2> <p>If we start the timeline of this run of misfortune last summer, we can include the undersea landslide in the Congo Canyon last August that took out multiple cables along the West Coast of Africa. In a <a href="https://www.kentik.com/blog/dual-subsea-cable-cuts-disrupt-african-internet/">blog post</a>, I detailed the impacts and discussed the history of threats to submarine cables from natural threats such as earthquakes.</p> <p>In February, multiple submarine cables connecting Europe and Asia were cut in an incident in the Red Sea on February 24. Having been attacked by Houthis from Yemen, the MV Rubymar was abandoned and left to drift at sea. Ultimately, its anchor snagged and cut three cables in the sea inlet’s shallow waters. I published <a href="https://www.kentik.com/blog/what-caused-the-red-sea-submarine-cable-cuts/">my analysis</a> as part of an <a href="https://www.wired.com/story/houthi-internet-cables-ship-anchor-path/">investigation</a> led by WIRED magazine to dig into the incident.</p> <p>Finally on March 14, four submarine cables (ACE, MainOne, WACS, and SAT-3) suffered failures off the coast of Côte d’Ivoire over a period of 5.5 hours. The suspected cause was an undersea landslide in the Trou sans Fond (“bottomless pit”) in an undersea canyon running from the coast. The last of the repairs was <a href="https://www.linkedin.com/posts/mainone-cable-company_activity-7194714117356584961-7Lm0?utm_source=share&#x26;utm_medium=member_desktop">finally completed</a> on May 10.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4olh2KaXrA8CyxuXOn9Fp2/f8c7d56a3bb470cd8a133b724dd4fd49/internet-outage-nigeria.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Internet outage in Nigeria" /> <div class="caption" style="margin-top: -35px">Impacts of the March 14 cable cuts on internet traffic to Nigeria</div> <h2 id="digging-into-the-impacts">Digging into the impacts</h2> <p>Vodacom Tanzania was one of the most impacted providers in East Africa. On May 15, they <a href="https://twitter.com/VodacomTanzania/status/1790656975673393620">announced</a> a full restoration of services, but BGP data reveals the scramble to re-establish connectivity immediately following the loss of the submarine cables on the 12th.</p> <p>While some Vodacom Tanzania routes were simply withdrawn at the time of the cable outages, others, such as 197.250.0.0/18 depicted below, were rerouted over Tigo (AS37035) in an attempt to attain transit from Djibouti Telecom to the north. Vodacom Tanzania (AS36908) was primarily using Seacom’s IP transit service (AS37100), which relies on its eponymous subsea cable, as well as transit from the parent company’s network in South Africa (AS36994).</p> <img src="//images.ctfassets.net/6yom6slo28h2/5gKZaxKa0hiGXAZw1QqKrz/24f2163f75627797aad8f8edabc3a932/vodacom-reroute.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Vodacom Tanzania being rerouted" /> <p>Eventually, Tanzanian providers were able to reroute traffic over alternative cables until the cable repairs take place.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5fE9W2VOiZnoCsp3HLWt4D/af55b6f9e12d8814d0993f0ccf0b5cf3/tanzanian-traffic-rerouted.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Graph showing traffic being rerouted" /> <p>Conversely, Safaricom of Kenya was largely able to weather the loss of the cables by reverting to other sources of transit.</p> <div as="Promo"></div> <p>The BGP analysis shows instability of WIOCC transit beginning around 07:28 UTC on May 12, until its complete loss over an hour later. From a BGP perspective, the loss was replaced by transit from China Telecom Global (AS23764), as depicted in this BGP visualization of a Safaricom route.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6AIXM9fI8naKz8ipNLNUZD/226b9e8e5ae2bd0f9e65de7b4c116b0f/safaricom.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP visualization of Safaricom" /> <h2 id="impacts-on-cloud-connectivity">Impacts on cloud connectivity</h2> <p>As with previous submarine cable disruptions, the latest cable cuts affected connectivity between public cloud regions. Kentik’s multi-cloud performance capability, which we use as the basis of the <a href="https://www.kentik.com/blog/cloud-observer-subsea-cable-maintenance-impacts-cloud-connectivity/">Cloud Observer</a> blog series, captured the impact.</p> <p>The most visible impact, illustrated in the screen captures below, were measurements between cloud regions in South Africa and Asia. Latencies were observed to increase dramatically as traffic was redirected along longer geographic paths due to the loss of these cables.</p> <p>In the first graphic, measurements from Azure’s region in Johannesburg, South Africa <code class="language-text">southafricanorth</code> to Google Cloud’s <code class="language-text">asia-south1</code> region in Mumbai, India, jumped by 74 ms.</p> <img src="//images.ctfassets.net/6yom6slo28h2/CKtxa9yqzM7AiMRJv9n6b/c9b024f6b117b7ac13a1b8f61239305e/internet-latency-johannesburg-mumbai.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Jump in traffic in Johannesburg" /> <p>In the graphic below, AWS’s region in Cape Town, South Africa <code class="language-text">af-south-1</code> experienced an increase of over 120 ms to Google Cloud’s <code class="language-text">asia-southeast1</code> region in Singapore.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6dJecfqGd7TBf4wCvOqBk3/aea633db31705ba863bd1d726f5631af/capetown-singapore-traffic.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Jump in traffic in Singapore" /> <p>These examples once again prove that the cloud providers must rely on the same submarine cable infrastructure as everyone else and can be impacted by cable cuts in similar manners.</p> <h2 id="conclusions">Conclusions</h2> <p>In the past decade, the number of submarine cables serving the continent of Africa has nearly doubled, leading many internet infrastructure observers, such as yours truly, to believe that this abundance of cables contributed to a greater degree of resilience.</p> <p>However, in the past nine months, Africa has endured four separate cable incidents, each resulting in the failures of multiple submarine cables. This leads to some tough questions. Are environmental conditions contributing to greater underwater turbidity that could lead to more undersea landslides? Has increased commerce led to greater maritime traffic and, thus, a greater threat from ship anchors?</p> <p>We have done a lot in the past decade to keep local internet traffic local by encouraging domestic interconnectivity through internet exchanges, for example. For the primary hubs of Africa (Nigeria, South Africa and Kenya), the amount of content that is served through local caches is enormous compared to where we were a decade ago, but it doesn’t appear to be enough. We still have a high degree of internet connectivity dependent on submarine cables despite the fact that much (most?) content now gets served locally in many of these markets.</p> <p>We’ll need to learn what was the cause of this latest incident. While we are waiting, it is worth considering that WIOCC’s EASSy cable and the Seacom cable failed within minutes of each other — similar to the cable cuts in the Red Sea, which were caused by a ship anchor. The cable failures caused by undersea landslides (Congo Canyon and the Côte d’Ivoire’s Trou sans Fond) were spread out over multiple hours.</p> <p>According to a <a href="https://www.linkedin.com/posts/philippe-devaux-218423199_14may24-east-africa-eassy-seacom-subsea-activity-7196245232693182465-bfuX?utm_source=share&#x26;utm_medium=member_desktop">recent LinkedIn post</a> from Devaux, the cable repair ship <a href="https://www.marinetraffic.com/en/ais/details/ships/shipid:761048/mmsi:645400000/imo:8108676/vessel:LEON_THEVENIN">Leon Thevenin</a> has been dispatched from Cape Town, South Africa, to make the necessary repairs.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1P2vSxV9Sk3aU3RGprRh0F/d248858fca5f0e35f576faf74e43b5ca/cable-repair-ship.png" style="max-width: 600px;" class="image center" alt="Map with cable repair ship" /> <p>Hopefully these are the last submarine cable breaks to disrupt African connectivity for a very long time.</p> <p>For additional discussion on the topic of the recent spate of submarine cable cuts affecting internet connectivity in Africa, tune into a recent episode of Kentik’s <a href="https://www.kentik.com/telemetrynow/">Telemetry Now</a> podcast hosted by my colleague Phil Gervasi.</p> <div as="WistiaVideo" videoId="mswbehusup" audio></div><![CDATA[Building Resilience: Modern Business Networks Need SaaS Monitoring]]><![CDATA[Traditional network monitoring systems can’t meet the dynamic demands of modern business networks. Modern NMSs built with SaaS-native architecture enable enterprises to deliver exceptional customer experiences, powered by scalable, high-performance, innovative monitoring.]]>https://www.kentik.com/blog/modern-saas-network-monitoring-systemhttps://www.kentik.com/blog/modern-saas-network-monitoring-system<![CDATA[Rosalind Whitley]]>Fri, 10 May 2024 04:00:00 GMT<p>As digital transformation continues to dictate market trends, an agile and robust network infrastructure is still pivotal for enterprises aiming to maintain competitive advantage. Traditional network monitoring systems (NMS), typically constrained by on-premises architectures, no longer meet the dynamic demands of modern networks and businesses. SaaS-delivered NMSs not only address the inherent limitations of traditional systems but also deliver scalability, performance, and continual innovation.</p> <h2 id="scalability-tailored-to-growth">Scalability, tailored to growth</h2> <p>SaaS models have revolutionized numerous software sectors, but network monitoring systems have not kept pace. While still the standard for network monitoring, on-premises solutions stifle infrastructure growth and flexibility by requiring significant upfront infrastructure investment and ongoing maintenance to scale reliably alongside the business. When a traditional NMS fails to keep pace with evolving workloads, this threatens application reliability, performance, and resilience. SaaS-delivered NMS, particularly those with SaaS-native architecture, break this mold by providing near-infinite scalability. This allows enterprises to expand monitoring activities seamlessly with business growth without the typical overheads associated with NMS physical infrastructure upgrades or replacements. That way, network monitoring scales dynamically and efficiently, with no waiting, whether the infrastructure team is dealing with increased traffic, more complex network architectures, or rapid geographic expansion.</p> <h2 id="optimizing-performance-to-drive-impact">Optimizing performance to drive impact</h2> <p>Moving to a SaaS-delivered NMS allows network and IT managers to leverage the full spectrum of SaaS benefits — from reduced total cost of ownership and decreased operational burdens to enhanced security features and compliance capabilities. This modernization is as much about transforming your network management to be more flexible, scalable, and aligned with modern business practices as it is about upgrading technology to leverage recent networking advancements such as streaming telemetry. All of these motions translate to a higher-quality user experience.</p> <p>An NMS should deliver fast, high-performance monitoring in the browser to ensure operations teams can access critical data and insights in real-time, facilitating swift decision-making and operational responsiveness. Traditional systems often suffer from delays and performance bottlenecks, whereas modern monitoring apps with SaaS-native architecture ensure optimal efficiency and user experience by significantly reducing the time to resolve network issues, minimizing downtime and performance degradation.</p> <div as="Promo"></div> <h2 id="continuous-innovation">Continuous innovation</h2> <p>The SaaS model of software delivery is celebrated because it embraces rapid technological evolution. This starkly contrasts with most traditional NMS solutions, such as <a href="https://www.youtube.com/watch?v=OtIg9iKcJzQ">Solarwinds</a>, which fail to deliver modern features such as container workload support, containerized deployment, and streaming telemetry support. When they do release product updates, they require manual upgrades on-prem, creating a lag in integrating the latest features. Meanwhile, access to contemporary innovations such as AI-assisted queries and insights is completely off the table for users of these traditional on-prem solutions.</p> <p>SaaS vendors continuously update their platforms with the newest functionalities. Kentik’s NMS product is no exception and equips users with state-of-the-art tools for traffic analysis and insights, DDoS detection and mitigation, cloud and hybrid environment monitoring, and natural language query to help them operate the networks of today. A strong vendor commitment to innovation keeps your network monitoring system current and adaptable to future technological advancements.</p> <h2 id="a-strategic-shift-to-saas-nms">A strategic shift to SaaS NMS</h2> <p>Adopting a SaaS-delivered NMS represents a strategic investment in the future resilience and agility of your business. Once you’re ready to rethink how you monitor and manage your networks, moving away from outdated legacy systems towards a more flexible, scalable, and innovative SaaS model will enhance operational efficiency and position the business for future growth and technological changes. In doing so, you can ensure that your network infrastructure—a critical asset in today’s digital landscape—is robust, responsive, and ready to meet the challenges of tomorrow.</p> <p>Watch this video to learn more about how a SaaS NMS can unify telemetry to observe complex modern networks.</p> <div as="WistiaVideo" videoId="d31q7pczqx" audio></div><![CDATA[NMS Migration Made Easy: Lessons Learned]]><![CDATA[In this post, I'm going to point out some of the opportunities you have in front of you not only because you now have Kentik NMS up and running, but more importantly, as a result of all the hard work you did to get here.]]>https://www.kentik.com/blog/nms-migration-made-easy-lessons-learnedhttps://www.kentik.com/blog/nms-migration-made-easy-lessons-learned<![CDATA[Leon Adato]]>Wed, 08 May 2024 04:00:00 GMT<p>The three <a href="https://www.kentik.com/blog/nms-migration-made-easy-gathering-information/">previous blog posts in this series</a> have focused on getting you from your old monitoring solution into Kentik NMS with a minimum of fuss, muss, or rework. But as I reviewed the content for this series, I realized that tucked away inside those posts (as well as the full <a href="https://www.kentik.com/go/ebook/nms-migration-guide/">NMS Migration Guide</a>) were nuggets of wisdom that applied not only to a migration but to the broader body of work we do as IT practitioners.</p> <div as="Promo"></div> <p>I felt it was worth taking a moment to highlight skills that aren’t just useful for a migration but are, in fact, things we should be doing as IT professionals in our daily work.</p> <p>In this post, I’m going to point out some of the opportunities you have in front of you, not only because you now have Kentik NMS up and running but, more importantly, because of all the hard work you did to get here.</p> <h2 id="revisiting-those-customers-consumers-and-constituents">Revisiting those customers, consumers, and constituents</h2> <p>You worked so hard to build those relationships, so don’t let them die on the vine now. Go back to them via Slack, email, Zoom, or by strolling up to their desk.</p> <p>Touch base with the customers – the people who pay the bills. It’s entirely likely these folks are (sadly) used to staff begging for money, only to disappear once the dollars have been dispensed, with nary a word about whether that investment bore fruit or not. Make sure they get a status update: The old system is not only shut down, it’s no longer sucking up budget dollars and staff time. Make the value Kentik NMS provides equally clear to them (more on that in a minute).</p> <p>Check in with the consumers to verify that Kentik NMS is handling everything the previous solution covered and find out what the old tool never did but they’d always wanted. Let this be the start of an ongoing feedback loop between you and the true beneficiaries of network observability.</p> <p>Use these regular (every three to six months) updates to do the following:</p> <ul> <li>Validate whether the monitors, reports, dashboards, and alerts still deliver the expected results.</li> <li>Tweak the ones that have drifted too far off the mark.</li> <li>Delete anything that’s no longer needed.</li> <li>Reserve time to discuss new needs and opportunities.</li> </ul> <p>Finally, this is a great time to consider attending a Network Observability Road Show among the constituents. It’s hard to understate the impact of a short presentation highlighting specific problems you’ve used monitoring and observability to solve, followed by a quick review of capabilities. This is less about singing Kentik’s praises specifically and more a chance to share the results of network observability at a high level.</p> <p>Inevitably, some of the folks you speak with will realize that the technology they rely on may benefit from the capabilities you’re describing.</p> <h2 id="management-as-a-second-opinion">Management as a second opinion</h2> <p>You may recall how we reminded you that the beloved tools used by teams today often have some monitoring baked in, but that didn’t mean Kentik NMS should supplant them.</p> <p>That is 100% true. But it’s also true that those tools can pull very specific telemetry from their systems, and you would be remiss to dismiss it out of hand. Instead, look for ways to validate Kentik NMS’ data against those sources of truth. This isn’t to say that vendor-provided management tools are always right. If we’re being honest, neither is Kentik NMS. Instead, those tools are a validation point. If their results differ from Kentik’s, it’s an opportunity to investigate why. Your environment as a whole can only improve from the effort.</p> <h2 id="monitor-what-matters">Monitor what matters</h2> <p>And make monitoring matter.</p> <p>As part of this migration, you’ve spent a lot of time gathering and validating the inventory of devices, elements, sub-components, dashboards, reports, and alerts. Now is the time to review that list and ask the hard questions about what value these assets provide.</p> <p>Not the devices themselves. Obviously, those are valuable to the business. But do you really need to monitor ARP tables? Do you need to collect them every 60 seconds? Does the alert telling you when any server has over 60% CPU utilization provide the business with meaningful insight? How about that report sent to the entire DBA team (in which half of them most likely have an email rule that sends it directly to the trash folder)?</p> <p>This is more than just Marie Kondo-ing your monitoring and throwing out anything that doesn’t spark joy. As mentioned earlier, each and every element in Kentik NMS represents a unit of work. It serves nobody to collect data nobody ever intends to use, to alert on events nobody responds to, or to send reports to folks with no interest in the information.</p> <h2 id="flex-your-language-skills">Flex your language skills</h2> <p>Finally, it’s possible that this project opened up a seat at the table with your leaders. Don’t dismiss the incredible opportunity this presents or shy away from it because you – a seasoned IT practitioner – are afraid of being sucked into a management role and losing your technical edge.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4M6gmXiGn9B9BqlItJOFyY/bcd651666ab75d303f8e1e1e3a110c2a/boardroom.jpg" style="max-width: 500px;" class="image center" alt="Boardroom" /> <p>Being bilingual means speaking two languages fluently, not losing the ability to speak one in favor of another. Having developed fluency in the language of the business, you should use your newfound superpower for good (and not evil) in precise ways:</p> <ul> <li>Continue to frame the technical achievements monitoring and observability (to say nothing of Kentik NMS) have achieved in business terms, describing how they have helped reduce cost and remove risk (and bonus points if you’ve used it to help increase revenue).</li> <li>As you sit at the leadership table, listen carefully. You will undoubtedly hear about what the business cares about most. Pay attention to this information, as it’s a cheat code to get more positive attention, support, and (most importantly) budget. Knowing what is most critical to the business tells you (and in terms you can tell your team) where to focus your efforts concerning what should be monitored.</li> <li>Take what you heard back to your team. Use your bilingual fluency to translate those business goals back into technical specifications. Leverage the company priorities you heard about to elevate projects that support them. But most of all, <em>share</em> what you heard with the team because, more often than not, there will be additional ideas on how to do things even more effectively.</li> </ul> <p>As Kenneth Blanchard said, “None of us is as smart as all of us.”</p> <p>The feedback loop that inevitably results – hearing what is important, monitoring it vigorously, and reporting back the positive results, which, in turn, will spark conversations where you hear about more business-critical initiatives – will ensure that the business continues to understand the value network observability brings to the table and therefore supports your efforts.</p> <h2 id="the-end-of-the-beginning">The end of the beginning</h2> <p>There are many more blog posts to come on NMS and the rest of the full suite of Kentik solutions. But for now, we’re setting aside the subject of migrations. That doesn’t mean you should set aside the skills you’ve learned along the way.</p> <p>It’s all too common to underestimate lessons and techniques learned in a specific context (e.g., migrating from one monitoring tool to Kentik NMS) and ignore how they apply in a wide range of other scenarios. Hopefully, this post gives you a reason to pause momentarily and consider the valuable tools you’re effectively throwing away.</p> <p>Regardless, we hope you find a few bits of wisdom you can add to your toolbox.</p> <p><a href="https://www.kentik.com/go/ebook/nms-migration-guide/">Download the complete NMS Migration Guide here</a>.</p><![CDATA[BGP Flowspec Doesn't Suck. We're Just Using it Wrong.]]><![CDATA[BGP Flowspec is one of the critical technologies gaining traction in DDoS mitigation strategies. In this article, Justin Ryburn outlines the adoption rates and best practices of this emerging and effective DDoS mitigation tool.]]>https://www.kentik.com/blog/bgp-flowspec-doesnt-suck-were-just-using-it-wronghttps://www.kentik.com/blog/bgp-flowspec-doesnt-suck-were-just-using-it-wrong<![CDATA[Justin Ryburn]]>Tue, 07 May 2024 04:00:00 GMT<h2 id="introduction">Introduction</h2> <p>DDoS attacks continue to plague the internet, posing a persistent threat to businesses of all sizes. As attackers evolve their tactics, our defenses must adapt to mitigate these threats effectively. A DDoS attack can cause significant issues for customers and can be very time-consuming for network operators to track down and stop. Those who have followed me for a while may be aware I am a big proponent of BGP Flowspec. I’ve done <a href="https://youtu.be/ttDUoDf6xzM">public speaking</a> on the topic and even published a short book, <a href="https://www.juniper.net/documentation/en_US/day-one-books/DO_BGP_FLowspec.pdf">A Day One Guide</a>, detailing how to configure it on Juniper devices. I truly believe that BGP Flowspec can be a big help to operators in blocking these attacks. Like most things in IT, it does not come without drawbacks and must be implemented properly. In this blog, I’ll give a refresher on BGP Flowspec and why I believe more operators should test and adopt the technology.</p> <div as="WistiaVideo" videoId="1evv79okbk" audio></div> <h2 id="ddos-trends">DDoS trends</h2> <p>Recent industry reports from leading cybersecurity firms such as Akamai and Cloudflare provide a picture of the ongoing battle against DDoS attacks. According to <a href="https://www.akamai.com/blog/security/a-retrospective-on-ddos-trends-in-2023">Akamai’s retrospective on DDoS trends in 2023</a> and <a href="https://blog.cloudflare.com/ddos-threat-report-2023-q4">Cloudflare’s DDoS threat report for 2023 Q4</a>, the frequency and complexity of DDoS attacks continue to rise, driving more organizations towards cloud-based mitigation services while moving away from traditional appliance-based solutions.</p> <p>One of the cool things about working for a network observability company that analyzes the global BGP table is that we can see when an organization activates a cloud-based scrubbing service by looking at the changes in the BGP table. In this visualization, you can see Intrado Life &#x26; Safety (ASN 36329), which uses AT&#x26;T (ASN 7018) as its upstream provider. When they are under attack, they swing their traffic over to Neustar (ASN 19905) to mitigate it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5jht5vdmTzruIQgIBFmLft/9e6f442c0af652aeb3966a0826fe40cb/bgp-flowspec-route-viewer.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP table in the Kentik platform" /> <div as="Promo"></div> <h2 id="bgp-flowspec-a-powerful-tool">BGP Flowspec: A powerful tool</h2> <p>One of the key technologies gaining traction in DDoS mitigation strategies is BGP Flowspec. Before we delve into its adoption rates and best practices, let’s have a quick refresher on how BGP Flowspec works. BGP Flowspec adds a new NLRI (Network Layer Reachability Information) that allows the operator to specify very detailed parameters for the type of attack they wish to mitigate. Here is a list of the possible parameters that can be specified:</p> <p>Type 1 - Destination prefix</p> <p>Type 2 - Source prefix</p> <p>Type 3 - IP protocol</p> <p>Type 4 - Port</p> <p>Type 5 - Destination port</p> <p>Type 6 - Source port</p> <p>Type 7 - ICMP type</p> <p>Type 8 - ICMP code</p> <p>Type 9 - TCP flags</p> <p>Type 10 - Packet length</p> <p>Type 11 - DSCP (Diffserv Code Point)</p> <p>Type 12 - Fragment</p> <p>Once you have signaled what traffic you want to match on, you must tell the router what to do with that traffic. This is done by attaching extended communities to the announced BGP NLRI. Here is a table of those options:</p> <table> <thead> <tr> <th><strong>Type</strong></th> <th><strong>Extended Community</strong></th> <th><strong>Encoding</strong></th> </tr> </thead> <tbody> <tr> <td>0x8006</td> <td>traffic-rate</td> <td>2-byte as#, 4-byte float</td> </tr> <tr> <td>0x8007</td> <td>traffic-action</td> <td>bitmask</td> </tr> <tr> <td>0x8008</td> <td>redirect</td> <td>6-byte Route Target</td> </tr> <tr> <td>0x8009</td> <td>traffic-marking</td> <td>DSCP value</td> </tr> </tbody> </table> <p>For more details, check out the IETF’s <a href="https://datatracker.ietf.org/doc/html/rfc8955">RFC 8955</a>.</p> <h2 id="adoption-rates">Adoption rates</h2> <p>While BGP Flowspec holds immense potential for mitigating DDoS attacks, its adoption rates vary across organizations. It’s 2024, so of course, we are going to ask ChatGPT what it knows about BGP Flowspec adoption rates:</p> <img src="//images.ctfassets.net/6yom6slo28h2/H2XNSA42ynEgKuzDYxtXD/e65f97608be3750f49b0928cf1444658/chatgpt-bgp-flowspec.png" style="max-width: 700px;" class="image center" alt="Chat GPT results about Flowspec adoption" /> <p>All joking aside, it is hard to get good data on the adoption rates as most organizations just go about doing it quietly. There has not been much research on which networks have it enabled in their own network. I talk to a lot of customers who are either using it or in the process of testing it in their lab before deployment. My limited data set would indicate that interest and adoption is growing. Adoption rates are lower than many would have hoped when the IETF ratified RFC 5575 (which RFC 8955 replaced) years ago.</p> <h2 id="best-practices-for-safe-adoption">Best practices for safe adoption</h2> <p>Like any powerful technology, BGP Flowspec can have adverse outcomes if deployed incorrectly. BGP Flowspec as a protocol has a negative impression in the minds of many network engineers due to some very public outages caused by its misuse. The two most famous were the <a href="https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage">2020 CenturyLink outage</a> and the <a href="https://blog.cloudflare.com/todays-outage-post-mortem-82515/">2013 Cloudflare outage</a>.</p> <p>However, this should not deter organizations from leveraging its capabilities in a careful and well-thought-out manner. To ensure safe adoption, it’s essential to follow best practices such as thorough testing, ongoing monitoring, and adjustment of deployment policies.</p> <p>I advise customers to test forwarding performance with different BGP Flowspec rules deployed carefully. Most line card ASICs that I am aware of have a finite amount of resources they can allocate to this filter. As the number of prefixes increases, those resources get used up, and the ASICs can no longer forward packets at line rate. Also, not all FS rules are created equally. The more complex the matching criteria, the more resources it is going to take on the ASIC.</p> <p>The next thing to be aware of is the filtering that is done as a result of processing the BGP Flowspec rules, which takes place as a forwarding table filter. What that means is it applies to all the interfaces on the device. Some vendors allow configurations to exclude specific interfaces like backbone links, out-of-band management interfaces, etc. This may be something that you want to consider to make sure you do not lose access to the device when the filter is applied.</p> <p>My final recommendation is to have tight control over what kind of BGP Flowspec rules you allow your system to advertise to your routes. The Cloudflare outage I mentioned earlier resulted from them advertising a rule that blocked packets between 99,971 and 99,985 bytes long. As you probably know, those packet sizes do not make sense, as most networks have a maximum packet size of ~9,000 bytes. This caused their Juniper line cards to panic and drop all traffic. The router should not have reacted that way, but Cloudflare should have also had checks in their automation platform to sanity check for packet size being in a reasonable range.</p> <h2 id="staying-current">Staying current</h2> <p>Keeping abreast of the latest developments in BGP Flowspec is crucial for effective DDoS mitigation strategies. Stay informed about the latest updates from the Internet Engineering Task Force (IETF) regarding BGP Flowspec standards and implementations. BGP Flowspec is maintained by the <a href="https://datatracker.ietf.org/wg/idr/about/">Inter-Domain Routing Group (IDR)</a>, and you can follow their mailing list <a href="https://mailarchive.ietf.org/arch/browse/idr/">here</a>. Ultimately, BGP Flowspec offers a potent solution to combat DDoS attacks, and it’s up to us as operators to harness its power responsibly.</p> <p>In conclusion, while DDoS attacks continue to pose significant challenges, solutions like BGP Flowspec provide hope for organizations seeking robust defense mechanisms. By understanding its capabilities, adopting best practices, and staying informed about industry developments, we can collectively strengthen our resilience against the ever-evolving threat landscape of DDoS attacks. As I said earlier, I am a big proponent of BGP Flowspec. If you are considering deploying it and have questions, <a href="https://www.linkedin.com/in/justinryburn/">feel free to reach out to me</a>, and I will be happy to answer any questions I can.</p><![CDATA[NMS Migration Made Easy: Moving Foward with Kentik NMS]]><![CDATA[The time has come to switch to your new network monitoring solution. Remember, if an alert, report, ticket, or notification does not spark joy, get rid of it. We’ll cover how to spin up Kentik NMS and ensure you're ready to sunset your old solution for good.]]>https://www.kentik.com/blog/nms-migration-made-easy-moving-foward-with-kentik-nmshttps://www.kentik.com/blog/nms-migration-made-easy-moving-foward-with-kentik-nms<![CDATA[Leon Adato]]>Thu, 02 May 2024 04:00:00 GMT<p>I’m back again with another step in the journey from your old monitoring solution to Kentik NMS. Previously, I discussed the <a href="https://www.kentik.com/blog/nms-migration-made-easy-gathering-information/">information you need to gather</a> in preparation for the big move and <a href="https://www.kentik.com/blog/nms-migration-made-easy-get-stakeholders-aligned/">how you can get the rest of the organization on board</a>. In this post, I’m going to dig into what it looks like when Kentik NMS comes fully online and how you know when it’s safe to turn off the old technology.</p> <p>Of course, if you want to get the complete 33-page story all at once, you can always download the NMS Migration Guide.</p> <div as="Promo"></div> <p>But you’re already here, just one sentence away from network observability triumph. You might as well read on and bask in the glory.</p> <h2 id="step-3-launch-the-alternative-kentik-duh">Step 3: Launch the alternative (Kentik. Duh.)</h2> <img src="//images.ctfassets.net/6yom6slo28h2/1iePIvpmmcTnJuSr4ruz6Z/61be99ecc2cfdc5bdc3611c18ba52701/nms-guide-wand.png" style="max-width: 350px;" class="image right" alt="Magic wand" /> <p>You got the inventory. You got the buy-in. Now it’s time to get to work.</p> <p>Installing NMS is reasonably trivial – it’s a single curl command or Docker run statement. So much so that we won’t chew up space here going over it.</p> <p>Instead, we will wave our magic wand and say you have Kentik installed and your devices (whether it’s a sample representation or your whole environment) loaded up and sending data.</p> <p>Now what? How do you validate that you’re collecting everything you need and are ready to move forward in the migration?</p> <p>For starters, plan to run in parallel for some time. How long it will take depends on your environment’s size, complexity, and organizational risk tolerance. Some folks run parallel for a week or two, and others keep it going much longer.</p> <p>Because SNMP and streaming telemetry create such a slight drag, pulling the same data into two systems is usually fine.</p> <p>What <em>can</em> be an issue are integrated tools like ticketing systems, email queues, and chat-ops. Nobody wants double tickets, emails, or pop-ups.</p> <p>Therefore, we recommend outputting one or the other system to a centralized log or even a shared email box. One system will still be responsible for cutting tickets, sending out notifications, etc. The other will output the information it <em>would</em> be sending to those targets to validate there’s been no loss of signal or fidelity (either by checking message by message or as a daily or weekly side-by-side comparison). Once apples-to-apples has been confirmed, the switch for <em>that</em> system (email, tickets, etc.) can be made, sending the old system to log file for validation and sending from Kentik NMS to the real destination.</p> <p>This phase of the process also presents opportunities and risks that many organizations need to expect and thus are caught short.</p> <p>It’s a time of opportunity because, even though you want to validate like-for-like, there’s every chance that when you review specific messages, ticket types, and so on, you (collectively, as a team) will realize it’s no longer needed. You should wholeheartedly embrace this opportunity to simplify. If an alert, report, ticket, or notification does not spark joy, get rid of it.</p> <p>That points to the inherent risk. As you validate those outbound messages and connections, it’s incredibly easy to fall into the trap of “fixing” them—adding functionality or elements they didn’t initially have. While some tweaking is unavoidable, try to minimize this as much as possible. You have plenty on your plate with the migration itself.</p> <p>Feel free to log what enhancement requests and bug fixes are, and plan to return to that list once the migration is complete.</p> <p>Finally, this is a moment when you can leverage another aspect of Kentik’s platform: <a href="https://www.kentik.com/go/offer/journeys-early-access/">Query Assistant and Journeys</a>.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/bEBMKNY8fpdzz4a2kLut3/dd8896ff572bb65efc31f01622086827/journeys-traffic-by-country.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Journeys AI" /> <p>We know it’s hard to get through a random 15 minutes without some marketing type popping out of your screen shouting about AI, LLMs, and Chat-jippity (yes, that <em>is</em> how you pronounce it).</p> <p>That said, Kentik has integrated a dedicated LLM that is custom-trained to respond to questions and language relating to network monitoring and observability. If you struggle to get Kentik NMS to show you the same data you are used to seeing in the old system, you have an “easy button” in the form of the Query Assistant and Journeys. Ask, and ye shall receive.</p> <p>Believe it or not, that’s all we have to say about this step. A lot of the work involved here is self-evident. So get to work, and know we’re here to help if needed.</p> <h2 id="step-4-wind-down-infrastructure">Step 4: Wind down infrastructure</h2> <p>In our hypothetical scenario, by this point, things are working well! Maybe even better than that! You’re comfortable enough with the data coming out of Kentik that all external integrations have been cut over, and your old solution is in the corner, muttering output into a glass of log files and wondering where they went wrong.</p> <p>There comes a point when everyone can agree it’s no longer necessary to maintain the facade that the old monitoring tool matters. Not only that, but the ongoing cost of upkeep—the license for the tool, the operating system, the database, and such—is a drag on the budget.</p> <p>Take one more backup of the data – in the case of SolarWinds, that’s the SQL database, but for other tools, it’s whatever your old monitoring solution kept its up-to-the-minute metrics. You should only need to back up some of the other items and elements we listed in Step 1 since you’ve replicated it all in Kentik NMS by this point.</p> <p>Once that’s done, maybe get the team together to raise a glass to mark the occasion—this liminal moment between the tired old monitoring tool that was good for its time but couldn’t keep up—and the bright new future where the necessary bases (like SNMP) are covered, and new capabilities (like streaming telemetry) are to be explored.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7uYH2ZMi1tOWzRLQQ4yXCH/087904088e3a70618b902373392e6859/nms-guide-money-trail.png" style="max-width: 350px;" class="image right" alt="Follow the money" /> <p>Okay, poetic faffing about aside, what do you need to actually do? For starters, “follow the money.”</p> <p>What we mean is, look at the elements of the old monitoring solution you continue to pay for – are there one or more servers in a rack somewhere still under maintenance? Indeed, there are payments for the tool itself. Then, there are the previously mentioned OS and database licenses.</p> <p>Ensure you have a handle on all those things, as well as license keys, vendor IDs, etc., or else you may find yourself making an unwanted trip into the data center to restart a server where you thought you’d typed “shutdown /s” for the last time.</p> <p>Speaking of the data center, work with them to understand the physical connections to your on-prem equipment. Reclaiming switch ports is serious business.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/63awRTAwkWdQ6s81MxnFkG/ca0ed59f6340589f7ef9cc82927f69fe/nms-guide-pirate-map.jpg" style="max-width: 450px; padding: 0;" class="image right" alt="Pirate map" /> <p>To be honest, untangling years of infrastructure investment shouldn’t take all that long, but without awareness and planning, it can drag on for way longer than necessary and distract you from the other essential tasks on your plate.</p> <p>Once this is done, you’re well and truly finished from a technical standpoint. But as Roz from “Monsters, Inc.” would remind you, don’t forget to do the paperwork. That means notifying your purchasing folks and letting them know they should not renew the contract with your old monitoring tool vendor. Then be a stand-up customer (not to mention a mentsch): Pick up the phone and have a necessary (but understandably hard) conversation with your sales rep from the old vendor to let them know you won’t be continuing the contract. Sure, it will be uncomfortable, but it’s the right thing to do.</p> <h2 id="conclusion-life-love-and-network-observability">Conclusion: Life, love, and network observability</h2> <p>This may be the end of this post, but it represents the beginning of the rest of our journey. Network monitoring is an old topic – spanning at least 30 years to the inception of SNMP itself. However, despite a full field of solutions, Kentik decided to build and launch NMS specifically because the older tools still needed to keep up with newer technologies, realities, and paradigms that make up modern networks.</p> <p>While the focus within this series has been to take what you had (primarily reliant on ping and SNMP) and move it over, the truth is that new technologies like streaming telemetry will provide you with vastly more meaningful insights and results.</p> <p>Equally important are capabilities within the broader Kentik platform – NetFlow is neither novel nor new, but its value is difficult to overstate. The same can be said for synthetics, VPC flow log insights, Kubernetes monitoring, DDoS detection and mitigation, botnet discovery, and all of the other benefits, large and small, that come as part of the Kentik platform.</p> <p>So it turns out that was just a little bit of a marketing-laden teaser. But if you’ve read this far, you’re probably on board for the whole ride.</p> <p>And we appreciate your willingness to see this through to the end. Here at Kentik, we know you have a lot of demands for your attention. We hope the time you’ve spent with us was valuable to you – that we’ve not only answered the questions you had when you began reading but have also raised new ideas you hadn’t even considered and shown you some new techniques along the way.</p> <h2 id="have-i-left-anything-out">Have I left anything out?</h2> <p>We’ve come to the end of our migration journey. Certainly, what you see above is the end of the <a href="https://www.kentik.com/go/ebook/nms-migration-guide/">migration guide</a>. But I think I have one more bit of insight for the stalwart IT practitioner. If you’re up to it, tune in next week for a final thought.</p> <p>Until then, may your packets flow and all your routes be solidly deterministic.</p><![CDATA[RPKI ROV Deployment Reaches Major Milestone]]><![CDATA[In this blog post, BGP experts Doug Madory of Kentik and Job Snijders of Fastly review the latest RPKI ROV deployment metrics in light of a major milestone. ]]>https://www.kentik.com/blog/rpki-rov-deployment-reaches-major-milestonehttps://www.kentik.com/blog/rpki-rov-deployment-reaches-major-milestone<![CDATA[Doug Madory, Job Snijders]]>Wed, 01 May 2024 04:00:00 GMT<p>As of today, May 1, 2024, internet routing security passed an important milestone. For the first time in the history of RPKI (Resource Public Key Infrastructure), the majority of IPv4 routes in the global routing table are covered by Route Origin Authorizations (ROAs), according to the <a href="https://rpki-monitor.antd.nist.gov/ROV">NIST RPKI Monitor</a>. IPv6 crossed this milestone late last year.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/Y2VenxZWr6FFte4aHU90X/e5ca2f5dfa9f3fc7d8be1c4b24b6b93d/rpki-rov-analysis-2024.png" style="max-width: 600px;" class="image center" alt="RPKI-ROV Analysis on Unique Prefix-Origin Pairs (IPv4)" /> <p>In light of this milestone, let’s take the opportunity to update the figures for RPKI ROV (Route Origin Validation) adoption we’ve been publishing in recent years.</p> <p>As you may already know, RPKI ROV continues to be the best defense against accidental BGP hijacks and origination leaks. For ROV to do its job (rejecting RPKI-invalid routes), two steps must be taken:</p> <ol> <li>ROAs must be created</li> <li>ASes must reject routes that aren’t consistent with the ROAs.</li> </ol> <p>The <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">first part of this analysis</a> began when we explored the first step of ROV: ROA creation. Two years ago at NANOG 84, Doug presented his analysis which showed that we were, in fact, farther along in ROA creation than could be ascertained by analyzing BGP alone. Utilizing Kentik’s aggregate NetFlow, he showed that <em>the majority</em> of traffic (measured in bits/sec) was heading to routes with ROAs, despite only <em>one third</em> of BGP routes having ROAs.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/dLd27cJo8Ds?si=6wXr8QkRSIPDXHeI" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <div as="Promo"></div> <p>This discrepancy was due to the fact that major content providers and eyeball networks had completed RPKI deployments in recent years and account for a disproportionate share of internet traffic volume. Of course, traffic volume isn’t the only criteria for achievement — there is plenty of traffic that is critical, but not voluminous (e.g., DNS). The idea was to simply provide another dimension to consider our progress in deploying RPKI ROV.</p> <p>To measure the second step of ROV (rejection of invalids), we <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">looked at the differences</a> in propagation based on a route’s RPKI evaluation. The conclusion at the time was that invalid routes could achieve a propagation no greater than 50% of the BGP sources in <a href="https://www.routeviews.org/routeviews/">Routeviews</a>, the public BGP repository from University of Oregon, now managed by the <a href="https://nsrc.org">Network Startup Resource Center</a>. Oftentimes invalids are propagated far less than 50% — it all depends on the upstreams involved.</p> <p>The dramatic reduction in propagation of RPKI-invalid routes can be primarily attributed to the tier-1 backbone providers that reject invalids. These providers cast a long shadow with their outsized influence on internet routing. Regardless, the reduction in propagation is RPKI ROV doing its thing: suppressing problematic routes so they don’t cause disruption.</p> <div as="WistiaVideo" videoId="qpfh2174t5" audio></div> <h2 id="roa-route-origin-authorization-creation-update">ROA (Route Origin Authorization) creation update</h2> <p>As mentioned above, over <a href="https://rpki-monitor.antd.nist.gov/ROV">50% of IPv4 routes</a> in the global routing table now have ROAs and are evaluated as valid (<a href="https://rpki-monitor.antd.nist.gov/ROV/20240404.06/All/All/6">with IPv6 at 52%</a>). Let’s check what that means for Kentik’s aggregate NetFlow.</p> <p>According to our <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">analysis two years ago</a>, we had roughly one third of routes with ROAs and just over 50% of internet traffic as “valid” (traffic to routes evaluated as valid in bits/sec). Now with over half of IPv4 routes with ROAs, our current aggregate NetFlow reveals a whopping 70.3% of internet traffic being valid!</p> <img src="//images.ctfassets.net/6yom6slo28h2/2660fb9slFcaSmIMJmD2ZX/d461f0b14a6290390670ce1b2e1b30f5/internet-traffic-rpki-202405.png" style="max-width: 650px;" class="image center" alt="Internet traffic volume by RPKI evaluation" /> <p>How much higher can this metric go? It remains to be seen. As depicted below in another NIST diagram, the upward slope of the percentage of routes with ROAs has held remarkably steady for the past four years. It stands to reason we will eventually see the slope flatten out as the number of easy wins begins to dwindle. However, it is important to recognize the progress made to date.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2HVt0oYPmtVkO46XacxXCX/6a660e0813965cc9e388ed1d5f1cbaf8/rpki-rov-history.png" style="max-width: 800px;" class="image center" alt="RPKI-ROV History of Unique Prefix-Origin Pairs" /> <h2 id="invalid-route-propagation-update">Invalid route propagation update</h2> <p>The aforementioned progress in the creation of ROAs is <em>useless</em> if networks are not rejecting RPKI-invalid BGP routes. So, the next step in understanding where we are at with RPKI ROV adoption is to better understand the degree to which the internet rejects RPKI-invalid routes.</p> <p>Among the internet’s largest transit providers (i.e., transit-free) providers, all but a couple were rejecting RPKI-invalid routes when we published our post, <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">How much does RPKI ROV reduce the propagation of invalid routes?</a> As a result, we concluded that “the evaluation of a route as invalid reduces its propagation by anywhere between one half to two thirds.”</p> <p>Now, two years later, we can explore how this metric has evolved over this period of time. Using historical RPKI data made available via Job’s <a href="http://www.RPKIviews.org">RPKIviews</a> site and BGP routing data from <a href="http://routeviews.org">Routeviews</a>, we evaluated the IPv4 global routing table every month going back to the beginning of 2022 to determine how the propagation of RPKI-invalid routes has changed over time.</p> <p>Recall that in this methodology, we measure the propagation of a route by counting how many Routeviews vantage points have the route in their tables. More vantage points means greater propagation. For more explanation on this approach, see our <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">invalid route propagation analysis</a>.</p> <p>The graphic below shows the average number of Routeviews vantage points for each RPKI-invalid route over time. We only include routes seen by at least 10 vantage points to avoid internal routes shared with Routeviews vantage points. At the beginning of the plot, we identify 4,978 RPKI-invalid routes that were seen, on average, by 82.5 vantage points. In the last data point from April 1, 2024, we observe 4,211 RPKI-invalid routes seen by 62.5 vantage points. <em>Note, we used a well-known globally routed prefix (Google’s 8.8.8.0/24) as control prefix for the effects of temporary changes in the count of Routeviews vantage points.</em></p> <img src="//images.ctfassets.net/6yom6slo28h2/1564S5ISl7zJOkxq5p4Yus/ba69bd2a8021c56bbc661045e49d21a3/average-propagation-rpki-invalids.png" style="max-width: 650px;" class="image center" alt="Average propagation of RPKI-invalids through time" /> <p>The main challenge to this type of analysis is that it is quite noisy. The set of persistently RPKI-invalid routes does not stay constant and propagation is heavily influenced by which providers are transiting a route. Those challenges aside, the analysis above shows a 24% decline in the propagation of RPKI-invalid routes since the beginning of 2022.</p> <p>To explore this phenomenon further, we can take a look at the routing of intentionally RPKI-invalid routes over time and see that they also experience a similar decline in propagation.</p> <p>RIPE NCC announces numerous “<a href="https://ris.ripe.net/docs/routing-beacons/">Routing Beacons</a>” for measurement purposes. Among these are routes which are intentionally RPKI-invalid (and RPKI-valid for a control). Not to be outdone, Job also announces RPKI-invalid routes along with a control route from his network, <a href="https://bgp.tools/as/15562">AS15562</a>.</p> <p>Below is a graphic displaying the Routeviews vantage point count for each of these measurement routes over time. The plots corresponding to the RPKI-invalid routes appear in the lower portion of the graphs, in keeping with our observation that RPKI-invalid routes propagate significantly less.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4iHA4IC8b0Q35GJTvg7htK/bd02a3a97a3c52f54a68d124a82033fd/routeviews-graphs.png" style="max-width: 800px;" class="image center" alt="Graphs showing Routeviews vantage points" /> <p>The three plots in this graphic all show a noticeable decline in the number of vantage points observing the various RPKI-invalid routes. This decline matches the drop in the average number of vantage points observing any given RPKI-route from earlier.</p> <p>There is one final observation to make based on this analysis. In the panel on the right (“Job’s Beacons”), there are two RPKI-invalid routes with slightly differing degrees of propagation.</p> <p>209.24.0.0/24 (green) has its ROA published via the ARIN Trust Anchor Locator (TAL), while 194.32.71.0/24’s (orange) is reachable via the RIPE TAL. A TAL is a file with a public key used by Relying Parties to retrieve RPKI data from a repository.</p> <p>The likely issue is that using the ARIN TAL requires agreeing to a <a href="https://www.arin.net/resources/manage/rpki/rpa.pdf">lengthy Relying Party Agreement</a>, which some providers refuse to do. As a result, ROAs published by ARIN are seen by slightly fewer networks that reject RPKI-invalid routes, decreasing the efficacy of RPKI for ARIN managed IP space.</p> <p>ARIN’s strong indemnification clause comes from their worry about being sued due to something that happens as a result of the data they publish in the RPKI. This obstacle to RPKI ROV adoption was covered in a 2019 academic article, <a href="https://papers.ssrn.com/sol3/Papers.cfm?abstract_id=3308619">Lowering Legal Barriers to RPKI Adoption</a> by professors Christopher S. Yoo and David A. Wishnick of the University of Pennsylvania.</p> <p>But alas, let’s get back to the progress we’re seeing in the rejection of RPKI-invalids.</p> <p>At the beginning of this section, we mentioned how all but two transit-free providers were rejecting RPKI-invalid routes. Well, the other milestone that occurred this past month is that that number dropped to just one as US telecom operator Zayo (AS6461) began rejecting RPKI-invalid routes from its customers.</p> <p>In 2022, Zayo <a href="https://mailman.nanog.org/pipermail/nanog/2022-August/220287.html">announced</a> that it had begun rejecting RPKI-invalids from its settlement-free peers. However, since nearly all of its big peers were already rejecting those routes, the impact was relatively minor.</p> <p>But on April 1, we began seeing AS6461 begin rejecting RPKI-invalids from customers for the first time. In the Kentik visualization below, RPKI-invalid route <a href="https://rpki-validator.ripe.net/ui/103.36.106.0%2F24?validate-bgp=true">103.36.106.0/24</a> stopped being transited by AS6461 at 16:24 UTC.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6vU81SQueWPh5Nmn0DdOj1/aeac04354f033dc1a096e957a9ac447f/rpji-invalid-as6461.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik platform showing RPKI-invalid route stopped" /> <p>The rollout of Zayo’s rejection of RPKI-invalids was done in phases and a couple of weeks later we started seeing other parts of their global network rejecting RPKI-invalids. At 18:54 UTC on April 12, we observed AS6461 begin rejecting RIPE’s RPKI-invalid beacons, 93.175.147.0/24 and 2001:7fb:fd03::/48, for the first time.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7I4ijmD9NNRPNVRob9o4a/a149f22c62bb1d675459617d618de2fa/ripe-rpki-beacon.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="RIPE's RPKI-invalids rejected" /> <p>Once completed, we expect Zayo’s rejection of RPKI-invalid routes from its customer base to continue to lower the propagation of these problematic routes reducing the risk of traffic disruption or misdirection due to many types of routing mishaps.</p> <p>And finally, for anyone still skeptical about the degree to which invalid routes are being rejected, may we direct your attention to the <a href="https://www.kentik.com/blog/digging-into-the-orange-espana-hack/">Orange España outage</a> in January. I summarized the incident in a blog post published the day after the hack.</p> <blockquote>Using a password found in a public leak of stolen credentials, a hacker was able to log into Orange España’s RIPE NCC portal using the password “ripeadmin.” Oops! Once in, this individual began altering Orange España’s RPKI configuration, rendering many of its BGP routes RPKI-invalid</blockquote> <p>The wielding of RPKI as a tool for denial of service was only possible due to the pervasive extent to which ASNs reject RPKI-invalid routes.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5RUbKtcNvHVab00CmS87El/771c761714138e6f2798c65d595164c6/rpki-orange-espana.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Propagation as seen in Orange Espana chart" /> <div class="caption" style="margin-top: -35px;">Propagation for an Orange España route dropped to less than 20% during the attack.</div> <h2 id="conclusion-benefits-of-deploying-rpki">Conclusion: Benefits of deploying RPKI</h2> <p>In our <a href="https://www.kentik.com/blog/exploring-the-latest-rpki-rov-adoption-numbers/">blog post</a> from one year ago, we made the following bold prediction:</p> <blockquote> If we are to assume steady growth of the share of BGP routes with ROAs, it should become the majority case in about a year from now (May 2024). Mark your calendars! <img src="//images.ctfassets.net/6yom6slo28h2/7g0IkHcVNSYaDiy3ptCXET/04983302a6280835c6e7c8c3475f9627/rpki-rov-trend.png" style="max-width: 700px;" class="image center" alt="RPKI-ROV History of Unique Prefix-Origin Pairs - Trend" /> </blockquote> <p>In December, we polled fellow BGP nerds on <a href="https://twitter.com/DougMadory/status/1735707501750788500">Twitter/X</a> and <a href="https://www.linkedin.com/posts/dougmadory_ok-bgprpki-nerds-whats-your-prediction-activity-7142512375504535552-kaO3">LinkedIn</a> when they believed we would hit this mark and they were decidedly more pessimistic than the prediction above:</p> <img src="//images.ctfassets.net/6yom6slo28h2/oqEv9qUVzeNEKqKDvFY1Y/2fb7c8af4fcaf8c13201332613bf307d/li-poll.png" style="max-width: 500px;" class="image center" alt="Results of LinkedIn poll on BGP routes" /> <p>The progress detailed in this blog post was years in the making and involved the dedicated efforts of hundreds of engineers at dozens of companies. Improving the security of the global internet routing system is not a small task and will continue to be a years long effort.</p> <p>Each of the two lines of analysis from this post should serve as motivation for additional networks to deploy RPKI ROV.</p> <ol> <li><strong>Reject RPKI-invalid BGP routes on EBGP sessions.</strong> Given that the majority of internet routes are covered by ROAs (including a super majority of traffic), network operators should reject RPKI-invalid routes to avoid mistakenly egressing customer traffic towards mis-originated routes.</li> <li><strong>Create ROAs.</strong> And given the scale to which RPKI-invalid routes are suppressed, it would benefit resource holders to create ROAs for their address ranges to enable networks around the world to automatically reject mis-originated routes.</li> </ol> <p>Networks who do so enjoy immediate benefits!</p> <p>But RPKI ROV doesn’t solve all of the issues surrounding internet routing security. In fact, this is only an opening salvo towards addressing the various “determined adversary” scenarios best characterized by the <a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/">recent attacks against cryptocurrency services</a>. These attacks take advantage of existing weaknesses in internet security that we will need to work to limit by building off the progress made by routing security mechanisms like RPKI ROV.</p><![CDATA[Navigating the Complexities of Hybrid and Multi-cloud Environments]]><![CDATA[Today's evolving digital landscape requires both hybrid cloud and multi-cloud strategies to drive efficiency, innovation, and scalability. But this means more complexity and a unique set of challenges for network and cloud engineers, particularly when it comes to managing and gaining visibility across these environments. ]]>https://www.kentik.com/blog/navigating-the-complexities-of-hybrid-and-multi-cloud-environmentshttps://www.kentik.com/blog/navigating-the-complexities-of-hybrid-and-multi-cloud-environments<![CDATA[Phil Gervasi]]>Tue, 30 Apr 2024 04:00:00 GMT<p>General Eric K. Shinseki once said, “If you don’t like change, you’re going to like irrelevance a lot less.” For IT operations, the shifting strategies, technologies, and politics around public cloud exemplify this perfectly.</p> <p>In 2006, AWS launched Elastic Compute, and the advent of the public cloud provider. After a time, many IT leaders were determined to have nothing on-premises and put everything in a public cloud. Whether that was for reducing cost, improving developer productivity, reducing operational burden, or improving the ability to scale, engineers had to re-tool, architectures had to change, and the bean counters had to shift from a purely CapEx model to OpEx.</p> <p>As resources began moving to the cloud, many organizations became hybrid cloud environments whether they wanted to or not, as some resources were migrated, others were not, and some were in process.</p> <p>However, many saw over time that some resources performed better and were cheaper to run on-premises in our own data centers. And so network engineers, sysadmins, and newly minted “cloud engineers” saw another shift to architectures running a mixture of public cloud and private data centers – what we now refer to as hybrid cloud.</p> <p>Container environments, SD-WAN, and the rise of remote work all certainly played a role, as did various other technical, financial, and political factors. Not long ago, we saw another shift to new designs with multiple cloud service providers, including SaaS.</p> <p>Today’s evolving digital landscape leverages both hybrid cloud and multi-cloud strategies to drive efficiency, innovation, and scalability. However, this has meant more complexity and a unique set of challenges for network and cloud engineers, particularly when it comes to managing and gaining visibility across these environments.</p> <h2 id="understanding-hybrid-and-multi-cloud-challenges">Understanding hybrid and multi-cloud challenges</h2> <p>Today’s <a href="https://www.kentik.com/product/multi-cloud-observability/">hybrid and multi-cloud cloud environments</a> pose new challenges for IT teams, especially those that have both.</p> <p>First, connectivity is more complex. Each cloud service provider (CSP) has its own set of APIs, networking constructs, capabilities, and management tools. Engineers managing the connectivity to multiple clouds must somehow manage these disparate systems collectively.</p> <p>Second, because we’re dealing with different providers, there is potential for inconsistent security policies. Especially in today’s cybersecurity landscape, ensuring consistent <a href="https://www.kentik.com/kentipedia/cloud-security-policy-management/" title="Cloud Security Policy Management: Definitions, Benefits, and Challenges">security policies across all platforms</a> is absolutely critical. Each environment might have different security controls and compliance requirements, not to mention the mechanisms used to enforce policies may vary from CSP to CSP.</p> <p>Setting up effective and accurate security policies is difficult. Good governance in a public cloud environment is even harder because different teams are often deploying simultaneously. This is compounded further when trying to do that across multiple clouds and aligning public cloud with on-premises resources.</p> <p>Next, remember that we’re using resources on-premises, in the cloud, and from SaaS providers to deliver services, generally in the form of applications, to actual people. This makes performance monitoring vital to ensuring a great user experience.</p> <p>The problem is that the different CSPs offer different levels of network performance, reliability, and visibility. This may be due to their internal infrastructure, where their data centers are located in the world, and what internet service providers are involved. When workloads rely upon components that span multiple environments, monitoring all these factors to understand and troubleshoot performance issues is very difficult for engineers, mainly because each CSP tends to have its own visibility and monitoring tools that don’t work across all clouds.</p> <p>Visibility doesn’t just mean performance, though. Also, consider that CSPs charge their customers for several services, including the ingress or egress of data from their cloud. <a href="https://www.kentik.com/blog/cloud-cost-optimization-best-practices/">Managing costs in multi-cloud environments</a> can quickly become a disaster without proper visibility. Engineers need tools that provide insights into resource utilization and cost-efficiency across all platforms. Today, not only do engineers have technical alerts for our infrastructure, but business leaders also have billing alerts when it comes to cloud monitoring. Surprise cloud bills are a problem for budget-conscious IT teams, and alerting them of significant traffic changes will give decision makers the information they need long before the bill comes.</p> <p>Lastly, remember that even though our cloud resources are in someone else’s data center, often across the globe, data sovereignty and compliance remain a top priority. Adhering to legal and regulatory requirements across geographic and digital boundaries is complex, mainly when data resides across multiple clouds and regions.</p> <h2 id="kentiks-role-in-enhancing-cloud-network-visibility">Kentik’s role in enhancing cloud network visibility</h2> <p>Comprehensive visibility into network operations across all environments is critical to address these challenges. This is where Kentik’s platform excels. Kentik is vendor-agnostic when it comes to CSPs, so it provides a unified network analytics and visibility platform designed to work across modern hybrid and multi-cloud architectures.</p> <div as="WistiaVideo" videoId="xogyx13gzl" audio></div> <div class="caption" style="margin-top: -20px;">Navigating the Complexities of Hybrid and Multi-cloud Environments with Phil Gervasi</div> <h3 id="comprehensive-data-collection">Comprehensive data collection</h3> <p>Comprehensive visibility means Kentik ingests data types from a wide variety of sources.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3iSnubdxG8jgDVU5YzmU8k/e7726ca1b9fe2a0fa330b55933acb5d4/comprehensive-data-collection1.png" style="max-width: 650px;" class="image center" alt="Chart showing types of network data that Kentik monitors" /> <p>From the CSPs, Kentik ingests:</p> <ul> <li>AWS VPC flow logs</li> <li>AWS cloud metrics</li> <li>Azure NSG flow logs</li> <li>Azure Firewall logs</li> <li>Azure cloud metrics</li> <li>Google Cloud VPC flow logs</li> <li>Google Cloud metrics</li> <li>OCI VCN flow logs</li> <li>OCI cloud metrics</li> </ul> <p>From on-premises infrastructure, Kentik ingests:</p> <ul> <li>Flow logs</li> <li>SNMP information</li> <li>Streaming telemetry</li> <li>IPAM information</li> <li>DNS information</li> <li>Application and security tag information</li> </ul> <p>From SaaS providers, Kentik monitors:</p> <ul> <li>Packet loss, latency, and jitter from hundreds of vantage points to each SaaS provider</li> <li>Path tracing between vantage points and individual points of presence</li> </ul> <p>For internet service providers, Kentik ingests:</p> <ul> <li>The global routing table</li> <li>Path tracing among agents and points of presence</li> <li>BGP information</li> </ul> <p>This comprehensive data collection gives engineers a unified view of the network performance across all cloud regions, providers, containers, environments, and including campus, on-premises data centers, public cloud, SaaS providers, internet service providers, and the pathways connecting everything.</p> <h3 id="contextual-network-visibility">Contextual network visibility</h3> <p>All of this data lives in a single unified data repository (UDR), so it can be analyzed <em>in context</em>. This means that Kentik shows what’s happening and provides context on why it’s happening. For instance, Kentik can correlate traffic spikes in one part of the network with events in another. This deep insight is crucial for proactive management and rapid troubleshooting.</p> <p>Kentik incorporates information from routing tables, DNS information, security tags, and application-layer information to help engineers understand the data as it relates to application performance. With this data, Kentik builds a clear visualization of these resources across on-premises data centers, containers, and clouds so that an engineer can clearly see, troubleshoot, and understand application traffic from end to end.</p> <p>Consider a typical problem in a multi-cloud environment. Imagine a Kubernetes environment in Google Cloud in which two container pods experience high latency between each other. This, in turn, increases the delay in response to the request made by the web front-end service running in AWS. A delay experienced by the web front end normally means a longer page load or rendering time.</p> <p><em>In other words, we have a user experience degradation due to a problem in one cloud affecting the performance of resources in another cloud.</em></p> <h3 id="visibility-across-service-provider-networks">Visibility across service provider networks</h3> <p>Remember that the internet is the primary delivery mechanism for applications today, so monitoring ISPs is crucial for an overall hybrid and multi-cloud strategy.</p> <p>Kentik offers visibility into service provider networks, bridging the visibility gap between on-premises infrastructure and cloud resources. This is particularly beneficial for tracking the performance of ISP and cloud provider links, which are critical for end-to-end service delivery.</p> <p>This includes analyzing routing tables, BGP information, and the paths over the internet connecting our resources. With Kentik, we can see where a problem occurs with network performance between our end users and the public cloud resources they’re trying to consume.</p> <h2 id="mastering-the-complexity-of-multi-cloud-environments">Mastering the complexity of multi-cloud environments</h2> <p>Today, most organizations are hybrid, multi-cloud, or both. Architectures and our strategies for consuming new technologies keep changing, especially as we learn to manage them better.</p> <p>Network and cloud visibility have become more complex, but they’ve also become more critical. Kentik is the modern, vendor-agnostic solution to staying on top of change regardless of the architecture, vendor, or the latest initiative from the CIO’s office.</p><![CDATA[NMS Migration Made Easy: Get Stakeholders Aligned]]><![CDATA[Migrating network monitoring solutions is more than just a list of devices and technical requirements. It's also about ensuring that all stakeholders are aligned - from team members to customers and everyone in between.]]>https://www.kentik.com/blog/nms-migration-made-easy-get-stakeholders-alignedhttps://www.kentik.com/blog/nms-migration-made-easy-get-stakeholders-aligned<![CDATA[Leon Adato]]>Wed, 24 Apr 2024 04:00:00 GMT<p>As we mentioned last week, we are more than a little nuts about <a href="https://www.kentik.com/product/network-monitoring-system/">Kentik NMS</a>. It’s a monitoring solution that’s right-sized for today’s technology and use cases, with a nod to the past (SNMP) and an eye to the future (streaming telemetry).</p> <p>But technology migrations—whether we’re talking about a CRM, email, or monitoring—are never trivial. We wrote the NMS Migration Guide with that problem in mind, to help folks avoid some of the pitfalls and challenges of moving from their old solution into Kentik NMS.</p> <div as="Promo"></div> <p>That said, nobody’s time is infinite, so digesting a detailed ebook just isn’t in the cards for some folks. That’s why you’re reading part 2 of this series, where we take the major topics and concepts from the guide and provide them in a smaller and easier-to-consume package.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1eDo65ZMynTA1cqpUMPCGD/c1bad1bde90617131a1680f9d53eb849/nms-guide-ducks.jpg" style="max-width: 500px;" class="image center" alt="Get your ducks in a row" /> <p>The vast majority of us here at Kentik – from the CEO down – are folks who came up through the ranks of IT, even if our current jobs are less technical than past roles. So, we understand entirely the reluctance you may feel regarding the more political aspects of a systems migration. We will encourage you to work through this because, as a bit of old wisdom goes, most projects fail at <a href="https://en.wikipedia.org/wiki/Layer_8">layers 8, 9, or 10 of the OSI model</a>.</p> <h2 id="customers-consumers-and-constituents">Customers, consumers, and constituents</h2> <img src="//images.ctfassets.net/6yom6slo28h2/7AeovNY4snyuQpkSrO2YUY/e6ddea8fd640344c860314b70232f14a/customer-consumer-constituent.png" style="max-width: 500px;" class="image center" alt="Customers, consumers, and constituents definitions" /> <p>One of the first (and most pernicious) mental disconnects among IT practitioners is the misunderstanding of customers versus consumers (we’ll get to constituents in a minute). This is a problem vis-a-vis migrating to Kentik NMS because it causes folks to ask the wrong questions, fail to obtain authority, and ineffectively set expectations. So, let’s take a minute to thread this needle:</p> <ul> <li><strong>Customers</strong> are, broadly speaking, the person, people, department, or whatever that <em>pays</em> for something. They might not use it, but they are funding it.</li> <li>Conversely, <strong>consumers</strong> are the folks who use or benefit from the item, service, or whatever. To be explicit: Consumers (users) are not always customers (payers), and customers are not always consumers.</li> <li><strong>Constituents</strong> is a term I’m stretching to indicate people who care about the service or product, even if they neither pay for it nor use it.</li> </ul> <p>How do these three terms relate to monitoring and observability?</p> <p>Our experience at Kentik is that the <strong>customer</strong> is usually an executive sponsor or leader or sometimes a guiding organization like security or compliance. These folks know monitoring is needed in the organization, but neither consume the output nor particularly care (on a personal, visceral, experiential level) about its execution. They may care as individuals because they understand monitoring and observability are vital to the healthy operation of any business. But the semantics of <em>how</em> it’s done must be added to their radar.</p> <p>Meanwhile, the <strong>consumers</strong> of monitoring and observability – which includes the folks who design, install, and maintain Kentik in the environment as well as the teams who receive Kentik’s output (alerts, reports, dashboards, and more) – are almost always not the ones with purchasing authority.</p> <p>Finally, there are <strong>constituents</strong>. These folks use the applications and systems monitored by Kentik and supported by the consumers (typically Ops, DevOps, SREs, and such). They care about the health and performance of the systems they use to do their job but aren’t on the hook if they go down.</p> <p>We’ve spent time outlining this here because when you want to achieve stakeholder alignment, you must first understand who those stakeholders are and what kind of stake they have when migrating from your existing solution to Kentik NMS.</p> <p>With that done, we can begin to address how to get each of these groups on board and avoid the common pitfalls that cause migration efforts to fail.</p> <h2 id="monitoring-is-not-management">Monitoring is not management</h2> <p>The first, loudest, and longest complaint you’ll likely hear is that there are already solutions to monitor individual systems, and the company doesn’t need to “add another brick.” Almost always, the tools people talk about are vendor/platform-specific management tools with some monitoring aspects.</p> <img src="//images.ctfassets.net/6yom6slo28h2/57QkX6hrakYU3wGtK4uu5b/a694a28facd75edd2b2ec6b3a55af83a/nms-guide-brick.jpg" style="max-width: 500px;" class="image center" alt="Add another brick" /> <p>The problem is that one vendor’s tools can never adequately monitor other parts of the infrastructure, certainly not as comprehensively as something like Kentik. On the other hand, nothing will ever help manage equipment or applications as well as the purpose-built systems from the vendor itself.</p> <p>Ultimately, you’ll need to make clear to these folks that nobody is asking to remove or replace the management tools they rely on. However, using those tools as de facto monitoring and observability solutions is ineffective because they lack the range, flexibility, or interoperability concerning the overall corporate infrastructure and the scope of other monitoring and observability tools at hand.</p> <h2 id="what-matters">What matters?</h2> <p>The next issue is a common fault of IT people—they need clarification on what matters in a tech context and what matters to the business at large. Just because most companies have become highly reliant on technical solutions to deliver products and services does not mean they are technology companies.</p> <p>A quote from tech pundit Bob Lewis holds as true today as it did when <a href="https://issurvivor.com/2004/02/02/the-theory-of-else/">he first said it back in 2004</a>: “There’s no such thing as an IT project. There are only business projects with an IT component.”</p> <p>Whether you can get behind this philosophy or not is beside the point. We promise it’s more accurate than not for the people running the business, making the decisions, and (most importantly) signing the checks for new monitoring solutions. No matter how different we wish it were, nobody ever won debate points describing the advantages of streaming telemetry versus SNMP to the CFO.</p> <p>You need to spend a few minutes learning about what matters to the business and then considering how to frame a monitoring migration in those terms.</p> <p>If that last sentence makes you want to throw your hands in the air and give up, don’t worry. We felt the same way, too. It’s not easy, and don’t let anyone try to tell you otherwise. But you <em>can</em> do it. Here are some helpful hints:</p> <ul> <li><strong>Discomfort over convenience</strong>: Technical folks often lead with how much better (meaning time-saving, easy to use, less brittle) a new system is versus the old. “The time we’ll save!” is the rallying cry. While we don’t want to paint all leaders with the same brush, saying, “I will have to work less hard on this,” is <em>not</em> a winning argument. On the other hand, “this thing is causing us pain [meaning loss of business], and here’s a solution” is incredibly convincing.</li> <li><strong>Identify critical systems</strong>: In any organization, some applications are seen as desperately important, and others are viewed as unessential. That may not be objectively true, but objective truth doesn’t matter here. You must know which business tools are mission-critical and attach your monitoring migration to those systems. It’s hard to say no when an IT practitioner shows how a new monitoring solution will maintain 30% more uptime on the critical application versus the old tool.</li> <li><strong>Understand the rhythm of the business</strong>: No matter how convincing your argument may be, you will get turned down if it’s the wrong time to ask. Whether it’s month-end closing, the (historically) worst quarter of the year, or the middle of an industry-wide downturn, you must be cognizant of the external business factors influencing your boss’s ability to say “yes.” You may not necessarily have to give up, but you must acknowledge the business landscape’s truth and account for it in your discussions.</li> <li><strong>Know whose baby you call ugly</strong>: Everyone’s application belongs to someone. The monitoring solution you want to replace was brought in-house by someone who (to more or less a degree) made their career on that success. Suppose that employee is gone now are gone now, all the better. But in far more cases, they’re still here, and you really don’t want to give the impression you are calling their choice (or worse, their work) wrong. It may require some frank (and even uncomfortable) side conversations, but you’ll need to get past this hurdle or face a dragged-out period of bickering, backstabbing, whisper campaigns, or worse. And to be clear, this isn’t <em>just</em> for the monitoring tool you want to replace with Kentik NMS. It’s equally valid for what you want to monitor <em>with</em> Kentik NMS. Ensure the owner doesn’t interpret your arguments about improving application performance and uptime as an accusation that their baby is unstable.</li> <li><strong>There <em>are</em> no ugly babies</strong>: All applications and systems have issues, and it’s equally valid that they perform their intended role well enough. When working with owners and teams, emphasize your desire to help make their stuff – and, by extension, them – show at their best within the business and reduce the interruptions and unplanned work caused if things go sideways.</li> </ul> <p>Finally, <em>polite persistence pays</em>. Let’s say you did everything we suggested above and were still told “no.” Seasoned IT folks know a last, almost magical phrase to use:</p> <p>“What would you need to hear that would make you comfortable saying ‘yes’ to this project?”</p> <p>Most leaders want to approve projects and see things get done – especially when those things have the potential to increase the stability, performance, and velocity of the business. By asking what specific things drive a leader’s decision, you are demonstrating both an interest in and a willingness to understand the things that are important to them. It’s hard to understate how powerful that can be.</p> <h2 id="become-an-it-polyglot">Become an IT polyglot</h2> <img src="//images.ctfassets.net/6yom6slo28h2/1wdxEOo1F25N6xxWowxR3X/3f6d23bdb5600f1a866b37467243c889/nms-guide-checklist.jpg" style="max-width: 500px;" class="image center" alt="NMS migration checklist" /> <p>What we’re getting at is that you need to become more fluent in the language of business. Many IT folks find this idea distasteful, if not downright disturbing. It’s as if we’re saying, “Eschew all your technical skills, let all your certifications lapse, vow never to touch another computer again, and become a Luddite hermit.”</p> <p>That’s simply not how languages work. Taking the time to learn Python (or French) doesn’t cause you to forget everything you knew about Java (or Klingon). Learning additional languages is usually an additive experience, where the new knowledge reinforces and enhances what you already know.</p> <p>So, learn business terms, concepts, and methods of relating ideas. Not because you need to “dumb it down.” That’s insulting and infantilizing to your colleagues who manage teams and run the company. Instead, think about standing on a corner in France and shouting at people walking by in English. You are not going to get much of a positive response. In that same vein, pushing technical arguments with increasing intensity at the business office isn’t going to magically make them more technical or care about your opinion more.</p> <p>So, what <em>is</em> this so-called language of business? It boils down to three things. You need to take your goal (migrating to Kentik NMS) and frame it as helping accomplish one of three basic things:</p> <ul> <li>Increase revenue</li> <li>Reduce cost</li> <li>Remove risk</li> </ul> <p>The good news is that monitoring and observability tools, in general, and Kentik NMS, in particular, are very good at numbers two and three. Creative IT practitioners can also find ways to make them fit into item number one.</p> <h2 id="whats-next">What’s next?</h2> <p>To be honest, if you’ve gotten this far—both in terms of reading the blog series and in the actual migration—the hard part is behind you. But that doesn’t mean there’s nothing of value left for me to tell you, for you to read, or for you and your team to do. So, I hope you’ll stick around for the third installment in our series.</p> <p>Or you could <a href="https://www.kentik.com/go/ebook/nms-migration-guide/">download the entire guide now</a> and skip the wait.</p><![CDATA[NMS Migration Made Easy: Gathering Information]]><![CDATA[Network monitoring tools have a lot of moving parts. Those parts end up getting stored in a wide range of locations, formats, and even the ways various capabilities are conceptualized. With that in mind, we're going to list out the information you should gather, the format(s) you should try to get it into, and why.]]>https://www.kentik.com/blog/nms-migration-made-easy-gathering-informationhttps://www.kentik.com/blog/nms-migration-made-easy-gathering-information<![CDATA[Leon Adato]]>Wed, 17 Apr 2024 04:00:00 GMT<p>Here at Kentik, we are (justifiably) excited about <a href="https://www.kentik.com/product/network-monitoring-system/">Kentik NMS</a>. It’s a monitoring solution that is right-sized for the job and built on a modern foundation — from concept to code — so that it can support your most valuable equipment. With the ability to collect data from traditional technology like SNMP and the latest techniques like streaming telemetry, Kentik NMS is positioned as the right solution for today’s network observability requirements.</p> <p>But NMS migration – making the move from whatever you have today – is a lot of work. We realize how big a job migrating systems can be, so we wrote an NMS migration guide to help make that process easier.</p> <div as="Promo"></div> <p>However, we recognize how precious your time is. Not everyone has time to read a comprehensive 33-page guide, so we thought we’d take some of its insights and knowledge and break it up into bite-size installments.</p> <p>Welcome to part 1: Gathering Information.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px; margin-top: 40px"></div> <h2 id="introduction">Introduction</h2> <p>Kentik NMS is a re-envisioning of traditional network monitoring tools, with all the familiar interfaces and options, but with a clear eye to both the current and future state of infrastructure, whether that’s on prem, in the cloud, in multiple clouds, or all of the above.</p> <p>This means Kentik NMS presents the opportunity to pivot away from older monitoring solutions – those with tired and ineffective interfaces, slow and inefficient architecture, limited scope, and decades-old code bases with all the technical debt that comes with them.</p> <p>You’re probably thinking, “Easier said than done!” Right? So are we. Kentik is a company built by engineers for engineers. We understand that a new solution simply <em>existing</em> doesn’t in and of itself solve the challenge of moving to the new system.</p> <p>Kentik has been in the monitoring and observability game for a long time – not just as a business but as individuals who work here. Many of us have decades of experience with systems migrations between databases, operating systems, development frameworks, data centers, monitoring solutions, and myriad technologies spanning decades of Moore’s Law iterations.</p> <p>That brings us to this post. The goal is to help you understand what steps are involved in a monitoring system migration – to help you validate your checklist and ensure no critical steps go unnoticed.</p> <p>It’s important to note that while this post specifically references SolarWinds, much of the information you’ll find here is relevant to just about every monitoring tool on the market. To be sure, the solution you’re using today will likely have peculiarities we won’t explicitly address (let’s face it – those “peculiarities” may be the reason you need to migrate in the first place!).</p> <p>We realize how big a job migrating systems can be. We’ve broken this post into steps to help you manage the work (not to mention the potential stress and anxiety) that large tasks like this tend to create. They are:</p> <ul> <li>Step 1: Gather inventory</li> <li>Step 2: Get stakeholders aligned</li> <li>Step 3: Launch the alternative (Kentik)</li> <li>Step 4: Wind down infrastructure</li> </ul> <p>Without a doubt, step one is the most extensive section and represents the most significant investment of time and effort. However, step 2 holds the most challenges for IT practitioners, who may not relish the political aspects of the job. Step 3 is the “moment of truth” section, where iteration, course correction, and education occur. Many people find step 4 the scariest one, as it holds such finality. Nevertheless, it’s an important step too.</p> <p>Whether you’re just at the point of considering NMS, currently kicking the tires, or have even taken the plunge as a paying customer on the cusp of flipping the switch, we invite you to pour another cup of your favorite beverage, settle into a comfortable chair, throw on a set of headphones, queue up your “music to kick ass to” playlist and read on.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7yBaswtKyd6SWYwOJD5i3h/24350ad234503f94cac98b7c25c15aaa/girl-headphones.png" style="max-width: 500px;" class="image center" alt="Getting ready to read the NMS guide" /> <h2 id="step-1-gather-an-inventory">Step 1: Gather an inventory</h2> <p>”<em>Mise en place</em>” is important for more than just chefs in the kitchen. Any effective strategy begins with knowing what you have and where it is.</p> <p>Network monitoring tools (and SolarWinds is by no means unique in this regard) have many moving parts. Those parts end up being stored in a wide range of locations, formats, and even the ways various capabilities are conceptualized (is a report a unique thing, or is it just a different style of dashboard?).</p> <p>With that in mind, we will list the information you should gather, the format(s) you should try to get it into, and why. This list will be organized primarily with an emphasis on the things you’ll need for migrating to Kentik NMS but with a secondary focus on ensuring you have <em>all</em> your data available in case you need to go back and refer to it down the road when your current solution (and the systems it was running on) have been put to bed.</p> <h3 id="before-we-start-what-does-gather-mean">Before we start, what does “gather” mean?</h3> <p>Some people will begin scanning this list and become discouraged because of its volume of work.</p> <p>I want to be clear that “gather” doesn’t mean you need to manually copy that information, print out reams of reports, or replicate the data into CSVs. You simply have to ensure that you <em>have</em> the data and can access it once the monitoring software has been turned off.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6c1MB6ZbQXoSEfc3iH4Qsp/4e810295df00e1e878f17e35879ec8c0/gather-info-table.png" style="max-width: 800px;" class="image center" alt="Database snapshot" /> <p>In the case of SolarWinds (presuming you have some basic SQL skills), that’s as simple as making a snapshot of the database. Everything you need is in that database and can be obtained with a query (or three). To be sure, sometimes running a report or using the SolarWinds SDK to extract the information while everything is still running is more manageable. But at the end of the day, the database is enough because Kentik has more than a few folks on staff who can give you a hand.</p> <h3 id="device-list">Device list</h3> <p>It doesn’t exactly require a blinding flash of insight to recognize that the most important thing you’ll need for a migration is a list of devices (or “nodes,” as SolarWinds prefers to call them). In this particular situation, it’s doubly important: first, because that’s the essence of what you’ll be monitoring in Kentik NMS. Second, nodes are the center of the SolarWinds architecture, the point from which all other references flow. For these reasons, it’s crucial to spend more time collecting and validating this list than any other task.</p> <p><strong>What you’ll need for Kentik NMS:</strong></p> <p>Note that Kentik NMS requires very little information in terms of migration. We need only a set of IP addresses plus the SNMP credentials they use, and we’ll pick up everything else on the machine.</p> <p>However, as we said before, collecting more information <em>now</em> allows you to validate your migration and ensure you have the additional context and details for the times when NMS inevitably grows and expands.</p> <p>The smallest amount of information Kentik NMS requires is:</p> <ul> <li>A list of IP addresses for the devices you want to monitor</li> <li>One or more CIDR notated subnets where all the devices to be monitored reside (example: 192.168.1.0/24)</li> <li>SNMP information: Depending on the version the devices are running, that would either be: <ul> <li>SNMP version 2c <ul> <li>The read-only community string</li> </ul> </li> <li>SNMP version 3 <ul> <li>The username</li> <li>Authentication type and passphrase</li> <li>Privacy type and passphrase</li> </ul> </li> </ul> </li> </ul> <p>If you’re a minimalist, that’s all you will need!</p> <h3 id="per-device-resources-and-details">Per device resources and details</h3> <p>The (sad) truth is that many of us are <em>not</em> minimalists. Most of us have experienced the pain of deleting “unnecessary” information only to discover hours or even minutes later that we absolutely needed that data and could not get it back.</p> <p>For everyone else in this camp, here’s a reasonably comprehensive list of data elements you’ll want to ensure you have for each device/node in your current monitoring solution.</p> <p>Specific to SolarWinds, these are the essential items displayed on the Node Details page, which can be gathered via a series of reports, SDK commands, or (as mentioned earlier) database queries.</p> <ul> <li>IP address</li> <li>SNMP information <ul> <li>SNMP version 2c: <ul> <li>The read-only community string</li> </ul> </li> <li>SNMP version 3: <ul> <li>The username</li> <li>Authentication type and passphrase</li> <li>Privacy type and passphrase</li> </ul> </li> </ul> </li> <li>Device details (name, model, etc.)</li> <li>CPU count, speed, type, etc.</li> <li>RAM</li> <li>Hardware <ul> <li>Fan, power supply, temp, etc.</li> </ul> </li> <li>Maintenance schedule</li> <li>Polling (ping) interval</li> <li>Statistics (SNMP) interval</li> <li>Interfaces: For each NIC on the device, ensure you have its: <ul> <li>Name</li> <li>Type</li> <li>MAC address</li> <li>IP address</li> <li>Bandwidth rating</li> <li>Polling (ping) interval</li> <li>Statistics (SNMP) interval</li> </ul> </li> <li>Volumes: For each drive on the device, ensure you have its: <ul> <li>Name</li> <li>Type</li> <li>Capacity</li> <li>Polling (ping) interval</li> <li>Statistics (SNMP) interval</li> </ul> </li> </ul> <h2 id="per-device-custom-elements">Per device custom elements</h2> <p>Along with the device details and the elements that are tightly bound to that device (CPU, RAM, disks, interfaces, etc.), there are additional items you’ll want to make a note of because they reflect the data you’re currently collecting.</p> <p>That list includes:</p> <h3 id="api-poller-items">API poller items</h3> <p>While Kentik NMS doesn’t – at the time I’m writing this – support the collection of API-based telemetry, it is on the roadmap. Therefore, it’s worthwhile to save the critical information you’ll need to migrate these data collections when the time comes.</p> <p>You’ll need to note:</p> <ul> <li>The collection method (GET, POST) and the URL</li> <li>The headers and body elements</li> </ul> <p>Note that a single API poller in SolarWinds could comprise multiple GET or POST requests and be very complex. Make sure you’ve got the entire sequence!</p> <h3 id="custom-snmp-objects-oids">Custom SNMP objects (OIDs)</h3> <p>Every monitoring solution chooses which SNMP objects (OIDs) are collected as a standard for each device type. However, additional items are always useful and even necessary in specific situations. Many monitoring tools allow you to specify those OIDs and assign them to various device types, providing additional insight into the status and operation. Examples include temperature sensors, power readings, and motion detectors.</p> <p>Since Kentik NMS supports this capability, you’ll want to make sure you’ve noted the following:</p> <ul> <li>The exact OID involved</li> <li>Any transformations that have been done to it (Celsius to Fahrenheit, for example)</li> <li>The devices this OID was assigned to</li> </ul> <p>It’s also important to note that sometimes you’ll find custom OIDs that don’t seem useful or essential on their own but – when combined – create a more complex value or insight. So, ultimately, it’s a good idea to note all the custom OIDs being collected just in case.</p> <h2 id="beyond-the-device">Beyond the device</h2> <img src="//images.ctfassets.net/6yom6slo28h2/1sNKpznYecw8L4Z2JpIooU/53d6455d42f16991302387d68083786b/game-show-guy.png" style="max-width: 500px;" class="image center" alt="But that's not all!" /> <p>As the 1970s advertising cliche goes, “But that’s not all!” Along with device-centric data, you’ll want to ensure you’ve captured and documented a range of other monitoring objects before closing the door on your current monitoring solution.</p> <p>While details on each of these appear below, the shopping list version is as follows:</p> <ul> <li>Custom properties</li> <li>Groups</li> <li>Discovery settings</li> <li>Users</li> <li>Polling engines</li> <li>SMTP servers</li> <li>Ticket system settings</li> <li>Dashboards</li> <li>Reports</li> <li>Alerts</li> </ul> <h3 id="custom-properties">Custom properties</h3> <p>Custom properties have become a pivotal aspect of SolarWinds, with the ability to add customizations to nodes, interfaces, volumes, applications, alerts, and more. Other monitoring solutions have their own spin on the idea. Still, the upshot is that these are custom fields associated with an object that lets you extend and enhance your understanding of where that element is, what it does, and how to treat it.</p> <p>While embodied by the eminently unusable “Manager’s Dog’s Name” custom property, they make more sense when you consider the ability to associate data points like:</p> <ul> <li>Email(s) for responsible individuals</li> <li>Environment (prod, test, QA)</li> <li>Latitude/longitude</li> <li>Asset tag</li> </ul> <p>In any case, you must take a minute to run reports or database queries to extract both the custom properties <em>and</em> the association with the object(s) to which they belong.</p> <h3 id="groups">Groups</h3> <p>Most monitoring solutions allow you to group objects together, whether using the aforementioned custom properties or some other mechanism.</p> <p>Sometimes, the mechanism is part of the device itself (such as if it uses the “location” tag within the SNMP configuration).</p> <p>However, it’s essential to understand that mechanism and note each device, interface, and object’s various groupings within your monitoring solution.</p> <h3 id="discovery-settings">Discovery settings</h3> <p>In some environments, device discovery is a task done during installation/setup and then rarely revisited. In more volatile or fluid infrastructures, device discovery happens often. If your environment resembles the latter more than the former, you’ll want to export the discovery settings. These can include (but aren’t limited to):</p> <ul> <li>Standard SNMP community strings, usernames, secrets, etc.</li> <li>Subnets that are scanned</li> <li>Subnets that are skipped/avoided</li> <li>Individual systems that are skipped/avoided</li> <li>Discovery schedules – when and how various areas of the infrastructure are scanned.</li> </ul> <h3 id="users">Users</h3> <p>Because of the powerful insights monitoring solutions provide, most companies like to set up some form of role-based access to the system. Most modern, robust monitoring solutions will support this with various forms of authorization and even synchronization to corporate authentication systems like Active Directory or single sign-on.</p> <p>The point here is that it’s essential to get an export of all of the individual users, groups, and so on, along with the permissions assigned, so you can replicate it in Kentik NMS.</p> <h3 id="polling-engines">Polling engines</h3> <p><strong><em>(and nodes assigned to them)</em></strong></p> <p>While the term “polling engine” is SolarWinds-specific, the concept is a common one: Whether you call it a Managed Node, a collector, or something else, we’re talking about a device or piece of software that sits between the physical elements themselves (routers, servers, etc.) and the location of all the data being collected (database, SaaS destination, etc.).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/L5ZWnomfLkt9tKvXfBID1/ea5625ed229c8050c734a2b94e6ce23d/migration-guide-arch.png" style="max-width: 500px;" class="image center" alt="Architecture" /> <p>In a sufficiently large environment, you’ll have more objects than a single collector can handle, at which point you’ll need to deploy multiple collectors to handle the load. You’ll also have to determine which collector will be responsible for which endpoints.</p> <p>Sometimes, this decision is simple because it will follow geographic or logical architectural boundaries. Other times, assignments arise organically and somewhat haphazardly.</p> <p>Regardless, your task at this migration stage is to record all the collector systems in your environment and identify the objects assigned to each one.</p> <p>That’s not to say you’ll have to mimic this setup in Kentik NMS. Because of the advances Kentik has made to both the code base and the collection methods, NMS can often collect far more data in a far more efficient manner than tools with a 10+-year-old code base to support.</p> <p>But despite this, and as we’ve emphasized several times already, it’s critical to know the original setup simply as a point of reference in the future.</p> <h3 id="smtp-servers">SMTP servers</h3> <p>This is a simple item, but people often need to pay more attention to it because it tends to be a set-it-and-forget-it step that happens way back at the installation phase of a monitoring tool and rarely gets touched after that.</p> <p>Nevertheless, this section is a reminder to go into your platform settings and note the SMTP (email) server or servers used to send out notifications. You’ll want to note the name/IP of the server, the username it connects with (and track down the password that username is associated with because the platform probably won’t show you), the port it uses to connect, etc.</p> <h3 id="ticket-system-settings">Ticket system settings</h3> <p>Right up there with your email server (SMTP) information is making sure you note the connections to external ticket systems like Remedy, ServiceNow, etc.</p> <p>This information can vary greatly and is often specific to the ticketing platform and the connection between it and your instance of the current monitoring solution. Therefore, providing a laundry list of things like username, API token, etc., is largely guesswork.</p> <p>Instead, we’ll keep this section short and remind you to check your current monitoring solution for settings and make a note of them.</p> <p>In Kentik NMS, you’ll use that information to set up one (or more) notification channels using the built-in integrations or the generic webhook option.</p> <h3 id="dashboards">Dashboards</h3> <p>If we’re being honest, most IT practitioners have a love-hate relationship with dashboards. On the one hand, they’re excellent tools to get an “at-a-glance” sense of the overall health and performance of a collection of systems. On the other hand, they’re only good if A) you have it built, B) you know it exists and can find it, and C) someone looks at it when something goes wrong.</p> <p>The challenge is that A above tends to lead to an overabundance of dashboards being created ad hoc, which in turn leads to B and C—not knowing the dashboard you need exists because there are so many of them. This creates a deadly feedback loop where more dashboards are created simply because nobody knew something like it existed.</p> <p>Philosophical issues aside, our point is to carefully consider each dashboard in your current system and ask yourself who needs it, for what purpose, and whether another dashboard in your inventory might not serve the same purpose.</p> <p>After that bit of soul-searching, documenting each dashboard often comes down to identifying which sub-elements (sometimes called “graphs,” “widgets,” or “panels” by various monitoring tool vendors) are part of the overall dashboard. Those sub-elements are almost always created with a query, which means what you <em>really</em> are collecting are those queries and the displays they create.</p> <p>This has a positive effect because you’ll end up with a list of queries/displays, allowing you to identify overlaps, close matches, and duplicates and simplify your overall environment.</p> <p>The other positive outcome is that those queries are much easier to translate into Kentik NMS and turn into dashboard elements in the new system.</p> <h3 id="reports">Reports</h3> <p>This is one of the easier items to understand and check off your list.</p> <p>First, for SolarWinds, reports are kept in \Program Files (x86)\SolarWinds\Orion\Reports directory on the primary polling engine.</p> <p>But even so, many tools allow you to output the query behind the report. And, of course, a copy of the reports often gives you almost everything you need to know to set them up in the new system.</p> <h3 id="alerts">Alerts</h3> <p>For many people, alerts are one of the key reasons for having an observability solution. Therefore, it is critical to ensure that you both know what alerts existed in your previous system and can replicate those alerts in the new one.</p> <p>Other solutions may or may not have a system as simple as SolarWinds. On that platform, you go to Alert Manager, select the alert(s) you need, and then select “Export” from the menus. The result will be an XML file that contains all the relevant information.</p> <p>Creating alerts within Kentik NMS will be a different process. However, more than any other asset, alerts should not be created via a bulk import process. As we will delve into later in this series, alerts should be carefully considered, purpose-built, and created within Kentik NMS to fit entirely within the Kentik platform’s conceptual framework.</p> <h3 id="a-quick-word-about-query-based-elements">A quick word about query-based elements</h3> <p>Before proceeding to the next section, we encourage you to consider deeply the elements of your monitoring solution that use or rely on queries. In most systems, these include alerts, reports, and dashboards.</p> <p>Consider that your monitoring solution collects a significant volume of data, and every item mentioned above will query the database.</p> <p>Indeed, every time a report is run, it generates one (or more) query to pull the relevant data.</p> <p>But equally, every refresh of a dashboard screen triggers multiple queries as each sub-element pulls and graphs the associated telemetry.</p> <p>However, alerts are perhaps the most misunderstood element regarding their impact. Every alert has a “trigger,” which is, in fact, a query. That query has to be run regularly—every half hour, every minute, or every 10 seconds—to determine whether any systems meet the trigger query parameters. Thus, if you have 200 alerts set to the (SolarWinds) default of running every minute, you are automatically running 200 database queries every minute.</p> <p>The result on a sufficiently large database is a hopelessly (and uselessly) slow and unresponsive system.</p> <p>Therefore, we encourage you to consider the inventory you just spent time collecting and look with a sober eye at what can be left behind.</p> <h2 id="stepping-back-and-looking-ahead">Stepping back and looking ahead</h2> <p>That wraps up part one of our Migration Made Easy series. If you’d like an even more detailed look at what goes into gathering the required information for a NMS migration, <a href="https://www.kentik.com/go/ebook/nms-migration-guide/" title="Get Kentik&#x27;s NMS Migration Guide here">download our entire NMS Migration Guide</a>. (Along with a handy checklist to make your transition smoother.) And if you’re considering moving from SolarWinds, you can learn more about why Kentik NMS is the modern, <a href="https://www.kentik.com/go/alternative/kentik-vs-solarwinds/" title="Learn why Kentik is the SolarWinds network performance monitoring alternative">ideal alternative to SolarWinds</a>.</p> <p>Until next time.</p><![CDATA[The Future of Cloud is AI: Google Cloud Next 2024]]><![CDATA[From infrastructure and chipsets to data solutions and development tools, it's clear that artificial intelligence will play a massive role in the future of the cloud. Read Justin Ryburn's latest recap to learn about many of the exciting unveilings at Google Cloud Next 2024.]]>https://www.kentik.com/blog/the-future-of-cloud-is-ai-google-cloud-next-2024https://www.kentik.com/blog/the-future-of-cloud-is-ai-google-cloud-next-2024<![CDATA[Justin Ryburn]]>Tue, 16 Apr 2024 04:00:00 GMT<h2 id="introduction">Introduction</h2> <p>The tech world was abuzz with excitement as Google Cloud Next 2024 unfolded, offering a glimpse into the future of cloud computing and digital innovation. Held at the Mandalay Bay Convention Center in Las Vegas, this conference serves as a platform for unveiling groundbreaking advancements, fostering collaboration, and inspiring businesses to reimagine their digital strategies. Let’s delve into the highlights and key takeaways from this monumental event.</p> <h2 id="a-new-cloud-venue">A new cloud venue</h2> <p>The event’s relocation from the Moscone Center in San Francisco to the Mandalay Bay in Las Vegas had a profound impact on its ambiance and energy. The vibrant atmosphere of Las Vegas infused the event with a new sense of liveliness and excitement. It’s fascinating to observe how the change in setting can influence an event’s overall tone and perception.</p> <p>Additionally, Google’s heightened marketing efforts surrounding this event seem to have contributed to the buzz and anticipation. This strategic move showcases the event’s significance and reflects the growing importance of such gatherings as platforms for major announcements and industry developments.</p> <h2 id="ai-takes-center-stage">AI takes center stage</h2> <p>As expected, AI was a significant focus at Next ’24. Google emphasized its commitment to a comprehensive AI platform, offering both its own solutions and extensibility for partner-developed tools. This multi-layered approach empowers businesses to leverage AI across various stages, from infrastructure and chipsets to data solutions and development tools.</p> <p>The focus wasn’t just on internal innovation; Google highlighted partnerships that expand the capabilities of its AI ecosystem. Attendees learned about new AI-powered features coming to Google Workspace, including the intriguing “Vids” – a video creation app designed specifically for work. If you have not checked out Google’s chatbot, <a href="https://gemini.google.com/">Gemini</a>, you really should. I’ve found it to be better quality for most responses than OpenAI’s <a href="https://chat.openai.com/">ChapGPT</a>.</p> <h2 id="beyond-ai-a-cloud-for-all-needs">Beyond AI: A cloud for all needs</h2> <p>While AI dominated the headlines, Next ’24 wasn’t a one-trick pony. There were announcements across the cloud spectrum, including:</p> <ul> <li><strong>Enhanced security:</strong> New additions like the AI-powered security add-on for Google Drive and extended Data Loss Prevention (DLP) controls for Gmail showcase Google’s commitment to robust cloud security.</li> <li><strong>Distributed cloud developments:</strong> Google Distributed Cloud (GDC) received a significant upgrade, achieving key compliance certifications and integrating advanced threat prevention technology from Palo Alto Networks.</li> </ul> <h2 id="cross-cloud-network">Cross-cloud network</h2> <p>Google Cloud Next 2024 had some exciting announcements regarding Cross-Cloud Networking. Here’s a quick rundown:</p> <ul> <li><strong>Cross-Cloud Network announcement:</strong> A blog post titled “<a href="https://cloud.google.com/blog/products/networking/whats-new-for-networking-at-next24">What’s new for networking at Next ’24</a>” highlights Cross-Cloud Network as a key innovation that simplifies connecting and securing workloads across various cloud environments. It emphasizes the benefits of a service-centric approach, potentially lowering TCO by up to 40%.</li> <li><strong>Enhanced Private Service Connect:</strong> One of the announcements detailed Private Service Connect transitivity over Network Connectivity Center. Currently in preview, this improvement allows services in a spoke VPC (Virtual Private Cloud) to be readily accessible from other spoke VPCs.</li> <li><strong>Cloud NGFW Enterprise goes GA:</strong> Cloud NGFW Enterprise (formerly Cloud Firewall Plus) achieved Generally Available status. This offering provides network threat protection powered by Palo Alto Networks and network security posture controls for organization-wide perimeters and Zero Trust micro-segmentation.</li> </ul> <h2 id="in-closing">In closing</h2> <p>If you missed Next ’24, don’t worry! Google Cloud offers complimentary access to <a href="https://cloud.withgoogle.com/next/session-library#all">on-demand content</a> from the event. This allows you to delve deeper into specific announcements, keynotes, and sessions that pique your interest.</p> <p>Whether you’re an <a href="https://www.youtube.com/watch?v=2nn9MI-FUm0">AI enthusiast</a> or simply looking to <a href="https://www.kentik.com/product/multi-cloud-observability/">optimize your cloud development and deployment</a>, Google Cloud Next 2024 offers a glimpse into the future. With its focus on innovation and comprehensive solutions, Google Cloud is poised to be a significant player in the ever-evolving cloud landscape.</p><![CDATA[The (Mostly) Complete Guide to Installing Kentik NMS]]><![CDATA[NMS can do so much that one blog topic triggered an idea for another, and another, and here we are, six posts later and I haven't explained how to install NMS yet. The time for that post has come.]]>https://www.kentik.com/blog/the-mostly-complete-guide-to-installing-kentik-nmshttps://www.kentik.com/blog/the-mostly-complete-guide-to-installing-kentik-nms<![CDATA[Leon Adato]]>Wed, 10 Apr 2024 04:00:00 GMT<h2 id="introduction">Introduction</h2> <p>The saying “nobody likes a know-it-all” applies equally well to blog series – they’re really not terribly loveable. That goes double for a deep-dive technical blog series. We, who make our livelihood and career in tech, appreciate in-depth information and tutorials. But if a post has “part 8 of 33” in the title, it’s a safe bet that most folks will scroll right by because who wants to make that kind of commitment?</p> <p>I share this with you to explain that I never intended to create a blog series on using <a href="https://www.kentik.com/product/network-monitoring-system/" title="Learn more about Kentik NMS, the modern network monitoring system">Kentik NMS</a>. My goal was to share what I knew about Kentik’s newest addition to the platform, and to do so in a way that was focused and easy to consume in a reasonable amount of time.</p> <p>But there’s so dang much that NMS can do! <a href="https://www.kentik.com/blog/getting-started-with-kentik-nms/">One topic</a> triggered an idea for <a href="https://www.kentik.com/blog/using-kentik-network-monitoring-system-to-identify-network-outages/">another</a>, and <a href="https://www.kentik.com/blog/adding-multiple-custom-metrics-to-kentik-nms/">another</a>, and here we are, six posts later, and I haven’t really gone into the details of how to install NMS yet.</p> <p>I’ll admit the oversight – not starting with the NMS installation – was (slightly) intentional. I’m tired of slogging through 30 paragraphs covering “how to install” before I even know if the tool I’m reading about does anything I need or care about. So, I made the conscious decision to start by digging into the useful features and circle back to installation once I felt NMS had proven its worth.</p> <p>That time has come.</p> <div as="Promo"></div> <h2 id="observations-on-the-nms-architecture">Observations* on the NMS architecture</h2> <p><em>(*You see what I did there, right?)</em></p> <p>I’m not going to belabor the overall design of NMS with a bunch of “…color glossy photographs with circles and arrows and a paragraph on the back of each one explaining what each one was…” (hat tip to Arlo Guthrie) because it’s pretty simple:</p> <ol> <li><strong>The targets:</strong> This is the stuff you want to monitor – network gear and servers that sit on-premises, in the cloud, or both.</li> <li><strong>The Ranger Collector:</strong> A system – it can be physical, virtual, or nothing more than a host for a Docker container – that’s in the same logical network, so it’s able to receive data and pull metrics from the stuff you want to monitor.</li> <li><strong>The Kentik platform:</strong> This is the system – located remotely from you and your devices – to which the Ranger Collector sends data.</li> </ol> <p>Ok, maybe just <em>one</em> photograph:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2vbmJMcIG7PxhpvRrBtRQm/b5f3b48fc862b8123f855444c5e8482f/collectors-devices.png" style="max-width: 500px;" class="image center" alt="The targets, collector, and platform" /> <h2 id="measure-twice-cut-once">Measure twice, cut once</h2> <img src="//images.ctfassets.net/6yom6slo28h2/21MydoSlSNPc2UzyGMqWEr/ba5d59ef53cf22ae90659ed7c2f67030/measure-twice.jpg" style="max-width: 500px;" class="image center" alt="Measure twice, cut once" /> <p>Eagle-eyed readers will note that I did, in fact, briefly touch on how to install Kentik NMS in both the NMS migration guide and also the blog <a href="https://www.kentik.com/blog/getting-started-with-kentik-nms/">Getting Started With Kentik NMS</a>. This blog will go into far greater detail than those two, but I will still use bits from other blogs if they work well. Back in “Getting Started…” I wrote:</p> <div style="background-color: #fafafa; padding: 40px; border: 1px solid #ebeff3; border-radius: 12px; margin-bottom: 30px;"> <p>There’s nothing more frustrating than being ready to test out a new piece of technology and then finding out you’re not prepared. So before you head down to the “Installation and configuration” section, make sure you have the following things in hand:</p> <ol> <li>A system to install the Kentik NMS collector on. The collector is an agent that can be installed directly onto a Linux-based system or as a Docker container. Per the <a href="https://kb.kentik.com/v0/Bd11.htm#Bd11-NMS_Agent_Requirements">Kentik Knowledge Base</a> instructions, you’ll want a system with at least a single core and 4GB of RAM.</li> <li>Verify the system can access the required remote sites: <ul> <li>Docker Hub</li> <li>TCP 443 to grpc.api.kentik.com (or kentik.eu for Europe)</li> </ul> </li> <li>Verify the system can access the devices you want to monitor: <ul> <li>Ping (ICMP)</li> <li>SNMP (UDP port 161)</li> </ul> </li> <li>Check that you have the following information for the devices you want to monitor: <ul> <li>A list of IP addresses and/or one or more CIDR notated subnets (example: 192.168.1.0/24)</li> <li>The SNMP v2c read-only community string and/or SNMP version 3 username, authentication type and passphrase, privacy type and passphrase.</li> </ul> </li> <li>You have a Kentik account. If you are just testing NMS, we recommend <i>not</i> using an existing production account. If you don’t, head over to <a href="https://portal.kentik.com/login">https://portal.kentik.com/login</a> and get one set up.</li> </ol> <p>Once you’ve got all of your technical ducks in a row (which, to be honest, shouldn’t take that long), you’re ready to get started on this NMS adventure!<br> </p></div> <p>That about sums it up. To get NMS up and running, you just need:</p> <ol> <li>A system to install the collector on</li> <li>Systems to monitor</li> </ol> <p>And with that, you’re ready to get installing.</p> <h2 id="installing-kentik-nms">Installing Kentik NMS</h2> <p>As I mentioned earlier, there are two primary options for installing the Kentik NMS Ranger Collector: direct or Docker. I will cover both, but regardless of which one you plan to use, you’ll start in the <a href="https://portal.kentik.com">Kentik portal</a>. Click the “hamburger menu” (the three lines in the upper left corner), which shows the full portal menu:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2H9BLwhkzRfxNVf6P1Al0R/e348a1ec126b067f691865232f014e59/dropdown-devices-highlighted.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik main menu" /> <img src="//images.ctfassets.net/6yom6slo28h2/5LGJMgWE4vH1bthQyHvgmF/cd979d429946e36a48ff6496efff17ce/discover-devices-button.png" style="max-width: 160px; padding: 0;" class="image right" alt="Discover Devices button" /> <p>Click “Devices” and then click the friendly blue “Discover Devices” button in the upper right corner.</p> <p>There’s a quick question on whether you want full monitoring or “ping-only” – just to know if systems are up or down:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6i60CfIeo4y7gAZUHqf3uT/5935c2f8c83b319c2df88ec7f5f2f548/add-devices.png" style="max-width: 600px;" class="image center" alt="Add NMS devices screen" /> <p>The next screen allows you to install the collector, either as a Docker container:</p> <img src="//images.ctfassets.net/6yom6slo28h2/31t0BwbMvfYSYmVBRc5l4K/2cee5750a174f0852ee3a51c87d5cca9/installation-docker.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Docker installation" /> <p>Or on a full Linux system:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6c8nu6f4PGil0U1r7rNHs9/b38c96ea397e3b90c5d08ef35c77218e/installation-linux.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Linux installation" /> <h3 id="installing-direct">Installing direct</h3> <p>Let’s start with a “direct” installation on a regular Linux system. Again, this can be an actual bare-metal server or a VM on-site or in the cloud. The only requirement is that the machine you’re installing on can access the systems you want to monitor.</p> <p>Copy the command from the portal, SSH to the target system, and paste that command. (<strong>Note</strong>: You must have sudo permission to run this command.)</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5XgsEREKsRyo4ibY89Z8B8/6c86991c671e8b11d6d40c209e487c3f/copy-the-command1.png" style="max-width: 800px;" class="image center" thumbnail alt="Installing direct - copy command from portal" /> <p>The installer will complete and… well, in most cases, that’s pretty much it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6ruUM7h25TXGX2mNMPwaQ5/b8c24145d852853dd6cb59fba20656f7/installer-complete.png" style="max-width: 800px;" class="image center" thumbnail alt="Installer completed" /> <p>No, really. That’s it. The next step involves getting things set up in the Kentik portal, so I will leave that aside for now.</p> <h3 id="installing-with-docker">Installing with Docker</h3> <p>This process starts out the same as with the Direct option – copy the command from the Kentik portal, SSH to the system hosting the Docker container, and paste.</p> <p><em>However</em>… there are a couple of implicit expectations that are worth stating out loud:</p> <ul> <li>The system you’re using already has Docker and all its necessary components installed.</li> <li>The user under which you’re installing has full permission to create, run, shut down, and update docker containers.</li> </ul> <p>Presuming that’s the case, cut, paste, and run!</p> <img src="//images.ctfassets.net/6yom6slo28h2/1CuQxX2tE0rxiK6aZi7G5s/fe4569e09bd80ecc4b68b842bf333528/copy-the-command2.png" style="max-width: 800px;" class="image center" alt="Installing with Docker - copy the command" /> <p>Once the Docker run command completes, there’s not much to see:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5eoBLcmilYQKy6Nox1ffiH/fb1c7fabdbd3d185b762ca42d8234def/docker-command-completes.png" style="max-width: 800px;" class="image center" alt="Docker command completed" /> <h2 id="scanning-and-adding-devices-aka-the-fun-part">Scanning and adding devices (aka: the fun part)</h2> <p>Shortly after installing the Ranger Collector, you’ll see the agent name (or the name of the system the agent is installed on) show up in the “Select an Agent” area below the installation commands:</p> <img src="//images.ctfassets.net/6yom6slo28h2/TFL4T7HsP1u3w3hinaJsq/f8e0b2be3effc8d58f04b532ef0e75f2/select-agent.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Select an agent" /> <p>Go ahead and click “Use this Agent.” That will automatically authorize it and take you to the next screen, where you can specify the devices you want to monitor. From the next screen, you’ll enter an IP address, a comma-separated list of IPs, or a CIDR-noted range (example: 192.168.1.0/24).</p> <img src="//images.ctfassets.net/6yom6slo28h2/1HedagqbxXQHmKanhIAPGK/60ddea025214ee0e2c9e4f44c2c058ac/ip-range.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Enter IP address" /> <p><strong>Trick</strong>: You can mix and match, including individual IPs and CIDR ranges.</p> <p><strong>Another trick</strong>: if there are specific systems you want to ignore, list them with a minus (-) in front.</p> <p>Presuming this is your first time adding devices, you’ll probably have to click “Add New Credential.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/7hmJTYkX4jsq9jGXm2IQMQ/52e3ededb59b53c72bc871f4c4345146/snmp-v2.png" style="max-width: 450px;" class="image center" withFrame alt="Choose SNMP v2" /> <img src="//images.ctfassets.net/6yom6slo28h2/1BL6WZkNUex3YDCsua75HA/3a24cb93395603b6a5073fdd2dec04df/snmp-v3.png" style="max-width: 450px;" class="image center" withFrame alt="Choose SNMP v3" /> <p>Let’s get this out of the way: You will <em>never</em> select SNMP v1. Just don’t.</p> <p>That said, select SNMP v2c or v3, include the relevant credentials, give it a unique name, and click “Add Credential.”</p> <p>Then select it from the previous screen.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4UUtcTPsQmGtptRUvf6Mvo/9883786b575b682c4b2ffe358a349241/ip-range-credentials-discovery.png" style="max-width: 650px;" class="image center" thumbnail withFrame alt="Credentials plus Start Discovery" /> <p>At that point, click “Start Discovery” to kick off the real excitement.</p> <p>The collector will start pinging devices and ensuring they respond to SNMP. Once completed, you’ll see a list of devices. You can check/uncheck the ones you want to monitor and click “Add Devices.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/6kzCRFzDXAYV2PCkWcx14W/373814734e6a0670c285528281b1af74/discovery-complete.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Discovery complete" /> <h2 id="docker-for-the-distracted">Docker for the distracted</h2> <p>I don’t want to presume that every reader is already familiar – let alone comfortable – with Docker and its basic commands. I recognize that there are <em>a lot</em> of Docker tutorials on the internet. In fact, I’ve lost several weeks of my life looking for (and at) them. But I also recognize that not every reader of this blog manages swarms of containers. So here’s the absolute minimum information you’ll need to be able to maintain your Ranger collector Docker container.</p> <p>You can see which containers are running (along with their container IDs) with the following command:</p> <p><code class="language-text">docker ps</code></p> <img src="//images.ctfassets.net/6yom6slo28h2/2Lbbl49AWK7NE4QW1CtrhZ/7c28e01d5d6026898b477eae46a607ff/docker-ps.png" style="max-width: 800px;" class="image center" alt="See which containers are running" /> <p>You can see an output of what a Docker container is doing with this command like this:</p> <p><code class="language-text">docker logs --follow &lt;container id></code></p> <img src="//images.ctfassets.net/6yom6slo28h2/3La3yvwsL7kX41Ex7fgHYy/21e3e5cb49f1724690c445d42c6cfb9b/docker-logs-follow.png" style="max-width: 800px;" class="image center" alt="See an output of a Docker container" /> <p>If you have issues with any containers, including the command to build or run the Kentik agent, you can easily stop a container with this command:</p> <p><code class="language-text">docker stop &lt;container id></code></p> <img src="//images.ctfassets.net/6yom6slo28h2/3W7sIiCPY8lcgQqHVuNCSA/8f3107932d2e6dee71844790455d76ab/docker-stop.png" style="max-width: 800px;" class="image center" alt="Command to stop a container" /> <p>Once a container is stopped, you probably will want to restart it. The problem is that it won’t show up using <code class="language-text">docker ps</code>. To see containers that are no longer running, use the -a (“all”) switch:</p> <p><code class="language-text">docker ps -a</code></p> <img src="//images.ctfassets.net/6yom6slo28h2/2kSNsM0u4AiNghv08SYn3P/ee947513256d652c45dfea6ab1fcb84b/docker-psa.png" style="max-width: 800px;" class="image center" alt="PS command" /> <p>And then, to start that container again, run:</p> <p><code class="language-text">docker start &lt;container id></code></p> <img src="//images.ctfassets.net/6yom6slo28h2/ioZGfjvmkYTcjrXsfFdDE/825eb81c74d54f2a0c42fe3a35f3119a/docker-start.png" style="max-width: 800px;" class="image center" alt="Restart a Docker container" /> <p>If something has gone horribly wrong, you can stop and then completely remove the container. (<strong>Warning</strong>: You’ll need to go back into the Kentik portal and re-run the original command to rebuild it, re-authenticate it, and add devices to be monitored by it.)</p> <p><code class="language-text">docker rm &lt;container id></code></p> <img src="//images.ctfassets.net/6yom6slo28h2/3snWpAivNYcrFfsIKQscHW/9980995250149abda6f168cab8374995/docker-rm.png" style="max-width: 400px;" class="image center" alt="Remove the container" /> <h2 id="variety-and-customization-is-the-spice-of-life">Variety (and customization) is the spice of life</h2> <p>If you’ve been following along in the blog series, you’ll know that you can add custom SNMP metrics along with the ones collected by default. To do that, you need to create certain files and make them discoverable by the NMS Ranger Collector agent when it starts up. Whether you’re using the direct or Docker version of the agent, do the following:</p> <ol> <li>Create the directory: <code class="language-text">/opt/kentik/components/ranger/local/config</code>. If you’re running the direct agent, everything up to “ranger” will already be there, but you’ll have to create local/config.</li> <li>In that directory, create three directories: <ul> <li>/profiles</li> <li>/reports</li> <li>/sources</li> </ul> </li> <li>Make the user:group “kentik:kentik” the owner of everything you just created and all the files and directories beneath it.</li> </ol> <p><code class="language-text">sudo chown -R kentik:kentik /opt/kentik/components/ranger/local/config</code></p> <p><strong>Note</strong>: You must monitor at least one device for /opt/kentik/components/ranger to exist.</p> <p><strong>Another note</strong>: If you add more files, you’ll probably need to re-issue that command.</p> <p><strong>Trick</strong>: You can also make this directory easier to get to by using the <a href="https://www.freecodecamp.org/news/symlink-tutorial-in-linux-how-to-create-and-remove-a-symbolic-link/">Linux “symbolic link” capability</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1eb3ZjcrkdZAuFryfhfUGh/6b7d12eed20da3c26f2a16da4638cb3c/symbolic-link.png" style="max-width: 800px;" class="image center" alt="Linux symbolic link capability" /> <p>This would change the chown command to:</p> <p><code class="language-text">sudo chown -R kentik:kentik /local_kentik</code></p> <p>You still need that directory for the Docker version of Kentik NMS, so do everything I described at the beginning of this section. But you’ll also need to tell Docker to mount it as a custom folder. To do that, we’ll start by looking at the “docker run” command that you used to install the container in the first place:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">docker run --name=kagent --detach --restart unless-stopped --pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env K_API_ROOT=grpc.api.kentik.com:443 --mount source=kagent-data,target=/opt/kentik/ kentik/kagent:latest</code></pre></div> <p></p> <p>Adding the custom folder means including the line:</p> <p><code class="language-text">-v /opt/kentik/components/ranger/local/config</code></p> <p>…to the end of that command. Which would look like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">docker run --name=kagent --detach --restart unless-stopped --pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env K_API_ROOT=grpc.api.kentik.com:443 --mount source=kagent-data,target=/opt/kentik/ kentik/kagent:latest -v /opt/kentik/components/ranger/local/config</code></pre></div> <p></p> <p>But wait! If you used the symlink trick from earlier, the command line becomes slightly easier to manage:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">docker run --name=kagent --detach --restart unless-stopped --pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env K_API_ROOT=grpc.api.kentik.com:443 --mount source=kagent-data,target=/opt/kentik/ kentik/kagent:latest -v /local_kentik</code></pre></div> <p></p> <h2 id="troubleshooting-and-other-swear-words">Troubleshooting and other swear words</h2> <p>Working in tech, we become used to the fact that things rarely go right the first time. Only through careful consideration, iteration, and correction can we achieve the result we initially envisioned.</p> <p>This section is devoted to a few things that might not initially work as you hoped when setting up Kentik NMS.</p> <h3 id="peeking-behind-the-curtain">Peeking behind the curtain</h3> <p>The Kentik Ranger Collector agent runs pretty silently during installation and afterward, which is usually a good thing. Still, when you suspect something is going wrong, it can be anything from slightly unsettling to downright rage-inducing. The good news is that most of the information you need is in the Linux Journal. The Linux Journal records every outbound message, error, update, whine, sigh, and grumble that your Linux system experiences – especially when it concerns services that run through the systemctl utility.</p> <p>The command to peer inside the Journal is, appropriately enough, journalctl. But typing that by itself will likely yield a metric tonne of mostly irrelevant information. To see messages and output specific to Kentik NMS, you should use the command:</p> <p><code class="language-text">sudo journalctl -u kagent</code></p> <p>What this is saying is, “Show me the Journal, but filter for the following UNIT (hence the “-u”), which is either the name of a service or a pattern to match. If that’s still too much information, try this:</p> <p><code class="language-text">sudo journalctl -u kagent –since "10 minutes ago"</code></p> <img src="//images.ctfassets.net/6yom6slo28h2/oomg2o2gnZx0RZ4pvKntL/8e03ab4ffcff04686bfb54d07ec9b541/journalctl.png" style="max-width: 800px;" class="image center" thumbnail alt="Linux Journal" /> <p>That asks for the Journal to be filtered down to messages about the Kentik agent (“kagent”), specifically those that have appeared in the last 10 minutes.</p> <p>If you want to see a running list of messages as they show up in the journal in real time, use this. The “-f” means “follow”:</p> <p><code class="language-text">sudo journalctl -f -u kagent</code></p> <p>Meanwhile, if you have a Docker container, you can see what’s happening with Docker’s logs:</p> <p><code class="language-text">docker logs --follow &lt;container id></code></p> <img src="//images.ctfassets.net/6yom6slo28h2/3La3yvwsL7kX41Ex7fgHYy/21e3e5cb49f1724690c445d42c6cfb9b/docker-logs-follow.png" style="max-width: 800px;" class="image center" alt="Docker logs" /> <h3 id="no-devices-are-discovered">No devices are discovered</h3> <p>If you’ve installed the NMS Ranger Collector agent, authenticated it into the platform, and run a scan, and no devices have been found, obviously, something needs to be fixed. Here’s a short list of things that might have gone wrong:</p> <p><strong>You can’t reach the target devices.</strong></p> <p>This is hands-down the most common issue. Whether due to firewall issues or a simple routing oversight, it’s important to start by verifying that the machine on which you’re running the NMS Ranger Collector agent can talk to the devices you want to monitor.</p> <p>Start off by running ping and ensuring you’ve got a clean response.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6S6zMe1GrKXb2e360DyGms/4dbf65e08a34af758121cdb11d7395aa/ping.png" style="max-width: 500px;" class="image center" alt="Run ping" /> <p>Next, make sure you can reach the device via SNMP. Here are the essential steps:</p> <ul> <li>On the system running the Ranger Collector agent, go to the command line/terminal.</li> <li>Type “snmp -V” (that’s a capital “V”) to verify that SNMP is installed on this system. If not, install it.</li> <li>Next, do an SNMPWALK on the system object ID, which is present on all devices that are running SNMP:</li> </ul> <p><code class="language-text">snmpwalk -v 2c -c &lt;snmp community string> &lt;device IP address> 1.3.6.1.2.1.1.2</code></p> <p>If that works, you’ll see a response like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2koMjRHqomgXMrURF4I6FX/e99816dae59de630aa1b1581dfb3eccb/snmpwalk.png" style="max-width: 800px;" class="image center" alt="SNMPwalk" /> <p>At this point, you’ll know a few things:</p> <ul> <li>If you can’t ping the device, you have a routing or firewall issue.</li> <li>If you can ping but can’t get SNMP information, then: <ul> <li>The target device is refusing the SNMP request.</li> <li>Or it’s not running SNMP.</li> <li>Or you need to correct some piece of information, like the community string.</li> </ul> </li> </ul> <h3 id="custom-oids-arent-being-collected">Custom OIDs aren’t being collected</h3> <p>While this blog doesn’t get into it, a few other posts in this series delved deep into <a href="https://www.kentik.com/blog/adding-multiple-custom-metrics-to-kentik-nms/">getting custom metrics</a> and sending them to the Kentik platform. If that’s not working, here’s a list of things to check:</p> <ul> <li>Check the YAML files in an editor that shows the type of whitespace you’re using. Mixing spaces with tabs will never end well for anyone.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/7yjepBvWvlwfwkFRmcAOzZ/51d026b6e06d1c653cc23c8d2fef81db/tabs-n-spaces.png" style="max-width: 350px;" class="image center" alt="Check whitespace" /> <ul> <li>Verify that all the files in /opt/kentik/components/ranger/local/config are owned by kentik:kentik.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/6tk2SgvNK8xwQPQVSqkxAa/49dc04ca327e233c6e18bc8917656e52/verify-config.png" style="max-width: 800px;" class="image center" alt="Verify config" /> <ul> <li>Verify that the custom files are the correct “kind” – profile, report, or source.</li> <li>Make sure the metadata name elements match up from file to file.</li> </ul> <h3 id="red-light-green-light-stopping-and-starting">Red light, green light (stopping and starting)</h3> <p>Sometimes, the Kentik agent needs a good swift kick in the process. To do that, you can use the systemctl utility:</p> <ul> <li><code class="language-text">sudo systemctl stop kagent.service</code></li> <li><code class="language-text">sudo systemctl start kagent.service</code></li> <li><code class="language-text">sudo systemctl restart agent.service</code></li> </ul> <p>Meanwhile, as discussed in the section on Docker, sometimes you need to restart the container itself:</p> <ul> <li><code class="language-text">docker ps</code> To get a list of running containers</li> <li><code class="language-text">docker ps -a</code> To get a list of all containers (running or stopped)</li> <li><code class="language-text">docker stop &lt;container ID></code> To stop the container</li> <li><code class="language-text">docker start &lt;container ID></code> To start the container</li> <li><code class="language-text">docker restart &lt;container ID></code> To stop and start the container all at once</li> <li><code class="language-text">docker rm &lt;container ID></code> To delete the container <ul> <li>Note: You have to stop the container first</li> <li>Another note: All the systems monitored by this container will have to be re-added when you recreate it.</li> </ul> </li> </ul> <h2 id="the-end-of-the-beginning">The end of the beginning</h2> <p>If you’ve arrived here after reading the previous posts on <a href="https://www.kentik.com/blog/using-kentik-network-monitoring-system-to-identify-network-outages/">using Kentik NMS to troubleshoot</a>, <a href="https://www.kentik.com/blog/how-to-configure-kentik-nms-to-collect-custom-snmp-metrics/">adding a single custom metric</a>, <a href="https://www.kentik.com/blog/adding-multiple-custom-metrics-to-kentik-nms/">adding multiple custom metrics</a>, or <a href="https://www.kentik.com/blog/adjusting-data-before-sending-it-to-kentik-nms/">modifying custom metrics</a> before they’re sent to the Kentik platform, you now have everything you need to get the Kentik NMS Ranger Collector agent installed, running, and collecting monitoring data from your systems.</p> <p>On the other hand, if you arrived here fresh from the internet and this is your first encounter with Kentik NMS, I invite you to use the links in the paragraph above to explore further.</p><![CDATA[Adjusting Data Before Sending It to Kentik NMS]]><![CDATA[Leon takes a look at how to further modify SNMP metrics in Kentik NMS. Because sometimes, you want to modify data before displaying it in your network monitoring system. Ready? Let’s get mathy with it!]]>https://www.kentik.com/blog/adjusting-data-before-sending-it-to-kentik-nmshttps://www.kentik.com/blog/adjusting-data-before-sending-it-to-kentik-nms<![CDATA[Leon Adato]]>Tue, 02 Apr 2024 07:00:00 GMT<p>In my ongoing exploration of Kentik NMS, I continue to peel back not only the layers of what the product can do but also the layers of information I quietly glossed over in my original example, hoping nobody noticed.</p> <p>In this blog, I want to both admit to and correct one of the most glaring ones:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1KUAuvMTTM44fUD3qf7mMv/52ecc2bea2d313de1270fc203620480a/nms-metrics-temperature-1.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Metrics Explorer showing device temperature" /> <p>If that is the temperature of one of your devices, you should seek immediate assistance. I don’t want to alarm you, but that’s six times hotter than the surface of the sun.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4UbG11DllZ7GpgdHVhwtS1/35f3646e8dd36a36b22c6bcb80f130f1/hot-pc.png" style="max-width: 400px;" class="image center" thumbnail alt="A very hot PC" /> <p>In reality, the SNMP OID in question gives temperature in <em>mC (microcelsius)</em>, so all we really need to do is divide by 1,000. But this opens the door to plenty of other situations where it’s not only nice but necessary to adjust metrics before sending them to Kentik NMS.</p> <h2 id="starlark-for-the-easily-distracted"><strong>Starlark for the easily distracted</strong></h2> <p>Kentik comes with scripting capabilities courtesy of <a href="https://bazel.build/rules/language">Starlark</a> (formerly known as Skylark), a Python-like language created by Google.</p> <p>That last sentence will either set your mind at ease or send you running for the door, and I’m honestly not sure how I feel about it myself.</p> <p>But, back to the task at hand, Starlark will let you take the values that come in via an OID and then manipulate them.</p> <p>A script block, which goes in the <code class="language-text">reports</code> file, must define a function called <code class="language-text">process</code> with two parameters: the record and the index set. It typically looks like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">reports: /foo/bar/baz: script: !starlark | def process(n, indexes): (do stuff here) </code></pre></div> <p></p> <p>That’s really all you have to know for now.</p> <div as="Promo"></div> <h2 id="to-review-this-is-our-metric"><strong>To review, this is our metric</strong></h2> <p>If you missed <a href="https://www.kentik.com/blog/using-kentik-network-monitoring-system-to-identify-network-outages/">the original post</a> and don’t feel like going back and reading it, here are the essentials:</p> <ul> <li>Move to (or create if it doesn’t exist) the dedicated folder on the system where the Kentik agent (kagent) is running:</li> </ul> <p><code class="language-text">/opt/kentik/components/ranger/local/config</code></p> <ul> <li>In that directory, create directories for /sources, /reports, and /profiles</li> <li>Create three specific files:</li> </ul> <ol> <li>Under /sources, a file that lists the custom OID to be collected</li> <li>Under /reports, a file that associates the custom OID with the data category it will appear under within the Kentik portal</li> <li>Under /profiles, a file that describes a type of device (Using the SNMP System Object ID) and the report(s) to be associated with that device type</li> <li>Make sure all of those directories (and the files beneath them) are owned by the Kentik user and group:</li> </ol> <p><code class="language-text">sudo chown -R kentik:kentik /opt/kentik/components/ranger/</code></p> <p>sources/linux.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-linux kind: sources sources: CPUTemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 interval: 60s</code></pre></div> <p></p> <p>reports/linux_temps_report.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: fields: CPUTemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 metric: true interval: 60s</code></pre></div> <p></p> <p>profiles/local-net-snmp.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp include: - device_name_ip</code></pre></div> <p></p> <p>As I showed earlier in this post, that gives you data that looks like this in Metrics Explorer:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1KUAuvMTTM44fUD3qf7mMv/52ecc2bea2d313de1270fc203620480a/nms-metrics-temperature-1.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Metrics Explorer showing device temperature" /> <p>Notice that my temperature readings are up around the 33,000 mark? We gotta do something about that.</p> <img src="//images.ctfassets.net/6yom6slo28h2/64JOHCjZvvRCBqgjMaQMUN/dab04727458f11196865d79c49630cc9/chill-pc.png" style="max-width: 400px;" class="image center" thumbnail alt="A chill PC" /> <h2 id="this-is-our-metric-on-starlark"><strong>This is our metric on Starlark</strong></h2> <p>First, we’ll do the simple math - dividing our output by 1000.</p> <ul> <li>sources/linux.yml - stays the same</li> <li>profiles/local-net-snmp.yml - stays the same</li> </ul> <p>Our new reports/linux_temps_report.yml file becomes:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: script: !starlark | def process(n, indexes): n[’CPUTemp’].value = n[’CPUTemp’].value//1000 fields: CPUTemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 metric: true interval: 60s</code></pre></div> <p></p> <p>Let’s take a moment to unpack the changes to this file:</p> <ul> <li>Under the category /device/linux/temp, we’re going to declare a starlark script</li> <li>That script is going to take (is piped - | ) a process that includes <ul> <li>n, the record containing the data</li> <li>indexes, the index set for the record</li> </ul> </li> <li>It pulls re-assigns the CPUTemp value from the record, replacing it with the original value divided by 1000 <ul> <li>To dig into the guts of Starlark for a moment, the two slashes (”//”) indicate “floored division” - which takes just the integer portion of the result.</li> </ul> </li> <li>The YAML file then goes on to identify the record itself, pulling the value from the OID 1.3.6.1.4.1 (and so on).</li> </ul> <p>I’m going to re-phrase because what the file does is actually backward from what is happening:</p> <p>The <code class="language-text">script:</code> block declares the process but doesn’t run it. It’s just setting the stage.</p> <p>The <code class="language-text">fields:</code> block is the part that identifies the data we’re pulling. Every time a machine returns temperature information (a record set), that process is run, replacing the original CPUTemp value with CPUTemp/1000.</p> <p>The result is an entirely different set of temperature values:</p> <img src="//images.ctfassets.net/6yom6slo28h2/yKBVoAt0dNBrx4q8cwZsG/30df41fa14ee902ee14f02cea6ca946c/nms-revised-temperature-metrics-2.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Metrics Explorer showing revised temperature metrics" /> <h2 id="when-you-need-a-dessert-topping-and-a-floor-wax"><strong>When you need a dessert topping AND a floor wax</strong></h2> <img src="//images.ctfassets.net/6yom6slo28h2/4kjlfNUQAZogfTJ7AkBst5/54e7a99e46389ccc121515190a683dfd/new-shimmer-metrics-3.png" style="max-width: 400px;" class="image center" thumbnail alt="New Shimmer meme" /> <p>Sometimes, you need to do the math but also store (and display) the original value. In that case, you just need one small change:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: script: !starlark | def process(n, indexes): n.append(’CPUTempC’, n[’CPUTemp’].value//1000, metric=True) fields: CPUTemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 metric: true interval: 60s</code></pre></div> <p></p> <img src="//images.ctfassets.net/6yom6slo28h2/SlZ2uyTFgX5Kj8F75yBQ0/b08dd218e5eecf5d009621a9b5f79fa9/cpu-temp-celsius-4.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Updated CPU temperature metrics in Celsius" /> <h2 id="making-it-more-mathy"><strong>Making it more mathy!</strong></h2> <p>To build on the previous example, this is what it would look like if you wanted to take that Celsius result and convert it to Fahrenheit:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-tempF kind: reports reports: /device/linux/tempF: script: !starlark | def process(n, indexes): n.append(’CPUTempF’, n[’CPUTemp’].value//1000*9/5+32, metric=True) fields: CPUTemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 metric: true interval: 60s</code></pre></div> <p></p> <img src="//images.ctfassets.net/6yom6slo28h2/2aRD5O0HXqqqgrD2AmqXi2/f85e71a31f30fe98f038a695f24eef2e/updated-cpu-temperature-metrics-farenheit-5.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Updated CPU temperature metrics in Farenheit" /> <h2 id="starlark-for-the-un-wholesomely-curious"><strong>Starlark for the un-wholesomely curious</strong></h2> <p>There’s a lot more to say about (and explore with) Starlark, but I want to leave you with just a few tidbits for now:</p> <ul> <li>Ranger will call the process function every time the report runs.</li> <li>For table-based reports, the process function will be called once for each row. <ul> <li>create new records</li> <li>maintain state across calls to process</li> <li>combine data from multiple table rows</li> </ul> </li> <li>Scripts can be included in the report (as shown in this blog), or referenced as an external file:</li> </ul> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">script: !external type: starlark file: test.star</code></pre></div> <p></p> <h2 id="building-it-up"><strong>Building it up</strong></h2> <p><a href="https://www.kentik.com/blog/adding-multiple-custom-metrics-to-kentik-nms/">In my most recent blog</a> on adding custom OIDs, I showed how to add a table of values instead of just a single item. The specific use case was providing temperatures for each of the CPUs in a system.</p> <p>The YAML files to do that looked like this:</p> <p>sources/linux.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-linux kind: sources sources: CPUTemp: !snmp table: 1.3.6.1.4.1.2021.13.16.2 interval: 60s</code></pre></div> <p></p> <p>reports/temp.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: fields: name: !snmp table: 1.3.6.1.4.1.2021.13.16.2 value: 1.3.6.1.4.1.2021.13.16.2.1.2 metric: false CPUTemp: !snmp table: 1.3.6.1.4.1.2021.13.16.2 value: 1.3.6.1.4.1.2021.13.16.2.1.3 metric: true interval: 60s</code></pre></div> <p></p> <p>profiles/local-net-snmp.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp include: - device_name_ip</code></pre></div> <p></p> <p>Incorporating what we’ve learned in this post, here are the changes. You’ll note that I’ve renamed a few things mostly to keep these new elements from conflicting with what we created before:</p> <p>linux_multitemp.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: linux_multitemp kind: sources sources: CPUTemp_Multi: !snmp table: 1.3.6.1.4.1.2021.13.16.2 interval: 60s</code></pre></div> <p></p> <p>This is effectively the same as the linux_temp.yml I re-posted from the last post. But again, I renamed the file, the metadata name, and the source name to keep things a little separate from what we’ve done.</p> <p>linux_multitempsc_reports.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-multitempC kind: reports reports: /device/linux/multitempC: script: !starlark | def process(n, indexes): n[’CPUTemp_Multi’].value = n[’CPUTemp_Multi’].value//1000 fields: CPUname: !snmp table: 1.3.6.1.4.1.2021.13.16.2 value: 1.3.6.1.4.1.2021.13.16.2.1.2 metric: false CPUTemp_Multi: !snmp table: 1.3.6.1.4.1.2021.13.16.2 value: 1.3.6.1.4.1.2021.13.16.2.1.3 metric: true interval: 60s</code></pre></div> <p></p> <p>The major change here is the addition of the <code class="language-text">script</code> block. The other changes are simply renaming:</p> <p>local_net_snmp.yml</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp - local-multitempC include: - device_name_ip</code></pre></div> <p></p> <p>In this file, our addition strictly includes <code class="language-text">local-multitempC</code> in the reports section.</p> <p>The result is a delightful blend of everything we’ve tested out so far. We have temperature values for each of the CPUs on a given system, and those values have been converted from microCelsius to Celcius.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7zfnwagpmGIXSTFMIOVLtF/0e8f58065819f96039aa378b2c7b7b27/temperature-metrics-microcelsius-to-celcius-6.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Updated temperature metrics - microCelsius to Celsius" /> <h2 id="why-summarize-when-we-both-know-im-not-done">Why summarize when we both know I’m not done?</h2> <p>This post, along with all those that have come before, again highlights the incredible flexibility and capability of Kentik NMS. But there are so many more things to show! How to ingest non-SNMP data, how to add totally new device types, and how to install the NMS in the first place.</p> <p>Wait… THAT HASN’T BEEN COVERED YET?!?!</p> <p>Oof. I’d better get started writing the next post.</p> <p>As always, I hope you’ll stick with me as we learn more about this together. If you’d like to get started now, start a <a href="#signup_dialog" title="Request a Free Trial of Kentik">free trial</a> or <a href="#demo_dialog" title="Request a Demo of Kentik">request a personalized demo</a>.</p><![CDATA[What Caused the Red Sea Submarine Cable Cuts?]]><![CDATA[In the latest collision between geopolitics and the physical internet, three major submarine cables in the Red Sea were cut last month likely as a result of attacks by Houthi militants in Yemen on passing merchant vessels. In this post, we review the situation and delve into some of the observable impacts of the subsea cable cuts.]]>https://www.kentik.com/blog/what-caused-the-red-sea-submarine-cable-cutshttps://www.kentik.com/blog/what-caused-the-red-sea-submarine-cable-cuts<![CDATA[Doug Madory]]>Sun, 31 Mar 2024 04:00:00 GMT<p>On February 24, three submarine cables were cut in the Red Sea (Seacom/TGN-EA, EIG, and AAE-1) disrupting internet traffic for service providers from East Africa to Southeast Asia. For a variety of factors, the Red Sea has been a dangerous place for submarine cables over the years, but with a hostile party firing missiles at nearby seafaring vessels, it has recently become even more so.</p> <div as="WistiaVideo" videoId="mswbehusup" audio></div> <p>In this post, we will review the background of the unique situation in the Red Sea and go into the internet impacts and timings of the submarine cable cuts. This work has been done in collaboration with WIRED magazine who is <a href="https://www.wired.com/story/houthi-internet-cables-ship-anchor-path">simultaneously publishing</a> their own investigation into the cable cuts.</p> <h2 id="background">Background</h2> <p>In response to the ongoing war in Gaza, the Houthi-controlled Yemeni government began firing missiles and armed drones at ships transiting the nearby <a href="https://en.wikipedia.org/wiki/Bab-el-Mandeb">Bab al-Mandab strait</a> that they believe have some affiliation with Israel, Britain, or the United States.</p> <p>In November, <a href="https://www.timesofisrael.com/missile-from-yemen-intercepted-over-red-sea-as-houthi-chief-vows-to-keep-up-attacks/">missiles were fired</a> from Yemen at Israel’s key Red Sea shipping port of Eilat. Shortly afterward, YemenNet suffered a <a href="https://twitter.com/DougMadory/status/1723016723459125292">multi-hour outage</a>, leading to speculation that the internet blackout was retaliation for the missile strike. I described this episode <a href="https://www.kentik.com/blog/cloud-observer-subsea-cable-maintenance-impacts-cloud-connectivity/">in the conclusion of a blog post</a> calling for greater transparency within the submarine cable industry, because, according to cable owner GCX, the YemenNet outage was caused by scheduled maintenance.</p> <p>In late December, a post in a Houthi-allied Telegram channel suggested that the submarine cables could also become targets of their retaliatory attacks on behalf of Gazans. However, after the threat was widely reported by the media, the Houthi-controlled Yemen Ministry of Telecommunications <a href="https://twitter.com/mtityemen/status/1739659835908432341">published a statement</a> disavowing any targeting of submarine cables:</p> <p>“The (Ministry of Telecommunications) disclaims what has been published by the social media and other media … with regard to the so-called threats against the Submarine Cables that cross Bab al-Mandeb in the Red Sea-Yemen.”</p> <div as="Promo"></div> <h2 id="a-dangerous-place-for-submarine-cables">A dangerous place for submarine cables</h2> <p>In June 2022, I <a href="https://www.kentik.com/blog/outage-in-egypt-impacted-aws-gcp-and-azure-interregional-connectivity/">published a blog post</a> discussing Egypt’s role as a chokepoint for global internet communications. However, this chokepoint extends through the neighboring Red Sea which offers its own set of risks.</p> <p>Cargo vessels in the Red Sea awaiting their turn to traverse the nearby Suez Canal will drop anchor in the sea inlet’s relatively shallow depths. The possibility of an anchor snagging one or more submarine cables is very real and has occurred in the past.</p> <p>In February 2012, <a href="https://www.dailymail.co.uk/news/article-2108868/Ships-anchor-accidentally-slices-internet-cable-cutting-access-African-countries.html">three submarine cables were cut</a> in the Red Sea due to a ship dragging its anchor. At the time, I <a href="https://web.archive.org/web/20140521135653/http://www.renesys.com/2012/02/east-african-cable-breaks/">wrote</a> about the impacts on internet connectivity in East Africa, focusing on how much connectivity <em>had survived</em> what would have previously been a cause for a complete communications blackout in the region.</p> <h2 id="what-happened-on-february-24">What happened on February 24?</h2> <p>Three submarine cables suffered cuts on the morning of Saturday, February 24: <a href="https://www.submarinecablemap.com/submarine-cable/europe-india-gateway-eig">EIG</a>, <a href="https://www.submarinecablemap.com/submarine-cable/seacomtata-tgn-eurasia">Seacom/TGN-EA</a>, and <a href="https://www.submarinecablemap.com/submarine-cable/asia-africa-europe-1-aae-1">AAE-1</a>. According to industry sources, EIG was already down at the time due to a previous cut which occurred in early December, so the operational impact on internet communications of a second cut was minimal.</p> <p>Seacom, to their great credit, continued its practice of being one of the most open and communicative submarine cables in the world by immediately <a href="https://www.itweb.co.za/article/seacom-confirms-cable-outage-in-red-sea/KPNG8v8NyDNM4mwD">confirming the damage</a> sustained to their cable. EIG and AAE-1 have not published anything similar.</p> <p>Initial speculation of the cause of the cuts focused on the purported threats against submarine cables posted in the Telegram channel from December. How Yemen would have pulled off such an undersea attack was left unexplained: underwater explosives, divers with cutting gear, a submersible?</p> <p>Before long, a more realistic theory emerged from the submarine cable industry. Days before the three subsea cable failures, a Belize-flagged, United Kingdom-owned cargo ship <a href="https://www.centcom.mil/MEDIA/PRESS-RELEASES/Press-Release-View/Article/3680410/feb-18-summary-of-red-sea-activities/">was struck by missiles</a> fired from Yemen. The crew dropped anchor and abandoned the crippled ship, the MV Rubymar. Afterwards, the Rubymar began to drift, dragging its anchor — one of the top causes of submarine cable cuts according to the <a href="https://www.iscpc.org/">International Cable Protection Committee</a>. On March 2, the derelict vessel finally <a href="https://apnews.com/article/yemen-houthi-rebels-rubymar-sinks-red-sea-fb64a490ce935756337ee3606e15d093">sank</a>, taking with it more than 41,000 tons of fertilizer.</p> <p>Yet to be confirmed, the dragging of the Rubymar’s anchor remains the leading theory as to the cause of the submarine cable cuts on February 24. For their part, the Yemen Ministry of Telecommunications <a href="https://twitter.com/mtityemen/status/1762453092442706385">released a statement</a> denying involvement in cutting the cables.</p> <h2 id="operational-internet-impacts">Operational internet impacts</h2> <p>As mentioned earlier, the EIG was already out of commission, so we don’t expect to see an impact from its loss in Internet measurement data, but the losses of Seacom and AAE-1 were observable. In fact, due to the different geographies of these cables, I believe I have been able to infer the timings of each cable cut. There were two clear clusters of disruptions occurring at 09:46 UTC and again around 09:51 UTC — about five minutes apart.</p> <p>Tata is a part owner of the Seacom cable, and, during its lifetime, the cable has been Tata’s preferred way to provides international transit to its customers in East Africa. Disruptions in East Africa, often involving the loss of Tata transit, occurred at 09:46 UTC. Therefore, I believe that is when Seacom suffered its failure.</p> <p>Conversely, the second cluster of disruptions around 09:51 UTC primarily occurred along the path of AAE-1 from the Red Sea to Asia, although we also saw impacts in East Africa. Therefore, I believe AAE-1 suffered its failure several minutes after the Seacom cut.</p> <p>Let’s look at some impacts visible in various types of measurement data, beginning with some Kentik BGP visualizations.</p> <p>Below is a BGP visualization of the percentage of BGP sources that saw each upstream of Arusha Art (AS37143) over time for 41.222.60.0/23. The route wasn’t withdrawn, but AS37143 lost service from Tata (AS6453) at the estimated time of the Seacom cable cut.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4zYsJzjiZx24FjmnxT9MCD/e3fcbf7743103d6f5d76f5c12eb37a9e/bgp-monitor-tanzania.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP visualization of Arusha Art, Tanzania" /> <p>King Abdul Aziz City for Science &#x26; Technology (AS8895) was exclusively transited by Tata (AS6453) but was withdrawn at 09:46 UTC when Seacom went down.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1qmry2r0kkJxBW3mEVT7lV/711acd0e41ba0959327111ea57d504cb/bgp-monitor-king-abdul-aziz.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP visualization of King Abdul Aziz City for Science & Technology" /> <p>In the graphic below, we can see the shift of transit experienced by Djibouti Telecom with the loss of the AAE-1 cable. Looking upstream from AS30990 (Djibouti Telecom), the primary change is a loss of Cogent (AS174) replaced by AS6762 (Telecom Italia) until the Cogent service is restored through another cable hours later.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5NxizjtMe683CIo4dqzUj3/add5af671f6cbd2291a3fb13a25fa98e/bgp-monitor-djibouti.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP visualization of Djibouti Telecom" /> <p>The transit shift depicted by the BGP visualization above is also reflected in Kentik’s aggregate NetFlow, pictured below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5KWyU6GthOUGtB8iDo3uER/8d670d02d65c522eb3bad115c1738439/netflow-djibouti.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Djibouti Telecom as seen in NetFlow" /> <p>The graphic below illustrates the transit shift for a route originated by Etisalat in the UAE. When looking at the upstreams of AS8966 (Etisalat) for this route, transit from AS1299 (Arelion) and AS2914 (NTT) was lost at 09:51 UTC when AAE-1 suffered a cut. According to our BGP data, the loss was replaced by transit from AS6762 (Telecom Italia), AS7473 (Singtel) and expanded service from AS3356 (Lumen).</p> <img src="//images.ctfassets.net/6yom6slo28h2/CzBerkTEAdmzsjHWQpip6/98b3f467ca9ff79012a94279d4205eab/bgp-monitor-uae.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP visualization of Etisalat, UAE" /> <p>Impacts from the AAE-1 cable cut can be observed all the way out in Southeast Asia. Vietnamese incumbent VNPT (AS45899) had routes withdrawn at the time of the cable cut, such as the one illustrated below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7sT55yXTBEEuGlKfP0yzfj/66a459f6030afa3de45afcf797fe0928/bgp-monitor-vietnam.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP visualization of VNPT (Vietnam)" /> <p>And finally, here is an example of a route that appears to have been downstream of both severed cables. 102.213.16.0/23 (Equity Bank Tanzania) is originated by AS329242 and transited exclusively by Simbanet (AS37084). Below is an upstream view for AS37084, showing the loss of Tata (AS6453) service when the Seacom cable was cut, followed minutes later by the loss of Cogent (AS174) when AAE-1 went down.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4HiOjuPEzm8mbQJdIJT6sD/f37a98c028671325b0dbaa32b66ac6d7/bgp-monitor-tanzania-equity-bank.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP visualization of Equity Bank Tanzania" /> <p>Additionally, Georgia Tech’s IODA tool reported drops in active measurement to countries in East Africa, including <a href="https://ioda.inetintel.cc.gatech.edu/country/TZ?from=1708612033&#x26;until=1708957633">Tanzania</a>, <a href="https://ioda.inetintel.cc.gatech.edu/country/KE?from=1708612033&#x26;until=1708957633">Kenya</a>, <a href="https://ioda.inetintel.cc.gatech.edu/country/UG?from=1708612033&#x26;until=1708957633">Uganda</a>, and <a href="https://ioda.inetintel.cc.gatech.edu/country/MZ?from=1708612033&#x26;until=1708957633">Mozambique</a>, around 09:50 UTC, likely due to the loss of the Seacom cable.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3e1fhfZgwCi19vyzVRMeVo/04f5e6716a99817a0f1090526cc1a57e/internet-connectivity-tanzania.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Georgia Tech's IODA tool - Internet Connectivity Tanzania" /> <h2 id="conclusion">Conclusion</h2> <p>While the Red Sea has been a problem area for submarine cables for many years, the current situation is quite unique. Never before has a hostile actor repeatedly fired at vessels traversing a busy body of water filled with critical submarine cables.</p> <p>And we’re not out of the woods yet. Merchant ships in the Red Sea are <a href="https://twitter.com/CENTCOM/status/1771726107168960751">still being targeted</a>, and it is not out of the question that we could have another vessel, struck by a missile, inadvertently cut another submarine cable. The loss of another major submarine cable connecting Europe to Asia would be devastating — there are only a few of them in total.</p> <p>Lastly, spare a thought for the crews of the cable ships sailing into these dangerous waters to make the repairs necessary to keep international communications flowing. The Houthi Yemen Minister of Telecommunication recently <a href="https://twitter.com/AlnomeirMosfer/status/1764666830801506476">released a statement</a> emphasizing the requirement that cable ships obtain a permit in order to carry out repairs in the Yemeni territorial waters where his government continues targeting ships with missiles and armed drones.</p> <p>The permits, he adds, are “out of concern for (the ships’) safety.”</p><![CDATA[Transforming Human Interaction with Data Using Large Language Models and Generative AI]]><![CDATA[AI has been on a decades-long journey to revolutionize technology by emulating human intelligence in computers. Recently, AI has extended its influence to areas of practical use with natural language processing and large language models. Today, LLMs enable a natural, simplified, and enhanced way for people to interact with data, the lifeblood of our modern world. In this extensive post, learn the history of LLMs, how they operate, and how they facilitate interaction with information.]]>https://www.kentik.com/blog/transforming-human-interaction-with-data-using-llms-and-genaihttps://www.kentik.com/blog/transforming-human-interaction-with-data-using-llms-and-genai<![CDATA[Phil Gervasi]]>Wed, 27 Mar 2024 07:00:00 GMT<p>Since its inception, artificial intelligence has been on an iterative journey to revolutionize technology by emulating human intelligence in computers. Once thought about only by academics in university computer science departments, AI has extended its influence to various areas of practical use, one of the most visible today, arguably being natural language processing (NLP) and large language models (LLMs).</p> <div as="WistiaVideo" videoId="ix44glg18r" audio></div> <p>Natural language processing is a field focused on the interaction between computers and humans using natural human language. NLP enables a very natural and, by many accounts, enhanced human interaction with data, allowing for insights beyond human analytical capacity. It also facilitates faster, more efficient data analysis, especially considering the enormous size of the datasets we use in various fields today.</p> <p>Large language models, or LLMs, are a subset of advanced NLP models capable of understanding and generating text coherently and contextually relevant. They represent a <em>probabilistic model</em> of natural language, similar to how texting applications predict the next word a human may input when typing. However, LLMs are much more sophisticated, generating a series of words, sentences, and entire paragraphs based on training from a vast body of text data.</p> <h2 id="ai-llms-and-network-monitoring">AI, LLMs, and network monitoring</h2> <p>Generative AI and LLMs have become useful in a variety of applications including in aerospace, medical research, engineering, manufacturing, computer science, and so on. IT operations and network monitoring in particular have benefited from generative AI’s democratization of information. Most recently, LLMs have provided a new service layer between engineers building and monitoring networks and the vast amount of telemetry data they produce.</p> <p>Built on a foundation of NLP, LLMs have even gone so far as to enable the democratization of information among anyone with access to a computer.</p> <h2 id="the-origin-and-evolution-of-large-language-models">The origin and evolution of large language models</h2> <p>NLP is currently a major focus in the evolution of AI. NLP research concerns how computers can understand, interpret, and generate human language, making it a helpful bridge between computers, or more accurately, the data they contain and actual humans. The power and excitement around NLP is based on its ability to enable computers to comprehend the nuances of human language, therefore fostering a more intuitive and effective human-computer interaction.</p> <h2 id="early-development-of-natural-language-processing">Early development of natural language processing</h2> <p>The origins of NLP can be traced back to the late 1940s after World War II, with the advent of the very first computers. These early attempts in computer-human communication included machine translation, which involved the understanding of natural language instructions. Technically speaking, humans would input commands that both humans and computers could understand; therefore, it is a very early form of a natural language interface.</p> <p>By the 1950s, experiments were done with language translation, such as the <a href="https://aclanthology.org/www.mt-archive.info/00/AMTA-2004-Hutchins.pdf">IBM-Georgetown Exhibition of 1954</a>, in which computers automatically performed a crude translation of over 60 Russian sentences into English. At that time, much of the research around translation centered on a dictionary-based lookup. Therefore, work during this period focused on morphology, semantics, and syntax. As sophisticated as this may sound, no higher level programming languages existed at this time, so much of the actual code was written in assembly language on very slow computers that were available only to a small number of academics and industry researchers.</p> <p>Beginning around the early to mid-1960s, a shift toward research in interpretation and generation occurred, dividing researchers into distinct areas of NLP research.</p> <p>The first, <em>symbolic NLP</em>, also known as “deterministic,” is a rule-based method for generating syntactically correct formal language. It was favored by many computer scientists with a linguistics background. The idea of symbolic NLP is to train a machine to learn a language as humans do, through rules of grammar, syntax, etc.</p> <p>Another area, known as <em>stochastic NLP</em>, focuses on statistical and probabilistic methods of NLP. This meant focusing more on word and text pattern recognition within and among texts and the ability to recognize actual text characters visually.</p> <p>In the 1970s, we began to see research that resembled how we view artificial intelligence today. Rather than simply recognizing word patterns or probabilistic word generation, researchers experimented with constructing and manipulating abstract and meaningful text. This involved inference knowledge and eventually led to model training, both of which are significant components of modern AI broadly, especially NLP and LLMs.</p> <p>This phase, sometimes called the <em>grammatico-logical phase</em>, was a move toward using logic for reasoning and knowledge representation. This led to <a href="https://plato.stanford.edu/entries/computational-linguistics/">computational linguistics</a> and <a href="https://cs.lmu.edu/~ray/notes/languagetheory/">computational theory of language</a>, most often attributed first to Noam Chomsky, and which can be traced back to symbolic NLP in that computational linguistics is concerned with the rules of grammar and syntax. By understanding these rules, the idea is that a computer can understand and potentially even generate knowledge and, in the context of NLP, better deal with a user’s beliefs and intentions.</p> <p><em>Discourse modeling</em> is yet another area that emerged during this time. Discourse modeling looks at the exchanges between humans and computers. It attempts to process communication concepts such as the need to change “you” in a speaker’s question to “me” in the computer’s answer.</p> <p>Through the 1980s and 1990s, research in NLP focused much more on probabilistic models and experimentation in accuracy. By 1993, probabilistic and statistical methods of handling natural language processing were the most common models in use. In the early 2000s, NLP research became much more focused on information extraction and text generation due mainly to the vast amounts of information available on the internet.</p> <p>This was also when statistical language processing became a major focus. Statistical language processing is most concerned with providing actual valuable results. This has led to familiar uses such as information extraction and text summarization. Also, in the early 2000s, statistical language processing led to what many call language processing. Though some of the first and most compelling practical uses for NLP were spell checking and grammar applications and translation tools such as Google translate, statistical language processing led to the rise of early chatbots, predictive sentence completion such as in texting applications, and so on.</p> <p>In the 2010s and up to today, research has focused on enhancing the ability to discern a human’s tone and intent and reducing the number of parameters and hyperparameters needed to train an LLM model, both of which will be discussed in a later section.</p> <p>This represents two goals: first, a push to make the interaction with a human more natural and accurate, and second, the reduction of both the financial cost and time needed to train LLM models.</p> <h2 id="models-used-by-llms">Models used by LLMs</h2> <p>Remember that modern LLMs employ a <em>probabilistic model</em> of natural language. A probabilistic model is a mathematical framework used to represent and analyze the probability of different outcomes in a process. This mathematical framework is essential for generating logical, context-aware, natural language text that captures the variety and even the uncertainty that exists in everyday human language.</p> <h2 id="n-gram-language-models">N-gram language models</h2> <p>An <em>n-gram model</em> is a probabilistic model used to predict the next item in a sequence, such as a word in a sentence. The term “n-gram” refers to a sequence of <em>n</em> items (such as words), and the model is built on the idea that the probability of a word in a text can be approximated by the context provided by the preceding words. Therefore, the n-gram model is used in language models to assign probabilities to sentences and sequences of words.</p> <p>In the context of predicting the probability of the next word in a sentence based on a body of training data (text), the n-gram formula is relatively straightforward.</p> <p>Given a sequence of words, the probability of a word *W<sub>n</sub>*​ appearing after a sequence of <em>n</em> - 1 words can be estimated using the formula below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1gfu0aekO0c7C8UIjmantp/b5a0d74d400a172788c307fb220985d0/n-gram-formula.png" style="max-width: 700px;" class="image center" alt="n-gram language model formula" /> <p>In this formula, on the left we’re representing the conditional property of the word <em>W<sub>n</sub></em> occurring given the preceding <em>n</em> - 1 words. On the right we divide the count of the specific n-gram sequence in the training dataset by the count of the preceding <em>n</em> - 1 words. This becomes more complex as the value of <em>n</em> increases, mostly because the model needs to account for longer sequences of preceding words.</p> <p>The above formula represents the chain rule, which many researchers consider impractical because it’s unable to effectively keep track of all possible word histories. The n-gram model isn’t typically used anymore in modern Large Language Models, but it’s essential to understand this model as the foundation of how LLMs evolved.</p> <h3 id="what-is-an-n-gram">What is an n-gram?</h3> <p>An n-gram is a sequence of <em>n</em> consecutive items (typically words) in a text, with <em>n</em> being a number.</p> <p>For instance, take a look at the sentences below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5seQhtldPWP0hbsuXObqq0/2505de725f73e7e767bc1ea600623698/n-gram-quick-brown-fox.png" style="max-width: 500px;" class="image center simple" alt="n-gram example" /> <p>The n-gram model estimates the probability of each word based on the occurrence of its preceding <em>n</em> - 1 words. This is called probability estimation. In the example above, for the bigram model, the probability of the word “fox” following “brown” is calculated by looking at how often “brown fox” appears in the training data.</p> <p>The model operates under the <em>Markov assumption</em>, which suggests that predicting the next item in the sequence depends only on the previous <em>n</em> - 1 items and not on any earlier items. This makes the computational problem more manageable and limits the context that the model can consider.</p> <p>N-gram models are used to create statistical language models that can predict the likelihood of a sequence of words. This is fundamental in applications like speech recognition, text prediction, and machine translation. Because they can generate text by predicting the next word given a sequence of words and by analyzing the probabilities of entire word sequences, n-gram models can also suggest corrections in text.</p> <h3 id="using-tokenization-to-build-an-n-gram-model">Using tokenization to build an n-gram model</h3> <p>To build an n-gram model, we first collect a large body of text data. This text is then broken down into <em>tokens</em> (usually words). Ultimately, this process, called <em>tokenization</em>, splits text into individual manageable units of data the LLM can digest. Though these tokens are typically words in most NLP applications, they can also be characters, subwords, or phrases.</p> <p>Take the image below, for example.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3ZEieoITDvJ6DTakorzlEn/85cd630002f69c29d3d80902eea27df6/n-gram-sentence.png" style="max-width: 500px;" class="image center simple" alt="Example of a sentence to be broken down using tokenization" /> <p>In this example, word-level tokenization splits the text into individual worlds.</p> <p><em>Word level tokenization:</em></p> <img src="//images.ctfassets.net/6yom6slo28h2/79X57xQ4FN9NlZ0e6Jnyxj/f931ae002ef08c8d2e6ac736e9e770c2/n-gram-word-level-tokenization.png" style="max-width: 500px;" class="image center simple" alt="Example of n-gram word level tokenization" /> <p><em>We can also break the text down to individual characters:</em></p> <img src="//images.ctfassets.net/6yom6slo28h2/6DXd6sE7ZUSNBYjeIdLInA/145c8b199d294779f93c5cc3fca40f0f/n-gram-character-level-tokenization.png" style="max-width: 500px;" class="image center simple" alt="Example of n-gram character level tokenization" /> <p>Thus, the main goal of tokenization is to convert a text (like a sentence or a document) into a list of tokens that can be easily analyzed or processed by a computer. Once the text is tokenized into words, these tokens are used to form n-grams. For example, in a bigram model, pairs of consecutive tokens are formed; in a trigram model, triples of consecutive tokens are created, and so on.</p> <p>Tokenization also involves decisions about handling special cases like punctuation, numbers, and special characters. For example, should “don’t” be treated as one token or split into “do” and “n’t”?</p> <p>Embeddings are the actual representations of these characters, words, and phrases, as vectors in a continuous, high-dimensional space. During model training and subsequent tokenization, the optimal vector representation of each word is determined to understand how they relate to each other. This is one mechanism used by the model to determine what words logically go together or what words are more likely to follow from another word or phrase.</p> <p>In the image below from a <a href="https://vaclavkosar.com/ml/Tokenization-in-Machine-Learning-Explained">blog post by Vaclav Kosar</a>, you can see how the sentence “This is a input text.” is tokenized into words and embeddings passed into the model.</p> <img src="//images.ctfassets.net/6yom6slo28h2/HX6zXtaFWNgAZYoCU2Bbi/c142dbec86c78edd6758baf6390e27be/this-is-a-input-text.png" style="max-width: 550px;" class="image center simple" alt="Example showing tokenization and embeddings" /> <h3 id="challenges-and-limitations-of-n-gram-models">Challenges and limitations of n-gram models</h3> <p>The main challenge with n-gram models is data sparsity. As <em>n</em> increases, the model requires a larger dataset to capture the probabilities of n-grams accurately. This can lead to a contextual limitation in which n-grams provide a limited window of context (only <em>n</em>-1 words). This often isn’t enough to fully understand the meaning of a sentence, especially for larger values of <em>n</em>.</p> <p>Therefore, due to their limitations, n-gram models have been largely superseded by more advanced models like neural networks and transformers, which can better capture longer-range dependencies and contextual nuances.</p> <h2 id="a-shift-to-neural-network-models">A shift to neural network models</h2> <h3 id="from-traditional-nlp-to-neural-approaches">From traditional NLP to neural approaches</h3> <p>In recent years, the industry has shifted away from using statistical language models toward using neural network models.</p> <p>Neural network models are computational models inspired by the human brain. They consist of layers of interconnected nodes, or “neurons,” each of which performs simple computations. Though popular culture may conflate neural networks with how the actual human brain works, they are quite different. Neural networks can certainly produce sophisticated outputs, but they do not yet approach the complexity of how a human mind functions.</p> <p>In a neural network, the initial input layer receives input data, followed by hidden layers, which perform computations through “neurons” that are interconnected. The complexity of the network often depends on the number of hidden layers and the number of neurons in these layers. Finally, the output layer produces the final output of the network.</p> <p>In the image below, you can see the visual representation of the input layers, hidden layers, and the output layers, each with neurons performing individual computations within the model.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6DUtJ9wB2Wb9t8SZUlR7xw/65a39300f3c02052d26b388f228c8de0/neural-network-layers.png" style="max-width: 650px;" class="image center" alt="Neural network layers" /> <div class="caption" style="margin-top: -35px">Image from ibm.com <a href="https://www.ibm.com/topics/neural-networks">“What is a neural network?”</a></div> <h3 id="how-neural-networks-function">How neural networks function</h3> <p>After data is fed into the input layer, each neuron in the hidden layers processes the data, often using non-linear functions, to capture complex patterns. Each individual neuron, or node, can be looked at as its own linear regression model, composed of input data, weights, a threshold, and its output.</p> <p>The result is then passed to the next layer for the subsequent computation, then to the next layer, and so on, until the result reaches the output layer, which outputs some value. This could be a classification, a prediction, etc.</p> <p>Normally, this is done in only one direction from the input layer, through successive hidden layers, and then the output layer in a feed-forward design. However, there is some backward activity called <em>backpropagation</em>. Backpropagation is used to fine-tune the model’s parameters during the training process.</p> <p>Neural networks learn from data in the sense that if an end result is inaccurate in some way, the model can make adjustments autonomously to reach more accurate results. Backpropagation weighs the accuracy of the final result from the output layer and, based on the result’s accuracy, will use optimization algorithms like gradient descent to “go back” through the layers to adjust the model parameters to improve its accuracy and performance.</p> <p>In the image below we can see how backpropagation will use the results from the output layer to go backward in the model to make its adjustments.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2hVz33OLhMvc8YVXnVhL0O/a8e57b333cb83b6284648dabab3dfba4/neural-network-backpropagation.png" style="max-width: 675px;" class="image center" alt="Neural network layers showing backpropagation" /> <p>This process can be done many times during training until the model produces sufficiently accurate results and is considered “fit for service.”</p> <p>Parameters and hyperparameters are discussed in greater detail below.</p> <h3 id="neural-networks-in-nlp">Neural networks in NLP</h3> <p>In natural language processing, neural networks are used to create word embeddings (like Word2Vec or GloVe), where words are represented as dense vectors in a continuous vector space. This helps capture semantic relationships between words.</p> <p>Notice in the diagram below where we can see the relationships between and among words in a given dataset. In this visualization from chapter 6 of Jurafsky and Martin’s book, <em><a href="https://web.stanford.edu/~jurafsky/slp3/">Speech and Language Processing</a></em> (still in progress), we see a two-dimensional projection of embeddings for some words and phrases, showing that words with similar meanings are nearby in space.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2KtpVRPWERLDluYeIi8O9p/a95cba9ac756c566fcfb87e821c3e956/neural-network-enbeddings.png" style="max-width: 600px;" class="image center" alt="Neural network embeddings" /> <div class="caption" style="margin-top: -35px"><a href="https://web.stanford.edu/~jurafsky/slp3/">Image source</a></div> <p>Several neural networks have been used in recent years, each of which either improves on another or serves a different purpose.</p> <p><em>Recurrent neural networks (RNNs)</em> are designed to handle data sequences, making them suitable for text. They process text sequentially, maintaining an internal state that encodes information about the sequence processed so far. They’re used in tasks like language modeling, text generation, and speech recognition but struggle with long-range dependencies due to vanishing gradient problems.</p> <p><em>Long short-term memory (LSTM)</em> networks are a type of RNN and are particularly effective in capturing long-range dependencies in text, which is crucial for understanding context and meaning. LSTMs include a memory cell and gates to regulate the flow of information, addressing the vanishing gradient problem of standard RNNs. LSTMs are widely used in sequence prediction, language modeling, and machine translation.</p> <p><em>Convolutional neural networks (CNNs)</em> are primarily known for image processing, but CNNs are also used in NLP to identify patterns in sequences of words or characters.</p> <p><em>Gated recurrent units (GRUs)</em> streamline RNNs and the architecture of LSTMs for greater efficiency. They merge the forget and input gates of LSTMs into a single mechanism and, similar to LSTMs, are used in various sequence processing tasks.</p> <p><em>Transformers,</em> like those used in BERT or GPT, have revolutionized NLP and significantly improved NLP tasks’ performance over earlier models. Transformers rely on attention mechanisms, allowing the model to focus on different parts of the input sequence when producing each part of the output, making them highly effective for various complex tasks.</p> <h3 id="applications-in-nlp">Applications in NLP</h3> <p>Neural networks, especially transformers, are used for generating logical and contextually relevant text. They have significantly improved the quality of machine translation as well as sentiment analysis, or in other words, analyzing and categorizing human opinions expressed in text. Neural networks are also used extensively to extract structured information from unstructured text, like named entity recognition.</p> <h3 id="advantages-and-challenges">Advantages and challenges</h3> <p>Neural networks can capture complex patterns in language and are highly effective in handling large-scale data, one significant advantage over n-gram models. When properly trained, they work well to generalize new, unseen data.</p> <p>However, neural networks require substantial computational resources and often need large amounts of labeled data for training. This means they can sometimes be seen as “black boxes,” making it challenging to understand how they make specific decisions or predictions.</p> <p>Nevertheless, their ability to learn from data and capture complex patterns makes them especially powerful for tasks involving human language. They are a significant step forward in the ability to process and understand natural language, providing the backbone for many modern LLMs.</p> <h2 id="transformers">Transformers</h2> <p>A <em>transformer</em> is an advanced neural network architecture introduced in the 2017 paper <a href="https://arxiv.org/abs/1706.03762">“Attention Is All You Need”</a> by Vaswani et al. This architecture is distinct from its predecessors due to its unique handling of sequential data, like text, and its reliance on the <em>attention mechanism</em>.</p> <p>A transformer is a versatile mechanism that has been successfully applied to a broad range of NLP tasks, including language modeling, machine translation, text summarization, sentiment analysis, question answering, and more.</p> <p>These models are typically complex and require significant computational resources, but their ability to understand and generate human language has made them the backbone of many advanced LLMs.</p> <h3 id="architecture">Architecture</h3> <h4 id="attention-mechanism">Attention mechanism</h4> <p>The core innovation in transformers is the <em>attention mechanism</em>, which enables the model to focus dynamically on different parts of the input sequence when making predictions. This mechanism is crucial for understanding context and relationships within the text.</p> <p>Attention allows the model to focus on the most relevant parts of the input for a given task, similar to how humans pay attention to specific aspects of what they see or hear. It assigns different levels of importance, or weights, to various parts of the input data. The model can then prioritize which parts to focus on based on these weights.</p> <p>Transformers use a type of attention called <em>self-attention</em>. Self-attention allows each position in the input sequence to attend to all positions in the previous layer of the model simultaneously. It then calculates scores that determine how much focus to put on other parts of the input for each word in the sequence. This involves computing a set of queries, keys, and values from the input vectors, typically using linear transformations.</p> <p>The attention scores are then used to create a weighted sum of the values, which becomes the output of the self-attention layer. The output is then used in subsequent layers of the transformer.</p> <p>Attention allows transformers to understand the context of words in a sentence, capturing both local and long-range dependencies in text. Multi-head attention runs multiple attention mechanisms (heads) simultaneously. Each head looks at the input from a different perspective, focusing on different parts of the input. Using multiple heads allows the model to identify the various types of relationships in the data.</p> <p>Many transformers use an encoder-decoder framework. The encoder processes the input (like text in one language), and the decoder generates the output (like translated text in another language). However, some models, like GPT, use only the decoder part for tasks like text generation.</p> <h4 id="parallel-processing">Parallel processing</h4> <p>Unlike earlier sequential models (like RNNs and LSTMs), transformers process the entire input sequence at once, not one element at a time. Transformers allow for more efficient parallel processing and faster training. Parallel processing is most concerned with model performance and efficiency. Therefore, we run multiple tasks at once, such as tokenization, data processing, hyperparameter tuning, and so on, simultaneously to reduce the operational cost of training the model.</p> <p>Parallel processing also allows for better handling of longer sequences, though attention mechanisms can be computationally intensive for especially long sequences. However, this can be addressed by using a layered structure.</p> <h4 id="layered-structure">Layered structure</h4> <p>A layered transformer model structure involves multiple stacked layers of self-attention and feed-forward networks, which contribute to the transformer’s ability to understand and generate language.</p> <p>Each layer in a transformer can learn different aspects of the data, so as information passes through successive layers, the model can build increasingly complex representations. For example, in NLP, lower layers might capture basic syntactic structures, while higher layers can understand more abstract concepts like context, sentiment, or semantic relationships.</p> <p>Each layer computes self-attention, allowing the model to refine its understanding of the relationships between words in a sentence. This results in rich contextual embeddings for words, where the entire sequence informs the representation of each word.</p> <p>Finally, the abstraction achieved through multiple layers allows transformers to handle new data the model has never seen. It allows previously learned patterns and relationships to be applied more effectively to new and sometimes completely different contexts.</p> <p>Below is a diagram of a transformer model architecture from Vaswani et al’s paper.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5a7nfn5jfVR8RDHO9aygop/2c0cb2b330f7302300ce603a632a1378/transformer-model-architecture.png" style="max-width: 450px;" class="image center simple" alt="Transformer model architecture" /> <div class="caption" style="margin-top: -35px"><a href="https://arxiv.org/pdf/1706.03762.pdf">Image source</a></div> <p>Notice how the Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder (left) and decoder (right).</p> <h2 id="training-llms">Training LLMs</h2> <h3 id="data-preparation-and-preprocessing">Data preparation and preprocessing</h3> <p>First, before we can begin the data preparation process, we select a dataset. For LLMs, this involves choosing a large and diverse body of text relevant to the intended application. The dataset might include books, articles, websites, or, in the case of some popular LLMs, the public internet itself.</p> <p>Next, the data must be cleaned and preprocessed, usually involving tokenization, which, as mentioned in the earlier section on tokenization, is the process of converting text into tokens, typically individual words. The data then goes through normalization (like lowercasing), removing unnecessary characters, and sometimes subword tokenization (breaking words into smaller units).</p> <h3 id="model-parameters">Model parameters</h3> <p>After this initial data preparation and preprocessing stage, we initialize the <em>model parameters</em>. Parameters are the various settings and configurations that define the architecture and behavior of the model. For modern LLMs, the number of parameters in a model can be in the 10s of billions, requiring significant computational power and resources.</p> <p>In machine learning, parameters are the variables that allow the model to learn the rules from the data. Parameters are not generally entered manually. Instead, during model training, which includes backpropagation, the algorithm adjusts parameters, making parameters critical to how a model learns and makes predictions.</p> <p>There is also a concept of parameter initialization, which provides the model with a small set of meaningful and relevant values at the outset of a run so that the model has a place to start. After each iteration, the algorithm can adjust those initial values based on the accuracy of each run.</p> <p>Because there can be many parameters in use, a successful model will need to adjust discrete values and find an appropriate overall combination when adjusting parameters, for example, modifying one value may make the model more efficient but sacrifice accuracy.</p> <p>Once the parameters’ best overall values and combinations are figured out, the model has completed its training and is considered ready for making predictions on new and unseen data.</p> <p>Model parameters are crucial because they significantly influence the model’s performance, learning ability, and efficiency. Additionally, model parameters directly influence operational and financial costs, and considering there may be millions or billions of parameters that can be adjusted, this can be a computationally expensive process.</p> <h3 id="hyperparameters">Hyperparameters</h3> <p>Hyperparameters differ from parameters in that they do not learn their values from the underlying data. Instead, hyperparameters are manually entered by the user and remain static during model training.</p> <p>Hyperparameters are entered by a user (engineer) at the outset of training an LLM model. They are often best-guest at first and adjusted based on trial and error. This brute force method may sound sophisticated, but it is quite common, especially at the outset of model training.</p> <p>Examples of hyperparameters are the learning rate used in neural networks, the number of branches in a decision tree, or the number of clusters in a clustering algorithm. These are specific values set before training, though they can be adjusted throughout training iterations.</p> <p>Because hyperparameters are fixed, the values of the parameters (described in the section above) are influenced and ultimately constrained by the values of hyperparameters.</p> <h3 id="parameter-tuning">Parameter tuning</h3> <p>Tuning a model can be a balancing act of finding the optimal set of these parameters. Fine-tuning these parameters is crucial for optimizing the model’s performance and efficiency to achieve a particular task. It must be done in light of the nature of the input data and resource constraints.</p> <p>For example, larger models with more parameters require more computational power and memory, and overly large models might “overfit” the training data and perform poorly on unseen data. Increasing the number of layers in a neural network may provide marginally better results but sacrifice time and financial cost to achieve.</p> <p>Today, frameworks such as PyTourch and TensorFlow are used to abstract much of this activity to more easily build and train neural networks. Both machine learning libraries are open-sourced and very popular both in academia and production environments.</p> <p><a href="https://github.com/pytorch/pytorch">PyTourch</a>, first released in 2016 by developers at Meta, integrates well with Python and is popular, particularly among researchers and academics. Today, PyTourch is governed by the <a href="https://pytorch.org/foundation">PyTorch Foundation</a>.</p> <p><a href="https://github.com/tensorflow/tensorflow">TensorFlow</a>, first released in 2015 by the <a href="https://research.google.com/teams/brain/?ref=harveynick.com">Google Brain Team</a>, is often used in production environments due to its ability to scale well, flexible architecture, and integration with a variety of programming languages.</p> <h3 id="challenges-for-transformer-models">Challenges for transformer models</h3> <p>Transformer models, especially large ones, are computationally intensive and require substantial training resources, often necessitating powerful GPUs or TPUs. Additionally, they typically need large datasets for training to perform optimally, which can be a limitation in data-scarce scenarios.</p> <p>Also, the complexity of these models can make them harder to interpret and understand compared to simpler models. Nevertheless, the transformer model’s ability to process data sequences in parallel, combined with the powerful attention mechanism, allows it to capture complex linguistic patterns and nuances, leading to significant advancements in various language-related tasks.</p> <h2 id="challenges-for-llms">Challenges for LLMs</h2> <h3 id="hallucinations">Hallucinations</h3> <p>A <em>hallucination</em> is when an LLM generates information or responses that are factually incorrect, nonsensical, biased, or not based on reality. The responses themselves may be grammatically correct and intelligible, so “hallucinations” in this context means incorrect or misleading information, not the collection of completely random letters and words.</p> <h3 id="why-llms-hallucinate">Why LLMs hallucinate</h3> <p>Hallucinations occur due to the inherent limitations and characteristics of how LLMs generate text and, ultimately, how they are trained. They can be caused by limitations of the data set, such as the absence of recent data on a subject area. Sometimes, this has no bearing on the user’s question or prompt, but in cases in which it does, the LLM is limited to old data, resulting in an inaccurate response.</p> <p>Data may also be limited because there is a general lack of information on a topic. For example, a brand-new technology will have very little literature on it to be added to an LLM’s data set. Therefore, the results, though presented with a confident tone, may be inaccurate or misleading.</p> <p>Also, the LLM will generate inaccurate or biased results if the data set used to train a model is biased or lacks diversity. For example, suppose the data set used to train is based solely on writing from one group of scholars without considering writings from scholars with opposing views. In that case, the LLM will confidently produce results biased toward the data on which it was trained.</p> <p>Another issue is model <em>overfitting</em> and <em>underfitting</em>, which is well understood in statistical analysis and machine learning. Overfitting the training data means the model is too specific for the data set on which it is trained. In this scenario, the model can operate accurately only on the particular data set. Underfitting occurs when the model is trained on such a broad data set, resulting in a general inability to operate on a specific data set or, in the case of an LLM, a corpus of data relating to a particular subject area.</p> <h3 id="solving-for-hallucinations">Solving for hallucinations</h3> <h4 id="training-data">Training data</h4> <p><em>Ensuring the LLM is trained on accurate and relevant data is the first step toward mitigating hallucinations.</em></p> <p>First, it’s critical that the body of training data is sufficiently large and diverse so that there is enough information to process a query in the first place. This also has the effect of mitigating bias in the results of the model.</p> <p>Also, data must be regularly validated so that false information can be identified before (and after) the model runs. This validation process is not trivial and requires filtering, data cleaning, data standardization and format normalization, and even manual checks, all of which contribute to ensuring the training data is accurate.</p> <p>Next, test runs can be performed on a small sample of the data to test the accuracy of the model’s results. This process is iterative and can be repeated until the results are satisfactory. For each iteration, changes can be made to the model, the dataset itself, and other LLM components to increase accuracy.</p> <p>Additionally, the data should be updated over time to reflect new and more current information. Updating the model with up-to-date data helps reduce the likelihood of it providing outdated, misleading, incorrect, or outright ridiculous information.</p> <h4 id="retrieval-augmented-generation">Retrieval-augmented generation</h4> <p>Currently, many in LLM research and industry have agreed upon RAG, or Retrieval-Augmented Generation, as a method to deal with hallucinations. RAG ensures that the LLM’s generated responses are based in real-world data. It does this by pulling information from reliable external sources or databases, decreasing the probability of the model generating nonsensical or incorrect facts.</p> <p>RAG uses a cross-validation process to ensure factual accuracy and consistency, which means the training data can be verified with the information the model received. However, this presupposes that the external data RAG uses is also of high quality, accurate, and reliable.</p> <h4 id="model-training-and-adjusting-parameters">Model training and adjusting parameters</h4> <p>As described in previous sections on neural networks and backpropagation, an LLM model can adjust its parameters based on the results’ accuracy in the training process. And though the number of parameters can be in the billions, adjusting parameters is still more productive than retraining the model altogether since only a small number of parameter changes could potentially have significant effects on the output.</p> <p>Because the model doesn’t yet have any meaningful knowledge or the ability to generate relevant responses at the very beginning of training, parameters are initially set to random or best-guess values. Then, the model is exposed to a large dataset, which, would be a vast body of diverse text in the case of LLMs.</p> <p>In supervised learning scenarios, this text comes with corresponding outputs that the model should ideally generate. During training, an input (such as a piece of text) is fed into the model, making a prediction or generating a response. This is called a <em>forward pass</em>.</p> <p>The model’s output is then compared to the expected output (from the training dataset), and the difference is calculated using a loss function. This function quantifies how far the model’s prediction deviates from the desired output. The model uses this loss value to perform backpropagation, which, as explained in an earlier section, is a critical process in neural network training.</p> <p>Remember that backpropagation involves using the results of the output layer to adjust parameters to achieve more accurate results. It does this by calculating the gradient of the loss function with respect to each parameter. These gradients are used to make the actual adjustments to the model’s parameters. Often, the model employs an optimization algorithm like Stochastic Gradient Descent (SGD) or its variants (like Adam) with the idea of adjusting each parameter in a direction that minimizes the loss.</p> <p>This process of making predictions, calculating loss, and adjusting parameters is repeated over many iterations and epochs (an epoch is one complete pass through the entire training dataset) to, again, minimize loss and produce a more accurate result.</p> <h4 id="temperature">Temperature</h4> <p><em>Temperature</em> is a parameter that influences the randomness, or what some would consider the creativity of the responses generated by the model. It is an essential component of how LLM models generate text.</p> <p>When generating a text response, an LLM predicts the probability of each possible next word based on the given input and its training. This results in a probability distribution of the entire vocabulary at the LLM’s disposal. The model uses this to determine what word or sequence of words actually makes sense in a response to a prompt or input.</p> <p>Temperature is a hyperparameter, or in other words, a static value, which the model uses to convert the raw output scores for each word into probabilities. It’s one method we can use to increase or decrease randomness, creativity, and diversity in a generated text. A high-temperature value (greater than 1) makes the model more likely to choose less probable words, making the responses less predictable and sometimes more nonsensical or irrelevant.</p> <p>When the temperature is set to 1, the model follows the probability distribution as it is, without any adjustment. This is often considered a balanced choice, offering a mix of predictability and creativity, though in practice, many will default to a slightly lower value of 0.7, for example.</p> <p>Adjusting the temperature can be useful for different applications. For creative tasks like storytelling or poetry generation, a higher temperature might be preferred to introduce novelty and variation. For more factual or informative tasks, a lower temperature can help maintain accuracy and relevance.</p> <p>In the image below, see how the temperature setting in an LLM can be seen almost as a slider that can be adjusted based on the desired output.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3LoGlCS9YU937jzVPBbRLG/48fd3ece8c04f1cdf4da478a6afdbe11/llm-temperature-setting.png" style="max-width: 600px;" class="image center" alt="Temperature scale" /> <p>This image is a good visualization of what occurs when we adjust the temperature hyperparameter. However, to achieve this change, we utilize the softmax function, which plays a critical role in how the temperature hyperparameter is actually applied in the LLM.</p> <p>LLMs use the softmax function to convert the raw output scores from a model into probability values. For example, in the formula below, <em>P(y<sub>i</sub>)</em> is the probability of the <em>i</em>-th event, <em>z<sub>i</sub></em> is the logit for the <em>i</em>-th event, and <em>K</em> is the total number of events.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1i3H0SkkIA39KOdrKfqAsB/c44f2615c31f256d577d614b31909a4d/softmax-function.png" style="max-width: 280px;" class="image center" alt="Softmax function" /> <p>To incorporate temperature into the function, we divide the logits by the <em>(e<sup>zi</sup>)</em> temperature value we select <em>(T)</em> before applying softmax. The new formula is:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6Eo7azTaneUlHOd9tU7FN7/0f6dea077b737948f930310b73dde447/softmax-function-temperature.png" style="max-width: 280px;" class="image center" alt="Softmax function with temperature" /> <p>It’s important to note that temperature doesn’t change the model’s knowledge or understanding; it merely influences the style of text generation by allowing more or fewer words to be considered. Individual words are assigned a probability score based on the above function, and as <em>T</em> increases, the output increases above 1, resulting in a more random selection of words. This also means that a model won’t produce more accurate information at a lower temperature; it will just express it more predictably.</p> <p>In that sense, temperature, as used in an LLM, is a way to control the trade-off between randomness and predictability in the generated text.</p> <h4 id="adjusting-hyperparameters">Adjusting hyperparameters</h4> <p>Hyperparameters are manually adjusted, which is their main defining characteristic. Adjusting hyperparameters in an LLM significantly impacts its training process and the quality of its output.</p> <p>Hyperparameters are the actual configuration settings used to structure the learning process and model architecture. They aren’t learned from the data but are set before the training process begins.</p> <p>For example, the <em>learning rate</em> is the amount by which parameters are adjusted during each iteration, which an engineer controls manually. This is a crucial hyperparameter in training neural networks because, though a higher learning rate might speed up learning, it can also overshoot the optimal values. Conversely, while setting a low learning rate may prevent overshooting, it can also make training inefficiently slow.</p> <p>The <em>batch size</em> refers to the number of training examples used in one iteration of model training. Manually selecting a larger batch size provides a more accurate estimate of the gradient but requires more memory and computational power. Smaller batch sizes can make the training process less stable but often lead to faster convergence.</p> <p>Another important hyperparameter is the <em>epoch</em>, or the number of times the entire training dataset is passed forward and backward through the neural network. Too few epochs can lead to underfitting, while too many can lead to overfitting.</p> <p>The manually configured model size is itself another crucial hyperparameter. Larger models with more layers and units can capture more complex patterns and have higher representational capacity. However, they also require more data and computational resources to train and are more prone to overfitting.</p> <p>There are a variety of other hyperparameters as well such as <em>optimizer</em>, <em>sequence length</em>, <em>weight initialization</em>, and so on, each contributing to the model’s accuracy, efficiency, and ultimately the quality of the results the model generates.</p> <h4 id="instruction-tuning-and-prompt-engineering">Instruction tuning and prompt engineering</h4> <p>Instruction tuning is an approach to training LLMs that focuses on enhancing the model’s ability to follow and respond to natural language user-given instructions, commands, and queries. This process is distinct from general model training and designed to improve the model’s performance in task-oriented scenarios.</p> <p>In essence, instruction tuning refines an LLM’s ability to understand and respond to a wide range of instructions, making the model more versatile and user-friendly for various applications.</p> <p>Instruction tuning involves fine-tuning the model on a dataset consisting of examples where instructions are paired with appropriate responses. These datasets are crafted to include a variety of tasks and instructions, from simple information retrieval to complex problem-solving tasks. While general training of an LLM involves learning from a vast body of text to understand language and generate relevant responses, instruction tuning is more targeted. It teaches the model to recognize and prioritize the user’s intent as expressed in the instruction.</p> <p>This specialized tuning makes the LLM more intuitive and effective in interacting with users. Users can give commands or ask questions in natural language, and the model is better equipped to understand and fulfill these specific requests.</p> <p>Part of instruction tuning can involve showing the model specific examples of how to carry out various tasks. This could include not just the instruction but also the steps or reasoning needed to arrive at the correct response. Instruction tuning also involves a continuous learning approach, in which the model is periodically updated with new examples and instructions to keep its performance in line with user expectations and evolving language use.</p> <p>Special attention is given in the instruction tuning process to train the model in handling sensitive topics or instructions that require ethical considerations. The aim is to guide the model towards safe and responsible responses.</p> <h4 id="user-feedback">User feedback</h4> <p>Like instruction tuning, incorporating user feedback into the model’s training cycle can help identify and correct hallucinations. This is often called RLHF, or reinforcement learning from human feedback. When actual human users point out inaccuracies, these instances can be used as learning opportunities for the model.</p><![CDATA[Adding Multiple Custom Metrics to Kentik NMS]]><![CDATA[Kentik NMS has the ability to collect multiple SNMP objects (OIDs). Whether they are multiple unrelated OIDs, or multiple elements within a related table, Leon Adato walks you through the steps to get the data out of your devices and into your Kentik portal.]]>https://www.kentik.com/blog/adding-multiple-custom-metrics-to-kentik-nmshttps://www.kentik.com/blog/adding-multiple-custom-metrics-to-kentik-nms<![CDATA[Leon Adato]]>Wed, 20 Mar 2024 04:00:00 GMT<p><a href="https://www.kentik.com/blog/using-kentik-network-monitoring-system-to-identify-network-outages/">In my last post</a>, I waxed poetic (or at least long-winded) on how to add a custom SNMP object ID (OID) into <a href="https://www.kentik.com/product/network-monitoring-system/">Kentik NMS</a>. Despite the fact that NMS collects a metric butt-tonne (that’s a highly technical form of measurement) of data, there are always custom elements needed by various folks for specific use cases.</p> <p>However, in sharing my example – collecting a custom OID for CPU temperature – I omitted a critical piece of context: Most modern systems have more than one CPU and, therefore, more than one temperature value.</p> <p>I left out that detail due to the need to clearly explain the mechanics of adding custom OIDs without overwhelming the audience with a bunch of additional complexities.</p> <p>With the basic process out of the way now, I felt it was important to circle back and talk about how you can add multiple custom <a href="https://www.kentik.com/kentipedia/snmp-monitoring/" title="Kentipedia: SNMP Monitoring: An Introduction and Practical Tutorial">SNMP metrics</a> at the same time. And when I say “multiple,” there are two different scenarios I’m going to cover:</p> <ol> <li>When a single OID returns multiple values</li> <li>Collecting several different, unrelated metrics in one configuration file.</li> </ol> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p>Don’t feel like reading? Watch this demo video to see a walkthrough of adding multiple custom metrics to Kentik NMS.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/shhgo1mj98?seo=false&amp;videoFoam=false" title="Adding Multiple Custom Metrics to Kentik NMS" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h2 id="a-quick-review-of-custom-oid-collection">A quick review of custom OID collection</h2> <p>Just to review the process for custom OIDs in Kentik NMS:</p> <ul> <li>Move to (or create if it doesn’t exist) the dedicated folder on the system where the Kentik agent (kagent) is running:</li> </ul> <p><code class="language-text">/opt/kentik/components/ranger/local/config</code></p> <ul> <li>In that directory, create directories for /sources, /reports, and /profiles</li> <li>Create three specific files: <ol> <li>Under /sources, a file that lists the custom OID to be collected</li> <li>Under /reports, a file that associates the custom OID with the data category it will appear under within the Kentik portal</li> <li>Under /profiles, a file that describes a type of device (Using the SNMP System Object ID) and the report(s) to be associated with that device type</li> </ol> </li> </ul> <p>For just one temperature setting, it would look like this:</p> <p><strong>sources/linux.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-linux kind: sources sources: temp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 interval: 60s</code></pre></div> <p></p> <p><strong>reports/linux_temps_report.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: fields: CPUTemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 metric: true interval: 60s</code></pre></div> <p></p> <p><strong>profiles/local-net-snmp.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp include: - device_name_ip</code></pre></div> <p></p> <div as="Promo"></div> <h2 id="collecting-multiple-unrelated-oids">Collecting multiple unrelated OIDs</h2> <p>Once you understand the process of collecting a single OID, adding others is pretty simple.</p> <p>For our example, we <em>also</em> want to collect icmpInEchos (the number of pings received) for Linux/net-snmp type devices. The OID for this is 1.3.6.1.2.1.5.8.0. Using the same files from above, I’d make the following modifications:</p> <p><strong>sources/linux.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-linux kind: sources sources: temp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 interval: 60s icmpInEchos: !snmp value: 1.3.6.1.2.1.5.8.0 interval: 60s</code></pre></div> <p></p> <p><strong>reports/ping-count.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-ping-count kind: reports reports: /device/linux/pings: fields: ping-count: !snmp value: 1.3.6.1.2.1.5.8.0 metric: true interval: 60s</code></pre></div> <p></p> <p><strong>profiles/local-net-snmp.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp - local-ping-count include: - device_name_ip</code></pre></div> <p></p> <p>Let’s unpack some of the things you see there, and how they differ from the collection of a single OID:</p> <ul> <li>sources/linux.yml has two different sources: one for temp (temperature) and one for icmpEchos.</li> <li>An entire new file under /reports, named ping-count.yml, describes the OID to be collected, and that its data will appear under /device/linux/pings within the Kentik portal.</li> <li>Finally, profiles/local-net-snmp.yml (which was already present for the single temperature OID) has been modified to also associate the report named “local-ping-count”.</li> </ul> <p>I want to emphasize two other points: First, the file name doesn’t matter at all. The key is to make sure the “name: ” element within the YAML files is correct. Second, the directory structure is just a housekeeping mechanism. As long as the “kind: ” element within the YAML file is correct, you can have everything in the same folder if you prefer.</p> <p>The result is that you will now receive and can display data within the Kentik portal for both temperature and ICMP echos received:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3tTmhQcfY3gLgenxhYggCZ/d8f1b1380fa6bac3cd376e579c751d3b/temperature-icmp.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik portal showing temperature and ICMP echoes" /> <img src="//images.ctfassets.net/6yom6slo28h2/2iW9BCuo3XSLHp1MqghUvr/c4b1a74d78cda3bd0db269dde440ebd5/ping-count.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik portal showing ping count" /> <h2 id="collecting-a-table-of-oids">Collecting a table of OIDs</h2> <h4 id="or-i-contain-multitudes">(or, “I Contain Multitudes”)</h4> <p>Some SNMP OIDs return a single value, like the icmpEchos metric in our last example.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5MJjziVPlNSWSyPvf2kklZ/66fd3b2fb9d3b729a4e8cd1bedfae2d7/single-value-return.png" style="max-width: 800px;" class="image center" alt="SNMP OIDs returning single value" /> <p>But others are effectively the tip of an iceberg of metrics. Examples of this type of OID include CPU, temperature, fans, disks, and even the SNMP OID that displays running processes.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6pPQ74f98xRLa44auIrSxe/f982d3bcbaa3e7cdafe71fa8ac3e815f/tip-of-iceberg-metrics.png" style="max-width: 800px;" class="image center" alt="Example of OID including lots of metrics" /> <p>Which brings us back to the original OID – You’ll recall that in my previous post, the OID I used was 1.3.6.1.4.1.2021.13.16.2.1.3.1, which gave me one temperature stat. But if I used “.2” instead of “.1” at the end, I would see another temperature reading.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4hmUXgTh0F9j0EGCqlC8fF/5dc9084e3f8a48d3eae9ab1169a35f04/original-with-dot-2.png" style="max-width: 800px;" class="image center" alt="Different temperature reading" /> <p>In fact, I can do that for five different OIDS:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6XMuZyqeIbhv7AfhdCYuvD/c352c1922d6ee7294fb3d9c3babc886a/5-different-oids.png" style="max-width: 800px;" class="image center" alt="Example with 5 different OIDs" /> <p>This is because the humble little Raspberry Pi I’m monitoring in this example has four cores (stick with me; I’ll explain why it’s not five in a minute). While I’d like to monitor all of them, I have to recognize that not every device has the same number of cores, and therefore, I need something that will flexibly collect <em>whatever</em> number I have. Luckily, SNMP handles that more or less automatically. At the command line, an snmpwalk (instead of snmpget) accomplishes the same thing:</p> <img src="//images.ctfassets.net/6yom6slo28h2/x55fMNiTKZ3n8UcSSo6hP/522029b1c08cb710be604d748757d6a8/snmpwalk.png" style="max-width: 800px;" class="image center" alt="snmpwalk in the command line" /> <p>What’s missing is the names of each of these elements. While it may not be a big deal for CPUs, it’s far more important when I’m collecting the same data point for disks, interfaces, and the like.</p> <p><a href="https://oidref.com/">https://oidref.com/</a> tells me the names of the sensors can be found at the OID 1.3.6.1.4.1.2021.13.16.2.1.2.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7yKuEBBRkBrsaCg01Rbk3X/de84d5c1b64cd7722b9ca51154128fce/names-of-sensors.png" style="max-width: 800px;" class="image center" alt="Names of sensors" /> <p>From this I can see that the first OID is the aggregate temperature, and the next four OIDs are temperatures for each of the four cores in my Pi.</p> <p>With that information in hand, our goals are to:</p> <ol> <li>Collect all five temperature readings without having to call out each one explicitly.</li> <li>Associate the data values with labels.</li> </ol> <p>Here is what the YAML files look like:</p> <p><strong>sources/linux.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-linux kind: sources sources: CPUTemp: !snmp table: 1.3.6.1.4.1.2021.13.16.2 interval: 60s</code></pre></div> <p></p> <p><strong>reports/temp.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: fields: name: !snmp table: 1.3.6.1.4.1.2021.13.16.2 value: 1.3.6.1.4.1.2021.13.16.2.1.2 metric: false CPUTemp: !snmp table: 1.3.6.1.4.1.2021.13.16.2 value: 1.3.6.1.4.1.2021.13.16.2.1.3 metric: true interval: 60s</code></pre></div> <p></p> <p><strong>profiles/local-net-snmp.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp include: - device_name_ip</code></pre></div> <p></p> <p>Once again, let’s unpack that:</p> <ul> <li>sources/linux.yml uses the OID 1.3.6.1.4.1.2021.13.16.2. Two things are notable: <ul> <li>This is a couple of levels “up” the OID chain from both the temperature (1.3.6.1.4.1.2021.13.16.2.1.3) and the labels (1.3.6.1.4.1.2021.13.16.2.1.2).</li> <li>The original example used a metric type of “value”. This one is using the “table” type.</li> </ul> </li> <li>/reports/temp.yml, describes two fields instead of just one: <ul> <li>A “name” field which pulls two data sets: <ul> <li>the overall table from the OID ending at 16.2</li> <li>the actual values from the OID ending at 16.2.1.2</li> </ul> </li> <li>A “CPUtemp” field which pulls two data sets: <ul> <li>the overall table from the OID ending at 16.2</li> <li>the actual values from the OID ending at 16.2.1.3</li> </ul> </li> </ul> </li> </ul> <p>The result of this structure is that it will associate the labels from the 16.2.1.2 branch of the OID table with the temperature values in the 16.2.1.3 branch.</p> <p>Note that the profiles/local-net-snmp.yml is unchanged from our original example of collecting a single temperature value.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/43meXWbETs8U3aRwRnwgyI/4455c232cd04abda96b010a16440f68c/labels-associated-with-temps.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik portal" /> <h2 id="bonus-round-putting-it-all-together">BONUS ROUND: Putting it all together</h2> <p>By this point, you should be getting pretty comfortable with the idea, if not the technique, of adding custom SNMP OIDs to Kentik NMS. But in this last example, we’re going to include custom OIDs for every CPU temperature, along with icmpEcho data. Here’s what the files look like:</p> <p><strong>sources/linux.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-linux kind: sources sources: icmpInEchos: !snmp value: 1.3.6.1.2.1.5.8.0 interval: 60s CPUTemp: !snmp table: 1.3.6.1.4.1.2021.13.16.2 interval: 60s</code></pre></div> <p></p> <p><strong>reports/temp.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: fields: name: !snmp table: 1.3.6.1.4.1.2021.13.16.2 value: 1.3.6.1.4.1.2021.13.16.2.1.2 metric: true CPUTemp: !snmp table: 1.3.6.1.4.1.2021.13.16.2 value: 1.3.6.1.4.1.2021.13.16.2.1.3 metric: true interval: 60s</code></pre></div> <p></p> <p><strong>reports/ping-count.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-ping-count kind: reports reports: /device/linux/pings: fields: ping-count: !snmp value: 1.3.6.1.2.1.5.8.0 metric: true interval: 60s</code></pre></div> <p></p> <p><strong>profiles/local-net-snmp.yml</strong></p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp - local-ping-count include: - device_name_ip</code></pre></div> <p></p> <p>That’s right. If you’re looking closely, you’ll see that it’s mostly just the same files from our previous example, but the inclusion of reports/ping-count.yml and the combining of both report names in local-net-snmp.yml.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/43meXWbETs8U3aRwRnwgyI/4455c232cd04abda96b010a16440f68c/labels-associated-with-temps.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik portal" /> <img src="//images.ctfassets.net/6yom6slo28h2/2iW9BCuo3XSLHp1MqghUvr/c4b1a74d78cda3bd0db269dde440ebd5/ping-count.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik portal showing ping count" /> <h2 id="the-mostly-unnecessary-conclusion">The mostly unnecessary conclusion</h2> <p>Two things should be clear at this point: Far from the shriveled, washed-up, has-been of a monitoring technique it’s often accused of being, SNMP continues to be a powerful, flexible, and valuable tool in your observability toolkit.</p> <p>Moreover, Kentik NMS is an equally powerful, flexible, and useful solution for collecting and displaying those metrics alongside other data types, providing you with complete insight into the health and stability of your network.</p> <p>As always, I hope you’ll stick with me as we learn more about this together. If you’d like to get started now, <a href="https://www.kentik.com/get-started/">sign up for a free trial</a>.</p><![CDATA[Using Kentik NMS to Identify Network Outages]]><![CDATA[Kentik NMS collects valuable network data and then transforms it into usable information -- information you can use to drive action.]]>https://www.kentik.com/blog/using-kentik-network-monitoring-system-to-identify-network-outageshttps://www.kentik.com/blog/using-kentik-network-monitoring-system-to-identify-network-outages<![CDATA[Leon Adato]]>Wed, 13 Mar 2024 04:00:00 GMT<p><a href="https://www.kentik.com/blog/setting-sail-with-kentik-nms-unified-network-telemetry/">I recently explored why Kentik</a> built and released an all-new <a href="https://www.kentik.com/product/network-monitoring-system/">network monitoring system (NMS)</a> that includes traditional and more modern telemetry collection techniques, such as APIs, OpenTelemetry, and Influx.</p> <p><a href="https://www.kentik.com/blog/getting-started-with-kentik-nms/">After that</a>, I briefly covered the steps to install Kentik NMS and start monitoring a few devices.</p> <p>What I left out and will cover in this post is what it might look like when you have everything installed and configured. Along the way, I’ll dig a little deeper into the various screens and features associated with Kentik NMS.</p> <div as="Promo"></div> <p>This raises the question, as eloquently put by <a href="https://youtu.be/5IsSpAOD6K8?t=48">The Talking Heads</a>, “Well, how did I get here?” Meaning: Where in this post do I explain how to install NMS?</p> <p>My answer, <em>for this moment only</em>, is, “I don’t care.” <a href="https://www.kentik.com/blog/getting-started-with-kentik-nms/">You can refer to the previous post</a> for a walkthrough of the installation, and many “how to install Kentik NMS” knowledge articles, blog posts, music videos*, and Broadway plays are either already available or will exist by the time you finish reading this post. But for this post, I’m not going to spend a single sentence explaining how NMS is installed. My focus is entirely on the benefit and value of Kentik NMS once it’s up and running.</p> <p><em>* There will absolutely NOT be a music video - the Kentik legal team. <br> ** OH HELL YES, WE ABSOLUTELY WILL!! - the Kentik creative marketing group (who do the final edit of blogs before they post)</em></p> <p>In case you’re more of a watcher than a reader, I’ve included a brief video version of this post below:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/zpbqrxtbxs?seo=false&amp;videoFoam=false" title="Getting to Work with Kentik NMS Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h2 id="portrait-of-a-network-outage">Portrait of a network outage</h2> <p>And to think, your day had started so well. The sun was shining, the birds were singing, the coffee was fresh and hot, and you could feel the first flutters of hope – hope that you’d be able to get some good work done today, hope that you could take a chunk out of those important tasks, or hope that you could avoid the unplanned work of system outages.</p> <p>And then came the call.</p> <p>“The Application” was down. Nobody could get in. Nothing was working.</p> <p>Now, let’s be completely honest about this. “The Application” was not, in fact, down. The servers were responding, the application services were running, and so on. But, being equally honest, it was slow.</p> <p>As we all know, “slow” is the new “broken.” Even if it wasn’t <em>literally</em> down, it wasn’t fully accessible and responsive, which means it was <em>effectively</em> down.</p> <p>What differentiates today from all the dark days in the past is that today, you have Kentik NMS installed, configured, and collecting data – data that the Kentik platform transforms into usable information that you can use to drive action.</p> <p>Let’s look at “The Application” – at the data flowing across the wire:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5zJw7hcpDbZ0rtItcGvYeY/38c727e912e1ff963fd925f3f8a59279/application-down.png" style="max-width: 500px;" class="image center" alt="Application showing drop in data" /> <p>By any account, that’s pretty down-ish.</p> <p>The problem is that a count of the inbound and outbound data doesn’t tell us what’s wrong; it just tells us that something <em>is</em> wrong.</p> <p>Likewise, the information from so-called “higher level” tools – monitoring solutions that focus on traces and such – might tell us that the flow of data has slowed or even stopped, but there’s no indication why.</p> <p>This is why network monitoring <em>still</em> matters – both to you as a monitoring practitioner, engineer, and aficionado and to teams, departments, and businesses overall.</p> <h3 id="the-smoking-ping">The smoking ping</h3> <img src="//images.ctfassets.net/6yom6slo28h2/xMJ1y3wNN9DdEze2GZf0v/50ba6c54004f6f291573e8cf255ebb6d/smoking-ping-643w.png" style="max-width: 500px;" class="image center" alt="Drop in ICMP packets" /> <p>We can see, at exactly the moment, a drop in the most basic metric of all: the ICMP packets received by the devices.</p> <p>Now, ICMP packets (also known as the good old “ping” are still data, but when they’re affected equally and simultaneously with application-layer traffic, there’s a good chance the problem is network-based.</p> <h3 id="what-we-have-here">What we have here</h3> <p>What was the problem? I’ll leave it to your experience, history, and imagination to fill in the blanks. In my example above, I changed the duplex setting on one of the ports, forcing a duplex mismatch that caused every other packet (or so) to drop. But it could have been a busy network device, a misconfigured route, or even a bad cable.</p> <p>In terms of making the case for Kentik NMS, the upshot is that network errors still occur. Often. And application-centric tools are ill-equipped to identify them, let alone help you resolve them.</p> <h3 id="move-along-now-nothing-to-see-here">Move along now. Nothing to see here.</h3> <p>Almost as fast as it started, the problem is resolved. With the duplex mismatch reversed, pings are back up to normal:</p> <img src="//images.ctfassets.net/6yom6slo28h2/7JTTRvOHPIbLXWwJw7FvA9/d56c08c856c5895b3c7e7bf6f61504e5/pings-normal.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Pings show traffic is back to normal" /> <p>And the application traffic is back up with it:</p> <img src="//images.ctfassets.net/6yom6slo28h2/22NnQtdil4AxgdDYbIV393/4cbb1c84314c911321954c909708599b/application-up.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Application traffic normal" /> <p>You pour yourself a fresh cup of coffee, listen to the birds chirping outside the window, and settle into what continues to look like a great day.</p> <h2 id="time-for-a-quick-tour">Time for a quick tour</h2> <p>Now that I’ve given you a reason to want to look around, I thought I’d spend some time pointing out the highlights and features of Kentik NMS so you could see the full range of what’s possible.</p> <h3 id="the-main-nms-screen">The main NMS screen</h3> <p>We’ll start at the main Kentik screen, the one you see when you log into <a href="https://portal.kentik.com">https://portal.kentik.com</a>. From here, click the “hamburger” menu in the upper left corner and choose “Network Monitoring System.” That will drop you into the main dashboard.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7ETQu5dCSw6ajOH3iLiVwF/e4ba9b450f2cdef087d7450e56394ca3/nms-main-screen.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="NMS dashboard" /> <p>On the main screen, you’ll see:</p> <ul> <li>A geographic map showing the location of your devices</li> <li>A graph and a table showing the availability information for those devices</li> <li>An overview of the traffic (bandwidth) being passed by/through your infrastructure</li> <li>Any active alerts</li> <li>Tables with a sorted list of devices that have high bandwidth, CPU, or memory utilization</li> </ul> <h3 id="the-devices-list">The Devices list</h3> <p>Returning to the hamburger menu, we’ll revisit the “Devices” list, but now that we <em>have</em> devices, we’ll take a closer look.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5UGM8LYeIZ4zK6c43p56KG/e89b32bd42d26ec8ce1659151e2660eb/device-list.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="List of devices in the Kentik platform" /> <p>This page is exactly what it claims to be – a list of your devices. From this one screen, you have easy access to the ability to:</p> <ul> <li>Sort the list by clicking on the column headings.</li> <li>Search for specific devices using any data types shown on the screen.</li> <li>Filter the list using the categories in the left-hand column.</li> </ul> <p>There are also some drop-down elements worth noting:</p> <ul> <li>The “Group By” drop-down adds collapsable groupings to the list of devices.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/6JQBYMhi0g4vXBM4MGQsiX/06aff9ce308985842998b8db158abdd6/dropdown-devices.png" style="max-width: 200px;" class="image center" alt="Dropdown menu showing Group-by dimensions" /> <ul> <li>The “Actions” drop-down will export the displayed data to CSV or push it out to Metrics Explorer for deeper analysis.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/yq471A14Vb9sf2rec95g6/9f9f07524d4878f14f962209a216b49e/dropdown-actions.png" style="max-width: 300px;" class="image center" alt="Actions menu" /> <ul> <li>The “Customize” option in the upper right corner lets you add or remove data columns.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/1eMbuxrwiqbyXbjE5a5VPM/535ef7a0946de5edb1c613e229391fd9/customize-menu.png" style="max-width: 270px;" class="image center" alt="Customize columns menu" /> <p>The friendly blue “Discover Devices” button allows you to add new devices to newly added or existing collector instances.</p> <h3 id="the-interfaces-list">The Interfaces list</h3> <p>Remember all the cool stuff I just covered about devices? The following image looks similar, except it focuses on your network interfaces.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3zpkNTJyK4OQAYto0LNx1f/82a244cdfdc8d271a38fc54f45be9041/nms-interfaces.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Network interfaces" /> <h3 id="metrics-explorer">Metrics Explorer</h3> <img src="//images.ctfassets.net/6yom6slo28h2/20ZknXENYgM0PsVbqmv3sf/8e5e9643bb82a2986be5c0e6d3e1465e/metrics-explorer.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Metrics Explorer" /> <p>Metrics Explorer is, in many ways, identical to Kentik’s existing <a href="https://kb.kentik.com/v4/Db03.htm">Data Explorer</a> capability. It’s also incredibly robust and nuanced, so much so that it deserves and will get its own dedicated blog post.</p> <p>For now, I will give this incredibly brief overview just to keep this post moving along:</p> <p>First, all the real “action” (meaning how you interact with Metrics Explorer) happens in the right-hand column.</p> <p>Second, it’s important to remember that the entire point of the Metrics Explorer is to help you graphically build a query of the network observability data Kentik is collecting.</p> <p>With those two points out of the way, the right-hand area has five primary areas:</p> <ol> <li><strong>Measurement</strong> allows you to select which data elements and how they are used. <ul> <li>The initial drop-down lets you select the broad category of telemetry from which your metrics will be drawn. For NMS, you will often select from either the /interfaces/ or the /device/ grouping.</li> <li>Metrics indicate the data elements that should be used for the graph.</li> <li>Group by Dimensions will create sub-groupings of that data on the graph. Absent any “group by,” you end up with a single set of data points across time. Grouping by name, location, etc, will create a more granular breakdown.</li> <li>Merge Series is a summary option that allows you to apply sum, min, max, or average functions to the data based on the groupings.</li> </ul> </li> <li><strong>Visualization options</strong>: This section controls how the data displays on the left. <ul> <li>Chart type: Line, bar, pie, table only, etc.</li> <li>Metric: The column that is used as the scale for the Y-axis.</li> <li>Aggregation: Whether the graph should map every data point, an average, a sum, etc.</li> <li>Sample size: When aggregating, all the data from a specific time period (from 1 to 60 minutes) will be combined.</li> <li>Series count: How many items from the full data set should be displayed in the graph</li> <li>Transformation: Whether to treat the data points as they are or as counters.</li> </ul> </li> <li><strong>Time</strong>: The period of time from which to display data and whether to display time markings in UTC or “local” (the time of whatever computer is viewing the graph).</li> <li><strong>Filtering</strong>: This will let you add limitations to include data that matches (or does <em>not</em> match) specific criteria.</li> <li>Dimension: Non-numeric columns such as location, name, vendor, or subnet. <ul> <li>Metric: Numeric data.</li> </ul> </li> <li><strong>Table options</strong>: These set the options for the table of data that displays below the graph and lets you select how many rows and whether they’ll be aggregated by Last, Min, Max, Average, or P95 methods.</li> </ol> <h2 id="add-about-data-and-device-types">AD&#x26;D: About data and device types</h2> <p>After folks see how helpful Kentik can be, the next question is usually, “Will it cover <em>my</em> gear?” While a list of vendors isn’t the same as a comprehensive list of each make and model, this is a blog post, and nobody will take that kind of time. Meanwhile, the list below should still give you a good idea of what is available out of the box. From there, modifying existing profiles to include specific metrics or even completely new devices is relatively simple.</p> <p>As of the time of this writing, Kentik NMS automatically collects data from devices made by the following vendors. For the legal-eagles in the group who are sensitive about trademarks, capitalization, and such, please note that this is a dump directly out of device-type directory:</p> <ul> <li>3com</li> <li>a10_networks</li> <li>accedian</li> <li>adva</li> <li>alteon</li> <li>apc</li> <li>arista</li> <li>aruba</li> <li>audiocodes</li> <li>avaya</li> <li>avocent</li> <li>broadforward</li> <li>brother</li> <li>calix</li> <li>canon</li> <li>cisco</li> <li>corero</li> <li>datacom</li> <li>dell</li> <li>elemental</li> <li>exagrid</li> <li>extreme</li> <li>f5</li> <li>fortinet</li> <li>fscom</li> <li>gigamon</li> <li>hp</li> <li>huawei</li> <li>ibm</li> <li>infoblox</li> <li>juniper</li> <li>lantronix</li> <li>meraki</li> <li>mikrotik</li> <li>netapp</li> <li>nokia</li> <li>nvidia</li> <li>opengear</li> <li>palo_alto</li> <li>pf_sense</li> <li>server_iron</li> <li>servertech</li> <li>sunbird</li> <li>ubiquiti</li> <li>velocloud</li> <li>vertiv</li> <li>vyos</li> </ul> <h2 id="the-mostly-unnecessary-summary">The mostly unnecessary summary</h2> <p>Of course, this is just the start of your Kentik NMS journey. There is so much more to the platform, from adding custom metrics and new devices to building comprehensive dashboards that contextualize data to creating alerts that convert monitoring information into action. I will be digging into all that and more in the coming weeks and months, even as Kentik NMS continues to grow and improve.</p> <p>I hope you’ll stick with me as we learn more about this together. If you’d like to get started now, <a href="https://www.kentik.com/get-started/">sign up for a free trial</a>.</p><![CDATA[How to Configure Kentik NMS to Collect Custom SNMP Metrics]]><![CDATA[Out of the gate, NMS collects an impressive array of metrics and telemetry. But there will always be bits that need to be added. This brings me to the topic of today’s blog post: How to configure NMS to collect a custom SNMP metric.]]>https://www.kentik.com/blog/how-to-configure-kentik-nms-to-collect-custom-snmp-metricshttps://www.kentik.com/blog/how-to-configure-kentik-nms-to-collect-custom-snmp-metrics<![CDATA[Leon Adato]]>Thu, 07 Mar 2024 05:00:00 GMT<p>The recent release of <a href="https://www.kentik.com/product/network-monitoring-system/">Kentik NMS</a> has impressed and excited a lot of folks, as evidenced by the volume of current Kentik customers kicking the tires of our newest capability, as well as folks who hadn’t dipped their toes in the warm and welcoming waters of Kentik’s platform until they heard about NMS.</p> <p>Out of the gate, NMS collects an impressive array of metrics and telemetry. But that doesn’t mean it knows about absolutely everything. No matter how diligently Kentik’s engineers work to incorporate devices and data points (both new and old), there will always be bits that need to be added.</p> <p>Not only is it impossible for any monitoring and observability solution to know about every possible data point, but making a tool collect “every” metric would cause it to be unreasonably slow.</p> <p>The goal, instead, is to collect all the telemetry commonly needed and provide the ability to extend the tool to collect other metrics specific to each company’s circumstances.</p> <p>This brings me to the topic of today’s blog post: How to configure NMS to collect a custom SNMP metric.</p> <h2 id="everything-was-going-great-until">Everything was going great until…</h2> <p>Imagine sitting at your desk, monitoring your little heart out with Kentik NMS. Even the Raspberry Pi boxes you’re using for small but essential tasks are showing up. Things are looking great.</p> <img src="//images.ctfassets.net/6yom6slo28h2/TP8nwGpJ94PQypVWht8oF/319ae263ac54da875cf52a57f71b6796/raspberrypi.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Raspberry Pi device" /> <p>Until you realize two things in quick succession:</p> <ol> <li>Those Raspberry Pi’s are warm enough to heat up a slice of… well, actual raspberry pie.</li> <li>Temperature stats aren’t showing up.</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/4YOwoxJxVfNqsrh7Y70dE/fda8afcf376b5ef08487b3f3405850be/raspberrypi-temperature.png" style="max-width: 700px;" class="image center" withFrame alt="Raspberry Pi temperature showing no data" /> <p>To be clear, I’m using temperature as a simple but common example. It could just as easily be toner status on a printer or a list of services running on a server, complete with CPU, RAM, and IO utilization for each service. What I’m about to explain is how to include <em>any</em> new SNMP metric, irrespective of data type or vendor.</p> <p>With those clarifications out of the way, let’s add some temperature stats to our view to see whether we should stock up on fire extinguishers.</p> <div as="Promo"></div> <h2 id="preparing-for-success">Preparing for success</h2> <p>Before we start making changes, I want to go over the information you need at your fingertips.</p> <p>First and foremost, you need to have the SNMP objects (OIDs) that get the data you want, and you should be certain the device responds to those objects in the way you expect.</p> <p>In my case, the OID I want is: <code class="language-text">1.3.6.1.4.1.2021.13.16.2.1.3.1</code></p> <p>There are lots of sources for OID information; one such source is <a href="https://oidref.com">https://oidref.com</a>.</p> <p>To validate that my device responds correctly, I can use the SNMPWalk utility to poll just that value:</p> <img src="//images.ctfassets.net/6yom6slo28h2/IDuZi2Nfmauk7vH53zm7e/e3a6074cf0f3b6f01f330d5526a9fb47/snmpwalk-utility.png" style="max-width: 800px; padding: 0" class="image center" alt="SNMPWalk Utility" /> <p>Now that we have our OID and we’ve confirmed it works on the device in question, our last step is to ensure we understand how this value is formatted. In this case, it’s in “milli-Celsius,” so 39166 is actually 39.1 degrees Celsius (or 102.38 degrees Fahrenheit).</p> <p>Finally, I have to understand the SNMP system object (sysobjectid) of the device to which I want to add my data. You can find that by going into Kentik’s portal, visiting the Devices page, and adding the SysObjectID column.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3OomsuhXWmf7EHWaIHXDtE/e35d1daea626aced950591ed399b6210/devices-add-column.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Device list showing additional column" /> <p>Or if you go to the details page for a specific device and view it in the left-hand column:</p> <img src="//images.ctfassets.net/6yom6slo28h2/40kYHAm3XOyiMdU9wU6Mqj/b5f58554850537b8fc08f44c2891ad68/raspberrypi-details-529w.png" style="max-width: 400px;" class="image center" alt="Raspberry Pi details" /> <p>Note that what I’ll be using for this example is <code class="language-text">1.3.6.1.4.1.8072.3.2.10</code></p> <p>Note that this will affect any Linux-based system because Raspberry Pis don’t have their own unique system ID.</p> <p>Now we’re ready to get this value added and displayed in Kentik NMS!</p> <h2 id="this-is-the-lede">This is the lede</h2> <p>I’m not going to hide the important information behind a wall of step-by-step text. This section is the straightforward, simple, direct answer. But it lacks context and detail and, therefore, might not make much sense. That’s what the rest of this post is about. But for those who want to get right to the point:</p> <ol> <li>Customizations to Kentik NMS all go in <code class="language-text">/opt/kentik/components/ranger/local/config</code>. Whether you are adding a custom OID, overwriting an existing OID with a new source (not covered in this post), or adding a new device type (also not covered in this post), it all goes there. <ul> <li>This directory <em>might</em> already exist. If it doesn’t, go ahead and create it yourself.</li> </ul> </li> <li>In that directory, create three directories: <ol> <li>/profiles</li> <li>/reports</li> <li>/sources</li> </ol> </li> <li>In sources/, create linux-temps_source.yml and add the following information:</li> </ol> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-linux kind: sources sources: CPUtemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 interval: 60s</code></pre></div> <p></p> <ol start="4"> <li>In reports/, create linux_temps_report.yml and add the following information:</li> </ol> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: fields: CPUTemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 metric: true interval: 60s</code></pre></div> <p></p> <ol start="5"> <li>In profiles/, create a file named local-net-snmp.yml and add the following information:</li> </ol> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp include: - device_name_ip</code></pre></div> <p></p> <ol start="6"> <li>Make the user:group “kentik:kentik” the owner of everything you just created and all the files and directories beneath it.</li> </ol> <p><code class="language-text">sudo chown -R kentik:kentik</code><br> <code class="language-text">/opt/kentik/components/ranger/local/config</code></p> <p><strong>Note</strong>: This is only necessary if you’re running the Kentik NMS agent on a regular Linux system whether it’s a VM or not. This isn’t necessary for Docker-based agents, but I’ll explicitly cover that in a later section.</p> <ol start="7"> <li>Restart the collector process (kagent):</li> </ol> <p><code class="language-text">sudo systemctl restart kagent.service</code></p> <p>Wait a polling cycle or two, and you’ll be able to see it in the Metrics Explorer:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Sv8e5s9uTetx2QhhFDH3p/8f6d8e6b8529d3ce3116fb1df8146470/metrics-explorer-temp.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Metrics Explorer showing temperature" /> <h2 id="unpacking-kentik-nms">Unpacking Kentik NMS</h2> <p>The previous section presented a lot of information in a very tight package. It was probably just enough for folks who are already familiar with NMS and its internal structures. But if you’re newer to the platform, you may be looking for additional information, detail, or context. That’s what I plan to present in the rest of this post.</p> <p>Kentik NMS is, at its heart, a straightforward set of processes and directories. When you install it, all the essential files will be located in <code class="language-text">/opt/kentik/components/ranger/current</code>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7KrYPCp8AC8KHJC70A6MY6/a993d6fbc18c2a3f8e72e9dbf7da8aa4/install-location.png" style="max-width: 600px; padding: 0" class="image center" alt="Install location" /> <p>The LATEST.ZIP file contains all of the device profiles and information needed to collect data from those devices. The beauty of this system is that NMS works with LATEST.ZIP as-is, without unpacking or unzipping it. Every time you restart the Kentik agent (kagent), it checks for a newer version and downloads it if necessary. So you’re guaranteed to get all the latest updates and goodies without any special upgrade process.</p> <p><strong>Note</strong>: Sharp-eyed Linux-literate readers will notice that “current” is actually a symbolic link to the latest version. This is important because if you make changes here, you’ll find those changes inexplicably lost after the next update.</p> <p><strong>Upshot</strong>: Avoid future headaches. Don’t make changes in this directory.</p> <p>If you <em>did</em> unpack LATEST.ZIP (but, as I said, <em>don’t</em>), you’d find a specific directory structure underneath it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2CNcW1KFcHzVcApUeOOaT5/b0c937366eb58ce54c0121e230d1c917/unpack-latest-zip-552w.png" style="max-width: 500px; padding: 0;" class="image center" alt="Directory structure" /> <p>The key directories there are Profiles, Reports, and Sources. Each one contains a set of YAML files defining an aspect of the collected data.</p> <p><strong>Important note</strong>: The names of the files aren’t important. What matters is the information you provide in the <code class="language-text">name:</code> element within each YAML file. That will allow you to connect or associate a profile to a report, a report to a source, and so on.</p> <h3 id="source-files">Source files</h3> <p>A source tells Kentik NMS about one or more OIDS to collect. Here’s an example:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-linux kind: sources sources: temp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 interval: 60s</code></pre></div> <p></p> <p>This file can be understood as:</p> <ul> <li>A source named “local-linux”</li> <li>The type (or kind) of file is a “source” (there are others, which you’ll understand in a minute)</li> <li>The collection method is SNMP</li> <li>The SNMP object (OID) to collect is <code class="language-text">1.3.6.1.4.1.2021.13.16.2.1.3.1</code></li> <li>That value should be collected every 60 seconds</li> </ul> <h3 id="report-files">Report files</h3> <p>Files in the Report folder tell Kentik NMS how to display a specific OID within the Metrics Explorer. There are sements that repeat the things in the Source file, but – for reasons beyond the scope of this post – they’re necessary in both files.</p> <p>Here’s an example:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-temp kind: reports reports: /device/linux/temp: fields: CPUTemp: !snmp value: 1.3.6.1.4.1.2021.13.16.2.1.3.1 metric: true interval: 60s</code></pre></div> <p></p> <p>This information can be parsed as follows:</p> <ul> <li>The name of this report is local-temp</li> <li>The type (or “kind”) of file is – somewhat obviously – a report</li> <li>Within Metrics Explorer, the data being collected will show up under /device/linux/temp</li> <li>The data elements that will be available in Metrics Explorer is “CPUTemp,” which is an SNMP data element <ul> <li>This element will contain the data collected by the SNMP OID <code class="language-text">1.3.6.1.4.1.2021.13.16.2.1.3.1</code></li> <li>Which is a metric rather than a table or some other type of data structure.</li> </ul> </li> <li>The data will be displayed in 60-second increments.</li> </ul> <h3 id="profile-files">Profile files</h3> <p>Profiles associate specific reports with the device types (as identified by their SNMP System Object ID, or sysobjectid) and also mention common data elements (like name, or IP) that should be associated with the data.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">version: 1 metadata: name: local-net-snmp kind: profile profile: match: sysobjectid: - 1.3.6.1.4.1.8072.* reports: - local-temp include: - device_name_ip</code></pre></div> <p></p> <p>One more time, let’s parse this out:</p> <ul> <li>The name of the profile is local-net-temp</li> <li>The type (or kind, there’s that word again) of the file is a profile</li> <li>This profile applies to anything with an SNMP SysObjectID starting with <code class="language-text">1.3.6.1.4.1.8072.*</code> (this means most Linux-type machines that run net-SNMP).</li> <li>Devices that match this profile will collect data found in the Report file with a name: element “local-temp.”</li> <li>The device name and IP data should be included along with the data in the local-temp report and associated source.</li> </ul> <h3 id="its-cool-to-be-kind">It’s cool to be kind</h3> <p>You may have noticed that the <code class="language-text">kind</code> element in all three files above identifies the file type and matches the file’s directory. If you have a nagging suspicion that the directory structure matches up with the Kind label, you are right.</p> <p>In fact, you don’t <em>actually</em> need the directories. You could put all your files in a single folder, and as long as the Kind: value was correct, everything would match up. We here at Kentik encourage you to use the three-directory approach because it makes organizing, tracking, and maintaining a large number of profiles much easier in the long run.</p> <h3 id="possession-is-910ths-of-the-law-and-1010ths-of-linux-permissions">Possession is 9/10ths of the law and 10/10ths of Linux permissions</h3> <p>Once all your files are in place, it’s important to ensure Kentik can access them. This comes down to giving ownership to both the “kentik” user and the “kentik” group.</p> <p>Remembering that all of your customizations will go in the folder <code class="language-text">/opt/kentik/components/ranger/local/config</code> we need to make sure everything you just created will be owned by the kentik user and group. The command to give ownership would be:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">sudo chown -R kentik:kentik /opt/kentik/components/ranger/local/config</code></pre></div> <p></p> <p><strong>Note</strong>: This is only necessary if you’re running the Kentik NMS agent on a regular Linux system, whether it’s a VM or not. This isn’t necessary for docker-based agents, but I’ll explicitly cover that in a later section.</p> <p>Finally, restart the collector process (kagent):</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">sudo systemctl edit kagent.service</code></pre></div> <p></p> <p>Wait a polling cycle or two, and you’ll be able to see it in the Metrics Explorer:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Sv8e5s9uTetx2QhhFDH3p/8f6d8e6b8529d3ce3116fb1df8146470/metrics-explorer-temp.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Metrics Explorer showing temperature" /> <h2 id="keeping-kentik-contained-in-docker">Keeping Kentik contained (in Docker)</h2> <p>Throughout this post, I’ve focused on the commands and options for the direct installation of the Kentik agent. If you’re running the containerized version, very little changes, but it’s still worth running through those differences for folks who prefer the Docker version of the Kentik NMS collector.</p> <h3 id="docker-for-the-easily-distracted">Docker for the easily distracted</h3> <p>Before we move on, I don’t want to presume that every reader is already familiar – let alone comfortable – with Docker and its basic commands. Here are a few that you might need.</p> <p>You can see which containers are running (along with their container IDs) with the following command:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">docker ps</code></pre></div> <p></p> <p>You can see an output of what a Docker container is doing with this command (you get the container ID with the docker ps command):</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">docker logs –follow &lt;container id></code></pre></div> <p></p> <p>Finally, if you have issues with any containers, including the command to build or run the Kentik agent, you can easily stop and remove a container with these commands:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">docker stop &lt;container id> docker rm &lt;container id></code></pre></div> <p></p> <h3 id="getting-a-custom-folder-into-a-container">Getting a custom folder into a container</h3> <p>For the Docker version of Kentik NMS, you will need to mount your custom folder into the container and add that path to the Kentik agent command line. What is that custom folder, you ask? If you’ve been paying attention, you can probably already guess:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">/opt/kentik/components/ranger/local/config</code></pre></div> <p></p> <p>That’s right, it’s the same folder we’ve already been working with.</p> <p>Starting with the “docker run” command that you used to install the container in the first place:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">docker run --name=kagent --detach --restart unless-stopped --pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env K_API_ROOT=grpc.api.kentik.com:443 --mount source=kagent-data,target=/opt/kentik/ kentik/kagent:latest</code></pre></div> <p></p> <p>You would simply add</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">-v /opt/kentik/components/ranger/local/config</code></pre></div> <p></p> <p>…to the end of that command.</p> <p>The full command would look like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">docker run --name=kagent --detach --restart unless-stopped --pull=always --cap-add NET_RAW --env K_COMPANY_ID=1234567 --env K_API_ROOT=grpc.api.kentik.com:443 --mount source=kagent-data,target=/opt/kentik/ kentik/kagent:latest -v /opt/kentik/components/ranger/local/config</code></pre></div> <p></p> <p>That’s it! Everything else in this post still applies, and you don’t even need to run the chown command to ensure ownership of that directory.</p> <h2 id="troubleshooting-and-other-swear-words">Troubleshooting and other swear words</h2> <p>Despite our best efforts, careful planning, detailed analysis, and heartfelt prayers – even with all that, things sometimes go awry. As the philosopher John Bender said in the profoundly philosophical work <em>The Breakfast Club</em>.</p> <div class="pullquote center" style="text-align: center;">“Screws fall out all the time; the world is an imperfect place.”</div> <p>With that great truth firmly in mind, I wanted to offer some tools and techniques you can use to identify where things may have gone off the rails.</p> <h3 id="yaml-for-the-easily-distracted">YAML for the easily distracted</h3> <p>YAML stands for “yet another markup language,” which, like most acronyms, tells you exactly <em>nothing</em> about it. YAML is similar in many ways to XML or JSON, an insight that provides little comfort to many of us who have an emotionally complicated relationship with those other two systems.</p> <p>My personal trauma aside, YAML is great for configuration files because it’s highly structured. But for that same reason, it can be easy to bork something up because of a small (and hard-to-find) oversight. Here are the ones that might trip you up:</p> <ul> <li>Everything you do in a YAML file will be in the form of a pair of information that follows the pattern: “key: value” <ul> <li>Some examples: <ul> <li>name: local-temp</li> <li>kind: profile</li> <li>interval: 60s</li> </ul> </li> </ul> </li> <li>Underscores, dashes, or spaces can separate words in keys.</li> <li>A key will always end with a colon (:)</li> <li>Indentation in the file matters! <ul> <li>You must have them.</li> <li>There must be a certain number of spaces.</li> <li>Indents must use spaces. They can not be tabs.</li> <li>Things that are on the same level (“name” and “kind,” for example) must line up with the <em>same</em> number of spaces.</li> </ul> </li> </ul> <h3 id="stop-start-restart-do-the-hokey-pokey">Stop, start, restart, do the Hokey Pokey</h3> <p>Sometimes, the Kentik agent needs a good swift kick in the… process. To do that, you can use the systemctl utility:</p> <ul> <li><code class="language-text">sudo systemctl stop kagent.service</code></li> <li><code class="language-text">sudo systemctl start kagent.service</code></li> <li><code class="language-text">sudo systemctl restart kagent.service</code></li> </ul> <h3 id="snooping-around-the-kentik-agents-diary-journal">Snooping around the Kentik Agent’s diary (journal)</h3> <p>The Linux journal isn’t some specialized magazine or email newsletter. It’s the onboard record of every outbound message, error, update, whine, sigh, and grumble that your Linux system experiences – especially when it concerns services that run through the systemctl utility.</p> <p>The command to peer inside the Journal is, appropriately enough, journalctl. But typing that by itself will likely yield a metric tonne of mostly irrelevant information. In order to see messages and output specific to Kentik NMS, you should use the command:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">sudo journalctl -u kagent</code></pre></div> <p></p> <p>If that list is overwhelming, include the following bit:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">sudo journalctl -u kagent –since "10 minutes ago"</code></pre></div> <p></p> <p>And if you want to see the messages appearing in the Journal in real time, use this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">sudo journalctl -u kagent -f</code></pre></div> <p></p> <h2 id="the-mostly-unnecessary-summary">The (mostly) unnecessary summary</h2> <p>There is more – a whole lot more! – to explore with Kentik NMS, including ways to add SNMP table data, create profiles for completely new device types, and even add data that isn’t coming from SNMP in the first place.</p> <p>Even so, this post should get you moving ahead in collecting those bits of information you know are available on your devices but aren’t collected by default by Kentik NMS today.</p> <p>To get started with Kentik NMS, <a href="https://www.kentik.com/get-started/">sign up for a free 30-day trial</a>.</p><![CDATA[Using Telegraf to Feed API JSON Data into Kentik NMS]]><![CDATA[While many wifi access points and SD-WAN controllers have a rich data set available via their APIs, most do not support exporting this data via SNMP or streaming telemetry. In this post, Justin Ryburn walks you through configuring Telegraf to feed API JSON data into Kentik NMS.]]>https://www.kentik.com/blog/using-telegraf-to-feed-api-json-data-into-kentik-nmshttps://www.kentik.com/blog/using-telegraf-to-feed-api-json-data-into-kentik-nms<![CDATA[Justin Ryburn]]>Wed, 06 Mar 2024 05:00:00 GMT<p>Every once in a while, I like to get my hands dirty and play around with technology. Part of the reason is that I learn best by doing. I also love technology and need to prove to myself occasionally that “I’ve still got it” in my spare time. This article is based on a <a href="https://ryburn.org/2024/02/25/using-telegraf-to-feed-json-data-from-an-api-into-influx/" title="Using Telegraf to Feed JSON Data from an API into Influx">post</a> I wrote for my personal tech blog.</p> <p><a href="https://www.kentik.com/blog/reinventing-network-monitoring-and-observability-with-kentik-ai/" title="Blog: Reinventing Network Monitoring and Observability with Kentik AI">Kentik recently launched NMS</a>, a metrics platform that ingests data in <a href="https://docs.influxdata.com/influxdb/cloud/reference/syntax/line-protocol/">Influx Line Protocol</a> format, stores it, and provides a great UI to visualize it. We built our own collector, called <a href="https://kb.kentik.com/v0/Bd11.htm" title="Kentik NMS Ranger Knowledgebase Article">Ranger</a>, for SNMP and streaming telemetry (<a href="https://openconfig.net/projects/gnmi/gnmi/" title="Learn more about gNMI">gNMI</a> data, but we don’t yet support grabbing <a href="https://www.json.org/json-en.html">JSON</a> data via an API call. Reading through the <a href="https://docs.influxdata.com/telegraf/v1/">Telegraf documentation</a>, I realized it should be relatively easy to configure Telegraf to do that. It just so happens I have a Google Wifi setup in my home that exposes an API with some interesting data to observe.</p> <p>In this blog, I will document how I got this all working. This will allow Kentik users to get this data into NMS while we develop our API JSON data collection in our agent.</p> <p>If you’re fond of video tutorials, I made a short video version of this blog post that you can watch below:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/qstefs9lln?seo=true&amp;videoFoam=false" title="Using Telegraf to Feed API JSON Data into Kentik NMS Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h1 id="docker-setup">Docker setup</h1> <p>I am a big fan of installing software using <a href="https://www.docker.com/">Docker</a> to keep my host clean as I play around with things. I also like to use <code class="language-text">docker-compose</code> instead of <code class="language-text">docker run</code> commands to upgrade the containers easily. For Telegraf, I used the following <code class="language-text">docker-compose.yml</code> file:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">--- version: '3.3' services: telegraf: container_name: telegraf image: 'docker.io/telegraf:latest' environment: - KENTIK_API_ENDPOINT="https://grpc.api.kentik.com/kmetrics/v202207/metrics/api/v2/write?bucket=&amp;org=&amp;precision=ns" - KENTIK_API_TOKEN=(REDACTED) - KENTIK_API_EMAIL=(REDACTED) volumes: - '/home/jryburn/telegraf:/etc/telegraf' restart: unless-stopped network_mode: host</code></pre></div> <p></p> <h2 id="telegraf-configuration">Telegraf configuration</h2> <p>Once the container was configured, I built a <code class="language-text">telegraf.conf</code> file to collect the data from the API endpoint on the Google Wifi using the HTTP input with a JSON data format. Using the <code class="language-text">.tag</code> puts the data into the tag set when exported to Influx, and using the <code class="language-text">.field</code> puts the data into the field set when it is exported to Influx. Once all the data I want to collect is defined, I define the outputs. Setting up the Influx output is pretty straightforward: configure the HTTP output to use the Influx format. I had to define some custom header fields to authenticate the API call to Kentik’s Influx endpoint. I also output the data to a file to simplify troubleshooting, an entirely optional step.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text"># Define the inputs that telegraf is going to collect [global_tags] device_name = "basement-ap" location = (REDACTED) vendor = "Google" description = "Google Wifi" [[inputs.http]] urls = ["http://192.168.86.1/api/v1/status"] data_format = "json_v2" # Exclude url and host items from tags tagexclude = ["url", "host"] [[inputs.http.json_v2]] measurement_name = "/system" # A string that will become the new measurement name [[inputs.http.json_v2.tag]] path = "wan.localIpAddress" # A string with valid GJSON path syntax type = "string" # A string specifying the type (int,uint,float,string,bool) rename = "device_ip" # A string with a new name for the tag key [[inputs.http.json_v2.tag]] path = "system.hardwareId" # A string with valid GJSON path syntax type = "string" # A string specifying the type (int,uint,float,string,bool) rename = "hardware-id" # A string with a new name for the tag key [[inputs.http.json_v2.tag]] path = "software.softwareVersion" # A string with valid GJSON path syntax type = "string" # A string specifying the type (int,uint,float,string,bool) rename = "software-version" # A string with a new name for the tag key [[inputs.http.json_v2.tag]] path = "system.modelId" # A string with valid GJSON path syntax type = "string" # A string specifying the type (int,uint,float,string,bool) rename = "model" # A string with a new name for the tag key [[inputs.http.json_v2.field]] path = "system.uptime" # A string with valid GJSON path syntax type = "int" # A string specifying the type (int,uint,float,string,bool) # # A plugin that stores metrics in a file [[outputs.file]] ## Files to write to, "stdout" is a specially handled file. files = ["stdout", "/etc/telegraf/metrics.out"] data_format = "influx" # Data format to output. influx_sort_fields = false # A plugin that can transmit metrics over HTTP [[outputs.http]] ## URL is the address to send metrics to url = ${KENTIK_API_ENDPOINT} # Will need API email and token in the header data_format = "influx" # Data format to output. influx_sort_fields = false ## Additional HTTP headers [outputs.http.headers] ## Should be set manually to "application/json" for json data_format X-CH-Auth-Email = ${KENTIK_API_EMAIL} # Kentik user email address X-CH-Auth-API-Token = ${KENTIK_API_TOKEN} # Kentik API key Content-Type = "application/influx" # Make sure the http session uses influx</code></pre></div> <p></p> <p>The following is what the JSON payload looks like when I do a <code class="language-text">curl</code> command against my Google Wifi API. It should help to better understand the fields I configured Telegraf to look for.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">{ "dns": { "mode": "automatic", "servers": [ "192.168.1.254" ] }, "setupState": "GWIFI_OOBE_COMPLETE", "software": { "blockingUpdate": 1, "softwareVersion": "14150.376.32", "updateChannel": "stable-channel", "updateNewVersion": "0.0.0.0", "updateProgress": 0.0, "updateRequired": false, "updateStatus": "idle" }, "system": { "countryCode": "us", "groupRole": "root", "hardwareId": "GALE C2E-A2A-A3A-A4A-E5Q", "lan0Link": true, "ledAnimation": "CONNECTED", "ledIntensity": 83, "modelId": "ACjYe", "oobeDetailedStatus": "JOIN_AND_REGISTRATION_STAGE_DEVICE_ONLINE", "uptime": 794184 }, "vorlonInfo": { "migrationMode": "vorlon_all" }, "wan": { "captivePortal": false, "ethernetLink": true, "gatewayIpAddress": "x.x.x.1", "invalidCredentials": false, "ipAddress": true, "ipMethod": "dhcp", "ipPrefixLength": 22, "leaseDurationSeconds": 600, "localIpAddress": "x.x.x.x", "nameServers": [ "192.168.1.254" ], "online": true, "pppoeDetected": false, "vlanScanAttemptCount": 0, "vlanScanComplete": true } } </code></pre></div> <p></p> <p>Once I start the Docker container, I tail the logs using <code class="language-text">docker logs -f telegraf</code> and see the Telegraf software loading up and starting to collect metrics.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">2024-02-25T17:15:44Z I! Starting Telegraf 1.29.4 brought to you by InfluxData the makers of InfluxDB 2024-02-25T17:15:44Z I! Available plugins: 241 inputs, 9 aggregators, 30 processors, 24 parsers, 60 outputs, 6 secret-stores 2024-02-25T17:15:44Z I! Loaded inputs: http 2024-02-25T17:15:44Z I! Loaded aggregators: 2024-02-25T17:15:44Z I! Loaded processors: 2024-02-25T17:15:44Z I! Loaded secretstores: 2024-02-25T17:15:44Z I! Loaded outputs: file http 2024-02-25T17:15:44Z I! Tags enabled: description=Google Wifi device_name=basement-ap host=docklands location=(REDACTED) vendor=Google 2024-02-25T17:15:44Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"docklands", Flush Interval:10s /system,description=Google\ Wifi,device_ip=x.x.x.x,device_name=basement-ap,location=(REDACTED),model=ACjYe,os-version=14150.376.32,serial-number=GALE\ C2E-A2A-A3A-A4A-E5Q,vendor=Google uptime-sec=2801i 1708881351000000000</code></pre></div> <p></p> <p>Now I hop over to the Kentik UI, where I am sending the Influx data, and I can see I am collecting the data there as well.</p> <img src="//images.ctfassets.net/6yom6slo28h2/9QF4hKGe9WexlK6H74POM/8448e466718db28faad5f7adc0c8652c/ryburn-influx-kentik.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Sending Influx data to Kentik" /> <h2 id="next-steps">Next steps</h2> <p>With Telegraf set up to ingest JSON, I have opened the door to a critical new data type in Kentik NMS. While many wifi access points and SD-WAN controllers have a rich data set available via their APIs, the challenge is that they do not support exporting this data via SNMP or streaming telemetry. By configuring Telegraf to collect and export this data in Influx, I can graph and monitor the metrics available via those APIs in the same UI to <a href="https://www.kentik.com/kentipedia/snmp-monitoring/" title="Kentipedia: SNMP Monitoring: An Introduction and Practical Tutorial">monitor SNMP</a> and <a href="https://www.kentik.com/blog/no-you-havent-missed-the-streaming-telemetry-bandwagon-part-1/" title="Kentik Blog: No, You Haven&#x27;t Missed the Streming Telemetry Bandwagon">streaming telemetry</a> data.</p> <p>Adding data from APIs to the context-enriched telemetry available in Kentik NMS will make it faster to debug unexpected issues. It also allows me to explore this rich dataset ad-hoc to help me make informed decisions. If you want to try this yourself, get started by <a href="#signup_dialog" title="Request a Free Trial of Kentik">signing up for a free Kentik trial</a> or <a href="#demo_dialog" title="Request a Demo of Kentik">request a personalized demo</a>.</p><![CDATA[How Hard Is It to Migrate to Streaming Telemetry?]]><![CDATA[Streaming telemetry is the future of network monitoring. Kentik NMS is a modern network observability solution that supports streaming telemetry as a primary monitoring mechanism, but it also works for engineers running SNMP on legacy devices they just can’t get rid of. This hybrid approach is necessary for network engineers managing networks in the real world, and it makes it easy to migrate from SNMP to a modern monitoring strategy built on streaming telemetry.]]>https://www.kentik.com/blog/how-hard-is-it-to-migrate-to-streaming-telemetryhttps://www.kentik.com/blog/how-hard-is-it-to-migrate-to-streaming-telemetry<![CDATA[Phil Gervasi]]>Thu, 29 Feb 2024 05:00:00 GMT<h2 id="a-plan-to-migrate-to-streaming-telemetry">A plan to migrate to streaming telemetry</h2> <p>It’s essential for engineers to develop a migration plan. As new network devices are rolled out slowly or en masse during the next major hardware refresh, we can expect many, if not all, new devices to support streaming telemetry. Network operations needs to be ready to support streaming telemetry from new devices, but it also needs to support SNMP from the legacy devices that haven’t been replaced yet. Some legacy network devices that are custom-designed for specific tasks, such as operating in harsh environments, simply can’t be replaced.</p> <p>So, if you think about it, most organizations, whether they’re enterprise networks or service providers, are already in sort of a hybrid situation in which their newer network gear supports streaming telemetry today, while older devices still in production support only SNMP.</p> <p>This requires that network operations has the tools in place to ingest, analyze, alert, and report on both forms of telemetry. Otherwise, they are either missing metrics from production devices or are logging into multiple tools, likely homegrown, to monitor a hybrid infrastructure.</p> <div as="Promo"></div> <h2 id="a-hybrid-approach-to-network-monitoring">A hybrid approach to network monitoring</h2> <p>Therefore, a hybrid approach to monitoring is a necessity. Unfortunately, not all devices in most production networks support streaming telemetry, and an organization can rarely upgrade all their legacy devices in one (swift) fell swoop. There are budget constraints, staff limitations, and limited time to configure every device as a streaming telemetry publisher.</p> <p>But it doesn’t stop there. It’s a tough sell to a network engineer to manage multiple visibility tools, one for streaming telemetry and another for SNMP. This is why it’s so important to have a platform that presents metrics from both in the same dashboard.</p> <p>The shift to streaming has to happen. It solves many of the problems of SNMP and gives network operations a whole new world of faster polling, more accurate results, and fewer monitoring anomalies. So how can we get there?</p> <h2 id="all-network-telemetry-in-one-place">All network telemetry in one place</h2> <p><a href="https://www.kentik.com/product/network-monitoring-system/">Kentik NMS</a> unifies SNMP, streaming telemetry, and custom metrics in one interface, even down to a single dashboard. When an engineer drills into a network device, for example, the data they see about CPU utilization, memory, packet drops, CRC errors, and so on will come from whatever protocol that device supports. In this way, an engineer can look at any device in their network in one place, regardless of whether it uses SNMP or streaming telemetry.</p> <p>Legacy network devices are configured with the appropriate SNMP community information, and Kentik is configured to poll those devices just like a traditional network monitoring system. Newer network devices, configured as streaming telemetry publishers, send their information to Kentik NMS, configured as a subscriber.</p> <p>Data is normalized as it enters the system, meaning data from both telemetry types are combined, redundancies are removed, appropriate scale and range are calculated, and the data is structured. Hence, it’s the same format across all fields. The telemetry type is entirely transparent to the network engineer navigating Kentik NMS.</p> <p>This gives network operations a straightforward way to look at any device in their network in one place, regardless of whether it’s a 15-year-old SCADA switch or a brand-new high-performance router in the data center. In one screen, an engineer can see the best of what both devices can support, whether a five-minute average from SNMP or high-fidelity metrics taken every two seconds via streaming telemetry.</p> <h2 id="kentik-makes-migrating-to-streaming-telemetry-simple">Kentik makes migrating to streaming telemetry simple</h2> <p>Other than the high cost of purchasing new network devices or the risk and disruption of upgrading code, another hurdle many engineers face when moving to streaming telemetry is a need for more familiarity with the mechanisms used to monitor devices via gRPC, gNMI, and API calls. For many, it’s a significant roadblock to moving away from SNMP, a protocol we’re all already familiar and comfortable with.</p> <p>So rather than googling wiki pages and experimenting with Grafana and Prometheus, Kentik provides the configuration documentation and a pre-built subscriber mechanism, so engineers have a shallow bar of entry into the world of streaming telemetry. In other words, Kentik makes it very easy to migrate from SNMP to streaming telemetry, including handling the hybrid infrastructure during the transition.</p> <p>Kentik’s field engineering team works with network operations teams strapped for time and resources so that devices can be configured and onboarded into Kentik NMS. Additionally, the field engineering team works with engineers directly on an ongoing basis to optimize the NMS, ensure the data is accurate and useful, and ensure that reporting and alerts are working as intended.</p> <h2 id="an-nms-for-engineers-by-engineers">An NMS for engineers, by engineers</h2> <p>Streaming telemetry is the future of network monitoring for service providers and enterprise organizations alike, or in other words, anyone who manages network devices. Kentik’s approach accommodates those still leveraging SNMP and those who have shifted to a fully streaming telemetry monitoring strategy.</p> <p>But remember that network engineers founded Kentik. That means we understand the reality of change freezes, waiting for budgets to open up, and needing to run legacy devices to accommodate corner cases. We know migrating from one protocol to another is closely tied to the hardware and software in production, which is no easy task to change.</p> <p>Not only is Kentik NMS a modern solution for network infrastructure monitoring, but we also provide hybrid support for engineers managing networks in the real world.</p> <p>For more information about Kentik NMS and the Kentik field engineering team, visit <a href="https://www.kentik.com/nms">www.kentik.com/nms</a>.</p><![CDATA[Closing the Interconnection Data Gap: Integrating PeeringDB into Kentik Data Explorer]]><![CDATA[Kentik users can now correlate their traffic data with internet exchanges (IXes) and data centers worldwide, even ones they are not a member of – giving them instant answers for better informed peering decisions and interconnection strategies that reduce costs and improve performance. ]]>https://www.kentik.com/blog/closing-the-interconnection-data-gap-integrating-peeringdb-into-kentik-data-explorerhttps://www.kentik.com/blog/closing-the-interconnection-data-gap-integrating-peeringdb-into-kentik-data-explorer<![CDATA[Lauren Basile]]>Wed, 28 Feb 2024 05:00:00 GMT<p>Less than a year ago, Kentik became the <a href="https://www.kentik.com/blog/making-peering-easy-announcing-the-integration-of-peeringdb-and-kentik/">first network observability platform to integrate PeeringDB</a> – delivering better database usability, visibility into common footprint and traffic ratios between customer networks and other ASNs, and more.</p> <p>Building on this milestone, we’ve taken our integration a step further by integrating PeeringDB data into Kentik Data Explorer. This means Kentik customers can now use a set of PeeringDB-sourced filters to quickly and easily evaluate traffic between their network and networks that are members of an internet exchange (IX) or present at a data center, making data-driven peering decisions and interconnection analysis truly faster than ever before.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1A5aOCvRAFEQTlCAKLVRCe/eff3d6882527f8954e4f33dd17dd8949/peering-db-filter-set.png" style="max-width: 600px;" class="image center" thumbnail alt="PeeringDB filters in Kentik platform" /> <div class="caption" style="margin-top: -35px;">New set of source and destination PeeringDB filters available in Kentik Data Explorer, enabling easy evaluation for networks looking to optimize existing interconnects and/or deploy at new IXes and data centers.</div> <h2 id="the-problem">The problem</h2> <p>Peering managers, network planners, and edge strategy experts are always looking for ways to improve network performance, increase reliability, and reduce connectivity costs.</p> <p>But getting the data to support strategic decisions like peering partnerships is a monumental task fraught with tedious data gathering and inefficient processes that drain time, energy, and, at times, sanity. Spending valuable resources collecting data to support strategic decisions, only to track all of it in a spreadsheet, is an exercise in futility that many of us would not like to revisit.</p> <p>Let’s take a look at a seemingly simple scenario – figuring out whether it makes sense for a network to join an IX. How do you get an estimate of the volume of traffic the network can offload to peers on the exchange? This multi-step process consists of:</p> <ol> <li>Going through the list of exchange members</li> <li>Determining the traffic exchanged with each</li> <li>Checking their peering policy</li> <li>Guessing whether a session to an existing peer will move traffic around</li> <li>Making a guess of how much traffic would be moved from your paid connections.</li> </ol> <p>Without Kentik, this is painful and time-consuming, sometimes requiring peering teams to meet with engineering to manually correlate PeeringDB records with their own flow data set or use a homegrown script to help with said correlation.</p> <p>And this gets to the heart of today’s peering operations woes: analytical decisions are gated by an inordinate amount of difficulty in collecting and correlating data sets together. A lot of time is spent on spreadsheet manipulation, or the decisions are ultimately made based on hunches or arcane culture.</p> <div as="Promo"></div> <h2 id="closing-the-interconnection-data-gap">Closing the interconnection data gap</h2> <p>At Kentik, we’re making peering simpler and more intuitive by eliminating this gap between the peering community and the contextualized data they need to do their jobs efficiently. Our latest feature release is a meaningful and significant step in this direction – finally, peering coordinators can get instant answers to the questions that matter most.</p> <p>Let’s take that same IX scenario, but this time with Kentik. Figuring out whether it’s worth joining an IX is as simple as running a query (or two) in our platform using the newly available PeeringDB filters. In a matter of seconds, users can ask Kentik to: <em>Show me my top Source ASNs where said ASNs are registered in this specific IX, with an open peering policy.</em></p> <img src="//images.ctfassets.net/6yom6slo28h2/31lNBZNhY09S8nW3yirYCw/7eb45825c584268d0e44e2e8952f56a6/ams-ix-transit-traffic-arrow.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik query with Peering DB filters" /> <div class="caption" style="margin-top: -35px;">Kentik query using PeeringDB filters showing substantial transit traffic that could be offloaded to peering at AMS-IX.</div> <p>Hours of data gathering and correlation, reduced down to minutes with Kentik. Users can easily calculate the cost differential between the price of the port at that IX versus how much savings they would receive moving the transit traffic results to peering, which Kentik’s <a href="https://www.kentik.com/blog/why-and-how-to-track-connectivity-costs/">Connectivity Costs workflow</a> can assist with.</p> <h2 id="peeringdb-filters-in-action">PeeringDB filters in action</h2> <p>Screenshots are great, but let’s see this short product demo where we walk through this use case in the Kentik platform:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/1qkj3prdck" title="Evaluating Traffic at IXes and Data Centers with PeeringDB Filters in Kentik" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>The possibilities don’t stop there. With these new filters, Kentik users can query on both IXes and data centers, enabling them to:</p> <ul> <li><strong>Maximize the use of current IX ports and reduce transit costs</strong> by reclaiming transit traffic and converting it to IX peering traffic.</li> <li><strong>Increase network resiliency</strong> by identifying additional peering locations where you can meet your existing peers.</li> <li><strong>Find the internet exchanges and data centers</strong> where most of your traffic’s source and destination ASNs have a presence.</li> <li><strong>Evaluate which IX to deploy to next</strong> based on your traffic profile and how much assured traffic you can peer off, and estimate the port capacity you will need.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/78Trxgx4zerMV129EypGbi/972700f0e5fdf2a83de6213143eb205d/ashburn-ix-comparison.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="" /> <div class="caption" style="margin-top: -35px;">Easily compare multiple IXes in a specific region to see which one has the most traffic that could be converted to peering.</div> <p>With this latest update, network operators now have the crucial data and actionable insights needed to drastically streamline peering operations and create true data-backed cases to drive interconnection strategy.</p> <h2 id="onboarding">Onboarding</h2> <p>As the cherry on top, we also rolled out <a href="https://new.kentik.com/interface-classification-peeringdb-integration-3wMSVW">auto-classification capabilities</a> that automatically maps customer IX interfaces to PeeringDB IXes and facilities during the Kentik onboarding process, reducing tedious and repetitive onboarding tasks. Win-win.</p> <h2 id="the-long-game">The long game</h2> <p>These updates are incredibly exciting, and even more significant developments are on the horizon as we continue to explore the powerful combination of flow traffic analytics, PeeringDB records, our connectivity costs workflow, and AI. For example, what if Kentik ran a report on all of your top Source or Destination ASNs every day and made informed, analytical auto-suggestions about which IXes or data centers you should go next? Keep your eyes peeled, because this is where we are headed.</p> <h2 id="get-started">Get started</h2> <p><a href="https://www.kentik.com/get-demo/">Sign up for a demo</a> or join us at Peering Days in Poland from March 5-7, where our own Nina Bargisen will be speaking about how peering coordinators can use these multiple data sources to boost efficiency and streamline operations. If you’re already a Kentik customer, reach out to your customer success team to learn more.</p><![CDATA[Getting Started with Kentik NMS]]><![CDATA[Kentik NMS is here to provide better network performance monitoring and observability. Read how to get started, from installation and configuration to a brief tour of what you can expect from Kentik’s new network monitoring solution.]]>https://www.kentik.com/blog/getting-started-with-kentik-nmshttps://www.kentik.com/blog/getting-started-with-kentik-nms<![CDATA[Leon Adato]]>Tue, 27 Feb 2024 05:00:00 GMT<p>In my <a href="https://www.kentik.com/blog/setting-sail-with-kentik-nms-unified-network-telemetry/">last post</a>, I explored the reasons why Kentik recently built and released an all-new network monitoring system (NMS) that includes traditional techniques like SNMP along with more modern methods of collecting telemetry like APIs, OpenTelemetry, and influx.</p> <p>For this article, we’ll jump right into getting set up in <a href="https://www.kentik.com/product/network-monitoring-system/">Kentik NMS</a>, from installation and configuration to a brief tour of what you can expect from Kentik’s new network monitoring solution.</p> <img src="//images.ctfassets.net/6yom6slo28h2/XStp6r9ae7J4wNCw6wrAt/b3e5cce7b003d2883cbe5d71403c8d52/meme-witness-the-power.png" style="max-width: 400px;" class="image center" alt="Network monitoring system (NMS) meme" /> <h2 id="before-you-begin-installation">Before you begin installation</h2> <p>There’s nothing more frustrating than being ready to test out a new piece of technology and then finding out you’re not prepared. So before you head down to the “Installation and Configuration” section, make sure you have the following things in hand:</p> <ol> <li>A system to install the Kentik NMS collector on. The collector is an agent that can be installed directly onto a Linux-based system or as a Docker container. Per the <a href="https://kb.kentik.com/v0/Bd11.htm#Bd11-NMS_Agent_Requirements">Kentik Knowledge Base</a> instructions, you’ll want a system with at least a single core and 4GB of RAM.</li> <li>Verify the system can access the required remote sites: <ul> <li>Docker Hub</li> <li>TCP 443 to grpc.api.kentik.com (or kentik.eu for Europe)</li> </ul> </li> <li>Verify the system can access the devices you want to monitor: <ul> <li>Ping (ICMP)</li> <li>SNMP (UDP port 161)</li> </ul> </li> <li>Check that you have the following information for the devices you want to monitor: <ul> <li>A list of IP addresses and/or one or more CIDR notated subnets (example: 192.168.1.0/24)</li> <li>The SNMP v2c read-only community string and/or SNMP version 3 username, authentication type and passphrase, privacy type, and passphrase</li> </ul> </li> <li>You have a Kentik account. If you are just testing NMS out, we’d recommend <em>not</em> using an existing production account. (Not because NMS is unsafe, but because the possibility of accidentally triggering an event that the real Helpdesk will get and think they have to respond to is no fun. Be nice to your support folks. They know where all your data is kept.) If you don’t have an account, head over to <a href="https://portal.kentik.com/login">https://portal.kentik.com/login</a> and get one set up.</li> </ol> <p>Once you’ve got all of your technical ducks in a row (which, to be honest, shouldn’t take that long), you’re ready to get started on this NMS adventure!</p> <div as="Promo"></div> <h2 id="installation-and-configuration">Installation and configuration</h2> <p>Whether you install the Kentik NMS collector on a Linux system (physical or virtual) or in a Docker container, you’ll start in the portal. Click the “hamburger menu” (the three lines in the upper left corner), which shows the full portal menu.</p> <img src="//images.ctfassets.net/6yom6slo28h2/lpt9TW6HjVy5vDfGbPpIS/9b4a27ddd2d2f30580bb21c09ef8d125/installation-and-config-menu.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik main menu" /> <img src="//images.ctfassets.net/6yom6slo28h2/5LGJMgWE4vH1bthQyHvgmF/cd979d429946e36a48ff6496efff17ce/discover-devices-button.png" style="max-width: 160px; padding: 0;" class="image right" alt="Discover Devices button" /> <p>Click “Devices” and then select the friendly blue “Discover Devices” button in the upper right corner.</p> <p>The next screen allows you to install the collector, either as a Docker container: <img src="//images.ctfassets.net/6yom6slo28h2/31t0BwbMvfYSYmVBRc5l4K/2cee5750a174f0852ee3a51c87d5cca9/installation-docker.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Docker installation" /></p> <p>or on a full Linux system.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6c8nu6f4PGil0U1r7rNHs9/b38c96ea397e3b90c5d08ef35c77218e/installation-linux.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Linux installation" /> <p>Shortly after doing that, you’ll see the agent name (or the name of the system the agent is installed on) show up in the “Select an Agent” area below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/TFL4T7HsP1u3w3hinaJsq/f8e0b2be3effc8d58f04b532ef0e75f2/select-agent.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Select an agent" /> <p>Go ahead and click “Use this Agent.”</p> <p>From the next screen, you’ll enter an IP address, a comma-separated list of IPs, or a CIDR-noted range (example: 192.168.1.0/24).</p> <img src="//images.ctfassets.net/6yom6slo28h2/1HedagqbxXQHmKanhIAPGK/60ddea025214ee0e2c9e4f44c2c058ac/ip-range.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Enter IP address" /> <p><strong>Trick</strong>: You can mix and match, including individual IPs and CIDR ranges.</p> <p><strong>Another trick</strong>: If there are specific systems you want to ignore, list them with a minus (-) in front.</p> <p>Presuming this is your first time adding devices, you’ll probably have to click “Add New Credential.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/7hmJTYkX4jsq9jGXm2IQMQ/52e3ededb59b53c72bc871f4c4345146/snmp-v2.png" style="max-width: 450px;" class="image center" withFrame alt="Choose SNMP v2" /> <img src="//images.ctfassets.net/6yom6slo28h2/1BL6WZkNUex3YDCsua75HA/3a24cb93395603b6a5073fdd2dec04df/snmp-v3.png" style="max-width: 450px;" class="image center" withFrame alt="Choose SNMP v3" /> <p>Let’s get this out of the way: You will <em>never</em> select SNMP v1. Just don’t.</p> <p>That said, choose SNMP v2c or v3, include the relevant credentials, give it a unique name, and click “Add Credential.”</p> <p>Then select it from the previous screen.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4UUtcTPsQmGtptRUvf6Mvo/9883786b575b682c4b2ffe358a349241/ip-range-credentials-discovery.png" style="max-width: 650px;" class="image center" thumbnail withFrame alt="Credentials plus Start Discovery" /> <p>At that point, click “Start Discovery” to kick off the real excitement.</p> <p>The collector will start pinging devices and ensuring they respond to SNMP. Once completed, you’ll see a list of devices. You can check/uncheck the ones you want to monitor and click “Add Devices.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/6kzCRFzDXAYV2PCkWcx14W/373814734e6a0670c285528281b1af74/discovery-complete.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Discovery complete" /> <h2 id="a-brief-tour">A brief tour</h2> <h3 id="the-main-nms-screen">The main NMS screen</h3> <p>Now that we have some devices installed and are collecting data, let’s take a quick look around.</p> <p>Back up in the main Kentik menu, click “Network Monitoring System.” That will drop you into the main dashboard.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6DNEnZ1CyvGQ4Pqj1tsZqi/211e8096ef0a192f884852fbdfd506ab/nms-dashboard.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik Network Monitoring System dashboard" /> <p>On the main screen, you’ll see:</p> <ul> <li>A geographic map showing the location of your devices</li> <li>A graph and a table showing the availability information for those devices</li> <li>An overview of the traffic (bandwidth) being passed by/through your infrastructure</li> <li>Any active alerts</li> <li>Tables with a sorted list of devices that have high bandwidth, CPU, or memory utilization</li> </ul> <h3 id="the-devices-list">The Devices list</h3> <p>Returning to the hamburger menu, we’ll revisit the “Devices” list, but now that we <em>have</em> devices, we’ll take a closer look.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5tcrcIeuB4VMLvDRtSkjwX/22c7254f17aa8153e0e6d8b4cd8d4384/device-list.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Device List" /> <p>This page is exactly what it claims to be – a list of your devices. From this one screen, you have easy access to the ability to:</p> <ul> <li>Sort the list by clicking on the column headings.</li> <li>Search for specific devices using any data types shown on the screen.</li> <li>Filter the list using the categories in the left-hand column.</li> </ul> <p>There are also some drop-down elements worth noting:</p> <ul> <li>The “Group By” drop-down adds collapsable groupings to the list of devices.</li> <li>The “Actions” drop-down will export the displayed data to CSV or push it out to Metrics Explorer for deeper analysis.</li> <li>The “Customize” option in the upper right corner lets you add or remove data columns.</li> </ul> <p>And we’re already familiar with the friendly blue “Discover Devices” button.</p> <h3 id="the-interfaces-list">The Interfaces list</h3> <p>Remember all the cool stuff I just covered about devices? The following image looks similar, except it focuses on your network interfaces.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4m65D4xQ6PjcOcEjGhpOCn/449c590b88437d193d609a34da531332/interface-list.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Interfaces list" /> <h3 id="metrics-explorer">Metrics Explorer</h3> <img src="//images.ctfassets.net/6yom6slo28h2/56i2ql1nE2sYtEkHeZrlzA/287855d9b8bd33c34e2fb6f2eb60345c/metrics-explorer.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Metrics Explorer" /> <p>Metrics Explorer is, in many ways, identical to Kentik’s existing <a href="https://kb.kentik.com/v4/Db03.htm">Data Explorer</a> capability. It’s also incredibly robust and nuanced. So much so that it deserves, and will get, its own dedicated blog post.</p> <p>For now, I will give this incredibly brief overview just to keep this post moving along:</p> <p>First, all the real “action” (meaning how you interact with Metrics Explorer) happens in the right-hand column.</p> <p>Second, it’s important to remember that the entire point of the Metrics Explorer is to help you graphically build a query of the network observability data Kentik is collecting.</p> <p>With those two points out of the way, the right-hand area has five primary areas:</p> <ol> <li><strong>Measurement</strong> allows you to select which data elements and how they are used.</li> <li><strong>Visualization options</strong>: This section controls how the data displays on the left.</li> <li><strong>Time</strong>: The period of time to display data from and whether to display time markings in UTC or “local” (the time of whatever computer is viewing the graph).</li> <li><strong>Filtering</strong>: This will let you add limitations so that only data that matches (or does <em>not</em> match) certain criteria is included.</li> <li><strong>Table options</strong>: These set the options for the table of data that displays below the graph and lets you select how many rows and whether they’ll be aggregated by Last, Min, Max, Average, or P95 methods.</li> </ol> <p>And that ends our brief tour!</p> <h2 id="parting-words">Parting words</h2> <p>We’ve only skimmed the surface of what Kentik NMS offers, but hopefully, you’re ready to start adding your own devices and interfaces. We’ll be back soon with more NMS tutorials and walkthroughs, but in the meantime, <a href="https://www.kentik.com/get-started/">sign up now to get started with a 30-day free trial of Kentik</a> and see <a href="https://www.kentik.com/product/network-monitoring-system/">Kentik NMS</a> in action yourself.</p><![CDATA[Understanding DDoS Attacks: Motivation and Impact]]><![CDATA[DDoS attacks disrupt services and damage reputations, with motivations ranging from political to personal. These attacks can also mask more severe security breaches, so early detection and mitigation are crucial. Learn how Kentik provides a solution by analyzing enriched NetFlow data to identify and mitigate DDoS threats. ]]>https://www.kentik.com/blog/understanding-ddos-attacks-motivation-and-impacthttps://www.kentik.com/blog/understanding-ddos-attacks-motivation-and-impact<![CDATA[Nina Bargisen]]>Wed, 21 Feb 2024 05:00:00 GMT<p>There are many motivations for initiating a DDoS attack. Many are political, some are motivated by competition, and others are out of spite — such as disgruntled/former employees. Perpetrators can bring a target’s infrastructure to its knees, leveraging the situation to extort money, information, or apply negotiation pressure.</p> <p>DDoS attacks are also used as a smoke-screen for other more insidious attacks, such as the introduction of malware or a more overt crime, like theft.</p> <h2 id="ddos-protection-with-network-observability">DDoS protection with network observability</h2> <p>Early detection and mitigation are critical for businesses that want to protect themselves against a DDoS attack. Some DDoS attacks are sophisticated enough to shut down large servers successfully and even completely disable a target’s network. This severe disruption to services and applications can result in direct revenue loss and damage to a brand’s reputation.</p> <p>Let’s look at how Kentik can help networks detect, analyze, and mitigate DDoS attacks.</p> <div as="Promo"></div> <h3 id="low-volume-ddos-attacks">Low volume DDoS attacks</h3> <p>When most people think of DDoS attacks, they think of massive volumetric attacks that crash websites or networks. In reality, most DDoS attacks are small in size and duration, often less than 1 Gbps and only a few minutes long, making them difficult to detect. DDoS detection tools are often configured with detection thresholds that ignore or don’t see these attacks. These low-volume attacks are often used to mask security breaches. Hackers will use a DDoS attack to distract SecOps while simultaneously launching a more rewarding security breach. The security breach could involve exfiltrating data, mapping networks for vulnerabilities, or infiltrating ransomware.</p> <h3 id="detection">Detection</h3> <p>Identifying where traffic originates, and normal traffic flows from those sources is keystone data to a defense strategy.</p> <p>Kentik analyzes your real-time and historical NetFlow data, constantly comparing this traffic flow data against benchmarks to catch anomalous traffic patterns, giving network and security engineers what they need most: the awareness and time to mitigate the attack and protect their network before it does damage.</p> <p>The solution allows you to baseline against small traffic volumes, and this way, your network engineers can fine-tune thresholds and alerts accordingly. The dynamic baseline function that uses machine learning to determine normal traffic is particularly powerful for low or slowly-growing traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/61hziEsu9Z7BB0ljMSU534/03507a512c6e8d689c22ac9aa39d69ac/ddos-defense-dashboard.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="DDoS detection using NetFlow and machine learning" /> <div as="Testimonial" index="0" color="blue"></div> <h3 id="policies">Policies</h3> <p>You can build policies based on specific geographies, such as an alert if the traffic is from an embargoed country. Being able to identify the source of the traffic can help tremendously in the detection of security breaches. Identifying traffic from an unusual source may be the key to early mitigation. Enriching the NetFlow data with network, geographical, and security context is the key to early detection.</p> <p>Part of building the policies is to decide what action to take when the policy results in an alert. Is this a straightforward attack that needs immediate mitigation, or does this require human eyes on the data to decide what action to take?</p> <h3 id="ddos-mitigation">DDoS Mitigation</h3> <p>You can mitigate a DDoS attack using several different methods. Two distinct groups of methods are supported for automation: the BGP native methods that remove all traffic and scrubbing.</p> <div as="Testimonial" index="1" color="purple"></div> <p>The simplest BGP native method is remote triggered black hole (RTBH), where you redirect all traffic destined to the attacked IP space, be it a single address or a prefix, to a black hole where the packets are dropped and not forwarded.</p> <p><a href="https://www.kentik.com/blog/what-is-adaptive-flowspec-and-does-it-solve-the-ddos-problem/">Flowspec</a> offers a much more granular removal of unwanted traffic. You can signal policies to the network edge and filter the unwanted traffic there based on the network context and details of the attack, like port numbers, protocols, source, destination, and more.</p> <p>Hardware-based scrubbing requires hardware and methods to redirect traffic to the devices where the attack traffic is identified and removed inline, and the clean traffic forwarded to the attacked destination. Kentik’s powerful API is the foundation for integrations with A10 and Radware scrubbers and <a href="https://www.kentik.com/blog/cybersecurity-cloudflare-and-kentik-mitigate-ddos-attacks/">Cloudflare’s Magic Transit cloud-based scrubber</a>. The RBTH and Flowspec functionality can be combined with the API to build custom integrations to other scrubbing platforms.</p> <h2 id="understanding-the-attack-in-context">Understanding the attack in context</h2> <p>SNMP data is not enough!</p> <p>Flow data gives you the ability to understand the attack in context. It details where the attack is coming from and what IP addresses, ports, or protocols make up the attack. The enriched NetFlow in the platform provides a deeper analysis. Dimensions like device, interface, geographical location of the attack sources, which network hosts the sources of the attack, and which providers bring the traffic to you.</p> <p>This context helps with mitigation by being able to understand the nature of the attack better, as well as apply more accurate filters against the traffic.</p> <p>When you create the alerting policies, you decide what traffic you monitor. Any query made in Kentik Data Explorer can be the basis for an alert. There are several predefined policies available that can be used directly or customized.</p> <p>From the alert overview or the DDoS workflow, there is quick access to visual and text information about what triggered an alert. From there, a deeper analysis is defined in the dashboards that are associated with the policy. This way, the experience of deep forensic analysis done either during or after an attack can be transformed into a predefined workflow for the next time this type of attack happens.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6jFXVWMje7P5012h7uoZrS/a952d690c9843c7598b9d5c8d91d6b18/ddos-investigation-alerts.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="DDoS alerting for forensic analysis" /> <div class="caption" style="margin-top: -25px;">Show the exact details you need for understanding an attack.</div> <h2 id="determining-the-effectiveness-of-ddos-mitigation">Determining the effectiveness of DDoS mitigation</h2> <p>Mitigation services and technologies sometimes don’t achieve full coverage, and attack traffic can circumvent the mitigation, leaving you exposed. Using NetFlow to analyze what DDoS traffic has been redirected for scrubbing and what traffic has been missed is essential. And perhaps just as important, monitoring BGP from hundreds of vantage points can enable you to understand how quickly your mitigation service achieved full coverage, if it did at all.</p> <p>The BGP visualization below shows a DDoS mitigation vendor (purple) appearing upstream of the customer network but never achieving complete coverage. Below, we can see the result of this incomplete activation, as only a portion of DDoS traffic is ultimately redirected to the DDoS mitigation vendor. An incomplete DDoS mitigation permits attack traffic to reach the target network, imperiling critical services.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/64sLSbhNFc66YH8a9JzOZI/b4128bafaec334e0d455a04fff72a5af/incomplete-ddos-mitigation-example.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Incomplete DDoS mitigation example" /> <h2 id="conclusion">Conclusion</h2> <p>DDoS attacks present a significant threat to businesses of all sizes, capable of disrupting services, tarnishing brand reputations, and incurring substantial financial losses. As these attacks evolve in complexity and intensity, the need for advanced, comprehensive defense strategies becomes imperative. Kentik emerges as a powerful ally in this ongoing battle, offering a robust suite of tools for early detection, granular analysis, and effective mitigation of DDoS threats.</p> <p>Through its sophisticated analysis of NetFlow data, enriched with network, geographical, and security context, Kentik provides unparalleled visibility into network traffic. This enables businesses to detect and mitigate attacks in their infancy and understand the nature of these threats in a broader context. With capabilities ranging from detecting subtle, low-volume attacks to deploying advanced mitigation techniques like BGP Flowspec and hardware-based scrubbing, Kentik equips network and security engineers with the resources they need to protect their infrastructure.</p><![CDATA[Using Kentik Journeys for Network Troubleshooting]]><![CDATA[Kentik Journeys uses an AI-based, large language model to explore data from your network and troubleshoot problems in real time. Using natural language queries, Kentik Journeys is a huge step forward in leveraging AI to democratize data and make it simple for any engineer at any level to analyze network telemetry at scale.]]>https://www.kentik.com/blog/using-kentik-journeys-ai-for-network-troubleshootinghttps://www.kentik.com/blog/using-kentik-journeys-ai-for-network-troubleshooting<![CDATA[Phil Gervasi]]>Thu, 08 Feb 2024 05:00:00 GMT<p><a href="https://www.kentik.com/solutions/kentik-ai/" title="Learn more about Kentik Journeys and Kentik AI">Kentik Journeys</a> is a new way for engineers to dig deep into network telemetry using natural human language. Just like using ChatGPT to interrogate a dataset and ask a series of questions that build on each other, Journeys gives you the ability to leverage a large language model and natural language processing to explore data from your network and troubleshoot problems in real time.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3dbM47LB2A2m6c19RbRyGy/6742f18d4c38328d459f2c4cbda58df3/journeys-home.png" style="max-width: 700px;" class="image center" thumbnail withFrame alt="Journeys AI for network telemetry with LLM examples" /> <p>Think about how we typically troubleshoot a network problem. Usually, it starts with asking, “What’s wrong?” followed by a series of follow-up questions based on what we learn each step of the way. In the same way, Journeys gives you the ability to perform root cause analysis easier and faster than ever before.</p> <div class="pullquote center" style="text-align: center;">Kentik will remember what you asked and respond to your follow-up questions with previous questions and answers in mind.</div> <p>Rather than study graphs of device metrics, complex Sankeys of flow data, or the output of a dozen show commands, you can simply ask Kentik a question or ask it to show you some specific data, and the system will programmatically query device metrics and flow data for you. And because troubleshooting usually requires a series of probing questions, Kentik will remember what you asked and respond to your follow-up questions with previous questions and answers in mind.</p> <p>Application delivery relies on many network and network-adjacent components and services, so we’ve designed Journeys to work across the entire Kentik product surface. So, let’s walk step-by-step through an example of using Journeys to troubleshoot a performance issue. Pay special attention to how follow-up questions are related to previous questions and how we can save our Journey to share with our team.</p> <h2 id="scenario">Scenario</h2> <p>In our scenario, people in one specific branch office are complaining about application performance. Specifically, the connection to the application breaks intermittently, which affects application performance and the user experience.</p> <p>The location is connected to on-prem data centers and the public cloud by a Cisco SD-WAN, which the application’s local Postgresql mechanism uses to connect to the resources it needs.</p> <p>We can begin troubleshooting by asking some basic questions because Kentik ingests device metrics, flow data, and contextual telemetry about the entire organization, including its public cloud environment.</p> <h2 id="natural-language-troubleshooting-with-journeys">Natural language troubleshooting with Journeys</h2> <h3 id="step-1">Step 1</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/1jfv9jVGJr4hqVXAhk1u4Y/8d1e88bfd63dbebdbd6ecd5e3022507d/journeys-button.png" style="max-width: 220px;" class="image right" alt="Start an AI Journey" /> <p>First, we select “New Journey” to start the process. Because we intend to keep this entire conversation to refer to later, we’ll give it a more useful name, in this case, “Postgres issue.”</p> <p>Journeys currently supports any queries related to flow data, which would typically be visible in Data Explorer, as well as metrics from SNMP and streaming telemetry, which we’d normally see in Kentik NMS Metrics Explorer. Over time, expect this to expand to more telemetry from across your infrastructure and clouds.</p> <h3 id="step-2">Step 2</h3> <p>Since all our Cisco SD-WAN devices have “cedge” and the site name in their names, we can start by asking to see all the traffic traversing our SD-WAN edge devices in the last couple of hours when the issue was reported.</p> <p>Type in the input query text box the following query:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2m5nZpbCp5HZ9iyVeJBPNz/58961f5974f312a76f471d0784890bf5/journeys-ai-search.png" style="max-width: 800px;" class="image center no-shadow" alt="AI query input bar" /> <p>Here, we’re using natural language to ask the system a question about traffic, which then turns into a query in the Data Explorer. The system interprets our natural language and turns it into a set of filters to give us the result below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5K6QSVPGHEne2wYrtvZzrM/9442904723d52e316cde0d5331666f16/step2-input-query.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Natural language query shows the result in Data Explorer" /> <p>In this output above, we can see the applications traversing our cedge devices, which is a good start, but the application we’re interested in likely has low volume traffic, and it is not visible in the TOP applications.</p> <p>We can drill down further and add filters using natural language that will look for our Postgresql application. Remember that the system remembers what we already asked, so as long as we stay in this Journey, we can ask a follow-up question in that context.</p> <h3 id="step-3">Step 3</h3> <p>We can add an additional filter for Postgresql by typing this query in the text box:</p> <img src="//images.ctfassets.net/6yom6slo28h2/I5DDlwjPj2JxZMcxZP5GE/2a28d5fcc8dffc01dbedcc9a34fe63c6/step3-add-filter-app-postgresql.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Add a filter in conversational AI to expand your query" /> <p>Notice the specific traffic for our Postgresql application in the results above. This is great, but we need more details to understand why it isn’t working correctly. We need to see which SD-WAN edge devices are actually seeing this traffic and learn that we can simply add that question to our growing journey.</p> <h3 id="step-4">Step 4</h3> <p>For our next step, we can add:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2rLM4YNVk8dReXWhgWXzAb/65f7d284280425c3c65dadb5974b79af/step4-grouping-by-app-device.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Filter with natural language: Add grouping by application and device" /> <p>In the results, we can see that this traffic goes over multiple devices. However, since users reported the problem specifically for one location, Site 01, we need to filter our view to only the Site 01 edge device.</p> <p>Our cedge has multiple connections to the internet, so we also need to filter our interfaces to see precisely which WAN link our application traffic is using. Remember that we’re using natural language query (NLQ), so we only need to group these results by destination interface using plain language.</p> <h3 id="step-5">Step 5</h3> <p>We can do that by adding our next input:</p> <img src="//images.ctfassets.net/6yom6slo28h2/54sXTEs8KPku5BhhCW9K0v/13105e726f62433a5db6e2b50feb4296/step5-filter.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Build on NLQ with further filters" /> <p>Ok, so the output above is interesting. The result of adding that filter shows us that traffic has been going over <em>two</em> WAN interfaces and not just one. One of the interfaces has the Cisco SD-WAN color of Silver and the interface Gold. Notice the two colors in the graph denoting two different interfaces, GigabitEthernet1 and GigabitEthernet2.</p> <p>If we hover the mouse over the results in the table, the system will highlight the results on the chart for us. When we do that (see images below), notice that the traffic is alternatively routed over these two links in different time intervals. This is not the behavior we want or expect, and it’s so far the most probable cause of the intermittent application issues.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4geDLzuRxYmwNn3eWMNrdL/0e0981f90c4ece8422b30a8670a0a09a/step5a.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Using AI to troubleshoot network traffic issues" /> <img src="//images.ctfassets.net/6yom6slo28h2/1fPDFTc5UcQiKCjq7uQA8K/e0cace31dc9bef82618b6d236ab22642/step5b.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Using AI show details of network traffic routing" /> <p>Now we’re really getting somewhere, but seeing the likely cause is one thing — understanding <em>why</em> is a different story. We need to keep digging further to figure out what’s happening.</p> <p>It’s important to remember that rather than pouring over charts and graphs or running show commands on cedge devices, we’ve been using natural language to easily query our data and ask follow-up questions. This means even a novice network engineer can troubleshoot complex network problems; novice or not, anyone using this method will get to the answer faster.</p> <h3 id="step-6">Step 6</h3> <p>So, let’s keep digging to figure out what’s happening here. So far, we’ve learned that something is causing the cedge to switch the forwarding path from time to time. Usually, that’s because the cedge sees some sort of problem with the link, or in other words, loss, latency, jitter, etc., that exceeds the thresholds we set.</p> <p>We can check the interface utilization, bitrate, and ensure there are no bottlenecks that would cause packet drops.</p> <p>Since we’re now asking questions about device metrics, Journeys will automatically apply our natural language query to the underlying metrics dataset, Metrics Explorer.</p> <p>Let’s add the following filter to our series of queries:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5LQxggpgmJN1664OMRjTZF/64cdeae4564355a45120b193a6d9e424/step6.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Query is automaticallly applied to the dataset" /> <p>According to the output above, there isn’t a smoking gun we can point to. Utilization of each interface is relatively low, and there’s no apparent packet loss or other adverse behavior happening on the links connecting to our local service providers.</p> <p>However, Cisco SD-WAN devices also track the quality, in terms of latency, of each SD-WAN tunnel. The good news is that <a href="https://www.kentik.com/product/network-monitoring-system/" title="Learn more about Kentik NMS, the next-generation network monitoring system">Kentik Network Monitoring System (NMS)</a> also collects this information, so we also have the metrics for SD-WAN tunnels.</p> <h3 id="step-7">Step 7</h3> <p>Let’s add latency grouped by remote IP and SD-WAN colors to our journey:</p> <img src="//images.ctfassets.net/6yom6slo28h2/60KqMYQgZDoSNnhQpDzpzF/110c50981f6b0be8a6880208972aeb4c/step7.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Drill down into network traffic issues" /> <p>Take a look at that — latency on the Silver link toward Site 05 fluctuates over time from 10ms to over 300ms. And that frequent change in latency correlates directly with the path selection changes and, ultimately, application performance issues.</p> <p>At this stage, we would examine the application routing policies on the SD-WAN controller and place a call to our service provider to ask what’s happening on their end that’s causing this latency fluctuation.</p> <h2 id="root-cause-analysis-using-natural-language">Root cause analysis using natural language</h2> <p>Being able to troubleshoot a network problem, especially one that’s intermittent, means analyzing data about devices, application flows, user behavior, service provider information, and so on. In other words, to get to the answer, it takes a lot of time and effort to mine through data, looking for clues, asking questions, and drawing conclusions.</p> <p>Using large language models, specifically natural language processing, takes the burden off the engineer to click through multiple screens, mine through charts, and run show commands ad nauseam. Kentik’s heart is with the engineer working in the trenches to run networks day-to-day, so it’s been a goal to find ways to make network operations easier.</p> <p>Kentik Journeys is a huge step forward in democratizing data, making it simple for any engineer at any level to analyze network telemetry at scale and in real time.</p><![CDATA[Setting Sail with Kentik NMS: Unified Network Telemetry]]><![CDATA[Kentik NMS has launched and is setting sail in familiar waters. Monitoring with SNMP and streaming telemetry is only the first leg of the journey. In short order, we’ll unfurl additional options, increasing NMS’s velocity and maneuverability.]]>https://www.kentik.com/blog/setting-sail-with-kentik-nms-unified-network-telemetryhttps://www.kentik.com/blog/setting-sail-with-kentik-nms-unified-network-telemetry<![CDATA[Leon Adato]]>Wed, 07 Feb 2024 05:00:00 GMT<h2 id="a-view-from-the-prow-as-we-set-sail-on-the-rolling-seas-of-snmp">A view from the prow as we set sail on the rolling seas of SNMP</h2> <p>So you may have heard by now that Kentik has released a <a href="https://www.kentik.com/product/network-monitoring-system/">network monitoring system</a>, commonly known as an “NMS,” in the smoky rooms of observability aficionados. This is more than just a cute little add-on to our robust flow-based monitoring capabilities. This is a stem-to-stern product that could stand alone if we wanted it to.</p> <p>But the question that arises in many people’s minds, like the first light peeking over a distant horizon, is: Why?</p> <p>Why, when the discipline of network monitoring is solidly into its third decade, would we think their shores are yet uncharted?</p> <p>More to the point, is Kentik trying to imply that – in this age of ubiquitous cloud, containers, microservices, and APIs – the regular old route-and-switch network even matters anymore?</p> <h2 id="the-time-is-right-for-modern-network-monitoring">The time is right for modern network monitoring</h2> <p>Given that we’re releasing Kentik NMS, the answer is obviously “yes.” But in this blog post, I need to get at why we’re doing it. Besides, you know, “customers keep asking for it” (although, admittedly, that is a pretty good reason).</p> <p>First, traditional monitoring still matters. Hardware — both in terms of availability and performance — still matters. On-premises systems still matter. And “the network” - meaning anything from bare metal packet pushers in a closet all the way up to a Kubernetes cluster in the cloud — matters.</p> <p>All of those things I’ve just named, along with a myriad of other infrastructure elements, are still critical components for organizations large and small. Being able to collect telemetry and visualize it effectively, turning data into information that drives action, is a core capability for any — and every — business.</p> <p>Second, the people responsible for running and maintaining their networks keep telling us that the current set of solutions on the market have either failed to keep up or that the cost of keeping up is so high their speed of adoption is unacceptably slow.</p> <div as="Promo"></div> <p>Before you interpret what I just said as an insult to existing vendors, let me be clear: I have a tremendous amount of respect for and even love the existing monitoring tools on the market. They do many things well, and in some cases, they were the first to do those things. They blazed trails, educated consumers, and established whole markets and sub-specialties within IT.</p> <p>But pivoting an entire product line is almost impossibly complicated. An established tool has existing customers who cannot be abandoned, which means keeping the current solutions more or less the same. Adding new capabilities is predicated on the ability of the tool to accommodate those new functions without breaking existing ones.</p> <p>For example, let’s look at collecting network metrics via API rather than a more traditional method like using <a href="https://www.kentik.com/blog/the-benefits-and-drawbacks-of-snmp-and-streaming-telemetry/">SNMP</a>. And please understand the irony of calling SNMP “traditional” versus APIs when network devices have included API options for the better part of a decade.</p> <p>While it’s been possible to collect data from hardware via API calls for quite some time, precious few network monitoring tools support this capability or do it particularly well. To be sure, the solutions that focus on application monitoring do it better, but even there, it’s in the context of the application rather than hardware.</p> <p>The reason for this isn’t because monitoring solutions vendors are lazy or uninspired. It’s that the work of adding an API collector is hard; different vendors have implemented API interfaces in just a different enough way to create additional hurdles, and normalizing the API data with the other telemetry presents its own hurdles.</p> <p>This difficulty stems from the fact that hardware didn’t support APIs when the tools were conceived and written.</p> <p>And if all of that is true for a 10-year-old technology like REST-ful APIs, how much more so is it true for OpenTelemetry and its Cisco-specific cousin, streaming telemetry?</p> <h2 id="keep-your-network-even-keel">Keep your network even keel</h2> <img src="//images.ctfassets.net/6yom6slo28h2/5WBUvXlBaoXR6oTCw8zDY0/b2321babb7054603c411e224c6587362/nms-dashboard.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik Network Monitoring System dashboard" /> <div class="caption" style="margin-top: -35px;">Kentik NMS in action</div> <p>All of this is my way of saying that Kentik realized the world needed a new NMS because:<br> A) That data still matters, and<br> B) Creating an NMS from the ground up was actually easier than bolting additional capabilities onto an existing tool.</p> <p>This brings us to the point we find ourselves at today: <a href="https://www.kentik.com/blog/reinventing-network-monitoring-and-observability-with-kentik-ai/">Kentik NMS has launched</a> and is setting sail in familiar waters. Monitoring with SNMP and streaming telemetry is only the first leg of the journey. In short order, we’ll unfurl additional options, increasing NMS’s velocity and maneuverability.</p> <p>So, now that the ship has set sail, I hope you <a href="https://www.kentik.com/get-started/">come aboard</a> and have a look around.</p><![CDATA[Reinventing Network Monitoring and Observability with Kentik AI]]><![CDATA[We’re bringing AI to network observability. And adding new products to Kentik, including a modern NMS, to simplify troubleshooting complex networks. ]]>https://www.kentik.com/blog/reinventing-network-monitoring-and-observability-with-kentik-aihttps://www.kentik.com/blog/reinventing-network-monitoring-and-observability-with-kentik-ai<![CDATA[Christoph Pfister]]>Wed, 31 Jan 2024 05:00:00 GMT<p>Whether it’s limited visibility into hybrid cloud, challenges with diverse telemetry, failure to scale, or an inability to support a host of network attributes that have become table stakes, one thing is clear: Legacy network monitoring tools are not built for modern networks.</p> <p>That’s why we are expanding our network observability platform and launching Kentik NMS, the first AI-assisted modern network monitoring system. We’re bringing AI to network monitoring via natural language query for the network, as well as rolling out AI-assisted troubleshooting across the Kentik platform (aptly named Kentik AI).</p> <p>We built <a href="/solutions/kentik-ai/" title="Learn more about Kentik AI">Kentik AI</a> and <a href="/product/network-monitoring-system/" title="Learn more about Kentik&#x27;s next-generation network monitoring system">Kentik NMS</a> for everyone, enabling both technical and non-technical teams to answer any question about their network, improve performance, and reduce costs. Our goal is to make everyone a superstar network engineer.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5yfhMnGqJXToT0RLhYzjXa/cec2633dee588d7164b35bd8b05becfe/journeys-cpu-usage.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik Network Monitoring System with AI" /> <h2 id="where-legacy-network-monitoring-tools-fall-short">Where legacy network monitoring tools fall short</h2> <p>The conventional view has been that NMS (or NPM as it’s sometimes called) is a solved problem. Yet, our conversations with customers painted a different picture. They expressed a need for something more approachable for stakeholders dependent on the network — which includes not only network teams, but also SREs, security, cloud engineers, application teams, and more.</p> <p>And we heard their frustrations: slow polling, siloed data, limited correlations and insights, lack of extensibility, manual upkeep, no support for new approaches to telemetry (e.g., streaming telemetry, OpenTelemetry), and for some, months or years between meaningful updates to their legacy tooling.</p> <div as="Testimonial" index="0" color="blue"></div> <h2 id="what-a-modern-nms-looks-like">What a modern NMS looks like</h2> <p>Kentik NMS is the first AI-assisted network monitoring system. It unifies flow, VPC Flow, eBPF, synthetic, SNMP, and other telemetry from Kentik’s network observability platform with real-time, custom, and streaming device metrics.</p> <p>When paired with a single view into the entire network — data centers, AWS, Azure, GCP, OCI, K8s, the internet, and beyond — Kentik becomes the network source of truth for any team.</p> <br /> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/8waesrgp2e" title="Kentik Query Assistant - AI-assisted Network Monitoring System" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <br /> <p>Kentik NMS enables enterprises and service providers to:</p> <ul> <li><strong>Decrease total cost of monitoring</strong>: Put a stop to spending on self-hosting, unload infrastructure costs (physical/virtual), and reduce engineering cycles required to maintain legacy platforms.</li> <li><strong>Improve network performance</strong>: Automatically identify inefficiencies, underutilized resources, and overprovisioning. Cut through troubleshooting guesswork, and resolve issues faster via AI-assisted investigations.</li> <li><strong>Democratize network insights</strong>: Enable any team to understand network metrics, interface stats, and the effects of network configurations — and make it easy with AI-assisted investigations. Now, everyone (not just network experts) can see, learn from, and act on insights.</li> </ul> <div as="Promo"></div> <h2 id="simplify-investigations-with-kentik-ai">Simplify investigations with Kentik AI</h2> <p>The networks that underpin our digital world need to be available and performant at all times. Yet with the advent of cloud, hybrid architectures, and microservices, these networks have become tremendously complex and difficult to troubleshoot.</p> <p><a href="/solutions/kentik-ai/" title="Learn more about Kentik AI">Kentik AI</a> is built to accelerate and simplify investigating these complex networks.</p> <p><strong>Kentik Query Assistant</strong> allows customers to use their preferred language to ask questions about their network. Traditionally, observability platforms like Kentik have provided query languages to give users a way to analyze the massive data sets we collect and store about a network. But these query languages, while powerful, have a learning curve. Kentik Query Assistant allows customers to harness the power of their data without having to understand all its idiosyncrasies or how to query it properly. This is a significant benefit for new and expert users alike.</p> <p><strong>Kentik Journeys</strong> takes this concept a step further by providing an AI-assisted conversational user experience purpose-built for the process of troubleshooting and network analysis.</p> <p>When troubleshooting, engineers typically ask a question, analyze the answer, then ask a new, and hopefully better-informed question. Sometimes this process is simple. Often it is not. This process is repeated until the user comes up with the root cause and then applies a fix leading to resolution.</p> <p>Journeys provide a dedicated space to perform this iterative process, saving questions and answers as the user performs their troubleshooting journey.</p> <p>The real power of Journeys is its network context. Meaning it understands the customer’s network. The devices involved, the VPCs and transit gateways running in the cloud, the apps consuming bandwidth, etc. Anyone can converse with Kentik Journeys about their specific network, within a user experience that is purpose built for the journey of troubleshooting, hence making it much faster and more efficient.</p> <br /> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/7ndhnmcvzn" title="Kentik Journeys - The first network GenAI" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style="border: 3px solid #DAE0E7; max-width: 96%; position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h2 id="see-kentik-nms-in-action">See Kentik NMS in action</h2> <p>There are a few ways you can get started with Kentik NMS.</p> <ol> <li>Sign up for a <a href="https://www.kentik.com/try-nms">Kentik trial</a>.</li> <li><a href="https://www.kentik.com/nms-demo">Request a demo</a>, and someone on our team will reach out to confirm a time.</li> <li>Log in to your <a href="https://portal.kentik.com/login">Kentik account</a> (if you are a customer).</li> </ol> <h2 id="interested-in-kentik-ai">Interested in Kentik AI?</h2> <p>By starting a <a href="https://www.kentik.com/try-nms">Kentik trial</a>, you will automatically have access to Kentik Query Assistant, and be able to ask questions about your network in any natural language.</p> <p>If you’d like to explore Kentik Journeys, our interactive AI troubleshooting, you can <a href="/go/offer/journeys-early-access/">request access to the preview here</a>.</p><![CDATA[The Cloudification of Telcos and the Evolution of the Telecom Ecosystem]]><![CDATA[In the rapidly evolving telecom sector, the concept of "cloudification" is not just a trend but a transformative shift, reshaping how services are delivered and managed. This change is underpinned by modern software architectures featuring modularity, microservices, and cloud-native designs. As we embrace this new era, marked by the rise of "netcos" and "servcos," we must also navigate the complexities it brings.]]>https://www.kentik.com/blog/the-cloudification-of-telcos-and-the-evolution-of-the-telecom-ecosystemhttps://www.kentik.com/blog/the-cloudification-of-telcos-and-the-evolution-of-the-telecom-ecosystem<![CDATA[Nina Bargisen]]>Wed, 24 Jan 2024 05:00:00 GMT<h2 id="introduction">Introduction</h2> <p>The telecom industry is currently undergoing a significant transformation, often referred to as “cloudification.” This change isn’t just a buzzword; it represents a fundamental shift in how telecom services are delivered and managed. Alongside this, there’s an intriguing development in the structure of the telecom ecosystem, particularly in the transition from traditional verticals to horizontal models, including the emergence of “netcos” (network companies) and “servcos” (service companies).</p> <div as="WistiaVideo" videoId="mo1m6maovk" audio></div> <p><strong>Modern software architecture: The backbone of cloudification</strong></p> <p>The cloudification of telecoms is deeply rooted in modern software architecture. Key characteristics of this architecture include:</p> <ol> <li> <p><strong>Modularity</strong>: Breaking down software into smaller, interchangeable modules.</p> </li> <li> <p><strong>Scalability</strong>: Efficiently managing increasing workloads and user numbers.</p> </li> <li> <p><strong>Microservices architecture</strong>: Using independently deployable, loosely coupled services.</p> </li> <li> <p><strong>Cloud-native design</strong>: Exploiting cloud platforms’ advantages like distributed computing.</p> </li> <li> <p><strong>Continuous integration/deployment (CI/CD)</strong>: Streamlining development and deployment.</p> </li> <li> <p><strong>API-first design</strong>: Prioritizing API development for better integration.</p> </li> <li> <p><strong>Security and compliance</strong>: Incorporating security measures from the ground up.</p> </li> <li> <p><strong>Resilience and fault tolerance</strong>: Maintaining functionality amidst failures.</p> </li> <li> <p><strong>User-centric design</strong>: Focusing on intuitive and responsive user experience.</p> </li> <li> <p><strong>Agility and flexibility</strong>: Adapting rapidly to changing business and technology needs.</p> </li> </ol> <p>This shift to this modern approach for the software systems that make up or support network functionality creates the foundation for the change we see now.</p> <h2 id="the-shift-of-location-of-the-network-functionality">The shift of location of the network functionality</h2> <p>Cloudification in telecom also means moving some network functions from specialized hardware to generic server hardware. This transition enables software and functionalities to be deployed wherever it makes sense. However, it’s important to note that not all functions suit this shift. For example, big iron routers in global transit networks still consist of specialized hardware optimized for this particular function.</p> <p>Functions that are being cloudified include mobile core systems, customer management, network management, SDN controllers, firewalls, analytics, monitoring, and, crucially, APIs. Conversely, functions like switching, forwarding, routing, and physical network components still rely on dedicated, on-prem hardware and probably will for some time to come. But also here, the systems are modularized, programmable, API-based, and scalable using modern software architecture.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3GhH1DFtYB50jrIOQDrJ0M/f099d71ee354dac08b0acb8238a58510/telco-cloudification-1.png" style="max-width: 550px;" class="image center no-shadow" alt="Telco cloudification diagram" /> <p>The telco companies have different approaches – some build close partnerships with the hyperscalers and can use the same technologies – both in the public clouds and in private instances dedicated to the telcos, while others are building their own private clouds from the ground up.</p> <p>Does this result in new needs? Yes, just like traditional networks are supported by several tools and systems, for example, a network monitoring system, the visibility the operators are used to is still needed even though the software is now running in a public cloud. We need to know about the health of the instances and how traffic flows in a system that now consists of dedicated and specialized hardware on-prem (or out in the field) and systems running in various versions on public and private cloud infrastructure.</p> <h2 id="the-telco-ecosystem-netcos-and-servcos">The telco ecosystem: Netcos and servcos</h2> <p>An interesting aspect of this transformation is restructuring the telecom and internet industry from verticals to horizontals. This change manifests in the division of companies into servcos and netcos. Servcos focus on developing and selling services and owning end-user customers, while netcos own and operate the network infrastructure. This decoupling allows, in the ideal scenario, services developed by servcos to run on any network facilitated by netcos.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5URuKFsO4VVUZU73ncp83e/050a87f7442d8621149d89bbf20281ab/telco-cloudification-2.png" style="max-width: 470px;" class="image center no-shadow" alt="" /> <p>This division mirrors the modularity and microservices approach in software design, where separate components (or services) are developed independently but work together. It allows for specialization in each domain, leading to more efficient and scalable operations. The servcos can move fast and agile since they are no longer needed to care about operating legacy networks. With netcos moving to a pure wholesale model with only servcos as customers, they can simplify and optimize their operation. In some instances, the split goes beyond just a netco, but splits into a netco, fiberco, and towerco, which we see in the UK and Italy. Another aspect of the netcos operating model is to attract long-term investors to finance the physical infrastructure.</p> <p>In the ultimate scenario, a servco competes on service innovations and provides their products running on several netcos, completely unknown or cared for by the end customers. Netcos, on the other hand, compete on efficiency and geographical scopes and might even be subject to price regulation in some markets.</p> <p>The model isn’t entirely new. For example, BT’s Openreach was established in 2006 following a strategic review by Ofcom in 2005, which identified the need for a new organization to manage the UK’s phone and broadband network fairly, previously controlled by BT. This was to ensure equitable access for all communication providers. In 2017, as part of the Digital Communications Review by Ofcom, Openreach became a legally separate company, operating as a wholly owned subsidiary of BT plc. This separation aimed to improve the digital communications market for consumers and businesses.</p> <p>We have even seen the introduction of a new layer, the API provider, that helps create the links between the customers, the services, and the network operator.</p> <p>Reach is such a company. Reach is a global SaaS platform where customers can create wireless products. WideOpenWest (WOW) is using Reach to create a mobile offering to their internet access and cable customers. Reach acts as a middle layer between the servco WOW and the netco T-Mobile.</p> <p>The National Content and Technology Cooperative that sets up tech deals for its members – 700+ cable and telco operators in the US – has entered an agreement with Reach so the members can use Reach to create mobile offerings for their end customers.</p> <h2 id="its-a-new-paradise-or-is-it">It’s a new paradise! Or is it?</h2> <p>This all sounds fantastic, but what are the downsides to this model?</p> <p>Competition drives innovation up and prices down, and it is easy to imagine this happening in the servco layer. But it is just as easy to imagine consolidation and monopolies happening in the netco layer. So, what are the potential downsides and pitfalls ahead?</p> <p>As mentioned, the risk of monopolization or at least limited competition between netcos creates a risk of high prices, long delivery times, and limited innovation in this layer.</p> <p>Another risk is the increased complexity from end-user to production. With a number of different organizations and customer/provider relationships in the production chain, there will be an increased risk that when something goes wrong, it’s challenging to find out where and get it fixed quickly.</p> <p>Other risks are vendor lock-in and missing interoperability.</p> <h2 id="conclusion">Conclusion</h2> <p>In conclusion, the telecom industry is at a pivotal point, embracing cloudification and evolving into a more dynamic and flexible ecosystem. This transformation is not just a technological shift but a complete overhaul of traditional business models, leading to the emergence of “netcos” and “servcos.” Adapting modern software architecture principles, such as modularity and microservices, is pivotal in this evolution, allowing telecom companies to be more agile, efficient, and customer-centric. This trend will likely accelerate as we look to the future, driven by continuous technological advancements and changing consumer demands. It’s an exciting time for the industry, with significant opportunities for innovation and growth, not just for the companies involved but also for the consumers who stand to benefit from more diverse, efficient, and tailored telecom services.</p> <p>The telco industry’s efforts to foster more agility with a modern approach brings inevitable complexity and distribution to its network and systems. The distributed nature of modern telcos and service providers leans towards monitoring and observability systems that can see it all – from an on-prem data center to the cloud to the internet and everything in between.</p> <p><strong>Read more:</strong></p> <p><a href="https://www.linkedin.com/pulse/telecoms-delayering-coming-future-servco-netco-marco-pizzo/">Telecoms delayering is coming, the future is ServCo and NetCo</a></p> <p><a href="https://www.capgemini.com/news/press-releases/telcos-expected-to-invest-1-billion-in-cloud-transformation-to-enable-a-future-connected-world/">Telcos expected to invest $1 billion on average in network cloud transformation to enable a future connected world | Press Release | Capgemini</a></p> <p><a href="https://www.lightreading.com/cloud/telco-cloudification-poses-a-huge-systemic-risk">Telco cloudification poses a huge systemic risk?</a></p> <p><a href="https://www.lightreading.com/5g/nctc-reach-strike-mobile-deal-for-700-us-cable-operators">NCTC, Reach strike mobile deal for 700+ US cable operators</a></p><![CDATA[Anatomy of an OTT Traffic Surge: Peacock Delivers First Exclusively Streamed NFL Playoff Game]]><![CDATA[NFL playoffs are here, and Doug Madory tells us how Saturday's first-ever exclusively live-streamed NFL playoff game was delivered without making any references to pop superstar Taylor Swift or her sizzling romance with nine-time Pro Bowler Travis Kelce.]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-peacock-streamed-nfl-playoff-gamehttps://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-peacock-streamed-nfl-playoff-game<![CDATA[Doug Madory]]>Wed, 17 Jan 2024 16:00:00 GMT<p>On Saturday night, NBC’s <a href="https://www.peacocktv.com/collections/nbc">Peacock</a> streaming service aired the first-ever exclusively live-streamed NFL playoff game. Football fans needed an account with the OTT service to watch the Kansas City Chiefs defeat the Miami Dolphins in arctic conditions. The result was the most-streamed NFL game ever and an OTT traffic surge. In this post, we’ll take a look at how that traffic was delivered during the game.</p> <h2 id="ott-service-tracking">OTT Service Tracking</h2> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how <a href="https://www.kentik.com/kentipedia/ott-services/" title="Kentipedia: OTT Services (Over-the-Top Services)">OTT services</a> are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/" title="Gaming as on OTT service: Virgin Media reveals that Call Of Duty: Warzone has the “biggest impact” on its network">Call of Duty update</a> or a <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday/">Microsoft Patch Tuesday</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p><a href="https://www.kentik.com/resources/kentik-true-origin/" title="Learn more about Kentik True Origin">Kentik True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 540 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <h2 id="peacock-exclusive-wild-card">Peacock exclusive wild card</h2> <p>Peacock <a href="https://www.wsj.com/articles/peacock-to-carry-one-nfl-playoff-game-exclusively-next-season-fb339027">reportedly paid $110 million</a> for the exclusive rights to air the Kansas City Chiefs’ win over the Miami Dolphins in frigid temperatures on Saturday night. It’s an expensive ploy by the OTT service to bring in new subscribers without <a href="https://www.foxnews.com/sports/peacocks-exclusive-nfl-stream-chiefs-dolphins-playoff-game-irks-fans">alienating NFL fans</a> used to watching the NFL over broadcast television.</p> <p>The screenshot below from Kentik’s Data Explorer shows Peacock traffic experiencing a meteoric rise as millions of football fans get on the OTT service to watch the game. On Monday, NBC asserted that the game in Kansas City was the most streamed NFL game in history. By our stats, it is also likely also the most streamed thing in Peacock’s history.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6No3m0RVtc1PrvdAUB20Lk/e4595982cf73b466d41f1ca16a81ce3a/Peacock_Streaming_Traffic_January_2024b.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Peacock Traffic Source CDN" /> <div class="caption" style="margin-top: -30px;">Peacock OTT traffic analyzed with Kentik</div> <p>Based on our customer OTT data, the game was delivered via a variety of content providers including Akamai (30.8%), Fastly (25.7%), Edgecast Verizon (18.7%), Amazon/AWS (11.9%) and Edgio and Lumen both delivering around 6%.</p> <p>The graphic below shows how Peacock was delivered during this one-week period. By breaking down the traffic by Source Connectivity Type (below), we can see that the wildcard games were delivered by a variety of sources including private peering, IXP, embedded cache, and transit. For this week in January, the content viewers were consuming overwhelmingly came via private peering (63.1%), but also via transit (28.1%) and IXP (7.6%). For Peacock, embedded caching (1%) barely registered.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4PLKJPk6iRZ1LAcgeJrmcD/9731d037ec7fd7c2ae7949dce01186c6/Peacock_Streaming_Traffic_January_2024c.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Peacock Traffic Source Type" /> <div class="caption" style="margin-top: -30px;">Peacock OTT traffic analysis by source</div> <p>It is normal for CDNs with a last mile cache embedding program to heavily favor this mode of delivery over other connectivity types as it allows:</p> <ol> <li>The ISP to save transit costs</li> <li>The subscribers to get demonstrably better last-mile performance</li> </ol> <p>In addition to source CDN and connectivity type, users of Kentik’s OTT Service Tracking are also able to break down traffic volumes by subscribers, specific router interfaces, and customer locations.</p> <h2 id="how-does-ott-service-tracking-help">How does OTT Service Tracking help?</h2> <p>Previously, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/" title="Learn more about recent OTT service tracking enhancements">described enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like the release of a blockbuster movie on streaming can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/product/subscriber-intelligence/" title="Subscriber Intelligence Use Cases for Kentik">subscriber intelligence</a>.</p> <p>Ready to improve <a href="https://www.kentik.com/kentipedia/ott-services/" title="Kentipedia: OTT Services (Over-the-Top Services)">over-the-top service</a> tracking for your own networks? <a href="#demo_dialog" title="Request your Kentik demo">Get a personalized demo</a>.</p><![CDATA[The Benefits and Drawbacks of SNMP and Streaming Telemetry]]><![CDATA[Is SNMP on life support, or is it as relevant today as ever? The answer is more complicated than a simple yes or no. SNMP is reliable, customizable, and very widely supported. However, SNMP has some serious limitations, especially for modern network monitoring -- limitations that streaming telemetry solves. In this post, learn about the advantages and drawbacks of SNMP and streaming telemetry and why they should both be a part of a network visibility strategy. ]]>https://www.kentik.com/blog/the-benefits-and-drawbacks-of-snmp-and-streaming-telemetryhttps://www.kentik.com/blog/the-benefits-and-drawbacks-of-snmp-and-streaming-telemetry<![CDATA[Phil Gervasi]]>Wed, 10 Jan 2024 05:00:00 GMT<p>Today, people consume most applications over a network. The system of routers, switches, firewalls, load balancers, wireless access points, network services, and so on all work together as a complex application delivery system. Each device type provides valuable insight into the health of the system and why an application is performing the way it is.</p> <div as="WistiaVideo" videoId="47qtvt2dvg" audio></div> <p>To ensure reliable and high-quality application performance, we must understand how the network is functioning, and to do that, we gather data. Since 1988, that usually meant SNMP; however, some vendors are beginning to support various forms of streaming telemetry, which solves several glaring problems with SNMP.</p> <p>Naturally, some in the networking industry have claimed that SNMP is no longer relevant in light of streaming telemetry, and there’s an argument to be made there. Streaming telemetry does collect more accurate and complete data, so shouldn’t streaming telemetry be the default for everyone?</p> <p>But before we throw the baby out with the bath water, it’s important to understand that streaming telemetry has not replaced SNMP. Not yet, at least. Perhaps it will one day, but today and in the foreseeable future, a solid network visibility strategy must be able to ingest and analyze SNMP and streaming telemetry together.</p> <div as="Promo"></div> <h2 id="snmp-veteran-of-network-monitoring">SNMP: veteran of network monitoring</h2> <p>SNMP has been a cornerstone in <a href="https://www.kentik.com/kentipedia/network-monitoring/" title="Kentipedia: Network Monitoring: Ensuring Optimal Network Health and Performance">network monitoring</a> for decades, and for good reason. Despite its shortcomings, it’s a powerful network management and visibility mechanism.</p> <h3 id="benefits-of-snmp">Benefits of SNMP</h3> <p>SNMP has widespread compatibility. One of SNMP’s greatest strengths is its universal support across network devices (as well as many non-network devices), making it a reliable standard for basic monitoring tasks. Most network devices support it, regardless of the device type or vendor.</p> <p>For example, most brand new $300,000 data center chassis switches support SNMP, and so does that 20-year-old router still in production for out-of-band management.</p> <p>Also, SNMP is relatively simple and cost-effective. It’s easy to set up and manage, which is a massive bonus for networks that don’t require deep granularity in monitoring and don’t have the staff to set up and manage more advanced forms of telemetry. It doesn’t require extensive training or specialized knowledge, making it accessible to a broad range of IT professionals.</p> <p>And though age is often looked down upon in tech, SNMP, one of the oldest protocols used in network management, is mature and proven reliable over time. This long history means that it has been thoroughly tested, resulting in a stable and dependable tool in network operations.</p> <p>From a cost perspective, many SNMP tools are available for free or at a low cost. This makes SNMP a cost-effective solution for network monitoring, especially for smaller organizations or those with limited budgets.</p> <p>It’s also important to remember that SNMP makes it possible to poll a device and get a data point, and to poll metadata, and SNMP traps offer event notification. While this may sound like a basic feature, it’s critical in proactive network management and quickly responding to issues.</p> <h3 id="drawbacks-of-snmp">Drawbacks of SNMP</h3> <p>There are drawbacks, however.</p> <p>Though SNMP makes it possible to poll devices and get a data point, and traps do offer event notification, alerts, notifications, and thresholding all occur in the network monitoring system, not within SNMP.</p> <p>Also, SNMP struggles to provide the more in-depth data that modern networks often need. For example, it doesn’t easily provide sub-minute information, at least without overloading a typical device. And sub-minute monitoring is required for some modern network operations activities.</p> <p>That means SNMP can produce a misleading output. For example, when polling a device every five minutes, the resultant graph will show the information taken at that moment every five minutes, an eternity in the networking world. If there are significant changes in the metrics between those five-minute intervals, such as spikes or drops in CPU utilization, traffic, device memory, interface errors, and so on, SNMP wouldn’t report it.</p> <p>Notice in the image below from a presentation at <a href="https://youtu.be/McNm_WfQTHw?si=gPMcwQ5ubD3BD1jI">NANOG 73</a> which shows spikes in bandwidth usage higher than the interface is capable of which is clearly incorrect.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1ZooRB3l73DFDlk1b1Mzuy/b5ac52e9cb82dec3b10353c52ffb3b64/streaming-telemetry-vs-smnp.png" style="max-width: 500px;" class="image center" alt="Graph showing bandwidth spikes" /> <div class="caption" style="margin-top: -35px;">Image from NANOG 73 presentation by Carl Lebsack and Rob Shaki</div> <p>Ultimately, infrequent polling leads to averaging, which leads to not seeing spikes. SNMP is also not timestamped at the source, and the time it takes for SNMP to be sent, received, and recorded is variable. So when we poll every five minutes, we actually get back results at slightly different times. This means we’re making an educated guess as to what the truth is. This uncertainty introduces additional artifacts in the data, which isn’t the case with streaming telemetry, which timestamps at the source.</p> <p>Notice in the graph below also from NANOG 73 that incorrect timestamps create false spikes. The graph on the left shows how SNMP can produce incorrect results, in this case spikes in traffic that never occurred. On the right we can see the result of streaming telemetry which produces smoother data.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7gAmnGqKrHHoKZDpldpFGI/c0f885b549d53c7147006ab22cf94fc4/streaming-telemetry-vs-smnp-graphs.png" style="max-width: 650px;" class="image center" alt="Graph showing differences in SNMP and streaming telemetry" /> <div class="caption" style="margin-top: -35px;">Image from NANOG 73 presentation by Carl Lebsack and Rob Shaki</div> <p>Next, SNMP simply makes the current status and metrics available. Consider that every engineer in the world looks at charts. This occurs because, even though SNMP isn’t built for it, engineers are all doing the same thing: asking a question repeatedly at a consistent interval so we can plot the answers on a chart. This is at the heart of why SNMP is inefficient and how streaming telemetry makes data collection more efficient. SNMP is made for the NMS to ask the router a question and for the router to answer.</p> <p>Conversely, streaming telemetry is made for the NMS to ask the router a question and subscribe to the answer forever. Then, the question can be answered without repeatedly asking. Additionally, the router can schedule the preparation and sending of that data in a way that is efficient for itself.</p> <p>Also, let’s face it, SNMP doesn’t scale well. And by scale, it isn’t necessarily the number of devices we’re polling but the number of interfaces multiplied by the number of metrics by the polling interval.</p> <p>For example, imagine polling a single switch with multiple linecards full of interfaces and hundreds of sub-interfaces, each with 20 metrics, and once per one-minute interval (or even faster). This can overwhelm the switch and create an incredible amount of noise on the network as the NMS is, in essence, asking for the same information over and over, usually with the same answer every time from the switch.</p> <p>As network complexity and size increase, SNMP’s effectiveness tends to decrease in a direct inverse proportion. This may not be an issue in small and medium-sized networks, but this can be a significant stumbling block for larger organizations.</p> <h2 id="streaming-telemetry-provides-a-modern-approach">Streaming telemetry provides a modern approach</h2> <p>Streaming telemetry is often touted as the next step in network monitoring evolution. It differs in how it works and not necessarily in the information it provides. So, it’s not precisely that streaming telemetry is more customizable or robust in the information it can provide over SNMP. Instead, streaming telemetry solves some of the problems we face when using SNMP.</p> <h3 id="benefits-of-streaming-telemetry">Benefits of streaming telemetry</h3> <p>The various forms of streaming telemetry, such as gNMI, structured data approaches (like YANG), and other proprietary methods, offer near-real-time data from network devices. Rather than waiting minutes for the subsequent polling to occur, network administrators get information about their devices in (almost) real-time.</p> <p>Because data is pushed in real-time from the devices and not pulled from an NMS, streaming telemetry can provide much higher-resolution data compared to SNMP. That is an important advantage, especially in complex and high-performance networks that need visibility at a sub-minute or better timeframe.</p> <p>A push-based model is generally more efficient than SNMP. Streaming telemetry processing often happens in hardware at the ASIC itself instead of the CPU. Streaming telemetry can, therefore, scale in more extensive networks without affecting the devices’ performance.</p> <h3 id="drawbacks-of-streaming-telemetry">Drawbacks of streaming telemetry</h3> <p>Arguably, the biggest drawback of streaming telemetry by far isn’t with the technology itself but because it isn’t widely supported yet. Many vendors don’t support any forms of streaming telemetry, or if they do, it’s often on only a select few platforms.</p> <p>Some network vendors use a proprietary form of streaming telemetry, requiring the vendor’s monitoring system to ingest and analyze. Vendor-specific constraints are a major stumbling block to widespread adoption.</p> <p>Next, there’s a learning curve that must be overcome when implementing streaming telemetry. Setting it up on devices and monitoring systems can be complex and may require specialized skills. For example, traditional network engineers may need to learn how to use APIs to query various network devices for the first time.</p> <p>Lastly, if not configured properly, streaming telemetry can send so much information that it negatively impacts bandwidth utilization and the storage capacity of a network monitoring system. Proper tuning is essential for any visibility technology, but it’s especially crucial with streaming telemetry.</p> <h2 id="balancing-todays-needs-with-tomorrows-tech">Balancing today’s needs with tomorrow’s tech</h2> <p>To throw SNMP out entirely likely means replacing many of the devices in a network. In time, more and more devices will support streaming telemetry, but that just isn’t the case as of today.</p> <p>It’s fair to say that streaming telemetry will slowly gain in usage as vendors choose to support it on their platforms, but until then, SNMP is everywhere, reliable, and well-understood.</p> <p>Especially for smaller, simpler networks that don’t need incredibly granular data resolution, SNMP may work just fine for years to come. In the real world of network operations, there’s no one-size-fits-all solution. SNMP and streaming telemetry offer different benefits and cater to different needs, and in this case, the answer isn’t to throw out one protocol in favor of another.</p> <p>The answer is a visibility strategy that utilizes both SNMP and streaming telemetry to collect information from devices already deployed and those that will be in the future.</p><![CDATA[Kentik Pets: Making Fully Remote Work a Paw-some Experience]]><![CDATA[In this post, we share the joys and benefits of having pets while working from home. We discuss the scientific benefits of having pets around, including improved mental health and job performance and offer practical tips for working with pets. From providing moral support during meetings to adding fun to our workdays, our furry companions play a vital role in our daily routines at Kentik.]]>https://www.kentik.com/blog/kentik-pets-making-fully-remote-work-a-paw-some-experiencehttps://www.kentik.com/blog/kentik-pets-making-fully-remote-work-a-paw-some-experience<![CDATA[Kentik People Operations Team]]>Mon, 08 Jan 2024 04:00:00 GMT<p>Many of us working from home have furry companions who are always by our side. Whether they were our buddies before remote work became the norm or they joined our lives during the pandemic for a little extra company, pets have become an integral part of our home office setups.</p> <p>At Kentik, our fully remote team includes some very special four-legged colleagues who are experts at making our workdays more enjoyable. From napping on keyboards to offering moral support during virtual meetings, our pets add a dose of cheer and a sprinkle of fun to our daily grind.</p> <img src="//images.ctfassets.net/6yom6slo28h2/Hc6LbJPUcxdl7kgilnyfp/9ad7a3552859a05b5a3bac5b8d1aa1e2/kentik-pets-fezzik.jpg" style="max-width: 500px;" class="image center" alt="Kentik Pets: Fezzik" /> <h2 id="why-pets-make-remote-work-better">Why pets make remote work better</h2> <p>It’s not just a warm fuzzy feeling — there’s science behind it! Studies show that having pets around while working from home can be incredibly beneficial. For example, a 2022 study called <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9264855/">The Role of Dogs in the Relationship between Telework and Performance</a> revealed that pets can help lower blood pressure, ward off heart disease, and boost overall mental health. Essentially, pets help keep us happier and healthier, which can translate into improved job performance.</p> <p>In workplaces where pets are welcome, employees tend to be more open, enjoy greater flexibility, and have a better overall work experience. <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8834589/">Further research</a> has found that employees who frequently brought their pets to work reported higher job satisfaction, better work-life balance, and a more positive work environment compared to those who left their pets at home.</p> <h2 id="tips-for-working-efficiently-with-pets-around">Tips for working efficiently with pets around</h2> <p>While the benefits of having pets nearby are clear, it’s also true that they can be a bit distracting. But with a bit of planning, you can enjoy the best of both worlds. Here are some tips for juggling work and pet care:</p> <p><strong>Schedule pet breaks</strong>: Just like you schedule work tasks, schedule breaks to spend time with your pet. This not only gives them some attention but also lets you step away from your desk, <a href="https://hbr.org/2021/09/bring-the-outdoors-into-your-hybrid-work-routine">get some fresh air</a>, and recharge.</p> <p><strong>Set up a pet-friendly workspace</strong>: Keep a stash of toys or engaging activities nearby to keep your pet entertained while you focus on work. A busy pet is a happy pet—and a less distracting one!</p> <p><strong>Create a routine</strong>: Pets thrive on routine, so try to stick to a consistent schedule for feeding, walks, and playtime. This helps your pet know what to expect and can reduce disruptions during work hours.</p> <p><strong>Establish boundaries</strong>: If you need uninterrupted focus time, create a designated pet-free zone where your furry friend knows they’re not allowed during work hours. A comfy bed or play area just outside your workspace can do the trick.</p> <p>By integrating these strategies, you can keep your workday productive while enjoying the added joy of having your pet close by. It’s all about finding that perfect balance between work and play!</p><![CDATA[Digging into the Orange España Hack]]><![CDATA[Orange España, Spain's second largest mobile operator, suffered a major outage on January 3, 2024. The outage was unprecedented due to the use of RPKI, a mechanism designed to protect internet routing security, as a tool for denial of service. In this post, we dig into the outage and the unique manipulation of RPKI.]]>https://www.kentik.com/blog/digging-into-the-orange-espana-hackhttps://www.kentik.com/blog/digging-into-the-orange-espana-hack<![CDATA[Doug Madory]]>Thu, 04 Jan 2024 05:00:00 GMT<p>On January 3, 2024, Spain’s second largest mobile operator, Orange España, experienced a <a href="https://elpais.com/tecnologia/2024-01-03/orange-sufre-una-caida-del-servicio-de-internet-en-toda-espana-por-culpa-de-un-acceso-indebido.html">national outage</a> spanning multiple hours. The cause? A compromised password and an increasingly robust routing system. Turns out that the network operator’s favorite defense tool (RPKI) can be a double-edged sword.</p> <p>Using a password <a href="https://twitter.com/Ms_Snow_OwO/status/1742666456058470739">found in a public leak</a> of stolen credentials, a hacker was able to log into Orange España’s RIPE NCC portal using the password “ripeadmin.” <em>Oops!</em> Once in, this individual began altering Orange España’s RPKI configuration, rendering many of its BGP routes RPKI-invalid.</p> <p>As demonstrated in <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">our earlier analysis</a>, the internet’s RPKI ROV deployment has reached the point where the propagation of a route is cut in half <em>or more</em> when evaluated as RPKI-invalid. Normally this is desired behavior, but when an RPKI config is intentionally loaded with misconfigured data, it can render address space unreachable, effectively becoming a tool for denial of service.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7orCUKNlz2HxMeJgxfb4SP/87b345c803dd65eb907c6b210e16a304/orange-espana-outage-aggregate-data.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Diagram showing an overview of the Orange Espana outage" /> <p>Using Kentik’s aggregate NetFlow, we observed the outage (illustrated above) as a large drop of the volume of inbound traffic to Orange España (AS12479) between 14:20 UTC (3:20pm local) and 18:00 UTC (7pm local). However, there were more developments prior to this window of time as well as some lingering effects, which we will dig into in the post below.</p> <h2 id="what-happened">What happened?</h2> <p>We already know the outage took place and how the attacker pulled it off. Now let’s trace the sequence of events using archived RPKI data from <a href="https://ripe86.ripe.net/archives/video/1030/">RPKIviews</a>.</p> <p>The story begins at 09:28 UTC on January 3, when someone (presumably the attacker) began tinkering with publishing and revoking ROAs for IP ranges belonging to the Spanish mobile operator. Then, at 09:42 UTC they published three new ROAs for Orange España IP ranges with material impact.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Origin prefix maxLength ta expiration AS12479 93.117.88.0/22 22 ripe 1704355258 AS12479 93.117.88.0/21 21 ripe 1704355258 AS12479 149.74.0.0/16 16 ripe 1704355258</code></pre></div> <div class="caption" style="text-align: left;"><a href="https://console.rpki-client.org/rpki.ripe.net/repository/DEFAULT/a7/1a830a-f061-4cdc-bafb-a2fe9f015d71/1/DZSNRxWKRySgDA0vp-t_yWLOM8s.roa.html">Source</a></div> <p>Given the fact that 93.117.88.0/22, 93.117.88.0/21, and 149.74.0.0/16 were all already originated by AS12479, those routes weren’t affected, but 149.74.0.0/16 had quite a few more-specifics that were now going to be evaluated as RPKI-invalid due to the max prefix length setting of 16.</p> <p>Perhaps realizing this, minutes later, that someone published a slew of additional ROAs to account for the more-specifics of 149.74.0.0/16. These had the proper origin (AS12479) and as a result, all of those more-specifics became valid. All but one, that is.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Origin prefix maxLength ta expiration AS12479 149.74.100.0/23 23 ripe 1704355258 AS12479 149.74.102.0/23 23 ripe 1704355258 AS12479 149.74.104.0/23 23 ripe 1704355258 AS12479 149.74.106.0/23 23 ripe 1704355258 AS12479 149.74.108.0/23 23 ripe 1704355258 (and many more)</code></pre></div> <p style="margin: 35px;"></p> <p>Using Kentik’s BGP visualization, we can compare the impact in reachability (aka propagation) for two adjacent more-specifics of 149.74.0.0/16. Shown below, 149.74.172.0/22 was the route missed in that follow-up publication of ROAs. Its reachability dropped for over four hours to as little as 20% of our BGP sources.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1ZxVCBqRJpVkOcOgJ7VVgZ/53937882a503d0351fdecfee92f2bac0/orange-espana-outage-impact-adjacents.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Reachability during the Orange Espana outage" /> <p>Conversely, the rest of the more-specifics looked like 149.74.168.0/22 below: a brief partial drop in reachability between the first and second publications of ROAs mentioned above.</p> <img src="//images.ctfassets.net/6yom6slo28h2/joWVNijxiGAPYItYMOj6Q/613a3b2aedaf019e5660b67e65458973/orange-espana-outage-brief-partial-drop.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Orange Espana outage showing a partial drop in reachability" /> <p>Although these prefixes were RPKI-invalid for several minutes, they only experienced a partial drop in reachability due to delays in the time to globally propagate ROAs, as documented in <a href="https://dl.acm.org/doi/10.1007/978-3-031-28486-1_18">recent research on the topic</a>. The act of blotting out a newly RPKI-invalid route is not instantaneous.</p> <div as="Promo"></div> <h2 id="wielding-rpki-as-a-weapon">Wielding RPKI as a weapon</h2> <p>Then the attacker took it a step further by creating ROAs with an origin other than that of Orange España’s. About the same time those additional ROAs were published covering the more-specifics of 149.74.0.0/16, four new ROAs were created for Orange España IP space with a deliberately incorrect origin of AS49581.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Origin prefix maxLength ta expiration AS49581 149.74.0.0/16 16 ripe 1704355258 AS49581 1.178.232.0/21 21 ripe 1704355258 AS49581 145.1.240.0/20 20 ripe 1704355258 AS49581 62.36.0.0/16 16 ripe 1704355258</code></pre></div> <p style="margin: 35px;"></p> <p>The addition of the bogus ROA for 149.74.0.0/16 had no effect because the attacker had previously created a ROA with the correct origin (AS12479) — as long as one ROA matches, a route is evaluated as RPKI-valid.</p> <p>145.1.240.0/20 and 1.178.232.0/21 were only briefly invalid before the attacker published ROAs with correct origins.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Origin prefix maxLength ta expiration AS12479 145.1.240.0/20 20 ripe 1704355258 AS12479 1.178.232.0/21 21 ripe 1704355258</code></pre></div> <p style="margin: 35px;"></p> <p>Only 62.36.0.0/16 (shown below) and its numerous more-specifics were rendered RPKI-invalid and had their reachability reduced for the duration of the outage due to the ROAs with bogus origins.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6SPt5RWoPj4sagFDqa1bPn/7262a46981fd94b9bdb6b8e5f2a81cde/orange-espana-outage-reachability-reduced-bogus-origins.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Outage showing reduced reachability due to bogus origins" /> <p>Thus far in the story, the attacker’s tinkering has led to the creation of a couple of RPKI-invalid routes and some minor reachability problems, but the major disruption was yet to come.</p> <p>It wasn’t until about 14:20 UTC (3:20pm local) that things got ugly. The attacker went for it and published four more ROAs with bogus origins. Two of the ROAs were /12’s which covered over a thousand routes originated by AS12479 — all rendered RPKI-invalid by the publication of the following ROAs:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Origin prefix maxLength ta expiration AS49581 85.48.0.0/12 12 ripe 1704355258 AS49581 90.160.0.0/12 12 ripe 1704355258 AS49581 93.117.88.0/21 21 ripe 1704355258 AS49581 145.1.232.0/21 21 ripe 1704355258</code></pre></div> <p style="margin: 35px;"></p> <p>It was here when the traffic graph at the beginning of this blog post began to take a nose dive. The number of globally routed routes originated by AS12479 dropped from around 9,200 to 7,400, as backbone carriers which reject RPKI-invalid routes stopped carrying a large chunk of Orange España’s IP space.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3tHpdzSYRg5u42Uk7C4RUd/7e383307f1a9d1ce647aa00e8777a9f3/orange-espana-outage-traffic-drop.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Orange Espana traffic drop graph" /> <div class="caption" style="margin-top: -35px;">Reachability of 145.1.232.0/21 during the worst part of the outage.</div> <p>It wasn’t until just before 18:00 UTC (7pm local) that things began to return to normal. Engineers from Spain’s second largest mobile operator regained control of their RIPC NCC account and began publishing new ROAs that would enable the carrier to restore service.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Origin prefix maxLength ta expiration AS12479 85.48.0.0/12 12 ripe 1704384768 AS12479 90.160.0.0/12 12 ripe 1704384768 AS12479 62.36.0.0/16 16 ripe 1704384768 AS12479 93.117.88.0/21 21 ripe 1704384768 AS12479 145.1.232.0/21 21 ripe 1704384768 AS12479 93.117.92.0/22 22 ripe 1704384768 AS12479 62.36.21.0/24 24 ripe 1704384768</code></pre></div> <p style="margin: 35px;"></p> <h2 id="conclusion">Conclusion</h2> <p>While RPKI was employed as a central instrument of this attack, it should not be construed as the cause of the outage any more than we would blame a router if an adversary were to get ahold of the login credentials and start disabling interfaces.</p> <p>It seems that prior to January 3, the Spanish mobile operator’s RIPE NCC account had never created a ROA (although other parts of Orange had created some on its behalf). If RPKI wasn’t on Orange España’s radar before, it sure is now.</p> <p>Although the outage is over, there is still a lot of clean-up work to be done. As of this writing, over a thousand of the routes originated by AS12479 are still invalid, mostly due to the max prefix length setting on the ROAs for the two /12’s. Between yesterday and today, the number of unique IPv4 addresses originated by AS12479 dropped from 7 million to 5 million, and a few bogus ROAs with an origin of AS49581 are still in circulation.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7hVi3Ekma9gNODKLPW6t7g/ccbdd9895df67228f9e16273e127062c/orange-espana-outage-invalid-routes.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Graph showing newly-invalid routes" /> <div class="caption" style="margin-top: -35px;">One of over a thousand newly RPKI-invalid routes still originated by AS12479.</div> <p>I would remind those engineers cleaning up the ROAs that max prefix length is an optional field and can simply be left empty causing RPKI to only match on the origin of the ROA. This course of action was recently published as a <a href="https://www.rfc-editor.org/rfc/rfc9319.html">best current practice</a>.</p> <p>RIPE NCC, the <a href="https://en.wikipedia.org/wiki/Regional_Internet_registry">RIR</a> responsible for managing the allocation and registration of internet number resources (IP addresses and ASNs) in Europe, has launched <a href="https://www.ripe.net/publications/news/ripe-ncc-access-security-breach-investigation">an investigation</a> into the incident.</p> <p>Hopefully this incident can serve as a wake-up call to other service providers that their RIR portal account is mission-critical and needs to be protected by more than a simple password.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h2 id="update-16-jan-2024">UPDATE: 16 Jan 2024</h2> <p>In our original blog post, I mentioned some “clean-up work” that was still required to address a myriad of RPKI-invalid routes that were being originated by AS12479 due to the quick fix ROA modifications that resolved the outage.</p> <p>These RPKI-invalid routes weren’t causing connectivity issues because three covering routes (85.48.0.0/12, 90.160.0.0/12 and 62.36.0.0/16) were RPKI-valid ensuring the address space was globally reachable.</p> <p>Well, that clean-up work was accomplished at 16:00 UTC on January 8, five days after the multi-hour outage. Orange España modified the ROAs for these three IP address ranges by increasing the maximum prefix length to 24 (from 12 and 16).</p> <p>We can see the impact in BGP as the reachability of some of the formerly RPKI-invalid routes jumped once they were RPKI-valid and no longer being filtered. The graphic below shows how reachability through the upstreams of AS12479 (primarily Orange, AS5511 and Lumen, AS3356) changed over this time:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3gGnJPLt8yyIunf2Psklpi/379cd1578bf841cd0578ba17ee5ee084/orange-espana-outage-update.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Orange Espana outage update" /> <p>The change is also visible in Kentik’s aggregate NetFlow. Below is a graphic of traffic to AS12479 (in bits/sec) on January 8 colored by the RPKI evaluation of the destination address space.</p> <p>There are three categories to traffic:</p> <ol> <li>RPKI-unknown (routes without ROAs)</li> <li>RPKI-valid (routes which match ROAs)</li> <li>RPKI-invalid, but covered (routes which are RPKI-invalid, but are reachable via a covering prefix)</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/2zifEyO6KEe4KyzGEQBJEy/bd3347ddfe0fbe993ab2d034fe71f378/orange-espana-outage-update-categories.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Orange Espana outage update" /> <p>We can see the “RPKI-invalid, but covered” traffic, along the bottom in a faint yellow, drops away to zero once the changes to the ROAs are published and that traffic becomes RPKI-valid. Again, since there were RPKI-valid covering routes, there was no change in the overall volume of traffic reaching AS12479.</p> <h2 id="how-to-protect-your-ripe-ncc-account">How to protect your RIPE NCC account</h2> <p><a href="https://www.ripe.net/participate/member-support/ripe-ncc-access/two-step-verification">How to Enable Two Step Verification on a RIPE NCC Access account</a></p> <p><em>Thanks to Job Snijders of Fastly for his expert guidance on this blog post.</em></p><![CDATA[Kentik Announces Year-Round Charitable Giving Program: Kentik Cares!]]><![CDATA[Kentik is excited to launch the Kentik Cares Giving Program, a year-round initiative that empowers employees to support the causes they care about. Starting in January 2024, Kentik will match employee donations up to $200 annually, fostering a culture of generosity and community engagement. This program reflects our ongoing commitment to making a meaningful impact and giving back to the communities we serve.]]>https://www.kentik.com/blog/kentik-announces-year-round-charitable-giving-program-kentik-careshttps://www.kentik.com/blog/kentik-announces-year-round-charitable-giving-program-kentik-cares<![CDATA[Kentik People Operations Team]]>Tue, 02 Jan 2024 05:00:00 GMT<p>At Kentik, we believe in the power of giving back to the community, and we’re excited to share some significant news about our internal charitable initiative, the Kentik Cares Giving Program. In the past, our charitable efforts were limited to end-of-year giving during the winter holiday season. However, after thoughtful consideration and valuable input from our team, we have decided to enhance our commitment to philanthropy. We’re thrilled to announce that the Kentik Cares Giving Program is now a year-round initiative, providing ongoing support throughout the entire 2024 calendar year!</p> <p>Starting in January 2024, our employees will have the opportunity to contribute to the causes that matter most to them. For each donation made, Kentik will match contributions up to $200 per employee each year. This initiative reflects our dedication to making a meaningful and lasting impact on the issues that resonate with our team.</p> <h2 id="why-this-matters">Why this matters</h2> <p>The Kentik Cares Giving Program demonstrates our commitment to social responsibility and our belief that collective efforts can lead to significant change. By empowering our employees to support the causes they care about, we are not only fostering a culture of generosity but also enhancing our community engagement.</p> <h2 id="transparency-and-impact">Transparency and impact</h2> <p>To ensure transparency and track the program’s impact, we are implementing a dedicated dashboard where employees can monitor their donations and the corresponding company matches. This feature will showcase the organizations benefiting from our collective contributions and the difference we are making together.</p> <h2 id="our-commitment-to-giving">Our commitment to giving</h2> <p>We are grateful for the opportunity to support our employees in their philanthropic efforts. By transitioning to a year-round program, we hope to encourage even more participation and inspire our team to give back to their communities throughout the year.</p> <p>At Kentik, we are proud to champion initiatives that align with our core values and reflect the passions of our workforce. We believe that through the Kentik Cares Giving Program, we can make a meaningful impact on the issues that matter most—not just during the holiday season, but all year long.</p> <p>Together, let’s create positive change and strengthen the communities we serve!</p><![CDATA[December 2023: What’s New at Kentik?]]><![CDATA[The year might be ending, but the Kentik news never stops. We're back with our bite-sized roundup of everything you might have missed at Kentik in December 2023.]]>https://www.kentik.com/blog/december-2023-whats-new-at-kentikhttps://www.kentik.com/blog/december-2023-whats-new-at-kentik<![CDATA[David Hallinan]]>Sun, 31 Dec 2023 05:00:00 GMT<p>The year might be coming to a close, but we still have a few exciting pieces of Kentik news to share with you before we officially hit 2024.</p> <p>In case you missed it last month, this short and sweet series covers some of the best, brightest, and most interesting things to happen at Kentik in any given month. So, without further ado, what’s new at Kentik?</p> <h2 id="kentik-cloud-rolls-out-oci-observability">Kentik Cloud rolls out OCI observability</h2> <p>2023 has been quite the banner year for <a href="https://www.kentik.com/product/multi-cloud-observability/">Kentik Cloud</a>. First, we rolled out complete observability for <a href="https://www.kentik.com/blog/multi-cloud-made-simple-observability-enhancements-for-aws-google-cloud/">AWS and GCP</a>, followed shortly by <a href="https://www.kentik.com/blog/announcing-complete-azure-observability-for-kentik-cloud/">Azure</a>, and then we launched <a href="https://www.kentik.com/blog/introducing-kentik-kube/">Kentik Kube</a> for Kubernetes. Now, we’re excited to share that we’ve added <a href="https://www.kentik.com/blog/oracle-cloud-infrastructure-support-in-kentik-cloud/">Oracle Cloud Infrastructure (OCI)</a> observability to the list.</p> <p>Network and cloud teams can now use Kentik Cloud to gain <a href="https://www.kentik.com/solutions/oracle-cloud-infrastructure-observability/">network insight into OCI workloads</a>, allowing them to map, query, and visualize OCI, hybrid, and multi-cloud traffic and performance – enabling them to truly see <em>all</em> critical cloud workloads in <em>one place</em>!</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3haYoBvxTW9rQFNYMX5kDK/64f8b6cb4e5e9f090071bdd432978f41/oci-health-v2.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Oracle Cloud Infrastructure in the Kentik platform" /> <p><strong>Looking for more product updates?</strong> Head on over to <a href="https://new.kentik.com/">new.kentik.com</a> to check out some of the other updates we’ve recently rolled out to the platform, including a <a href="https://new.kentik.com/library-v3-is-live-1eHy8w">redesigned UX for Kentik Library</a> and new <a href="https://new.kentik.com/synthetics-tcp-udp-support-for-ping-test-8uAZq">TCP/UDP support </a>for our synthetic ping tests.</p> <h2 id="kentik-media-observability">Kentik media observability</h2> <p>Kentik’s CEO, <a href="https://www.linkedin.com/in/avifreedman">Avi Freedman</a>, recently sat down with Code Story to share how he got his start in networking, how Kentik came to be, and even a little about his professional poker-playing career. Tune in to listen to the <a href="https://codestory.co/podcast/bonus-avi-freedman-kentik/">entire podcast here</a>.</p> <p>Additionally, Kentik’s CPO, <a href="https://www.linkedin.com/in/pfisterc">Christoph Pfister</a>, had a chance to chat with TFIR about the release of Kentik Kube, how AI and natural language will drive the future of network observability, and why the network is fundamental to any modern business. Watch the entire interview below.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/R9SPad8CyTc?si=cIqKiJQwcD7OfgEx" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <h2 id="are-you-tired-of-reading-already">Are you tired of reading already?</h2> <p>We try our best to keep this blog series as short and easily digestible as possible, but some folks still want their Kentik news in a visual format.</p> <p>Good news, then! The second episode of <a href="https://www.youtube.com/watch?v=F0JgK7keq9U&#x26;list=PLLr1vqeAl9QNMvS3uHyb3YjrxsWrAGMDx">What’s New at Kentik?</a> with <a href="https://www.linkedin.com/in/leonadato">Leon Adato</a> is here to walk you through the latest and greatest from Kentik in a hilarious, informative, poignant, and, most importantly, fun way.</p> <p>We’d love to hear what you think about this new series, so please drop a comment, question, or general musing about life in the comment section. Be sure to <a href="https://www.youtube.com/channel/UCsTYhGUm81x6m6lQ-QUwfKA">subscribe to Kentik’s YouTube channel</a> to ensure you don’t miss our future episodes.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/awso2u0fbz" title="What's New at Kentik: Episode 2" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h2 id="thats-whats-new-at-kentik">That’s what’s new at Kentik</h2> <p>That’s a wrap for 2023. We hope your year was as good as ours, and we can’t wait to show you what we have in store in 2024. (Hint: It’s very exciting.)</p> <p>For more Kentik news and updates, head over to the <a href="https://new.kentik.com/">product updates</a> and the <a href="https://www.kentik.com/media-coverage/">media coverage</a> pages, and be sure to <a href="https://www.kentik.com/blog/#subscribe">subscribe to our blog</a> so you don’t miss out on any upcoming What’s New at Kentik content.</p> <p>Until next month, network enthusiasts!</p><![CDATA[A Year in Internet Analysis: 2023]]><![CDATA[In this post, Doug Madory reviews the highlights of his wide-ranging internet analysis from the past year, which included covering the state of BGP (both routing leaks and progress in RPKI), submarine cables (both cuts and another historic activation), major outages, and how geopolitics has shaped the internet in 2023. ]]>https://www.kentik.com/blog/a-year-in-internet-analysis-2023https://www.kentik.com/blog/a-year-in-internet-analysis-2023<![CDATA[Doug Madory]]>Wed, 20 Dec 2023 05:00:00 GMT<p>It’s the end of another eventful year on the internet, and time for another end-of-year recap of the analysis we’ve published since our <a href="https://www.kentik.com/blog/a-year-in-internet-analysis-2022/">last annual recap</a>. This year, we’ve organized the pieces into three broad categories: BGP analysis, outage analysis, and finally, submarine cables and geopolitics. We hope you find this annual summary insightful.</p> <h2 id="border-gateway-protocol-bgp-analysis">Border Gateway Protocol (BGP) analysis</h2> <img src="//images.ctfassets.net/6yom6slo28h2/1KLS7sJBkMdtBbNb8TYtNF/d0fe0885becbb7b9819965ce09ad5a40/featured-doug-brief-history-of-bgp-leaks.jpg" style="max-width: 500px;" class="image center" alt="Brief History of BGP Leaks graphic" /> <p>In June, I published my post, <a href="https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents/">A Brief History of the Internet’s Biggest BGP Incidents</a>, which began as a section I wrote for the 2022 <a href="https://www.bitag.org/index.php">Broadband Internet Technical Advisory Group (BITAG)</a> comprehensive report on <a href="https://www.bitag.org/Routing_Security.php">internet routing security</a>. The piece began with the AS7007 leak of 1997 and covered the most notable and significant BGP incidents in the history of the internet, from traffic-disrupting BGP leaks to crypto-stealing BGP hijacks. It intended to familiarize the reader with the major events that shape our understanding of the vulnerabilities and weaknesses of BGP.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 40px; padding-bottom: 20px;"></div> <img src="//images.ctfassets.net/6yom6slo28h2/3ihFVfen5zsY6J5OUfHXDF/23a564b4fea28ace34134eb2ebe205c3/feature-internet-traffic-rpki-202305.png" style="max-width: 500px;" class="image center" alt="Pie chart: Internet traffic volume by RPKI evaluation" /> <p>In May, my friend Job Snijders of Fastly and I published <a href="https://www.kentik.com/blog/exploring-the-latest-rpki-rov-adoption-numbers/">updated RPKI ROV adoption numbers</a> based on our measurement approach from last year. While the degree to which RPKI-invalids are rejected didn’t change noticeably from the previous year, the number of ROAs created continues to climb. The additional ROAs serve to increase the share of internet traffic eligible for the protection that RPKI ROV offers.</p> <p>In fact, ROAs are being created at such a clip that the number of globally routed IPv4 BGP routes with ROAs will overtake those without at some point in the next year. IPv6 routes are already there. I’ve posted a survey on <a href="https://www.linkedin.com/feed/update/urn:li:activity:7142554453575880704/">LinkedIn</a> and <a href="https://twitter.com/DougMadory/status/1735707501750788500">X/Twitter</a> to gather predictions from my fellow BGP nerds.</p> <div as="Promo"></div> <p>The above analysis was cited three times by speakers at the FCC’s <a href="https://www.fcc.gov/news-events/events/2023/07/bgp-security-workshop">Border Gateway Protocol Security Workshop</a> held on July 31 in Washington D.C. My friend Tony Tauber, Engineering Fellow at Comcast, <a href="https://www.youtube.com/watch?v=VQhoNX2Q0aM&#x26;t=3973s">cited this analysis</a> to argue that traffic data (i.e., NetFlow) suggests that we’re farther along in RPKI ROV adoption than the raw counts of BGP routes might suggest. Another friend of mine, Nimrod Levy of AT&#x26;T, <a href="https://www.youtube.com/watch?v=VQhoNX2Q0aM&#x26;t=4332s">cited our observation</a> that a route that is evaluated as RPKI-invalid will have its propagation reduced by 50% or more.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3PsMazQtqrJOZqfv3Gx6tM/4186f961c41c641fde7a1dfa05db9248/nist-slide.jpg" style="max-width: 700px;" class="image center" alt="RPKI-ROA/ROV System Dynamics slide from NIST" /> <div class="caption" style="margin-top: -35px;"> Slide from Doug Montgomery of NIST, including the pie chart above in a slide at the workshop.</div> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 40px; padding-bottom: 30px;"></div> <p>In addition, I dug into a couple of BGP leak incidents this year. At the beginning of the year, I used <a href="https://www.kentik.com/blog/new-year-new-bgp-leaks/">two route leaks</a> to explore the impacts on traffic using our extensive NetFlow data. A common concern with a routing leak is the misdirection of traffic through a suboptimal path, but the other, often greater impact, is how much traffic is simply dropped due to congested links or high latencies.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/65ZikNty6szrsKv4RgepVq/8e686ebe097588290997358340b9205f/leaks-by-status.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Charts: Propagation of leaked routes by RPKI status" /> <div class="caption" style="margin-top: -35px;">Propagation analysis from <a href="https://www.kentik.com/blog/a-tale-of-two-bgp-leaks/">A Tale of Two BGP Leaks</a> post</div> <p>In other incidents, I analyzed the role of RPKI ROV in limiting the disruption caused by leaks. For example, in August, the government of <a href="https://www.kentik.com/blog/iraq-blocks-telegram-leaks-blackhole-bgp-routes/">Iraq attempted to block Telegram</a> by announcing a BGP hijack within the country. The route leaked out but didn’t propagate very far because Telegram had a ROA for the IP space and the international carriers serving Iraq rejected the RPKI-invalid BGP hijack from propagating outside the country, limiting the disruption of Telegram.</p> <p>In that post, I concluded that “while it likely didn’t have the potential to be another Pakistan/YouTube incident, BGP <em>non-events</em> like this are what RPKI successes look like.”</p> <h2 id="outage-analysis">Outage analysis</h2> <h3 id="azure">Azure</h3> <p>As with any year, there were many internet service outages of various types, but the two that I spent time digging into were the <a href="https://www.kentik.com/blog/digging-into-the-recent-azure-outage/">Microsoft Azure connectivity outage in January</a> and the <a href="https://www.kentik.com/blog/digging-into-the-optus-outage/">Optus outage in Australia</a> at the end of the year.</p> <p>In the early hours of Wednesday, January 25, Microsoft’s public cloud suffered a major outage that disrupted their cloud-based services and popular applications such as Sharepoint, Teams, and Office 365. Similar to the historic <a href="https://www.kentik.com/blog/facebooks-historic-outage-explained/">Facebook outage of October 2021</a>, Microsoft blamed the outage on a flawed router command, which took down a significant portion of the cloud’s connectivity.</p> <p>In my post, I analyzed the outage using a variety of unique datasets from Kentik, including our BGP visualizations, performance monitoring, and aggregate NetFlow analysis.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/ljU15ARogkLkblS03LCPk/5e045406ba23b4ea1831aff1937ccfe3/seattle-se-asia-disconnect.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure outage with BGP visualizations and network performance monitoring" /> <p>In the end, it was hard to explain why some parts of Azure’s connectivity were impacted while others were not without intimate knowledge of their architecture, but we could surmise that Azure’s loss of connectivity varied greatly depending on the source and destination.</p> <p>A curious detail of this outage is that it surfaced BGP routes that purportedly showed Microsoft hijacking Saudi Telecom, T-Mobile, the US Department of Defense, among other entities. After a lengthy email exchange involving numerous engineers from Microsoft, Vocus, TPG and myself concerning the hijacks, they <a href="https://twitter.com/DougMadory/status/1622636431230238738">disappeared from the global routing table</a> without further explanation.</p> <h3 id="optus">Optus</h3> <p>The Optus outage in November was arguably much more impactful than the Azure glitch, and I looked into it in a post entitled <a href="https://www.kentik.com/blog/digging-into-the-optus-outage/">Digging into the Optus Outage</a>:</p> <blockquote>In the early hours of Wednesday, November 8, Australians woke up to find one of their major telecoms completely down. Internet and mobile provider Optus had suffered a catastrophic outage beginning at 17:04 UTC on November 7, or 4:04 am the following day in Sydney. It wouldn’t fully recover until hours later, leaving millions of Australians without telephone and internet connectivity.</blockquote> <img src="https://images.ctfassets.net/6yom6slo28h2/17I4YFwQtVgr4a8RLqflU8/0523550769d8d6a5014691f3a5535c53/internet-outage-restoration-phases.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Graph showing internet traffic to Optus, the outage and restore" /> <p>In the post, I described how the Optus outage was similar to the <a href="https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outage/">Rogers outage of July 2022</a>. In both cases, a portion of the networks’ BGP routes were withdrawn, but those withdrawals were merely symptoms of an internal issue, rather than the cause of the outage.</p> <p>The Rogers outage was caused by the removal of a route filter, which leaked the global routing table into the internal routing, overwhelming their routers and bringing the network down. In the Optus outage, a sibling network (also owned by parent company Singtel) leaked routes into Optus’s network. The leak triggered the <a href="https://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/25160-bgp-maximum-prefix.html">MAXPREF safeguard</a> on Optus’s routers. MAXPREF instructs routers to tear down a session when the number of routes advertised across it exceeds a specified maximum number.</p> <p>Unfortunately, the default retry time for Cisco routers is <a href="https://twitter.com/jaredmauch/status/1724410843125744002">‘forever,’</a> meaning that an engineer would need to manually intervene on each router to re-establish lost sessions. A safer practice is to set a retry time for the router to automatically re-establish the session after a minute when the routing leak has passed.</p> <p>The outage led to the <a href="https://www.theguardian.com/business/2023/nov/20/optus-ceo-kelly-bayer-rosmarin-resigns-network-outage-australia">resignation of the CEO</a> and the minting of a new internet star, <a href="https://www.dailymail.co.uk/news/article-12727421/Optus-outage-cat-Sydney.html">Luna the cat</a> whose <a href="https://www.studiesinaustralia.com/Blog/about-australia/the-modern-guide-to-aussie-slang#:~:text=Brekky%3A%20the%20first%20and%20most,Aussies%20call%20breakfast%20&#x27;brekky&#x27;.">brekkie</a> was another casualty of the outage.</p> <h2 id="submarine-cables-and-geopolitics">Submarine cables and geopolitics</h2> <img src="https://images.ctfassets.net/6yom6slo28h2/4wSZMuazTqeVn0z7ckwyBW/bfadc07e6addd06a84d4d3ff98eb392f/featured-cuba-submarine-cable.png" style="max-width: 500px;" class="image center" alt="Cuba graphic with submarine cable" /> <p>In January, we celebrated the 10th anniversary of the <a href="https://www.reuters.com/article/cuba-internet/cubas-mystery-fiber-optic-internet-cable-stirs-to-life-idUKL1N0AR9TQ20130122/">activation of the ALBA-1 submarine cable</a> to Cuba. However, just a month earlier, the US Department of Justice <a href="https://www.justice.gov/opa/press-release/file/1554426/download">recommended</a> that the FCC deny the request by the <a href="https://www.submarinecablemap.com/submarine-cable/arcos">ARCOS submarine cable</a> to build a spur connecting Cuba to the cable system.</p> <p>As I wrote in my <a href="https://www.kentik.com/blog/cuba-and-the-geopolitics-of-submarine-cables/">blog post in January</a>, the rejection of an ARCOS landing in Cuba showed that, almost a decade after the activation of the ALBA-1 cable, geopolitics continues to shape the physical internet — especially when it comes to Cuba.</p> <p>A large part of the rationale for denying ARCOS a landing in Cuba was the possibility that ETECSA, the state telecom of Cuba, could employ BGP hijacks to intercept US traffic. Of course, nothing stops ETECSA from trying that now and the addition of a submarine cable doesn’t alter this risk, in my opinion.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 40px; padding-bottom: 20px;"></div> <p><img src=" https://images.ctfassets.net/6yom6slo28h2/P3DfFXV8tQOZ44jL9Ns98/ebe750473c99657f50194c2a8192699f/feature-congo-canyon.png" style="max-width: 500px;" class="image center" alt="Congo Canyon Undersea Landslide photo" /></p> <p>In August, an undersea landslide in one of the world’s longest submarine canyons knocked out two of the most important submarine cables serving the African internet. The loss of these cables knocked out international internet bandwidth along the west coast of Africa.</p> <p>In my <a href="https://www.kentik.com/blog/dual-subsea-cable-cuts-disrupt-african-internet/">post on the cable cuts</a>, I reviewed some history of the impact of undersea landslides on submarine cables and concluded, “Make no mistake; the seafloor can be a dangerous place for cables.” I used some of Kentik’s unique data sets to explore the impacts of these cable breaks, which I later <a href="https://www.youtube.com/watch?v=mNinsrz9FmE">remotely presented</a> at Angola NOG (AONOG) in November.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 40px; padding-bottom: 20px;"></div> <p><img src=" https://images.ctfassets.net/6yom6slo28h2/4Xhbe6j3UflEYdG61qxz8q/d5a49fb87c574df51af87cf66b0efd91/feature-napoleon-traffic.png" style="max-width: 500px;" class="image center" alt="Internet traffic to St. Helena" /></p> <p>It is rare that we get to celebrate a new submarine cable activation like the one that occurred on October 1 in Saint Helena. In <a href="https://www.kentik.com/blog/ending-saint-helenas-exile-from-the-internet/">my blog post</a>, I published the first evidence of the activation of the submarine cable connection to the remote British overseas territory and the final place of exile for Napoleon Bonaparte.</p> <p>In the post, I went on to tell the epic story (more <a href="https://variety.com/2023/film/news/napoleon-inaccuracies-french-historians-pyramids-1235823975/">historically accurate</a> than the recent Napoleon movie) of the advocacy work by my friend German telecommunications expert Christian von der Ropp to make this cable activation a reality:</p> <blockquote>Realizing that landing a submarine cable branch in Saint Helena wasn’t going to happen without dedicated advocacy, Christian founded the non-profit Connect St Helena in early 2012 and began the lobbying effort.</blockquote> <p>After being rebuffed by the government of the UK, Saint Helena ultimately received funding from the EU to build a spur to Google’s upcoming <a href="https://www.submarinecablemap.com/submarine-cable/equiano">Equiano submarine cable</a>.</p> <blockquote>While the UK had voted to leave the EU in 2016, the implementation of “Brexit” wasn’t finalized until February 2020. In 2018, the (Saint Helena Government) was still eligible to receive EU funding. It would be one of the last EU benefits given to the UK.</blockquote> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 40px; padding-bottom: 20px;"></div> <p><img src=" https://images.ctfassets.net/6yom6slo28h2/6ok3ZENHnPRKX0SO6BHH1b/1e1f039900b8d6b294c342130abdf420/featured-russification.png" style="max-width: 500px;" class="image center" alt="Russification of Ukrainian IP addresses" /></p> <p>Finally, the conflict in Ukraine continues to rage on, impacting internet connectivity in war-ravaged parts of the country. I wrote two pieces of analysis early in the year focused on the situation in Ukraine.</p> <p>The first focused on the <a href="https://www.kentik.com/blog/the-russification-of-ukrainian-ip-registration/">Russification of Ukrainian IP addresses in RIPE registrations</a>. Ukrainian residents of Russian-held regions have been forced to adopt all things Russian: language, currency, telephone numbers, and, of course, internet service.</p> <p>Using a novel utility made available by RIPE NCC, I identified dozens of changes to RIPE registrations, revealing another target of this Russification effort: the geolocation of Ukrainian IP addresses.</p> <p>An example of the output of the historical RIPE whois query is below:</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --diff-versions 9:10 178.158.128.0 - 178.158.191.255 % Difference between version 9 and 10 of object "178.158.128.0 - 178.158.191.255" @@ -2,3 +2,3 @@ netname: ISP-EAST-NET <span style="background-color: #fef89b;">-country: UA</span> <span style="background-color: #fef89b;">+country: RU</span> org: ORG-EL88-RIPE @@ -10,3 +10,3 @@ created: 2010-11-18T10:59:18Z -last-modified: 2016-04-14T10:43:56Z +last-modified: 2022-07-21T12:58:43Z source: RIPE </code></pre> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 40px; padding-bottom: 20px;"></div> <p><img src=" https://images.ctfassets.net/6yom6slo28h2/5QOnyg3hZsJFekDPqDuBZK/cf78c1d81ddcc1d5f0ab884086eb3b4a/featured-kentik-analysis-wsj.png" style="max-width: 500px;" class="image center" alt="Wall Street Journal graphic showing connectivity inside Ukraine" /></p> <p>The <a href="https://www.kentik.com/blog/ukraines-wartime-internet-from-the-inside/">second post</a> was a collaboration with the <a href="https://www.wsj.com/articles/ukrainians-work-through-blackouts-internet-outages-as-russia-targets-power-grid-218a0fd5">Wall Street Journal</a> and used a novel data source that allowed us to explore connectivity inside Ukraine: the <a href="https://www.caida.org/projects/ark/">ARK dataset</a> from the <a href="https://www.caida.org/">Center for Applied Internet Data Analysis (CAIDA)</a>.</p> <p>Expanding on our <a href="https://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplan/">collaborative work</a> with the <a href="https://www.nytimes.com/interactive/2022/08/09/technology/ukraine-internet-russia-censorship.html">New York Times</a> in the previous year, we explored how the war changed the path of traffic traversing the domestic Ukrainian internet. In the graphic below, we employed traceroutes from the ARK measurement server in Kyiv to various ISPs in the Kherson region over the course of a year.</p> <blockquote>There is a clear point when the latencies increase due to the Russian rerouting at the beginning of June 2022. The graphic also illustrates the result of the Ukrainian liberation effort in Kherson. Ukrainians recaptured half of the region, and we see a portion of the traceroutes reverting to a lower latency as those networks restore their Ukrainian transit connections. A few providers in the region of Kherson are still on Russian transit, presumably in the territory that is still under Russian control.</blockquote> <img src="https://images.ctfassets.net/6yom6slo28h2/4XXRULGwe25LcZA2skKP3e/9e9e758d7d17248a608094341acd82ad/kiev-kherson-1.png" style="max-width: 700px;" class="image center" withFrame thumbnail alt="Diagram showing measurements from Kiev to Kherson" /> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 40px; padding-bottom: 20px;"></div> <p>Lastly, I joined a team of researchers led by the <a href="https://censoredplanet.org/">Censored Planet</a> group at the University of Michigan to author a paper presented this summer at the <a href="https://www.usenix.org/conference/usenixsecurity23">32rd Usenix Security Symposium</a> entitled, <a href="https://www.usenix.org/conference/usenixsecurity23/presentation/ramesh-network-responses">Network Responses to Russia’s Invasion of Ukraine in 2022: A Cautionary Tale for Internet Freedom</a>. My portion covered the Russian <a href="https://twitter.com/DougMadory/status/1508466367112093709">BGP hijack of Twitter</a> and a video of the presentation is <a href="https://www.youtube.com/watch?v=kBK3N3dYZq4">available here</a>.</p> <h2 id="conclusion">Conclusion</h2> <p>Kentik provides network observability to hundreds of the most important service providers and enterprises in the world. As a result, it provides me, an internet measurement analyst, unparalleled data and tools to conduct analysis of important internet developments.</p> <p>As we look ahead to the new year, there is no shortage of challenges and opportunities for internet connectivity around the world. We intend to continue producing timely, informative, and impactful analysis that helps inform the public and industry about internet connectivity issues.</p> <p>Follow <a href="https://twitter.com/DougMadory">me on X/Twitter</a> and <a href="https://www.linkedin.com/company/kentik/mycompany/">Kentik on LinkedIn</a> to make sure you get notified as we publish posts in the future.</p><![CDATA[The Power of Paris Traceroute for Modern Load-Balanced Networks]]><![CDATA[Modern networking relies on the public internet, which heavily uses flow-based load balancing to optimize network traffic. However, the most common network tracing tool known to engineers, traceroute, can’t accurately map load-balanced topologies. Paris traceroute was developed to solve the problem of inferring a load-balanced topology, especially over the public internet, and help engineers troubleshoot network activity over complex networks we don’t own or manage. ]]>https://www.kentik.com/blog/the-power-of-paris-traceroute-for-modern-load-balanced-networkshttps://www.kentik.com/blog/the-power-of-paris-traceroute-for-modern-load-balanced-networks<![CDATA[Phil Gervasi]]>Tue, 19 Dec 2023 05:00:00 GMT<p>How we consume applications today requires a clear understanding of the paths application traffic takes over our networks, both locally and over the internet. Traceroute has always been an indispensable tool for tracing network paths and ultimately for troubleshooting network problems, so despite it being a technology almost four decades old, it’s more important today than ever.</p> <div as="WistiaVideo" videoId="8zuvx7yfu7" audio></div> <p>However, classic traceroute has limitations that hinder its usefulness in modern networking. Our reliance on the public internet, distributed resources, deterministic and non-deterministic routing, various forms of load balancing, and many network-adjacent devices means engineers need a better tool to trace packets from a source to its destination.</p> <p>Paris traceroute solves some of the limitations of classic traceroute and its variants and helps engineers address concerns about path anomalies, false paths, and especially with load balancing and equal cost multipath, or ECMP. Ultimately, these advancements are critical when troubleshooting network activity over complex networks we don’t own or manage.</p> <h2 id="origin-of-traceroute">Origin of traceroute</h2> <p>Traceroute was first introduced in 1987 by Van Jacobsen, an American computer scientist and prolific contributor to the development of the internet and several commonly used diagnostic tools. Jacobsen and Steve Deering, another computer scientist credited with many early innovations in networking, developed a way to edit an IPv4 packet header’s TTL field so that a hop-by-hop path could be traced between a source and destination.</p> <p>Though some consider this a sort of hack because, in essence, traceroute “tricks” devices in the path to divulge information about itself, it nonetheless quickly became a valuable tool in the hands of engineers using almost any operating system. Over time, that has included network and security devices and compute. Ultimately, the IP protocol lacks basic telemetry features, which is why traceroute was developed.</p> <div as="Promo"></div> <h2 id="traceroute-basics">Traceroute basics</h2> <p>The traceroute function sends out a series of network packets toward a destination, incrementing the time-to-live (TTL) value with each set. The TTL is a field in the IP header specifying the maximum number of hops a packet is allowed before it’s discarded. Along the way, each router decreases the TTL by one, and when it hits zero, the router sends back an ICMP “time exceeded” message, revealing its identity.</p> <p>This means that if we set the TTL to one for the first hop, we get an immediate response from our next-hop device. We then send a follow-up packet with a TTL of two to discover the second hop in the path. Then, a TTL value of three, four, and so on until we reach our destination and see a complete trace of the path in terms of hops by device, often a router of some type.</p> <p>By sending out a series of probes with increasing TTL values and listening for ICMP responses, the source computer builds the path by looking at the ICMP message received from each time exceeded message and noting the time taken for the round trip. So, by starting with a TTL of 1 and incrementing it with each new set of packets, traceroute builds up a list of routers on the path and the round-trip time to each. This continues until it reaches the destination or reaches its limit, usually 30 hops.</p> <p>Classic traceroute uses ICMP, most widely known as ping, for the outgoing packets but can also use UDP and TCP. The destination port is used as the sequence number for UDP traceroute packets, so it can vary throughout the entire trace.</p> <p>The output of a classic traceroute command shows the list of hops along the path, the IP addresses (or domain names if resolvable) of those routers, and the round-trip times for each hop.</p> <p>It looks something like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/4xFs0k2YcqIHEaA6QM3xv5/8312e0baa5517cd4bf760a8389ee3289/classic-traceroute-command.png" style="max-width: 800px;" class="image center" alt="Output of a classic traceroute command" /> <p>There are several immediate use cases for classic traceroute, including:</p> <ul> <li><strong>Network path verification</strong>: Ensuring packets are taking the expected route through the network.</li> <li><strong>Troubleshooting network issues</strong>: Identifying routers or network segments experiencing issues such as dropping packets.</li> <li><strong>Latency measurement</strong>: Understanding the time packets take to travel to a destination and back.</li> </ul> <p>Verifying a network path means simply using traceroute to ensure packets take the expected route. If there’s a problem, we can use classic traceroute to troubleshoot network issues, such as identifying routers or other devices in the network path that could be the source of the problem. And since we’re able to determine the RTT from each hop, we have a good understanding of latency in terms of the entire path from source to destination and between each hop.</p> <h2 id="limitations-of-classic-traceroute">Limitations of classic traceroute</h2> <p>Classic traceroute does have its limitations, however.</p> <p>In networks where there are multiple possible paths or load balancing in use, subsequent packets may report different devices in the path from source to destination, which can lead to confusing output at best and an inaccurate output at worst. Nodes or entire links in the path could be missing from the output of classic traceroute, and there is no mechanism to identify hops that could contain multiple interfaces, as is the case with load balancing.</p> <p>This is important to consider because load balancing and multipath networking are very common in more extensive networks, especially the global internet. For instance, OSPF and other IGPs use a form of dynamic multipathing to ensure packet delivery within internal networks. On the public internet, DNS load balancing and elastic load balancing are used to dynamically adjust the flow of traffic over the internet and among public cloud providers.</p> <p>Also, some routers or firewalls may be configured to ignore the traceroute packets or the ICMP time-exceeded messages, resulting in <code class="language-text">* * *</code> in the output and an incomplete path, leaving holes in an output that adversely affects an engineer’s ability to understand the entire path.</p> <p>We can end up with anomalies or incorrect outputs such as <em>loops</em> (often where there aren’t any), <em>cycles</em>, and <em>diamonds</em>.</p> <p><strong>Loops</strong></p> <p>A loop is when the same node appears multiple times in an output. Normally, routers don’t forward traffic back to itself on the same interface. However, misconfigured routing could potentially create a scenario in which a router sends a packet to its next hop. Still, the next-hop router could have the originating router configured (manually or dynamically) as its next-hop.</p> <p>However, more often in a classic traceroute output, a loop in an otherwise successful trace is likely an output anomaly in that a loop doesn’t actually exist. If the destination is reachable, the loop must then be some sort of artifact in the observed time-exceeded responses.</p> <p>In this case, a loop is likely caused by load balancing when there are multiple paths with different lengths. Notice the image below where a load balancer forwards the traceroute probes with TTL 7 and 8 to router A and the probes with TTL 9 to router B, producing two different results from the same source and destination.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4MmwFWemHJrCcAqtbmfmpp/2e8e9e1726dc6c2b597eba8d696fc5f8/traceroute-loop.png" style="max-width: 800px;" class="image center" alt="Diagram showing how load balancing causes a loop" /> <div class="caption" style="margin-top: -35px;">Image source: <a href="https://paris-traceroute.net/images/imc2006.pdf">https://paris-traceroute.net/images/imc2006.pdf</a></div> <p>Loops in the traceroute output can also be caused by misconfigured routing or faulty devices, such as when a router receives a probe with a TTL of 0, and instead of dropping the packet, it forwards it to the next hop with a TTL of 1. Assuming it is functioning correctly, the next router will receive the packet, decrement the TTL to 0, and send it to its next hop, which then drops the packet and generates an ICMP time exceeded message. This continues repeatedly, which appears as a loop in the traceroute output.</p> <p>Yet another cause of apparent loops is <em>address rewriting</em>. Address rewriting is most commonly found in network address translation, or NAT. The purpose of NAT is to change the IP address in a packet’s source and/or destination fields, which can lead to an anomalous traceroute output.</p> <p>An anomalous loop in a classic traceroute output is a recurring problem, especially as organizations rely more heavily on the internet and public cloud, where load balancing is commonly deployed.</p> <p><strong>Cycles</strong></p> <p>The term “cycle” is sometimes used interchangeably with loop, but there is a subtle difference. Less common than loops, a cycle in a classic traceroute output refers to any repetitive output in the trace. Loops are considered a redundant output, but the term “cycles” is typically used to refer to a loop with additional nodes between the two looping devices (usually routers).</p> <p><strong>Diamonds</strong></p> <p>A traceroute diamond is often an anomalous output that occurs when there are multiple traceroute probes per hop. This is primarily caused by load balancing and can result in false links leading to an incorrect trace between source and destination. Diamonds occur in a significant number of classic traceroute results in large networks such as the internet itself.</p> <p>However, keep in mind that diamonds are an artifact of load balancing and, therefore, a genuine part of the topology. Diamonds in a traceroute output do not always indicate false links; instead, they may be representative of the actual topology.</p> <p>In the image below on the left, a load balancer (L) can send each individual probe of a single trace via a different path to the destination (G). The result is a diamond-shaped trace and an inaccurate view of the actual network path.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2B5BW7jUxoXjnJ7KsZaqzc/c3192c789891e235b509b9e30ff5fcce/diamond-shaped-traceroute.png" style="max-width: 700px;" class="image center" alt="Diagram showing how a load balancer causes a diamond-shaped trace" /> <div class="caption" style="margin-top: -35px;">Image source:<a href="https://paris-traceroute.net/images/imc2006.pdf"> https://paris-traceroute.net/images/imc2006.pdf</a></div> <h2 id="what-is-paris-traceroute">What is Paris traceroute?</h2> <p>Paris traceroute is an important development of the traceroute tool because it solves the problem of <em>flow-based</em> (as opposed to packet-based) load-balanced network paths, causing the inaccuracies in classic traceroute results. It ensures a consistent path is taken by all packets in a session, providing a more accurate view of the network path. Considering our reliance on the public internet and how common flow-based load balancing is, this method for topology inference is a critical advantage over classic traceroute.</p> <p>Introduced by Brian Augustin, Timur Friedman, and Renata Teixeira in the 2007 workshop <a href="https://ieeexplore.ieee.org/xpl/conhome/4261330/proceeding">End-to-End Monitoring Techniques and Services</a> in Munich, Germany, Paris traceroute is an “adaptive, stochastic probing algorithm, called the Multipath detection algorithm, to report all paths towards a destination.”</p> <p>In their research, Augustin, Friedman, and Teixeria found that classic traceroute often produces inaccurate and incomplete measured network paths. This means engineers using classic traceroute need an adequate tool for network troubleshooting and understanding where latency is occurring in a multi-hop path with intermediate routers.</p> <p>Additionally, they found that many modern routers deployed in production perform per-packet, per-flow, and per-destination load balancing, none of which are effectively measured with classic traceroute.</p> <p>Their work with Paris traceroute aimed to solve several of these problems, especially the difficulty of tracing flow-based load-balanced paths, and enable engineers to see more clearly how traffic traverses modern networks.</p> <h2 id="how-paris-traceroute-works">How Paris traceroute works</h2> <h3 id="maintaining-session-information">Maintaining session information</h3> <p>Paris traceroute changes the probing strategy and improves topology inference by increasing the number of probes and controlling the various identifiers in packet headers.</p> <p>First, the key to ensuring that all packets in a traceroute session are treated as part of the same flow is keeping certain header fields (commonly referred to as the 5-tuple) constant. For UDP packets these are the IP src/dst, IP protocol, and UDP src/dst port fields. For ICMP packets they are the IP src/dst, IP protocol, and ICMP type, code, and checksum. Classic traceroute changes some of these fields as the session progresses causing the packets to be treated as different flows by load-balancers. Paris traceroute rectifies this design flaw and uses other header fields to encode the state required to process the ICMP responses received from routers.</p> <p>Load balancers are typically designed to use destination port numbers to identify all the incoming packets in a flow and forward them down the same path. However, classic traceroute uses different port numbers for each probe. On the other hand, Paris traceroute maintains the same port number so that an inline load balancer will send all the probes down the same path.</p> <p>In the image below, notice the fields that are used for flow-based load balancing shaded in gray. These fields must be kept constant throughout the flow, but the set is different depending on whether it’s an ICMP traceroute or a UDP traceroute.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Wd5OSbaTUwPwnBib5hDL9/9f9393ce5e4e18c0e4ba853940e0720d/flow-based-load-balancing.png" style="max-width: 700px;" class="image center" alt="Table showing fields for flow-based load balancing" /> <div class="caption" style="margin-top: -35px;">Image source:<a href="https://paris-traceroute.net/images/imc2006.pdf"> https://paris-traceroute.net/images/imc2006.pdf</a></div> <p>Second, Paris traceroute will also send probes with <em>different</em> flow identifiers (typically different destination port numbers) to differentiate between flows. This helps identify if there are multiple potential next-hop interfaces that would cause incorrect or confusing output in the trace.</p> <p>Like classic traceroute, Paris traceroute also incrementally increases the TTL value to discover each hop. However, using several different modes to maintain consistency in packet headers it maintains the flow consistency for all packets. So, like classic traceroute, it systematically discovers and records the path taken by packets through the network but can account for load-balanced paths.</p> <p>Notice in the image below that a network may contain multiple paths due to load balancing, dynamic routing changes, or even faulty network devices. In this image, if we want to trace the path from source (Src) to destination (Dst), we have to contend with a load balancer (L) and the potential for multiple forwarding paths at A and B. On a network with a global scale, the complexity would be significantly greater.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5pG8UtkGUwUgqbYfnRTYy3/fe890a79c0824d2884261d777ac06852/traceroute-paths1.png" style="max-width: 600px;" class="image center" alt="Traceroute paths example" /> <p>This becomes even more apparent when different path selections result in different nodes in the output and a different number of nodes, affecting the TTLs generated by the source.</p> <p>In the image below, notice that depending on the path probes take, the TTL value and the subsequent time-exceeded responses will be affected. This will result in anomalies and generally inaccurate results in the traceroute output.</p> <img src="//images.ctfassets.net/6yom6slo28h2/42uh0TaNWdHrmhWYilM1RS/092eb5dfd408c8382774968307bb1a4f/traceroute-paths2.png" style="max-width: 600px;" class="image center" alt="Example showing anomalies" /> <div class="caption" style="margin-top: -35px;">Original image:<a href="https://paris-traceroute.net/images/imc2006.pdf"> https://media.frnog.org/FRnOG_10/FRnOG_10-3.pdf</a></div> <p>Note that there is a variation of the Paris traceroute algorithm, the Multipath Detection Algorithm, or MDA, which sends six and up to 96 probes to factor in possible load balancing. By sending more probes and varying their flow identifiers (UDP packets), the MDA hopes to trace multiple paths more accurately, including diamonds caused by load balancing. However, the MDA is not commonly used, whereas Paris traceroute has become the industry standard.</p> <h3 id="modes">Modes</h3> <p>Paris traceroute solves the problem of flow-based load balancing by using several modes.</p> <ul> <li> <p>First, in UDP mode, Paris traceroute keeps the source and destination port fields constant for all probes in a single run. This is different from traditional traceroute, which usually randomizes these values, especially the destination port, which traditional traceroute repurposes as a sequence number.</p> </li> <li> <p>Second, TCP mode is similar to UDP mode in that the source and destination port fields are kept constant across probes. Due to the nature of how TCP operates, it’s important to maintain flow consistency, which Paris traceroute accomplishes by controlling the TCP Sequence and Acknowledgement Number.</p> </li> </ul> <p>By keeping these fields constant (particularly the port number), Paris traceroute can keep all probes in a flow identifier and, therefore, the same network path even when a flow-based load balancer is involved.</p> <ul> <li>Third, ICMP mode uses the ICMP sequence number field for its sequence number. However, this would cause the ICMP checksum to change, resulting in a changing flow identifier. To counteract this, Paris traceroute also manipulates the identifier field in the ICMP echo request message to keep the checksum constant throughout the session, which is conceptually different from UDP and TCP modes.</li> </ul> <h3 id="limitations">Limitations</h3> <p>There are several limitations of Paris traceroute:</p> <ul> <li>Per-packet load balancing load balances packets individually, so there isn’t an easily discernible flow to track.</li> <li>Dynamic routing can affect the path a flow takes if there is any type of change in the routing table and next-hop destinations on a hop-by-hop basis.</li> <li>NAT devices and other middleboxes will cause Paris traceroute to infer incorrect results.</li> <li>Paris traceroute isn’t able to accommodate faulty network devices and code bugs that can cause anomalous outputs in a trace.</li> <li>It’s more complex than classic traceroute and might require more understanding to interpret the results.</li> <li>Like any active measurement tool, it generates additional traffic, which might be a consideration in bandwidth-sensitive networks.</li> <li>Paris traceroute only sends one probe per hop, so it hides diamonds, which is an important limitation of Paris traceroute to consider.</li> </ul> <h3 id="understanding-the-output">Understanding the output</h3> <p>The output of Paris traceroute is similar to classic traceroute in that it includes route information, or in other words, the path(s) taken by the packets, including hop-by-hop information. A significant difference (improvement) is that it also detects and reports on multiple paths when they exist due to routing policies or load balancing.</p> <p>The specific information reported when executing a Paris traceroute to a destination includes:</p> <ol> <li><strong>List of hops</strong>: Each line in the output represents a layer three hop along the path from the source to the destination.</li> <li><strong>IP addresses</strong>: For each hop, the IP address of the intermediary device (router, L3 switch, firewall, etc.) is displayed.</li> <li><strong>Round-trip time (RTT)</strong>: The output usually shows three round-trip times for each hop, representing the time it takes for a packet to go from the source to that hop and back. RTTs are measured in milliseconds.</li> <li><strong>Possible path variations</strong>: Since Paris traceroute is designed to detect different paths taken by packets due to load balancing, the output displays multiple paths or varying IP addresses for the same hop number across other traceroute executions.</li> <li><strong>Destination reach</strong>: The final line in the output shows the destination IP address and the round-trip times to it.</li> </ol> <p>An example output can look like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/DTrwA0pjUToP51jVq0YvM/f8c14b16d3175952e52d50b1c29a43ec/exampe-output.png" style="max-width: 800px;" class="image center" alt="Example output" /> <p>The image above shows each individual hop by IP address and the RTT for each probe sent for that specific next hop. We use these times to determine latency between hops and between source and destination.</p> <h3 id="verifying-network-paths-over-the-public-internet">Verifying network paths over the public internet</h3> <p>Load balancing is very common on the public internet, so Paris traceroute (among several other traceroute versions) has become a primary method for tracing a path from source to destination, even on a global scale.</p> <p>In the image below from the Kentik Portal, we are tracing a path between a source and a destination, in this case, between two instances in AWS and a destination target. You can see that by using Paris traceroute, we can trace the multiple paths between a source and destination over the public internet.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1ff8rwMLHw9sde8NqGWebA/8eb9769cf89898f7a0e36cf7e412e2d4/traceroute-portal1.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Traceroute in the Kentik portal" /> <p>If we drill down into each node, we can also see the gathered metrics for packet loss, network latency, and jitter for each individual hop in the path.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4CgnZOFw9t01d0wJZ9jImw/7915a29d717a8f13e7ae091d324c630e/traceroute-portal-detail.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Drill down into traceroute details" /> <p>The graphics above are built on the raw data underlying every trace and every probe in that trace, which you can see in the following image below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2l7qGJnuznj4x0vhUyos9C/7017612ce6e3aa9ab69b2128acbe1386/traceroute-portal-raw-data.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Traceroute with raw data" /> <h2 id="conclusion">Conclusion</h2> <p>From its origins in the 1980s to the versatile tool it is today, traceroute remains a vital tool for network engineers and system administrators interested in tracing the path applications over a network. Indeed, the way we consume applications today, including our complete reliance on the public internet, means that to troubleshoot delivery and performance problems successfully, we need an accurate understanding of the paths our applications take now more than ever.</p> <p>Especially in an environment where flow-based load balancing is the norm, Paris traceroute has become the de facto path-tracing solution for inferring network topology on networks we don’t own or manage. Often replacing classic traceroute entirely, Paris traceroute assists engineers in addressing concerns about anomalies, false paths, and the limitations of classic traceroute in load-balanced networks.</p> <h2 id="related-listening">Related listening</h2> <p>Check out my interview with Dr. Justin Rohrer about Paris traceroute in episode 32 of the <a href="https://www.kentik.com/telemetrynow/s01-e32/" title="Telemetry Now Podcast: Using Paris Traceroute for Modern Load-Balanced Networks">Telemetry Now podcast, “Using Paris Traceroute for Modern Load-Balanced Networks”</a>.</p><![CDATA[AWS re:Invent re:Cap]]><![CDATA[Despite the siren song of AI in the keynotes, visitors were far more focused on solving real-world problems. These are the issues that have plagued IT practitioners for years, if not decades: troubleshooting and validating performance and availability of their applications, services, and infrastructure.]]>https://www.kentik.com/blog/aws-re-invent-re-caphttps://www.kentik.com/blog/aws-re-invent-re-cap<![CDATA[Leon Adato]]>Thu, 14 Dec 2023 05:00:00 GMT<p>The first two demos are the most terrifying.</p> <p>After arriving at the booth, unpacking my stuff, and looking for the first cup of coffee, I spent a lot of time wondering what questions I’d be asked.</p> <blockquote><em>Side note: In my (not so) humble opinion, an organizer’s respect for sponsors is directly proportional to the number of coffee stations available to attendees before the expo floor opens.</em></blockquote> <p>And when the floor opens and attendees come flooding in, searching for answers and swag in equal measure, it takes at least a couple of demos before I find my groove. To understand the major questions and themes attendees have on their minds and to overcome my imposter syndrome.</p> <p>After that, it’s just a matter of listening, clarifying, and then showing off Kentik so others have the same sense that I do of how awesome a solution it is.</p> <p>My initial event demo jitters aside, though, what else did I find notable about this year’s AWS re:Invent?</p> <img src="//images.ctfassets.net/6yom6slo28h2/D8cLUUYBxZWJRhL1wWn31/a04d7e8435d07feea8ce6a98f088fde6/aws-reinvent-booth.jpg" style="max-width: 600px;" class="image center" alt="Kentik booth at AWS re:Invent 2023" /> <div class="caption" style="margin-top: -35px;">Leon Adato (center, green lanyard) and Phil Gervasi (front) at the Kentik booth</div> <h2 id="the-biggest-aws-reveal">The biggest AWS re:Veal</h2> <p>As my colleague Justin commented <a href="https://ryburn.org/2023/12/07/aws-reinvent-2023-a-look-into-the-future-of-cloud-technology/">over on his blog</a>:</p> <blockquote>"AWS showcased significant updates and improvements in its AI and ML services. Advancements in natural language processing, neural nets for machine learning, and more intelligent tools for developers were the key highlights."</blockquote> <p>Of course, <a href="https://www.lastweekinaws.com/blog/aws-degenerative-ai-blunder/">not everyone</a> has been impressed with what AWS had to show or how Amazon Q is comporting itself. While I’m personally inclined to give products the benefit of the doubt when they’re in the initial release stages, I also understand that the levels of both hype and forced integration Amazon has used to push Q on the public leaves little that kind of generosity.</p> <h2 id="reflecting-on-whats-required">re:Flecting on what’s re:Quired</h2> <p>Despite the siren song of AI in the keynotes, visitors to the Kentik booth were far more focused on solving real-world problems. These are the issues that have – with minor variations – plagued IT practitioners for years, if not decades: troubleshooting and validating the performance and availability of their applications, services, and infrastructure.</p> <p>To be sure, there are differences. What qualifies as “infrastructure” today is vastly more complex and nuanced than in the past. A far more varied range of elements must work in concert – from “traditional” network infrastructure to cloud-based platforms, containers and orchestration, microservices and external APIs, and more.</p> <p>But even so, we’re still often asking the same questions at the end of the day. And honestly, that’s OK.</p> <p>As my buddy Phil Gervasi wrote in <a href="https://www.linkedin.com/posts/phillipgervasi_awscloud-aws-awsreinvent-activity-7136044553811161088-hQZd">his own review</a>:</p> <blockquote>"...it's important we keep that discussion in the context of using generative AI to solve problems rather than awkwardly look for problems to solve that could have been solved more simply."</blockquote> <h2 id="review-renew-relax-and-recover">re:View, re:New, re:Lax, and re:Cover</h2> <p>It’s always important to put a conference into its proper context: Spread across six separate casinos (Caesars, Encore, Mandalay Bay, MGM Grand, The Venetian, and Wynn), over 65,000 attendees attempted to jam in over 2,000 technical sessions and keynotes, and also visit as many of the 400+ vendors on the expo floor.</p> <p>It’s an impossible ask, even in a city like Las Vegas, where time has no meaning. Even for experienced conference attendees, that level of frenetic energy is hard to ignore.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/huzo54rzfb" title="AWS re:Invent piroutte view" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>This is why – even a couple of weeks later – I’m still processing the experiences and ideas I collected. Nevertheless, I’m grateful to the people who stopped in and said hello, and I’m already looking forward to next year’s event.</p> <p>First demo jitters and all.</p><![CDATA[Kentik Expands Hybrid Cloud Observability with OCI Support]]><![CDATA[Kentik now provides network insight into Oracle Cloud Infrastructure (OCI) workloads, allowing customers to map, query, and visualize OCI, hybrid, and multi-cloud traffic and performance.]]>https://www.kentik.com/blog/oracle-cloud-infrastructure-support-in-kentik-cloudhttps://www.kentik.com/blog/oracle-cloud-infrastructure-support-in-kentik-cloud<![CDATA[Rosalind Whitley]]>Wed, 13 Dec 2023 05:00:00 GMT<p>Navigating the oceans of hybrid and multi-cloud computing requires a compass. As software infrastructure has evolved with abstractions up the stack, the network has remained as the critical foundation that connects distributed apps and services to each other, and to users. So network observability is your compass through rising complexity, a strategic asset empowering enterprises to make decisions with confidence and clarity. With today’s announcement of Oracle Cloud Infrastructure (OCI) support in Kentik Cloud, Kentik delivers understanding of all your critical cloud workloads.</p> <img src="//images.ctfassets.net/6yom6slo28h2/IWDFWFncW5M3ACfKrguC6/519ec30d94bc3f185b4c4e6582c0b0b6/oci-topology-migration.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Oracle Cloud Infrastructure topology in the Kentik platform" /> <h1 id="why-oci-observability">Why OCI observability?</h1> <p>OCI’s comprehensive cloud services suite offers robust compute, storage, and networking capabilities. It’s non-trivial to ensure performance, security, and cost efficiency when migrating or expanding infrastructure to leverage these OCI capabilities. Network observability helps by reducing risk, improving outcomes, and minimizing the impact of common, costly challenges such as:</p> <ul> <li>Prolonged, unexpected migration downtime due to forgotten services or infrastructure</li> <li>High, unexplained egress traffic costs driven by elephant flows</li> <li>Compromised data when development-phase security policy mistakenly ends up in production</li> </ul> <p>Network traffic analysis helps engineering teams understand which component microservices map to target applications. Live network traffic visualization makes the routes connecting internal and third-party services to deliver end-user experiences instantly recognizable. Flexible query capabilities quickly reveal exactly who is talking to whom at any given moment.</p> <p>Kentik, the leader in network observability, answers all of these questions, providing deep insights into network performance, security, and traffic patterns. Use Kentik Cloud’s new OCI support to:</p> <ul> <li>Collect, analyze, and visualize flow logs generated on OCI in Kentik’s Data Explorer</li> <li>Automatically visualize OCI and hybrid topology with the new Kentik Map for OCI</li> <li>Rapidly answer any question about network traffic between OCI and AWS, Google Cloud, Azure, data centers, SD-WAN, or the internet</li> </ul> <h1 id="oci-and-multi-cloud-network-visibility-in-the-enterprise">OCI and multi-cloud network visibility in the enterprise</h1> <p>Modern enterprises may not always enter into multi-cloud architectures intentionally, but these distributed environments are becoming a necessity for reliability, cost control, and business continuity. With this necessity comes complexity. Managing networks across multiple cloud providers can lead to expensive incidents, inefficiencies, security vulnerabilities, and increased costs. Enterprises looking to optimize their cloud infrastructure need visibility into these distributed networks to mitigate the risks.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3haYoBvxTW9rQFNYMX5kDK/64f8b6cb4e5e9f090071bdd432978f41/oci-health-v2.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Oracle Cloud Infrastructure (OCI) support in the Kentik platform" /> <p>Our customers employ OCI, now made visible by Kentik Cloud, in several critical use cases:</p> <ul> <li><strong>Data analytics and big data processing</strong>: Enterprises dealing with large volumes of data can leverage OCI’s powerful compute and storage capabilities, along with Kentik Cloud’s ability to monitor and analyze network traffic, to ensure optimal performance across data pipelines.</li> <li><strong>Reliable storage at lower cost</strong>: Sensitive data required or generated by large-scale workloads often demands secure, scalable storage and fast retrieval, and OCI delivers. Kentik Cloud provides insights into traffic patterns to and from OCI workloads, helping in optimizing resource allocation and improving user experience.</li> <li><strong>Disaster recovery and business continuity</strong>: With the combined strength of OCI’s robust infrastructure and Kentik Cloud’s traffic analysis, enterprises can use OCI to host mission-critical resources such as databases that require high availability, and ensure disaster recovery plan effectiveness through testing.</li> </ul> <p>The ability to leverage OCI to distribute hybrid and multi-cloud workloads for greater reliability and performance at lower cost hinges on the team’s ability to respond to incidents and plan migrations with full understanding of the infrastructure. Delivering the best experience to end users within split seconds still comes down to the network. This means rapid troubleshooting across complex distributed systems when issues inevitably arise. Hybrid and multi-cloud engineers use network-first observability as a critical tool for maximizing ROI.</p> <h1 id="the-correlation-advantage">The correlation advantage</h1> <p>The correlation of traffic analysis, infrastructure metrics, and synthetic testing that Kentik enables significantly reduces incident resolution times. By providing a granular view of network traffic in the same platform as monitoring and on-demand testing, Kentik enables engineering teams to quickly identify and address issues from the network layer on up, minimizing downtime and its associated costs. This level of analysis is invaluable in today’s fast-paced business environment, where every minute of downtime can have substantial financial repercussions.</p> <p>One of the less talked about but equally important benefits of network-first observability is the reduction in incident-related stress. Network issues can be incredibly stressful for engineering teams, often requiring hours of painstaking analysis to resolve. Kentik reduces the burden that context-switching between tools and resorting to painstaking processes like TCP dumps puts on teams. This frees engineers up to do their best work – including platform optimization, which helps development teams become more productive in turn.</p> <h1 id="oracle-cloud-observability-available-now">Oracle Cloud observability: Available now</h1> <p>Harness the power of advanced network intelligence to optimize your entire cloud infrastructure and do more with your team’s time. To learn more, visit our <a href="/solutions/oracle-cloud-infrastructure-observability/">OCI page</a>. Ready to observe all of your networks? <a href="https://www.kentik.com/get-started/">Try Kentik free for 30 days</a>, or <a href="https://www.kentik.com/go/get-started/demo/">contact our team</a> for a personalized demo.</p><![CDATA[How to Analyze Subscriber Behavior with Kentik]]><![CDATA[Learn how to analyze subscriber behavior using Kentik. In this post, we focus on the challenges and solutions of identifying and tracking the customers in an IP network while complying with regulations such as GDPR, show how Kentik Custom Dimensions and Data Explorer provide the analysis, and finally touch on how the associated APIs help automate and ease the entire process. ]]>https://www.kentik.com/blog/how-to-analyze-subscriber-behavior-with-kentikhttps://www.kentik.com/blog/how-to-analyze-subscriber-behavior-with-kentik<![CDATA[Nina Bargisen]]>Tue, 12 Dec 2023 05:00:00 GMT<h2 id="introduction">Introduction</h2> <p>Running an access network? Knowing what your customers are doing with the product or service they buy from you is key to success – a universal truth that applies to almost every business, but maybe even more for those who sell access to the internet.</p> <p>In <a href="https://www.kentik.com/blog/analyzing-the-traffic-patterns-of-broadband-subscriber-behavior/">this previous blog post</a>, we discussed the various learnings of broadband subscriber behavior analysis and its uses. Here we will explore how Kentik can help perform these analyses.</p> <h2 id="the-challenge-identifying-the-subscriber">The challenge: Identifying the subscriber</h2> <p>The first step in analyzing customer behavior is identifying the subscriber. In IP networks, the usual method is through the IP address assigned to the customer’s connection. However, since these IP addresses are often dynamically assigned, the challenge is tracking which customer is using which IP address at any given time. Once you decide to track, it is crucial to do so in a way that complies with regulation in place where your network operates. For example, inside the EU these actions fall under GDPR regulation. At a high level, this means that:</p> <ol> <li>You must be transparent with the customer that you do this.</li> <li>The customer must be able to refuse to let their data be included.</li> <li>They must be able to get the data you collect about them.</li> </ol> <h1 id="address-assignments-and-logging">Address assignments and logging</h1> <p>No matter what method you use to assign IP addresses to your customer – DHCP, Radius, PPPoE, or combinations – the servers involved will have the ability to log which customers are assigned which IP address. So you can use the logs to build a system where a customer’s current IP address is tracked and where changes would trigger an update to Kentik to ensure that the traffic flows will be assigned to the right customer. Part of this system should also include anonymization of the customer ID, so the privacy of the individual users is protected.</p> <p>A practical and secure method to achieve this is through hashing of customer IDs. Hashing transforms the original customer ID into a unique, fixed-size string of characters, which is nearly impossible to reverse-engineer.</p> <p>By replacing the actual customer IDs with their hashed counterparts before they are stored or processed, you can ensure that all subsequent analyses are conducted on anonymized data. This approach not only enhances data security but also aligns with privacy regulations like GDPR, ensuring that the customer’s privacy is maintained without compromising the integrity of the analysis.</p> <h2 id="custom-dimensions">Custom dimensions</h2> <p>Kentik’s custom dimensions feature is the key to this approach. Custom dimensions allow you to add custom columns to your organization’s main tables in the Kentik Data Engine. The Kentik Data Engine is where the flow data is enhanced and stored so it can be queried for all the applications of the data the system offers.</p> <p>Like Kentik-provided dimensions, a custom dimension may be used for group-by in a query or as a filter in a filter group.</p> <h2 id="implementation">Implementation</h2> <p>A dimension in the Kentik Data Engine can be viewed as a column in the main table of the database. A Custom Dimension is populated by a set of rules, called “Populators,” and it is the ability to specify these yourself that defines the “custom” in Custom Dimensions.</p> <p>A Populator consists of a “value” and a set of criteria to match predefined fields to each flow in order to determine whether or not the value should be assigned to the flow when the flow is ingested into Kentik.</p> <p>In this specific case, we will use the hashed CustomerID as “value” for a Populator for each customer and add the corresponding IP address to the “address” field. We will also specify that the direction of the flow can be either destination or source.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5Gp296WJ9fq2wuXOAzDfsn/6eca2de19c05d788f1b90d7f9c315223/add-populator.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Populators in Custom Dimensions" /> <p>But how would this work in real life? Even with a small customer base using static addresses it would be a tedious task to configure the one populator per customer using the UI. Like many other tasks, this can be automated by using the Kentik API.</p> <h2 id="automated-management-via-the-api">Automated management via the API</h2> <p>The Kentik API offers a number of methods that enable programmatic control of Kentik. In this case, we need to use one of the Customization APIs, namely the Batch API.</p> <p>This API supports batch updates of flow tags or populators for Custom Dimensions and we will dive into the use for updating populators.</p> <p>The Batch API is constructed to make it easy to do batch updates. Batch updates are needed in this scenario where we want to build a system where we need to keep a very large set of data updated – the mapping of the hashed CustomerID to IP address.</p> <p>The Batch API uses a single POST method called “Batch Request” to add, update, or delete a set of populators and a GET method called “Batch Get Request” to get the status updates of the progress of the batch operations.</p> <p>For the create/update/delete method, multiple requests – each referred to as a “part” – are supported within one batch operation. This way, very large datasets can be managed in a staged and controlled manner.</p> <p>The POST method requires you to specify which criteria needs to be matched for a specific value to be assigned to a flow when it is ingested into the platform. It is constructed such that there is no need for you to keep track of the populator IDs, but will match on the values to identify the populator that needs to be either created, modified, or deleted. You will only need to keep track of the Batch Global Unique Identifier so you can control which parts belong to which Batch request and keep track of the progress.</p> <p>A sample of the JSON to do this could look like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text"> "replace_all": false, "complete": true, "upserts": [ { "value": "USERZUDW4GB0", "criteria": [ { "addr": ["1.2.3.4"] } ] "value": "USERV0R2OOLU", "criteria": [ { "addr": ["1.2.3.7"] } ] } ], "deletes": [ { "value": "USERU7EEATTH" } ] }</code></pre></div> <p>Note that the direction is not specified. The default value is either which is exactly what we need for this use case.</p> <h2 id="data-analysis-and-application">Data analysis and application</h2> <p>Once the flow data is enhanced with hashed customer IDs, you can start the analysis. Kentik’s Data Explorer is the tool for this, capable of creating complex queries.</p> <h3 id="example">Example</h3> <p>An example query could be identifying the top 50 users of the top 100 most-used OTT services. Such insights can be instrumental for tailored customer outreach and special offers based on their OTT service usage.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3kCDdYKPfA6a5XDO9j0X0Z/e0521fddbbefe78f093af101f90beea4/ott-users1.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Top 50 users of top OTT services" /> <p>Once you are happy with the results, you can use the API to automate the queries to get the data for post processing – for example, creating an outreach to the customers to offer products based on their OTT services use.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1PfsYQOal5paUYdDgcdirt/ad0c399ee76d76104f5716a42104480a/ott-users2.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Automation for data queries" /> <img src="//images.ctfassets.net/6yom6slo28h2/6Uk7jZIK79FO1xAjCzemfI/b403a5bb7342e9b22c5f96c1cf5472f8/data-app-call.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Data API call via cURL" /> <h2 id="need-more-analysis">Need more analysis?</h2> <p>While Kentik’s Data Explorer is powerful, some analyses might require more complex statistical calculations. For such cases, the <a href="https://www.kentik.com/product/firehose/">Kentik Firehose feature</a> allows for the export of data from the Kentik Data Engine to external analytics systems or data lakes, facilitating deeper analysis.</p> <h2 id="conclusion">Conclusion</h2> <p>In summary, analyzing subscriber behavior using Kentik offers a comprehensive and detailed approach to understanding customer behavior when using their internet access. By effectively identifying subscribers through dynamic IP address tracking and ensuring compliance with regulations like GDPR, the operator can gather essential insights while respecting customer privacy and use these to find new revenue streams and enhance the customer experience.</p><![CDATA[Nov 2023: What’s New at Kentik?]]><![CDATA[Have you ever sat down and thought to yourself, "What's new at Kentik?" Well, you're in luck! Welcome to the first edition of a blog series to highlight the best and brightest of Kentik that month.]]>https://www.kentik.com/blog/nov-2023-whats-new-at-kentikhttps://www.kentik.com/blog/nov-2023-whats-new-at-kentik<![CDATA[David Hallinan]]>Thu, 30 Nov 2023 05:00:00 GMT<p>We’re probably all familiar with the oft-repeated adage: knowledge is power. It’s relatively simple to understand. The more information we possess on a given subject, the better we can make informed decisions about it.</p> <p>To further the pursuit of knowledge for you, the reader, we decided we need a good place for you to learn about everything we’ve been up to (and there has been a lot). There are the <a href="https://new.kentik.com/">product updates</a> and the <a href="https://www.kentik.com/media-coverage/">media coverage</a> pages, but they are not enough. The people demand a fun, easily digestible blog aggregation of everything newsworthy from Kentik! So, to rectify these wrongs, welcome to our newest monthly series, “What’s New at Kentik?”</p> <p>This series will cover some of the best, brightest, and most interesting things to happen at Kentik in any given month. So, without further ado, what’s new at Kentik?</p> <h2 id="video-killed-the-radio-button-star">Video killed the radio (button) star</h2> <p>Some people don’t like to read. Blame it on cell phones, TikTok, or anything you want, but we all have the right to consume content in any way our heart desires. (This is in no way an endorsement of not reading. Read a book. It’s good for you.)</p> <p>But with the aim of giving our audience a new outlet for Kentik news, we’re beyond thrilled to announce a new, ongoing video series of the same name as this blog: What’s New at Kentik?</p> <p>In our new monthly series, <a href="https://www.linkedin.com/in/leonadato">Leon Adato</a> walks you through the latest and greatest from Kentik in a hilarious, informative, poignant, and, most importantly, fun way. We present to you the first episode of What’s New at Kentik:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/ga97i9p0r0" title="What's New at Kentik with Leon Adato" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>Be sure to <a href="https://www.youtube.com/channel/UCsTYhGUm81x6m6lQ-QUwfKA">subscribe to Kentik’s YouTube channel</a> to ensure you don’t miss our future episodes.</p> <h2 id="gleaming-the-kentik-kube">Gleaming the Kentik Kube</h2> <p>Kentik is <em>the</em> network observability company. Kubernetes is part of the network. Ergo, Kentik has been working hard to ensure we can provide comprehensive coverage for our cloud native compatriots. (We also work hard on alliteration.)</p> <p>So, we are excited to have <a href="https://www.kentik.com/blog/introducing-kentik-kube/">launched</a> our newest offering this month – <a href="https://www.kentik.com/solutions/kubernetes-networking/">Kentik Kube</a>, an industry-first solution that provides network insight into Kubernetes workloads, revealing the routes container traffic takes through data centers, public clouds, and the internet.</p> <p>Network and infrastructure teams can finally gain full visibility of network traffic within the context of their cloud and on-prem K8s deployments, enabling them to quickly troubleshoot performance problems and optimize costs.</p> <p>Kentik CEO <a href="https://www.linkedin.com/in/avifreedman">Avi Freedman</a> had the privilege of sitting down with Techstrong TV at KubeCon 2023 to discuss the launch of Kentik Kube <a href="https://techstrong.tv/videos/kubecon-cloudnativecon-north-america-2023/avi-freedman-kentik-kubecon-cloudnativecon-north-america-2023">in this extended interview</a>.</p> <p>Still want more Kube content? Watch <a href="https://www.linkedin.com/in/phillipgervasi">Phil Gervasi</a> demonstrate exactly how <a href="https://www.kentik.com/go/video/kentik-kube/">Kentik Kube can troubleshoot container latency, enforce security monitoring policies</a>, and more.</p> <p><a href="https://www.kentik.com/go/video/kentik-kube/"><img src="//images.ctfassets.net/6yom6slo28h2/6N4FZP3uJMTzjRtEWb07h2/29b61856a14d632831fdeebc3e0ae9e5/kentik-kube-video-phil.png" style="max-width: 600px;" class="image center" thumbnail alt="Phil Gerasi demoes Kentik Kube" /></a></p> <h2 id="rbac-to-the-future">RBAC to the future</h2> <p>You asked for it, and we delivered.</p> <p>Kentik has now rolled out a long-awaited feature to our platform and admin user base this month – Role Based Access Control (RBAC), which provides a more granular approach for Kentik admins to manage user access and permissions to Kentik features and settings. Get all the details about Kentik’s initial release of RBAC and how it will eventually replace our previous permission system with our <a href="https://new.kentik.com/role-based-access-control-is-live-3R1D68">product update blog post</a> and Kentik <a href="https://kb.kentik.com/v4/Cb32.htm">Knowledge Base (KB) article</a>.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1gats1E3woqpFJYNtvMHN6/6ea9df98df37a0ac0438d393f8330df3/rbac-screenshot.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="RBAC in the Kentik portal" /> <h2 id="thats-whats-new-at-kentik">That’s what’s new at Kentik</h2> <p>So, there you have it. November of 2023 was a busy month for us here at Kentik. There’s so much more that we could have covered here, from <a href="https://www.futuriom.com/articles/news/itential-kentik-partner-on-network-automation/2023/11">new partnerships</a> to <a href="https://www.businesswire.com/news/home/20231102644004/en/Kentik-and-nLogic-Join-Forces-to-Deliver-Network-Observability-to-Nordic-Service-Providers-and-Enterprises">other new partnerships</a> to <a href="https://www.techtimes.com/articles/296960/20230928/top-5-solutions-network-observability-2023.htm">Kentik being named the #1 solution for network observability in 2023</a>.</p> <p>We can’t wait to show you some exciting things we plan to release soon, so be sure to subscribe to <a href="https://www.kentik.com/blog/">our blog</a> so you don’t miss out on any upcoming What’s New at Kentik content.</p> <p>Til next month, network enthusiasts!</p><![CDATA[What’s Stopping the Adoption of Network Automation?]]><![CDATA[Is there a gap between the potential of network automation and widespread industry implementation? Phil Gervasi explores how the adoption challenges of network automation are multifaceted and aren’t all technical in nature.]]>https://www.kentik.com/blog/whats-stopping-the-adoption-of-network-automationhttps://www.kentik.com/blog/whats-stopping-the-adoption-of-network-automation<![CDATA[Phil Gervasi]]>Wed, 29 Nov 2023 05:00:00 GMT<p><a href="https://networkautomation.forum/eventinfo">AutoCon 0</a> was a pivotal event in the network engineering world. Most attendees were involved with networking in some meaningful way, and everyone I spoke with was interested in increasing the adoption of network automation. It was a fantastic event to attend.</p> <img src="//images.ctfassets.net/6yom6slo28h2/42ACF2sUKQtbXslHdVrWNy/311b54eb75ed5411961af7b60516dc47/autocon.png" style="max-width: 400px;" class="image right" alt="AutoCon 0 event logo" /> <p>With the recent NetDevOps Days London and New York and now AutoCon, it’s fantastic to see a completely focused event for the networking community to gather and discuss methods, tools, workflows, and lessons learned of practical network automation. Specifically, AutoCon 0 was a laser-focused event, and the desire to embrace <a href="https://www.kentik.com/kentipedia/network-automation/" title="Kentipedia: Network Automation">network automation</a> was palpable among attendees. However, despite recent technical advancements and growing excitement, a recurring theme at the conference was the noticeable gap between the potential of network automation and its actual implementation in the industry.</p> <div as="WistiaVideo" videoId="93u2jcczdu" audio></div> <h2 id="a-network-automation-reality-check">A network automation reality check</h2> <p>AutoCon 0 brought together practitioners new to network automation and experts implementing complex workflows. However, according to a show of hands in one of the sessions, most attendees were not implementing network automation practices in any meaningful way.</p> <p>But why? We’ve been talking about network automation for years, so why isn’t the networking industry farther along with mass adoption?</p> <p>Over the last decade, innovation in network automation has involved fine-tuning automation workflows, creating device templates to make configurations more consistent and easier to automate, and creating integrations with network-adjacent services like DNS, IPAM, CRMs, etc. Network vendors have also been busy making their devices more accessible via NETCONF and REST APIs. So, people are trying, and vendors are responding. Nevertheless, adoption has yet to take off like we expected.</p> <h2 id="whats-the-network-automation-holdup">What’s the network automation holdup?</h2> <p>The stumbling blocks for the adoption of network automation are multifaceted and aren’t all technical in nature.</p> <h3 id="1-complexity-of-network-systems">1. Complexity of network systems</h3> <p>One of the most significant barriers is the inherent complexity of today’s networks. Many organizations still operate legacy systems that are not readily compatible with modern automation methods and solutions.</p> <p>We also have the complexity of modern networks, even if the systems are cutting-edge. A modern network includes campus devices from various vendors, abstractions and overlays, network adjacent devices, cloud and container networking, and so on. Automating an infrastructure when that infrastructure is both complex and only partially owned by the operator is a stumbling block to getting started.</p> <p>And as much as we’d like to deny it, every network is still a snowflake. Even if we reach the point where networks are 80% the same organization by organization, there will always be that unique use-case, one-off problem, and bespoke design. The reality is that no one automation technique or method will work for everyone.</p> <h3 id="2-skill-gap">2. Skill gap</h3> <p>Next, network automation requires unique skills that blend traditional networking knowledge with programming ability. In theory, learning a new skill wouldn’t be a stumbling block if network operators working in the trenches had unlimited time, resources, and a reliable and stable network that didn’t need attention.</p> <p>I repeatedly heard that many network engineers got into networking because they hated their 100-level programming classes in college. Networking was a way to stay in the tech field without being a programmer. Therefore, many network engineers have a deep-seated aversion to programming and a reluctance to develop that skill. This has given rise to “codeless” automation platforms such as Itential, Gluware, and Pliant and the developing partnerships between observability and automation platforms.</p> <h3 id="3-cultural-change">3. Cultural change</h3> <img src="//images.ctfassets.net/6yom6slo28h2/1QCZMheQpJU5ANrlqToicJ/a98b02ce6403cfe38dbf8f5f14725c60/autocon-philg-ryburn.jpg" style="max-width: 500px;" class="image right" alt="Justin Ryburn and Phil Gervasi at AutoCon 0" /> <p>Change is hard, especially in large, established organizations. There’s often a cultural resistance to adopting new technologies, particularly when it involves fundamental changes to how networks are managed. The truth is most of us have built our networks, including the internet itself, mostly without network automation. That means justifying the cost and pain of organizational change to leadership is a hard sell. Many network pros are heavily invested in (and tied to) their unique network implementation, and it could be that they feel it’s their unique responsibility to keep it going with their embedded personal knowledge.</p> <h3 id="4-trust-issues">4. Trust issues</h3> <p>We should exercise healthy caution with anything that directly impacts application delivery. However, some engineers have trust issues with handing over configuration changes to a script. I wonder if that’s really warranted, considering we have imperfect people changing the config on the CLI directly. Mistakes certainly happen all the time with legacy network operations. In fact, more than once, I’ve taken a network down or at least impacted network activity by implementing an incorrect command or some other human error.</p> <p>When network automation is implemented cautiously and as part of an operational workflow, rather than increasing errors, we’ll see _fewer _errors and more stability. It’s easier to correct errors if you have automation in place. We have version control, a central place to fix typos, etc. Nevertheless, we must still overcome trust issues among engineers and automation processes.</p> <h3 id="5-financial-constraints">5. Financial constraints</h3> <p>The initial investment in automating networks can be substantial. Consider the cost of upgrading skills, possibly hiring new engineers, the loss of production time to implement new workflows, and the licensing cost of new tools and databases. For many companies, the ROI is not immediate, which will lead some decision-makers to forgo implementing network automation altogether and focus on activities with quicker returns.</p> <p>I’d argue this is short-sighted, but it’s still a prevalent stumbling block mentioned several times throughout AutoCon 0.</p> <h2 id="lessons-learned-in-network-automation">Lessons learned in network automation</h2> <p>The lessons were plain to me. I heard them repeatedly in various presentations – even the session on AI and LLMs. Surprisingly, almost all of them relate to culture and people and very little to technology itself.</p> <p><strong>Embrace incremental change</strong></p> <p>All the experts who spoke at the conference emphasized the importance of taking incremental steps towards automation. Automating small, routine tasks can build confidence and demonstrate value, opening the door to bigger, more impactful network automation activities.</p> <p><strong>Invest in training</strong></p> <p>Bridging the skill gap is crucial. The industry needs to invest in training programs that equip network professionals with the necessary skills for automation. Relying only on individuals to independently acquire the skills they need to automate their networks means change will continue to go at a crawl. Organizations need to take professional development more seriously.</p> <p><strong>Culture shift</strong></p> <p>Organizations also need to cultivate a culture open to change and innovation. This involves top-down directives and encouraging a mindset change at all levels, from management to front-line engineers. Our networks have changed dramatically over the last few years, so it doesn’t make sense to operate them the same way we always have.</p> <p><strong>Showcase the long-term benefits</strong></p> <p>Getting around short-sighted leadership is an uphill battle. The value in network automation isn’t purely short-term, so that means explaining the diverse value of network automation will depend on an organization’s culture, goals, and leadership.</p> <p>It can be challenging to communicate the long-term benefits of network automation beyond immediate cost savings, especially when there may not be immediate cost savings. Network automation generally provides little to no short-term gains, so we need to showcase the long-term efficiency, reliability, and scalability improvements.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p>AutoCon 0 was a reminder that while the journey towards widespread network automation may be slower than expected, it’s still very much underway. Yes, there are challenges and stumbling blocks for leadership and engineers. However, there is a growing excitement and community around network automation that, hopefully, will soon lead to significant changes in how we operate our networks.</p> <p>About a year ago, <a href="https://www.linkedin.com/in/scottrobohn/">Scott Robohn</a> and <a href="https://www.linkedin.com/in/cgrundemann/">Chris Grundemann</a> first had the idea for the <a href="https://networkautomation.forum/">Network Automation Forum</a> and AutoCon 0. Now that the community has rallied around advancing the adoption of network automation, I’m looking forward to the amazing success stories shared at AutoCon 1.</p><![CDATA[Cloud Observer: Subsea Cable Maintenance Impacts Cloud Connectivity]]><![CDATA[In this edition of the Cloud Observer, we dig into the impacts of recent submarine cable maintenance on intercontinental cloud connectivity and call for the greater transparency from the submarine cable industry about incidents which cause critical cables to be taken out of service.]]>https://www.kentik.com/blog/cloud-observer-subsea-cable-maintenance-impacts-cloud-connectivityhttps://www.kentik.com/blog/cloud-observer-subsea-cable-maintenance-impacts-cloud-connectivity<![CDATA[Doug Madory]]>Tue, 28 Nov 2023 05:00:00 GMT<p>In recent weeks, two of the internet’s major submarine cable systems were down for repairs, impacting internet traffic between Europe and Asia. As we’ve pointed out in the past, the major public clouds rely on the same submarine cable infrastructure as us regular internet users, so when a cable incident occurs, cloud connectivity is also affected.</p> <p>In this edition of the Cloud Observer, we’ll take a look at what we observed when <a href="https://www.submarinecablemap.com/submarine-cable/seamewe-5">Sea-Me-We-5</a> (SMW-5) and <a href="https://www.submarinecablemap.com/submarine-cable/imewe">IMEWE</a> were recently down for repairs.</p> <h2 id="cloud-synthetic-measurements">Cloud synthetic measurements</h2> <p>The Cloud Observer is a recurring series focused on analyzing inter-regional cloud connectivity. These posts utilize Kentik’s continuous measurements between our agents in over 110 cloud regions in the world’s largest cloud providers: AWS, Azure, and Google Cloud. Configured in a full mesh, these agents perform ping and traceroute tests to every other agent providing a continuous picture of interconnectivity within the cloud.</p> <h2 id="smw-5-downtime">SMW-5 downtime</h2> <p>Sea-Me-We-5’s name is an acronym that traces the geographic path the cable takes: Southeast Asia (SEA), Middle East (ME), Western Europe (WE). It was designated as ‘ready for service’ (RFS) in 2016 and is the fifth edition of a consortium-owned intercontinental submarine fiber optic cable carrying a large portion of the internet traffic between Europe and Asia.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2T90k8khScBEPDWXGhxXta/c56bd1c2c02f3fe8324ff213dc69b0a6/seamewe-5-path.jpg" style="max-width: 700px;" class="image center" alt="SeaMeWe-5 path" /> <div class="caption" style="margin-top: -35px;">SeaMeWe-5 path, credit: <a href="https://www.submarinecablemap.com">www.submarinecablemap.com</a></div> <p>As I described in a <a href="https://www.kentik.com/analysis/smw-5-maintenance-leads-to-higher-latencies-between-europe-and-asia/">brief post</a> last month, at 00:00 UTC on October 23, SMW-5, the major submarine cable connecting Asia to Europe, was taken down to repair a shunt fault. A shunt fault occurs when sea water breaches the cable’s insulation, causing a short circuit in the cable’s electrical feed.</p> <p>Submarine cables contain an electrical system that carries thousands of volts of electricity to power the amplifiers (often referred to as repeaters) that are placed every 70 km or so to maintain signal strength. Normally, a cable can continue operating despite suffering a shunt fault, but must be repaired as soon as possible.</p> <p>My friend <a href="https://www.linkedin.com/in/philippe-devaux-218423199/">Philippe Devaux</a> describes himself as “a keen observer of the global submarine cable systems industry” and has a knack for piercing through the secretive world of submarine cables (occasionally reaching the ship crews themselves) to find insiders willing to confirm details of submarine cable repairs.</p> <p><a href="https://www.linkedin.com/posts/philippe-devaux-218423199_01nov23-seamewe-5-gulf-of-aden-cable-shunt-activity-7125473317381132291-9GEE?utm_source=share&#x26;utm_medium=member_desktop">In a recent LinkedIn post</a> for the SMW-5 repairs, he identified the cable ship (CS) <a href="https://www.marinetraffic.com/en/ais/details/ships/shipid:726403/mmsi:563435000/imo:9063275/vessel:ASEAN_RESTORER">Asean Restorer</a> as the vessel performing the repairs of the shunt fault disrupting Europe-to-Asia internet traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/71Liroeu1XfRUE3OtCQCQS/21c826398c513d592aa5e99b0f7df4de/asean-restorer-vessel1.jpg" style="max-width: 700px;" class="image center" alt="Asean Restorer vessel doing repairs on undersea cable" /> <div class="caption" style="margin-top: -35px;">Image credit: Philippe Devaux</div> <p>The loss of the major route between Asia and Europe was picked up on the latency measurements between our agents in cloud regions. Numerous routes experienced latency increases of around 40 ms. The criticality of SMW-5 was evident as all three major public clouds experienced latency changes due to the loss of the cable. Let’s take a look at some of the impacts caught in Kentik’s cloud measurements.</p> <h3 id="amazon-web-services">Amazon Web Services</h3> <img src="//images.ctfassets.net/6yom6slo28h2/5Y4UecHf6dZabO3MdXTqog/f24492d78fbaef1025912b3a649c72a0/latency-aws-frankfurt-sydney.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="AWS latency, Frankfurt to Sydney" /> <h3 id="google-cloud">Google Cloud</h3> <img src="//images.ctfassets.net/6yom6slo28h2/5Tn4z0qek81BgngxzGlZhy/4fa95caad3781860b2e2321db499ba3a/latency-gcp-amsterdam-jakarta.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Google Cloud latency, Amsterdam to Jakarta" /> <h3 id="azure">Azure</h3> <img src="//images.ctfassets.net/6yom6slo28h2/6Ty9aM1QEYIUeYwulWdXgG/6f4f0b5d1173e451da93c6c0b2e221ba/latency-azure-cardiff-singapore.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure latency, Cardiff to Singapore" /> <p>The impacts for AWS and GCP revealed a stable shift in latency, likely owing to a longer geographic path while SMW-5 was undergoing repairs. Alternatively, the measurements involving the Azure regions showed periodic latency increases that align with work days, as pictured below. This may suggest that Azure would shift a portion of its traffic to a longer path only during busy working hours.</p> <img src="//images.ctfassets.net/6yom6slo28h2/Dd9SW4aNebcLr5RsxcjDg/5d995fbe271e7b782b54ddbbed79fa74/latency-azure-amsterdam-canberra.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure latency, Amsterdam to Canberra" /> <p>And, of course, it wouldn’t be a submarine cable analysis without providing an example of a case when latencies <em>improved</em> due to the loss of a submarine cable. The graphic below illustrates the impact of the loss of SMW-5 on the latencies from AWS’s <code>eu-west-3</code></strong> in Paris to Azure’s <code>uaenorth</code> in Dubai — a drop from 114ms to 90ms.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5pPMzp9fEjc0ADBfjYlAO/0c5294633d2e7e82bd54b6ec6cda27ea/latency-aws-paris-dubai.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="AWS latency, Paris to Dubai" /> <p>Why would latency drop as a result of the loss of a submarine cable? I addressed this in <a href="https://www.kentik.com/blog/dual-subsea-cable-cuts-disrupt-african-internet/">my blog post</a> from August about the submarine cable cuts off the west coast of Africa:</p> <blockquote>While this may seem counterintuitive, this is a phenomenon I have often encountered when analyzing submarine cable cuts... Essentially a higher latency route becomes unavailable, and traffic is forced to go a more direct path. Why would traffic be using the higher latency path in the first place? What you can’t see in these visualizations are factors like cost.</blockquote> <p>In that example, latencies between Cape Town and Seoul dropped when the WACS cable was cut and AWS had to redirect traffic over a more direct path through the Indian Ocean. In that case, there simply may not be a business case to justify paying a premium to ensure the lowest latency between those two distant locations.</p> <p>That logic would seem to hold less weight in the case of latencies between Dubai and Paris. Network engineers in the Middle East typically try to optimize latencies to Europe, the main source of global transit for the region. In this case, we’re looking at a traffic handoff between two clouds (AWS and Azure). As we will explore in future posts in this series, it is not uncommon to find suboptimal latencies between two clouds each operated by a different networking team with its own interconnection strategy.</p> <h2 id="imewe-downtime">IMEWE downtime</h2> <p>Similar to SMW-5, IMEWE’s name derives from its route, heading east to west: India (I), Middle East (ME), Western Europe (WE) and was RFS in 2010. In a <a href="https://www.menog.org/presentations/menog-9/Doug%20Madory%20-%20Middle%20Eastern%20Latency%20Analysis.pdf">talk I gave</a> at MENOG in Muscat, Oman in 2011, I presented evidence of IMEWE’s activation for the country of Lebanon.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6ohUeWuK2hrfsv8N5Get2T/4eed184e1bec15c3cc3acea9ae606587/imewe-path.png" style="max-width: 700px;" class="image center" alt="IMEWE path" /> <div class="caption" style="margin-top: -35px;">IMEWE path, credit: <a href="https://www.submarinecablemap.com">www.submarinecablemap.com</a></div> <p>In a <a href="https://www.linkedin.com/posts/philippe-devaux-218423199_06nov23-imewe-india-middle-east-western-activity-7127355055858413568-TFNQ">LinkedIn post</a>, Phillippe Devaux again identified the vessel performing the repair as the cable ship (CS) Maram, presently in the Gulf of Aden. Phillippe added:</p> <blockquote>CS Maram departed Salalah 27OCT23, currently positioned about 60km off Aden, where she has been mostly stationary since 03NOV23, likely busy repairing IMEWE fault. (ETR Expected Time to Repair: 12NOV23)</blockquote> <img src="//images.ctfassets.net/6yom6slo28h2/2QkOGeKglLFckhc2noAcEQ/4b3f005313d585233281ca7646b709c2/imewe-cable-outage-repair.jpg" style="max-width: 700px;" class="image center" alt="IMEWE cable outage repair" /> <div class="caption" style="margin-top: -35px;">Image credit: Philippe Devaux</div> <p>We could see the impacts of this cable maintenance in a couple of places in our tools. Latency between AWS’s <code>eu-south-1</code></strong> in Milan and <code>me-south-1</code> in Manama, Bahrain, dropped by 31ms during the IMEWE outage.</p> <img src="//images.ctfassets.net/6yom6slo28h2/60rdRfjgKSZoZR9lXK5A1Z/66053c0a6930f7823676ed755799a178/latency-aws-milan-manama.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="AWS latency, Milan to Manama" /> <p>Additionally, <a href="https://kb.kentik.com/v4/Ha04.htm">Kentik Market Intelligence (KMI)</a> reported impacts, including the loss and subsequent return of transit from Telecom Italia Sparkle (AS6762) for <a href="https://www.incpak.com/tech/pakistan-may-experience-slow-internet-due-submarine-cable-outage/">Pakistani incumbent PTCL</a> (AS17557).</p> <p>Below is a visualization from transit data from KMI over time showing the disappearance of AS6762 in peach on November 2nd, followed by an increase in transit from TMnet (AS4788 in maroon) before seeing a return of AS6762 transit for AS17557 on the 10th.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/15vCcg3y0HNQUANdgoA4Hk/d2e66f8b8889d23cfb156e78b589a722/kmi-visualization1.png" style="max-width: 700px;" class="image center" alt="Kentik Market Intelligence visualization" /> <p>Incidentally, I have a connection to the cable ship performing this repair. At a submarine cable conference in 2016, the event provided a tour of the <a href="https://www.telecomreview.com/articles/telecom-vendors/924-e-marine-launched-its-new-cs-maram-cable-laying-vessel-in-dubai-strengthening-regional-presence">newly christened</a> CS Maram docked in Dubai. Below is a picture of me in the bay of the CS Maram in front of a <a href="https://www.smd.co.uk/our-products/qtrenchers/qtrencher-400/">QTrencher 400</a>, a massive remotely operated vehicle (ROV) for subsea trenching and cable maintenance.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3VAfoehlYKPsaZ1kuC0f4Y/56d11259f5d65069cc509ff5ef0102d9/doug-madory-cs-maram.jpg" style="max-width: 450px;" class="image center" alt="Doug Madory on the CS Maram" /> <h2 id="a-call-to-action">A call to action</h2> <p>Submarine cables require regular maintenance and repair, and as I mentioned in <a href="https://www.kentik.com/blog/dual-subsea-cable-cuts-disrupt-african-internet/">my post</a> about the submarine cable cuts in Africa: <em>the seafloor can be a dangerous place for cables.</em></p> <p>The loss of submarine cable connectivity can have profound impacts on the international flow of internet traffic, but unfortunately, the submarine cable industry isn’t communicative to the general public on events like this, and this is a problem. Typically, cable operators only report downtime and outages to their direct customers. If those customers fail to notify a broader audience, the general public is left to speculate on the causes of internet outages.</p> <p>In 2017, I wrote a blog post about <a href="https://circleid.com/posts/20170726_telecom_heroics_in_somalia">Telecom Heroics in Somalia</a>, which told the story of how the ISPs in Mogadishu faced down a terrorist threat to activate their first submarine cable landing. In that piece, I also covered an incident where the lack of communication from a cable operator could have gotten people hurt.</p> <blockquote>Somalia held a presidential election earlier this year, and as the candidates were getting ready for their first nationally televised debate, the country’s primary link to the global internet went out. Many Somalis were understandably concerned:</blockquote> <img src="//images.ctfassets.net/6yom6slo28h2/4PkzsEogJBYGUWZjvmjlje/4fe99e8eeb95da2a9a566bcd72ed8ae9/somalia-tweet.png" style="max-width: 400px;" class="image center" thumbnail withFrame alt="Tweet about the Somalia internet outage" /> <blockquote>However, despite its tremendously unfortunate timing, this Internet outage was due to emergency downtime on the <a href="https://www.submarinecablemap.com/submarine-cable/eastern-africa-submarine-system-eassy">EASSy cable</a>, which was needed to repair a cable break that occurred the previous week near Madagascar... Regardless, 12 presidential candidates <a href="https://allafrica.com/stories/201702010708.html">walked out on the debate</a> believing the outage was a political dirty trick.</blockquote> <p>Most recently, YemenNet suffered a <a href="https://twitter.com/DougMadory/status/1723016723459125292">multi-hour outage</a>, leading to speculation that the internet blackout was retaliation for missiles <a href="https://www.timesofisrael.com/missile-from-yemen-intercepted-over-red-sea-as-houthi-chief-vows-to-keep-up-attacks/">fired from Yemen at Israel’s</a> key Red Sea shipping port of Eilat.</p> <p>The speculation led cable owner GCX to publish a rare <a href="https://subsea.gcxworld.com/clarification-yemen-internet-outage/">statement</a> of explanation the following day:</p> <blockquote>With reference to the recent internet outage in Yemen reported in some media outlets, GCX can confirm that a scheduled maintenance event took place on 10th November on the FALCON cable system in conjunction with the Yemen cable landing party. This maintenance event had been in planning for the past 3 months and was notified to all relevant parties and was completed successfully within the agreed maintenance window.</blockquote> <p>The secrecy around submarine cable maintenance events does not serve the public interest. It is long overdue for the submarine cable industry to develop a practice of widely disseminating (beyond just direct customers) advanced notice on any scheduled cable maintenance activity that could lead to a widespread internet blackout.</p> <p>In a time of heightened tensions around the world, not doing so could have grave consequences.</p><![CDATA[Kubecon 2023: Code, Culture, Community, and Kubernetes]]><![CDATA[Kubecon 2023 was more than just another conference to check off my list. It marked my first chance to work in the booth with my incredible Kentik colleagues. It let me dive deep into the code, community, and culture of Kubernetes. It was a moment when members of an underrepresented group met face-to-face and experienced an event previously not an option. ]]>https://www.kentik.com/blog/kubecon-2023-code-culture-community-and-kuberneteshttps://www.kentik.com/blog/kubecon-2023-code-culture-community-and-kubernetes<![CDATA[Leon Adato]]>Tue, 21 Nov 2023 05:00:00 GMT<p>I didn’t find out I was headed to <a href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/">Kubecon 2023</a> until the Thursday before the event. Which was, for all you procrastinators out there, still enough time to buy my ticket, find a hotel, and make reservations at <a href="https://miltsbbq.com/">Milt’s BBQ for the Perplexed</a>.</p> <p>While last-minute travel is always a bit of a nail-biter for me, I’m glad I decided to take the risk. The event was both massive and massively impactful in ways I couldn’t have imagined and in ways that no other event has been (at least for me). In this post, I wanted to share some of what I saw, thought, and learned.</p> <h2 id="yes-the-network-is-still-a-thing">Yes, “the network” is still a thing</h2> <img src="//images.ctfassets.net/6yom6slo28h2/6wyPjAeWSW1zNOc309ZHMG/3d9a85bdd85420c20461906981440198/meme-network-still-a-thing.jpg" style="max-width: 500px;" class="image center" alt="meme - network is still a thing" /> <p>I won’t pretend this is the first time I’ve heard the question, but it still catches me up short when I get it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4sGR0ekLgCuuoaml8TnfAH/b006c30be0adcc0071f4c11b138fbf09/kubecon-booth-visitors.jpg" style="max-width: 400px;" class="image center" alt="Booth visitors at Kubecon" /> <p>People see the big “network observability” banner at the Kentik booth and come up to ask some variation of a question I’ve heard since attending my first DevOpsDays back in 2014:</p> <p><strong><em>“Are networks even really something we need to think about anymore?”</em></strong></p> <p>I understand why they’re asking. Between containers, orchestration, cloud platforms, microservices, and the application itself, there are so many layers of abstraction that the packets, bits, wiring, and routing have all faded into the background. But it only takes a moment’s introspection to realize that, of course, they are all still there and still matter — both to IT practitioners who manage that infrastructure and the applications whose performance (let alone availability) rely on it.</p> <p>Ironically, people also came to the booth to give examples of when the network really <em>was</em> the problem, but they had no way to detect it because that type of data didn’t show up in their tools.</p> <p>My colleague <a href="https://www.kentik.com/blog/author/jryburn/">Justin Ryburn</a> echoed this observation in his <a href="https://ryburn.org/2023/11/16/navigating-kubecon-2023-a-dive-into-the-future-of-container-orchestration/">Kubecon review</a>:</p> <blockquote>Last year, I had to describe what eBPF was and how it might provide value when it comes to understanding the network traffic in a k8s pod. This year, most of the attendees asked me about eBPF.</blockquote> <p>So yes, the network matters.<br> It never didn’t matter.<br> Some folks just forgot.</p> <h2 id="the-kubernetes-connection">The Kubernetes connection</h2> <p>One of the points my colleague Mike Krygeris was quick to make was that every Kubernetes cluster looks like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2WGqz6KOX9ox5QMFAJM3Zx/f270530b6a9264c68769dd239793b3da/kubernetes-deepdive-diagram.png" style="max-width: 500px;" class="image center no-shadow" alt="Kubernetes diagram" /> <div class="caption" style="margin-top: -35px;"><a href="https://itnext.io/kubernetes-network-deep-dive-7492341e0ab5">Image source</a></div> <p>He follows it up with this explanation:</p> <blockquote><p>One original creator of the Calico CNI said, “Every Kubernetes node has a router built right into it. IP Tables.” Kubernetes still uses IP, and BGP is a battle-tested routing protocol with way less overhead than other tunneling techniques. The challenge is that the resources that are connected to that router are very ephemeral, so we need to add context to that in order to get useful observability from it. That context comes from metadata gleaned from the Kubernetes API. When you combine flow logs with all of the associated metadata, you get useful information like what services are being used and which are generating costs.</p> <p>eBPF allows us to add context and information about performance. We are able to add insight like connection latency and TCP retransmission information, which identifies when packet loss is causing application slowdowns.</p> <p>It’s important to recognize that this sort of insight is much harder to obtain further up the stack.</p></blockquote> <p>Mike stressed that folks who need higher performance often replace the kernel’s routing functions with an eBPF XDP data plane. Still, the trade-off is that these replacement data planes no longer work with standard kernel observability tools, and you have to turn to other solutions. An example would be using Hubble for Cilium’s XDP data plane.</p> <p>All of this should provide you with all the context for why we announced <a href="https://www.kentik.com/solutions/kubernetes-networking/">Kentik Kube</a> on the first day of Kubecon. <em>If</em>, as I’ve already explained, the network is still a thing that matters; <em>and</em>, as Mike has explained, every Kubernetes cluster has a router baked into it:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3QP4CBiqxUtofk2B208v8g/03da078c0f2f31c7c5fd2499e5ae79d7/kube-venn-diagram.png" style="max-width: 400px;" class="image center no-shadow" alt="Kentik Kube Venn diagram" /> <p>Then, it stands to reason that Kentik ought to be right in the middle of that Venn diagram. And that’s what Kentik Kube does: It collects network data from your Kubernetes instances — whether they are on-premises, in the cloud, in multiple clouds, or spanning any or all of those locations — and adds it into the not-insignificant network data already being collected. The result is that it provides you with deep and rich insights into your application architecture, performance, and uptime.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2zDh1NnBy1kgs3ljX41UXL/d0e68f1c1893925e2c6161861da037e8/kentik-kube-2023-shopping-no-header.png" style="max-width: 800px;" class="image center no-shadow" withFrame thumbnail alt="Kentik Kube" /> <h2 id="code-community-culture">Code, Community, Culture</h2> <p>The keynotes across all three days had wonderful speakers and amazing themes — from lessons learned from outages to Destiny O’Connor and Catherine Paganini explaining essential truths about accessibility (more on that in a minute). One of the statements that stood out for me came during the day two keynotes when one of the speakers celebrated the way Kubecon is centered around “Code, Community, and Culture.”</p> <p>If you read my <a href="https://www.kentik.com/blog/the-driving-force-of-community-at-all-things-open-2023/">retro blog from All Things Open</a>, you already know the importance I am placing on building and nurturing a strong community at Kentik and how there will be much more to discuss in the coming weeks and months. What I loved about this framing was recognizing how it all has to fit together. Building a strong community by itself, simply for the sake of <em>having</em> a community, isn’t enough. Nor is it enough to establish and be an exemplar of a particular culture (of curiosity, being data-driven, celebrating achievement, or any other laudable trait) among users or of “code” which, in the case of Kubecon and the CNCF, translates to the overall product.</p> <p>The lesson I learned is that all three of those things need to be imagined from the very beginning as being woven together into a complementary and harmonious whole — each supporting and supplementing the other. It’s a powerful vision and one which I wholeheartedly agree with and embrace. I also plan to use this as a model here at Kentik. More on that coming soon.</p> <p>It also has to be said that part of the Kubecon 2023 culture was food-focused. Barely a single presentation failed to include images of deep dish pizza, hot dogs, and other well-known Chicagoan culinary delights.</p> <p>Honestly, I left almost every workshop, talk, and keynote hungrier than when I’d sat down.</p> <h2 id="and-communication">And communication</h2> <p>There aren’t many people (especially folks reading this blog, for whom English is a familiar if not native, mode of communication) who would choose to repeatedly attend conferences where nobody spoke their language, where every session — along with all hallway conversations, vendor interactions, and after-hours events — was foreign. And to be clear, I’m not talking about “foreign” as in “this is a technology I haven’t worked with,” but <em>completely</em> foreign. Every noun, verb, conjunction, and adjective is unintelligible.</p> <p>Sure, some of us are lucky enough to travel to other countries, where we might experience a bit of a struggle getting through the airport, into our hotel, ordering food, or reading street signs. But once those relatively minor hurdles are cleared and we get to the conference, we usually find our way.</p> <p>However, for the tech workers who are among the 430 million people worldwide with significant hearing impairment (to say nothing of the millions more with audio processing challenges), this is precisely their experience.</p> <p>I learned American Sign Language (ASL) in college and was privileged to meet incredible folks in the Deaf community and count them among my friends to this day. So, this is both a passion (accessibility) and a pet peeve (the lack thereof) of mine.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/gdhd8qdz2d" title="ASL conversation at Kubecon" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>In the ten years I’ve worked as a technical evangelist — attending conferences, user groups, meetups, and conventions running the gamut in scale from a modest handful to tens of thousands — I can count on one hand (with fingers to spare) the number of times I met up with someone for whom sign language was their primary mode of communication.</p> <p>This is <em>not</em> because Deaf and Hard of Hearing people don’t work in tech — quite the contrary. The reason Deaf/HoH people don’t come to conferences is that the effort and challenge (not to mention the cost) far outweigh the value.</p> <p>I’m proud to say this appears set to change. In June 2023, the CNCF established a Deaf and Hard-of-Hearing working group tasked with identifying conference accessibility shortcomings and creating a list of recommendations and guidelines to overcome them. The group — sponsored by <a href="https://www.linkedin.com/in/catherinepaganini/">Catherine Paganini</a> and co-chaired by <a href="https://www.linkedin.com/in/robkoch/">Rob Koch</a> and <a href="https://www.linkedin.com/in/destiny-o-connor-28b2a5255/">Destiny O’Connor</a> — will work immediately, drawing on the input from a diverse group of members. A draft document was published in September 2023.</p> <p>The CNCF showed exactly how serious it was about this by expending both the effort and the investment to contact and contract interpreters, provide scholarships, and reach out to the Deaf/HoH members so that everyone understood this would be an accessible event. Ten people may not sound like a lot for a conference that boasts ten thousand attendees. Still, given the short notice and lack of promotion, and in light of my earlier comment that conferences are simply not something Deaf/HoH folks expect to attend, it’s positively massive. Add to that a few hearing ASL speakers like me, who joyfully and enthusiastically glommed onto the main group, and you had a respectable crowd.</p> <img src="//images.ctfassets.net/6yom6slo28h2/50ZLp2ZgndtaL6jKps5eae/4e0ae2d5a72d0013649f01cb3d609208/kubecon-group.jpg" style="max-width: 600px;" class="image center" alt="ASL speaker at Kubecon" /> <p>A significant portion of the people in that picture had never attended a conference before, simply because of the hurdles I described. Others have attended, but at significant cost to themselves or their companies, and the experience was always constrained by the limits of what they — individually — could advocate for.</p> <p>This is the part of the blog where someone usually expresses hope for the future, something like “I can only hope future events will be even more accessible.” And I do hope that. But I have more than hope. As a founding member of the Accessibility working group, I know there’s a plan to continue this effort and a commitment by the CNCF to ensure Kubecon 2023 is the rule rather than an exception.</p> <h2 id="summary">Summary</h2> <p>Kubecon 2023 was more than just another conference to check off my list. First, it marked my first chance to work in the Kentik booth and learn from incredible colleagues.</p> <img src="//images.ctfassets.net/6yom6slo28h2/10UV25PR7ANkXUMumy6Vhe/72c2d0ea9287505d944d3069257dfd9d/kubecon-rosalind-mikek.jpg" style="max-width: 600px;" class="image center" alt="Rosalind Whitley and Mike Krygeris at Kubecon" /> <p>Second, it was a chance to dive deep into the code, community, and culture of Kubernetes.</p> <p>Third, it was a moment for a few members of an underrepresented group to meet in person and experience an event that was previously not an option.</p> <p>As a postscript, I’ll mention that I was able to take advantage of Chicago’s famous kosher food scene. Kentik CEO <a href="https://www.kentik.com/blog/author/avi/">Avi Friedman</a> and I made it to <a href="https://miltsbbq.com/">Milt’s BBQ for the Perplexed</a> and enjoyed a mind-boggling seven-pound rack of ribs. I fear those calories will be with me, at least as long as the rush I got from attending the conference.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1JWvGxDwGnOinEFPFudr1h/16d55e1ff264054d25a2566f43bc68ed/kubernetes-bbq.jpg" style="max-width: 600px;" class="image center" alt="BBQ at Kubecon" /><![CDATA[Alerts Should Work for You, Not the Other Way Around]]><![CDATA[The entire reason we have monitoring is to understand what users are experiencing with an application. Full stop. If the user experience is impacted, sound the alarm and get people out of bed if necessary. All the other telemetry can be used to understand the details of the impact. But lower-level data points no longer have to be the trigger point for alerts.]]>https://www.kentik.com/blog/alerts-should-work-for-you-not-the-other-way-aroundhttps://www.kentik.com/blog/alerts-should-work-for-you-not-the-other-way-around<![CDATA[Leon Adato]]>Wed, 15 Nov 2023 05:00:00 GMT<p>“A few years back, I was tuning an alert rule and accidentally triggered the alert, which created 772 tickets. Twice.”</p> <p>This (all too true) story serves as the introduction of the main thesis of my talk in the video below (and the reason for its title): That alerts — contrary to the popular opinion held by IT practitioners across the spectrum of tech — don’t inherently suck. The problem lies in how alerts are typically created, which causes them to be… well, let’s just say “sub-optimal” and leave it at that.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 600px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.166666666666664%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/cdb46q0i30" title="Alerts Don't Suck: YOUR Alerts Suck!" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>I’ve given this talk frequently at conferences such as <a href="https://www.youtube.com/watch?v=vtE-TFblEgE">DevOpsDays BRUM</a>, <a href="https://www.youtube.com/watch?v=9_-4ni7ncuw">DevOpsDays TLV</a>, <a href="https://vimeo.com/channels/1848325/843992066">Monitorama</a>, and others. I believe its popularity is largely due to its fun approach to a frustrating issue.</p> <p>I’d like to take a few moments of your time here to emphasize points I make in the talk, but then extend those ideas in ways that don’t fit the limitations of time or format common in conference presentations.</p> <h2 id="the-slippery-slope-to-monitoring-engineer">The slippery slope to “Monitoring Engineer”</h2> <p>If you’ve read this far, there’s a good chance you care about alerts for more than just your own personal reasons. You probably have people — whether on your immediate team or in the larger organization — who look to and rely on you for help designing, implementing, maintaining, and fixing alerts.</p> <p>While most of us first encounter monitoring solutions because we want to know more about our own sh… tuff, it quickly follows that we’re helping others set up monitoring for themselves. Before long, we found ourselves in the “resident expert” role. Once that reputation gets around, the job (whether official or not) is irrevocably added to our responsibilities.</p> <p>The good news is that this is a huge opportunity for those who enjoy the work. Monitoring is an undeniable game-change in organizations willing to embrace and use it.</p> <h2 id="alerts--monitoring">Alerts ≠ Monitoring</h2> <p>One of my first encounters with alerting that was completely off the rails was at a company that defined uptime as “100% minus the # of alerts” in a given period. It was utterly unhinged.</p> <p>While it was an extreme example, the underlying issue — confusing alerting with monitoring — isn’t rare at all. For many individuals (and teams, departments, and entire companies), the raison d’être for monitoring is to have alerts, which is simply not helpful or effective.</p> <p>Monitoring is nothing more (and nothing less) than the consistent, persistent collection of data from a set of systems. Everything else that a monitoring and observability solution provides — dashboards, widgets, reports, automation, and alerts — is merely a happy by-product of having monitoring in the first place.</p> <p>As a monitoring engineer, I know something is amiss when I see people hyper-focusing on alerts to the exclusion (if not the detriment) of monitoring.</p> <h2 id="alerts-need-proof-of-their-value">Alerts need proof of their value</h2> <p>An alert should only exist if it has a proven, measurable, meaningful impact. The best way to validate that is to see if an alert is intended to cause</p> <ul> <li>someone</li> <li>to do something</li> <li>RIGHT. NOW.</li> <li>about a problem</li> </ul> <p>If all of those conditions aren’t met, you’re looking at an alert that is trying to replace some other monitoring structure — a dashboard, a report, etc.</p> <p>But that merely proves that an alert is actionable, not valuable. And I must be clear: “important” isn’t the same as “valuable.” Importance implies that it is technically, intellectually, or (believe it or not) emotionally meaningful to some person or group.</p> <p>“Valuable” is much more particular: The existence of the alert can be directly tied to a financial outcome.</p> <p>How does one establish this? Start with what the world would look like without the alert:</p> <ul> <li>How would the people who can fix the issue find out about the problem? And more to the point, how LONG would it take for the people who can resolve the issue to find out?</li> <li>Are there any inherent losses while the problem is happening? An online sales system that generates $1,000 an hour loses that amount every hour it’s unavailable.</li> <li>How long would it take to fix the problem? In some cases, it’s the same amount of time, alert or not. But in far more circumstances, if the problem were left unaddressed for the length of time identified in the first bullet, it would take longer (possibly significantly longer) to resolve.</li> <li>What is the regular (“total loaded”) rate for the staff who can fix the issue?</li> <li>What is the “interruption cost” for that staff? This means the staff is (ostensibly) not sitting around waiting for this particular error. So what is the value of their normal work? Because they will NOT be doing it during the time they are addressing this issue.</li> </ul> <p>You are welcome to take the formula above and, as the saying goes, “salt to taste.”</p> <p>Once you have this, recalculate all of the above WITH the alert. The difference between the first calculation and the second is the dollar value of the alert.</p> <p>Now, you can set up a simple report showing the number of occurrences the alert triggered, multiplied by the value. That is the amount this one alert has saved the company during that time.</p> <h2 id="observability-enables-us-to-change-our-focus">Observability enables us to change our focus</h2> <p>Back when I started working with monitoring solutions (yeah, yeah, Grampa. When dinosaurs ruled the earth and you had to chisel each bit into the hard drive by hand with a lodestone), we had to guess at the user’s experience from an array of much lower-level data. We’d look at network traffic, disk I/O, server connections, and other metrics and use those metrics to guess what was happening at the top of the OSI model.</p> <p>We didn’t do it because we thought it was the best option. We did it because it was the ONLY option. Tracing didn’t really come onto the scene — in terms of true application monitoring - until 2010. And it only took hold because of the fundamental change in application architecture.</p> <p>The widespread adoption of cloud computing (AWS EC2 went GA in 2006) and mobile phones (the first iPhone came on the scene in 2007) radically changed how we interacted with applications. Facebook had an unbelievable (for the time) 600 million users in 2010. That number grew to 800 million in 2011 and over 1 billion in 2012.</p> <p>Against THAT backdrop, application tracing and real user monitoring went from something we could only do in carefully controlled QA environments to a technique that was not only possible but game-changing.</p> <p>Because the entire reason we have monitoring — the whole damn point — is to understand what users are experiencing with an application. That’s it. That’s the whole enchilada.</p> <p>So, I will go on record as saying that alerting should focus on that aspect first and foremost. If the user experience is impacted, sound the alarm and get people out of bed if necessary.</p> <p>At that point, all the other telemetry - metrics, events, and logs - can be used to understand the details of the impact. But those lower-level data points no longer have to be the trigger point for alerts. Not in most cases.</p> <h2 id="where-do-we-go-from-here">Where do we go from here?</h2> <p>Hopefully, you have enough time between this blog and my talk to reflect on your existing alerts with an eye toward real improvement. You may find yourself deleting alerts you once thought essential. You will also undoubtedly spend time tweaking your alerts to make them more actionable, meaningful, and valuable.</p> <p>Just ensure you don’t trigger an alert storm in the process, or you’ll end up in the helpdesk, manually closing 1,544 tickets.</p> <p>Don’t ask me how I know.</p><![CDATA[ Digging into the Optus Outage]]><![CDATA[Last week a major internet outage took out one of Australia's biggest telecoms. In a statement out yesterday, Optus blames the hours-long outage, which left millions of Aussies without telephone and internet, on a route leak from a sibling company. In this post, we discuss the outage and how it compares to the historic outage suffered by Canadian telecom Rogers in July 2022.]]>https://www.kentik.com/blog/digging-into-the-optus-outagehttps://www.kentik.com/blog/digging-into-the-optus-outage<![CDATA[Doug Madory]]>Tue, 14 Nov 2023 05:00:00 GMT<p>In the early hours of Wednesday, November 8, Australians woke up to find one of their major telecoms completely down. Internet and mobile provider Optus had suffered a catastrophic outage beginning at 17:04 UTC on November 7 or 4:04 am the following day in Sydney. It wouldn’t fully recover until hours later, leaving millions of Australians without telephone and internet connectivity.</p> <p>On Monday, November 13, Optus released a statement about the outage blaming “routing information from an international peering network.” In this post, I will discuss this statement as well as review what we could observe from the outside. In short, from what we’ve seen so far, the Optus outage has parallels to the catastrophic outage suffered by Canadian telecom <a href="https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outage/">Rogers in July 2022</a>.</p> <h2 id="analyzing-the-outage-from-afar">Analyzing the outage from afar</h2> <p>Let’s start with the overall picture of what the outage looked like in terms of traffic volume using Kentik’s aggregate NetFlow. As depicted in the diagram below, we observed traffic to and from Optus’s network go to almost zero beginning at 17:04 UTC on November 7 or 4:04 am on November 8 in Sydney, Australia.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4Nxc9xoumafgsZMblCPV0c/559532d4d836fce3c28de938b99be58a/internet-traffic-to-optus-overview.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Overall view of the Optus internet outage and restoration" /> <p>Traffic began to return at 23:22 UTC (10:22 am in Sydney) following a complete outage that lasted over 6 hours. Pictured below, the restoration was phased, beginning with service in the east and working westward until it was fully recovered over three hours later.</p> <img src="//images.ctfassets.net/6yom6slo28h2/17I4YFwQtVgr4a8RLqflU8/0523550769d8d6a5014691f3a5535c53/internet-outage-restoration-phases.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Detailed view of Optus outage and restoration by region" /> <p>As stated in the introduction, there are notable parallels between this outage and the <a href="https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outage/">Rogers outage of July 2022</a>. The first of which is the role of BGP in the outage.</p> <div as="Promo"></div> <p>In the case of Rogers, the <a href="https://twitter.com/atoonk/status/1550896354511036416">outage was triggered</a> when an internal route filter was removed, allowing the global routing table to be leaked into Rogers’s internal routing table. This development overwhelmed the internal routers within Rogers, causing them to stop passing traffic. Rogers also stopped announcing most, <em>but not all</em>, of the 900+ BGP routes it normally originated.</p> <p>In the case of last week’s outage, Optus’s AS4804 also withdrew most, <em>but not all,</em> of its routes from the global routing table at the time of the outage.</p> <p>As an example, below is a visualization of an Optus prefix (49.2.0.0/15) that was withdrawn at 17:04 UTC and returned at 23:20 UTC later that day.</p> <img src="//images.ctfassets.net/6yom6slo28h2/36I2MXdqmZouZpfYkl46tx/9b11fc308972481c89301222d83c8e66/optus-route-withdrawal-return.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP route withdrawal and return" /> <p>However, this didn’t render all of that IP space unreachable as more-specifics such as 49.3.0.0/17 were not withdrawn during the outage, as is illustrated in the visualization below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2No5JTuTW7MMnhJjDbOi7r/a2473a2853d03e7891728c5c829cb584/optus-routes-not-withdrawn.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="View of reachable BGP routes" /> <p>In fact, the <a href="https://ioda.inetintel.cc.gatech.edu/">IODA tool</a> from Georgia Tech captures this aspect of the Optus outage well. Pictured below, <a href="https://ioda.inetintel.cc.gatech.edu/asn/4804?from=1699351200&#x26;until=1699430400">IODA estimates</a> (in the green line) that AS4804 was still originating about 40% of the IP space it normally did, even though the successful pings (blue line) dropped to zero during the outage.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6vMI6QRbapuGLbZ0XlCXoo/a3ee469341336cc757c42b2d1071406d/ioda-optus-outage.png" style="max-width: 700px;" class="image center" thumbnail alt="IODA tool showing the Optus internet outage" /> <p>However, if we utilize Kentik’s aggregate NetFlow to analyze the traffic to the IP space of the routes that weren’t withdrawn, we still observe a drop in traffic of over 90%, nearly identical to the overall picture.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5QwvcB7LMCnycuCk9rfiB/96d1f8d263c91e03e1642208b88c5154/optus-routes-not-withdrawn-analysis.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="NetFlow analysis of the routes not withdrawn" /> <p>From this, we can conclude that, as was in the case of Rogers, traffic stopped going to Optus, not due to a lack of available routes in circulation (i.e. reachability), but due to an internal failure with Optus causing the network to stop sending and receiving traffic.</p> <h2 id="a-flood-of-bgps">A Flood of BGPs?</h2> <p>Earlier this year, my friend and BGP expert Mingwei Zhang <a href="https://blog.cloudflare.com/radar-routing/">announced</a> the expansion of the <a href="https://radar.cloudflare.com/">Cloudflare Radar</a> page to include various routing metrics. One of the statistics included on each AS page is a rolling count of BGP announcements relating to the routes originated by the AS.</p> <p>In the aftermath of the Optus outage, I saw postings on social media citing the page for AS4804, showing a spike in BGP messages at the time of the outage. It’s worth discussing what this does or does not mean.</p> <img src="//images.ctfassets.net/6yom6slo28h2/wEWAKhkKch0fBfMMkRz9Q/0ef4df6851c0608b7785de8f447f3cf9/optus-outage-cloudflare-bgp-spike.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cloudflare view of the Optus outage" /> <p>As you may already know, BGP is a “report by exception” protocol. In theory, if nothing is happening and no routes are changing, then no BGP announcements (UPDATE messages, to be precise) are sent.</p> <p>The withdrawal of a route triggers a flurry of UPDATE messages as the ASes of the internet search in vain for a route to replace the one that was lost. The more routes withdrawn, the larger the flurry of messages.</p> <p>This includes both messages that signal a new AS_path or other attribute, and messages that signal that the sending AS no longer has the route in its table. Let’s call these types of UPDATE messages “announcements” and “withdrawals,” respectively.</p> <p>Perhaps surprisingly, when a route is withdrawn, the number of announcements typically exceeds the number of withdrawals by an order of magnitude. If we take a look at the plight of the withdrawn Optus prefix from earlier in this post (49.2.0.0/15), we can see the following.</p> <img src="//images.ctfassets.net/6yom6slo28h2/57GHDmIYYhsMa0XJXStbg3/ea2b6ab40974711754be54ab55e743b0/bgp-announcements-withdrawals-detail.png" style="max-width: 600px;" class="image center" alt="Detail of BGP announcements and withdrawals" /> <p>The lower graph tracks the number of Routeviews BGP sources carrying 49.2.0.0/15 in their tables over time. The line begins a descent at 17:04 UTC when the outage begins and takes more than 20 minutes to be completely gone. Coinciding with the drop in propagation on the bottom, the upper graph shows spikes in announcements and withdrawals — much more of the former than the latter.</p> <p>Optus did <a href="https://twitter.com/LachlanEagling/status/1722074113454342359">not break BGP</a>; its internal outage caused it to withdraw a subset of its routes. The spike in BGP announcements was simply the <a href="https://twitter.com/DougMadory/status/1722225758842851810">natural consequence</a> of those withdrawals.</p> <h2 id="optus-and-rpki">Optus and RPKI</h2> <p>First things first: RPKI <em>did not</em> play a role in this outage. But Optus has some issues regarding RPKI that are worth discussing here.</p> <p>On any given day, AS4804 announces more than 100 RPKI-invalid routes to the internet. These routes suffer from the same misconfiguration shared by the majority of the persistently RPKI-invalid routes that can be found in the routing table on any given day. Specifically, they are invalid due to the maxLength setting specified in the ROA.</p> <p>Take 220.239.80.0/20, for example, <a href="https://rpki-validator.ripe.net/ui/220.239.80.0%2F20?validate-bgp=true">validation shown below</a>. According to the ROA, this IP space is not to be announced in a BGP route with a prefix length longer than 19. Since its length is 20, it is RPKI-invalid and is being rejected by networks (including the <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">majority of the tier-1 backbone providers</a>) that have deployed RPKI.</p> <img src="//images.ctfassets.net/6yom6slo28h2/0XTvUb3YWqaNsziAD3iIN/52849c42fda9dc9b5a986e8b1333efcb/optus-outage-rpki-validation.png" style="max-width: 600px;" class="image center" alt="ROA RPKI validation" /> <p>Since these routes are RPKI-invalid, their propagation is <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">greatly reduced</a>. Normally, this isn’t a problem because Optus also announces covering routes, ensuring that the IP space contained in the RPKI-invalid routes are reachable in the global routing table.</p> <p>However, during the outage last week, many of these covering routes were withdrawn, causing the internet to have to rely on the RPKI-invalid more-specific routes to reach this IP space. This would have caused this IP space to become unreachable for most of the internet, greatly reducing its ability to communicate with the world.</p> <p>In the example below, 220.239.64.0/20 and 220.239.80.0/20 are RPKI-invalid and, therefore, only seen by 26.2% of Kentik’s 3,000+ BGP sources. The covering prefix 220.239.64.0/19 is RPKI-valid and enjoys global propagation. During the outage, 220.239.64.0/19 was withdrawn and as a result, over 70% of the internet did not have a way to reach 220.239.64.0/20 and 220.239.80.0/20 due to the <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">greatly limited propagation</a> that RPKI-invalid routes typically experience.</p> <p>The reachability of these routes over time is illustrated in the diagram below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6Kn1SbIOMTL0n8jLKLcj7u/3f4f8e1c6940a2ac9f750c67f4a3f2f5/bgp-route-reachability-over-time.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Reachability of BGP routes over time" /> <p>Ultimately, this did not have a significant impact on the outage. We know this because our analysis above shows that traffic to IP space in routes that <em>weren’t withdrawn</em> also dropped by over 90%.</p> <p>Regardless, there is a simple fix that Optus can do to resolve this issue. The maxLength setting can either be increased to match the prefix length of the route, or it can simply be removed causing RPKI to only match on the AS origin of AS4804.</p> <p>To help alleviate the confusion of the maxLength setting, this latter course of action was recently published as a best practice in <a href="https://www.rfc-editor.org/rfc/rfc9319.html">RFC9319</a> <em>The Use of maxLength in the Resource Public Key Infrastructure (RPKI)</em>.</p> <p><strong>Conclusion</strong></p> <p>The multi-hour outage last week left more than <a href="https://www.abc.net.au/news/2023-11-09/how-the-optus-outage-played-out/103079768">10 million Australians</a> without access to telephone and broadband services, including emergency lines. In a <a href="https://www.dailymail.co.uk/news/article-12727421/Optus-outage-cat-Sydney.html">story that went viral</a>, one woman discovered the outage when her cat Luna woke her up because the feline’s wifi-enabled feeder failed to dispense its <em>brekkie</em>.</p> <p>On Monday, November 13, 2023, <a href="https://www.reuters.com/business/media-telecom/singtel-owned-optus-says-massive-australia-outage-was-after-software-upgrade-2023-11-13/">Optus issued a press release</a> that nodded at the cause in the following paragraph:</p> <blockquote>At around 4:05 am Wednesday morning, the Optus network received changes to routing information from an international peering network following a routine software upgrade. These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves.</blockquote> <p>To me, that statement suggests that an external network (presumably now <a href="https://www.smh.com.au/technology/identity-of-third-party-who-brought-down-optus-network-revealed-20231114-p5ejy1.html#">identified</a>) that connects with AS4804 sent a large number of routes into Optus’s internal network, overwhelming their internal routers and bringing down their network. If that is the correct interpretation of the statement, then the outage was nearly identical to the <a href="https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outage/">Rogers outage in July 2022</a>, when the removal of route filters allowed a route leak to overwhelm the network’s internal routers.</p> <p>Like any network exchanging traffic on the internet, Optus’s network needs to be able to handle the likely scenario that a peer leaks routes to it — even if it is another subsidiary of its parent company. These types of mistakes happen all the time.</p> <p>Aside from filtering the types of routes accepted from a peer, a network should, at a minimum, employ a kind of circuit breaker setting (known as Maximum-Prefix or MAXPREF) to kill the session if the number of routes went above a predefined amount.</p> <p>Another possibility is that Optus did use MAXPREF but used a higher threshold on its exterior perimeter than on the interior — allowing a surge of routes through its border routers but taking down sessions internally. When MAXPREF is reached, routers can be configured to automatically re-establish the session after a retry interval or go down “forever,” requiring manual intervention. <em>Be aware that Cisco’s default behavior is to go down forever.</em></p> <p>Again, mistakes happen regularly in internet routing. Therefore, it is imperative that every network establish checks to prevent a catastrophic failure. It would seem that Optus got caught without some of their checks in place.</p> <p>The Australian government announced that it is <a href="https://www.zdnet.com/article/australia-to-investigate-optus-outage-that-impacted-millions/">launching an investigation</a> into the outage. At the time of this writing, we don’t yet know if Luna will be called to testify.</p> <div class="image center" style="max-width: 300px;"><img src="//images.ctfassets.net/6yom6slo28h2/5GErdAlurHL57JL7PvAMJ7/1458a23ca541c837783905bcd20fb584/luna-the-cat.jpg" style="max-width: 300px;" alt="Luna the cat alerted her owner of the internet outage" /><div class="caption" style="margin-bottom: 0; margin-top: 0;">Luna the cat. Credit: <a href="https://twitter.com/_angemccormack/status/1722416136690733473">Ange McCormack</a></div></div><![CDATA[Introducing Kentik Kube: Revolutionizing Kubernetes Network Observability]]><![CDATA[Kentik Kube provides network insight into Kubernetes workloads, revealing K8s traffic routes through an organization’s data centers, clouds, and the internet.]]>https://www.kentik.com/blog/introducing-kentik-kubehttps://www.kentik.com/blog/introducing-kentik-kube<![CDATA[Rosalind Whitley]]>Tue, 07 Nov 2023 04:00:00 GMT<p>Kentik is proud to announce the general availability of <a href="https://www.kentik.com/solutions/kubernetes-networking/">Kentik Kube</a>, an industry-first solution that provides network insight into Kubernetes workloads, revealing the routes enterprise container traffic takes through data centers, public clouds, and the internet.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6z0lvvO6TM3bmy73pd0uXv/53f3db9a43ec6a4995830ded2807ca83/kentik-kube-2023-1-no-header.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kubernetes infrastructure monitoring" /> <p>Enterprises today navigate an intricate landscape, leveraging multiple types of Kubernetes, including AWS’ EKS, Google Cloud’s GKE, Microsoft Azure’s AKS, VMWare’s Tanzu, RedHat’s OpenShift, and myriad other private and public cloud orchestration solutions. This diversity might empower teams with autonomy, but it has also introduced significant complexity in ensuring high-level network visibility, troubleshooting capabilities, and cost transparency. Enterprise organizations with complex networks face a considerable gap in understanding network traffic, particularly in hybrid and multi-cloud infrastructure estates. For example, most Kubernetes monitoring tools don’t adequately reveal cluster-to-cluster traffic, internet-to-service and service-to-internet traffic, and take a more application-centric approach to observability.</p> <h2 id="network-layer-kubernetes-visibility-at-your-fingertips">Network-layer Kubernetes visibility at your fingertips</h2> <img src="//images.ctfassets.net/6yom6slo28h2/2zDh1NnBy1kgs3ljX41UXL/d0e68f1c1893925e2c6161861da037e8/kentik-kube-2023-shopping-no-header.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kubernetes performance monitoring" /> <p>Kentik Kube is our response to these challenges. Kentik Kube gives network, cloud, and infrastructure engineers detailed network traffic and performance visibility inside and among their Kubernetes clusters, so they can quickly detect and solve network problems, surface anomalies and compliance issues, and identify outliers and misconfigurations that inflate network traffic costs (more on that below). All traffic data in Kentik, including K8s traffic analysis in Kentik Kube, is automatically enriched with critical business and security metadata that puts the telemetry in context.</p> <p>Kentik Kube ensures Kubernetes performance by enabling teams to quickly identify services and pods experiencing network anomalies and delays, helping them troubleshoot and resolve issues more efficiently. With the ability to configure alert policies, teams can proactively address high latency across nodes, pods, workloads, or services.</p> <p>Kentik customers use hybrid and multi-cloud networks to run their businesses, like most enterprises. Total network observability – one view into context-enriched telemetry from container and VM-based cloud primitives across the major providers, from hybrid cloud interconnect, and from data center workloads of all kinds – empowers their teams to make better decisions faster, with less overhead. Most observability tools start at the application layer to help teams understand issues impacting apps. This approach has a lot of value, but it neglects the bigger picture of how applications and shared services impact each other, and how communications across infrastructure boundaries impact cost and performance. As technologies and teams grow and change, it’s critical that organizations maintain a big picture of how their apps and services use the network to pass data and requests around the entire software estate.</p> <h2 id="a-lack-of-k8s-visibility-is-more-costly-than-you-may-think">A lack of K8s visibility is more costly than you may think</h2> <p>Let’s face it: Traffic inside Kubernetes can still be a “black box.” Sure, K8s abstractions make it possible to operate cloud-native workloads reliably at scale. But they also make it hard to pinpoint exactly what’s happening under the hood. Since architecting Kubernetes is still a fairly new engineering discipline, the lack of standardization at many companies also makes troubleshooting and correcting configuration errors difficult.</p> <p>While most enterprises prioritize operating cost reduction, for infrastructure engineers working with new technologies, balancing reliability with cost and performance optimization can be tricky and slow. Regardless of where a team is on the Kubernetes learning curve, they likely deal with the consequences of normal user error. Even the most experienced teams often must contend with old design decisions, as critical infrastructure must be reliable even as we update it according to hard-earned lessons.</p> <p>Costly architectural decisions and misconfigurations are challenging to spot with application-centric monitoring tools, and many times, these errors impact the way Kubernetes, or cloud-managed Kubernetes services, route traffic. Most teams lack the Kubernetes network observability and expertise needed to identify, understand, and solve these problems quickly. Without context on historical traffic trends and anomalies, inter-service and inter-pod connections, and transfer types – context with IP and port-level specificity – many teams fly blind.</p> <p>Engineers tasked with architecting in K8s may not even know that transmitting data with a <a href="https://www.kentik.com/kentipedia/nat-gateway/" title="Kentipedia - NAT Gateways: A Guide to Cost Management and Cloud Performance">NAT gateway</a> can be expensive or know how to recognize NAT gateways implemented in their environments. But when egress, inter-region transfer, and gateway charges get out of control, this expertise suddenly becomes urgently important – anything to avoid another cloud bill that’s $250,000 over budget. Similarly, awareness of the specific pod, application, or user causing a spike can be critical to controlling runway costs. At many companies, getting this information can take days or weeks while the meter runs.</p> <p>Kentik Kube was created to address these pain points directly by detecting traffic changes tied to new deployments or misconfigurations before costs escalate. Kentik Kube’s total network visibility also allows teams to trace the history of pod deployments across nodes, identify communication channels between pods, services, and other clusters, and detect and address traffic sent to unapproved destinations or embargoed countries.</p> <p>Kentik Kube achieves this by collecting metadata across Kubernetes pods, clusters, and services combined with telemetry from a lightweight <a href="https://www.kentik.com/blog/ebpf-explained-why-its-important-for-observability/">eBPF</a> agent. This unique dataset, coupled with Kentik’s advanced analytics engine, empowers infrastructure and platform teams to move faster, reduce incident resolution times, and gain critical insights into the health and performance of their networks.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/9m7d1ijcl8" title="Kentik Kube: Troubleshooting Kubernetes Network Performance Issues" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h2 id="kentik-kube-is-available-now">Kentik Kube is available now</h2> <p>As of today, we invite you to experience Kentik Kube firsthand. <a href="/get-started/">Sign up for a trial</a> or join us at <a href="https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/">KubeCon 2023</a> in Chicago from November 7-9 at <strong>booth #B23</strong> for a live demonstration of how Kentik Kube can reduce costs and increase efficiency for your organization. If you’re already a Kentik customer, reach out to your customer service partner to learn more.</p><![CDATA[Analyzing the Traffic Patterns of Broadband Subscriber Behavior]]><![CDATA[Broadband subscriber behavior analysis is the process of collecting and analyzing data on how broadband subscribers use the internet. This data can be used to gain insights into subscriber needs and preferences, as well as to identify potential problems with the broadband service.]]>https://www.kentik.com/blog/analyzing-the-traffic-patterns-of-broadband-subscriber-behaviorhttps://www.kentik.com/blog/analyzing-the-traffic-patterns-of-broadband-subscriber-behavior<![CDATA[Nina Bargisen]]>Thu, 02 Nov 2023 04:00:00 GMT<h2 id="broadband-subscriber-traffic-patterns-and-their-sources">Broadband subscriber traffic patterns and their sources</h2> <p>One of the most valuable insights that can be gained from broadband subscriber behavior analysis is an understanding of traffic patterns. The sources of the traffic data for a network operator vary, but common sources are deep packet inspection, <a href="https://www.kentik.com/kentipedia/sflow-collector/">various flow technologies like NetFlow or sFlow</a>, and DNS log analysis.</p> <p>By analyzing traffic patterns, broadband providers can learn a great deal about their subscribers’ needs and preferences. For example, suppose a large number of subscribers are using streaming video services during the evening hours. In that case, it is a good idea to work closely with the content providers to secure the traffic and develop service offerings for this customer segment.</p> <p>Traffic patterns can also be used to identify potential problems with the broadband service. A typical use is recognizing when a subscriber has a higher demand that their service is dimensioned for.</p> <h2 id="broadband-subscriber-insights-from-individual-vs-aggregated-data">Broadband subscriber insights from individual vs. aggregated data</h2> <p>Broadband subscriber behavior analysis can be performed at the individual level or the aggregated level. Individual-level analysis involves collecting and analyzing data on the online behavior of individual subscribers. Aggregated-level analysis consists of collecting and analyzing data on the online behavior of groups of subscribers.</p> <h3 id="individual-level-analysis">Individual-level analysis</h3> <p>When we look into the data use of an individual subscriber, there is a wide range of applications for the insights.</p> <p><strong>Operational insights</strong>: Slow speeds or broken sessions can indicate issues with the subscriber’s last-mile connection or home Wi-Fi system. Malicious traffic from the subscriber often means that there are infected devices on the home network. All these types of information can be used to improve the customer experience proactively.</p> <p><strong>Marketing insights</strong>: Recognizing individual use patterns means the operator can tailor their marketing and direct existing offerings to the individual subscriber. For example, bundled streaming offers for a subscriber with a high use of video streaming or more bandwidth for a working-from-home professional with a high use of videoconferencing.</p> <h3 id="aggregated-analysis">Aggregated analysis</h3> <p>Aggregated-level analysis can provide insights into the overall trends in broadband use.</p> <p><strong>Operational insights</strong>: Capacity planning is the poster child of the use cases. At the same time, the typical approach is to look at the total traffic in the network when planning topology and capacity, understanding what kind of traffic segments customers are using. That might be geographical segments or product segments. In any case, there might be an opportunity to optimize the capacity or the network topology for those segments.</p> <p>Another use case is traffic engineering, where an operator can control different types of traffic in various manners.</p> <p><strong>Strategic insights</strong>: The aggregated analysis gives insights into trends in the services used. Emerging services can be tracked, the users of each are analyzed, and new offerings can be developed based on this information.</p> <p><strong>Regulatory compliance</strong></p> <p>The main concern once you start breaking down traffic per user is data privacy. The GDPR (General Data Protection Regulation) governs data protection within the EU and EEA. Instituted in May 2018, it superseded Directive 95/46/EC, aiming to empower users over their data and harmonize data protection laws across the EU.</p> <p>For broadband providers, this translates to:</p> <ul> <li>Procuring explicit user consent before data collection or analysis.</li> <li>Granting users full access to their data, along with the right to deletion.</li> <li>Implementing robust measures to shield user data from unsanctioned access or misuse.</li> </ul> <p>Additionally, global regulations like the California Consumer Privacy Act (CCPA) and the Brazilian General Law on the Protection of Personal Data (LGPD) echo similar mandates, underscoring the universal thrust towards stringent data protection.</p> <h2 id="what-do-you-need-to-do-with-broadband-subscriber-analytics">What do you need to do with broadband subscriber analytics?</h2> <p>In summary, the essentials of subscriber analytics are</p> <ul> <li>Data collection and enhancements.</li> <li>Customer consent and anonymization</li> <li>Analytics</li> <li>Data protection and process</li> </ul> <p>The Kentik Network Observability Platform provides an easy platform for analytics, data collection, and enhancements. Based on flow and DNS data collected from your network, the <a href="https://www.kentik.com/kentipedia/ott-services/">Kentik True Origin engine detects OTT services</a>. The custom dimensions in the Kentik Data Engine allow for the enhancement of flow data with business dimensions such as anonymized customer IDs. Kentik provides the perfect foundation for aggregated and individual analysis of the traffic consumption of subscribers.</p> <img src="//images.ctfassets.net/6yom6slo28h2/f4wTvsF9Dhg0zWPNIUaBj/d065d41c189d70a619cdacf050521db5/top-subscriber-sankey.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Sankey chart showing top subscribers" /> <div class="caption" style="margin-top: -35px;">A query in Kentik showing the OTT services used by each subscriber</div> <p><strong>Conclusion</strong></p> <p>Broadband subscriber behavior analysis can be a valuable tool for broadband providers to gain insights into their subscribers’ needs and preferences, as well as to identify potential problems with the broadband service. However, it is crucial to be aware of the potential pitfalls of broadband subscriber behavior analysis, such as privacy concerns and discrimination. It is also essential to be mindful of the role of the GDPR in broadband subscriber behavior analysis.</p><![CDATA[The Driving Force of Community at All Things Open 2023]]><![CDATA[The most noticeable takeaway from All Things Open 2023 was how visibly and demonstrably people were there for the event itself. Not to check a box or browse the swag but to be together, show their support of open source, and glean every last bit of knowledge they could. ]]>https://www.kentik.com/blog/the-driving-force-of-community-at-all-things-open-2023https://www.kentik.com/blog/the-driving-force-of-community-at-all-things-open-2023<![CDATA[Leon Adato]]>Tue, 31 Oct 2023 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/7mqBCiyQmLgy2D4Suo9MJg/696559a29877f109042f0c2057578d2a/ato-leon-entrance.jpg" style="max-width: 300px;" class="image right" alt="Leon at ATO 2023" /> <p>I recently attended my second <a href="https://2023.allthingsopen.org/">All Things Open</a> conference and wanted to share some of the observations, experiences, and lessons I learned along the way.</p> <p>As conferences go, All Things Open stands out to me for a few reasons. Hosted at the <a href="https://www.raleighconvention.com/">Raleigh Convention Center</a> in North Carolina, it’s one of the few larger shows that doesn’t sit in a wallet-busting city. And, at ~5,000 attendees, it’s definitely one of the more significant events, standing out from events like <a href="https://devopsdays.org/">DevOpsDays</a>, <a href="http://www.securitybsides.com/w/page/12194156/FrontPage">BSides</a>, and even <a href="https://monitorama.com/">Monitorama</a>, which attract much more modest crowds. And, of course, it stands out because of its focus. While every event – from the aforementioned DevOpsDays all the way up to the monster conferences like <a href="https://www.ciscolive.com/">Cisco Live</a>, <a href="https://www.vmware.com/explore.html">VMware Explore</a>, and <a href="https://reinvent.awsevents.com/">re:Invent</a> – contain an acknowledgment of the importance of open source tools – none of those events are so wholly and wholesomely focused on the tools, techniques, and community that make open source the powerful force for good that it is.</p> <p>It’s hard to imagine any other event supporting over 200 sessions covering a comprehensive range of topics, from how to build and maintain inclusive and engaged communities; to the latest coding tools and techniques; and even including topics like how to make your alerts suck less.</p> <h2 id="committed-to-open-source">Committed to open source</h2> <p>But this “speeds and feeds” description doesn’t capture the heart of the event. What is most noticeable at All Things Open is how visibly and demonstrably people are there for the event itself. They aren’t there to check a box or browse the swag. It’s clear when you watch the interactions in the hallway and see the attendance in the room that everyone is there to be together, or show their support of open source, and to glean every last bit of knowledge they can.</p> <p>By way of explanation, I will share an anecdote that I promise is not a #humblebrag. My talk was scheduled for the last time slot of the day (3:45-4:30) on the last day of the event. By that point, even the vendors were packing up. I expected to have about ten people come to my session if for no other reason than I was sure people had flights to catch, parties to attend, dinners to eat, and so on.</p> <p>By 3:40, the room was filled to capacity.</p> <p>It was surprising, humbling, and immensely gratifying. But I hold no illusions that I’m some powerhouse speaker who can draw a crowd by virtue of my name, far from it. No, it was clear that everyone in the room was there because they wanted to hear about monitoring and alerting. Moreover, they were committed to attending every aspect of All Things Open to the fullest.</p> <p>If nothing else, that should explain why this event stands out.</p> <h2 id="open-source-community">Open source community</h2> <p>As I’ve already mentioned, the topic of community was ever-present during the event. From sessions and vendors providing the nuts-and-bolts view of how to build a community platform to the more psychological and sociological aspects of creating community spaces that were open, welcoming, and supportive, community was front and center in 10% of the sessions and everywhere in the hallway track.</p> <p>This should surprise nobody, as the entire concept of open source software is built upon the idea of community support. Often, that’s presumed to mean a community of developers working together toward a common goal (if not a common good). But for anyone who dips even a toe into an open source project, the truth becomes apparent almost immediately: community goes far beyond the people with hands on the keyboard slinging code, fixing issues, and submitting pull requests. It goes to the folks promoting the project, finding new and exciting ways to blend two or more projects together, using the solution during a hackathon, or even (maybe especially) providing financial support.</p> <p>None of that happens by accident or “organically” on its own. Creating a community is an intentional act and goes far beyond the idea of logging issues or providing support. To be sure, vibrant online communities have a distinctively grass-roots feel and a by-the-nerds-for-the-nerds appeal. But an online space that passes what I call “the <a href="https://bechdeltest.com/">Bechdel test</a> for community”* requires more than “I’ve got some support agents, and my uncle has a slack channel, so let’s put on a community” level of planning.</p> <p>It should be clear that the things I heard and learned about community made an impact on me, and I hope you’ll see some of those lessons reflected in changes to Kentik’s own community strategy in the very near term.</p> <p>* <em>1 - two or more community members with completed profiles; 2 -speaking to each other (not a staff member); 3 - about something other than the vendor</em></p> <h2 id="carefree-open-source">Carefree open source</h2> <p>For me personally, this event stood out because of what I didn’t do: for the last ten years, most of the events I’ve attended have been ones where I had to also work in the booth, talking about the company, giving demonstrations, and handing out swag. At some events, this job took up every spare moment, from the opening of the floor in the morning until the last beer bottle was cleaned up at night. At others, I’d have a half-day shift and could spend the rest of the event wandering around.</p> <p>Unless you’ve also had this experience, it’s hard to understand the distraction this imposes and how it’s difficult to fully invest in the event and really explore, listen, and learn.</p> <p>But in this case, I was a free agent, and I tried to make the best use of it – attending every session that piqued my interest or tickled my fancy, chatting with vendors about their product, their swag, or just their opinion of the industry. That respite from the responsibilities of representing “the brand” was an unexpected and utterly delightful breath of organizational oxygen, and I won’t soon forget the lesson.</p> <h2 id="closing">Closing</h2> <p>If your budget has room for a little travel and a lot of learning, All Things Open should be high on your list of options to consider. <a href="https://www.allthingsopen.org/ato-2024-save-the-date/">They’ve already announced</a> the dates for next year. I know I’ve already added it to my calendar.</p> <p>If you were at All Things Open 2023, and I walked past you without saying “hello,” chide me in the comments below or on <a href="https://www.linkedin.com/in/leonadato/">LinkedIn</a>. Or share your observations to keep this conversation going.</p><![CDATA[Visibly Kentik]]><![CDATA[If there's anything I've learned, monitoring data is the lifeblood of the business and a superpower for any IT practitioner. Monitoring allows organizations to react to changes, identify and recover, and understand the true health of the business. ]]>https://www.kentik.com/blog/visibly-kentikhttps://www.kentik.com/blog/visibly-kentik<![CDATA[Leon Adato]]>Fri, 27 Oct 2023 04:00:00 GMT<p>What’s the opposite of “Hello, I Must Be Going”?</p> <p>I ask because, here at Kentik, one of the requirements for a blog is to have a solid music reference. Or several. I inferred this from a review of existing blog content. OK, actually, <a href="https://www.kentik.com/blog/why-cant-network-teams-have-nice-things/" title="Kentik Blog: Why Can&#x27;t Network Teams Have Nice Things?">just this blog</a>.</p> <p>It’s important that I get this right because – as both the location and the very existence of this blog imply – I’m able to call Kentik my home now, at least professionally, and I don’t want to do anything that might get things off on the wrong foot. Honestly, <a href="https://www.youtube.com/watch?v=TzIOeSB9RaU">I cannot believe it’s true</a>. And while I know <a href="https://www.youtube.com/watch?v=C9IwBJYTwQ0">you can’t hurry love</a>, I already feel like I’m hanging with old friends.</p> <p>I won’t belabor the point (or the song references). Still, I did want to highlight some of the things I’m excited about now that I’ve officially become a “Kentikian” (yeah, that’s what we call ourselves around here, but <a href="https://www.youtube.com/watch?v=daYXFb0JtsI">it don’t matter to me</a>).</p> <h2 id="observability-is-for-everyone">Observability is for everyone</h2> <p>Monitoring – both network-specific and the broader category of solutions that include APM, synthetics, traces, and even capital-O Observability (whether you spell it out or write it “o11y” like the super cool DevOps folks) – isn’t just a niche skill practiced by a few key people in an org.</p> <p>If there’s anything I’ve learned over the quarter-century of installing, managing, and extending monitoring solutions, it’s that the data it contains is the lifeblood of the business and a superpower for any IT practitioner motivated enough to leverage it. Monitoring and the insights it provides via alerts, reports, and dashboards allow organizations to react to changes more quickly, identify and recover from issues more reliably, and understand the true health and performance of the business and the systems on which that business is built.</p> <p>If there’s one thing I’m excited to share now that I’m here at Kentik, it’s the experiences, lessons, and insights I’ve gleaned from using dozens of tools over thousands of hours at companies that ranged from modest (25-100 systems), to moderate (1-5000 systems), to mind-boggling (250,000 systems).</p> <h2 id="content-by-it-folks-for-it-folks">Content by IT folks, for IT folks</h2> <p>Whether it’s a blog <a href="https://www.kentik.com/blog/why-cant-network-teams-have-nice-things/" title="Kentik Blog: Why Can&#x27;t Network Teams Have Nice Things?">highlighting the ways network monitoring has (and hasn’t) changed</a> or an analysis of what it looks like when an <a href="https://www.kentik.com/blog/ending-saint-helenas-exile-from-the-internet/">entire island cuts over from satellite internet to (undersea) fiber</a>, Kentik places a high value in talking to IT folks the way we speak to each other.</p> <p>Of course, that starts with clear, concise, and detailed explanations of the latest technology or technique and how to leverage it within Kentik’s platform.</p> <p>But it also includes frank explanations when something doesn’t stack up. Even more importantly, Kentik isn’t afraid to share honest looks at what we need to do as tech practitioners to show up and do our best work every day for the businesses that depend on us.</p> <p><a href="https://www.youtube.com/watch?v=DuRF5MJ7xcE">It also means having fun sometimes</a>. Because if the work were a never-ending joyless struggle, most of us would have built our careers as hamster ranchers, deep sea carpentry engineers, or competitive maraschino cherry jugglers.</p> <h2 id="with-all-due-respect-to-sesame-street-c-is-for-community">With all due respect to Sesame Street, “c” is for community</h2> <p>A user community has to be more than a support forum. It has to not only allow but encourage conversations and connections between members irrespective of their company affiliations or the problems they’re encountering at the moment. A community should uplift, inform, inspire, comfort, and celebrate.</p> <p>I’ll be honest (and one of the reasons I’m thrilled to be at Kentik is because this kind of honesty is not only permitted but valued), our community isn’t there – YET. And my saying that we need it is far more than aspirational. It’s happening. Stay tuned to this channel for more information as it becomes available.</p> <p>But community can be found in many places and in many ways. Community also happens in the comment section of blogs and videos, in the shared stories of the hallway track at conferences and user groups, and in the whispered interjections and hastily scribbled notes during keynotes.</p> <p>Kentik – and incredibly, I’m now included in that amazing collection of folks – is committed to building a vibrant, passionate, engaging, participatory community, and I hope you will stick around to be part of it because it’s going to be something special.</p> <h2 id="the-mostly-unnecessary-summary">The (mostly) unnecessary summary</h2> <p>I still haven’t figured out the opposite of “Hello, I Must Be Going,” but perhaps I don’t need to. In 1930, 21 years before Mr. Collins graced us with his presence, let alone his musical genius, Groucho Marx <a href="https://www.youtube.com/watch?v=_YrNQaXdOxU">sang the song</a> in the classic movie “Animal Crackers.” The lyrics make it clear that leaving doesn’t mean not staying:</p> <p><em>“I’ll stay a week or two <br> I’ll stay the summer through <br> But I am telling you: <br> I must be going”</em></p> <p>Like Groucho, no matter how many times I sign off, I don’t plan to really go anywhere. I plan to be extremely <strong>visible</strong> here at Kentik, whether on <a href="https://www.kentik.com/blog">the blog</a>, in the <a href="https://www.youtube.com/@KentikHQ">video channel</a>, at conferences, or in community spaces.</p> <p>I hope to see you there, too!</p> <p>*<em>For those who skipped comparative linguistics in school, “Kentik” is Yiddish for “visible.”</em></p><![CDATA[How to Harden Zero-Trust Cloud Network Policy with Kentik]]><![CDATA[Zero trust in the cloud is no longer a luxury in the modern digital age but an absolute necessity. Learn how Kentik secures cloud workloads with actionable views of inbound, outbound, and denied traffic.]]>https://www.kentik.com/blog/how-to-harden-zero-trust-cloud-network-policy-with-kentikhttps://www.kentik.com/blog/how-to-harden-zero-trust-cloud-network-policy-with-kentik<![CDATA[Phil Gervasi]]>Thu, 26 Oct 2023 04:00:00 GMT<p>Is managing the security of your cloud network a technical nightmare? The evolving nature of cloud workloads forces you to focus on the visibility of your security posture, but figuring out where to begin can be daunting. This is where a zero-trust <a href="https://www.kentik.com/kentipedia/cloud-security-policy-management/" title="Cloud Security Policy Management: Definitions, Benefits, and Challenges">cloud network policy</a> can make a big difference.</p> <p>Zero trust is based on the assumption that everything is potentially a threat, which means there can be no implicit trust. It doesn’t matter where a request is coming from or from which device; it’s assumed to be potentially unsafe.</p> <p>In this article, you’ll learn all about zero-trust cloud network policies, why they’re important to enterprises, and how Kentik can help you secure one.</p> <h2 id="what-is-a-zero-trust-framework">What is a zero-trust framework?</h2> <p>For a long time, cybersecurity was based on the concept of <a href="https://www.forrester.com/report/No-More-Chewy-Centers-The-Zero-Trust-Model-Of-Information-Security/RES56682" title="Forrester Research report on Zero-Trust Information Security Model">trust but verify</a>, which worked when threats existed outside the network. This traditional security model operated on the assumption that the external network was not to be trusted, while everything within it was granted implicit trust. In other cases, network engineers who moonlit as security engineers often focused only (or mainly) on perimeter security. However, changes such as remote work, bring your own device (BYOD), and cloud computing have made this model obsolete. The perimeter is now much more fluid, and the old <em>bad guys out, good guys in</em> philosophy doesn’t work.</p> <p>This is where zero trust comes in. Zero trust helps to mitigate the ever-present threat of <a href="https://www.cisa.gov/topics/physical-security/insider-threat-mitigation/defining-insider-threats" title="CISA: Defining Insider Threats">malicious insiders</a> (i.e., users who misuse the authorized access they’ve been granted). It achieves this by individually evaluating every request, regardless of where it originated.</p> <p>Moreover, zero trust significantly enhances cybersecurity mitigation against cyber threats. This means that within a zero-trust environment, the network is not dependent on different security solutions; instead, it becomes a security control by assuming every request is potentially malicious.</p> <p>The principles of zero trust closely align with the mandates and guidelines set out in regulatory frameworks and standards like the <a href="https://gdpr-info.eu">General Data Protection Regulation (GDPR)</a>, <a href="https://listings.pcisecuritystandards.org/documents/PCI_DSS-QRG-v3_2_1.pdf">Payment Card Industry Data Security Standard (PCI DSS)</a>, and <a href="https://www.cdc.gov/phlp/publications/topic/hipaa.html">Health Insurance Portability and Accountability Act (HIPAA)</a>. Therefore, following a zero-trust policy for cybersecurity can help you avoid costly fines for noncompliance.</p> <p>Additionally, zero trust streamlines employee onboarding by eliminating traditional bureaucratic barriers, enabling quicker integration of new hires. Employees can use personal devices for the initial onboarding processes, and chief information security officers (CISOs) can rest easy knowing the zero-trust engine is evaluating and assessing every request against its trust policies:</p> <img src="//images.ctfassets.net/6yom6slo28h2/30rn8avGPRRDEhixN7l6QC/5712e06a61aaae8b3fa281d5ed20f250/zero-trust.png" style="max-width: 500px" class="image center no-shadow" alt="Zero-trust engine" /> <p>While embracing a zero trust policy can reap many rewards, it comes with unique challenges. For instance, implementing a network policy that spans both on-premise and cloud networks can be challenging, and you’ll probably consider using a solutions provider that can cover as much of your network infrastructure as possible. This becomes particularly crucial when adopting a <a href="https://www.kentik.com/kentipedia/multicloud-networking/" title="Kentipedia: Multicloud Networking">multicloud strategy</a>, which balances cost savings among different cloud providers and their products. However, this approach is often complicated since you have to manage all the other services offered by the various providers.</p> <p>The key to <a href="https://www.kentik.com/solutions/harden-zero-trust-cloud-network-policy/" title="Learn more about Kentik solutions for hardening zero-trust cloud network policies">implementing a zero-trust network</a> lies in translating the principles into actionable technical controls that can fortify your cloud environments.</p> <p>Imagine a scenario where you’re responsible for the cloud environment of a major fintech company. The holiday season is about to begin, and you want to lock down the environment against cybercriminals who are all too eager to compromise it. Following the principles of zero trust seems like a great place to start, but how do you actually implement it? What should you focus on, and how do you do it without getting bogged down in the complexities of it all?</p> <p>This is where solutions like Kentik can help. The <a href="https://www.kentik.com/product/kentik-platform/" title="Learn more about Kentik’s Network Observability Platform">Kentik Network Observability Platform</a> gives companies insights into their cloud environments, enabling them to detect security issues, establish baselines for network activity, understand how traffic flows through the network and public internet, and see what SaaS applications are in use.</p> <p>In the next section, you’ll learn how easy it is to use Kentik to lock down security groups and cloud access and monitor rejected traffic volume.</p> <h2 id="implementing-a-zero-trust-cloud-network-policy-with-kentik">Implementing a zero-trust cloud network policy with Kentik</h2> <p>Let’s take a look at how Kentik can enable a zero-trust network in the cloud.</p> <h3 id="lock-down-security-groups">Lock down security groups</h3> <p>Security groups are often the entry points into cloud workloads. The rules defined in security groups dictate how and where workloads can be accessed.</p> <p>A frequent oversight in cloud environments is opening up security groups to troubleshoot issues and then forgetting to close them. This violates zero-trust principles and becomes a vulnerability that attackers can exploit.</p> <p>Locking down security groups is essential but far from easy, especially given how quickly cloud environments change. Thankfully, Kentik can help to automate and streamline the process.</p> <p>Following are the steps you can take in Kentik to lock down your cloud environment:</p> <p>If you don’t already have one, <a href="https://www.kentik.com/get-started/" title="Click here to get started with Kentik">create a Kentik account</a> and log in.</p> <p>Once you’ve configured one of your cloud providers to <a href="https://kb.kentik.com/v4/Ka02.htm#Ka02-About_Cloud_Setup" title="Kentik Knowledgebase: Cloud Setup">send flow logs to Kentik</a>, you should be able to see directional data flow on your dashboard.</p> <p>Navigate to the <strong>Dashboard</strong> to get a 360-degree view of your cloud posture, and use Kentik’s filtering options to identify open security groups that should be locked down.</p> <p>Now, you can harden the security group settings to a more restrictive state, allowing only the needed traffic. This step is essential as it <a href="https://kb.kentik.com/v4/Ga08.htm#Ga08-Building_the_Baseline" title="Kentik Knowledgebase: Building the Baseline">creates a baseline</a> for expected and/or unexpected traffic patterns.</p> <p>Then <a href="https://kb.kentik.com/v0/Ab10.htm#Ab10-Policybased_Alerting" title="Kentik Knowledgebase: Policy-based Alerting">create alerts</a> that inform you if the security groups deviate from your set baseline.</p> <p>By enabling these settings, Kentik continually monitors your cloud environment and prevents any future instances of network security groups from being left open accidentally, misconfigured security groups, or previously unknown attack vectors.</p> <h3 id="monitor-rejected-traffic-volume-and-update-cloud-policies-as-needed">Monitor rejected traffic volume and update cloud policies as needed</h3> <p>The cloud’s noisy nature means filtering out data signals indicating a threat is essential. A high volume of rejected traffic can reveal many things, including a potential cyber threat trying every entry point to gain access. Properly performing policies will deny all traffic except the tiny sliver of allowed traffic. Either way, a good zero-trust network policy should continually be adjusted in response to changing conditions and threats.</p> <p>Striking the balance between securing your network and allowing productivity can be challenging, but Kentik can help. Kentik allows you to analyze rejected data volumes and fine-tune policies in real time. Essentially, it is a balance of allowing access while also maintaining a policy of least privilege.</p> <p>To do so, you just need to log into your Kentik account and navigate to the dashboard. There, you can look for instances of rejected traffic and any potential spikes.</p> <p>Kentik’s <a href="https://www.kentik.com/kentipedia/network-traffic-analysis/">analytical features</a> allow you to dive deep into trends and isolate their origins and where the traffic was headed. This enables you to identify both potential threats and bottlenecks within your network.</p> <p>Once you’ve identified the issue causing problems, navigate to the cloud policy settings within Kentik and fine-tune the specified policies. Make sure you harden the policy more if you feel that the rejected traffic indicates a broader threat.</p> <p>Finally, <a href="https://kb.kentik.com/v0/Ab10.htm#Ab10-Policybased_Alerting" title="Kentik Knowledgebase: Policy-based Alerting">create alerts</a> that can notify you in case of surges of rejected traffic. This helps you proactively respond before a potential threat can materialize.</p> <p>Using data to fine-tune security controls is a vital characteristic of a zero-trust network. Kentik makes this entire process intuitive and easy to manage, enabling you to balance usability and security.</p> <h2 id="locking-down-cloud-access">Locking down cloud access</h2> <p>The <a href="https://csrc.nist.gov/glossary/term/least_privilege" title="Least privilege definition from NIST Computer Security Resource Center">principle of least privilege</a> is one of the most fundamental aspects of zero trust, but it can also be one of the most challenging to implement in the cloud. Every user, workload, and API call needs permissions to function, which can become hard to manage over time. However, the consequence of not locking down your cloud environment is that you could potentially leave overly permissive access open that an attacker can misuse.</p> <p>Applying the principles of least privilege in a cloud environment is even more challenging due to cloud resources’ dynamic and ephemeral nature. This is where Kentik adds value by providing an easy way to restrict access without bogging the user down.</p> <p>To lock down cloud access with Kentik, navigate to the <strong>Dashboard</strong> and identify which users need to be able to access your cloud resources. Remember to create alerts that notify you if other user activities are observed (<em>ie</em> if any actors other than your created baseline are observed).</p> <p>Kentik provides users with an intuitive way of restricting access to the bare minimum required and enforcing zero-trust principles within the cloud.</p> <h3 id="audit-access-to-cloud-resources">Audit access to cloud resources</h3> <p>Your cloud environment can house some sensitive assets, like client databases or signed documents uploaded to an <a href="https://aws.amazon.com/s3/" title="AWS S3">Amazon Simple Storage Service (Amazon S3)</a> bucket. It’s essential that you maintain constant vigilance over these assets, as unauthorized access can lead to a data breach.</p> <p>For zero trust to be effective, it’s critical to have real-time insights into who is performing which actions and where within your cloud environment. Unfortunately, given the sheer volume of activities that can take place, it’s difficult to extract any meaningful information.</p> <p>Early detection of suspicious activity can mean the difference between a simple alert and a full-on data breach. Kentik offers a simple way of extracting useful information without drowning you in it. All you need to do is navigate to your dashboard and analyze Kentik’s access logs to get an overview of what is happening within your cloud environment.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/57xtKXHdgJ211YhH4oJoRU/a98a0b129791e8e533c6e362dbe9f1fc/azure-denied-traffic-investigation.gif" style="max-width: 800px" class="image center" withFrame thumbnail alt="Zero-trust engine" /> <p>It’s recommended that you have an idea of what constitutes normal within your environment so that it’s easier to identify any deviations from the regular baseline.</p> <p>Kentik also allows you to drill down using various criteria, such as focusing on a particular resource, time frame, or user. This enables you to get meaningful context on events and audit them.</p> <p>You can also set up alerts for particular patterns or access attempts for proactive security monitoring and audits.</p> <p>Auditing can change from a cumbersome, manual activity to a graphical, intuitive task with Kentik. Its interface helps you identify any activity that warrants further investigation and gives you detailed visibility and understanding of your environment.</p> <h2 id="conclusion">Conclusion</h2> <p>The zero-trust framework marked a significant advancement in the evolution of network security principles. This framework incorporates the following fundamental principles:</p> <ul> <li>Users and devices on a network should not be trusted but verified.</li> <li>Every user should get only the necessary access to fulfill their purpose.</li> <li>Continuous monitoring of access logs is necessary to identify incorrect access patterns.</li> </ul> <p>Zero trust is an evolving field within cloud security that requires the proper tooling to be appropriately implemented. Kentik helps users enforce zero-trust principles from a unified interface without drowning in technical complexities.</p> <p>Zero trust in the cloud is no longer a luxury in the modern digital age but an absolute necessity. By using Kentik, you can enforce zero trust with complete confidence and strike that magic balance between security and usability within the cloud.</p><![CDATA[What Is Adaptive Flowspec and Does It Solve the DDoS Problem?]]><![CDATA[Managing modern networks means taking on the complexity of downtime, config errors, and vulnerabilities that hackers can exploit. Learn how BGP Flow Specification (Flowspec) can help to mitigate DDoS attacks through disseminating traffic flow specification rules throughout a network.]]>https://www.kentik.com/blog/what-is-adaptive-flowspec-and-does-it-solve-the-ddos-problemhttps://www.kentik.com/blog/what-is-adaptive-flowspec-and-does-it-solve-the-ddos-problem<![CDATA[Phil Gervasi]]>Wed, 25 Oct 2023 04:00:00 GMT<p>Network administrators need to make sure your users have secure and efficient connectivity to their applications and data. With that in mind, our networks have also become increasingly complex. This is partly due to the pure volume of data needed to run everyday business plus our growing reliance on the public internet. Add in technologies like cloud and SaaS, AI, big data, and the continuous addition of users and devices, and you can see that the subsequent threat landscape is always changing.</p> <p>It can be incredibly difficult to manage any modern network with this sort of complexity, and that difficulty is compounded when the risk of downtime, configuration errors, security loopholes, and vulnerabilities that hackers can exploit grows with that complexity.</p> <p>One specific threat to our network security is the distributed denial of service (DDoS) attack. From a high level, a DDoS attack floods a website or a system with fake requests making the system unavailable for legitimate users. There are several different variations of DDoS attacks, but regardless of the form, DDoS attacks are more of an issue today than they’ve ever been because of our reliance on the public internet and internet-based services.</p> <div as="Promo"></div> <h2 id="ddos-attacks--they-are-everywhere">DDoS attacks — they are everywhere</h2> <p>There are several notable examples of DDoS attacks in the news in recent years. For example:</p> <ul> <li>During the Winter Olympic Games in February 2018, the official website of the Olympic Organizing Committee was forced to shut down due to a DDoS attack.</li> <li>GitHub also fell victim to a DDoS attack in March 2018 after experiencing a maximum peak traffic of 1.7 TBPS.</li> <li>Amazon Web Services experienced a similar attack in October 2019, with an outage affecting many websites for several hours.</li> </ul> <p>And this is just the tip of the iceberg — many other <a href="https://www.kentik.com/analysis/" title="Kentik Network Analysis Center: Tracking DDoS Attacks, BGP leaks, and other network incidents">significant examples have been reported</a>, with the number growing yearly.</p> <p>These attacks are often carried out by a network of devices, either computers or IoT devices, which hackers infected with malware. These infected devices, known as bots, form a network called a botnet. In a volumetric type of DDoS attack, these bots, or “zombies,” simultaneously send requests to the target IP address, overwhelming the system and leading to a service degradation or even a shard down scenario.</p> <img src="//images.ctfassets.net/6yom6slo28h2/40EMhkorF5tijIS8sxk0s6/c35564f8a29f3a4129c49bee7fdeea13/nature-of-ddos-attack.png" style="max-width: 420px;" class="image center no-shadow" alt="Diagram showing the nature of a DDoS attack" /> <p>As noted in <a href="https://www.statista.com/statistics/1260806/ddos-attacks-by-vertical-united-states/" title="Statista: Frequency of distributed denial of service (DDoS) attacks in the United States in 2020, by vertical">one research source</a>, there were over 174,000 DDoS attacks just in the USA alone in 2021. These attacks result in downtime, transaction loss, reputational risk, and real dollar impact on your service or product sales.</p> <p>In other words, the impact of a DDoS is both a financial and business risk.</p> <h2 id="overview-of-flowspec-and-adaptive-flowspec">Overview of Flowspec and Adaptive Flowspec</h2> <h3 id="what-is-bgp-flowspec">What is BGP Flowspec?</h3> <p>BGP Flow Specification (Flowspec) is a powerful technology to mitigate DDoS attacks. It’s an extension to the Border Gateway Protocol (BGP) that allows for disseminating traffic flow specification rules throughout a network.</p> <p>BGP policies using Flowspec help mitigate DDoS attacks by rapidly filtering traffic across an entire network. In light of the ever-growing DDoS attack network administrators have been dealing with as we move to a cloud network model, the IETF RFC5575 proposed Flowspec explicitly as a DDoS mitigation protocol to work on top of BGP.</p> <p>Using BGP Flowspec, we have granular control to match a particular flow with a source and destination, and also match parameters such as packet packet length, fragment, and so on. With Flowspec, we can dynamically perform three different actions:</p> <ul> <li>Drop the traffic</li> <li>Inject it in a different VRF (virtual routing and forwarding) for analysis</li> <li>Allow it, but throttle it at a defined rate</li> </ul> <p>BGP directs a specific flow format to border routers, prompting them to generate access control lists (ACLs) which can be used to manage packet forwarding from external sources. In this way, filtering policies can be pushed to edge devices to mitigate a DDoS attack. As traffic matches a rule pushed by BGP Flowspec, routers take the prescribed action.</p> <h2 id="does-bgp-flowspec-prevent-all-ddos-attacks">Does BGP Flowspec prevent <em>all</em> DDoS attacks?</h2> <p>While BGP Flowspec offers volumetric DDoS attack mitigation, it struggles against some advanced forms attacks. In volumetric DDoS attacks, clients apply Flowspec rules at border routers, then inform the service provider to block specific traffic types targeting the affected IP addresses and ports for filtering.</p> <p>However, for other types of DDoS attacks, such as amplification DDoS attacks, clients must understand the ISP’s network to effectively mitigate the attack which means much more coordination between the client and the ISP.</p> <img src="//images.ctfassets.net/6yom6slo28h2/43Pptd4iBERIBwJ6sEiGLb/0cf839cc731fa2be0f1d549216e4299c/bgp-flowspec.png" style="max-width: 800px;" class="image center no-shadow" alt="BGP Flowspec diagram" /> <div class="caption" style="margin-top: -55px;">Working of BGP Flowspec (adapted from cisco.com)</div> <h2 id="what-is-adaptive-flowspec">What is Adaptive Flowspec?</h2> <p>Adaptive Flowspec improves upon Flowspec by dynamically adjusting or modifying flow specification parameters based on changing network conditions or requirements, or in other words, the “adaptive” part of Adaptive Flowspec. This adaptability allows for more flexible and efficient traffic management in response to varying network conditions, ensuring optimal performance and resource utilization.</p> <p>The dynamic nature of Adaptive Flowspec is its power, and this ability to update rules based on changing network conditions, or in the case of this discussion, DDoS attack traits, can be completely automated relying on a programmatic workflow and real-time network and traffic analysis.</p> <p>With traditional Flowspec, a subscriber to a distributed filtering service can request filtering for detected DDoS flows. However, while filtering DDoS traffic based on protocol, TCP flags, and destination is straightforward, source-based filtering is more complex.</p> <p>Remember the Flowspec policy actions mentioned before:</p> <ul> <li>Discard all traffic from an IP prefix.</li> <li>Discard all traffic from a specific IP.</li> <li>Discard traffic from an IP with a particular source port.</li> </ul> <p>Adaptive filtering lets subscribers switch between these granularities as required. This enhances the dynamic nature of responding to what are likely adaptive attacks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/63TP0T8EzQVV5nnjKMyaDI/033d0cf6a3ca14f2f46d3ee70d8848bf/flowspec-policy-example.png" style="max-width: 600px;" class="image center" alt="Flowspec policy" /> <div class="caption" style="margin-top: -35px;">A simple Junos configuration example of a Flowspec policy</div> <p>For an IP prefix producing DDoS traffic, if its legitimate traffic is minimal, filtering the entire IP prefix might be suitable, especially during intense attacks. Otherwise, evaluate each subprefix, create rules for subprefixes sending DDoS traffic, ignore those mainly sending legitimate traffic, and recursively apply this strategy to subprefixes with mixed traffic.</p> <p>Adaptive Flowspec uses distributed filtering across network nodes. This model works at various scales: internet-wide, within ISPs, or with DDoS-scrubbing services. DDoS traffic filters at different autonomous systems (ASes) across the internet (Figure 1), inside ISP routers or programmable switches (Figure 2), or within DDoS-scrubbing service data centers (Figure 3).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4LryWBkl1UxRMW6pyrj9l4/f08d1188662a7a4d66c089da53ca6e00/adaptive-flowspec-fig1.png" style="max-width: 500px;" class="image center no-shadow" alt="figure 1" /> <img src="https://images.ctfassets.net/6yom6slo28h2/2VVKLwJMXPNNs85ntEYiIF/829ffcb9afd2238abe3baf23ed4999a4/adaptive-flowspec-fig2.png" style="max-width: 500px;" class="image center no-shadow" alt="figure 2" /> <img src="https://images.ctfassets.net/6yom6slo28h2/6Ize3FlgLzUGAD8chNC7IQ/137681b71fd14ba0409e3e0616cb5022/adaptive-flowspec-fig3.png" style="max-width: 500px;" class="image center no-shadow" alt="figure 3" /> <p>Adaptive Flowspec employs feedback loops to refine its filtering policies. However, one issue that can arise is that while aiming to filter DDoS traffic, it sometimes blocks legitimate traffic. Thus, it uses feedback loops to better distinguish and target only malicious traffic.</p> <h2 id="mitigating-ddos-attacks-using-adaptive-flowspec">Mitigating DDoS attacks using Adaptive Flowspec</h2> <p>Once a DDoS attack is detected, reactive DDoS mitigation strategies are employed. Ultimately, the DDoS attack has already begun, so these strategies are about minimizing the damage caused by the attack. Common mitigations include rate limiting and traffic scrubbing both of which focus on mitigating a DDoS attack in real time.</p> <p>Adaptive Flowspec adaptively generates rules at a more granular level so the mitigation is more precise with less unintended traffic blocking. It deploys rules at the most appropriate filtering nodes on the path of the actual DDoS traffic, which implies a great deal of network visibility is required to understand where the malicious traffic is flowing.</p> <p>Because traffic classification can be done on a granular basis, or in other words, flow-by-flow, it’s essential to run adaptive filtering on top of DDoS classification. Network administrators can then fine tune traffic rules based on actual real-time traffic information and threat detection.</p> <img src="//images.ctfassets.net/6yom6slo28h2/62m2KvovNGN4GqRAmXy635/6c862bf9cff72265b729ec1b4a9f28c3/fine-tune-traffic-rules.png" style="max-width: 450px;" class="image center no-shadow" alt="Fine-tuning traffic rules" /> <p>This BGP-based strategy is used because BGP is so customizable and ubiquitous in modern networking. It’s an excellent tool to easily propagate and apply traffic rules in normal operations and in the event of malicious network activity.</p> <h2 id="challenges-and-considerations">Challenges and considerations</h2> <p>Thousands of DDoS flows can stem from only a few IP prefixes, millions from a group of IP addresses, and even more from combined IP and port numbers. Monitoring and determining filtering rules for each very granular scenario can be costly in terms of the time and resources it takes to figure out what’s going on and how to mitigate the attack properly. There’s also the risk of collateral damage if and when good traffic is inadvertently dropped, and managing an inordinate number of classification and mitigation rules.</p> <p>Nodes on the DDoS traffic path must distinguish and filter DDoS from legitimate traffic to avoid real source traffic loss; however, Adaptive Flowspec addresses these challenges by counter DDoS attacks proactively and dynamically.</p> <h2 id="is-adaptive-flowspec-the-optimal-ddos-mitigation-tool">Is Adaptive Flowspec the optimal DDoS mitigation tool?</h2> <p>The BGP Flow Specification (Flowspec) and its adaptive counterpart, Adaptive Flowspec, present formidable tools against DDoS threats. Flowspec disseminates traffic rules across a network, providing a method for traffic control, rapid deployment, and dynamic responses to DDoS attack variations.</p> <p>Meanwhile, Adaptive Flowspec can dynamically recalibrate these rules based on evolving network conditions and attack characteristics, providing a proactive, adaptive mitigation method. And of course, both solutions therefore require deep and broad network visibility to monitor, identify, and filter traffic during an attack.</p><![CDATA[Why Can’t Network Teams Have Nice Things?]]><![CDATA[Let me tell you something you already know: Networks are more complex than ever. They are massive. They are confounding. Modern networks are obtuse superorganisms of switches, routers, containers, and overlays; a hodgepodge of telemetry from AWS, Azure, GCP, OCI, and sprawling infrastructure that spans more than a dozen timezones.]]>https://www.kentik.com/blog/why-cant-network-teams-have-nice-thingshttps://www.kentik.com/blog/why-cant-network-teams-have-nice-things<![CDATA[David Klein]]>Thu, 19 Oct 2023 04:00:00 GMT<p>Despite this (or perhaps in part because of it), customer expectations have never been higher. Everyone expects everything to work perfectly all the time. Every time. Every moment.</p> <p>So, in my loudest, most strident, and most concerned voice, I must ask, “Why are network engineers stuck with tools that don’t work?” Or, in a more gentle tone: “Don’t network teams deserve nice things too?”</p> <p>Modern networks require — <em>not need, but require</em> — modern network solutions.</p> <p>Legacy <a href="https://www.kentik.com/kentipedia/network-performance-monitoring/" title="Kentipedia: Network Performance Monitoring">network performance monitoring</a> systems (they might rhyme with shiverbed or rollerwinds) are not capable of delivering on the ever-expanding needs of network, cloud, and security engineers. And, more importantly, the needs of the eight billion people whose lives are, for better or worse, inextricably linked to the networks that connect them to the world and one another.</p> <h2 id="this-is-the-part-where-i-talk-about-kentik">This is the part where I talk about Kentik</h2> <p>Kentik ingests <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Kentik Blog: Best Practices for Enriching Network Telemetry to Support Network Observability">telemetry</a> from physical and virtual infrastructure at mind-numbing scale. (200 trillion flows were processed and stored in 2022!) It synthesizes, analyzes, and adds context to this data so anyone can answer any question about their network. Or set up alerts and automation to surface and resolve issues without even needing to ask.</p> <img src="//images.ctfassets.net/6yom6slo28h2/523sKzpBEenGgzkNjZDuRR/9dd23e3cad41304b3c84b6ad6bf6d55a/network-observability-questions.jpg" style="max-width: 600px;" class="image center" alt="Network observability questions" /> <p>Another way to look at it: Kentik doesn’t just highlight <em>what</em> is happening, but indicates why, and what needs to be done to resolve an issue.</p> <p>Rather than re-label features and point solutions like three network monitoring solutions in a trenchcoat (I may have watched too many episodes of BoJack Horseman), Kentik is a single, unified platform for network performance, visibility, cost, capacity, and security that just so happens to have an NPS score so high it can only be described as <del>“cult-like”</del> super duper good. Ya, that’s a safer way to write that.</p> <div as="Promo"></div> <h2 id="hybrid-cloud-visibility-is-table-stakes-now-or-at-least-will-be-soon">Hybrid cloud visibility is table stakes now (or at least will be soon)</h2> <p>Visibility with a hybrid cloud infrastructure is tough. Not only do you need insight into the traffic flowing within the networks you control, but you also need to see traffic in networks you don’t control. And that’s not even taking the public internet into account. It’s a lot of data from a lot of sources, and even the smallest blind spot can end up being disastrous. This can be something really bad, like losing hundreds of thousands of dollars because customers cannot order your products, or something really, really bad, like having to explain to a five-year-old that she can’t watch Bluey.</p> <p>Kentik’s <a href="https://www.kentik.com/product/multi-cloud-observability/" title="Learn more about Kentik solutions for Hybrid and Multi-Cloud Network Observability">hybrid cloud visibility</a> is drastically different from legacy tools.</p> <p>First, Kentik has a unique view of internet traffic patterns and performance. We create path traces and gather performance metrics among major service providers, monitor the performance over the internet of the major cloud providers, and ingest the global routing table to augment our understanding of path selection between our customers and the cloud.</p> <p>Second, our cloud observability solution (can you guess the name?), <a href="https://www.kentik.com/product/cloud/">Kentik Cloud</a>, ingests flow logs, cost information, application and process tags, and a variety of other data to provide deep visibility into how traffic is moving within a cloud instance, in between cloud regions, between different public clouds, and to on-premises data centers.</p> <p>This means an engineer can easily visualize the overall traffic flow between AWS (or any of the major cloud providers) and their own private data center, including the internet in between.</p> <p>Look, we know that legacy NPM and NMS companies pretty much wrote the book on monitoring WANs (even if it is with Java interfaces), but their cloud visibility hasn’t kept up. Collecting flow information with a single tool deployed as a physical or virtual appliance on-prem or in the cloud is a solution for a bygone era. No one, and I mean no one, is excited by the idea of another network appliance. (Except for people selling network appliances. They are over the moon.)</p> <p>It also means that legacy tools lack visibility into significant network activity, specifically how traffic moves between resources in various clouds, back on-premises, and on the public internet. If you’re running a multi-cloud architecture, good luck finding a legacy solution that supports every major cloud provider — and answers any question you have about your network or how customers experience it.</p> <h2 id="this-is-not-yet-the-part-where-we-quote-vanilla-ice">This is not yet the part where we quote Vanilla Ice</h2> <p>Installing a lightweight test agent at a branch office is one thing, but legacy NMS generally means installing physical and virtual appliances in multiple locations. That means appliances to manage, cooling and power to consider, and failed hard drives and power supplies to plan around; a lot of extra steps with a whole new suite of complications.</p> <p>That’s why Kentik was designed as a fully SaaS-delivered solution with no physical appliances for customers to install and maintain.</p> <p>Here’s the typical process of engaging with Kentik:</p> <ul> <li><strong>Step 1</strong>: Read or watch something from the marketing team that CHANGES YOUR LIFE</li> <li><strong>Step 2</strong>: Write the CMO a nice email about how amazing the marketing team is</li> <li><strong>Step 3</strong>: Succumb to a web form (yes, no one likes them) and start a Kentik trial or get a demo</li> <li><strong>Step 4</strong>: Unbridled excitement!</li> <li><strong>Step 5</strong>: All the ROI: Reduced costs, improved performance, security, etc.</li> </ul> <h2 id="finding-meaning-in-the-data-and-possibly-life">Finding meaning in the data (and possibly life)</h2> <p>There’s a big difference between seeing more data on graphs and charts — and understanding what that data says about application delivery. It’s not enough to show an interface statistic or the number of bits on a colorful graph. Modern networks require an advanced solution to apply statistical analysis algorithms and appropriate machine learning models to the entire dataset to derive meaningful insight.</p> <p>Kentik ingests all telemetry into the platform and performs essential activities such as data transformation and applying specific models to identify patterns, detect anomalies, find meaningful correlations, and make predictions.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4Paz01GG6V7ZT8DvDVbn9B/783b26b188ec1f9d8ae14d6c6673c282/kentik-map-see-your-network.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik Map" /> <p>Kentik Insights, for example, provides unsolicited insight generated by the system that highlights what an engineer needs to pay attention to. That could be anything from an interface being unexpectedly overutilized after a configuration change, AWS costs trending upward after deploying a new DNS service, botnet activity detected in a given geo, and so on.</p> <p>Legacy network performance monitoring tools struggle to parse and filter the data quickly enough for network teams to take action before customers are affected.</p> <h2 id="enriching-data-with-business-context-is-good-right">Enriching data with business context is good, right?</h2> <p>To truly understand application performance and delivery, we need more than just flows, SNMP information, streaming telemetry, etc. We need qualitative information that wraps the hard data in a business context. That way, engineers can understand more than just bits per second. Wouldn’t it be helpful to know how specific traffic patterns affect a monthly cloud bill, or how latency among containers in AWS relates to the performance of a particular application, for a specific end-user, in an exact location?</p> <p>Kentik enriches its database with a cavalcade of additional information such as geographic identifiers, the global routing table, information from IPAM databases, container process identifiers, DNS server information, application, security tags, etc. We want to color our data with this additional context to aid in our understanding of why, not just what.</p> <p>Something legacy NMS and NPM solutions seem to forget: Today’s applications are delivered primarily over the public internet. A visibility solution must provide data and insight into what’s happening with an application’s traffic traversing service provider networks. So, why do so many legacy visibility solutions avoid it altogether?</p> <p>Kentik’s extensive work in global provider networks means that alongside internal flow, SNMP, streaming telemetry, and enrichment data is the entire global routing table, path traces between global cloud providers, metrics among ISP points of presence, and even peering information from <a href="https://www.kentik.com/blog/making-peering-easy-announcing-the-integration-of-peeringdb-and-kentik/">sources such as PeeringDB</a>.</p> <p>Captain obvious alert: Today’s applications are delivered as SaaS apps or from public cloud resources. Applications rely entirely on the public internet, meaning having provider visibility is essential to modern network operations. (Not the thoughtiest leadership that’s ever been thoughted, but still a valid point.)</p> <h2 id="why-should-you-ditch-your-network-tool">Why should you ditch your network tool</h2> <p>To quote famed network engineer Vanilla Ice in the 1991 classic <em>Cool As Ice</em>, “Drop that zero and get with the hero.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/1UhQnMEdXJLaX3C9iOQasq/2762a6268033ed3e64bca940315ac8e9/vanilla-ice-poster.jpg" style="max-width: 400px;" class="image center" alt="" /> <p>Is that perhaps a bit much? Yes. But it’s so hard to find a relevant Vanilla Ice quote. Anything from <em>Ice Ice Baby</em> is too trite for a professional blog post (this loosely falls under that category), and “Go ninja, go ninja, go!” is painfully jaunty and completely asinine, and I don’t think I can make it work even with a desperate shout out to Robert Griesemer, Rob Pike, and Kent Thompson.</p> <p>So, who are the heroes in this poorly chosen analogy? The Kentik product and engineering teams! As I write this, I’m singing the loudest Joe Esposito I can muster.</p> <p>You know the song. That one.</p> <p>Enough asides: Kentik was built from the ground up, from its very inception, as a network observability platform. From the beginning, the goal was (and is) to collect more data than the day before, find more types of telemetry to add to the dataset, and refine the insight derived from a unified data repository.</p> <p>It seems obvious after looking at the myriad ways legacy NPM and NMS tools have struggled to modernize alongside network infrastructure. Kentik is a modern network observability solution created in the SaaS/big data era, made for today’s network.</p> <p>The Legacy NPM and NMS tools used (and, dare I say, not loved) by many of today’s network teams are an outdated suite of point solutions bundled together for how engineers used to work.</p> <h2 id="what-you-should-do-now">What you should do now</h2> <ul> <li>Tell <a href="https://www.linkedin.com/in/avifreedman/">Kentik’s CEO, Avi Freedman</a>, that this blog post was a good idea, and not a very, very bad idea.</li> <li>Sign up for a <a href="https://www.kentik.com/get-demo">Kentik demo</a>.</li> <li>Listen to the Karate Kid soundtrack (but only briefly).</li> </ul><![CDATA[Ending Saint Helena’s Exile from the Internet]]><![CDATA[Just after midnight on October 1, 2023, the remote island of Saint Helena in the South Atlantic began passing internet traffic over its long-awaited, first-ever submarine cable connection. In this blog post, we cover how Kentik’s measurements captured this historic activation, as well as the epic story of the advocacy work it took to make this development possible.]]>https://www.kentik.com/blog/ending-saint-helenas-exile-from-the-internethttps://www.kentik.com/blog/ending-saint-helenas-exile-from-the-internet<![CDATA[Doug Madory]]>Wed, 11 Oct 2023 04:00:00 GMT<p>Just after midnight on October 1, 2023, the remote island of Saint Helena in the South Atlantic began passing internet traffic over its long-awaited, first-ever submarine cable connection. Hours later, I <a href="https://www.kentik.com/analysis/saint-helena-activates-long-awaited-subsea-connection/">published the first evidence</a> of the submarine cable activation, continuing a practice of mine which has included the reporting the activations of the <a href="https://archive.nanog.org/meetings/nanog57/presentations/Tuesday/tue.lightning1.madory.alba-1.pdf">ALBA-1 cable</a> connecting Cuba to Venezuela, the <a href="https://twitter.com/DougMadory/status/1482414748666941445">Tonga Cable</a> in the South Pacific, and <a href="https://www.vice.com/en/article/ypw35k/russia-built-an-underwater-cable-to-bring-its-internet-to-newly-annexed-crimea">Kerch Strait cable</a> connecting Crimea to mainland Russia.</p> <p>If you’ve ever heard of Saint Helena, it might be due to its role as the <a href="https://www.historic-uk.com/HistoryUK/HistoryofBritain/Napoleons-Exile-on-St-Helena/">final place of exile</a> for Napoleon Bonaparte, beginning in 1815. Up until a couple of weeks ago, the <a href="https://www.worldometers.info/world-population/saint-helena-population/">5,300 residents</a> of this British overseas territory lived in an updated form of exile: from the benefits of high-speed internet access which define modern society.</p> <p>In this blog post, I cover how Kentik’s measurements captured this historic activation, as well as the epic story of the advocacy work it took to make this development possible.</p> <img src="//images.ctfassets.net/6yom6slo28h2/34YmMrCQtQuCqduojrMXUw/2e3fec13c6df20e846d2e2d360798f49/globe-st-helena.png" style="max-width: 400px;" class="image center no-shadow" alt="Globe showing location of St. Helena" /> <h2 id="evidence-of-the-cable-activation">Evidence of the cable activation</h2> <p><a href="https://www.sure.co.sh">Sure, South Atlantic</a> (AS210841) isn’t just <em>one of the ISPs</em> in Saint Helena; it is <em>the</em> <em>only ISP</em> for Saint Helena. Sure also enjoys monopoly status in the other British overseas territories of <a href="https://en.wikipedia.org/wiki/Ascension_Island">Ascension</a> and the <a href="https://en.wikipedia.org/wiki/Falkland_Islands">Falkland Islands</a>. More on that later.</p> <p>In anticipation of Saint Helena’s cable activation, I had set up numerous <a href="https://www.kentik.com/product/synthetic-monitoring/">synthetics tests</a> in Kentik to measure latency to the island from dozens of our agents around the world. Below is an example of what one of those tests reported in the early hours of October 1.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3sXsetf6jcBfDWs8RbMv9t/a7daddfb47eb6ca3b39c79cd228d6f85/london-st-helena-oct1.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Synthetic tests London to St. Helena" /> <p>The screenshot above illustrates the dramatic drop in latency (from 657ms to 131ms!) from AWS’s <code>eu-west-2</code> region in London to Sure (AS210841).</p> <p>Due to the speed of light and the location of the orbiting satellites, it is impossible to attain a round-trip time (RTT) less than 480ms over a geostationary satellite connection. The observed drop in latency was almost exactly this constant value, which makes sense given it was due to the removal of a high-latency satellite hop.</p> <div as="Promo"></div> <p>Furthermore, the switch to fiber optic connectivity evidently eliminated the jitter (variability in latency) that had been experienced over the geostationary satellite link.</p> <p>Alternatively, latency from AWS’s region in Cape Town, South Africa (picture below) remains high despite a similar drop. Clearly, traffic is still required to make its way up to Europe before coming back along the Equiano cable.</p> <img src="//images.ctfassets.net/6yom6slo28h2/56PF1eZYjoqMZtu0Ru8UXw/59076f81e7e3c132527a8ed5dcad73fd/capetown-st-helena.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Latency from Cape Town to St. Helena" /> <p>The cable activation was also visible in BGP. Prior to October 1, satellite operator Intelsat (AS22351) was the sole transit provider for Sure (AS210841), colored in light green in the stacked plot below. Then, the upstream ASN changed from Intelsat to another Sure network (AS8680) — initially at first, and then permanently as the satellite connection was replaced with a submarine cable connection.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Ajs56i0i5HQe2GEW0UNx8/f69cde3f59e46ecbad66269366ba878a/connection-replaced.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Chart showing satellite connection replaced with submarine cable connection" /> <p>Lastly, the change was quite visible in traffic stats as well. We saw an enormous surge of traffic flowing to Saint Helena, based on Kentik’s aggregate NetFlow, as the submarine cable offered a huge increase in capacity versus the previous satellite connection. Our friends at Cloudflare <a href="https://twitter.com/lpoinsig/status/1708656447415017625">reported a similar spike</a> in traffic volume delivered to Saint Helena after the cable activation.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1FNsJbxMIaVmkWenEtpf21/0c7285a69f3bd3de95d2955695dd94ee/traffic-to-st-helena.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Traffic surge with subsea cable" /> <h2 id="the-story-behind-the-activation">The story behind the activation</h2> <p>Saint Helena’s submarine cable connection would never have happened without a tireless lobbying effort spanning more than a decade. The leader of this effort was my good friend, German telecommunications expert <a href="https://www.linkedin.com/in/cvdropp/recent-activity/all/">Christian von der Ropp</a>.</p> <p>Christian first learned of Saint Helena’s situation from a childhood friend who spent ten months there in 2008. One of the most remote places on earth, Saint Helena was only reachable by boat until the UK government appropriated £200m to fund the <a href="https://www.gov.uk/government/news/airport-to-be-built-on-st-helena">construction of an airport</a> on the island in 2011.</p> <p>It is not uncommon for island nations to receive funding from development banks or similar organizations to construct an undersea fiber optic line to their shores. This was how <a href="https://www.worldbank.org/en/news/press-release/2011/08/15/world-bank-and-asian-development-bank-to-support-high-speed-internet-in-tonga">Tonga got its submarine cable</a> — the one that was also damaged by last year’s undersea volcano eruption. As I wrote in my <a href="https://www.kentik.com/blog/tonga-downed-by-massive-undersea-volcano-eruption/">blog post on that incident</a>, development bank funding is necessary because cables to island nations are “thin routes,” i.e., cable projects unlikely to offer a return on investment. They are funded as development projects intended to help the local economy by enabling the connectivity necessary for modern society.</p> <p>Unfortunately, because Saint Helena is a territory of the United Kingdom, it is ineligible to receive funding from organizations like the World Bank or African Development Bank. Like the airport, it would be the UK’s responsibility to fund a submarine cable connection. But also, like the airport, a high-speed internet connection would be essential to help to alleviate the many issues stemming from the island’s remoteness, putting it on a path to self-sufficiency.</p> <p>Soon after the construction of Saint Helena’s airport, a South African company announced their plan to build a submarine cable through the South Atlantic called <a href="https://en.wikipedia.org/wiki/SAex">South Atlantic Express (SAEx) cable</a>. The proposed cable would be the first to span the South Atlantic and run relatively close to Saint Helena. When Christian reached out to both the Saint Helena government and the South African company behind the cable, he found that neither group had considered the possibility of landing SAEx in Saint Helena.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2enYbbsIA1YTKuZWMVI4z7/a43fb8bb9349bd3180054cf4361e8bab/st-helena-logo.png" style="max-width: 350px;" class="image right no-shadow" alt="Connect St. Helena logo" /> <p>Realizing that landing a submarine cable branch in Saint Helena wasn’t going to happen without dedicated advocacy, Christian founded the non-profit Connect St Helena in early 2012 and began the lobbying effort.</p> <p>As part of this effort, Saint Helena islander James Greenwood submitted a petition to the UK government to fund a cable branch. While the petition failed to get the necessary 100,000 signatures to be taken up by the UK government, a <a href="https://en.wikipedia.org/wiki/Andrew_Rosindell">British MP Andrew Rosindell</a> filed <a href="https://publications.parliament.uk/pa/cm201212/cmhansrd/cm120312/text/120312w0003.htm">parliamentary questions</a> on the matter. This action attracted the first media attention to Saint Helena’s quest to have a submarine cable hookup</p> <p>In 2015, the UK government expressed concerns to the <a href="https://www.sainthelena.gov.sh/">Saint Helena Government (SHG)</a> about the financial viability of a submarine cable serving such a small population. Particularly concerning were the ongoing O&#x26;M charges that would be required to maintain the cable. To address these concerns, Christian collaborated with <a href="https://www.linkedin.com/in/holvey/">SHG’s chief economist Tom Holvey</a> to explore the possibility of attracting satellite ground stations to the island.</p> <p>As <a href="http://www.earthstation.sh/">earthstation.sh</a> — a website Christian set up for SHG explains — by using the remote island’s submarine cable connection as a source of backhaul connectivity, a satellite operator could better serve large parts of the South Atlantic. The satellite operator would become a customer of the cable, contributing to its ongoing costs. Within a few weeks, Christian had received seven expressions of interest, including from <a href="https://oneweb.net/">OneWeb</a> (LEO satellite operator and competitor to Starlink).</p> <img src="//images.ctfassets.net/6yom6slo28h2/1NH8ywvgwCZPlpWCBacUof/21e8eaaa4e721c49103d57f46a3e7b11/vonderropp-balfe.jpg" style="max-width: 600px;" class="image center" alt="Christian von der Ropp meeting with Lord Richard Balfe at Westminster Palace" /> <div class="caption" style="margin-top: -35px;">Christian von der Ropp meeting with Lord Richard Balfe at Westminster Palace</div> <p>Christian shared the expressions of interest with two MPs: Andrew Rosindell (mentioned earlier) and <a href="https://en.wikipedia.org/wiki/Richard_Balfe">Lord Richard Balfe</a>, who each wrote formal letters to the Secretary of State for International Development, <a href="https://en.wikipedia.org/wiki/Justine_Greening">Justine Greening</a>. These letters led the UK government to award the first feasibility study on the possibility of funding a cable to Saint Helena.</p> <img src="//images.ctfassets.net/6yom6slo28h2/23xnQQvmvY164pDPGNpld1/da986acf67ae1d0448029e33d00eb239/letter-feasability-study.jpg" style="max-width: 600px;" class="image center" alt="UKAID letter" /> <div class="caption" style="margin-top: -35px;">Response from the office of Justine Greening, Secretary of State for International Development</div> <p>Despite the progress from the UK government, the SAEx project failed to raise the funding necessary and would eventually fold. Needing to find another submarine cable to connect to, Christian reached out to Angola Cables, who were in the process of building the <a href="https://www.submarinecablemap.com/submarine-cable/south-atlantic-cable-system-sacs">South Atlantic Cable System (SACS)</a> cable. In 2018, I would publish the <a href="https://www.lacnic.net/innovaportal/file/3209/1/sacs_lightning_talk_lacnic30.pdf">first evidence</a> of SACS carrying traffic across the south Atlantic, the first submarine cable to do so.</p> <p>Unfortunately, Angola Cables was too late in the process of building SACS and would have required an upfront payment of $50 million within weeks — an impossibility for this project. Without a major submarine cable to connect to, the effort seemed to have hit a dead end.</p> <p>Just when it appeared that all hope was lost, Christian heard rumors of content providers like Facebook and Google building new submarine cables around Africa. In January 2018, Christian wrote to Andrew Metcalf in the submarine cable division at Google, who revealed that Google was working on a new cable that would later become known as Equiano.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/60VLLe7dGZf5iKwGOafTsE/d6fdebda77adfeade265fbd446b36a54/vonderropp-message.png" style="max-width: 500px;" class="image center" alt="Metcalf message" /> <p>After realizing there was now another submarine cable project in the vicinity of Saint Helena, the SHG lobbied the European Development Fund (instead of the UK) for help. In June 2018, just six months after Christian’s initial outreach to Google, <a href="https://www.sainthelena.gov.sh/2018/public-announcements/shg-signs-financing-agreement-for-the-territorial-allocation-of-11th-edf/">SHG received a grant of €21.5 million</a> to cover the cost of building the cable branch connecting Saint Helena to Equiano.</p> <p>While the UK had voted to leave the EU in 2016, the implementation of “Brexit” wasn’t finalized until <a href="https://en.wikipedia.org/wiki/Timeline_of_Brexit">February 2020</a>. In 2018, the SHG was still eligible to receive EU funding. It would be one of the last EU benefits given to the UK.</p> <p>The EDF grant was the final piece to the puzzle. It would cover the cost of the cable branch to Google’s Equiano and OneWeb’s ground station would contribute to the operational costs of maintaining the branch into the future.</p> <p>The cable branch to Equiano was built from Saint Helena in August 2021, and initially, it just terminated on the ocean floor because the main trunk had yet to be laid. In May 2023, the branch was connected to the main trunk, and Saint Helena had a fiber optic link to the world, even if it was not yet carrying traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3nJoakQJLSqpdyjvzsb7up/81d4e68cf50b8caa5ec97b61cf97d3f9/cable-branch-landing.jpg" style="max-width: 800px;" class="image center" alt="Photo of the landing of cable branch to the Equiano cable" /> <div class="caption"style="margin-top: -35px;">Landing of cable branch to the Equiano cable, August 2021</div> <p>Earlier in this piece, I mentioned that Sure enjoys a monopoly on internet service in Saint Helena. That isn’t simply because it is the only local ISP, it is because it successfully lobbied the SHG to <a href="https://www.sainthelena.gov.sh/2023/news/use-of-unlicenced-telecommunications-equipment/">outlaw Starlink</a> or any other competing source of internet service to protect its revenues.</p> <p>As the only legal source of internet service on the island, Sure initially refused to make use of the new submarine cable connection as it risked the volume-based tariffs they were collecting from Saint Helena islanders. Sure only gave in and activated the high speed link to the Equiano cable when the SHG threatened to legalize Starlink, as the island of <a href="https://www.facebook.com/groups/6023925134/permalink/10160799751240135/">Ascension had done</a> to spur Sure into action.</p> <h2 id="conclusion">Conclusion</h2> <p>While internet service has improved exponentially in Saint Helena, tariffs are still quite expensive when compared to the average local income, and basic users face a data cap which seems hard to justify given the most expensive component — the submarine cable — was essentially gifted to Saint Helena by European taxpayers.</p> <p>Lastly, Saint Helena islanders connect to the internet via ADSL2+ copper lines, meaning they rarely achieve more than 10Mbps download. The SHG began building their own FTTH network, but a major issue is Sure’s license to operate, which states that the SHG is required to compensate Sure for all its infrastructure if their exclusivity is not protected.</p> <p>New internet connections don’t just happen by chance and this one faced enormous challenges, not all of which were geographic. It took over twelve years of lobbying to make Saint Helena’s submarine cable hookup a reality. Christian von der Ropp and numerous others on and off the island helped Saint Helena take a major step to eliminate their exile from the global internet.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7GWWWgl69OG7Qo75BJGiXA/ca87ae27ec10997c25c77ba2084562cf/napoleon-in-exile.jpg" style="max-width: 600px;" class="image center" alt="Napoleon on Saint Helena, watercolor by Franz Josef Sandmann, c. 1820" /> <div class="caption" style="margin-top: -35px;">Napoleon on Saint Helena, watercolor by Franz Josef Sandmann, c. 1820</div><![CDATA[AWS Performance Monitoring with Kentik]]><![CDATA[CloudWatch can be a great start for monitoring your AWS environments, but it has some limitations in terms of granularity, customization, alerting, and integration with third-party tools. In this article, learn all the ways that Kentik can supercharge your AWS performance monitoring and improve visibility.]]>https://www.kentik.com/blog/aws-performance-monitoring-with-kentikhttps://www.kentik.com/blog/aws-performance-monitoring-with-kentik<![CDATA[Phil Gervasi]]>Thu, 05 Oct 2023 04:00:00 GMT<p>Nowadays, you’d be hard-pressed to find a digital business that doesn’t rely on cloud infrastructure. Within these cloud environments, it’s paramount for businesses to ensure optimal performance and efficiency to deliver a smooth customer experience.</p> <p>As one of the leading cloud service providers, Amazon Web Services (AWS) provides built-in solutions like CloudWatch to help users monitor the performance of their applications. However, as you’ll soon see, these built-in tools may not be sufficient for more advanced use cases.</p> <p>In this article, we’ll explore AWS performance monitoring with Kentik Cloud, a solution designed to empower businesses with greater insights and control over their AWS deployments. We’ll learn about some of the key challenges associated with AWS performance monitoring to see why built-in tools like CloudWatch might fall short. Conversely, we’ll also explore some of the unique and practical capabilities of Kentik Cloud that you can use to gain valuable insights into your AWS environments.</p> <h2 id="what-is-aws-performance-monitoring">What is AWS performance monitoring?</h2> <p>AWS performance monitoring refers to the practice of tracking, analyzing, and optimizing the performance of your AWS resources and services. Here, performance is typically attributed to metrics such as compute resource utilization, network traffic, and application latency. Confirming that your AWS application properly emits these metrics and then diligently tracking them is critical to guarantee the efficient operation of your resources.</p> <h3 id="why-is-aws-performance-monitoring-important">Why is AWS performance monitoring important?</h3> <p>Let’s highlight some of the reasons why AWS performance monitoring is important:</p> <ul> <li><strong>Improved user experience:</strong> Users are much more likely to have a positive experience using your service or application if latency and response times are low. Using performance monitoring to identify and address performance bottlenecks can ultimately lead to higher customer satisfaction and retention.</li> <li><strong>Enhanced availability:</strong> These days, the best applications demand a 99.999 percent availability (also known as <a href="https://www.stratus.com/about/company-information/uptime-meter/#:~:text=Availability%20is%20normally%20expressed%20in,have%20on%20your%20server%20downtime.">“five nines”</a>). Monitoring can help you detect resource constraints and other anomalies before they happen. This can help you mitigate or avoid downtime altogether, ensuring high availability and minimizing business disruptions.</li> <li><strong>Reduced costs:</strong> Effective monitoring can help organizations identify underutilized and overprovisioned AWS resources. By rightsizing essential resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances, businesses can reduce costs and put their money toward other resources that require scaling.</li> <li><strong>Better security and compliance:</strong> Proper monitoring also includes visibility into security and compliance metrics. Tracking them can help organizations rapidly detect and respond to security threats and vulnerabilities.</li> <li><strong>Reduced network connectivity issues:</strong> Keeping an eye on network traffic and connectivity metrics helps identify issues such as package loss and high latency. Addressing these issues ensures smooth communication between services and contributes to better availability.</li> <li><strong>Enables data-driven decision-making:</strong> Overall, the trend with all these points is that AWS performance monitoring enables informed, data-driven decision-making. Any choice that you make regarding your application’s infrastructure scaling, workload optimization, or resource allocation needs to be backed up with sufficient data.</li> </ul> <h3 id="built-in-aws-monitoring-solutions">Built-in AWS monitoring solutions</h3> <p>One of the most popular services in AWS’s 200+ service catalog is Amazon CloudWatch. CloudWatch is the built-in AWS monitoring solution designed to help users track various metrics, collect and view log files, and set alarms. With this data, users can create custom dashboards to view different alarms and metrics in one place. For instance, here’s a custom CloudWatch dashboard titled “cloudwatch-networking-dashboard”:</p> <img src="//images.ctfassets.net/6yom6slo28h2/319yDh5idHoTbKXgZKbCT0/dee6b1386e10ea059890b56d5644e00f/cloudwatch-dashboard.png" style="max-width: 600px;" class="image center" alt="A screenshot of a custom CloudWatch dashboard containing a line graph widget to display Amazon Virtual Private Cloud (VPC) network usage metrics and three alarm widgets to display various alarm statuses" /> <p>With this dashboard, users can immediately visualize a key Amazon VPC metric: <a href="https://docs.aws.amazon.com/vpc/latest/userguide/network-address-usage.html">Network Address Usage (NAU)</a>. Next to this line graph widget, you can also quickly obtain the status of three CloudWatch alarms (whoops, one is currently in alarm!).</p> <p>Overall, CloudWatch is a great tool for monitoring your AWS environments, but it can fall short in a few key areas:</p> <ul> <li><strong>Limited granularity:</strong> CloudWatch provides metrics at predefined granularities. For example, the “NetworkAddressUsage” metric can be viewed at one-second, five-second, ten-second, thirty-second, one-minute, and all the way up to thirty-day intervals (the previous example used fifteen-minute intervals). However, users with requirements for custom intervals may find these limiting.</li> <li><strong>No predictive analysis:</strong> CloudWatch primarily acts as a central location for historical and current metrics for your AWS resources. However, it lacks built-in predictive analysis capabilities to forecast future trends or performance issues.</li> <li><strong>Lack of customization:</strong> CloudWatch does support creating custom metrics and dashboards. However, its customization options are somewhat limited compared to other specialized third-party monitoring tools, and learning <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/publishingMetrics.html">how to publish custom metrics</a> can take additional time.</li> <li><strong>Limited alerting options:</strong> While CloudWatch alarms are excellent for basic alerting, they may not be as sophisticated as some third-party alerting and incident management tools. For instance, one CloudWatch alarm primarily tracks a single metric and goes into alarm when its threshold is breached. This might not work for use cases that demand more advanced alerting logic, such as time-based alerting, anomaly detection, or alerting based on dynamic usage patterns.</li> <li><strong>Limited integration with third-party tools:</strong> CloudWatch is primarily designed to monitor AWS resources and services, and its integration capabilities with third-party tools and services are limited. This can be a significant drawback for organizations that use a mix of cloud providers or on-premise infrastructure.</li> </ul> <p>In light of these limitations, it’s advantageous to consider <a href="https://www.kentik.com/product/multi-cloud-observability/">Kentik Cloud</a>, a performance monitoring tool that’s geared toward providing insights into networks.</p> <h2 id="kentik-cloud-the-better-aws-performance-monitoring-tool">Kentik Cloud: The better AWS performance monitoring tool</h2> <p>Because Kentik Cloud is a dedicated network observability tool, it offers a very extensive set of features and capabilities, making it a superior choice for organizations with more complex AWS environments or more advanced monitoring needs. As you’ll see in the following points, Kentik Cloud has the potential to offer deeper insights compared to built-in AWS performance monitoring tools like CloudWatch.</p> <h3 id="powerful-visualizations">Powerful visualizations</h3> <p>Kentik Cloud can collect network data from a variety of cloud providers. For your AWS VPCs, you can set up Kentik to observe your VPC metrics by doing a <a href="https://kb.kentik.com/v0/Bd06.htm#Bd06-AWS_Resource_Information_Types">cloud export</a>. This allows Kentik to ingest your VPC’s flow logs and metadata.</p> <p>Once you’re set up, Kentik can process this network data and produce powerful visualizations. In the <a href="https://kb.kentik.com/v4/Db03.htm">Data Explorer</a>, you can graph select metrics to comprehensively view your network traffic and performance. See the example below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/4obagKA9HsUpI0C6IPfVv9/cce6698f1082ca4e800a8bc7c99074d8/data-explorer.png" withFrame style="max-width: 800px;" class="image center" alt="A screenshot from Kentik Cloud that showcases the Data Explorer" /> <p>Kentik Cloud provides a variety of different filters that you can use on your data to produce the right visualization for your needs at whatever granularity you like. Notice that there are two main components in each visualization, a graph and a table:</p> <ul> <li><strong>For the graph</strong>, you can select from one of many <a href="https://kb.kentik.com/v4/Ja06.htm">chart view types</a>. In this example, the individual data points use a stacked area chart, which is often useful for comparing traffic amounts between two different networks. The aggregated total uses a line chart.</li> <li><strong>The table</strong> lists the query results in tabular form. Whereas the graph is useful for visualizations, the table can help you extract the exact value of key metrics. This is lacking in CloudWatch, which doesn’t currently provide a tabular form of queried metric data.</li> </ul> <h3 id="real-time-metrics">Real-time metrics</h3> <p>Data visualizations in Kentik Cloud are provided in real time to help you understand the current state of your network. When <a href="https://kb.kentik.com/v4/Db03.htm#Db03-Auto_Update_Mode">Auto Update</a> mode is on, metrics and visualizations are continuously updated as new data flows in.</p> <p>One metric that can be particularly useful to track in real time is traffic percentage. For example, if you’re slowly migrating traffic off one network and onto another, being able to monitor this transition in real time can help you detect if and when something goes wrong. The <a href="https://kb.kentik.com/v4/Ja06.htm#Ja06-Line_Chart">100 percent stacked area chart</a> is particularly useful in this case:</p> <img src="//images.ctfassets.net/6yom6slo28h2/gLKnR1EWXIwS6YfC4Ttkq/77917994038bf223e25c173fb9a8c9d3/stacked-chart.png" style="max-width: 500px;" class="image center" alt="A screenshot from the Kentik Cloud documentation that showcases a 100 percent stacked area chart" /> <p>In this chart, the relative loads of different networks can be visualized in a single, real-time graph. It can help you see changes in one network’s traffic relative to all other networks, as well as the total combined usage. CloudWatch doesn’t natively support the ability to graph relative values like this (though you can manually implement a similar idea using <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/using-metric-math.html">metric math</a>).</p> <h3 id="synthetic-testing-of-your-cloud-services">Synthetic testing of your cloud services</h3> <p><a href="https://www.kentik.com/blog/everything-you-need-to-know-about-synthetic-testing/">Synthetic testing</a> is a technique that can help you proactively discover networking issues before your users do. In synthetic testing, you simulate different types of traffic through specific routing paths. For example, you might use synthetic testing to observe communications between your VPCs.</p> <p>A good synthetic monitoring platform offers insights into a test’s high-level metrics and low-level details. You’ll want to know about issues like unavailability, high latency, and packet loss. At the same time, you also want to be able to trace issues down to a specific area of the network, such as a host infrastructure issue in a third-party service.</p> <p>Synthetic testing is not available in a service like CloudWatch, which isn’t really intended for such use cases. However, Kentik has a <a href="https://www.kentik.com/product/synthetic-monitoring/" title="Learn about Kentik solutions for synthetic monitoring">synthetic testing and monitoring solution</a> built into its platform. This allows you to plan and execute routine synthetic testing and immediately visualize the results:</p> <img src="//images.ctfassets.net/6yom6slo28h2/18FUZgBwwxOruYbjq8VnIr/68bb1e70381ef7386232271653ff262b/synthetics-dashboard.png" withFrame style="max-width: 800px;" class="image center" alt="A screenshot from the Kentik Synthetics Dashboard that graphs the result of synthetic testing over the course of a day" /> <p>Below the visualization, you’ll also get a log of the exact tests and their respective statuses. You can also combine many other features with synthetic testing to customize your network monitoring experience, such as setting up <a href="https://kb.kentik.com/v4/Ea09.htm">alerts</a> when synthetic tests detect performance issues or unusual behavior. Alerts that are based on the direct results of synthetic testing can often be more telling of underlying issues than simply alerting based on metric thresholds, which is typically what you’re limited to in CloudWatch.</p> <p>Perhaps the most useful feature is that network operators can use Kentik to set up <a href="https://kb.kentik.com/v4/Ma00.htm#Ma00-Synthetic_Test_Types">automatic synthetic testing</a>. In a single click, Kentik can predict the key routes through your network and create custom testing plans. Few other monitoring solutions double as a network testing platform, let alone one that can automatically detect key paths through your network and generate entire synthetic testing plans. As you develop and change your network infrastructure, Kentik can also automatically update test configurations in real time.</p> <h3 id="monitoring-across-multiple-cloud-providers">Monitoring across multiple cloud providers</h3> <p>As previously stated, the main limitation of CloudWatch is that it really only shines in monitoring AWS environments. However, many organizations these days employ a mix of different cloud providers, and any in-house monitoring tool within these providers is likely not going to integrate well with others. In contrast, Kentik allows you to see and monitor all your networks in one place, with native integrations built for <a href="https://kb.kentik.com/v0/Bd06.htm">AWS</a>, <a href="https://kb.kentik.com/v0/Bd07.htm">Google Cloud Platform (GCP)</a>, <a href="https://kb.kentik.com/v0/Bd08.htm">Azure</a>, and <a href="https://kb.kentik.com/v0/Bd09.htm">IBM Cloud</a>. This offers a few immediate benefits for organizations that use a multicloud strategy:</p> <ul> <li><strong>Data aggregation:</strong> Kentik can aggregate network traffic and performance data from all your cloud providers into a single interface. This means you can analyze data from multiple clouds without having to switch between different monitoring tools.</li> <li><strong>Single unified dashboard:</strong> Kentik can provide a dashboard that displays data from your cloud providers side by side. This can be useful if you need to compare performance or traffic patterns across different clouds.</li> <li><strong>Single alerting system:</strong> Instead of manually setting up custom alerting rules in each cloud provider, you can now do this all in Kentik.</li> </ul> <p>Overall, using a platform like Kentik can help you gain a holistic view of all your networks, even if they’re distributed across many different cloud providers. (And you can learn more about Kentik and other alternatives in our article, “<a href="https://www.kentik.com/kentipedia/cloudwatch-alternatives-multicloud-network-observability/" title="CloudWatch Alternatives: Enhancing Multicloud Network Observability">CloudWatch Alternatives: Enhancing Multicloud Network Observability</a>.”</p> <h2 id="how-to-improve-aws-performance-monitoring-with-kentik">How to improve AWS performance monitoring with Kentik</h2> <p>AWS performance monitoring is essential for ensuring the efficiency, availability, and security of cloud-based applications. Though CloudWatch can be a great start for monitoring your AWS environments, you also learned that it has some limitations in terms of granularity, predictive analysis, customization, alerting, and integration with third-party tools.</p> <p>To address these shortcomings, you learned about some of the key features of a tool called <a href="https://www.kentik.com/product/multi-cloud-observability/" title="Learn more about multicloud observability with Kentik">Kentik Cloud</a>. Kentik’s cloud observability solution offers powerful visualizations, real-time monitoring, and built-in synthetic testing capabilities to provide a comprehensive view of network performance across all your cloud providers, not just AWS. By embracing Kentik Cloud, your business can get ahead of networking issues before they reach your end users and ultimately make informed, data-driven decisions about your network applications.</p> <p>Try Kentik Cloud for yourself today: Start a <a href="#signup_dialog" title="Request a Free Trial of Kentik">free trial</a> or <a href="#demo_dialog" title="Request a Demo of Kentik">request a personalized demo</a> to learn more about how to improve AWS performance with Kentik.</p><![CDATA[State of the Internet: Monitoring SaaS Application Performance]]><![CDATA[With the increasing reliance on SaaS applications in organizations and homes, monitoring connectivity and connection quality is crucial. In this post, learn how with Kentik’s State of the Internet, you can dive deep into the performance metrics of the most popular SaaS applications.]]>https://www.kentik.com/blog/state-of-the-internet-monitoring-saas-application-performancehttps://www.kentik.com/blog/state-of-the-internet-monitoring-saas-application-performance<![CDATA[Phil Gervasi]]>Tue, 03 Oct 2023 04:00:00 GMT<p>Most of us, at work and at home, are using SaaS applications every day for productivity tools, mission critical applications, and entertainment. Therefore it’s important to monitor connectivity to these services, as well as the quality of those connections. However, that’s easier said than done.</p> <p>We don’t have much insight into what’s going on in SaaS provider networks, so when a SaaS app is performing poorly, we can feel left in the dark. We don’t own Google’s DNS service, we don’t own Microsoft 365, we don’t own Salesforce, Zoom, and so on, so this is where Kentik’s State of the Internet comes in.</p> <p>The State of the Internet is a part of Kentik’s comprehensive network observability platform and is included for all our customers to use. We’ve deployed hundreds of <a href="https://www.kentik.com/product/global-agents/">network and application testing agents</a> around the world to monitor some of the most popular SaaS providers continually and report on the results in the Kentik portal.</p> <p>We gather information on packet loss, network and application latency, jitter, DNS resolution, and path tracing from each agent to specific SaaS points of presence. That way, we can get an understanding of both the SaaS applications’ performance locally, on their end, and in the path in between.</p> <h2 id="the-state-of-the-internet-grid">The State of the Internet grid</h2> <p>Notice in the image below the connection quality metrics for many of the most popular SaaS providers in a grid. We’re capturing the HTTP status code, response size, domain lookup time, connection time, response time, average HTTP latency, average latency, average jitter, and packet loss. We’re capturing a variety of metrics to gauge the quality of the connection from multiple perspectives – network, application layer, even DNS.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5DyE6ocXnhZBYKkSKlEgLI/386f7ed9c5fd62c415596d7b73bbd3d0/state-of-internet-saas.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="State of the Internet SaaS apps" /> <h2 id="drilling-down-for-more-info">Drilling down for more info</h2> <p>As an example, if you have a problem with a particular SaaS provider, you can get a quick view of the metrics related to only them on the main page. However, though this quick glance is very helpful and extremely popular with our customers, you can also drill down further by clicking on the provider name.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Hnr4CXCe0ojO6E6nCHUUA/0ed36810f81122882c7b42dc66e788a0/state-of-internet-drill-down.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Drill down into SaaS provider" /> <p>When you select a specific provider, you can see those same metrics in a map or time series, you can adjust your time range, and you can drill down further to see the detailed metrics and path view for that specific provider.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7JitLqs4FTBVKEcTGtAWos/d0edf304550780453383aaa106f78480/state-of-internet-salesforce.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Drill down to path view" /> <h2 id="public-cloud-monitoring">Public cloud monitoring</h2> <p>We’re also monitoring the quality of the connection to the <a href="/solutions/improve-cloud-performance/">major public clouds</a>, which you can see in the grid below, and just like the main State of the Internet grid, you can also drill down into any of these details from here.</p> <p>Keep in mind that these are the connections from the Kentik test agents deployed all over the world and not our individual customer agents. That way, just by taking a quick look at the State of the Internet in the Kentik portal, you can get a general overview of the major public clouds’ performance characteristics in regions all over the world instead of testing only one instance.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2g2t9i0BiPoWeSyjzpwjYZ/d06ae4e97a8820be58e1aab44d9641c9/state-of-internet-mesh.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Public cloud monitoring" /> <h2 id="we-still-care-about-dns">We still care about DNS</h2> <p>Lastly, we understand that DNS is a critical component of application delivery and performance, so we’re monitoring the connection to and resolution time of the major public DNS services like Cloudflare, Google, and Quad9. This is presented in the form of a health status bar and a time series graph so that you can see changes over time. We also create a rolling standard deviation so we understand what “normal” looks like for a particular DNS service. And of course you can also adjust your time range to see what the metrics were a few hours ago, yesterday, last week, or last month.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1xdpKcStCGw52yQAVNwIVU/a8b388c114179d34f6b7b438b66f12d5/state-of-internet-dns.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="DNS health status" /> <p>The State of the Internet is built with Kentik Synthetics, so as a Kentik customer you can create your own custom tests with very specific parameters to whatever application or service you want to monitor, internally, in your own cloud instances, or on the public internet.</p> <p>For more information about the State of the Internet and how Kentik can help you monitor your SaaS application performance, check out this short overview video.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe//l1g8m7cvoi" title="State of the Internet: Monitoring SaaS Application Performance" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div><![CDATA[The Evolution of Data Center Networking for AI Workloads]]><![CDATA[Traditional data center networking can’t meet the needs of today’s AI workload communication. We need a different networking paradigm to meet these new challenges. In this blog post, learn about the technical changes happening in data center networking from the silicon to the hardware to the cables in between.]]>https://www.kentik.com/blog/the-evolution-of-data-center-networking-for-ai-workloadshttps://www.kentik.com/blog/the-evolution-of-data-center-networking-for-ai-workloads<![CDATA[Phil Gervasi]]>Wed, 27 Sep 2023 04:00:00 GMT<p>As artificial intelligence continues its march into every facet of modern life and business, the infrastructure behind it has been undergoing rapid and significant changes. Data center networking, or in other words, the infrastructure that connects AI compute nodes, won’t meet the needs of today’s artificial intelligence workload communication which means traditional data center networking must evolve to meet these new challenges.</p> <div as="WistiaVideo" videoId="n035vc7lup" audio></div> <h2 id="what-makes-ai-workloads-unique">What makes AI workloads unique?</h2> <p>AI workloads are fundamentally different from the traditional data center tasks. They:</p> <ul> <li>Rely on extremely high-performance computing nodes.</li> <li>Are normally distributed across multiple CPUs and GPUs that need to communicate with each other in real-time.</li> <li>Predominantly use IP networking but require extremely low/no latency, non-blocking, and high bandwidth communication.</li> <li>Cannot afford “time on network,” where GPUs wait on data from another, leading to inefficiencies and delays in the overall “job completion time.”</li> </ul> <p>The underlying premise is that scalability of these workloads doesn’t come from a single, giant monolithic computer, but from distributing tasks among numerous <em>connected</em> devices. Echoing the famous quote from Sun Microsystems, “the network becomes the computer,” and this has never been more true than it is today.</p> <h2 id="traffic-patterns-of-ai-workloads">Traffic patterns of AI workloads</h2> <p>Traditional data center traffic typically consists of many asynchronous flows. These could be database calls, end-users making requests of a web server, and so on. AI workloads, on the other hand, involve ‘elephant flows’ where vast amounts of data can be transferred between all or a subset of GPUs for extended periods.</p> <p>These GPUs connect to the network with very high bandwidth NICs, such as 200Gbps and soon even 400Gbps and 800Gbps. GPUs usually communicate with each other in a synchronized mesh or partial mesh. For example, when a pod of GPUs completes a particular calculation, it sends that data to another entire pod of GPUs, possibly numbering in the thousands, to use that data to then train an ML model or perform some other AI-related task.</p> <p>As opposed to traditional networking, the pod of GPUs, and in fact each individual GPU, requires <em>all</em> the data before it can move forward with its own tasks. In that way, we see huge flows of data among a mesh of GPUs rather than numerous lightweight flows that can sometimes even tolerate some missing data.</p> <p>And since each GPU relies on all the data from another GPU, any stalling of one GPU, even for a few milliseconds, can lead to a cascade effect, stalling many others. This makes job completion time (JCT) crucial, as the entire workload then relies on the slowest path in the network. In that sense, the network can easily become the bottleneck.</p> <h2 id="challenges-with-traditional-data-center-networking">Challenges with traditional data center networking</h2> <p>Since AI workloads are so large and synchronous, one flow can cause collisions and delays in a pathway shared by other elephant flows. To get around this, we need to re-evaluate the old data center design principles of oversubscription, load balancing, how we handle latency and out-of-order packets, and what type of control plane we use to keep traffic moving correctly.</p> <p>In traditional data center networking, we might have configured a 2:1, 3:1, 4:1, or even 5:1 oversubscription of link to bandwidth under the assumption that not all connected devices would be communicating at maximum bandwidth all the time.</p> <p>We also need to consider how we load balance across paths and links. Using a technology like a simple LAG or ECMP hashing technically works, but with very high bandwidth elephant flows that don’t tolerate packet loss, latency, etc, we end up with flow collisions and network latency on some links while others sit idle.</p> <p>We also end up with an uneven distribution of traffic since ECMP will load-balance entire flows, from the first packet to the last, and not each individual packet. That could also result in collisions, but it could also cause ingest bottlenecks.</p> <h2 id="what-ai-interconnect-networks-require">What AI interconnect networks require</h2> <p>AI interconnect, or the the new data center network purpose-built for AI workloads, has several requirements:</p> <ol> <li>Non-blocking architecture</li> <li>1:1 Subscription ratio</li> <li>Ultra-low network latency</li> <li>High bandwidth availability</li> <li>Absence of congestion</li> </ol> <p>To get us there, we need to consider exactly <em>how</em> we can engineer traffic programmatically in real-time. This means intelligence at the data plane level, with minimal (or no) latency from control plane traffic.</p> <h2 id="ai-interconnect">AI interconnect</h2> <p>An AI interconnect will solve the issues of traditional data center networking by making use of several new technologies, and several re-purposed old technologies.</p> <ul> <li>Multi-pathing (packet spraying)</li> <li>Scheduled fabric</li> <li>Hardware-based solutions such as RDMA</li> <li>Adaptive routing and switching</li> </ul> <p>Packet spraying is the process of distributing individual packets of a flow across multiple paths or links rather than sending all the packets of that flow over a single path. ECMP typically pins a flow to a link which won’t work well for AI workload traffic. So the objective of packet-spraying is to make more effective use of available bandwidth, reduce congestion on a single link, and improve overall network throughput.</p> <p>A scheduled fabric, in particular, will manage the packet-spraying intelligence and solve the inability of ECMP to avoid flow collisions. The catch is that path/link selection on a packet-by-packet basis needs to be local on the switch, or even on the NIC itself, so that there isn’t a need for runtime control plane traffic traversing the network. So though there may be a controller involved, policy is pushed and decisions are made locally.</p> <p>Next, RDMA, or Remote Direct Memory Access, is a protocol that allows computers in a network to exchange data in main memory without involving the processor, cache, or operating system of either computer. By bypassing this additional overhead, RDMA improves data transfer speed and reduces latency. It essentially permits memory-to-memory transfer without the need to continuously interrupt the processor.</p> <p>Adaptive routing and switching is the ability to change routing or switching decisions based on the current state and conditions of the network. This is different from static or predetermined routing, which always follows the same path for data packets regardless of network conditions. Instead, adaptive routing can adjust paths based on factors like congestion, link failures, or other dynamic variables, all pointing ultimately to link and path <em>quality</em>.</p> <p>This kind of runtime dynamic routing improves performance, fault tolerance, reliability, and is exactly what is needed for the type of traffic produced by a distributed AI workload.</p> <p>And lastly, apart from networking, infrastructure challenges also include the choice of cabling solutions, power, cooling requirements, and the extremely high cost of optics.</p> <h2 id="the-ultra-ethernet-consortium">The Ultra Ethernet Consortium</h2> <p>The <a href="https://ultraethernet.org/">Ultra Ethernet Consortium</a> comprises mostly network vendors and one hyperscaler, Meta. They’re working collectively to address the networking challenges posed by AI interconnects. By 2024, the consortium aims to release its first set of standards, coupled with awareness drives like conferences and literature.</p> <h2 id="the-importance-of-visibility">The importance of visibility</h2> <p>For efficient AI workloads, there’s a need for robust telemetry from the AI interconnect. This allows for monitoring of switches, communication between hosts and switches, and the identification of issues affecting job completion time.</p> <p>Ultimately, without flow-based and very granular telemetry, a switched fabric doesn’t have the information needed to make path selection decisions in real-time as well as schedule path selection decisions for the next packet, or in other words, short-term predictive path selection.</p> <p>As AI continues to shape the future, it’s important for data centers focused on this kind of compute to evolve and accommodate the unique requirements of AI workloads. With innovations in networking, the industry is taking steps in the right direction, ensuring that the backbone of AI – the purpose-built AI data center – is robust, efficient, and future-proof.</p> <p>For more information, <a href="https://www.youtube.com/watch?v=cObFQVETxxs">watch our recent LinkedIn Live</a> with Phillip Gervasi and Justin Ryburn in which we go deeper into AI interconnect.</p><![CDATA[Cloud Cost Optimization Best Practices]]><![CDATA[Businesses are rapidly transitioning to the cloud, making effective cloud cost management vital. This article discusses best practices that you can use to help reduce cloud costs. ]]>https://www.kentik.com/blog/cloud-cost-optimization-best-practiceshttps://www.kentik.com/blog/cloud-cost-optimization-best-practices<![CDATA[Rosalind Whitley]]>Thu, 21 Sep 2023 04:00:00 GMT<p>Businesses that have transitioned to the cloud are starting to re-evaluate their cloud strategy to manage application performance and unexpected cloud expenses. You’re likely already aware that companies must invest significantly in their cloud infrastructures. However, these investments can swiftly go to waste if not optimized.</p> <p>Additionally, cloud cost optimization isn’t just about reducing costs; it’s about efficiently allocating resources and optimizing network access to those resources to maximize value without compromising performance or security.</p> <p>That’s why this guide focuses on best practices you can use to decrease your cloud costs and optimize your resources. By the end of this article, you’ll understand several strategies that will help ensure you’re getting the most bang for your buck in your cloud environment.</p> <h2 id="best-practices-to-master-cloud-cost-optimization">Best practices to master cloud cost optimization</h2> <p>The following best practices will teach you how to optimize your cloud cost and, when implemented, will ensure efficient resource utilization and maximize your company’s return on investment.</p> <h2 id="set-budgets-and-monitor-both-idle-services-and-cost-anomalies">Set budgets and monitor both idle services and cost anomalies</h2> <p>Keeping a close eye on your cloud cost spending is critical. In the same way you’d establish a budget for a project or department in your organization, you should also set a budget for your cloud resources. This budget not only keeps you in check financially but fosters a culture of accountability and efficiency.</p> <p>For instance, imagine a scenario where one of your applications has been erroneously deploying redundant instances due to a misconfigured automation script. Over a month, this could amount to considerable, unplanned costs. When you set a budget and actively monitor it, you can promptly identify and rectify such issues, avoiding potential financial pitfalls.</p> <p>To set budgets and monitor services and anomalies, consider implementing the following:</p> <ul> <li><strong>Define clear budget parameters:</strong> Begin by understanding your cloud spending patterns. Assess past bills, factor in planned projects, and set a realistic yet stringent budget. Tools like <a href="https://aws.amazon.com/aws-cost-management/aws-budgets/">AWS Budgets</a> or <a href="https://azure.microsoft.com/en-us/products/cost-management">Microsoft Cost Management</a> can assist in this process.</li> <li><strong>Monitor idle resources:</strong> Regularly audit your cloud resources. Look for unused or underutilized instances, storage volumes, or services. Implement automated scripts or use tools like the <a href="https://cloud.google.com/compute/docs/instances/idle-vm-recommendations-overview">Google Cloud Idle VMs</a> report to flag and take action on these resources.</li> <li><strong>Set up alerts:</strong> Establish threshold-based alerts for when your spending reaches a certain percentage of the budget (<em>ie</em> 75 percent or 90 percent), allowing ample time to assess and react before potential overruns. Most cloud providers offer built-in mechanisms for this, such as <a href="https://docs.aws.amazon.com/cost-management/latest/userguide/budgets-sns-policy.html">Amazon SNS notifications for AWS Budgets</a>.</li> <li><strong>Analyze and rectify anomalies:</strong> If your monitoring tools detect unusual spikes in usage or costs, don’t ignore them. Dive deep to find the root cause (<em>ie</em> a misconfigured service, an unnoticed DDoS attack, or any other anomaly).</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5zjpCiaFqFTsyfMHjqSfIs/d39ed3a4d9a1e2a732c86a5ba479740e/attribut-cloud-costs-to-business.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Azure Cloud Costs" /> <h2 id="integrate-cloud-cost-optimization-in-your-software-development-lifecycle">Integrate cloud cost optimization in your software development lifecycle</h2> <p>As the development lifecycle evolves, so does the need to ensure that each phase is optimized for cloud resource usage. By embedding cost optimization practices into your software development lifecycle (SDLC), you can ensure that resources are used efficiently from development to deployment.</p> <p>Consider a DevOps team that develops a feature-heavy application, only to realize post-deployment that it consumes three times the anticipated cloud resources. The culprit? Non-optimized code and redundant database queries. This scenario could have been avoided with proactive cost considerations at each SDLC phase.</p> <p>Consider the following points to help you integrate cloud cost optimization in your SDLC:</p> <ul> <li><strong>Requirement analysis:</strong> At this initial stage, outline the cloud resources that will be required. For example, factor in database scaling and storage needs if you’re developing a data-intensive application.</li> <li><strong>Design and planning:</strong> Design your application architecture with scalability and efficiency in mind. Opt for serverless architectures or microservices where feasible, as they can scale according to demand, often resulting in cost savings.</li> <li><strong>Development:</strong> Train your developers to write efficient, modular code. Implement code reviews focusing not only on functionality but also on resource optimization. Use tools that can identify resource-heavy code snippets, such as <a href="https://www.ej-technologies.com/products/jprofiler/overview.html">JProfiler in Java</a> or <a href="https://nodejs.org/en/docs/guides/simple-profiling">built-in profiler tool in Node.js</a>.</li> <li><strong>Testing:</strong> During your testing phase, check for functionality and monitor how the application impacts cloud resource consumption. Tools like the <a href="https://aws.amazon.com/aws-cost-management/aws-cost-explorer/">AWS Cost Explorer</a> can provide insights into how specific services consume resources during test runs.</li> <li><strong>Deployment:</strong> Optimize your deployment strategies. Strategies such as <a href="https://codefresh.io/learn/software-deployment/what-is-blue-green-deployment/">blue-green deployments</a> or <a href="https://martinfowler.com/bliki/CanaryRelease.html">canary releases</a> can be used to ensure you’re not provisioning unnecessary resources during the rollout.</li> <li><strong>Maintenance:</strong> Regularly review and refactor your code. Older applications might be running on outdated services or architectures that are no longer cost-effective. Upgrading or transitioning to newer, more efficient services can yield significant savings.</li> </ul> <h2 id="analyze-cost-attribution-and-usage-to-rightsize-hybrid-cloud-capacity-and-failover">Analyze cost attribution and usage to rightsize hybrid cloud capacity and failover</h2> <p>Optimizing a hybrid cloud environment can be particularly challenging, given the interplay between on-premises, private cloud, and public cloud resources. The key is understanding cost attribution and accurately size your capacities for primary operations and failover scenarios.</p> <p>Imagine a company that’s overprovisioned its private cloud, assuming high traffic. Concurrently, it’s underprovisioned its public cloud failover capacity. During an unexpected traffic spike, the public cloud resources are quickly overwhelmed, causing significant downtime and lost revenue. A proper analysis of cost attribution and resource usage would have painted a clearer picture of actual needs.</p> <p>To help you analyze cost attribution, consider implementing the following:</p> <ul> <li><strong>Understand your workloads:</strong> Start by identifying which applications and workloads are best suited for private clouds and which can be shifted to public clouds. Some sensitive applications need the security of a private cloud, while more scalable, consumer-facing apps could benefit from the elasticity of public clouds.</li> <li><strong>Monitor usage patterns:</strong> Regularly monitor the usage patterns of your workloads. Utilize tools that provide insights into peak usage times, idle times, and resource demands. This data is invaluable in rightsizing your capacities.</li> <li><strong>Implement cost attribution tools:</strong> Use tools that can break down costs by departments, teams, projects, or even individual applications. Platforms like <a href="https://cloudhealth.vmware.com/">CloudHealth</a> or <a href="https://www.apptio.com/products/cloudability/">Cloudability</a> can provide granular insights into where your cloud costs are originating.</li> </ul> <h2 id="efficiently-transfer-cloud-traffic-with-full-visibility-low-costs-and-no-sprawl">Efficiently transfer cloud traffic with full visibility, low costs, and no sprawl</h2> <p>Migrating and managing data traffic between public and private clouds is critical to hybrid cloud architectures. Achieving this with complete visibility while minimizing costs requires a strategic approach.</p> <p>For instance, consider a scenario where a company transfers large data sets between clouds without a traffic management strategy. Soon, they’re hit with hefty data transfer fees, and to compound the issue, they can’t pinpoint the exact sources of high costs due to a lack of visibility. This lack of control can also lead to <a href="https://www.tierpoint.com/blog/cloud-sprawl-what-is-it-and-how-to-control-it/">cloud sprawl</a>—an uncontrolled proliferation of cloud instances, storage, or services.</p> <p>To mitigate this scenario and help efficiently manage data traffic, consider implementing the following:</p> <ul> <li><strong>Audit data movement patterns:</strong> Understand what data needs to be moved, how frequently, and why. Regularly auditing this can help you spot inefficiencies or redundancies in your data transfer patterns.</li> <li><strong>Implement traffic visibility tools:</strong> Platforms like <a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html">Amazon VPC Flow Logs</a> provide insights into your cloud traffic, allowing you to monitor and optimize data transfers effectively.</li> <li><strong>Localize data when possible:</strong> Keep data closer to its primary point of use. If most of your users or applications are in a specific region, try to store the data in the same region to reduce inter-region data transfer costs.</li> <li><strong>Control cloud sprawl:</strong> Implement strict governance policies. Tools like <a href="https://aws.amazon.com/servicecatalog/">AWS Service Catalog</a> or <a href="https://learn.microsoft.com/en-us/azure/governance/policy/overview">Azure Policy</a> can help enforce rules on which services can be provisioned, reducing the risk of unnecessary resource proliferation.</li> </ul> <h2 id="balance-cost-and-performance-with-real-time-analysis-during-and-after-migration">Balance cost and performance with real-time analysis during and after migration</h2> <p>Navigating the constant juggling act between performance and costs is an ongoing challenge in cloud management. This equilibrium becomes even more crucial during workload migrations, as unexpected costs or performance hiccups can significantly disrupt operations.</p> <p>Consider a business migrating a mission-critical application to a new cloud provider. With real-time analysis during this migration, they could avoid overspending on resources or end up with insufficient capacity, leading to poor application performance. Real-time monitoring can act as a safety net, ensuring neither scenario unfolds.</p> <p>Consider implementing the following best practices to help you navigate this constant juggling act:</p> <ul> <li><strong>Define performance metrics:</strong> Outline the key performance indicators (KPIs) crucial for your workloads before migrating. This could include response times, availability percentages, or error rates.</li> <li><strong>Deploy real-time monitoring tools:</strong> Platforms like <a href="https://newrelic.com/">New Relic</a> or <a href="https://grafana.com/">Grafana</a> can provide live insights into how your workloads are performing during migration. These tools can alert you to potential issues before they escalate.</li> <li><strong>Opt for incremental migrations:</strong> Transfer workloads in phases instead of migrating everything simultaneously. This allows you to monitor and adjust in real time, ensuring each migration phase is optimized for cost and performance.</li> <li><strong>Optimize over time:</strong> The cloud is dynamic. Regularly review the performance and costs of your migrated workloads. As you gather more data, refine your resource allocation strategies to maintain the delicate balance.</li> </ul> <h2 id="optimize-hybrid-cloud-interconnect-capacity">Optimize hybrid cloud interconnect capacity</h2> <p>For organizations leveraging a hybrid cloud model, the capacity of interconnections between their on-premise infrastructure, private clouds, and public clouds is crucial. An optimized interconnect ensures seamless operations, high availability, and efficient resource usage.</p> <p>For instance, consider a hypothetical enterprise that has applications split across private and public clouds. These applications frequently communicate, but performance degrades due to a bottleneck in the interconnect, especially during peak times. Had the interconnect capacity been optimized, such a performance hit could have been avoided.</p> <p>Following are a few ways you can optimize hybrid cloud interconnect:</p> <ul> <li><strong>Assess traffic patterns:</strong> Regularly review the volume and nature of traffic flowing between your on-premises infrastructure and cloud setups. This gives you insights into the required interconnect capacity.</li> <li><strong>Scale as needed:</strong> Interconnect capacity shouldn’t be static. During times of expected high traffic, scale up the capacity and then scale it down again during off-peak times. This dynamic approach ensures you’re only paying for what you need.</li> <li><strong>Regularly review SLAs:</strong> The service level agreements (SLAs) associated with your interconnects should be reviewed periodically. Ensure they align with your current requirements, and don’t hesitate to renegotiate if necessary.</li> </ul> <h2 id="plan-migrations-by-understanding-baseline-traffic-and-dependencies">Plan migrations by understanding baseline traffic and dependencies</h2> <p>Migrating workloads to or between cloud environments is no trivial task. The foundation of a smooth migration lies in understanding the baseline traffic and the interdependencies of applications and services.</p> <p>Imagine an e-commerce company migrating its inventory management system without considering its dependency on the billing system. During the migration, the inventory system gets momentarily disconnected from the billing system, resulting in failed transactions and lost revenue. Such pitfalls can be avoided by having a clear map of traffic and dependencies.</p> <p>Use the following best practices to help you understand baseline traffic and dependencies:</p> <ul> <li><strong>Conduct a traffic baseline analysis:</strong> Before any migration, assess the regular traffic patterns of the applications or services in question. Tools like Kentik can provide historical data that can help plan migration schedules with minimal disruption.</li> <li><strong>Map out dependencies:</strong> Understand how different applications, databases, and services communicate with each other. Tools like <a href="https://www.dynatrace.com/">Dynatrace</a> or <a href="https://www.appdynamics.com/">AppDynamics</a> can help visualize these dependencies, ensuring no crucial links are broken during migration.</li> <li><strong>Choose the right migration window:</strong> Pick a migration window during off-peak hours (based on the <a href="https://www.kentik.com/kentipedia/network-traffic-analysis/" title="Kentipedia: Network Traffic Analysis">baseline traffic analysis</a>) to minimize potential disruptions to end users or other services.</li> <li><strong>Test in staging environments:</strong> Replicate the process in a staging environment before the actual migration. This highlights any potential issues or gaps in the migration plan.</li> <li><strong>Communicate and coordinate:</strong> Ensure that all relevant teams (DevOps, <a href="https://www.kentik.com/kentipedia/what-is-netops-network-operations/" title="Kentipedia: What is NetOps?">NetOps</a>, or IT support) are in the loop. Clear communication ensures everyone is on the same page and ready to address any unforeseen issues promptly.</li> <li><strong>Monitor actively during migration:</strong> Keep a close eye on the migration process. Real-time monitoring tools can provide insights into bottlenecks, failures, or performance degradation, allowing for swift remediation.</li> </ul> <h2 id="monitor-test-and-optimize-the-performance-of-interconnect-traffic-between-cloud-and-data-center">Monitor, test, and optimize the performance of interconnect traffic between cloud and data center</h2> <p>The bridge between your traditional data center and the cloud is not merely a data highway; it’s the lifeline of your hybrid operations. Ensuring its optimal performance is paramount for the seamless functioning of your applications and services.</p> <p>For instance, consider a financial firm that relies on real-time data feeds from its data center to its cloud-based analytics platform. Any latency or disruption in this interconnect could delay crucial investment decisions, resulting in potential losses. The importance of consistent monitoring and optimization can’t be overstated.</p> <p>Implement the following best practices to help you monitor, test, and optimize the performance of interconnect traffic:</p> <ul> <li><strong>Deploy traffic monitoring tools:</strong> Utilize monitoring tools such as Kentik to gain insights into the traffic flow between your data center, the cloud, and traffic between and among multiple clouds.</li> <li><strong>Establish performance benchmarks:</strong> Define the interconnect’s acceptable latency, throughput, and error rates. This provides a baseline against which to measure and optimize.</li> <li><strong>Conduct regular stress tests:</strong> Simulate high-traffic scenarios to understand how the interconnect performs under load. This can help you spot potential bottlenecks or weak points.</li> <li><strong>Proactively address issues:</strong> Don’t wait for problems to escalate. If monitoring tools indicate a potential issue, address it immediately. This proactive approach reduces downtime and ensures optimal performance.</li> </ul> <h2 id="tune-capacity-and-redundancy-based-on-workloads-and-traffic-shifts">Tune capacity and redundancy based on workloads and traffic shifts</h2> <p>In a dynamic cloud environment, the only constant is change. As your business evolves, so will its demands on the cloud infrastructure. Actively tuning capacity and ensuring redundancy becomes vital to keeping costs in check while maintaining performance.</p> <p>Consider a streaming service that sees a sudden influx of users due to a hit show’s release. If capacity isn’t adjusted in real time, services could slow down or crash, leading to user dissatisfaction and the possibility of losing subscribers. Conversely, overprovisioning during off-peak times would result in unnecessary costs.</p> <p>To help tune capacity and redundancy, consider implementing the following best practices:</p> <ul> <li><strong>Implement dynamic scaling:</strong> Most cloud providers, like <a href="https://aws.amazon.com/">Amazon Web Services (AWS)</a> and <a href="https://azure.microsoft.com/en-us">Microsoft Azure</a>, offer autoscaling capabilities. These tools automatically adjust resources based on real-time demand, ensuring performance while optimizing costs.</li> <li><strong>Utilize predictive analytics:</strong> Use tools like <a href="https://aws.amazon.com/forecast/">Amazon Forecast</a> to forecast demand based on historical data and trends. Predicting surges allows you to adjust capacity preemptively.</li> <li><strong>Perform frequent capacity audits:</strong> Regularly assess your infrastructure to identify underutilized resources. Scaling down or decommissioning these can lead to significant cost savings.</li> <li><strong>Optimize for seasonality:</strong> Adjust your infrastructure if your business experiences seasonal demand fluctuations. For instance, an e-commerce platform might need more resources during holiday sales but can scale down during off-seasons.</li> </ul> <h2 id="use-reserved-and-spot-instances-for-cost-efficient-capacity-planning">Use reserved and spot instances for cost-efficient capacity planning</h2> <p>Cloud cost management isn’t just about the amount of resources you consume but also how you procure and utilize them. Reserved and spot instances offer opportunities to use cloud resources cost-effectively, especially when paired with diligent capacity planning.</p> <p>Think of an online retail platform that experiences consistent traffic throughout the year but sees significant spikes during sale events. Relying solely on on-demand instances can be costly during regular operations and might not guarantee availability during high-demand periods. A strategic mix of reserved and spot instances can optimize costs and performance.</p> <p>To that end, consider the following best practices:</p> <ul> <li><strong>Understand your workloads:</strong> Classify your workloads into predictable (consistent usage) and variable (sporadic spikes) categories. This informs your decision on which instance types to use.</li> <li><strong>Leverage reserved instances:</strong> For predictable workloads, purchase reserved instances. These offer substantial discounts compared to on-demand pricing. Cloud providers like AWS, Azure, and <a href="https://cloud.google.com/">Google Cloud (GCP)</a> offer various reserved instance options with different pricing and flexibility levels.</li> <li><strong>Tap into spot instances:</strong> Consider using spot instances for short-term, variable workloads. You can bid for these spare cloud resources, often available at a fraction of the on-demand price. However, be mindful that they can be terminated if the resources are needed elsewhere.</li> <li><strong>Stay informed on pricing models:</strong> Cloud providers frequently adjust their pricing models and introduce new offerings. Keep an eye out for any changes or promotions that can offer better value for your operations.</li> </ul> <h2 id="conclusion">Conclusion</h2> <p>Cloud cost optimization is not a one-off task but a continuous journey. From setting budgets to leveraging various instance types, there are numerous strategies to ensure you’re getting the most bang for your buck in the cloud environment. Whether you’re a cloud engineer, network engineer, DevOps, or NetOps professional, understanding these best practices can significantly influence your operations’ efficiency, performance, and cost-effectiveness.</p> <p>As you venture deeper into cloud cost optimization, don’t do it alone. <a href="https://www.kentik.com/product/cloud/">Kentik Cloud</a> can be a powerful ally. Kentik offers unparalleled insights into your cloud utilization, empowering you to make decisions that lead to tangible savings. Why not take the next step? Dive into the <a href="https://www.kentik.com/get-started/">world of Kentik</a> and discover how it can be your cornerstone in your cloud cost optimization.</p><![CDATA[Using Kentik to Fight DDoS at the Source]]><![CDATA[In this blog post, we describe how one backbone service provider uses Kentik to identify and root out spoofed traffic used to launch DDoS attacks. It's a "moral responsibility," says their chief architect.]]>https://www.kentik.com/blog/using-kentik-to-fight-ddos-at-the-sourcehttps://www.kentik.com/blog/using-kentik-to-fight-ddos-at-the-source<![CDATA[Doug Madory]]>Thu, 14 Sep 2023 04:00:00 GMT<p>For decades, the scourge of distributed denial of service (DDoS) attacks has plagued the internet. One of the most common forms of DDoS attack is the <a href="https://en.wikipedia.org/wiki/Denial-of-service_attack#Reflected_attack">reflection attack</a>. This type of DDoS takes advantage of stateless protocols to trick a myriad of internet-connected devices into sending an overwhelming amount of traffic to an intended target, degrading its connectivity or completely knocking it offline.</p> <p>The attacker sends thousands of requests with “spoofed” source IP addresses. Instead of the attacker’s real IP address, the source IP address is modified to be the address of the intended target. These requests utilize a stateless protocol like NTP or DNS. It’s typically a UDP protocol since the attacker wouldn’t be able to complete TCP’s three-way handshake with a spoofed source address, although <a href="https://blog.apnic.net/2022/10/18/a-new-ddos-attack-vector-tcp-middlebox-reflection/">some attacks</a> have found ways around this restriction.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2vzJMJeTgumIvJuOC1wpLi/b01d085a727a85677ab8df52da518590/attacker-ip-spoof.png" style="max-width: 600px;" class="image center no-shadow" alt="Spoofing attack" /> <p>During the attack, thousands of servers unwittingly respond to these UDP requests with responses sent to the forged source address, which, of course, is the target of the attack. Incidents such as these take place numerous times every day and have given rise to the symbiotic industries of DDoS-for-hire services as well as the DDoS mitigation vendors that defend against their attacks.</p> <p>One of the reasons that reflection attacks are so difficult to stop is that the victim has no idea where on the internet the attack is actually being initiated. While the victim network might see the IP addresses of the servers doing the reflection, it is unable to discern the true source from the attack traffic.</p> <div as="Promo"></div> <h2 id="addressing-the-problem">Addressing the problem</h2> <p>There are multiple places to break the chain of events leading to reflection-style DDoS attacks. First, networks should implement <a href="http://www.bcp38.info/">BCP38</a> where possible to do so. This will prevent the hosts on a network from being able to send spoofed traffic to the internet.</p> <p>A BCP is an IETF Request for Comment document, which has been highlighted as recommending a Best Current Practice. BCP38 presently refers to <a href="https://datatracker.ietf.org/doc/html/rfc2827">RFC2827</a>: Network Ingress Filtering: Defeating Denial of Service Attacks which employ IP Source Address Spoofing.</p> <p>BCP38 specifies that a network should drop packets with source IP addresses that are likely forged. It determines that a packet has a forged source address if the address is not in the forwarding table of the interface it was received on. In other words, packets, which if forwarded on to the internet, would elicit a response that could never come back through that interface are therefore likely to be spoofed source addresses.</p> <p>Despite the publication of BCP38 in the year 2000, <a href="https://spoofer.caida.org/summary.php">many networks</a> still allow spoofed traffic. In some cases, this is because BCP38 filters can block legitimate traffic, such as in the case of multihoming (although there is a <a href="https://www.rfc-editor.org/info/bcp84">BCP</a> for that, as well).</p> <p>Regardless of the reason, we simply can’t solely rely on BCP38 filtering to prevent spoofed traffic. We need service providers (SPs) to do their part as well. One way SPs can help is to police their network for spoofed traffic using a NetFlow analytics platform like Kentik.</p> <h2 id="using-kentik-to-traceback-ddos-sources">Using Kentik to traceback DDoS sources</h2> <p>“Service providers have a moral responsibility to identify and remediate customer networks which are sending spoofed traffic,” says Aaron Weintraub, chief architect for Cogent Communications, one of the world’s largest network service providers.</p> <p>Aaron utilizes a customized workflow in Kentik’s Data Explorer (KDE) to identify customer networks which are sending traffic in violation of BCP38. KDE is an extensive NetFlow analysis platform used by hundreds of companies to optimize and manage how traffic traverses their network.</p> <p>His methodology boils down to two steps:</p> <p><strong>Step 1.</strong> Find spikes of packets from customer networks to a large set of unique destination IP addresses using commonly abused UDP ports. For this Aaron uses the following ports: 19, 53, 123, 161, 389, 427, 1900, 3283, 3702, 10001, 10074, 11211, 37810, 32414.</p> <p>He excludes a list of customer networks that he has already investigated — either because the traffic turned out to be legitimate (e.g., a network hosting a popular DNS server), or because he is already actively engaging with the customer about the problematic traffic.</p> <p>Aaron builds a query like the one below, using packets/sec and unique destination IP address count as the metrics. Then, he plots the results on a line chart for visual inspection. To expedite this process, he has these queries saved in a library, so they are trivial to rerun.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6prKSGhvHb40NDFAWDqniy/0d164c9a2365ae7a16513410f79da628/filtering-options.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Filtering options in the Kentik platform" /> <p><strong>Step 2.</strong> For any suspicious spikes in packets to those selected UDP ports, a quick subsequent investigation reveals the nature of the traffic. He next checks the source IPs of these packets coming from that customer.</p> <p>Aaron runs a query to isolate traffic coming from the router interface(s) serving the customer. He’s looking for the telltale signs of spoofed traffic responsible for launching DDoS attacks.</p> <p>That traffic will have “source IPs” from an improbably diverse set of ASNs — networks which would have never routed traffic through the customer network to Cogent. This diverse set of ASNs are actually the targets of on-going DDoS attacks that are being launched by this spoofed traffic.</p> <p>Below is an example of the secondary analysis described above. It shows spikes of packets across a customer interface to thousands of unique destination IPs (the unwitting reflection nodes) and seemingly from a variety of source networks, such as Kazakh mobile provider Beeline and US mobile provider Verizon, the targets of the DDoS attacks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1NK29vKj8o4LKMg9XAkiVO/7e01b001e2097c08565c9863264d8719/secondary-analysis_copy.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Spikes from the secondary analysis" /> <h2 id="now-the-fun-begins">Now the fun begins</h2> <p>Finding a customer network that is the source of spoofed traffic triggering DDoS attacks might actually be the easy part (provided one has Kentik) — getting them to stop? Well, that is a different animal entirely.</p> <p>Aaron could simply refer the traffic to Cogent’s abuse team for them to take action according to their acceptable use policy (AUP) that they, as well as every customer agreed to upon initiating service. That might ultimately get them off Cogent’s network, but it will have made little difference in the broader fight against DDoS.</p> <p>His objective is to get the customer’s networking team to understand that they are the source of problematic spoofed traffic and address it. If they don’t, he can either impose strict BCP38 filtering from his side and risk blocking legitimate traffic or, as a last resort, refer them to his abuse team.</p> <p>Using an email template that describes how spoofed traffic fuels DDoS attacks, Aaron reaches out to the customer with a shareable link to the view above in Kentik showing the spoofed traffic coming from the customer’s network.</p> <p>Engaging with customers can be a very time-consuming process. There can be language barriers, network engineers who are either overworked or poorly trained, and, unfortunately, some networking teams who are simply uninterested in fixing the problem.</p> <p>A certain cat at a cloud provider made a humorous anti-spoofing reflection response bingo card to capture the variety of unhelpful responses encountered when trying to bring spoofing to the attention of an external peer network.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2ym74W6ih1qXCsfKZeehw4/bf5c7ebfa4d198cd49a2b7dcdfb04fcd/anti-spoofing-bingo.jpg" style="max-width: 550px;" class="image center no-shadow" alt="Anti-spoofing Bingo card" /> <h2 id="yielding-results-against-ddos">Yielding results against DDoS</h2> <p>It is a painstaking process, but it is part of a larger effort that has begun to yield results. One of the key insights of those battling DDoS was that there was a finite amount of networks which were responsible for launching most attacks. This insight meant that if the large service providers coordinated with each other, they could identify and eliminate many of the sources of spoofed traffic and thus make a dent in the number and severity of DDoS attacks.</p> <p>In a <a href="https://www.kentik.com/resources/replay-what-network-teams-need-to-know-to-be-successful-in-2023/">webinar</a> I did earlier this year with backbone carrier Arelion’s Chief Evangelist Mattias Fridström, he reported seeing a decline in the number of DDoS attacks seen across their global network and chalked it up to this collaboration between Tier 1 operators (service providers). Mattias’s slide is below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/65KKAgxl6sXeFE0oLm91Rd/38c635cad7ea38cb6ed3cbcc1c06395e/arelion-slide.png" style="max-width: 800px;" class="image center" alt="Arelion - DDoS attack trending downward" /> <p>Another factor contributing to the reduction in DDoS is the work of the “Big Pipes” effort, <a href="https://www.wired.com/story/big-pipes-ddos-for-hire-fbi/">profiled</a> in WIRED back in May. This group is another example of experts from multiple (sometimes competing) companies working together with the common objective of eliminating DDoS attacks. Big Pipes works with law enforcement to take down DDoS-for-hire services, better known as booter services.</p> <p>The bottom line is if your network is allowing spoofed traffic, there is a good chance that someone is taking advantage of this fact and using a connection on your infrastructure to launch DDoS attacks against victims around the world.</p> <p>If you run a network that operates as a service provider, whether a university or a backbone carrier, you have a responsibility to the rest of the internet to actively look for and eliminate spoofed traffic. And a solution like Kentik is perfect for the job.</p> <p>If someone like Aaron reaches out to you about spoofed traffic coming from your network, you’ll need tools in place to investigate and address the claims. You don’t want to be the one to complete his <em>anti-spoofing response bingo card.</em></p><![CDATA[Box and Kentik: A Google Cloud Migration Success Story]]><![CDATA[Tools and partners can make or break the cloud migration process. Read how Box used Kentik to make their Google Cloud migration successful.]]>https://www.kentik.com/blog/box-and-kentik-a-google-cloud-migration-success-storyhttps://www.kentik.com/blog/box-and-kentik-a-google-cloud-migration-success-story<![CDATA[Phil Gervasi]]>Wed, 13 Sep 2023 04:00:00 GMT<h2 id="introduction">Introduction</h2> <p>In the age of digital transformation, secure content management and collaborative prowess are as crucial as ever. <a href="https://www.box.com/">Box</a> is an esteemed leader in cloud content management, serving an impressive 68% of the Fortune 500. Over the years, Box has leaned on the Kentik Network Observability Platform to bolster its network performance. So, when Box faced the complex task of migrating to Google Cloud, they chose yet again to use Kentik as a trusted partner.</p> <h2 id="background">Background</h2> <p>Established in 2005, Box has built its reputation as the preeminent Content Cloud. Their ethos revolves around offering a singular platform for organizations to manage, collaborate, and secure content seamlessly. It’s no surprise that for such a pivotal migration, the importance of preserving their network’s integrity was at the forefront. Box simply couldn’t allow their users to see an interruption in service while they underwent the Google Cloud migration.</p> <h2 id="the-scenario">The scenario</h2> <p>Box’s network team faced the challenge of migrating to Google Cloud on a very short timetable. Given the condensed timeline, Box’s network team faced the dual challenge of ensuring on-prem visibility during and post-migration. Google Cloud does offer a suite of native tools for monitoring; however, they fall short in granting the comprehensive visibility that Box, accustomed to Kentik’s dashboards and insights, needed.</p> <p>“We love the dashboards and the custom alerts we get with Kentik. Adding another network monitoring platform would have made our jobs so much more difficult. With Kentik, we are confident that we have the visibility we need to take action to ensure our network is performing optimally,” said Louis Bolanos, Staff Cloud Network Engineer.</p> <p>That visibility enabled Box to discover lingering dependencies between cloud and on-prem services, helping to ensure an exceptional experience for customers during the migration.</p> <img src="//images.ctfassets.net/6yom6slo28h2/fmfWSQfWWCVhpDJGZ0Wnn/cd51299329a54f3ceab3f4a6a473a84e/kubernetes-monitoring1.jpg" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Kubernetes in the Kentik platform" /> <p>However, Box’s existing on-prem infrastructure still needed maintenance throughout the migration. Monitoring and maintaining on-prem, cloud, and Kubernetes networks simultaneously without additional headcount would challenge this migration. Additionally, Box knew they needed to pinpoint latency issues, bandwidth utilization, and hairpinning between on-prem and cloud services.</p> <p>Meeting security and compliance requirements was also an important consideration. Box needed to ensure stringent security measures were in place to protect sensitive customer data and comply with industry regulations while maintaining its high standard of seamless content collaboration and data sharing.</p> <p>They soon realized that cloud costs could quickly spiral out of control without visibility into inter- and intra-cloud traffic flows. Identifying inter-region traffic volumes to perform cost attribution efficiently, even down to the level of Kubernetes clusters, would be a challenge.</p> <img src="//images.ctfassets.net/6yom6slo28h2/62Kx5UFf83lE8WqY9szEJC/85221a15c600fd3ddccdf107721804ff/sankey.jpg" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Sankey diagram" /> <h2 id="kentiks-role-in-the-migration">Kentik’s role in the migration</h2> <p>Box’s partnership with Kentik proved instrumental in navigating the intricacies of the Google Cloud migration. Using Kentik offered Box:</p> <ul> <li><strong>Cloud monitoring</strong>: Kentik was vital in providing in-depth traffic analysis, bottleneck identification, and overall network optimization during the migration. Network behavior within and between container workloads, particularly Kubernetes, was also critical to the Box networking team. “Visibility into cloud-deployed Kubernetes clusters in Kentik is super clear and makes troubleshooting so much easier,” said Bolanos.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/3n8nNu9Lk90xEb7ttCDCb6/c182e47f68cc281531cbd67005818c38/map.jpg" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Kentik Map" /> <ul> <li> <p><strong>Anomaly detection and alerting</strong>: Kentik highlighted all unusual network behaviors. Their real-time alerts enabled Box’s IT teams to avert potential service disruptions swiftly.</p> </li> <li> <p><strong>Security analytics</strong>: Kentik helped Box monitor network traffic for potential security threats and see malicious or anomalous activity easily.</p> </li> <li> <p><strong>Scalability and flexibility</strong>: Kentik’s cloud-native architecture seamlessly integrated with Box’s existing infrastructure and provided scalability to handle its growing user base and evolving business requirements. “We were able to provide custom migration dashboards for internal service owners, allowing them to observe traffic declines on services that were being sunset and discover misconfigured services routing calls to the wrong locations,” said Bolanos.</p> </li> <li> <p><strong>Cost efficiency</strong>: Crucial insights from Kentik ensured that traffic flowing to, from, and within Google Cloud remained performance- and cost-optimized.</p> </li> </ul> <h2 id="the-results">The results</h2> <p>Kentik’s role in the migration was transformative. Bolanos encapsulates the sentiment, “Kentik saved us time and gave us full confidence that we’d be aware of and be able to respond to any cloud or on-prem network issues on a timely basis.”</p> <p>With Kentik, Box could inventory and track migration progress efficiently, collaborate without pivot fatigue, and monitor/validate the performance of Box services during migration.</p> <p>“As our network continues to evolve, we have confidence that Kentik will continue to meet our network observability needs. With the network diversifying into various flavors of virtualization on top of cloud migrations, this drove us to look at various aspects of network performance in different tools. Kentik’s Google Cloud and Kubernetes observability allowed us to investigate traffic on Google Cloud VMs and K8s clusters, including the separation of workload namespace to help simplify our suite of tools used for reporting and troubleshooting.”</p> <h2 id="conclusion">Conclusion</h2> <p>Tools and partners can make or break the cloud migration process. The Kentik Network Observability Platform helped Box achieve the visibility and responsiveness native tools couldn’t provide while migrating to Google Cloud. Box can rely on Kentik to introduce or extend to a cloud network, perform migrations, or maintain hybrid cloud networks. “To save time and costs and improve the performance of hybrid networks, there’s no need to retool if you use Kentik,” said Bolanos.</p> <p><a href="https://www.kentik.com/resources/case-study-box/">Read the entire case study</a>.</p><![CDATA[Anatomy of an OTT Traffic Surge: NFL Kickoff on Peacock]]><![CDATA[Football is officially back, and Doug Madory is here to show you exactly how well the NFL's streaming traffic was delivered. ]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-nfl-kickoff-on-peacockhttps://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-nfl-kickoff-on-peacock<![CDATA[Doug Madory]]>Tue, 12 Sep 2023 16:00:00 GMT<p><a href="https://www.peacocktv.com/collections/nbc">Peacock</a>, NBC’s streaming service, carried the NFL’s opening night game on Thursday featuring the defending superbowl champions Kansas City Chiefs facing off against the Detroit Lions. Then a few days later, Peacock carried the Sunday night blowout win by the Dallas Cowboys over the New York Giants. In this post, we’ll take a look at how that traffic was delivered during those games.</p> <h2 id="ott-service-tracking">OTT Service Tracking</h2> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/" title="Gaming as on OTT service: Virgin Media reveals that Call Of Duty: Warzone has the “biggest impact” on its network">Call of Duty update</a> or a <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday/">Microsoft Patch Tuesday</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p><a href="https://www.kentik.com/resources/kentik-true-origin/" title="Learn more about Kentik True Origin">Kentik True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 540 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <h2 id="kicking-off-the-104th-season-of-the-nfl">Kicking off the 104th season of the NFL</h2> <p>In these days of an endlessly fractured media landscape, professional football remains a ratings powerhouse in the United States, consistently drawing in audiences numbered in the millions.</p> <p>As illustrated below in a screenshot from Kentik’s Data Explorer view, Peacock traffic dramatically surged during the evening on Thursday and Sunday. If traffic volume can indicate viewership, then broadcasting the first game of the NFL season more than doubled the number of households watching Peacock during the games.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6MvtoVmYJyQo6sXTZ3HCXM/98b401bc793137e16a8e13bfd53b9499/Peacock_NFL_Traffic_Source.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Peacock NFL Traffic Source" /> <div class="caption" style="margin-top: -30px;">Peacock OTT traffic analyzed with Kentik</div> <p>Based on our customer OTT data, the game was delivered via a variety of content providers including Fastly (28.3%), Edgio/Limelight (23.1%), Amazon/AWS (20%), Akamai (13.5%), and Lumen (11.3%).</p> <p>The graphic below shows how Peacock was delivered during this one-week period. By breaking down the traffic by Source Connectivity Type (below), we can see how the 2023 season kickoff was delivered by a variety of sources including private peering, IXP, embedded cache, and transit. For this mid-September week, the content viewers were consuming overwhelmingly came via private peering (68.3%), but also via transit (25.1%) and IXP (6.3%). For Peacock, embedded caching (0.1%) barely registered.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/qFbpGA2puBpw0CCojHflV/62839060c82ddac031f835067dd25fa4/Peacock_NFL_Source_Type.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Peacock NFL Traffic Source Type" /> <div class="caption" style="margin-top: -30px;">Peacock OTT traffic analysis by source</div> <p>It is normal for CDNs with a last mile cache embedding program to heavily favor this mode of delivery over other connectivity types as it allows:</p> <ol> <li>The ISP to save transit costs</li> <li>The subscribers to get demonstrably better last-mile performance</li> </ol> <p>In addition to source CDN and connectivity type, users of Kentik’s OTT Service Tracking are also able to break down traffic volumes by subscribers, specific router interfaces, and customer locations.</p> <h2 id="how-does-ott-service-tracking-help">How does OTT Service Tracking help?</h2> <p>Previously, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/" title="Learn more about recent OTT service tracking enhancements">described the latest enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like the release of a blockbuster movie on streaming can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/product/subscriber-intelligence/" title="Subscriber Intelligence Use Cases for Kentik">subscriber intelligence</a>.</p> <p>Ready to improve <a href="https://www.kentik.com/kentipedia/ott-services/" title="Kentipedia: OTT Services (Over-the-Top Services)">over-the-top service</a> tracking for your own networks? <a href="#demo_dialog" title="Request your Kentik demo">Get a personalized demo</a>.</p><![CDATA[A Tale of Two BGP Leaks]]><![CDATA[Doug Madory investigates two large BGP leaks from August 28 and 29, 2023 and how RPKI ROV and other technologies can help mitigate widespread internet disruptions that can result for incidents like these.]]>https://www.kentik.com/blog/a-tale-of-two-bgp-leakshttps://www.kentik.com/blog/a-tale-of-two-bgp-leaks<![CDATA[Doug Madory]]>Thu, 31 Aug 2023 04:00:00 GMT<p><em>“It was the best of routes, it was the worst of routes.”</em></p> <p>Earlier this week, the internet experienced another two BGP leaks. Monday saw a path leak emanating from Bangladesh and, on Tuesday, an origination leak from Brazil. While they were brief, they resulted in misdirected traffic and dropped packets, and as such are worthy of investigation.</p> <p>In this blog post, I’ll look into these leaks using Kentik’s unique data and capabilities and see what we can learn from them.</p> <h2 id="what-are-bgp-leaks">What are BGP leaks?</h2> <p><em>“A route leak is the propagation of routing announcement(s) beyond their intended scope.”</em></p> <p>That was the overarching definition of a <a href="https://www.kentik.com/kentipedia/bgp-route-leaks/" title="Kentipedia: What are BGP Route Leaks and How to Protect Your Networks Against Them">BGP route leak</a> introduced by <a href="https://datatracker.ietf.org/doc/html/rfc7908" title="BGP route leaks as defined in RFC7908">RFC7908</a> in 2016. Border Gateway Protocol (BGP) enables the internet to function by providing a mechanism by which autonomous systems (ex: telecoms, companies, universities, etc.) exchange information on how to forward packets based on their destination IP addresses.</p> <p>In this context, the term “route,” when a noun, is shorthand for the prefix (range of IP addresses), AS_PATH and other associated information relating to packet delivery. When routes are circulated farther than where they are supposed to go, traffic can be misdirected, or even disrupted, as happens <a href="https://www.kentik.com/blog/new-year-new-bgp-leaks/" title="Kentik Blog: New Year, New BGP Route Leaks">numerous times per year</a>.</p> <div as="Promo"></div> <p>RFC7908 went on to define a taxonomy for BGP leaks by enumerating six common scenarios, half of which appear in the two leaks covered in this post. In my writing on route leaks, I like to group them into two broad categories: origination leaks and path leaks. As I described in my blog post earlier this year, <a href="https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents/" title="Kentik Blog: A Brief History of the Internet’s Biggest BGP Incidents">A Brief History of the Internet’s Biggest BGP Incidents</a>, this distinction is useful because the two types of error require different mitigation strategies.</p> <p>With those definitions out of the way, let’s get into the two leaks.</p> <h2 id="path-leak-on-monday">Path leak on Monday</h2> <p>Beginning at 11:12 UTC, AS58715 leaked almost 30,000 BGP routes to its transit provider BTCL (AS17494), <a href="https://twitter.com/Qrator_Radar/status/1696155402512015609" title="Qrator Twitter status">first reported</a> by our friends at Qrator. These routes were learned from both AS58715’s peers and its other transit providers, making the incident a combination of leak type 1 and 4 from RFC7908.</p> <p>Once circulated onto the internet, the leaked routes misdirected internet traffic from around the world through AS58715 in Bangladesh. Below is a visualization based on Kentik’s aggregate NetFlow data of the traffic <em>that wasn’t destined for Bangladesh</em> seen flowing to AS58715 via AS17494.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3T2zfatSCUoz3bRu81htqN/2418d90c45488c5aba2e139fe77619cf/first-leak.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="BGP Path leak on August 28" /> <p>Let’s dig into a couple of problematic BGP announcements to illustrate the leak. These messages can be found in the Routeviews archive <a href="https://archive.routeviews.org/route-views3/bgpdata/2023.08/UPDATES/updates.20230828.1100.bz2" title="Routeviews data">here</a>.</p> <p>In this first message, we see AS58715 passing an Amazon route (13.32.249.0/24) from its peering session with Amazon (AS16509) to its transit provider AS17494 (RFC7908 leak type 4). Although the leaked Amazon routes didn’t propagate very far, AS16509 was still the largest destination for misdirected packets (see graphic above) simply due to the large volume of traffic AWS handles.</p> <pre><code>TIME: 08/28/23 11:13:01.660163 TYPE: BGP4MP_ET/MESSAGE/Update FROM: 43.226.4.1 AS63927 TO: 128.223.51.108 AS6447 ORIGIN: IGP</code></pre> <pre><code>ASPATH: 63927 <span style="background-color: yellow">17494 58715 16509</span></code></pre> <pre><code>NEXT_HOP: 43.226.4.1 COMMUNITY: 24115:17494 24115:65012 63927:106 63927:2101 63927:5201 LARGE_COMMUNITY: 24115:1000:1 24115:1001:1 24115:1002:1 24115:1003:40 24115:1004:17494 ANNOUNCE</code></pre> <pre><code> <span style="background-color: yellow">13.32.249.0/24</span></code></pre> <p>In this second message, AS58715 passes a Microsoft route (20.46.144.0/20), learned from one transit provider AS9498, to another, AS17494 (RFC7908 leak type 1). These routes were circulated widely because they are not typically seen in the global routing table and, therefore, faced no competition with existing routes.</p> <pre><code>TIME: 08/28/23 11:13:06.020567 TYPE: BGP4MP_ET/MESSAGE/Update FROM: 64.71.137.241 AS6939 TO: 128.223.51.108 AS6447 ORIGIN: IGP</code></pre> <pre><code>ASPATH: 6939 1299 174 <span style="background-color: yellow">17494 58715 9498</span> 59605 8075</code></pre> <pre><code>NEXT_HOP: 64.71.137.241 ANNOUNCE</code></pre> <pre><code> <span style="background-color: yellow">20.46.144.0/20</span></code></pre> <p>In fact, if we visualize the propagation of this route using Kentik’s BGP visualization, we can see 85.4% of our BGP sources would have chosen to send traffic to the IP addresses in 20.46.144.0/20 and other similarly leaked routes via AS58715 in Bangladesh.</p> <img src="//images.ctfassets.net/6yom6slo28h2/opvV6DBMIvS5zbjtkeusy/fe5d214bf25dfd6c8b32a2e0b8d5e1f9/bgp-monitor-microsoft.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Impacted prefixes" /> <p>Since the leak did not change the origins of the routes nor introduce more-specific routes, RPKI ROV wasn’t able to help. In fact, of the leaked routes, 17,173 were RPKI-unknown (without a ROA) and 12,588 were RPKI-valid. In either case, networks rejecting RPKI-invalid routes would not have rejected routes from this leak.</p> <p><a href="https://datatracker.ietf.org/doc/draft-ietf-sidrops-aspa-verification/" title="ASPA Verification draft spec">Autonomous System Provider Authorization</a> (ASPA) was designed to address path leaks like this one. ASPA works by allowing ASes to assert their transit relationships in RPKI which enables other ASes to identify improper routes due to <a href="https://blog.ipspace.net/2018/09/valley-free-routing.html" title="Valley-Free Routing at ipSpace.net">valley-free violations</a> and reject them. However, ASPA is still in its early stages and is not yet fully fielded.</p> <h2 id="origination-leak-on-tuesday">Origination leak on Tuesday</h2> <p>And then on Tuesday, August 29, AS266970 accidentally began originating nearly every prefix in the IPv6 global routing table. This lasted for 10 minutes and resulted in the misdirection of a significant amount of internet traffic, as observed in our aggregate NetFlow, pictured below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3gzU42Tbiyleunte68mZgz/bef375b17e9bf9bb01953abf69290e04/second-leak-aug29.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Path leak on August 29" /> <p>Unlike the BGP leak on the previous day, this was an origination leak — the type of leak that RPKI ROV is supposed to help contain, limiting the disruption. <em>So, how much did it help?</em></p> <p>Last year, Job Snijders of Fastly and I <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/" title="Kentik Blog: How Much Does RPKI ROV Reduce the Propagation of Invalid Routes?">explored the question</a> of how much does RPKI ROV reduce the propagation of RPKI-invalid routes. It is an important question because that ultimately is the objective of RPKI ROV — to reduce the propagation of problematic routes, thus reducing the disruption they cause.</p> <p>We can estimate propagation by counting how many Routeviews BGP sources had each leaked route in their routing table during the leak. Then, we can separate these routes by their RPKI evaluation, as shown in the plots below. (Note: The two graphs below contain the same data, but the one on the left uses a log plot.)</p> <img src="//images.ctfassets.net/6yom6slo28h2/65ZikNty6szrsKv4RgepVq/8e686ebe097588290997358340b9205f/leaks-by-status.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Graphs showing unknown and invalid routes" /> <p>Although there can be numerous factors that can influence the propagation of any individual route, it is clear to see that the RPKI-invalid routes propagated less, primarily due to networks rejecting RPKI-invalid routes.</p> <p>To isolate the impact of RPKI ROV, let’s compare two routes originated by the same ASN, one with a ROA and the other without. 2a02:ee80:4270::/48 lacked a ROA, and the Kentik BGP visualization below illustrates how it fared. During the peak of the leak, 60.7% of our BGP sources saw the leaker (AS266970) as the origin and would have directed its traffic there.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3zJG99INEdzYAFmSJKUHts/1298df04ba9869f06ac532f0cb3a7abf/bgp-monitor-accenture.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="RPKI ROV comparison 1" /> <p>Conversely, 2801:1f0:4017::/48 was also originated by AS3573 and was impacted by the leak. However, at its peak, only 2.4% of our BGP sources saw the leaker (AS266970) as the origin. 2801:1f0:4017::/48 <a href="https://rpki-validator.ripe.net/ui/2801%3A1f0%3A4017%3A%3A%2F48?validate-bgp=true" title="RPKI validator at RIPE">has a ROA</a> that asserts AS3573 as the valid origin, and was hardly impacted by the leak.</p> <img src="//images.ctfassets.net/6yom6slo28h2/73DgPzIGUTJHdETvpvzFMc/27019fa99c2904938c604289029459b8/bgp-monitor-accenture2.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="RPKI ROV comparison 2" /> <h2 id="conclusion">Conclusion</h2> <p>Years ago large routing leaks like these might have been the cause of widespread internet disruption. Not so much anymore.</p> <p>Humans are still (for the time being) configuring routers and, being human, are prone to the occasional mistake. What has changed is that the global routing system has become better at containing the evitable goof-ups. <a href="https://www.kentik.com/blog/exploring-the-latest-rpki-rov-adoption-numbers/" title="Kentik Blog: Exploring the Latest RPKI-ROV Adoption Numbers">Route hygiene has improved</a> due to efforts like <a href="https://www.manrs.org/" title="MANRS.org site">MANRS</a> and the hard work of network engineers around the world.</p> <p>That progress is largely the macro-level, but there are many individual networks that have not deployed RPKI and to them I would say the following:</p> <ol> <li>Creating ROAs will help to protect your inbound traffic by asserting to the rest of the internet which origin is the legitimate one — useful during an orignation leak like what happened on Tuesday.</li> <li>Conversely, rejecting RPKI-invalids helps to protect your outbound traffic by rejecting leaked routes that might misdirect that traffic or lead to a disruption.</li> </ol> <p>By reducing the impact of BGP leaks, we can focus on the harder problems left to be solved in routing security. Problems such as the “determined adversary” scenario witnessed in last year’s <a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/" title="Kentik Blog: BGP Hijacks Targeting Cryptocurrency Services">attacks on cryptocurrency services</a>. In that realm, there is still much work to be done.</p> <p>“<em>It is a far, far better ROA to create, than to never have done. It is a far, far better RPKI-invalid to reject than to forward on.</em>”</p><![CDATA[Beyond the Hype: The Power of AI and Large Language Models in Networking]]><![CDATA[Artificial intelligence is certainly a hot topic right now, but what does it mean for the networking industry? In this post, Phil Gervasi looks at the role of AI and LLM in networking and separates the hype from the reality.]]>https://www.kentik.com/blog/beyond-the-hype-the-power-of-ai-and-large-language-models-in-networkinghttps://www.kentik.com/blog/beyond-the-hype-the-power-of-ai-and-large-language-models-in-networking<![CDATA[Phil Gervasi]]>Wed, 30 Aug 2023 04:00:00 GMT<p>There’s probably no bigger buzzword right now than artificial intelligence. Normally, I recoil at the sound of the latest buzzword repeated ad nauseam by influencers, tech media, and, in this case, even the talking heads on late-night television. However, I believe this is one technology that, despite popular culture latching onto it, is extremely important for us to understand, especially in the context of networking.</p> <h2 id="what-is-ai">What is AI?</h2> <p>In a <a href="https://www.kentik.com/telemetrynow/s01-e22/">recent episode of Telemetry Now</a>, my friend Ryan Booth joined me to talk about AI, specifically in the context of networking. Ryan explained that AI is a broad term, not a specific technology. That means it’s important to distinguish between artificial intelligence and other technologies like machine learning.</p> <p>AI describes computer systems and workflows that mimic human intelligence and behavior. Machine learning, along with many technologies lumped into the AI category, are components of AI and a means of attaining human-like intelligence in computer form. For example, for AI to mimic human intelligence, it needs to be able to learn and adapt, which is why statistical analysis and machine learning models are crucial elements of an overall AI architecture.</p> <p>This way, computers working together in a common AI workflow can ingest, organize, and analyze data, then discover new information such as correlations, patterns, trends, etc., and take action based on that data analysis workflow.</p> <p>And remember that network telemetry is actually a broad category of many different types of data in different formats and representing very different aspects of network activity. The application of AI to networking is in the context of this growing diversity and complexity, which is inherently difficult for a human being to work with.</p> <h2 id="machine-learning-as-the-foundation">Machine learning as the foundation</h2> <p>Machine learning has existed for decades, so applying algorithms and ML models to perform a specific task is well understood. Basic ML has few layers and uses relatively simple algorithms and models to achieve a particular mission. At its core, an ML workflow focuses on a given task.</p> <p>When we add layers to a basic ML workflow, we get into the realm of deep learning and neural networks, which, at their core, use forward and backward propagation to train a model and use the results of an ML workflow to feed into another workflow. When working in multiple layers like this, computers can learn from data and make swift decisions with a high degree of accuracy and ultimately go beyond that single task of a basic ML algorithm.</p> <p>In networking, we use ML models to determine trends, find patterns in traffic, identify anomalies that could indicate a security breach, and so on. We’re clearly already headed in the right direction, but what does the latest buzz about large language models have to do with anything?</p> <h2 id="large-language-models">Large language models</h2> <p>Large language models, or LLMs, are one manifestation of ML models stacked together into an AI workflow. LLMs will analyze vast amounts of data, usually in the form of text such as the internet itself, to discover the relationships between words, phrases, and language patterns.</p> <p>LLMs have elements of pattern recognition, correlation, and prediction to anticipate what the next word in a sentence would be and what the meaning of a sentence is beyond the denotation of each individual word.</p> <p>Using recurrent neural network models, long short-term memory network models, and today, by using a transformer, an entire deep learning architecture, we can process massive data sets sequentially, in time series prediction, and for machine translation. The transformer is the underlying mechanism for the natural language processing (NLP) function in large language models like BERT and ChatGPT.</p> <h2 id="ai-and-llm-in-networking">AI and LLM in networking</h2> <p>LLMs already give us a significant advantage in tasks like document summarization, pattern recognition, document generation, creating computer programming code, and so on. The real-world applications of AI and LLM are varied, especially in networking.</p> <p>For example, LLMs are helpful for generating computer programming code, analyzing documentation, creating new documentation based on a set of data, and information classification, such as named-entity recognition.</p> <p>In networking, this relates directly to operational tasks such as analyzing divergent and voluminous network telemetry, generating code, generating documentation based on network data, etc. This makes an engineer more efficient, reduces error, expedites time to resolution, and offers the ability to more easily find insight in vast quantities of data than would be possible manually.</p> <p>Additionally, natural language processing, or NLP, allows us to interact with networks in a way we’ve never been able to before. We can now use actual human language to query a database of network telemetry, push configuration changes, or glean insight from a programmatic analytics workflow.</p> <p>Of course, this presupposes that there is an underlying AI workflow to ingest, organize, classify, and analyze network data. As we develop more sophisticated layers of ML models to do this work for us, we’re solving an operational problem to derive insight we otherwise wouldn’t be able to do quickly and easily.</p> <p>The precise future of AI in networking is still somewhat uncertain, but real-world applications are now starting to emerge. As data scientists and engineers discover how to use these tools to solve operational problems, we will see the landscape of networking and our understanding of complex networking systems change for the better.</p> <h2 id="2024-update-ai-network-monitoring-has-arrived">2024 Update: AI Network Monitoring has Arrived</h2> <p>Just a few months after having originally mused about the role that AI might play in future network monitoring solutions, the future has arrived… Kentik’s <a href="https://www.kentik.com/product/network-monitoring-system/" title="Learn more about Kentik NMS, the next-generation network monitoring system">next-generation network monitoring system, Kentik NMS</a>, includes advanced artificial intelligence-based features for network monitoring.</p> <div as="Promo"></div> <p><a href="https://www.kentik.com/solutions/kentik-ai/" title="Learn more about Kentik AI">Kentik AI</a> allows NetOps professionals and non-experts alike to answer any question about the status or performance of their networks using natural language queries.</p> <p>These new features allow network pros to understand on-premises, hybrid, and multicloud networking environments from a single query engine. And because Kentik combines network data from all sorts of protocols—including flow data, SNMP, streaming telemetry, containers, and cloud flow logs—Kentik AI enables unprecedented visibility into modern networks. Check out a short preview below:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/7ndhnmcvzn?seo=true&amp;videoFoam=false" title="Kentik AI-powered Queries with Kentik Journeys" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <div class="caption">Network Monitoring meets AI: The Journeys feature of Kentik NMS lets you answer any question about even the most complex networks using natural language queries.</div><![CDATA[How to Use Kentik to Perform a Forensic Analysis After a Security Breach]]><![CDATA[In this post, Phil Gervasi uses the power of Kentik’s data-driven network observability platform to visualize network traffic moving globally among public cloud providers and then perform a forensic analysis after a major security incident.]]>https://www.kentik.com/blog/how-to-use-kentik-to-perform-a-forensic-analysis-after-a-security-breachhttps://www.kentik.com/blog/how-to-use-kentik-to-perform-a-forensic-analysis-after-a-security-breach<![CDATA[Phil Gervasi]]>Tue, 22 Aug 2023 04:00:00 GMT<p>Kentik provides a macro view of how your cloud traffic moves among and within regions and also granular visibility of very specific traffic moving among regions, countries, data centers, and down to specific applications and endpoints.</p> <p>Having a single view of all these cloud components, including their connections to on-premises resources, means engineers have the ability to identify compliance issues, spot inefficient movement of traffic, troubleshoot application performance problems, and perform <a href="https://www.kentik.com/kentipedia/network-forensics/" title="Kentipedia: Network Forensics and the Role of Flow Data in Network Security">forensic analyses</a> after a security incident.</p> <h2 id="the-scenario">The scenario</h2> <p>In this post, we’ll explore a scenario in which we had a serious security incident. As part of our analysis, we need to learn how the attacker gained entry, what specific devices were compromised, and the extent of the data exfiltration.</p> <p>To set the stage, our security tool alerted us to a suspected attack. Numerous connection attempts on port 22 were made to our public-facing host in our Azure east US region. A spike in connection attempts and traffic would show up as anomalous behavior with most popular SIEMS; however, we need to confirm that this is indeed a security incident and not a false positive, and along with understanding the nature of the data exfiltration, we need to know if data crossed international boundaries, a serious compliance violation in many highly regulated industries.</p> <h2 id="the-kentik-map">The Kentik Map</h2> <p>The Kentik Map is a great way to see an overview of on-prem data centers, branch offices, internet connections, and all your public cloud resources. Seeing this data overlaid on a geographic map allows you to get a quick glance at how traffic is moving among all your locations, including across international boundaries.</p> <img src="//images.ctfassets.net/6yom6slo28h2/iprokbqImxR1Drki4ynO0/2150c70b9808096bc923536adf36ca64/security-blog-kentik-map.png" withFrame style="max-width: 800px;" class="image center" alt="Kentik Map" /> <p>First, we need to find out the source of the attack and where the attacker is in the world geographically. A robust security tool would also give you the attacker’s IP as well, but in our workflow, we need to go beyond IPs to understand the nature of the flows and where our attacker is located geographically.</p> <p>We can start by drilling down into our east US site and see if there’s anything there that can help start us off.</p> <img src="//images.ctfassets.net/6yom6slo28h2/53GLJkS5Xvxz1zf8E2ofMT/ccf46d959ef55be1fc7fe91fb59d7af6/entity-azure-cloud.png" style="max-width: 300px;" class="image right" alt="Select geography" /> <p>Notice in the image on the right, when we select our geographic region, we can see what resources we have there, including Azure east US. Having quick access to all the sites in a specific region, whether on-prem or cloud, means engineers are seeing all their networks in one place rather than with multiple screens or tools.</p> <p>When we continue to drill down into our Azure east US region, we can see the Cloud Region Details, which gives us a quick glance of historical traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/LLpcRSaBjw8FW7zyfKTu7/7ce363579b8f9951c32b3d16ed19dabe/details-sidebar-500w.png" style="max-width: 400px;" class="image center" alt="Cloud region details" /> <p>In the image above, notice we get some quick details about our region, such as the Tenant ID, Name, CIDRs, and so on. We can see a traffic summary that can be filtered depending on application, total or average, and using bits/s, flows, packets, etc.</p> <p>And lastly, we can also view the total flows according to the Azure network security group.</p> <p>Notice below that we can see an animation showing the connections this subnet is making and the connections the entire region is making.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/K0DJulgXFhXwe3olKVvsk/4cd3622848b3abe48e32ada49eca8c3a/map-animation2.gif" withFrame style="max-width: 800px;" class="image center" alt="Topology animation" /> <p>Now that we’ve confirmed the indicator of compromise in the security alert, we can begin to analyze further. To filter and explore the underlying data, we can use the <a href="https://kb.kentik.com/v3/Db03.htm">Kentik Data Explorer</a>.</p> <h2 id="understanding-the-attack-vector">Understanding the attack vector</h2> <p>We can run a query to understand better what was hitting the affected Azure host. We know the IPs from the security alert, but by using Data Explorer, we can see all the traffic over time and get a better understanding of what led up to the breach.</p> <p>From the data sources, we select Azure (since our host is in Azure and we can use those logs), but notice we can select multiple data sources or even all of them if we’d like.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Z97c5qaybiqv7B1NsSap1/f8c9ccf2a0fbafce31de990544974e04/data-source-selection.png" withFrame style="max-width: 500px;" class="image center" alt="Select data sources" /> <p>From the dimensions menu, we’ll select source IP, source port, destination IP, destination region in Azure, and Firewall Action. We should also capture the source country from our geolocation option because we want to know if data is leaving our Azure east US region across international boundaries, and we also want to know the application.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5Z6QDA4eBsYVYAR5WLwFW/c161d16cd91631ac18edf26e93756349/group-by-dimensions.png" withFrame style="max-width: 800px;" class="image center" alt="Select dimensions" /> <p>Also, we can modify the custom filter to narrow the scope to just our Azure host. Lastly, we can change the time range to the last several weeks. This will give us a good picture of what outside IP addresses are trying to make connections to our host.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6RiR6IQNrDodKizaJdCOhb/d110082a2f3ce19cf6f7661dba6957b1/filtering-options.png" withFrame style="max-width: 800px;" class="image center" alt="Filtering options" /> <p>The results of this first query, which you can see below, show a lot of connection attempts to our host in public-facing Azure over ssh. Since this is a public host and ssh is allowed, we’re seeing all Allow statements from the firewall, which is expected. However, notice all the connection attempts are from one IP address with nominal or zero data transfer.</p> <p>This is indicative of TCP resets after a failed logon attempt. And now that we confirmed the IP address of the attacker, our source country, and see a flow at the top of the list with significant data transfer, we have a point in time that we can use to isolate the logs locally on the host itself.</p> <img src="//images.ctfassets.net/6yom6slo28h2/ZmOhIbodpqn9UAjVw8URZ/b8be7b54dd72f991d60987ba6ce30cc7/connection-attempts-zero-transfer.png" style="max-width: 800px;" class="image center" withFrame alt="Connection attempts" /> <p><em>Thus far, our host in Azure was targeted on port 22 repeatedly using different source ports, which our SIEM identified as an indicator of compromise, presumably different login credentials until finally, the attacker could gain entry.</em></p> <h2 id="looking-at-lateral-traffic-internally">Looking at lateral traffic internally</h2> <p>Next, we can drill down on our compromised host and pivot to internal visibility. This will allow us to see if the compromised host talked to anything internally that may have also been compromised. Looking at lateral movement is one of Kentik’s strengths.</p> <p>Our dimension will include the source and destination as well as the application, and we’ll change the filter so our source is the Azure host with IP address 10.170.50.5. We can leave the destination blank, but we should exclude our attacker’s IP so we see only what else our host talked to laterally within our own network.</p> <p>The results of this query, as seen below, show numerous connections to one inside host at 10.170.100.10, all with minimal data transfer and on various ports. This is certainly indicative of a scan of some type, likely in an attempt to find a means of accessing this inside host. And notice that our Azure host at 10.170.50.5 eventually initiates significant MySQL data transfer with the inside host at 10.170.100.10.</p> <img src="//images.ctfassets.net/6yom6slo28h2/61NV7ziHjyiQBQhrYyfaCt/9a0862e97a45169d5054dc94f1bacaad/connections-table2.png" withFrame style="max-width: 800px;" class="image center" alt="Data transfer" /> <p>Now that we have evidence of another compromised host and a likely data exfiltration, we can change the filter to focus on just these IPs and MySQL and ssh in a time series to compare.</p> <p>Our first filter rule isolates the lateral traffic within Azure, and my second filter isolates the traffic between my Azure host and the attacker, which should be a similar amount of ssh traffic, though it doesn’t necessarily have to be. For this visualization, I’ll enable bi-directional mode to more easily see the correlation.</p> <p>In the results below, we see clearly that the spike for both MySQL inside and ssh outbound happened simultaneously. Also, the two values are very close in the actual amount of traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2bKsExC2cH7DgKfvOq0Je3/897eaaf39fbd1f30559273dca684344b/security-blog-spike.png" withFrame style="max-width: 800px;" class="image center" alt="Simultaneous spikes" /> <h2 id="the-forensic-analysis-workflow">The forensic analysis workflow</h2> <ol> <li>We started by confirming our alert from our security tool was valid and not a false positive.</li> <li>We verified that an attacker scanned our outside host and identified port 22 as open.</li> <li>Based on the data, we assume that the attacker then initiated a brute force dictionary attack, as evidenced by the many ssh connections with very minimal traffic, followed by a successful attempt with much more traffic.</li> <li>Then, with this open connection, our attacker was able to identify an internal host running MySQL. We saw the data transfer internally, and then we saw the data transfer back to the attacker.</li> </ol> <p>At this point, we can hand this off to our security team to look into the system logs themselves and probably revisit the ssh credentials on our outside host and MySQL credentials on the inside host. Depending on what they find, they also implement two-factor authentication and a brute force prevention mechanism like fail2ban.</p> <p>Ultimately, visibility is a cornerstone of effective cybersecurity. Kentik’s data-driven approach to network observability provides engineers with the tools they need to make informed decisions to remediate problems and improve their security posture.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/dmlquhotwl" title="How to Perform a Forensic Analysis After a Security Breach" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>Watch the demo video to see this entire forensic analysis step-by-step.</p><![CDATA[Dual Subsea Cable Cuts Disrupt African Internet]]><![CDATA[On Sunday, August 6, an undersea landslide in one of the world’s longest submarine canyons knocked out two of the most important submarine cables serving the African internet. The loss of these cables knocked out international internet bandwidth along the west coast of Africa. In this blog post, we review some history of the impact of undersea landslides on submarine cables and use some of Kentik’s unique data sets to explore the impacts of these cable breaks.]]>https://www.kentik.com/blog/dual-subsea-cable-cuts-disrupt-african-internethttps://www.kentik.com/blog/dual-subsea-cable-cuts-disrupt-african-internet<![CDATA[Doug Madory]]>Thu, 17 Aug 2023 04:00:00 GMT<p>On Sunday, August 6, an undersea landslide in one of the world’s longest submarine canyons <a href="https://mybroadband.co.za/news/fibre/503568-bad-news-about-break-in-undersea-cables.html">knocked out two</a> of the most important submarine cables serving the African internet. The landslide took place in the <a href="https://en.wikipedia.org/wiki/Congo_Canyon">Congo Canyon</a>, located at the mouth of the Congo River, separating Angola from the Democratic Republic of the Congo.</p> <div as="WistiaVideo" videoId="mswbehusup" audio></div> <p>The <a href="https://www.submarinecablemap.com/submarine-cable/sat-3wasc">SAT-3 cable</a> was the first to suffer an outage, followed hours later by the failure of the <a href="https://www.submarinecablemap.com/submarine-cable/west-africa-cable-system-wacs">WACS cable</a>. The loss of these cables knocked out international internet bandwidth along the west coast of Africa. In this blog post, I review some history of the impact of undersea landslides on submarine cables and use some of Kentik’s unique data sets to explore the impacts of these cable breaks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6xPKkRWjrEOwKHP9r3yxmE/d4f14a3d8afbb56afb1bf886e330ed37/africa-congo-canyon-zoomout.jpg" style="max-width: 650px" class="image center" alt="Congo Canyon" /> <img src="//images.ctfassets.net/6yom6slo28h2/2MsWBHhweaZJ6OxRIdhHpQ/129168354e4dc4afdebf6bb087049357/africa-congo-canyon.jpg" style="max-width: 650px" class="image center" alt="Zoomed in on Congo Canyon" /> <div class="caption" style="margin-top: -35px;">Detail of the Congo Canyon</div> <h2 id="seismic-threats-to-submarine-cables">Seismic threats to submarine cables</h2> <p>By far, the greatest threat to the submarine cables connecting the global internet is human maritime activity. This usually involves seafaring vessels breaking submarine cables either by snagging them during fishing operations (especially <a href="https://en.wikipedia.org/wiki/Trawling">trawling</a>) or by inadvertently dragging their anchors along the seafloor, the cause of the <a href="https://en.wikipedia.org/wiki/2008_submarine_cable_disruption">2008 submarine cable cuts</a> in the Mediterranean Sea.</p> <p>But after marine activity, the next biggest category of threat consists of natural causes — and let me emphasize, I’m <em>not</em> talking about <a href="https://arstechnica.com/tech-policy/2015/07/its-official-sharks-no-longer-a-threat-to-subsea-internet-cables/amp/">sharks biting cables</a>. On numerous occasions, undersea landslides and earthquakes have damaged cables laying on the seafloor.</p> <p>In December 2006, a <a href="https://www.nytimes.com/2006/12/29/business/worldbusiness/29connect.html">large earthquake</a> off the southern coast of Taiwan crippled internet communications in East Asia by rupturing numerous submarine cables. According to a <a href="https://www.iscpc.org/documents/?id=9">press release</a> from the International Committee for the Protection of Cables, as a result of the <a href="https://en.wikipedia.org/wiki/2006_Hengchun_earthquakes">Hengchun earthquakes</a>, “21 faults were recorded in … 9 cables, and it took 11 ships 49 days to restore everything back to normal.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/6NsUzTCfNtgJhlAMO3nmW8/cc682b7d1ae1fbe051b42fbda774226a/icpc-hengchun-earthquake.jpg" style="max-width: 800px" class="image center" alt="Hengchun earthquakes map" /> <div class="caption" style="margin-top: -35px;">Graphic from the ICPC press release on the Hengchun earthquakes </div> <p>During a panel at the Suboptic 2013 submarine cable conference in Paris, the audience listened in rapt attention as a speaker from Japan described the efforts to repair submarine cables damaged by the <a href="https://en.wikipedia.org/wiki/2011_T%C5%8Dhoku_earthquake_and_tsunami">Great East Japan Earthquake</a> two years earlier. According to him, one of the cables had been dragged <em>over a kilometer</em> by an undersea landslide triggered by the powerful earthquake. Additionally, the sequence in which the cables were to be restored was driven, not by usual contractual priorities, but by where the cable ship could safely navigate while avoiding the radiation cloud emanating from the crippled <a href="https://en.wikipedia.org/wiki/Fukushima_nuclear_disaster">Fukushima Daiichi Nuclear Power Plant</a>.</p> <p>In another case, a decade ago, I <a href="https://web.archive.org/web/20150910063825/http://research.dyn.com/2013/02/faraway-fallout-from-black-sea/">wrote about</a> an undersea earthquake in the Black Sea that unexpectedly impacted Iranian transit. It was initially reported that <em><a href="https://en.trend.az/regions/scaucasus/georgia/2105346.html">an underwater volcano</a></em> had severed a submarine cable in the Black Sea running between Poti, Georgia, and Sochi, Russia. This caught the attention of WIRED magazine’s resident geologist, who dutifully <a href="http://www.wired.com/2013/01/submarine-eruption-in-the-black-sea-off-georgia-not-likely/">pointed out</a> that there were no volcanoes in the Black Sea. As was later <a href="https://ictna.ir/id/052590/">reported in the Iranian press</a>, it was an earthquake that led to the failure of the cable — likely by triggering another underwater landslide.</p> <p>Finally, this isn’t the first time that these cables (SAT-3 and WACS) have been downed by undersea landslides in the Congo Canyon. A study, <a href="https://www.datacenterdynamics.com/en/news/study-finds-mudslides-caused-by-river-flooding-can-damage-subsea-cable-damage/">published in June 2021</a>, described how these cables had suffered breaks in 2020 due to, in one instance, “an exceptionally large and powerful submarine mudslide that originated at the mouth of the Congo River.”</p> <p>Make no mistake; <em>the seafloor can be a dangerous place for cables.</em></p> <h2 id="internet-impacts-due-to-the-cable-cuts">Internet impacts due to the cable cuts</h2> <p>Shifting back to the cable cuts in Africa, let’s begin by looking at the impact on cloud connectivity.</p> <h3 id="cloud-performance">Cloud performance</h3> <p>Last year, a terrestrial cable cut in Egypt temporarily knocked out service for multiple submarine cables, a <a href="https://www.kentik.com/blog/outage-in-egypt-impacted-aws-gcp-and-azure-interregional-connectivity/">situation I analyzed</a> by illustrating the impact on <em>internal</em> connectivity for the three major public cloud providers: AWS, Azure, and Google Cloud. Well, this cable incident was no different, proving once again that even the big hyperscalers must rely on the same submarine cable infrastructure as everybody else.</p> <div as="Promo"></div> <p>Below is a screenshot of our performance monitoring from <code class="language-text">af-south-1</code>, AWS’s region in Cape Town, South Africa, to <code class="language-text">eu-west-2</code> in London, England. It shows an increase in latency, from 150 to 195 ms, as AWS’s traffic is diverted to a backup route, presumably with a longer geographic distance, to reach London in this example.</p> <img src="//images.ctfassets.net/6yom6slo28h2/y5ibM0Ikq1i26QdMS3IVk/d3e5b6eec07be08d482c703e1039cf9a/capetown-london.png" withFrame style="max-width: 800px" class="image center" alt="AWS from Cape Town to London" /> <p>Conversely, we can also see a <em>drop</em> in latency from af-south-1 to some points in Asia. The screenshot below shows a latency decrease from 386 to 304 ms from Cape Town to Seoul at 17:30 UTC, when the WACS cable was cut.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3S8SpRmv2yDutBpZHquKNs/c00810e821d98f182bbc001ed6cc1e31/capetown-seoul.png" withFrame style="max-width: 800px" class="image center" alt="AWS from Cape Town to Seoul" /> <p>While this may seem counterintuitive, this is a phenomenon I have often encountered when analyzing submarine cable cuts — see slide 10 in <a href="https://www.linkedin.com/in/dougmadory/details/featured/50016397/single-media-viewer/?profileId=ACoAAABgDfMBQqg6K3WmEobvLYAvoetjdTSF-R0">my presentation</a> at Suboptic 2013. Essentially a higher latency route becomes unavailable, and traffic is forced to go a more direct path.</p> <p>Why would traffic be using the higher latency path in the first place? What you can’t see in these visualizations are factors like cost. The business case for maintaining the lowest possible latency between Cape Town and Seoul may not justify a higher-cost but lower-latency route.</p> <h3 id="traffic-volume-as-measured-by-aggregate-netflow">Traffic volume as measured by aggregate NetFlow</h3> <p>We observed a drop in traffic volume as seen in our aggregate NetFlow to the affected countries following the cable cuts. Let’s take a closer look at the impacts in the largest affected market, South Africa. According to our data, the largest traffic destination in South Africa is Telkom SA (AS37457), who experienced a 20% drop in peak traffic following the cuts.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6VdV7NJXXDtSa9eWkd1moW/67d53b327eb8e9f04ccc080882ab5aeb/traffic-to-telkom-sa.png" withFrame style="max-width: 800px" class="image center" alt="Internet traffic to Telkom SA" /> <p>Namibia was another country in southwest Africa impacted by these cable cuts. When WACS was cut at 17:30 UTC, Telecom Namibia (AS20459) lost transit from Cogent (AS174) and BICS (AS6774). Two hours later, the Namibian incumbent partially restored service using transit from Angola to the south.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1yTyVIYnsPuRSeAeMutKbY/4dd855349c68a7699a27b01e4902ccf3/traffic-to-telecom-namibia.png" withFrame style="max-width: 800px" class="image center" alt="Failure of WACS submarine cable" /> <p><a href="https://www.kentik.com/product/market-intelligence/">Kentik Market Intelligence (KMI)</a> also reported on these changes. Recall that KMI, based on BGP, enables users to navigate the dynamics hidden in the global routing table by identifying the ASes operating in any given country, determining their providers, customers, and peers, and reporting when there are changes to those relationships.</p> <p>Telecom Namibia, for example, lost its two main transit providers Cogent (AS174) and BICS (AS6774) as a result of the WACS cable cut. To restore service, it had to activate emergency service from Angola Cables (AS37468). Those developments were reported in KMI Insights and are shown below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4ti7PBgsPi6ZfZl6emUXtc/22a68ae9aef1abf86f67a7481b648271/namibia.png" style="max-width: 600px" class="image center" alt="Provider loss in Namibia" /> <h3 id="bgp-visualization">BGP visualization</h3> <p>Since KMI is strictly based on interpreting BGP data, let’s see what an individual route change looked like. Let’s take this Telecom Namibia route as an example: 41.205.128.0/19.</p> <p>When we take the upstream view from AS20459, we can see that, like the NetFlow-based graphic above, Telecom Namibia lost AS174 and AS6774 at around 17:30 UTC and gained transit from AS37468 at 22:00 UTC on August 6. In the interim, the reachability of the route became very low, along with traffic reaching its destination.</p> <img src="//images.ctfassets.net/6yom6slo28h2/51cVl1KT6qdmk4A6U0VlYU/72b9ac86cf713db82472cdd6300d86c7/telecom-namibia.png" withFrame style="max-width: 800px" class="image center" alt="Namibia reachability in the Kentik platform" /> <h2 id="conclusion">Conclusion</h2> <p>At the time of the cuts, the cable repair ship operating in the region (<a href="https://www.marinetraffic.com/en/ais/home/shipid:761048/zoom:10">CS Leon Thevenin</a>) was <a href="https://twitter.com/benliquidkenya/status/1689424154749976576">busy with submarine cable work in West Africa</a> but has since shifted its mission and set sail for Cape Town, South Africa. Once on location, the repairs may take additional weeks to complete leaving a significant portion of the African internet without critical internet bandwidth well into <a href="https://twitter.com/philBE2/status/1689628954536382464">September</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4iUDSvPOqpEU2valF9jWAF/0de44d0ad1980dae65d6d3df9aa6b348/marine-traffic-com.jpg" style="max-width: 800px" class="image center" alt="MarineTraffic.com view" /> <div class="caption" style="margin-top: -35px;">Image courtesy of <a href="https://www.marinetraffic.com/">MarineTraffic.com</a></div> <p>To make up for the loss of capacity, traffic has been shifted to other submarine cables, such as Google’s new <a href="https://www.submarinecablemap.com/submarine-cable/equiano">Equiano cable</a>, which was activated earlier this year. Like WACS and SAT-3, Equiano also runs along the west coast of Africa, but was not impacted by the undersea landslide earlier this month. This fact was highlighted by Equiano client Liquid Dataport (formerly Liquid Telecom) in a <a href="https://liquid.tech/about-us/news/liquid-dataport-subsea-cables/">press release</a> last week. Liquid has managed to use their service on Equiano to fill the gaps left by the loss of WACS and SAT-3.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3QN9pLep8a998tQo6VbtEy/fcece9214f95ffd6c6f07bc187e77f84/liquid-dataport-map.png" style="max-width: 600px" class="image center" alt="Cable Map" /> <p>For more information about the fascinating world of submarine cables, check out our <a href="https://www.kentik.com/telemetrynow/s01-e10/">Telemetry Now podcast episode</a> with Alan Mauldin of Telegeography.</p><![CDATA[Iraq Blocks Telegram, Leaks Blackhole BGP Routes]]><![CDATA[This past weekend, the government of Iraq blocked the popular messaging app Telegram, citing the need to protect Iraqi's personal data. However, when an Iraqi government network leaked out a BGP hijack used for the block, it became yet another BGP incident that was both intentional, but also accidental. Thankfully disruption was minimized by Telegram's use of RPKI.]]>https://www.kentik.com/blog/iraq-blocks-telegram-leaks-blackhole-bgp-routeshttps://www.kentik.com/blog/iraq-blocks-telegram-leaks-blackhole-bgp-routes<![CDATA[Doug Madory]]>Thu, 10 Aug 2023 04:00:00 GMT<p>This past weekend, the government of Iraq took the step to block the popular messaging app Telegram, citing the need to <a href="https://www.reuters.com/technology/iraq-blocks-telegram-app-cites-personal-data-violations-2023-08-06/">protect the personal data of Iraqi users</a> following a leak of confidential information. According to data from our friends over at Tor’s <a href="https://explorer.ooni.org/">Open Observatory for Network Interference (OONI)</a>, the block was implemented by <a href="https://explorer.ooni.org/chart/mat?test_name=telegram&#x26;axis_x=measurement_start_day&#x26;since=2023-07-08&#x26;until=2023-08-08&#x26;time_grain=day&#x26;probe_cc=IQ">blocking Telegram’s IP addresses</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6ZEi3RyL5qch9cAoVw3I86/27bd164bfb1581cebcc45be93851cc64/iraq-telegram-test.png" style="max-width: 600px;" class="image center" alt="Iraq Telegram test graph from OONI" /> <p>Evidently, when the Iraqi government began blocking Telegram, it started by <a href="https://twitter.com/DougMadory/status/1688660150087823360">using BGP to hijack traffic</a> destined for IP addresses associated with the messaging service, redirecting them to the proverbial bitbucket. And, as has happened before on numerous occasions, these hijack BGP routes leaked out of the country.</p> <p>However, despite this technical error, no Telegram disruption was reported outside of Iraq, in part, due to the fact that Telegram had created <a href="https://en.wikipedia.org/wiki/Resource_Public_Key_Infrastructure#Route_Origin_Authorizations">Route Origin Authorizations</a> (ROAs) for its routes allowing ASes outside of Iraq to automatically reject the hijacks. A ROA is a record in <a href="https://en.wikipedia.org/wiki/Resource_Public_Key_Infrastructure">RPKI</a> that specifies the AS origin that is authorized to originate the IP address range.</p> <h2 id="intentional-but-also-accidental">Intentional, but also accidental</h2> <p>Perhaps the most famous BGP hijack ever was Pakistan’s <a href="https://www.wired.com/2008/02/pakistans-accid/">hijack of YouTube in 2008</a> (also see <a href="https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents/">The Internet’s Biggest BGP Incidents</a>). In that case, the Pakistani government ordered a block of Youtube in the country. The Pakistani state telecom, PTCL created BGP routes to hijack traffic destined for Youtube and blackhole it. However, the hijacks leaked out of Pakistan, leading to a global disruption of Youtube. Over the years, there have been many such leaks of BGP hijacks meant to censor content, such as those in <a href="https://web.archive.org/web/20210512142832mp_/https://blogs.oracle.com/internetintelligence/ukraine-bans-russian-social-media">Ukraine</a> and <a href="https://web.archive.org/web/20170327193134/http://dyn.com/blog/iran-leaks-censorship-via-bgp-hijacks/">Iran</a>.</p> <p>More recently, during the initial weeks of the <a href="https://www.kentik.com/blog/myanmar-goes-offline-during-military-coup/">Myanmar military coup in 2021</a>, the military junta in charge ordered social media to be blocked. To comply with the order, one Myanma ISP elected to use BGP in order to hijack local Twitter traffic and drop it. Unfortunately, <a href="https://www.manrs.org/2021/02/did-someone-try-to-hijack-twitter-yes/">their hijack route was inadvertently leaked out</a> of Myanmar, causing disruptions to Twitter around South Asia.</p> <p>And finally, last year, during Russia’s crackdown on social media and independent journalism following their invasion of Ukraine, a <a href="https://www.itnews.com.au/news/russian-network-hijacked-twitter-traffic-578000">Russian ISP elected to use BGP to blackhole traffic</a> to Twitter by hijacking the <em>exact same prefix</em> (104.244.42.0/24) that was hijacked a year earlier in Myanmar.</p> <p>However, there was a difference between the two hijacks of Twitter’s 104.244.42.0/24 by Myanmar in 2021 and then again by Russia in 2022. In the intervening year, Twitter deployed RPKI, creating ROAs in RPKI for nearly all of its routes. By doing so, it enabled ASes which reject RPKI-invalid routes to automatically reject the hijack routes, limiting the disruption to Twitter.</p> <p>As Twitter’s CISO <a href="https://twitter.com/LeaKissner/status/1508503164374315011">wrote at the time</a>, “a bunch of the point of having security is to keep your systems from breaking all of the time” and in this case, it kept a BGP hijack from breaking access to Twitter for users outside of Russia. The Russian hijack propagated <a href="https://twitter.com/DougMadory/status/1508466367112093709"><em>slightly less</em></a> than the hijack from Myanmar, but it is hard to directly compare since the hijacks were announced from different places on the internet, and there are numerous factors which can influence BGP route propagation.</p> <p>Anecdotally, I would say that after my <a href="https://www.youtube.com/watch?v=hKwjq94Quhc">NANOG 86 presentation</a> on the internet impacts due to the war in Ukraine, one of Twitter’s network engineers shared that, from a traffic standpoint, they observed significantly less disruption due to the Russian hijack. They believed the difference was due to RPKI.</p> <p>The recent episode in Iraq is similar to the aforementioned cases because it was a BGP incident that was intentional, but also accidental. The network accidentally leaked out a BGP hijack that they intentionally created.</p> <h2 id="the-hijacks-from-iraq">The hijacks from Iraq</h2> <p>Except for the networks in the semi-autonomous region of Kurdistan, all Iraqi internet service must go through <a href="https://bgp.tools/as/208293">AS208293</a>, a government network which serves as the country’s international gateway. AS208293 normally doesn’t originate any routes; it only connects the country’s telecoms to international transit providers.</p> <p>However, at 13:52 UTC on 5 August 2023, AS208293 started originating the following prefixes:</p> <table> <tbody> <tr> <td>151.106.160.0/19</td> <td>95.161.64.0/20</td> </tr> <tr> <td>95.161.0.0/17</td> <td>91.108.8.0/22</td> </tr> <tr> <td>91.108.0.0/18</td> <td>91.108.4.0/22</td> </tr> <tr> <td>149.154.160.0/20</td> <td>149.154.164.0/22</td> </tr> <tr> <td>91.108.56.0/22</td> <td>149.154.160.0/22</td> </tr> </tbody> </table> <p>All but the first are address ranges used by Telegram. 151.106.160.0/19 was last utilized by the <a href="https://www.linkedin.com/posts/subspace-com_we-regret-to-announce-that-effective-may-activity-6930952295824207873-2z25/">now-defunct Subspace network</a>, so the reason for its inclusion is unclear.</p> <p>Illustrated in Kentik’s BGP Route Viewer below, 149.154.160.0/20 was originated by AS208293 and attained global circulation for two reasons. The first was that there was no competing route in circulation — 149.154.160.0/20 is “less-specific” to existing Telegram routes. Secondly, there was no matching ROA in RPKI to give ASes who have deployed RPKI a reason to reject it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2SKv4tHTkjO96rSP4SS2Iu/87b18e94f21eacd721dbe2ac24a88c5d/bgp-route-viewer-spike.png" withFrame thumbnail style="max-width: 800px;" class="image center" alt="BGP Route Viewer showing origination" /> <p>Conversely, when we look at the propagation of the corresponding “more-specific” 149.154.160.0/22, we see that nearly the entire internet believed (correctly!) that AS62041 (Telegram) was the origin. While AS208293 shows up in red in the ball-and-stick diagram along the bottom, it can hardly be seen in the upper stacked plot, which depicts route propagation by origin.</p> <img src="//images.ctfassets.net/6yom6slo28h2/51o8y6IEJZnm7ehxutN4hj/fe94a5563b3a3fd40eb8865738acd7b7/bgp-route-viewer-propagation.png" withFrame thumbnail style="max-width: 800px;" class="image center" alt="BGP Route Viewer showing propagation" /> <p>The lack of propagation of a BGP route with AS208293 as the origin of 149.154.160.0/22 is due to the fact that there was already an existing route to compete with (originated by AS62041), but also because the route from Iraq was <a href="https://rpki-validator.ripe.net/ui/149.154.160.0%2F22/AS208293?include=related_alloc,related_less_specific">RPKI-invalid</a>, dramatically <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">limiting its reach</a>.</p> <h2 id="conclusion">Conclusion</h2> <p>The challenge of measuring the positive impact of any security mechanism is that it often requires knowledge of the incidents that didn’t occur as a result of its use. RPKI is no different. While it likely didn’t have the potential to be another Pakistan/YouTube incident, BGP non-events like this are what RPKI success looks like.</p> <p>Of course, this is little solace to the Iraqi users who must now employ a VPN to circumvent this IP-level blockage. Censoring Telegram will do little to protect the personal data of any Iraqis, but it does cut them off from a popular communication tool and source of news and information.</p><![CDATA[Troubleshooting Cloud Application Performance: A Guide to Effective Cloud Monitoring]]><![CDATA[The scalability, flexibility, and cost-effectiveness of cloud-based applications are well known, but they're not immune to performance issues. We've got some of the best practices for ensuring effective application performance in the cloud.]]>https://www.kentik.com/blog/troubleshooting-cloud-application-performance-a-guide-to-effective-cloud-monitoringhttps://www.kentik.com/blog/troubleshooting-cloud-application-performance-a-guide-to-effective-cloud-monitoring<![CDATA[Phil Gervasi]]>Wed, 09 Aug 2023 04:00:00 GMT<p>Cloud-based applications offer unparalleled benefits in terms of scalability, flexibility, and cost-effectiveness. However, these applications are not immune to performance issues. Like any other software, your cloud applications may experience misconfiguration problems, traffic congestion, or other issues that can affect the overall user experience. It’s essential that you can troubleshoot application activity in the cloud to ensure effective performance.</p> <p>Unfortunately, troubleshooting can be challenging due to the complexity of cloud architectures and the myriad tech stacks involved. That’s why this article will discuss some best practices for troubleshooting cloud apps. This includes things such as identifying misconfigurations and declined or dropped flows, as well as congestion in east-west and cloud-to-site connections.</p> <p>After reading this article, you’ll have learned about several practical ways to fix performance problems in cloud applications. This includes finding misconfigurations, understanding network traffic, using distributed tracing mechanisms, and using synthetic testing.</p> <h2 id="why-you-should-troubleshoot-cloud-application-performance">Why you should troubleshoot cloud application performance</h2> <p>Modern cloud applications consist of various microservices, infrastructures, languages, and frameworks. Their distributed nature introduces multiple points of potential failure. Without a proper governance framework, these complexities can quickly disrupt availability and create performance issues stemming from underlying problems.</p> <p>Organizations need effective observability solutions that provide a holistic view of the application and its components, offering operational teams much-needed visibility. By troubleshooting cloud applications, these organizations can detect and rectify issues before they escalate into full-fledged problems. Employing effective troubleshooting practices enables the early detection of performance issues and ensures compliance with service level agreements (SLAs).</p> <p>The benefits of troubleshooting cloud applications are numerous. It enhances user experience, reduces unnecessary costs, and strengthens the security of your services. Adopting effective troubleshooting practices is essential for enterprises aiming to stay ahead in a competitive industry.</p> <h2 id="best-practices-for-troubleshooting-cloud-application-performance-issues">Best practices for troubleshooting cloud application performance issues</h2> <p>Because of the complexity of cloud applications, identifying and resolving performance issues can be complicated. Whether dealing with complex application configurations or network latency, knowing how to diagnose and troubleshoot issues can be the key to maintaining optimal performance.</p> <p>The following sections will outline some practical strategies for resolving performance issues in cloud applications.</p> <h3 id="implement-log-aggregation">Implement log aggregation</h3> <p>Logs play a crucial role in troubleshooting your cloud applications for performance issues. Log aggregation, the practice of collecting logs from various sources and consolidating them into a unified platform, is a powerful tool in this process. Each component in your cloud architecture exposes logs that provide detailed information about its status and behavior at any given point in time.</p> <p>Application logs generated from call stacks and runtime libraries, such as <a href="https://logging.apache.org/log4j/2.x/">Log4j</a>, help determine application behavior at runtime. System logs emitted from firewalls, servers, object storage buckets, and Syslog provide security insights:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5aTja2qrEa4yyQcniOnlna/55ab4531f3ae96948ff0f00e7150d4e9/log-aggregation.png" style="max-width: 600px;" class="image center simple" alt="Log aggregation diagram" /> <p>Examples of these types of log telemetry data are VPC Flow Logs that capture information about the IP traffic going to and from network interfaces in your VPC, or NetFlow, a network protocol for collecting IP traffic information and monitoring network flow.</p> <p>Logs are also generated for network resources, databases, and serverless functions. These amount to massive data and can be hard to navigate without proper tools and frameworks. Log aggregation simplifies this process and helps correlate events and find patterns that might be hard to catch when looking at logs in isolation.</p> <p>For example, log aggregation can quickly reveal if a performance issue is caused by a particular service’s increased latency, possibly due to the problematic interaction between two microservices.</p> <h3 id="use-a-centralized-configuration-management-solution">Use a centralized configuration management solution</h3> <p>In a complex, distributed cloud environment, settings are often scattered across services and components. This practice can quickly lead to misconfigurations and cause serious performance issues or service disruptions.</p> <p>A centralized configuration management solution, such as <a href="https://spring.io/projects/spring-cloud-config">Spring Cloud Config</a> or <a href="https://www.consul.io/">HashiCorp Consul</a>, can help manage these configurations by collecting all of them into one unified platform. These platforms are designed to keep your configuration settings in one location that’s easy to manage and visualize. Additionally, such systems allow for consistent configurations across your environment, minimizing discrepancies that can cause performance problems.</p> <p>While centralized configuration management solutions help maintain and manage configurations, automation tools, such as <a href="https://www.ansible.com/">Ansible</a>, <a href="https://www.chef.io/">Chef</a>, and <a href="https://www.puppet.com/">Puppet</a>, can be employed for dynamic configuration changes. Although these tools aren’t primarily focused on configuration management, they can be effectively utilized with proper <a href="https://www.redhat.com/en/topics/devops/what-is-gitops">GitOps practices</a>.</p> <p>By combining centralized configuration management solutions and automation tools, operations teams can maintain a consistent, efficient, and robust environment. This helps to reduce the risk of misconfigurations, improve system resilience, and ultimately boost the performance of cloud applications.</p> <h3 id="diagnose-network-traffic">Diagnose network traffic</h3> <p>When troubleshooting cloud application performance, understanding network traffic behavior is critical. Network congestion can significantly impact application performance, particularly in east-west (i.e., traffic between servers in the same data center) and <a href="https://cloud.google.com/network-connectivity/docs/network-connectivity-center/concepts/site-to-cloud">cloud-to-site connections</a>. Implementing effective network traffic diagnosis is essential to mitigate these issues, and using appropriate network diagnosis tools can make this task more manageable.</p> <p>Choosing a network observability platform that provides real-time visibility into network traffic is ideal. The tool should be able to capture network data and enrich it with application and business context, delivering actionable insights for operations teams. This enables rapid identification of network congestion and its root cause, which may be due to sudden spikes in demand, distributed denial-of-service (DDoS) attacks, or misconfigurations.</p> <p>The key to maintaining optimal application performance is understanding the flow and behavior of network traffic, including detecting network congestion points. Network observability tools can help simplify this task by providing real-time insights, enabling operations teams to rapidly identify and address potential network issues.</p> <p>Regardless of the tool you use, the objective remains the same—ensure seamless network operations for efficient cloud application performance.</p> <h3 id="add-health-endpoints">Add health endpoints</h3> <p>Another crucial practice when troubleshooting cloud application performance is adding health endpoints to your applications. <a href="https://www.ibm.com/garage/method/practices/manage/health-check-apis/">Health endpoints</a> are specific URIs in your application that return the status of different aspects of your app, such as database connectivity, memory usage, and uptime. This information is helpful for checking the health of your application and its components.</p> <p>You can readily integrate health checks into monitoring tools to get real-time health updates and alerts in case of failures or performance degradation. By monitoring these endpoints, you can detect issues early and take remedial action before they escalate into major problems.</p> <p>In a distributed cloud environment, health endpoints on each service can enable operations teams to keep a pulse on the overall system.</p> <h3 id="use-distributed-tracing-mechanisms">Use distributed tracing mechanisms</h3> <p>Another effective troubleshooting strategy for cloud application performance is using distributed tracing mechanisms. Distributed tracing tracks and monitors requests as they flow through various microservices and components of your application:</p> <img src="//images.ctfassets.net/6yom6slo28h2/78GmMhMLn2dLUbXQ81o7ph/032047a68787f6aaf28fea97cc26187d/distributed-tracing.png" style="max-width: 750px;" class="image center simple" alt="Distributed tracing diagram" /> <p>By implementing distributed tracing, you gain valuable insights into how requests are processed across different services. Each request is assigned a unique identifier, and as it travels through the application, the tracing mechanism captures information about its journey, including processing times and any errors encountered along the way.</p> <p>With distributed tracing, you can identify bottlenecks and pinpoint the exact services or components causing performance issues. For example, if a user experiences slow response times, distributed tracing can help you trace the request flow and identify the specific microservice responsible for the delay.</p> <h3 id="use-a-service-mesh">Use a service mesh</h3> <p>A <a href="https://www.kentik.com/blog/kubernetes-and-the-service-mesh-era/">service mesh</a> is an exclusive infrastructure layer that facilitates service-to-service communications in a microservice-based architecture. By doing so, it decouples the networking code from application logic, making it possible to update or change the network layer of your cloud application independently. It helps manage traffic flow, enforce policies, and offer valuable observability features.</p> <p>Using a service mesh can help simplify the diagnosis and resolution of performance issues in the cloud as it provides insight into the complex interactions between microservices, making it easier to identify problematic patterns. Features such as traffic control and load balancing further enhance application performance.</p> <p>Moreover, service mesh solutions such as <a href="https://istio.io/">Istio</a> or <a href="https://linkerd.io/">Linkerd</a> offer built-in observability features, providing metrics, logs, and traces necessary for comprehensive troubleshooting. Implementing one such solution can help you manage the communication layer of your cloud apps more proactively and avoid many potential performance issues.</p> <h3 id="implement-synthetic-testing">Implement synthetic testing</h3> <p><a href="https://www.kentik.com/blog/everything-you-need-to-know-about-synthetic-testing/">Synthetic tests</a> are simulated scenarios that mimic user behavior or interactions with an application. They’re specifically valuable for troubleshooting cloud application performance, as they help identify potential bottlenecks and issues before end users encounter them. Additionally, they can help reduce mean-time-to-resolution (MTTR) and uphold SLAs and product launches in new regions.</p> <p>Synthetic tests encompass various testing types, including load, stress, and latency. These tests allow you to monitor performance under different conditions and configurations. By doing so, operations teams can detect issues and anomalies that might degrade application performance in the future.</p> <h2 id="kentik-your-powerful-ally-for-troubleshooting-cloud-applications">Kentik: Your powerful ally for troubleshooting cloud applications</h2> <p>Kentik is a leading network observability solution that provides real-time visibility into your cloud applications’ network traffic, performance, and security. Its robust data analysis platform makes collecting and analyzing network data from various sources easy. It gives actionable insights so NetOps teams can proactively detect and diagnose performance problems.</p> <p>Organizations can use Kentik’s network observability platform to implement digital experience monitoring, increase visibility into network traffic, and improve application performance testing. By integrating Kentik into your cloud troubleshooting process, you can enjoy the benefits of data-driven insights, allowing you to stay ahead of issues and ensure optimal application performance.</p> <h2 id="conclusion">Conclusion</h2> <p>The distributed architecture of modern cloud applications makes it hard to troubleshoot performance issues. However, effective cloud troubleshooting can help mitigate these issues, reduce operational costs, and provide better security.</p> <p>This article highlighted vital strategies to mitigate performance issues, including implementing log aggregation and centralized configuration management, adding health endpoints, and leveraging robust tools such as Kentik. Combined, these strategies result in a powerful framework for troubleshooting cloud application performance issues, ensuring your applications are always reliable, secure, and performant.</p> <p>Check out <a href="https://www.kentik.com/product/cloud/">Kentik Cloud</a> to learn more about increasing your application’s performance. Discover the benefits of the Kentik Network Observability Platform for yourself — start a <a href="#signup_dialog" title="Request a Free Trial of Kentik">free trial</a> or <a href="#demo_dialog" title="Request a Demo of Kentik">request a personalized demo</a> today.</p><![CDATA[SD-WAN Best Practices]]><![CDATA[SD-WAN is a reliable, fast, and secure WAN network. In this guide, you'll learn some best practices for planning, monitoring, analyzing, and managing modern SD-WANs. ]]>https://www.kentik.com/blog/sd-wan-best-practiceshttps://www.kentik.com/blog/sd-wan-best-practices<![CDATA[Phil Gervasi]]>Wed, 26 Jul 2023 04:00:00 GMT<p>Having the right infrastructure is crucial for businesses going through a digital transformation. You need to make sure you have what you need to support your growth and expansion.</p> <p>Organizations will encounter substantial challenges without well-defined strategies and effective solutions, including limited application accessibility, suboptimal performance, and interrupted connectivity to essential assets and resources.</p> <p>Additionally, a robust infrastructure plays a crucial role in supporting the <a href="https://www.techopedia.com/definition/13767/business-continuity-and-disaster-recovery-bcdr">business continuity and disaster recovery (BCDR)</a> strategy. It fortifies businesses against a number of crises, including natural disasters, infrastructure collapses, and economic downturns.</p> <p>Today, businesses need integrated network solutions to more easily adapt, maintain a constant online presence, increase productivity, and reduce operational costs. One such integrated solution is software-defined wide area networks (SD-WAN), an innovative approach to WAN implementation that aims to strengthen the digital presence of businesses in every market, redefining the management of their daily operations and accelerating their evolution.</p> <p>In this guide, you’ll learn more about SD-WAN and some best practices for planning, monitoring, analyzing, and managing modern SD-WANs.</p> <h2 id="what-is-sd-wan">What is SD-WAN?</h2> <p>A typical WAN often consists of a mix of private circuits, such as <a href="https://www.kentik.com/kentipedia/network-traffic-engineering/#multi-protocol-label-switching-mpls-and-traffic-engineering" title="Kentipedia: MPLS">Multiprotocol Label Switching (MPLS)</a> and <a href="https://www.bsimplify.com/what-is-a-point-to-point-circuit/">point-to-point circuits</a>, as well as public internet circuits spread across wide geographical locations.</p> <p>Broadband requirements have increased thanks to the increasing popularity of cloud and software-as-a-service (SaaS) applications. However, each data packet is still routed without considering the conditions or congestion of other networks. As a result, multiple connections can remain inactive while others are overloaded with traffic.</p> <p>In a traditional WAN setup, it’s common for data from remote locations to be routed to a central data center. This routing allows for security inspections to be performed before the data is sent to internet locations. Of course, not all data needs to go through this process, and some traffic can be directly sent to the internet without going through a data center. Routing and security inspection decisions depend on an organization’s specific needs and network setup.</p> <p>In any case, this approach negatively impacts application performance and can be more expensive due to the higher costs of MPLS-based connections. When a company adds multiple branch offices or a large number of remote workers, these costs escalate while also impairing performance.</p> <p>Additionally, a traditional WAN requires device-by-device configuration, management, and troubleshooting. This is both highly inefficient and error-prone and can lead to problems with configuration drift and security compliance issues. Instead, SD-WAN utilizes a central controller so that all WAN devices are managed more easily from a single UI with templates rather than individual box-by-box configuration.</p> <p>Fortunately, SD-WAN provides a high-quality solution to these problems. It centrally manages all network connections and policies and offers an innovative approach to reducing network costs and optimizing bandwidth usage.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2DjbxISZUkWVEIxQ8VUUQG/9a6cf5b71abac941dc7b87b8832bb297/traditional-wan-vs-sd-wan.png" style="max-width: 750px;" class="image center no-shadow" alt="Traditional WAN vs. SD-WAN diagram" /> <p>SD-WAN separates the control and data planes, allowing for smarter and more secure traffic routing across the WAN and to cloud services. It makes use of cost-effective internet connections, such as broadband, 4G, or 5G wireless, saving businesses money while enhancing cloud-based applications’ performance.</p> <p>An SD-WAN makes dynamic path selections on a per flow or even per packet basis (depending on vendor) based on the quality of the local connection and sometimes the path between the source and destination. This way, an SD-WAN can ensure performance requirements for sensitive applications thereby improving and optimizing an end-user’s experience.</p> <p>SD-WAN comes packed with <a href="https://www.netify.com/learning/what-is-sd-wan-security">security features</a> to protect traffic intended for applications. Additionally, businesses can use SD-WAN to segment the network into distinct sections based on user identities or job roles, create secure connections by encrypting data, and tightly integrate with cloud security features.</p> <h2 id="best-practices-for-planning-monitoring-analyzing-and-managing-modern-sd-wans">Best practices for planning, monitoring, analyzing, and managing modern SD-WANs</h2> <p>Now that you know how SD-WAN works, it’s time to explore some best practices to help you get the most out of this technology.</p> <h3 id="design-your-sd-wan-carefully">Design your SD-WAN carefully</h3> <p>SD-WAN is undoubtedly an intelligent tool. However, given the increasing complexity of modern networks, simplifying things is challenging. Before implementing an SD-WAN solution, enterprises must invest significant effort in designing, carefully planning, and considering several crucial factors.</p> <h4 id="consider-your-background-and-architecture">Consider your background and architecture</h4> <p>One aspect you need to consider is your organization’s background and architecture, including understanding the desired network topology. This entails examining the existing topology to clearly define the transition from point A to point B and defining the necessary components.</p> <p>For instance, many organizations have a traditional <a href="https://learn.microsoft.com/en-us/azure/cloud-adoption-framework/ready/azure-best-practices/hub-spoke-network-topology">hub-and-spoke</a> environment but want to shift toward a full mesh network or a local breakout-style network. Understanding these requirements is vital to ensure that the architecture functions effectively in the end and throughout the transitional phase of the project.</p> <h4 id="emphasize-collaboration">Emphasize collaboration</h4> <p>Collaboration among stakeholders is another critical factor to consider while designing your SD-WAN. Because SD-WAN impacts several teams, including networking, security, and application teams, you must ensure everyone is on the same page.</p> <h4 id="consider-third-party-involvement">Consider third-party involvement</h4> <p>The next area of focus has to do with third-party involvement. Organizations increasingly rely on third-party connections (internet service providers, cloud service providers, SD-WAN vendors, etc.), which raises connectivity and security concerns. It’s important to define what third-party connections are needed and their underlying functionality beforehand.</p> <h4 id="take-into-account-your-long-term-goals">Take into account your long-term goals</h4> <p>Finally, the overall strategy and vision of your business must be considered. While designing a network to meet current needs is critical, it’s also essential to understand the company’s long-term strategic direction. This prevents you from having to do a complete network redesign in a short time.</p> <p>Understanding the strategic direction ensures the design is future-proof and meets current and future needs.</p> <h3 id="keep-a-close-eye-on-network-performance">Keep a close eye on network performance</h3> <p>A network’s performance significantly impacts user experience, productivity, and overall business operations. SD-WAN technology provides a massive opportunity for businesses to improve network performance and reliability. However, you need to thoroughly monitor your SD-WAN to guarantee optimal network performance, security, and the best application experience possible for your users.</p> <p>In fact, a major part of designing an SD-WAN is understanding and planning for what the traffic will mostly consist of. Applications with specific thresholds for <a href="https://www.kentik.com/kentipedia/understanding-latency-packet-loss-and-jitter-in-networking/" title="Kentipedia: Understanding Latency, Packet Loss, and Jitter in Network Performance">jitter and latency</a>, for example, should be part of an SD-WAN design to ensure the proper path selection policies are implemented.</p> <p>Some organizations may face network visibility and performance monitoring challenges because they can’t monitor their SD-WAN networks effectively. Often this is because SD-WAN vendors don’t have much visibility into the underlay network, which in the case of SD-WAN is the public internet itself.</p> <p>Despite the high claims made by SD-WAN vendors regarding their features and performance, there are still various issues affecting SD-WAN networks that these vendors can’t detect using their native monitoring features. SD-WAN native monitoring capabilities often lack the depth required for comprehensive network monitoring and troubleshooting. Consequently, when performance issues arise, there may be some gray area since it’s unclear where the problem originated, who needs to fix it, or how it can be fixed.</p> <p>Thankfully, by implementing an appropriate <a href="https://www.kentik.com/solutions/usecase/wan-and-sd-wan/" title="Kentik solutions for SD-WAN monitoring">SD-WAN monitoring solution such as Kentik</a>, businesses can proactively identify and address any issues that may arise, ensuring optimal network performance.</p> <p>This means that businesses using SD-WAN, with the assistance and implementation of a high-quality network observability company, can monitor every network connection from an end-user perspective. This helps companies assess whether the performance of their SD-WAN service meets expectations.</p> <h3 id="pay-special-attention-to-security">Pay special attention to security</h3> <p>As you implement an SD-WAN solution, there are several actions you’ll undertake. These include enabling direct internet access for branch offices and providing home workers with the same capability. However, these actions pose a number of vulnerabilities within your organization.</p> <p>In the past, with more traditional networks, safeguarding your network was relatively straightforward, involving the use of firewalls, intrusion prevention systems, and a <a href="https://www.okta.com/identity-101/dmz/">demilitarized zone network (DMZ)</a> as the sole entrance and exit point for network access.</p> <p>With SD-WAN, you face a much greater number of potential entry points. That’s why it’s critical to pay special attention to security.</p> <p>To begin, you need to know your organization’s intended <a href="https://www.kentik.com/blog/how-kentik-reduces-likelihood-full-blown-cyber-attack-before-it-happens/" title="Kentik Blog: How Kentik Reduces the Likelihood of a Full-blown Cyberattack Before it Happens">security posture</a>. You also need to decide where security measures will be implemented. For instance, will you implement security locally or on each branch level or offload it to the cloud?</p> <p>Moreover, you need to address compliance and regulatory needs. Regulations such as the <a href="https://gdpr.eu/what-is-gdpr/">General Data Protection Regulation (GDPR)</a> and <a href="https://oag.ca.gov/privacy/ccpa">California Consumer Privacy Act (CCPA)</a>, which require the security of sensitive information, customer data, and employee data, already impact many organizations. You need to determine the scope of this criteria, including aspects of visibility, such as the information that can be accessed, stored, and retained, as well as the duration of storage.</p> <p>Additionally, encryption is essential for data transfer and storage. For instance, you must understand your organization’s encryption process during data transmission and storage. When it comes to edge technology, the level of encryption used can have a considerable impact on performance and delivery.</p> <p>Finally, you must consider any additional security services. With expanding threats and increased possibility of breaches, security is complex. Kentik allows you to <a href="https://www.kentik.com/solutions/usecase/ddos-detection-and-network-security/" title="Using Kentik for DDoS Detection and Network Security">establish and maintain a robust security posture</a> but also utilizes managed services that help you monitor and detect security threats and solve them promptly.</p> <h3 id="test-and-update-your-sd-wan-regularly">Test and update your SD-WAN regularly</h3> <p>Another critical best practice is regularly testing and making appropriate updates to your SD-WAN deployment. This guarantees that your network functions properly and fulfills your organization’s requirements.</p> <p>The simplest and most efficient way to do this is for your network team to design a monitoring strategy that covers a variety of topics, including event management, active path testing, and <a href="https://www.kentik.com/kentipedia/what-is-network-topology/" title="Kentipedia: What is Network Topology?">network topology</a>. These components are critical for effective troubleshooting and guaranteeing that your SD-WAN solution operates effectively.</p> <p>It’s also important to test the functionality of the features provided by your SD-WAN solution. Many systems offer capabilities, such as traffic optimization, bandwidth management, and advanced encryption, that improve network efficiency and security. Evaluating these functionalities is critical to ensure they deliver the stated advantages.</p> <p>However, SD-WAN testing extends beyond only functionality validation. It’s also vital to focus on the quality of service (QoS), availability, and failover. We know that there is no actual QoS on the internet with regard to buffers, queues, and so on, so what SD-WANs do is monitor links moment by moment to choose the best path to take for a given flow or packet. This improves the overall performance of an application being delivered over the internet.</p> <p>Additionally, organizations must evaluate the scalability of their SD-WAN solution to ensure it can accommodate evolving needs.</p> <p>Regular testing is also vital so that SD-WAN deployments can meet <a href="https://www.techtarget.com/searchnetworking/tip/SD-WAN-and-SLAs-Why-crafting-internal-SLAs-is-a-smart-move">service level agreements (SLAs)</a> and provide the necessary network performance and reliability. Testing should occur during and after the deployment process. It should take place regularly and continuously to adapt to expanding IT infrastructure and evolving business requirements.</p> <p>It’s crucial to approach testing and updating in a controlled manner, with proper planning and consideration for potential impacts on production environments. You need to regularly review and test the process and then take action and apply the lessons you learn. This way, organizations can ensure that their SD-WAN deployment remains optimized and aligned with their needs.</p> <h3 id="optimize-traffic-routing-using-analytics">Optimize traffic routing using analytics</h3> <p>SD-WAN offers significant advantages regarding <a href="https://www.kentik.com/kentipedia/network-performance-monitoring/" title="Kentipedia: Network Performance Monitoring">network performance</a>. One key strength lies in its ability to prioritize and route traffic through optimal paths, thus optimizing network resources.</p> <p>You can optimize traffic routing using monitoring tools to continuously analyze network components for patterns and trends. Leveraging analytics and data, you can decide how to prioritize traffic and optimize routing policies to <a href="https://www.kentik.com/kentipedia/bandwidth-utilization-monitoring/" title="Kentipedia: What is Bandwidth Utilization Monitoring?">maximize bandwidth</a> use.</p> <p>Implementing an analytics solution can help you understand your network traffic and effectively identify bottlenecks. By leveraging analytics data, you can evaluate various network paths and components. This evaluation ensures efficient routing decisions based on metrics such as latency, jitter, throughput, and available bandwidth.</p> <p>Additionally, organizations should try to estimate future network traffic demands and capacity requirements by examining old data and patterns. This proactive strategy anticipates traffic increases, simplifies scalability design, and allows for early adjustments to routing policies and network resources.</p> <p>Analytics can also be used to discover irregularities in network traffic patterns. Unexpected bandwidth spikes or unexpected application activity can suggest potential security issues. When you identify these issues via analytics, you can take the necessary steps, such as adopting security policies or routing suspect traffic for additional examination.</p> <h3 id="distribute-your-traffic-in-multiple-tunnels-to-ensure-redundancy-and-reliability">Distribute your traffic in multiple tunnels to ensure redundancy and reliability</h3> <p>To ensure redundancy and reliability in your SD-WAN deployment, you need to distribute your traffic across multiple virtual tunnels (e.g., <a href="https://www.kentik.com/kentipedia/ip-network/" title="Kentipedia: IP Network">IPsec</a>). Traffic distribution provides numerous solutions to mitigate potential failures and minimize downtime. By having more than one request-handling portal, the system becomes failproof and capable of handling multiple requests simultaneously. This parallel processing of requests increases redundancy by dividing the workload and improving overall performance.</p> <p>With SD-WAN, you can create a failover for your organization through multiple connections, which leads to increased redundancy. This means you can stay connected even if one of your connections fails. The failover mechanism helps your users develop a sense of trust and dependability while maintaining session traffic and security.</p> <p>Additionally, using multiple paths and auto-failover mechanisms, SD-WAN can significantly improve delivery time. SD-WAN redirects traffic around bottlenecks and poorly performing WAN connections, ensuring that critical data reaches its destination without delays.</p> <p>SD-WAN enables accurate resource allocation and appropriate delivery of services. SD-WAN optimizes network resource utilization by effectively distributing traffic across multiple tunnels, resulting in improved reliability and overall system performance.</p> <h2 id="conclusion">Conclusion</h2> <p>SD-WAN is a pioneering network architecture that offers an innovative, reliable, secure, and performance-optimized WAN network. This modern technology provides you with uninterrupted connectivity and centralized management of your network and allows you to intelligently manage your entire network, saving time and reducing costs.</p> <p>It’s crucial to adhere to a set of best practices in planning, monitoring, analyzing, and managing to make the most of SD-WAN. Carefully designing your SD-WAN, closely monitoring network performance, and giving special attention to security are excellent starting points. Additionally, regularly testing and updating your SD-WAN, optimizing traffic routing using analytics, and distributing your traffic across multiple tunnels help ensure optimal performance.</p> <p>The Kentik network observability solution is a valuable tool for ensuring the success of your SD-WAN implementation. With Kentik, you gain the necessary operational oversight to effectively monitor, manage, and plan your SD-WAN environment.</p> <p>To explore the capabilities of Kentik and see how it can specifically support your SD-WAN needs, <a href="https://www.kentik.com/go/get-demo/">request a personalized demo</a>.</p><![CDATA[Website Performance and Transaction Monitoring with Kentik]]><![CDATA[Learn about website performance monitoring, why you need it, and how to set up automated transaction testing using Kentik’s network observability platform in this step-by-step tutorial. You can even follow along using the free trial version of Kentik.]]>https://www.kentik.com/blog/website-performance-and-transaction-monitoring-with-kentikhttps://www.kentik.com/blog/website-performance-and-transaction-monitoring-with-kentik<![CDATA[Amr Abdou]]>Thu, 20 Jul 2023 07:00:00 GMT<p>In this article, you’ll learn more about website performance monitoring and why you need it. You’ll also learn how to set up automated transaction testing using <a href="https://www.kentik.com/">Kentik’s</a> synthetic monitoring tools.</p> <h2 id="what-is-website-performance-monitoring">What is website performance monitoring?</h2> <p>Website performance monitoring is the process of continually tracking and analyzing the performance of a website or web application, including metrics such as page load time, uptime, server response time, and error rates. Automating the performance monitoring process helps businesses identify and address issues impacting user experience, availability, search engine ranking, capacity, and scalability.</p> <h3 id="why-do-you-need-website-performance-monitoring">Why do you need website performance monitoring?</h3> <p>If you’re reading this article, chances are you know you need to use different tests and techniques to monitor a website’s performance. Tests like page load speed and HTTP response can help you monitor your website’s performance, but they don’t offer a comprehensive understanding of user behavior and its impact on performance. However, synthetic transaction tests provide a more effective solution by allowing you to monitor website performance based on the user’s digital experience, enhancing the observability of your system.</p> <p>This is where website performance monitoring plays a vital role. By ensuring reliable and insightful performance monitoring, you can achieve lower bounce rates, improve customer retention, and enhance the overall user experience.</p> <p>Most major search engines use multiple performance metrics as a ranking factor, such as <a href="https://www.searchenginejournal.com/core-web-vitals/" title="Search Engine Journal: Core Web Vitals: A Complete Guide">Google’s Core Web Vitals</a> and <a href="https://www.bing.com/webmasters/help/webmaster-guidelines-30fba23a" title="Bing Webmaster Guidelines">Bing Webmaster Guidelines</a>. Improving your website’s performance can also help you achieve high search engine rankings and increase the discoverability of your website.</p> <p>Additionally, performance monitoring helps to maintain web applications’ availability and to indicate the right time to scale. By leveraging the insights provided by effective monitoring tools, you gain valuable data that can inform business decisions regarding your development roadmap, capacity planning, and scalability.</p> <h3 id="introducing-synthetic-transaction-monitoring">Introducing Synthetic Transaction Monitoring</h3> <p>Synthetic transaction monitoring takes the monitoring process further by monitoring specific types of digital transactions, such as registration, form submission, and purchases. It can help you gain performance insights based on real user experience scenarios and identify issues that may arise during the transaction process, such as slow response times, bottlenecks, and browser errors.</p> <p>Synthetic transaction monitoring involves monitoring user transactions from different geographical locations to help you focus on performance where most users are located. It can also provide accurate transaction performance metrics from a new location where a service will be launched.</p> <p>Transaction monitoring is valuable in several scenarios, including e-commerce and software-as-a-service (SaaS) websites. For example, it can help e-commerce websites create an efficient checkout process and allow complex SaaS applications to provide a better user experience and determine scalability needs. Moreover, it can help maintain fast streaming speed for media and entertainment websites.</p> <h2 id="how-to-monitor-website-performance-with-kentik">How to monitor website performance with Kentik</h2> <p>Now that you know why you need performance monitoring, set it up using Kentik’s <a href="https://www.kentik.com/blog/everything-you-need-to-know-about-synthetic-testing/" title="Kentik Blog: Everything You Need to Know About Synthetic Testing">synthetic transaction testing</a> features. As previously noted, transaction tests help you monitor the real-time performance of specific user experience scenarios. Here, you’ll learn how to create tests that run periodically using testing agents from multiple locations.</p> <p>With Kentik’s transaction monitoring, you’ll receive real-time alerts that help you visualize issues based on a monitored global digital experience. This can save you time and help you determine when and how to improve performance.</p> <p>Before you begin, you need to have the latest version of the Chrome browser (version 113 was used here) and a <a href="https://www.kentik.com/get-started/" title="Start a 30-day free trial of Kentik">free Kentik account</a>.</p> <p>Once you’ve opened an account, it’s time to get started.</p> <h3 id="record-a-transaction-in-chrome">Record a transaction in Chrome</h3> <p>To begin, you need to simulate users’ actions using the <a href="https://developer.chrome.com/docs/devtools/recorder/" title="Chrome Developers: Record, replay, and measure user flows">Chrome DevTools Recorder</a>. This is a relatively new function that allows you to record a user’s digital experience and import it in several formats, including JSON, <a href="https://pptr.dev/" title="Puppeteer Node.js library">Puppeteer</a>, or test tool scripts, such as <a href="https://www.cypress.io/" title="Cypress Testing Framework">Cypress</a> and <a href="https://nightwatchjs.org/" title="Nightwatch Testing Framework">Nightwatch</a>.</p> <p>In your Chrome browser, press <strong>F12</strong> (or <strong>Ctrl + Shift + I</strong> if you’re using Windows/Linux or <strong>Command + Shift + I</strong> on Mac) to open Chrome DevTools, then open the <strong>Recorder</strong> tab:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2moc2V2l635IVNZGfusOmO/d067705dde2afd57df3de158bacbc20b/chrome-dev-tools-recorder-pane-web-monitoring-1.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Chrome DevTools Recorder" /> <div class="caption" style="margin-top: -35px;">Accessing the Chrome DevTools recorder.</div> <p>Click the <strong>Recording</strong> button, then navigate to the website or application you want to test and perform some actions. For instance, you can record navigating to your website and completing the registration process. If you’re testing an e-commerce website, you can mimic the shopping experience and add products to the shopping cart. Once you’re done, click <strong>Stop</strong>.</p> <p>In this example, we navigated to a company’s website, clicked buttons that triggered JavaScript-animated scrolling, and submitted a contact form. This is what the recording results look like:</p> <img src="//images.ctfassets.net/6yom6slo28h2/61LOV1iGTFwgnLFIuMfiKa/3ff830b214d18e4df494d761d8e04853/web-performance-recording-2.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Chrome Recorder result" /> <div class="caption" style="margin-top: -35px;">Results from the Chrome recorder.</div> <p>To use the Recorder with Kentik’s transaction testing, you need to export the recorded actions as a <a href="https://developer.chrome.com/docs/puppeteer/" title="Chrome Developers: Puppeteer">Puppeteer script</a>. If you’re unfamiliar with Puppeteer, it’s a Node.js library that provides an API to create and control a headless Chrome browser.</p> <p>To view the Puppeteer script, select <strong>Show code</strong> and then select <strong>Puppeteer</strong> from the drop-down menu and copy the code:</p> <img src="//images.ctfassets.net/6yom6slo28h2/EkvWSbYXKWOnaDDiqLvQl/589821c1c6aac88b9347664500f77810/puppeteer-script-web-monitoring-3.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Get Puppeteer script" /> <div class="caption" style="margin-top: -35px;">Getting the code for our Puppeteer test script.</div> <h3 id="edit-the-puppeteer-script">Edit the Puppeteer script</h3> <p>The imported Puppeteer script will include the code of the recording without screenshots. You need to manually add the code to take screenshots in order to see what the website looks like during the test.</p> <p>To add screenshots, go through the script and paste <code class="language-text">await page.screenshot({ path: "screenshot1.jpg" });</code> after every event, such as a page load, an AJAX request, or a browser closing. Make sure every screenshot image has a unique name, and that every image has a <code class="language-text">.jpg</code> extension, like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">{ const targetPage = page; const promises = []; promises.push(targetPage.waitForNavigation()); await targetPage.goto(’https://yourwebsite.com/’); await page.screenshot({ path: "screenshot1.jpg" }); await Promise.all(promises); }</code></pre></div> <p>Here’s an example of what adding a screenshot would look like for a JavaScript scrolling event:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">await element.click({ offset: { x: 98, y: 34, }, }); await page.screenshot({ path: "screenshot2.jpg" }); await Promise.all(promises);</code></pre></div> <p>Ensure that the viewport size of the browser instance is big enough to capture a large part of the screen. Your code for that should look something like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">{ const targetPage = page; await targetPage.setViewport({ width: 1906, height: 600 }) }</code></pre></div> <h3 id="create-a-kentik-test">Create a Kentik test</h3> <p>To create a new test, log into your <a href="https://portal.kentik.com/" title="Kentik portal">Kentik account</a> and navigate to the <strong>Test Control Center</strong> under the <strong>Synthetics</strong> menu:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Ft8hrUpd8zfmYIIqrEFNh/e145a5c98fba328feca9ccd455bccbfb/synthetics-test-control-center-4.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Synthetic Test Control Center" /> <div class="caption" style="margin-top: -35px;">Navigating to the Test Control Center in Kentik.</div> <p>On the <strong>Test Control Center</strong> page, click <strong>Add Test</strong> in the top-right corner and select <strong>Transaction</strong>:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3DUduGE4G6kiZjh2ZS8VXA/cd6f24f06efd9f9f49a3fd26f810bcd2/adding-a-new-test-in-kentik-5.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Creating a new transaction test" /> <div class="caption" style="margin-top: -35px;">Creating a new synthetic transaction test in Kentik.</div> <p>Now you need to enter the following basic information and test frequency:</p> <ul> <li><strong>Test name:</strong> A name to identify the test. In this example, it is named “Basic Browsing.”</li> <li><strong>Test description:</strong> A description of the test.</li> <li><strong>Labels:</strong> You can create and assign labels for each test to help your group identify tests for future reference. In this example, the test is labeled “Browsing and Scrolling.”</li> <li><strong>Test frequency</strong>: The period between each test. You can choose from five, ten, fifteen, thirty, sixty, and ninety minutes.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/2AH418fcFLpSIP6Ng3f81Q/5d51a8ff84d031a0443d37ebfd28ed71/synthetic-test-setup-in-kentik-6.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Setting up our new synthetic test" /> <div class="caption" style="margin-top: -35px;">Setting up our new synthetic transaction test.</div> <h4 id="target-and-agents">Target and agents</h4> <p><a href="https://kb.kentik.com/v4/Ma01.htm" title="Kentik Knowledgebase: Kentik Synthetics Agents">Test agents</a> in Kentik are test servers maintained in various data centers worldwide. You can select agents from the geographic locations where most of your website users are from or from a location where you’re planning to launch your service.</p> <p>Then paste the Puppeteer script you exported and customized from the Chrome DevTools Recorder:</p> <img src="//images.ctfassets.net/6yom6slo28h2/7EuKrsbgWY9YiGIbPwqa1g/066fd45932e717e59948004a15a1601e/transaction-test-target-and-agents-7.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Synthetic transaction test target and agents" /> <div class="caption" style="margin-top: -35px;">Selecting our test target, agents, and Puppeteer script.</div> <h4 id="optional-settings">Optional settings</h4> <p>The next three steps are optional settings that can help you customize the HTTP timeout, test health metrics, and alert settings.</p> <p>If you want to create the test now to see how it works, you can skip these settings, click the <strong>Create Test</strong> button, and move to the <strong>Viewing the Test Results</strong> section. You can come back to edit these settings anytime by clicking the <strong>Edit</strong> icon next to the test in the <strong>Test Control Center</strong>.</p> <h4 id="http-requests-settings">HTTP requests settings</h4> <p>The HTTP settings enable you to set the HTTP request timeout and choose if you want to ignore TLS errors. The <strong>HTTP Timeout</strong> is set at 5000ms by default. If you’re creating this test for a staging or development environment, you can enable <strong>Ignore TLS Errors</strong>:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6rMdU0zMs4hwx3AX8y3n9O/29f11cf0346bcd2ba914984ad658856f/transaction-test-http-settings-8.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="HTTP settings for synthetic transaction test" /> <div class="caption" style="margin-top: -35px;">Modifying HTTP settings for our test.</div> <h4 id="health-metrics">Health metrics</h4> <p>The health status of each agent subtest is divided into three categories: healthy, warning, and critical. Based on the length of your recorded transaction, you can set the <strong>Transaction Time Threshold</strong> on which the test status is measured. You can specify this threshold value either in milliseconds or as a standard deviation based on a baseline computed by Kentik. Then you can set the minimum number of unhealthy subtests to determine whether to classify the overall test as unhealthy.</p> <p>Here, the threshold is set to five seconds for <strong>Warning</strong> and seven seconds for <strong>Critical</strong>. The test will be determined as unhealthy if three subtests exceed this time threshold:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6ERBu2JkvCe839rgrBNce7/64370823ac71cc19c63982f2d7524292/transaction-test-health-settings-9.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Health settings for synthetic transaction test" /> <div class="caption" style="margin-top: -35px;">Configuring Health settings for our test.</div> <h4 id="configure-alerts-and-notifications">Configure alerts and notifications</h4> <p>You can also set alerts and notifications based on the number of unhealthy tests, the time gap between them, and the overall time range of test execution. You can add your preferred notification channels from the available options, including Slack, Microsoft Teams, email, or custom webhooks:</p> <img src="//images.ctfassets.net/6yom6slo28h2/7iQ7OiFdkwH8xZtDcbkPrW/651740d8f958b972ac58fec7b13505b3/synthetic-transaction-test-alerts-and-notifications-10.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Alerts and notifications settings for synthetic transaction test" /> <div class="caption" style="margin-top: -35px;">Configuring alerts and notifications for our website performance tests.</div> <p>Once you’ve curated your optional settings, click <strong>Create Test</strong>. You’ll be redirected to the <strong>Test Control Center</strong>, where you can see the test you just created. The test result will show as pending. Wait a few minutes until the first test is executed, and then open the <strong>Test Results</strong> page.</p> <h3 id="view-and-analyze-the-results">View and analyze the results</h3> <p>The <strong>Test Results</strong> page shows the latest test run using each assigned agent. The following are the metrics included for each test:</p> <ul> <li><strong>Health status</strong> is based on the threshold set during the test creation.</li> <li><strong>Total transaction time</strong> is the total time of the transaction.</li> <li><strong>Transaction completion</strong> shows if the transaction was completed.</li> <li><strong>Screenshots</strong> show the number of screenshots taken during the test.</li> </ul> <p>In this sample result, one subtest failed to run, another exceeded the warning threshold, and one didn’t run:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1IZDmMIk7lBTmvjFLBz6nk/7f3cb69424c12bd4d946bc9a8935a8d2/synthetic-transaction-test-results-11.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Synthetic transaction test results" /> <div class="caption" style="margin-top: -35px;">Results from our synthetic transaction tests.</div> <p>The health status of this test will appear as <strong>healthy</strong> because the number of subtests exceeding the threshold value was set to three, and only one subtest failed.</p> <p>To compare test results over a period, wait for a few more tests to run, then switch to the <strong>Time Series</strong> view. This view shows the result of tests within a specific time range that you set in the <strong>Input field</strong> at the top-right corner. When you examine the test results within a time series, you can identify an issue if a test continually fails from a specific agent:</p> <img src="//images.ctfassets.net/6yom6slo28h2/JLeJMptMsoYzwqvTsBzJi/161a68950acb1e4c5999ac2b8df5b9d4/synthetic-test-time-series-view-12.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Synthetic transaction test time series" /> <div class="caption" style="margin-top: -35px;">Time series view of our synthetic transaction test results.</div> <p>To view an individual agent’s test results, click the <strong>Details</strong> button. This page shows you the recorded transaction’s timeframe and the screenshots taken. In this view, you can identify performance bottlenecks and plan the appropriate actions to fix them:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5mcIF1JZqsUEduPfRw2GPy/27adf1f0df97e7c41e542e1e21994cbd/synthetic-transaction-test-results-13.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Synthetic transaction test details" /> <div class="caption" style="margin-top: -35px;">Detail view of our synthetic transaction test results.</div> <p>The <strong>Waterfall</strong> view (located in the menu at the top) shows specific data about all the resources loaded during the transaction test to help you determine if a particular resource is slowing the user experience:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3rVyMjjrqHMrQfQMlIzowa/451381cf37ec7673ee0bfac637dc49eb/synthetic-transaction-test-waterfall-view-14.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Synthetic transaction test results waterfall" /> <div class="caption" style="margin-top: -35px;">Waterfall view of our synthetic transaction test results.</div> <p>To share the test result with your team members, you can turn on <strong>Alerting and Notification</strong>, one of the optional settings you went through while creating the test. If you skipped the optional settings, go to the <a href="https://portal.kentik.com/v4/synthetics/tests" title="Kentik portal: Test Control Center"><strong>Test Control Center</strong></a> and click the edit icon next to the test. Then click <strong>Alerting and Notification</strong> at the left side menu:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6DXAUvgDfCwq1UnHPaT9JB/e5dc24c4006ceaf8eebc148bbd49fff7/editing-alerts-and-notifications-15.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Editing alerts and notifications" /> <div class="caption" style="margin-top: -35px;">Editing Kentik alerts and notifications.</div> <h2 id="conclusion-web-and-transaction-monitoring-in-kentik">Conclusion: Web and transaction monitoring in Kentik</h2> <p>Transaction testing is a method used to monitor website performance based on realistic digital user experience transactions. Using the right tool can help you set up an efficient monitoring process that can help you improve your website or application.</p> <p>In this article, you learned about the importance of website performance monitoring and explored how to set up transaction monitoring using Kentik, a cloud-native network observability platform. You also learned how synthetic transaction monitoring increases the observability of your system and detects issues as soon as they show up by monitoring the real-time performance of user transactions.</p> <p>In the current market, improving website performance is crucial, and setting up a <a href="https://www.kentik.com/get-started/" title="Get started with a Kentik trial">free Kentik account</a> is an easy way to improve end-user experience and search engine rankings while staying ahead of your competitors. Kentik’s free trial comes with more than enough test credits to explore how synthetic transaction monitoring can benefit your own sites or apps.</p> <p>Additionally, Kentik’s network observability solution can help your business improve its <a href="https://www.kentik.com/solutions/usecase/digital-experience-monitoring/">digital experience monitoring</a>, <a href="https://www.kentik.com/solutions/usecase/improve-peering-interconnection/">peering and interconnection</a>, <a href="https://www.kentik.com/solutions/usecase/clouds-and-hybrid/">clouds and hybrid connectivity monitoring</a>, <a href="https://www.kentik.com/solutions/usecase/ddos-detection-and-network-security/">DDoS detection and network security</a>, <a href="https://kentik.com/solutions/usecase/troubleshoot-networks/">network troubleshooting</a>, <a href="https://www.kentik.com/solutions/usecase/network-capacity-planning/">capacity planning</a>, and <a href="https://www.kentik.com/solutions/usecase/network-business-analytics/">network business analytics</a>.</p><![CDATA[Troubleshooting a SaaS Performance Problem with Kentik]]><![CDATA[Discover how Kentik’s network observability platform aids in troubleshooting SaaS performance problems, offering a detailed view of packet loss, latency, jitter, DNS resolution time, and more. Phil Gervasi explains how to use Kentik’s synthetic testing and State of the Internet service to monitor popular SaaS providers like Microsoft 365.]]>https://www.kentik.com/blog/troubleshooting-a-saas-performance-problem-with-kentikhttps://www.kentik.com/blog/troubleshooting-a-saas-performance-problem-with-kentik<![CDATA[Phil Gervasi]]>Tue, 18 Jul 2023 04:00:00 GMT<p>SaaS applications make provisioning new apps simple for IT operations, but what’s not so easy is troubleshooting performance problems considering we don’t own or manage the SaaS provider’s network or the public internet.</p> <p>Kentik’s network observability platform can monitor SaaS providers like Office365, Salesforce, GitHub, ServiceNow, and many more, gathering information about packet loss, HTTP latency, jitter, DNS resolution time, web page load time, etc. In that way, you can monitor a SaaS application’s connection <em>and</em> detailed performance characteristics, including tracing the network path over the public internet.</p> <h2 id="synthetic-testing-and-saas-monitoring">Synthetic testing and SaaS monitoring</h2> <p>We use a <a href="https://www.kentik.com/blog/everything-you-need-to-know-about-synthetic-testing/" title="Kentik Blog: Everything You Need to Know About Synthetic Testing">synthetic testing</a> mechanism to continually interact with a particular SaaS provider and capture metrics from test agents deployed anywhere in the world. Agents are lightweight programs that can be deployed almost anywhere, including individual branch offices, to test SaaS application delivery to and from a specific location.</p> <p>Tests include simple ping tests to check connectivity and gather metrics on <a href="https://www.kentik.com/kentipedia/understanding-latency-packet-loss-and-jitter-in-networking" title="Kentipedia: Understanding Latency, Packet Loss, and Jitter in Network Performance">loss, latency, and jitter</a>. Synthetic tests can also monitor a web server’s response to an HTTP(S) request or API call, simulate an end user interacting with an application, measure the responses from DNS requests, and so on.</p> <h2 id="the-state-of-the-internet">The State of the Internet</h2> <p>The State of the Internet is a service Kentik provides all our customers as a built-in function of the platform. We’ve deployed test agents globally in strategic locations to gather performance metrics of many of the most popular SaaS providers and services.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2eW8g9D8fux2SQlzLpBvYM/6fcf729ac99f3a6a4a432a5d62e6034c/state-of-the-internet-saas-apps.png" style="max-width: 800px;" class="image center" alt="SaaS app performance" withFrame thumbnail /> <div class="caption" style="margin-top: -35px;">Monitoring SaaS app performance with Kentik’s State of the Internet service.</div> <p>Notice in the image above that we’re reporting on the HTTP status code, response size, domain lookup time, connection time, response time, HTTP latency, network latency, jitter, and packet loss.</p> <p>Additionally, we monitor the major cloud providers, such as AWS and Azure, from multiple vantage points. And since DNS plays a critical role in an end user’s interaction with an application over the internet, we also track public DNS services looking at both connectivity and tracking the actual name resolution times.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3STIDxhEGbZ17OKkvUrBbC/6f42b14ff926b099454b5e56825e852f/state-of-the-internet-dns.png" style="max-width: 800px;" class="image center" alt="SaaS apps by public DNS" withFrame thumbnail /> <h2 id="how-to-troubleshoot-a-slow-saas-application">How to troubleshoot a slow SaaS application</h2> <p>To troubleshoot the performance of a specific SaaS application for your end users in a particular location, we can deploy private test agents in that branch office or even in a home network to capture metrics from that location programmatically.</p> <p>For example, to monitor Microsoft 365 performance from a branch office, we can deploy test agents on-premises at that location. We can then use the results of those tests in real-time, or better yet, run them continuously to collect information over time. That means we can troubleshoot an issue as it’s happening, but we can also go back in time in our data to see what was happening when end users experienced a problem.</p> <h3 id="scenario-monitoring-microsoft-365">Scenario: Monitoring Microsoft 365</h3> <p>Imagine Microsoft 365 feels very slow to the end users in our upstate New York branch office. This is an essential suite of productivity applications for our end users, so we monitor programmatically and tie this to our alerting systems.</p> <p>For our example, I’ve set up tests to monitor Microsoft 365 (among other SaaS apps), and I’ve set up tests to monitor several on-prem devices like the gateway, office router, and an on-prem wireless controller.</p> <p>For this scenario, we received trouble tickets that our end users couldn’t log into Microsoft 365 earlier in the day for about an hour. Sometimes the login was slow and failed, and sometimes the login page itself wouldn’t load at all.</p> <p>Kentik integrates with most ticketing and alerting systems, so real-time alerts generated by the platform can be emailed, part of a ChatOps workflow, or sent to whatever ticketing system you prefer, such as ServiceNow.</p> <h3 id="the-workflow">The workflow</h3> <p>We can start by filtering tests to look at only Microsoft 365. Since logins are failing, starting with our login simulation test makes sense. This is a <a href="https://www.kentik.com/kentipedia/what-is-synthetic-transaction-monitoring/" title="Kentipedia: What is Synthetic Transaction Monitoring (STM)?">synthetic transaction monitor</a> that uses a built-in script to interact with an application and capture the overall transaction time and all its individual components. In our demo example, the test logs into Microsoft 365 to track how long it takes to go through the process until I get all our apps like PowerPoint, Word, etc.</p> <p>In the image below, notice that when we look back over the past six hours, we see that the test was reporting as PASS before and after that one hour when logins were failing. Also, notice that when the test passes, the total transaction time is around seven to eight seconds which the system established as the average time it takes for the script to complete successfully.</p> <p>Notice in the image below that during the period users were experiencing issues, the total transaction time spikes to over 20 seconds, which we know is causing the test to fail by looking at the indicator that says transaction timeout. This indicates something is slowing everything down and causing a timeout which is exactly what our end users reported.</p> <img src="//images.ctfassets.net/6yom6slo28h2/0knlqbQBntNvz4AdfZ4W5/572a9632727e1f47ba50d111ebb24b95/ms365-login-simulation-test.png" style="max-width: 800px;" class="image center" alt="Office365 Login Simulation Test" withFrame thumbnail /> <div class="caption" style="margin-top: -35px;">Using synthetic transaction monitoring to troubleshoot Microsoft 365 in Kentik</div> <p>We can also analyze the login page loading by looking at our page load test. Below, notice that when things are working fine, we see a 200 status code, navigation time, and domain lookup time are very low. Our average HTTP latency is around a second and a half, which is normal. Our average latency, which represents the connection time, or in other words, network latency, is around 15 or 16 milliseconds. These all indicate good performance.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7sCmP6sUVnKIJ3GPXKB6ag/86785ddfc2f6f0b6170c72bd2d11a732/ms365-login-page-load-monitor-healthy.png" style="max-width: 800px;" class="image center" alt="Page load monitoring - healthy" withFrame thumbnail /> <p>Next, in the screenshot below, take a closer look at the time period in which users reported issues. The navigation time and domain lookup time look ok, suggesting this is likely not a DNS problem. However, notice that the HTTP latency spiked significantly. This would certainly affect application performance and an end-user’s experience. Also, see that the average latency, which, remember, is network related, spiked as well, suggesting that there is possibly a network issue.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6GQHKi2qfUKCk81c7tOqev/3cffd8007dfd648a53f5e754d3a3e8a7/ms365-login-page-load-monitor-critical.png" style="max-width: 800px;" class="image center" alt="Page load monitoring - critical" withFrame thumbnail /> <h3 id="analyzing-page-load-test-results">Analyzing page load test results</h3> <p>To analyze the individual components of the page load, we can look at the Waterfall breakdown (below). In our scenario, no one file or element stands out as the culprit, but we do see that many files are taking a very long time — several seconds — to queue and ultimately send. Clearly, something is slowing the actual transmission of data, and it doesn‘t seem to be any particular corrupt file or a DNS problem.</p> <img src="//images.ctfassets.net/6yom6slo28h2/LKf2tM7wxuJuuBnxLIOtR/333545f9f30f77ffae464fa8cdb9d377/synthetic-testing-waterfall.png" style="max-width: 800px;" class="image center" alt="Synthetic monitoring waterfall view" withFrame thumbnail /> <div class="caption" style="margin-top: -35px;">Analyzing synthetic page load tests in Kentik’s Waterfall view.</div> <p>Since we now suspect this is a network issue, let’s look at the local network resources to see if anything is causing or reporting latency at our locally connected devices.</p> <p>The following screenshots show that both the gateway device and our local office router report no latency, jitter, or packet loss problems during that one hour. This indicated that the network problem must not be on my local network.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7dE5PVmJLGL1YcPRTTBCbC/3d3a3f3a9ac3a7fc981b194542a843f4/albany-office-gateway-device.png" style="max-width: 800px;" class="image center" alt="Gateway device performance" withFrame thumbnail /> <img src="//images.ctfassets.net/6yom6slo28h2/6UkXQmRwQ2bAFz4wB5usfy/51459bcda11dfef9e24e3f4826d03750/albany-office-router-monitor.png" style="max-width: 800px;" class="image center" alt="Router performance" withFrame thumbnail /> <p>We can also monitor the connection to a SaaS app over the public internet using the network connection test, either by IP address or hostname. In our scenario, we use hostname because Microsoft uses a variety of IPs for connectivity.</p> <p>Looking back at that one hour, we can see in the graphic below that there was an apparent and dramatic increase in latency which went away at the same time our end-users reported the login problem went away.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2OY60aIXepWUAQNWOzBONx/126971ffb018c1d460757154e7372b34/ms365-connection-monitor-critical.png" style="max-width: 800px;" class="image center" alt="Office365 connection monitor" withFrame thumbnail /> <p>This is helpful, but to figure out exactly where the latency is happening, we can look at the path view generated using traceroute, or more specifically, Paris traceroute. In the next image, notice that there’s network latency with our upstream provider during that period, which just happens to go away right around the time the Microsoft 365 performance problems go away.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4jcJJERdjTduO80Q2OgRlA/3947ecd27e553bc195755bacba42def7/ms365-connection-monitor.png" style="max-width: 800px;" class="image center" alt="Office365 connection monitor" withFrame thumbnail /> <div class="caption" style="margin-top: -35px;">Troubleshooting Microsoft 365 latancy with Kentik’s traceroute path view.</div> <img src="//images.ctfassets.net/6yom6slo28h2/1VLmo0T9uAgi3uD7l9sxxz/fa61a0f1307cf2732c9b1e8c45fa813e/ms365-latency-detail.png" style="max-width: 800px;" class="image center" alt="" withFrame thumbnail /> <p>With Kentik, we were able to investigate a slow SaaS application both from a global perspective using the State of the Internet and also from a regional perspective with custom performance monitoring from one of our regional branch offices. We identified that there was no local network problem, no DNS problem, and no individual file or web page element that slowed things down on its own. Still, there was significant latency in at least one hop upstream from our local last-mile provider.</p> <h2 id="video-microsoft-365-troubleshooting">Video: Microsoft 365 troubleshooting</h2> <p>Follow along with me as I walk through the troubleshooting steps outlined above in this short video:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/6oxw07l673" title="Troubleshooting a SaaS App in Kentik" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>To learn more about Kentik and how we can help you monitor your SaaS applications, <a href="#demo_dialog" title="Request a Demo of Kentik">request a demo</a> or start a <a href="#signup_dialog" title="Start a Free Trial of Kentik">free trial</a> today.</p><![CDATA[How Kentik Helps You Mitigate Cyberattacks Faster ]]><![CDATA[No matter how much prevention you have, serious security incidents will inevitably occur. Read the next article in our security series that covers how to understand cyberattacks as quickly as possible so that your organization can respond swiftly.]]>https://www.kentik.com/blog/how-kentik-helps-you-mitigate-cyberattacks-fasterhttps://www.kentik.com/blog/how-kentik-helps-you-mitigate-cyberattacks-faster<![CDATA[Christoph Pfister]]>Thu, 13 Jul 2023 04:00:00 GMT<p><em>This is part 2 of 3 in a blog series about how to fortify your security posture with Kentik. Also see <a href="/blog/how-kentik-reduces-likelihood-full-blown-cyber-attack-before-it-happens/">part 1</a>.</em></p> <p>Kentik is crucial in strengthening the security posture for our customers before, during, and after a cyberattack. We do this by using deeply enriched network data from across your entire data center, cloud, and container footprint to prevent, detect, and respond to cyber threats.</p> <ul> <li><strong>Prevent</strong>: Kentik reduces the likelihood of a full-blown attack before it happens</li> <li><strong>Detect</strong>: Kentik helps you mitigate attacks faster when they do occur</li> <li><strong>Respond</strong>: Kentik uses real-time network data to kick off mitigation efforts and, when the dust settles, obtain a deep understanding of what happened so you can prevent future attacks.</li> </ul> <p>In <a href="https://www.kentik.com/blog/how-kentik-reduces-likelihood-full-blown-cyber-attack-before-it-happens/">Part 1 of this series</a>, we looked at how Kentik reduces the likelihood of a full-blown attack before it happens. By going beyond the basics of tracking network IP, port and protocol, Kentik’s deep multidimensional enrichment of network data, integrated threat feeds, and anomaly detection, enable customers to detect threats faster before attackers can pivot deeper into the network.</p> <p>But no matter how much prevention you put in place, serious security incidents will inevitably occur. It is essential to understand what is happening as quickly as possible so that you can respond. That is the topic of today’s post.</p> <h2 id="the-1-10-60-rule">The 1-10-60 rule</h2> <p>Timing is everything in cyberattacks. It is increasingly impossible to defend IT resources against a determined attacker; however, the blast radius and severity of the attack can be significantly reduced if detected sooner.</p> <p>Crowdstrike popularized this idea with the <a href="https://www.darkreading.com/threat-intelligence/most-companies-lag-behind-1-10-60-benchmark-for-breach-response">1-10-60 rule</a>. The rule states that an organization should aim to detect an attack in one minute, investigate its source in 10 minutes and remediate the root cause in 60 minutes. Our last post showed that the average time to detect and contain a breach is 277 days. Clearly, much work still needs to be done, but Kentik can help.</p> <h2 id="real-time-visibility">Real-time visibility</h2> <p>Kentik provides real-time visibility into network traffic, allowing security teams to monitor ongoing attacks. This visibility is essential to identify the attack source, understand the attack vectors, and take immediate action to mitigate the attack. And it applies at one minute, 10 minutes, and 60 minutes of the attack investigation.</p> <p>For example, Kentik allows security teams to see all their networks in one fell swoop and understand traffic and telemetry across clouds, data centers, edge, SaaS, WAN, and SD-WAN.</p> <p>Incident response teams can query, filter, drill in, and add context to find the answers they need, even across mountains of data. And they are supported in their jobs with Kentik’s intuitive dashboards and reports to see patterns quickly. These reports can easily be shared with colleagues even if they are not Kentik users, which is extremely useful for network and security team collaboration.</p> <div as="Testimonial" index="0" color="blue"></div> <img src="https://images.ctfassets.net/6yom6slo28h2/3pZfO0RpfBUrFoKt6pASkS/35c06df4bdd8ace8afa414711495dd33/network-explorer-202307.png" style="max-width: 800px;" class="image center" alt="Network Explorer" withFrame thumbnail /> <h2 id="real-time-enrichment">Real-time enrichment</h2> <p>When seconds count, it is crucial to correlate seemingly unrelated events to find the root cause of the cyberattack faster.</p> <p>For instance, if your security team has identified that an attack is taking place, it would be extremely useful to understand that the traffic is originating from an embargoed county. Kentik can tell you that and more.</p> <p>This is where Kentik’s real-time enrichment comes in. The additional context provided by enrichment enables security and network teams to identify root causes much faster because they have a complete picture of traffic in and out of their networks, not just the basics of IP, port, and protocol.</p> <p>Kentik enables the utilization of various types of metadata to provide insights into flows. For instance, it can include information like geolocated IP addresses, the specific service being communicated with, the names of autonomous systems (AS), or metadata from cloud providers.</p> <p>Kentik allows its customers to enhance their data based on their own sources, ensuring the most relevant context is available. The sources for enhancing data can range from any logs or event data to telemetry at the endpoint or application level.</p> <h2 id="siem-integrations-and-more">SIEM integrations and more</h2> <p>As great as the Kentik real-time dashboards are, and they are pretty great, your organization most likely has other tooling in place to monitor security incidents. With Kentik, you don’t have to choose between multi-dimensional network data that is rich in context or your integrated security incident response toolkit. You can send real-time enriched network data to your SIEM tool via the <a href="https://www.kentik.com/product/firehose/">Kentik Firehose</a> to give your security teams complete visibility into an attack as it happens.</p> <p>Kentik also provides other integrations that come in handy during a security incident. For example, we support modern workflows like Chatops, on-call systems such as PagerDuty, and even configurable webhooks so that you can integrate Kentik with any downstream system that needs to know when an attack or vulnerability has been detected.</p> <h2 id="advanced-ddos-detection-algorithms">Advanced DDoS detection algorithms</h2> <p>Kentik’s DDoS protection streamlines network protection against attacks by offering customizable preset alert policies and automatic mitigation triggers. Kentik uses machine learning-based traffic profiling to eliminate false positives/negatives and reduce response time. Users can visualize attack characteristics and their impact on the network and trigger automatic mitigation actions.</p> <p>Some popular configuration options include:</p> <ol> <li><strong>Enable attack profiles</strong>: DDoS Defense offers preset alert policies for different attack profiles. You can enable specific attack profiles relevant to your network and adjust the threshold settings to tailor the detection parameters based on your network’s characteristics. This allows you to customize the detection and response to different types of DDoS attacks.</li> <li><strong>Exclude interfaces</strong>: You can choose to exclude specific interfaces from being monitored for DDoS attack traffic. This allows you to focus on monitoring only your network’s most vulnerable or critical interfaces.</li> <li><strong>Exclude IP addresses</strong>: You have the option to globally exclude specific IP addresses from being considered in the baseline for normal traffic patterns. This helps ensure that particular IP addresses, such as trusted sources or known outliers, do not affect the accuracy of DDoS attack detection.</li> </ol> <p>Once an attack has been identified, automatically trigger your own mitigation strategy via RTBH/Flowspec or integrate with threat mitigation providers like Cloudflare, Radware, and A10. Again, our webhooks can integrate with any downstream system for complete automation.</p> <p>Kentik customer <a href="https://www.kentik.com/resources/case-study-square-enix/">Square Enix’s use of Kentik’s DDoS detection services</a> illustrates this point. According to Square Enix, “As a gaming company, we see a lot of DDoS attacks, and when we see them, we use Kentik to analyze where they’re coming from and alert our security team so they can deploy countermeasures.”</p> <p>Before Kentik, the customer explained, “All we had was an on-prem DDoS monitor that would send an alert about unusual changes in traffic volume, but we didn’t have any information about what type of traffic it was. We could only react to alerts and start hunting across a wide range of potential sources of an attack. With Kentik, we can pinpoint the source of suspicious traffic virtually in real time.”</p> <p>Kentik also can help identify a false alarm, he adds. “In the past, we might have seen unexpected spikes in traffic from developers that set off the on-prem DDoS alarm. Now with Kentik, we can see exactly what that traffic is and where it’s coming from to determine if it’s normal traffic and something threatening.”</p> <h2 id="sophisticated-bgp-analysis">Sophisticated BGP analysis</h2> <p>As we wrote in our recent <a href="https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents/">A Brief History of the Internet’s Biggest BGP Incidents</a>, BGP routing incidents can be problematic for various reasons. In some cases, they simply disrupt the flow of legitimate internet traffic, while in others, they can result in the misdirection of communications, posing a security risk from interception or manipulation. Routing incidents occur with some regularity and can vary significantly in operational impact.</p> <p>Kentik’s BGP monitoring solution helps mitigate BGP attacks faster by actively monitoring BGP for routing issues, including hijack detection and RPKI problems. Kentik provides the following capabilities:</p> <ul> <li><strong>Event tracking</strong>: Analyze route announcements and withdrawals over time to identify unusual activities.</li> <li><strong>BGP hijack detection</strong>: Detect and receive instant alerts about hijacking incidents for quick response.</li> <li><strong>Route leak detection</strong>: Identify and alert about route leaks caused by configuration errors.</li> <li><strong>RPKI status checks</strong>: Monitor unexpected origins and invalid RPKI status for secure BGP routing.</li> <li><strong>Reachability tracking</strong>: Track changes in prefix visibility and receive alerts when prefixes become unreachable.</li> <li><strong>AS path change tracking and visualization</strong>: Monitor frequent AS path changes and visualize route changes for faster troubleshooting.</li> </ul> <h2 id="conclusion">Conclusion</h2> <p>Attacks are inevitable. Responding to them faster is the key to minimizing their impact. With Kentik’s real-time visibility, data enrichment, and deep integration into the rest of your security stack, your company can be well on its way to living up to the 1-10-60 rule. In our third and final post in this series, we will look at how Kentik uses this real-time network data to kick off mitigation efforts and obtain a deep understanding of what happened so you can prevent future attacks.</p><![CDATA[$3 Million in Savings and Improved Performance: A Case Study Featuring StackPath]]><![CDATA[When your business is all about providing cloud services at the edge, optimizing the quality of your network connectivity is paramount to customer success. Learn how StackPath saved $3 million and optimized their network performance by using Kentik.]]>https://www.kentik.com/blog/case-study-stackpathhttps://www.kentik.com/blog/case-study-stackpath<![CDATA[Stephen Condon]]>Wed, 12 Jul 2023 04:00:00 GMT<h2 id="introduction">Introduction</h2> <p>Here at Kentik, we’re all about sharing the real-world success stories of industry leaders using our platform to achieve incredible results. In this post, we’ll dive into our recently published <a href="/resources/case-study-stackpath/">StackPath case study</a> that showcases how they used the Kentik Network Observability Platform to drive annual cost savings of over $3 million and enhance the performance of their cloud computing platform.</p> <h2 id="understanding-the-need-for-network-observability">Understanding the need for network observability</h2> <p>In today’s digital landscape, businesses heavily rely on their networks to deliver seamless user experiences and ensure optimal performance. For StackPath, their network is the foundation of their service offering. StackPath offers a cloud computing platform at the edge of the internet and value-added services on top of this infrastructure. They serve telecom, internet, cloud services, security, and gaming industries and have been a Kentik customer since 2015.</p> <p>StackPath has grown through product innovation and acquisitions, including MaxCDN, HighWinds, and Fireblade. However, these acquisitions presented new product and network consolidation challenges, leading StackPath to seek a trusted network observability partner.</p> <p>For this case study, we spoke with Brad Raymo, vice president of network strategy at StackPath, who drives the company’s network expansion initiatives, manages peering relationships, and the optimization of network spend.</p> <h2 id="the-collaboration">The collaboration</h2> <p>StackPath realized early on that network observability would be crucial in managing network integration and driving network efficiencies. Kentik was selected to gain deep visibility into its network infrastructure. With Kentik’s powerful platform, StackPath embarked on a journey to unlock cost savings and elevate its service delivery to new heights.</p> <p>StackPath’s partnership with Kentik revolved around a shared vision of harnessing the power of network data to optimize performance and deliver exceptional user experiences. By leveraging Kentik’s advanced network analytics, StackPath gained deeper visibility, proactively identified issues, and made data-driven decisions to enhance its network operations.</p> <h2 id="the-power-of-kentiks-network-observability-solutions">The power of Kentik’s network observability solutions</h2> <p>Kentik’s platform provided StackPath with a comprehensive suite of tools and capabilities to monitor, analyze, and optimize its network performance. Here are some key features and benefits StackPath enjoyed through its partnership with Kentik:</p> <ul> <li> <p><strong>Real-time visibility</strong>: Kentik’s real-time analytics gave StackPath granular visibility into network traffic, performance metrics, and application behavior. This invaluable insight allowed them to quickly identify bottlenecks, anomalies, and potential threats, enabling proactive network management and troubleshooting.</p> </li> <li> <p><strong>Actionable intelligence</strong>: Kentik’s intelligent alerts and customizable dashboards enabled StackPath to extract meaningful insights from their network data. By utilizing comprehensive visualizations and tailored reports, StackPath’s teams gained the ability to make data-driven decisions and optimize network resources effectively. This was especially important when faced with the challenge of integrating acquired networks.</p> </li> <li> <p><strong>Scalability and flexibility</strong>: Kentik’s cloud-native architecture seamlessly accommodated StackPath’s growing infrastructure and evolving business needs. With the ability to scale effortlessly, StackPath used Kentik’s platform to analyze massive volumes of network data and gain real-time insights across their entire network footprint.</p> </li> <li> <p><strong>Proactive performance optimization</strong>: StackPath identified performance bottlenecks and optimized its network infrastructure for enhanced delivery of content and services. By proactively addressing issues, StackPath significantly improved the user experience, reduced latency, and increased customer satisfaction.</p> </li> </ul> <h2 id="results-worth-celebrating">Results worth celebrating</h2> <p>The collaboration between StackPath and Kentik delivered exceptional outcomes that surpassed expectations. StackPath achieved significant improvements in network observability, performance optimization, and user experience. Importantly, through peering and interconnection optimizations, StackPath found approximately <strong>$3 million in annualized savings</strong> with the help of the Kentik Network Observability Platform.</p> <blockquote> <p>“When I joined the company, I did a full deep dive into how the traffic was flowing and where I could find cost and performance improvements. In my first months, I was able to find approximately $3 million in annual savings in just traffic engineering changes. This would not have been possible without the analysis and insights provided by the Kentik platform.” said Brad Raymo.</p> </blockquote> <p>Along with these savings, StackPath has gained:</p> <ul> <li>Enhanced network visibility allowing them to identify and address potential issues proactively</li> <li>Faster issue resolution and reduced MTTR, with Kentik’s intelligent alerts and customizable dashboards, that allow the detection and resolution of network issues before they impact end-users</li> <li>Improved user experience by optimizing network performance, reducing latency, and enhancing overall satisfaction</li> <li>Efficient resource utilization by optimizing their network infrastructure, peering, and interconnection, ensuring efficient utilization of resources and cost savings.</li> </ul> <h2 id="conclusion">Conclusion</h2> <p>The partnership between StackPath and Kentik exemplifies the transformative power of network observability. Kentik’s advanced analytics and real-time insights empowered StackPath to optimize its network infrastructure, elevate service delivery, and drive cost efficiencies. You can <a href="/resources/case-study-stackpath/">read or download the StackPath case study here</a>.</p> <p>To learn more about how Kentik can help your organization with unrivaled network observability, <a href="https://www.kentik.com/get-demo/">contact us now to set up a personalized demo</a>.</p><![CDATA[How Kentik Reduces the Likelihood of a Full-blown Cyberattack Before It Happens]]><![CDATA[Organizations are under constant attack, and it's critical to reduce the time it takes to detect attacks to minimize their cost. This first article in our new security series dives deep into how Kentik helps customers before, during, and after a cyberattack.]]>https://www.kentik.com/blog/how-kentik-reduces-likelihood-full-blown-cyber-attack-before-it-happenshttps://www.kentik.com/blog/how-kentik-reduces-likelihood-full-blown-cyber-attack-before-it-happens<![CDATA[Christoph Pfister]]>Thu, 29 Jun 2023 04:00:00 GMT<p><em>This is part 1 of 3 in a blog series about how to fortify your security posture with Kentik. Also see <a href="/blog/how-kentik-helps-you-mitigate-cyberattacks-faster/">part 2</a>.</em></p> <p>Kentik is crucial in strengthening the security posture for our customers before, during, and after a cyberattack. We do this by using deeply enriched network data from across your entire data center, cloud, and container footprint to prevent, detect, and respond to cyber threats.</p> <ul> <li><strong>Prevent</strong>: Kentik reduces the likelihood of a full-blown attack before it happens.</li> <li><strong>Detect</strong>: Kentik helps you reduce the blast radius by detecting attacks faster when they do occur.</li> <li><strong>Respond</strong>: Kentik uses real-time network data to kick off mitigation efforts and, when the dust settles, obtain a deep understanding of what happened so you can prevent future attacks.</li> </ul> <p>In Part 1 of this series, we will look specifically at how Kentik reduces the likelihood of a full-blown attack before it happens.</p> <p>Before we get into how, let’s look at why.</p> <p>Here is some data that is scaring the pants off of CIOs right now.</p> <ul> <li>The <a href="https://www.ibm.com/reports/data-breach">average cost of a data breach globally is $4.35 million dollars</a>, or $165/record. In the US, the cost of a breach rises 2x to $9.44 million.</li> <li>It only takes threat actors <a href="https://go.crowdstrike.com/rs/281-OBQ-266/images/CrowdStrike2023GlobalThreatReport.pdf">84 minutes on average</a> to pivot deeper into your network after an initial compromise. Responding faster to an initial penetration is essential to prevent a small breach from turning into a multi-million liability.</li> <li>And yet it takes <a href="https://www.ibm.com/reports/data-breach">277 days to identify and contain a breach</a> on average. That is almost nine months! The bulk of that time is on the identification side, with the average breach taking 207 days to identify.</li> </ul> <p>With every organization under constant attack, it is critical to shrink the time it takes to detect attacks in order to minimize their cost. Within the context of this blog, an attack assumes not only that your network has been penetrated, but also that an attacker has been able to pivot and do something with their newfound foothold.</p> <p>With that in mind, here is a summary of how Kentik helps prevent full-blown attacks in the first place.</p> <h2 id="verify-and-enforce-network-policy">Verify and enforce network policy</h2> <p>Many attacks can be prevented simply by verifying and enforcing the network policy that you already have in place. It’s one thing to say, “My Kubernetes clusters should not be communicating with external IP addresses,” or “unencrypted HTTP or FTP traffic should not appear in these specific network zones.” It is another thing to enforce these rules.</p> <p>With Kentik, you can easily do just that. For example, you can use Kentik to detect and alert on unintended connections which might be a potential threat. Remember, it only takes 84 minutes on average for an attacker to pivot from a compromised host. That HTTP traffic in a sensitive area of your network might be a sloppy attacker who you can shut down well before they can make an impact.</p> <p>For example, one Kentik customer recently used our network observability platform to identify spoofed traffic coming from one of their customers before it became a problem. To do this, they simply set up filters to capture outbound internal network traffic with source addresses that did not match expected internal IP ranges. With this real-time monitoring and alerting in place, they were able to investigate potential threats long before the attacker could pivot.</p> <div as="Promo"></div> <h2 id="traffic-monitoring">Traffic monitoring</h2> <p>What does normal network traffic look like? This is an incredibly important question to answer if you are to detect abnormalities. Kentik can continuously monitor your traffic, providing visibility into normal network behavior. This helps establish a baseline and detect any abnormal activities or patterns indicating a potential attack.</p> <h2 id="integrated-threat-feeds">Integrated threat feeds</h2> <p>Once you know what normal looks like, Kentik can start to look for aberrations. One way we do this is by enriching flow records with data from threat intelligence feeds, identifying threats such as botnet command and control servers, malware distribution points, phishing websites, and spam sources.</p> <p>One of Kentik’s airline customers was impressed that our threat feed picked up one of their internal tools scanning across their AWS VPCs, where none of their other monitoring tools alerted them on the traffic patterns. In security, granularity really matters.</p> <h2 id="network-data-enrichment">Network data enrichment</h2> <p>Speaking on granularity, Kentik goes beyond the basics of tracking IP, port, and protocol by providing deep multi-dimensional enrichment of network data like NetFlow and SNMP to detect threats faster. Kentik customers can even do enrichment based on their own data sources so that they have the most relevant context.</p> <p>With Kentik, any type of metadata can be used to provide additional context to flows. Some examples include geoIPS, the service being communicated with, AS names, or cloud provider metadata. Enrichment sources can include any arbitrary logs/event data or endpoint and application-level <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Best Practices for Enriching Network Telemetry to Support Network Observability">telemetry</a>.</p> <div as="Testimonial" index="0" color="purple"></div> <h2 id="anomaly-detection">Anomaly detection</h2> <p>With a baseline established using multi-dimensional sources, Kentik can identify anomalous behavior or deviations from normal network patterns, such as traffic to/from banned or embargoed countries or domains. This enables faster detection of potential security threats before they can escalate into full-blown attacks.</p> <h2 id="proactive-threat-hunting">Proactive threat hunting</h2> <p>Kentik’s advanced network analytics capabilities allow security teams to perform proactive threat hunting. They can search for indicators of compromise (IOCs), analyze historical data, and identify potential vulnerabilities or weak points in the network infrastructure. This helps in fortifying the security posture and addressing vulnerabilities proactively.</p> <p>For example, let’s say your DevOps team has completed a CI/CD project which no longer requires engineers to SSH into individual machines to conduct deployments. Well, if you see SSH connections over port 22 still happening, it is worth investigating.</p> <p>With rich sources of context in the form of enriched network data at their fingertips, network and security teams can use Kentik to pinpoint vulnerabilities when they suspect they might be under attack.</p> <h2 id="conclusion">Conclusion</h2> <p>This blog could have been titled “An ounce of prevention is worth a pound of cure.” If the average breach takes nine months to detect and costs nearly $5 million, it is worth asking, “Is my organization doing enough to prevent breaches in the first place?”</p> <p>This blog post has explored six concrete ways that <a href="https://www.kentik.com/product/protect/">Kentik helps mitigate attacks</a> in the first place. Our network observability platform is a powerful tool in the hands of security-minded network administrators and/or security teams to pinpoint vulnerabilities when they suspect they might be under attack. But what if an attack does occur? Kentik can still help, and that is what we will explore in our next blog, including how Kentik can help identify the attack source so teams can respond more effectively.</p><![CDATA[Announcing Complete Azure Observability for Kentik Cloud]]><![CDATA[Kentik customers can now map traffic and performance of Microsoft Azure infrastructure with visibility into Azure Firewalls, Express Routes, Load Balancers, VWANs, and more in Kentik Cloud. ]]>https://www.kentik.com/blog/announcing-complete-azure-observability-for-kentik-cloudhttps://www.kentik.com/blog/announcing-complete-azure-observability-for-kentik-cloud<![CDATA[Rosalind Whitley]]>Wed, 28 Jun 2023 04:00:00 GMT<p>Today, the phrase “cloud migration” means a lot more than it used to – gone are the days of the simple lift and shift. Kentik customers move workloads to (and from) multiple clouds, integrate existing hybrid applications with new cloud services, migrate to Virtual WAN to secure private network traffic, and make on-premises data and applications redundant to multiple clouds – or cloud data and applications redundant to the data center.</p> <p>These strategies bring fresh challenges to teams working in hybrid and multi-cloud environments, including:</p> <ul> <li>Optimizing resource allocation to reduce cost</li> <li>Maintaining consistent operations using diverse platforms, APIs, networking configurations, and security models</li> <li>Integrating multiple clouds with on-premises infrastructure and legacy systems</li> <li>Managing data across diverse storage systems with strong and compliant data governance that ensures consistency, integrity, and privacy</li> </ul> <p>At Kentik, we help infrastructure, network, and cloud engineering teams meet those challenges head-on. To that end, we’re excited to announce major updates to Kentik Cloud that will make your teams more efficient (and happier) in multi-cloud. <strong>Today, we’re unveiling <a href="https://www.kentik.com/solutions/usecase/microsoft-azure/" title="Microsoft Azure Observability">Kentik Map for Azure</a> and extensive support for Microsoft Azure infrastructure</strong> within the Kentik platform. With this release, your teams can:</p> <ul> <li><strong>Reference live visualizations</strong> of Azure infrastructure topology in Kentik Map for Azure, with both current and historical views of paths and detailed metadata</li> <li><strong>Quickly understand path and performance</strong> details of traffic routes that connect Microsoft Azure workloads to data centers across Azure Express Routes</li> <li><strong>Visualize and interrogate traffic</strong> across Azure Firewalls, Load Balancers, Application Gateways, and other network infrastructure</li> <li><strong>Visualize and interrogate</strong> all branch, VPN, private cloud, and intra-cloud connectivity that is connected with Azure VWAN</li> <li><strong>Drill into the impact</strong> of NSG (Network Security Group) configuration on security and productivity with live traffic data</li> <li><strong>Use custom tags</strong> to add any custom context to Microsoft Azure data along with pre-built Azure-native attributes sourced from Azure APIs</li> <li><strong>Create custom dashboards</strong> with configurable pre-built widgets to visualize important Azure cost and performance trends</li> </ul> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <br> <iframe src="https://fast.wistia.net/embed/iframe/ftc1a5jsqm" title="Underutilized Azure Firewall Analysis" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>The complexity of moving things between cloud and on-premises to meet cost, customer experience, regulatory, and security requirements is here to stay. Network and infrastructure teams need the ability to rapidly answer any question about their networks to resolve incidents, understand tradeoffs, and make great decisions at scale. Kentik’s comprehensive network observability, spanning all of your multi-cloud deployments, is a critical tool for meeting these challenges.</p> <h2 id="purpose-built-for-azure">Purpose-built for Azure</h2> <p>Kentik Map now visualizes Azure infrastructure in an interactive, data- and context-rich map highlighting how resources nest within each other and connect to on-prem environments. We designed this new map specifically around Azure hybrid cloud architectural patterns in response to the needs of some of our largest enterprise customers. It includes rich metrics for understanding the volume, path, business context, and performance of flows traveling through Azure network infrastructure.</p> <p>Live traffic flow arrows demonstrate how Azure Express Routes, Firewalls, Load Balancers, Application Gateways, and VWANs connect in the Kentik Map, which updates dynamically as topology changes for effortless architecture reference. When it comes to data, the Kentik Map for Azure surfaces hard-to-find health, usage, and utilization metrics for these Azure-native resources based on what matters most for each, including information about CPU utilization, route change frequency, bit and packet receipt rates, failed request counts, latency, port utilization, throughput, data processed, and more as it is available. For example, Express Route metrics include data about inbound and outbound dropped packets.</p> <p>Kentik Map for Azure makes denied traffic easily discoverable from each subnet visualized. Beyond the Map, Azure data sources such as NSG (Network Security Group) and Firewall flow logs are now deeply integrated into all of Kentik Cloud to power Azure-aware <a href="https://www.kentik.com/solutions/usecase/clouds-and-hybrid/" title="Kentik Use Case: Hybrid &#x26; Multicloud Networking">hybrid and multi-cloud</a> investigations with easy correlation of NSGs and Firewalls to other Azure networking resources.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/t46f3kv0d1" title="Azure cost attribution by subscription" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h2 id="why-do-you-need-complete-network-telemetry">Why do you need complete network telemetry?</h2> <p>Kentik now collects, analyzes, and contextualizes traffic flow and performance data from all major public clouds – from Microsoft Azure, <a href="https://www.kentik.com/solutions/usecase/google-cloud-platform/" title="Google Cloud Observability from Kentik">Google Cloud</a>, and <a href="https://www.kentik.com/solutions/usecase/amazon-web-services/" title="AWS Cloud Observability from Kentik">AWS</a> services – along with data from on-premises networks. Kentik enriches all this network telemetry with deep application, business, and security context to provide observability across all hybrid and multi-cloud environments that you can use to make smart decisions faster. It also provides custom alerts and synthetic testing for each environment, including Azure.</p> <p>Engineering teams use this unified telemetry to automate insights across the networks they own, like data centers and corporate IT systems, and the networks they don’t, like cloud software deployments and backups. This improves investigation efficiency by reducing time spent context-switching across tools and manually extracting data and insights. Complete <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Best Practices for Enriching Network Telemetry to Support Network Observability">network telemetry</a> also prevents critical security, policy, and performance data from falling through the cracks.</p> <p>Only Kentik Cloud provides this complete telemetry, and we will continue to rapidly expand the platform’s capabilities to help our customers optimize performance, resolve network incidents, reduce cloud costs, and maximize ROI from both acquired and migrated workloads.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4uxeWz41ZpnZKRI3wjA8dT/72805ac1085acabb374f901ef41289c3/azure-main-view.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Microsoft Azure topology in the Kentik Map" /> <p>To learn more, visit our <a href="https://www.kentik.com/solutions/usecase/microsoft-azure/" title="Microsoft Azure Observability">Azure page</a>, and check out the <a href="https://www.kentik.com/go/video/kentik-map-azure/" title="Videos: Kentik Map for Microsoft Azure">Azure video library</a> for more scenario-based demos of all things Azure in Kentik Cloud. Ready to observe your networks? <a href="https://www.kentik.com/get-started/">Try Kentik free for 30 days</a>, or <a href="https://www.kentik.com/go/get-started/demo/">contact our team</a> for a personalized tour.</p><![CDATA[Where No (Enterprise) WAN Has Gone Before]]><![CDATA[Today's modern enterprise WAN is a mix of public internet, cloud provider networks, SD-WAN overlays, containers, and CASBs. This means that as we develop a network visibility strategy, we must go where no engineer has gone before to meet the needs of how applications are delivered today. ]]>https://www.kentik.com/blog/where-no-enterprise-wan-has-gone-beforehttps://www.kentik.com/blog/where-no-enterprise-wan-has-gone-before<![CDATA[Phil Gervasi]]>Thu, 15 Jun 2023 04:00:00 GMT<p>We all know the story.</p> <p>As the Enterprise entered the Mutara Nebula, Khan lost sight of his prey. Sensors were inoperable from the firefight, and finding anything in the gaseous cloud was near impossible.</p> <p>Khan maneuvered, stalked, and hunted his mortal enemy with the cautious vengeance of a madman tempered by misguided intelligence and patience. But this type of encounter in space, this new application of battle strategy borne from intelligence without experience, meant Khan was handicapped from the start.</p> <p><strong>His pattern suggested… two-dimensional thinking</strong>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7cB7vcK6efcsAftMUzTes3/838a2b32b27ba464fba8ff7ec2de46c3/khan-spock-kirk.jpg" style="max-width: 500px;" class="image center" alt="Spock and Captain Kirk" /> <p>Spock, spotting the fatal flaw, raised his head from the familiar viewfinder to report to his captain this new advantage. Without delay, Kirk ordered the Enterprise to a full stop and Z-minus 10,000 meters. Now, positioned close to the Reliant but beneath it in such a way that the great Khanh Nguyen Sung, despite his great intellect and inimitable prowess, would likely fail to spot the famed Federation starship.</p> <img src="//images.ctfassets.net/6yom6slo28h2/56MLh8UymVEJqgeICJQe8g/b4f9a0af46e2425fbeaf1fe38deee7c2/khan-enterprise-reliant.jpg" style="max-width: 500px;" class="image center" alt="Starship Enterprise" /> <p>And so the battle progressed until Captain Kirk and his crew destroyed the Reliant, or more accurately, put Khan into a position where he destroyed himself with the Genesis device.</p> <p>Khan, a brilliant tactician, was unable to meet the new challenge because of legacy thinking. He thought in terms of two dimensions. Of a flat universe. But the universe isn’t two-dimensional, is it? And therein lay the end of Khan (and the greatest of Star Trek films.)</p> <img src="//images.ctfassets.net/6yom6slo28h2/29VGbS6q9s49TNUMm135p3/08b86a8ad32ec76a086f4200f6cccf68/wrath-of-khan-ricardo.jpg" style="max-width: 500px;" class="image center" alt="The Wrath of Khan" /> <h2 id="enterprise-wan-in-2023">Enterprise WAN in 2023</h2> <p><a href="https://www.kentik.com/blog/todays-enterprise-wan-isnt-what-it-used-to-be/" title="Kentik blog: Today&#x27;s Enterprise WAN Isn&#x27;t What It Used To Be">Enterprise WAN</a> networking in 2023 is very much the same. An engineer standing in front of a console today stares at the traffic moving from their on-prem data center up and out to a CASB, receiving DNS responses from a cloud-provided DNS service, and then on through an ephemeral microservices architecture in a public cloud.</p> <p>And this, of course, is just to reach the front end. On the back end is yet another series of intricate and <a href="https://www.kentik.com/blog/when-reliability-goes-wrong-in-cloud-networks/">complex traffic patterns within and among various public clouds</a> and back to an end user, an actual human being, working on a mobile device on a train heading into a tunnel.</p> <p><em>To succeed as an engineer in this new network, and to successfully manage the infrastructure and services that deliver applications to people, we must rid ourselves of two-dimensional thinking</em>.</p> <p>I recently had the privilege of attending the WAN and AWS Summits in London. Both events, focused by virtue of their names on very different aspects of technology, were in practice and conversation all about moving resources from the public cloud to a person anywhere in the world. In other words, both events were all about <a href="https://www.kentik.com/kentipedia/what-is-cloud-networking/">cloud networking</a>.</p> <p>Not many years ago, all my WAN projects were IPsec tunnels to a headend, dual hub DMVPN designs, coordinating MPLS handoffs from last mile providers, etc. Full or partial mesh topologies, backhauling to data centers, testing connectivity to on-prem resources.</p> <p>Recently, conversations with colleagues and customers depict very different traffic patterns. Today, it’s up and out. User to the cloud. Cloud to cloud. There’s very little going on with branch to branch connectivity or backhauling traffic to centralized data centers. There are exceptions for sure, but they are not the norm.</p> <p>For the most part, people access resources, usually in the form of applications, directly from the cloud — whether that’s public cloud, private cloud, or a SaaS provider. This means traffic patterns have changed, and by extension, the nature of network visibility has changed.</p> <p>Today’s network is more a collection of autonomous networks of varying sizes under one administrative domain than a highly interconnected mesh of networks. Sometimes those networks are individual branch offices that talk to no other branch of a private data center. Sometimes, they are large campuses with minimal resources on-site; and very often, they are networks of individual end users working from home.</p> <h2 id="the-genesis-of-new-enterprise-wan-visibility">The genesis of new enterprise WAN visibility</h2> <p>Enterprise network visibility in 2023 means much more than collecting flows and SNMP trap messages. Yes, those forms of telemetry are still important, but the network is very different than it used to be, so we need a new strategy to understand what’s happening.</p> <h3 id="cloud">Cloud</h3> <p>First, we can start by collecting data from our cloud providers. This may go without saying, but if you’re using a public cloud in any way, you should also collect any type of logs or flows your cloud provider offers through visibility. This includes information about resource connectivity and traffic flow within a particular cloud provider’s environment and among different cloud providers.</p> <p>AWS provides <a href="https://www.kentik.com/resources/aws-vpc-flow-logs-for-kentik/">VPC Flow Logs</a>, which allow you to gather info about traffic going in and out of your VPC interfaces.</p> <p>GCP also uses <a href="https://www.kentik.com/resources/google-cloud-vpc-flow-logs-for-kentik/">VPC Flow Logs</a> to record a sample of network flows sent from and received by VM instances, including instances used as GKE nodes.</p> <p><a href="/resources/azure-nsg-flow-logs-for-kentik/">Azure NSG flow logs</a> inform us about ingress and egress IP traffic through a Network Security Group.</p> <h3 id="the-internet">The internet</h3> <p>Next, considering we access cloud resources mostly over the public internet, we need to collect data about the state of the internet itself. This has always been something network engineers have wanted, but it’s never been more critical than it is today.</p> <p>We can start with collecting whatever telemetry our service providers will give us, especially our last-mile providers. But that’s typically lacking at best, so we should also explore other data types and sources.</p> <p>First, the global routing table is freely available to ingest from various reliable sources. This allows us to see the internet weather, as it were, which can help us understand why traffic moves the way it does.</p> <p>Next, we can deploy test agents to monitor connectivity to public resources such as public DNS servers, SaaS providers, etc. More than simple up/down, we can test for latency, jitter, route and path changes, and so on.</p> <p>Remember that many, if not most, of our end-users are accessing our applications over the internet today, so understanding what’s happening on the global network we don’t own or manage is critical to understanding application delivery.</p> <h3 id="containers">Containers</h3> <p>So many applications are built on microservices architectures today that collecting metrics about container network performance is crucial. <a href="https://www.kentik.com/blog/ebpf-explained-why-its-important-for-observability/">eBPF</a> is a common way to interact with the container network stack to collect metrics such as packet loss, latency, jitter, and elements like TCP retransmissions, fragments, etc.</p> <p>Monitoring container networking also means tracking the traffic among containers and between your pods deployed in the cloud, on-premises services, and interacting with third-party SaaS services.</p> <h3 id="network-overlays">Network overlays</h3> <p>Network overlays were once relegated only to the data center to stretch VLANs or other niche requirements. Today, network overlays form the basis of how we connect to resources over the internet.</p> <p>SD-WAN, for example, is an overlay technology that obfuscates an underlying physical network, usually the public internet. It provides many benefits to network operations, network security, and, above all else, cloud connectivity.</p> <p>In my professional experience, a conversation about SD-WAN is usually about the cloud. So without this telemetry as part of our visibility strategy, we’re missing visibility into the forwarding and policy component of how our end users actually reach their apps.</p> <p>Some SD-WAN vendors expose an API you can use to collect information, and some export flow, SNMP, and more modern telemetry like streaming. Getting a clear picture of how the overlay and underlay networks interact and how the underlay affects what we see in the overlay is crucial for understanding application delivery today.</p> <h3 id="end-users">End users</h3> <p>We want to know as much as possible about our end-user’s experience with their applications. After all, that’s the reason the network exists in the first place. And for that, we can collect a variety of data.</p> <p>There are browser plugins that can tell us how an application performs, and of course, there are locally installed agents that can tell us about a computer’s resource utilization, such as memory and CPU.</p> <p>However, to be proactive about monitoring an end-user’s experience, we can deploy test agents to simulate end users interacting with an application in a deliberate and prescribed manner so we can gather the specific telemetry about each step in a digital transaction.</p> <h2 id="the-starship-enterprise-wan">The Starship Enterprise WAN</h2> <p>In 2023, applications are consumed almost entirely over the internet. With few exceptions, the enterprise network is more critical than ever, more complex than ever, more distributed than ever, and more impactful than ever to an end-user’s experience. That means designing and maintaining a performant application delivery mechanism requires a different form of visibility than we’ve previously had.</p> <p>Like the Kobayashi Maru, we must change the conditions of the test. We can’t afford to lose, and as enterprise network engineers, we can’t afford to believe in a no-win scenario. The network <em>has</em> to work, or no one has access to anything.</p> <p>Using more data, additional sources, and new data analysis workflows, network observability is the three-dimensional thinking necessary to meet the needs of today’s network.</p><![CDATA[No, You Haven’t Missed the Streaming Telemetry Bandwagon - Part 1]]><![CDATA[Streaming telemetry holds the promise of radically improving the reliability and performance of today’s complex network infrastructures, but it does come with caveats. In the first of a new series, Kentik CEO Avi Freedman covers streaming telemetry's history and original development.]]>https://www.kentik.com/blog/no-you-havent-missed-the-streaming-telemetry-bandwagon-part-1https://www.kentik.com/blog/no-you-havent-missed-the-streaming-telemetry-bandwagon-part-1<![CDATA[Avi Freedman]]>Wed, 14 Jun 2023 04:00:00 GMT<p><em>“One of the wonderful things about standards is that there are so many to choose from.”</em><br> — Inscription on the Tomb of the Unknown Network Engineer</p> <p>One recurring hot topic in network observability is streaming telemetry. The application stack and many of the layers of abstraction, orchestration, and automation saw huge innovation in telemetry over the last 15 years, and networking leaders wanted to take a similar approach to support both “known” telemetry as well as supporting more modern observability approaches. To listen to the vendors and practitioners, streaming telemetry is the promised land of network telemetry and observability. But what is it and how does it relate to SNMP and other more traditional types of network metrics telemetry?</p> <p>This is the first of a two-part blog series on streaming telemetry.</p> <p>In it, I’ll give background about the history and original development of streaming telemetry, and in the second part I will cover enterprise and service provider adoption, as well as our take on likely next steps of evolution in streaming telemetry.</p> <h2 id="the-basics-of-streaming-telemetry">The basics of streaming telemetry</h2> <p>Let’s start with terminology to level set the conversation. Streaming telemetry is one of five methods for collecting performance data from a device in a network (e.g., routers, switches, servers, interfaces, and links).</p> <p>The other four are:</p> <ul> <li>Command-line interface (CLI) for devices like routers</li> <li>Syslog messages</li> <li>APIs to the config and control plane of routers (most prominently Juniper), used by large customers who create their own network management systems</li> <li>Simple Network Management Protocol (SNMP)</li> </ul> <p>Of these methods, SNMP is by far the most widely used.</p> <p>Streaming telemetry is fundamentally different from the other four in one crucial way: it made a design choice to be a “push” method to send all of its telemetric data, while the others are mostly “pull” for metrics. I’ll get into the significance of that difference a little later. For now, though, I need to make an important observation about taxonomy.</p> <p>There are classes of network telemetry: device, traffic, synthetic, config, and metadata. Some network professionals believe streaming telemetry covers all these versions. But it doesn’t. In my view, the focus of streaming telemetry has been mainly on device telemetry.</p> <p>Telemetry (of any kind) is critical to network health. Without understanding what’s happening on a device at a distance, I&#x26;O teams can’t see directly when something goes wrong, troubleshoot the problem, predict future failures, and implement fixes. Device telemetry has always been an essential element in networking.</p> <p>Enter SNMP.</p> <h2 id="snmp-a-lovehate-story">SNMP: A love/hate story</h2> <p>SNMP is the <em>de facto</em> standard for internetworking management. But it’s like the U.S. tax code: there’s something in it for everyone to hate (for some folks, <em>a lot</em> to hate), and everyone still uses it.</p> <p>Before diving into what’s wrong with SNMP, let’s acknowledge one of its strengths: the Management Information Base (MIB). The MIB collects, stores, and organizes telemetry data from devices on the network. While MIB has its faults, it does one crucial thing: it provides a common way to examine the activity and health of equipment from multiple vendors. In today’s complex multi-vendor, multi-layer hybrid cloud environment, that is invaluable.</p> <p>The bottom line is that SNMP is functional and ubiquitous, and no alternative has achieved significant momentum.</p> <p>What is it about SNMP that people hate? Simply put, it’s the wrong paradigm for the third decade of the 21st century.</p> <p>To start with, it demands a lot of administrative attention, and the arcane interface is at odds with today’s expectations of easy-to-use GUIs.</p> <p>But the main problem with SNMP is that it’s pull-based. The devices being monitored send data only when requested by the network management system. I&#x26;O teams have to select the devices to poll, set the timing for polling, etc. In highly complex networks, like those of service providers or global enterprises, the interval for polling may be anywhere from 30 seconds to even five minutes – an unacceptable length of time when you’re talking about events potentially occurring in the millions per second. This leaves the network vulnerable to catastrophic interruptions in service if it takes too long to spot an anomaly, error, or (even worse) a DDoS.</p> <p>(There are such things as SNMP traps, which can send a message about the device without being polled. But traps are limited in functionality and require significant operator intervention. They are intended to be emergency messages, but in reality, not all traps signal a true emergency, and even some emergencies don’t generate a trap.)</p> <p>SNMP polling involves so much host-client communication that CPUs can be overwhelmed, particularly if multiple network management tools are employed simultaneously (as is common) by I&#x26;O teams.</p> <p>Network operators look at all this traffic and ask: why can’t each device send this information once, by itself, so everyone can absorb it when and how they want?</p> <p>It’s an important question, especially when today’s highly complex networks are expected to be always-on and highly reliable. Any network performance monitoring and diagnostics (NPMD) system must constantly and thoroughly check the network’s health. The goal is a consistent flow of real-time and comprehensive data to deliver meaningful, actionable network intelligence.</p> <p>Enter streaming telemetry.</p> <h2 id="streaming-telemetry-a-river-of-useful-data">Streaming telemetry: A river of useful data</h2> <p>Streaming telemetry is a push-based methodology in which data from network devices is streamed continually and automatically, a “set it and forget it” approach that lifts much of the burden imposed on I&#x26;O teams by SNMP.</p> <p>Streaming telemetry taps into three important trends in today’s IT environment:</p> <ul> <li><strong>Big data</strong> – The more information you get and the more frequently you get it, the better informed you will be to take actions that optimize network performance and avoid problems before they occur.</li> <li><strong>Automation</strong> – The fewer repetitive tasks I&#x26;O teams have to perform, the more attention they can pay to higher-value, more strategic priorities.</li> <li><strong>Proactivity</strong> – Network management must be a constant and conscious effort to optimize network performance and reliability that applies directly to the goals of the organization and the quality of the end-user experience.</li> </ul> <p>Streaming telemetry holds the promise of radically improving the reliability and performance of today’s complex network infrastructures, as well assisting with prediction, capacity planning, cost analysis, performance, and even security. But, there’s a flaw in streaming telemetry that goes back to its roots more than ten years ago. The creators of streaming telemetry abandoned the idea of multi-vendor normalization and dispensed with the MIB.</p> <p>Why?</p> <p>Back then, the mega-scale web companies were trying to tackle one of their biggest problems: providing reliable service to businesses and consumers amid the hyper-scale growth of web traffic. All they wanted, they told the network equipment makers, was a constant stream of data from all their devices. No need for a MIB, they said; we can do our own debugging, troubleshooting, and optimization. Those web giants got what they wanted, and it has worked for them for the most part.</p> <p>When streaming telemetry technology was presented to large enterprises and service providers, the reaction was initially enthusiastic. But when they actually saw what streaming telemetry delivered to them — and required of them — they concluded: “We can’t handle all this by ourselves.” Streaming telemetry was looking like it would become another in a long line of <a href="https://review.firstround.com/the-three-infrastructure-mistakes-your-company-must-not-make">hipster tools and technologies that generate more buzz than adoption</a>.</p> <p>So, if you’re worried that you’re late in adopting streaming telemetry, rest assured the train has not left the station.</p> <p>In the second part of this blog, I’ll explore what’s happened to streaming telemetry since those early days and where it might be headed.</p> <p>I’ll also discuss a key question on the minds of many — since some devices don’t (and may never) support streaming telemetry, how do you unify SNMP and streaming telemetry to get a coherent view of device state?</p><![CDATA[API Monitoring with Kentik]]><![CDATA[API monitoring is the process of keeping tabs on the performance of your REST APIs. Learn how Kentik’s API monitoring tools let you identify bottlenecks, spot performance drops, and maintain API availability. Learn more in this API monitoring how-to tutorial.]]>https://www.kentik.com/blog/api-monitoring-with-kentikhttps://www.kentik.com/blog/api-monitoring-with-kentik<![CDATA[James Konik]]>Tue, 13 Jun 2023 07:00:00 GMT<p>APIs are complex, particularly if you run lots of microservices. If you want to deliver the best possible service, you have to monitor them. Monitoring your API lets you identify bottlenecks, spot performance drops, and maintain availability, all essential to ensuring a quality experience for your end users.</p> <p>Let’s take a look at how to use Kentik as a comprehensive API monitoring solution. We’ll go over how to glean metrics and enable reporting and alerts, as well as what you need to look for in a monitoring tool in general.</p> <h2 id="what-is-api-monitoring">What is API monitoring?</h2> <p>API monitoring is simply the process of keeping tabs on the performance of your APIs. You can monitor their status and collect metrics like response time and resource consumption, just to name a few.</p> <p>Your API monitoring system should be running all the time, continually logging information and letting you spot any peaks or troughs in your service delivery. These details can help you identify problems and figure out what needs to improve. For example, understanding the relative performance of your API’s endpoints and how they interact with one another lets you improve overall user experience.</p> <h3 id="why-do-you-need-api-monitoring">Why do you need API monitoring?</h3> <p>Monitoring can improve your API in several areas:</p> <ul> <li><strong>Performance.</strong> APIs need to respond quickly to ensure that users aren’t kept waiting. That’s all the more important for interconnected services, as one slow service can affect many others. Monitoring alerts you to performance drops, so you can fix the troublesome service and see what else it’s impacting.</li> <li><strong>Availability and reliability.</strong> If a service becomes unavailable, you need to know right away. You want the details, such as how long it was down for, what was the impact on other services, and what happened prior to it dropping.</li> <li><strong>Internal bottlenecks.</strong> In complex, interdependent systems, any minor issue can disproportionately affect the wider system. API monitoring lets you pinpoint the source of issues and quickly diagnose and fix problems.</li> <li><strong>End user experience.</strong> When your API runs smoothly, your users have a better experience. You can monitor your system to ensure the metrics that matter to your users are within expected levels. A system that detects drops in performance before they become serious can give you a headstart. You might even be able to fix problems before your users notice</li> <li><strong>Target delivery.</strong> Metrics are critical for business relationships — if you’re contractually obligated to provide a certain level of performance with your API, you probably have a service level agreement (SLA) that defines those requirements. Service level indicators (SLIs) are the metrics used in these agreements.</li> </ul> <h2 id="how-to-monitor-apis-with-kentik">How to Monitor APIs with Kentik</h2> <p>Let’s take a look at how Kentik addresses monitoring for APIs.</p> <h3 id="setting-up-your-data-source">Setting up your data source</h3> <p>Of course, you need a data source before getting properly started with monitoring. You can use an existing API or create a quick boilerplate system for testing purposes. Kentik also allows you to switch to its fully functional demo mode using pre-existing data.</p> <p>You can create a simple API to test with on a Linux system. Ensure Docker is installed with the following:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">docker</span> –version</code></pre></div> <p>If Docker isn’t installed, install it via:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">sudo</span> <span class="token function">apt</span> <span class="token function">install</span> docker.io</code></pre></div> <p>You can <a href="https://www.freecodecamp.org/news/a-beginners-guide-to-docker-how-to-create-your-first-docker-application-cc03de9b639f/">create a Docker image</a> yourself or <a href="https://hub.docker.com/">download one</a> to use as a starting point.</p> <p>To <a href="https://www.howtogeek.com/devops/how-to-get-started-using-the-docker-engine-api/">expose Docker’s API to the world</a>, use:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">sudo</span> dockerd <span class="token parameter variable">-H</span> unix:///var/run/docker.sock <span class="token parameter variable">-H</span> tcp://0.0.0.0:2375</code></pre></div> <p>Connect to your server at its IP address over port 2375.</p> <h3 id="setting-up-kentik">Setting up Kentik</h3> <p>Now let’s set up Kentik as an API monitoring solution.</p> <img src="//images.ctfassets.net/6yom6slo28h2/NRaAJpQtmZPpGbI33wx5o/b42600ae2449c2ca9453a75e8683e30b/api-monitoring-get-started.png" style="max-width: 700px;" class="image center" thumbnail withFrame alt="Set up Kentik for API monitoring" /> <div class="caption" style="margin-top: -35px;">Add the hostname or IP address of your API here.</div> <p><a href="https://www.kentik.com/get-started/">Sign up with Kentik</a> (you can follow that link or fill out the “Try Kentik” form at left) and log in. Follow the prompts to let Kentik guide you to the right features to set up your monitoring tasks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3vrDwrg7SRDPxRS5kAfM1t/a4ed7585cad58a2724607db121e832f3/kentik-api-setup-2.png" style="max-width: 700px;" class="image center" thumbnail withFrame alt="Use case selection options" /> <div class="caption" style="margin-top: -35px;">Select Improve Users’ Digital Experience to create a setup task for synthetic monitoring.</div> <p>You can add your Linux server as a data source here or whatever other data sources you want to pipe in.</p> <p>Proceed to Kentik’s dashboards for an overview of the information you’re capturing. From here, you can see everything you need to know about your API in one place. These dashboards are heavily customizable — you can create a blank one of your own or use one of many preset options. You access them through the library in the Kentik portal.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2WKIbmZIzzPGBF51C1YCCE/1738fd1b3a8b60684dee248bb54e62fe/kentik-add-dashboard-api.png" style="max-width: 600px;" class="image center" alt="Kentik add dashboard dialog" /> <div class="caption" style="margin-top: -35px;">Adding an API monitoring dashboard</div> <p>You can choose various display options for a dashboard according to what best serves your use case. Decide how often your data sources are updated and what related info is displayed. You can keep your dashboards private, share them with your team, or quickly create PDF reports from their data.</p> <p>Of course, your dashboards are an excellent place to monitor the status of your API. For example, if your dashboard indicates a broken endpoint, you can go back to Docker and fix the endpoint, and the dashboard will update its status appropriately.</p> <p>As mentioned earlier, you can configure how often Kentik gets new info, so ensure you have a low enough value to pick up changes promptly. Too few updates, and you won’t spot outages quickly enough. Too many, and your resource consumption will be higher. Hourly updates are acceptable in many cases, but you can keep a closer eye on critical services where needed.</p> <h3 id="running-synthetic-tests-on-your-api">Running synthetic tests on your API</h3> <p>Synthetic tests let you run continual tests on your APIs. They’re enabled via agents to connect your infrastructure to Kentik’s monitoring tools.</p> <p>To run synthetic tests on your own APIs, <a href="https://kb.kentik.com/v4/Ma07.htm#Ma07-App_Agent_Package_Install">install Kentik’s ksynth</a> agent. You can do that in Bash on Debian-based packages via:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">apt-get</span> <span class="token function">install</span> ksynth-agent</code></pre></div> <p>Or on RPM-based systems:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">yum <span class="token function">install</span> ksynth-agent</code></pre></div> <p>Once ksynth is installed, start the agent via:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">systemctl start ksynth-agent</code></pre></div> <p>To set up a ksynth agent for Docker, create a folder for it and then run:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">docker</span> run <span class="token parameter variable">-it</span> <span class="token parameter variable">--rm</span> <span class="token parameter variable">-v</span> /path/to/local/directory:/var/lib/ksynth-agent kentik/ksynth-agent:latest ksynth-agent <span class="token parameter variable">-u</span></code></pre></div> <p>You also need to register and activate the agent. <a href="https://kb.kentik.com/v4/Ma07.htm#Ma07-Register_and_Activate_ksynth">Consult Kentik’s documentation</a> if you need help with that — the process varies by system.</p> <p>To allow ksynth agents to communicate with Kentik, you need to <a href="https://kb.kentik.com/v4/Ma01.htm#Ma01-About_Synthetics_Agents">let them through your firewall</a>. That means whitelisting the following:</p> <ul> <li><strong>IP addresses</strong> <ul> <li><code class="language-text">208.76.14.0/24</code></li> <li><code class="language-text">2620:129::/44</code></li> </ul> </li> <li><strong>Domains</strong> <ul> <li><code class="language-text">flow.kentik.com</code></li> <li><code class="language-text">api.kentik.com</code></li> </ul> </li> </ul> <p>To create a test, navigate to a Kentik dashboard and click <strong>Add</strong> next to <strong>Synthetic Test View</strong>. In the panel that appears, select your test type, display type, and so on, before choosing the data source for your test.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/31yJ15QY6khVe01ygFZ5KY/77738507c4e5afd5c5588737d0b65339/add-synthetic-test-dialog-api.png" style="max-width: 500px;" class="image center" alt="Kentik add synthetic test panel" /> <div class="caption" style="margin-top: -35px;">Adding a synthetic HTTP/API test in Kentik</div> <p>This example selects an HTTPS/API test using CloudFront data and displays it as a table. Here’s how that looks on the dashboard:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5iYOUCbEYWy7YV7NmTaG25/88d2e4ff05b532b1198e08972b2f3f73/kentik-synthetic-test-results-panel-api-5.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="API Synthetic Test Results" /> <div class="caption" style="margin-top: -35px;">API monitoring: Synthetic test results in Kentik.</div> <p>This shows metrics for various AWS addresses. From this panel, you can see the certificate expiry and the status code. These are both easy metrics to monitor and helpful for generating alerts when you need to fix an outage or if you have an upcoming certificate expiry.</p> <p>The panel also shows performance metrics, such as domain lookup time, connection time, response time, and HTTP latency. You can set thresholds to determine what metrics are acceptable and when you should receive an alert.</p> <h3 id="setting-useful-metrics-in-a-kentik-dashboard">Setting useful metrics in a Kentik dashboard</h3> <p>Kentik can help you monitor a wide variety of metrics. To add a new metric, click <strong>Add</strong> next to <strong>Data Explorer View</strong>. On the <strong>Add Data Panel</strong> window that appears, select from various visualizations, data sources, and metrics.</p> <p>You can also customize the metrics, add filters, and change the timescale over which they’re collected. Let’s go through some of them:</p> <ul> <li><strong>Latency.</strong> A key metric that tells you how long users have to wait before getting a response from the server. Select server latency from the <strong>Metrics</strong> section of the <strong>Add Data Panel</strong> window.</li> </ul> <img src="https://images.ctfassets.net/6yom6slo28h2/7KAPn84gGntBXtDjwPSmVL/e3911d589553552b7000c932666ab360/server-latency-graph-api.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Kentik's Add Data Panel, showing a server latency graph being added" /> <div class="caption" style="margin-top: -35px;">Adding a server latency graph</div> <ul> <li><strong>Throughput.</strong> The throughput of your services shows how much work your server is doing, i.e., the total rate of your services in bits per second.</li> </ul> <img src="https://images.ctfassets.net/6yom6slo28h2/2z4ozRe8AcfsFnyXUw95Zw/79dc066d5f65396c27b5e012c306b505/api-performance-metrics-throughput.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Panel on Kentik dashboard showing throughput API performance in bits per second" /> <div class="caption" style="margin-top: -35px;">Kentik API performance dashboard showing throughput in bits per second.</div> <ul> <li><strong>Jitter.</strong> A measure of <a href="https://www.kentik.com/kentipedia/understanding-latency-packet-loss-and-jitter-in-networking/">how latency varies over time</a>. Jitter can cause packet loss if packets are received out of order, or higher latency if networks adjust to compensate for the varying speed.</li> </ul> <img src="https://images.ctfassets.net/6yom6slo28h2/2wLwy2vKflwi2QvGTv4haL/740219112bcdf2bf9edc36d175266156/api-performance-metrics-jitter.png" style="max-width: 500px;" class="image center" thumbnail alt="Kentik synthetic dashboard, showing jitter and other API metrics" /> <div class="caption" style="margin-top: -35px;">Kentik Synthetics dashboard showing latency, packet loss, and jitter API metrics</div> <p>Other metrics you could add to your dashboard include the following:</p> <ul> <li><strong>Response time.</strong> How long a service takes to respond following a call. The lower, the better!</li> <li><strong>Error rate.</strong> Divide the failed calls by the total calls made to calculate the percentage of calls that succeeded. Any errors are worth investigating, but calls with high error rates are clearly more urgent.</li> <li><strong>Uptime.</strong> Track how long service outages last and divide this by the time you’ve been monitoring to calculate your API’s uptime. This can be key to delivering on your SLAs.</li> </ul> <h3 id="alerts-and-reporting-for-api-monitoring">Alerts and reporting for API monitoring</h3> <p>Now that you’re tracking various metrics, what can you do with that information? Let’s set up alerts so Kentik can inform you about any problems it detects.</p> <p>Kentik can alert you via various channels, including email, JSON, Slack, Microsoft Teams, custom webhooks, and more. You can configure who gets sent what and what channels are used on Kentik’s notification settings page.</p> <img src="//images.ctfassets.net/6yom6slo28h2/30NzhWbnDbrrgqyb0K0bGx/cb61b9a921606afde78169aaa0642b26/kentik-alerts-settings-api-monitoring-9.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Kentik notification settings screen, showing the many available channels, with a list of notification targets" /> <div class="caption" style="margin-top: -35px;">Kentik notification settings screen, showing available channels and a list of notification targets</div> <p>To view existing alerts, click <strong>Alerting</strong> from Kentik’s menu. You’ll see a list of current alerts and their related metrics. You can filter these in various ways, such as by type, status, or severity. That’s extremely helpful as the number of your API monitoring alerts increases.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4cuicbSGCJYNn4HtFRlvio/e125392513606f023ec93b56ada1d858/list-of-alerts-in-kentik-api-monitor-10.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="List of alerts in Kentik" /> <div class="caption" style="margin-top: -35px;">List of alerts in Kentik</div> <p>If you click <strong>Configure Alert Policies</strong>, you’ll see a list of policies available to you. Click the box next to each policy’s status to enable or disable it. On the <strong>Policy Templates</strong> page, you can set your own policies, defining the conditions for alerts yourself.</p> <h2 id="what-to-look-for-when-choosing-an-api-monitoring-tool">What to look for when choosing an API monitoring tool</h2> <p>Not all monitoring tools are created equal; picking one that matches your needs is essential. Easier said than done? Organize your research around a few considerations to narrow your options:</p> <ul> <li><strong>Core features.</strong> Your API monitoring solution should be able to view all the networks in your infrastructure and collect all needed telemetry. 24x7 monitoring is also important—you don’t want gaps in your coverage. Instant alerts are another critical feature that helps you respond to issues in a timely manner. After all, detecting issues isn’t much help if you can’t respond to the problem quickly.</li> <li><strong>Customization and flexibility.</strong> Your solution should allow you to customize rules, thresholds for notifications, and alerts via multiple channels easily if it truly fits your use case.</li> <li><strong>Usability.</strong> Usable software encourages engagement. The UI needs to be easy to use and quick to work with. You should be able to easily find the metrics you need and set alerts and thresholds painlessly.</li> <li><strong>Scalability.</strong> You want your system to keep working as your user base grows, and you don’t want to be babysitting your tools when you have cool new features to work on. Your monitoring solution should be capable of growing along with your API and not hold you back when traffic builds.</li> <li><strong>Security.</strong> Your API monitoring system will store critical data and should do so securely. You need to be able to control who can access your monitoring tool and make sure its data is secure in transmission. <a href="https://kb.kentik.com/v0/Ab03.htm">Encryption in transit</a>, as well as for stored data, is a significant feature to look for.</li> <li><strong>Data collection and analysis.</strong> Your monitoring solution should provide tools to analyze and collect it. That should include all kinds of metrics options and a customizable dashboard where you can choose what is shown. Ideally, you should be able to build multiple dashboards for different use cases.</li> <li><strong>Third-party integrations.</strong> Your solution should work well with <a href="https://www.kentik.com/product/integrations/">the rest of your toolset</a> and not just your API. It should be able to export data to other monitoring tools you use and your data center. It should be compatible with different hosting providers, container tools, and collaboration tools.</li> <li><strong>Cost.</strong> Last but not least (and perhaps most obvious), your solution should fit within your budget and still provide sufficient value. You want a tool that saves your engineers time and detects problems before they become serious. Flexible pricing options are also helpful, particularly if you expect significant growth.</li> </ul> <h2 id="conclusion">Conclusion</h2> <p>API monitoring is an essential part of service delivery. It helps you understand what’s working within your API and what needs attention. With a proper monitoring tool, you can automate many tasks that would be impractical to do manually, especially at scale. Take full advantage of your chosen solution’s power, and you’ll be able to keep your services running well, giving your users the best experience possible.</p> <p>Fulfill your SLAs with ease with Kentik, and lower your costs while you’re at it. To try Kentik, <a href="#signup_dialog" title="Start a Free Trial of Kentik">sign up for a free trial</a> or <a href="#demo_dialog" title="Request a Demo of Kentik">request a demo</a> for a personalized walkthrough on how it can become your perfect API monitoring solution.</p><![CDATA[Multi-Cloud Made Simple: Announcing Kentik Observability Enhancements for AWS and Google Cloud]]><![CDATA[Limited visibility into network performance across multi-clouds frustrates even the best teams. That's why we're thrilled to announce enhanced AWS and GCP support for Kentik Cloud, enabling network, cloud, and infrastructure teams to rapidly troubleshoot and understand multi-cloud traffic. ]]>https://www.kentik.com/blog/multi-cloud-made-simple-observability-enhancements-for-aws-google-cloudhttps://www.kentik.com/blog/multi-cloud-made-simple-observability-enhancements-for-aws-google-cloud<![CDATA[Rosalind Whitley]]>Thu, 08 Jun 2023 04:00:00 GMT<p>Enterprises migrate to multi-cloud networks not because they want to, but because they have to. There’s an acquisition. An initiative to reduce costs. A mandate for redundancy.</p> <p>Regardless of the catalyst (and despite a number of benefits), one outcome is always the same: limited visibility into end-to-end performance across AWS, Azure, GCP, and on-prem.</p> <p>Today we are thrilled to announce updates to <a href="https://www.kentik.com/product/cloud/">Kentik Cloud</a> that enable network, cloud, and infrastructure teams to rapidly troubleshoot and understand multi-cloud traffic — and futureproof their organizations against the rising network complexity that comes with cloud adoption.</p> <p>Two exciting new capabilities help you quickly answer any question about your multi-cloud network:</p> <ul> <li> <p>Kentik Cloud users can now <strong>collect, analyze, and visualize flow logs generated on AWS Transit Gateways</strong>. (This is in addition to cloud VPC flow logs and other Kentik data sources for cloud and hybrid environments: NetFlow, sFlow, IPFIX, J-Flow, and sFlow-RT logs.)</p> </li> <li> <p>Kentik Cloud users can now <strong>access the new Kentik Map for Google Cloud</strong> to automatically visualize detailed Google Cloud and hybrid cloud infrastructure topology.</p> </li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/7n9r90LJZjJcAqHaSUwuhT/7970c415281f369458664710f0c1cb7e/harden-zero-trust.gif" style="max-width: 800px;" class="image center" withFrame thumbnail alt="AWS Flow Logs - harden zero-trust policy" /> <div class="caption" style="margin-top: -35px;">See cloud security policy in action, find gaps in network security groups, and easily refine cloud policies.</div> <p>With these enhancements, Kentik Cloud enables you to monitor network traffic, analyze performance metrics, and detect security threats seamlessly across your entire hybrid cloud network.</p> <p>In this blog post, we will dive into the details of these new Kentik Cloud capabilities and explore how they can uplevel your network monitoring and troubleshooting.</p> <h2 id="key-benefits-of-kentik-clouds-enhanced-aws-and-google-cloud-support">Key benefits of Kentik Cloud’s enhanced AWS and Google Cloud support</h2> <ul> <li><strong>Centralized visibility</strong>: Kentik Cloud aggregates flow logs from AWS, Google Cloud, and every environment in your hybrid cloud, giving you a comprehensive view of network traffic across multiple VPCs and on-premises networks. Centralized logs from AWS Transit Gateways don’t require access to flow logs for every attached VPC.</li> <li><strong>Advanced analysis</strong>: With Kentik’s powerful analytics engine, you can perform an in-depth analysis of flow logs from any cloud. Detect patterns, identify trends, and uncover anomalies in your network traffic. Use custom queries and filters to drill down into specific traffic patterns or attributes for detailed investigation.</li> <li><strong>Real-time monitoring and alerting</strong>: Kentik Cloud processes flow logs in near real-time, enabling proactive monitoring and rapid issue detection. Set up custom alerts based on specific network conditions or security events to receive instant notifications when anomalies occur in any environment.</li> <li><strong>Traffic optimization and performance analysis</strong>: By analyzing and visualizing flow logs from your cloud resources, Kentik Cloud helps you optimize network performance. Identify congested links, understand application-level traffic patterns, and make data-driven decisions to improve resource allocation and network efficiency.</li> <li><strong>Enhanced security insights</strong>: Flow log analysis provides valuable insights into potential security threats and network vulnerabilities. Kentik Cloud shines a light in the cloud to detect suspicious traffic patterns, identify unauthorized access attempts, and strengthen your security posture.</li> </ul> <h2 id="what-are-flow-logs-and-aws-transit-gateways">What are Flow Logs and AWS Transit Gateways?</h2> <p><a href="https://www.kentik.com/resources/aws-vpc-flow-logs-for-kentik/">Flow logs are a valuable source of network traffic information in AWS</a>. They capture detailed metadata about the traffic flowing through various components of your network, such as VPCs, subnets, and network interfaces. By analyzing flow logs, you can gain insights into network behavior, detect anomalies, monitor performance, and improve security.</p> <p>AWS Transit Gateways act as a centralized hub for connecting multiple VPCs and on-premises networks. They simplify network architecture and enable efficient traffic routing between different environments. By consuming flow logs generated on AWS Transit Gateways, Kentik Cloud provides a unified view of traffic across VPCs and facilitates centralized monitoring and analysis.</p> <h2 id="how-can-i-use-kentik-clouds-aws-transit-gateway-flow-log-support">How can I use Kentik Cloud’s AWS Transit Gateway Flow Log support?</h2> <p>Analyzing your transit gateway flow logs in Kentik Cloud can help you to troubleshoot cloud and hybrid network problems, plan network capacity based on past utilization patterns, detect suspicious traffic, and audit compliance with security policies. Let’s dig into the details.</p> <h3 id="troubleshooting-cloud-and-hybrid-network-problems">Troubleshooting cloud and hybrid network problems</h3> <p>Transit Gateway Flow Logs allow you to analyze and detect patterns or anomalies across multiple VPCs, making it easier to identify and troubleshoot performance-impacting issues that span multiple environments. They capture detailed information about network traffic, including:</p> <ul> <li>Source and destination IP addresses</li> <li>Ports</li> <li>Protocols</li> <li>Packet counts</li> </ul> <p>By analyzing these logs, you can gain visibility into the volume, patterns, and characteristics of traffic, such as whether it’s being dropped or incorrectly routed, to identify connectivity issues.</p> <p>Analyzing Transit Gateway Flow Logs can also help in identifying performance bottlenecks. For example, you can recognize congested paths by examining which types of traffic use the same transit gateway as an important app or service, and assess latency and performance during high-traffic events. This information enables you to optimize network routing and adjust capacity, to improve network performance.</p> <h3 id="network-capacity-planning">Network capacity planning</h3> <p>Transit Gateway Flow Logs contain information about network traffic volumes and patterns over time. <a href="https://www.kentik.com/resources/video-data-explorer-in-kentik-portal/">Kentik’s Data Explorer</a> allows you to compare performance across time periods according to the metrics and attributes you care about.</p> <p>By analyzing and comparing historical logs, you can identify usage trends, peak traffic periods, and forecast future network capacity requirements. This helps in effective capacity planning and resource allocation; maybe you need to upgrade capacity or redistribute traffic among new VPCs or Direct Connects to deliver acceptable performance for peak traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4FuaG7ZhRtPwFx6v7tL0g5/65654029abbde0d348d5bfc7ad147c2b/migrated-application-traffic-volume.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="AWS Flow Logs - plan migrations based on traffic and defpendencies" /> <div class="caption" style="margin-top: -35px;">Plan migrations based on traffic and dependencies, and drive decisions on capacity, latency, geo-location, and policy.</div> <h3 id="detecting-suspicious-traffic">Detecting suspicious traffic</h3> <p>Transit Gateway Flow Logs provide visibility into network traffic, allowing you to detect potential security threats or anomalies. In the logs, you can identify suspicious or unauthorized access attempts, unusual traffic patterns, or communication with blacklisted IP addresses. Combined with Kentik Alerts, this helps detect and respond promptly to security incidents.</p> <p>For example, an audit may be warranted if unexpected TCP traffic is flowing into a MariaDB instance but isn’t entering on port 3306. Similarly, connection requests on port 22, where teams don’t need SSH access to operate a service, may be a red flag that a bad actor is trying to access other resources on your network. Alerts can flag this activity before it impacts your organization.</p> <p>Port scanning detection and alerting can also help to prevent network threats. Kentik can identify port scanning behavior by identifying unusual volumes or requests that originate from a single IP but have many destinations across your network. Centralized network observability and Transit Gateway Flow Log support make it simple to set up Alerts for this behavior, and make it easy to understand flagged activity.</p> <h3 id="auditing-compliance">Auditing compliance</h3> <p>Transit Gateway Flow Logs can assist with auditing network communications and upholding compliance requirements. By analyzing the logs, you can granularly track historical network activity, verify compliance with security policies, generate audit trails, and import enriched flow data to your SIEM to add context. This information can be useful for meeting regulatory requirements and conducting post-incident investigations.</p> <p>For example, in the case that one of your company’s AWS accounts is involved in a security incident, you may need to answer the question, “Which VPC resources in other AWS accounts are attached to the Transit Gateway impacted by this incident?” to understand the blast radius. Using Transit Gateway Flow logs, you can easily answer this question by scanning flows from the affected accounts and VPCs in one place to check for unauthorized activity.</p> <h2 id="kentik-map-for-google-cloud">Kentik Map for Google Cloud</h2> <p>The Kentik Map visualizes every aspect of network infrastructure, both on-prem and cloud, to enable an instant understanding of how resources connect and how that impacts traffic patterns, network health, and customer experience. This searchable, interactive tool displays dynamic traffic, routing, and interconnect topology and metadata for all current and historical resources that customers own in Google Cloud. The Kentik Map automatically updates in real-time as networks change to provide maintenance-free documentation that’s always up to date – our customers love using it for:</p> <ul> <li>Planning cloud migrations</li> <li>Investigating connectivity and device issues</li> <li>Onboarding new employees</li> </ul> <p>One unique benefit of Google Cloud is that its VPCs (Virtual Private Clouds) can span multiple regions, which limits overhead associated with scaling applications and services across regions and globally. To help customers make the most of this benefit, <a href="/solutions/usecase/google-cloud-platform/">Kentik Map for Google Cloud</a> groups VPCs by region, displaying the subnets in each region beneath each parent VPC. In addition, the map visualizes Dedicated Interconnect attachments, VM interfaces, and VPN gateways with their associated on-prem and cloud routers. It surfaces static link paths for on-prem routers and external VPN gateways and traffic links generated from subnets and internet types for easy access to the specifics users need to answer questions and solve network problems.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3S9NtoYaEGJBOLg8hpmm2V/4cfb09ef47dae29246fd2c88fca20b00/google-cloud-ensure-smooth-migrations.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Visibility into Google Cloud with Kentik Map" /> <div class="caption" style="margin-top: -35px;">Plan migrations based on traffic and dependencies, and drive decisions on capacity, latency, geo-location, and policy.</div> <h2 id="how-do-i-enable-transit-gateway-flow-log-analysis-in-kentik-cloud">How do I enable Transit Gateway Flow Log Analysis in Kentik Cloud?</h2> <p>To leverage Kentik Cloud’s new capability for consuming and analyzing flow logs on AWS Transit Gateways, follow these steps:</p> <ol> <li><strong>Integration setup</strong>: Connect your AWS account to Kentik Cloud by providing the necessary permissions to access flow logs. Kentik Cloud securely retrieves flow logs from your AWS environment.</li> <li><strong>Flow Log ingestion</strong>: Kentik Cloud automatically ingests AWS Transit Gateway Flow Logs from a configured S3 bucket, ensuring a seamless data collection process. The ingestion process includes parsing and enriching flow log data for deeper analysis.</li> <li><strong>Visualization and analysis</strong>: Once flow logs are ingested, use Kentik Cloud’s intuitive user interface to explore and analyze your network traffic. Create custom dashboards, visualizations, and reports to gain insights into your AWS network performance and behavior.</li> <li><strong>Alerts and automation</strong>: Set up proactive alerts based on specific traffic conditions or security events. Configure automated actions, such as sending notifications, triggering workflows, or scaling resources based on defined thresholds.</li> </ol> <p>To learn more, visit our <a href="https://www.kentik.com/solutions/usecase/amazon-web-services/">AWS Cloud Observability page</a> for a fully-featured demo video of Kentik Cloud, and check out our <a href="https://www.kentik.com/solutions/usecase/google-cloud-platform/">Google Cloud Observability page</a> for a quick tour of the new Kentik Map for Google Cloud.</p> <p>Ready to observe your networks? <a href="https://www.kentik.com/get-started/">Try Kentik free for 30 days</a>, or <a href="https://www.kentik.com/go/get-started/demo/">contact our team</a> for a personalized demo.</p><![CDATA[A Brief History of the Internet’s Biggest BGP Incidents]]><![CDATA[Stretching back to the AS7007 leak of 1997, this comprehensive blog post covers the most notable and significant BGP incidents in the history of the internet, from traffic-disrupting BGP leaks to crypto-stealing BGP hijacks.]]>https://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidentshttps://www.kentik.com/blog/a-brief-history-of-the-internets-biggest-bgp-incidents<![CDATA[Doug Madory]]>Tue, 06 Jun 2023 04:00:00 GMT<p><em>In the summer of 2022, I joined a team of BGP experts organized by the <a href="https://www.bitag.org/index.php">Broadband Internet Technical Advisory Group (BITAG)</a> to draft a <a href="https://www.bitag.org/Routing_Security.php">comprehensive report</a> covering the security of the internet’s routing infrastructure. The section that I was primarily responsible for covered the history of notable BGP incidents, a topic I have written about extensively throughout my career in the internet industry.</em></p> <p><em>Below is an edited version of my take on the internet’s most notable BGP incidents. <a href="https://henrybirgelee.com">Henry Birge-Lee of Princeton</a> was the primary author of a large portion of the section on the attacks on cryptocurrency services.</em></p> <h2 id="bgp-routing-security-incidents-in-the-wild">BGP routing security incidents in the wild</h2> <p>BGP routing incidents can be problematic for a range of reasons. In some cases, they simply disrupt the flow of legitimate internet traffic while in others, they can result in the misdirection of communications, posing a security risk from interception or manipulation. Routing incidents occur with some regularity and can vary greatly in operational impact. In this blog post, I will address selected specific incidents which have demonstrated the range and gravity of threats to the stability and security of the internet’s routing system.</p> <h3 id="disruptions-and-attacks-caused-by-bgp-incidents">Disruptions and attacks caused by BGP incidents</h3> <p>In BGP parlance, the term “routing leak” broadly refers to a routing incident in which one or more BGP advertisements are propagated between ASes (Autonomous Systems) in a way they were not intended to. Often these incidents occur accidentally, but malicious actors may also attempt to camouflage intentional attacks under the guise of apparent accidents.</p> <p>In 2016, <a href="https://datatracker.ietf.org/doc/html/rfc7908">RFC 7908</a> introduced a more complex taxonomy of BGP routing leaks, but in this post, I will employ simply two main categories of error: origination and AS path.</p> <ul> <li>A mis-origination occurs when an AS originates (announces with its ASN as the origin) a new advertisement of a route to an IP address block over which it does not possess legitimate control, consequently soliciting traffic destined to those IP addresses.</li> <li>An AS path error occurs when an AS inserts itself as an illegitimate intermediary into the forwarding path of traffic bound for a different destination.</li> </ul> <p>This distinction is important because the two types of error require different mitigation strategies.</p> <h4 id="what-is-the-difference-between-a-bgp-hijack-and-a-bgp-route-leak">What is the difference between a BGP hijack and a BGP route leak?</h4> <p>Generally the phrase “BGP hijack” often connotes malicious intent, whereas a “BGP route leak” is assumed to be accidental. Complicating matters, there are BGP incidents which involve both intentional and accidental components and some that we simply don’t know whether they were intentional. Experts in this area can hold varying opinions about what constitutes a BGP leak versus a BGP hijack.</p> <div as="Promo"></div> <h3 id="bgp-origination-errors">BGP origination errors</h3> <p>The <a href="https://en.wikipedia.org/wiki/AS_7007_incident">AS7007 incident</a> in April 1997 was arguably the first major internet disruption caused by a routing leak. In this incident, a software bug caused a router to announce a large part of the IP address ranges present in the global routing table as if they were originated by AS7007. This origination leak was compounded by the fact that the routes were more-specifics (i.e., smaller IP address ranges) and, therefore, higher priority according to the BGP selection algorithm.</p> <p>An additional factor contributing to the degree of disruption was the fact that the leaked routes persisted even after the problematic router was disconnected from the internet. During the leak, a large portion of the internet’s traffic was redirected to AS7007, where it overwhelmed its networking equipment and was dropped.</p> <p>The AS7007 incident was followed soon after by a <a href="https://mailman.nanog.org/pipermail/nanog/1997-October/123970.html">massive leak from AS701</a>, which was <a href="https://en.wikipedia.org/wiki/UUNET">UUNet</a> at the time. In this incident, AS701 originated all of the IPv4 space contained in 128.0.0.0/9 as /24’s disrupting the flow of traffic to a large portion of the global routing table.</p> <p>In subsequent years, other similarly large origination leaks have occurred, disrupting internet communications. These incidents include the <a href="https://archive.nanog.org/meetings/nanog34/presentations/underwood.pdf">Turk Telecom leak of December 2004</a>, the <a href="https://www.computerworld.com/article/2516953/a-chinese-isp-momentarily-hijacks-the-internet--again-.html">China Telecom leak of April 2010</a>, and <a href="https://www.bgpmon.net/massive-route-leak-cause-internet-slowdown/">Telecom Malaysia leak of June 2015</a>. Each of these disruptions lasted less than an hour and appeared indiscriminate in the address blocks affected.</p> <p>Large-scale origination leaks like these have become less frequent in recent years due to increases in the automation of router configuration in topologically-central networks. Two competing methodologies, RPSL and RPKI, are used to inform the defensive configuration of routers. In both cases, information pairing IP address blocks with authorized origin ASNs (Autonomous System Numbers) is made public, and is distilled by most large network’s operators into “filter lists,” which block the assimilation of nonconforming BGP route advertisements into local routing tables.</p> <p>Origination errors can also include incidents that weren’t completely accidental, more commonly referred to as BGP hijacks. Perhaps the most famous BGP hijack was the <a href="https://www.wired.com/2008/02/pakistans-accid/">incident in February 2008</a> involving the state telecom of Pakistan, PTCL, and YouTube. In that instance, the government of Pakistan ordered access to YouTube to be blocked in the country due to a video it deemed anti-Islamic.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/771pth2ZCilVLDZ1y7UxrW/698075713b4196f9f305e43df4a7a822/diagram-pakistan-hijack-youtube.png" style="max-width: 500px;" class="image center" alt="Pakistan Telecom hijack of YouTube" /> <div class="caption" style="margin-top: -35px;">Diagram of Pakistan Telecom hijack of YouTube in 2008 (<a href="https://dl.acm.org/doi/fullHtml/10.1145/2668152.2668966">source</a>)</div> <p>To implement the block, PTCL announced more-specific routes of YouTube’s BGP routes to intentionally hijack Pakistan’s traffic to the video streaming service. Once hijacked, PTCL’s goal was to black hole the traffic, preventing Pakistanis from being able to access Youtube. However, things went downhill when PTCL passed these routes to its international transit providers, who carried the routes around the world, blocking Youtube for a large portion of the global internet.</p> <p>Since the PTCL-YouTube hijack, there have been other instances of localized traffic manipulation implemented in BGP leaking out to the internet. In 2017, Russian state telecom Rostelecom leaked out a <a href="https://www.bgpmon.net/bgpstream-and-the-curious-case-of-as12389/">curious set of routes</a> including those from major financial institutions.</p> <p>During both the internet crackdown following the <a href="https://www.manrs.org/2021/02/did-someone-try-to-hijack-twitter-yes/">military coup in Myanmar in 2021</a> and the <a href="https://arstechnica.com/information-technology/2022/03/absence-of-malice-russian-isps-hijacking-%20of-twitter-ips-appears-to-be-a-goof/">Russian crackdown of social media</a> following its invasion of Ukraine in 2022, telecoms in each of these countries attempted to block access to Twitter using a BGP hijack to black hole traffic. In each case, the intentionally hijacked BGP route was <em>unintentionally</em> propagated onto the internet affecting Twitter users outside of the originating countries.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6unJtS5WDjRMuU1y5GXiNE/c1b96d2ae6db32f1e6e60c99acc44fef/russian-bgp-hijack-twitter.png" style="max-width: 600px;" class="image center" withFrame alt="Russian BGP hijack of Twitter as seen in the Kentik portal" /> <div class="caption" style="margin-top: -35px;">Kentik visualization of Russian BGP hijack of Twitter in February 2022 (<a href="https://storage.googleapis.com/site-media-prod/meetings/NANOG86/4493/20221017_Madory_Internet_Impacts_Due_v1.pdf">source</a>)</div> <p>In 2008, researchers outlined how <a href="https://www.wired.com/2008/08/revealed-the-in/">BGP could be manipulated</a> to conduct a man-in-the-middle attack over the internet. The first documented case of a BGP-based man-in-the-middle attack like the one outlined in 2008 was <a href="https://www.wired.com/2013/12/bgp-hijacking-belarus-iceland/">discovered in 2013, originating in Belarus</a> and targeting the networks of major US credit card companies and governments worldwide.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5Q4UwH0mUUCAfE64xI0ZPJ/8050497404ae70f3f39cb4569c674af8/traffic-misdirection-bgp-based-mitm.png" style="max-width: 800px;" class="image center" alt="Russian BGP hijack of Twitter as seen in the Kentik portal" /> <div class="caption" style="margin-top: -35px;">Diagram of traffic misdirection due to BGP-based MITM in 2013 (<a href="https://www.wired.com/2013/12/bgp-hijacking-belarus-iceland/">source</a>)</div> <p>During a 6-day period in August 2013, spyware service provider <a href="https://arstechnica.com/information-technology/2015/07/hacking-team-orchestrated-brazen-bgp-hack-to-hijack-ips-it-didnt-own/">Hacking Team conducted BGP hijacks</a> on behalf of the Special Operations Group of the Italian National Military Police, according to leaked documents revealed during a breach of Hacking Team’s network.</p> <p>And finally, in 2018, a security company Backconnect <a href="https://mailman.nanog.org/pipermail/nanog/2016-September/087902.html">publicly defended a BGP hijack </a>they admitted to performing in order to regain control of a botnet server responsible for DDoS attacks. <a href="https://krebsonsecurity.com/2016/09/ddos-mitigation-firm-has-history-of-hijacks/">Researchers subsequently found </a>that the DDoS mitigation firm had been involved in numerous prior BGP hijacks and had been utilizing a DDoS-for-hire service to drum up business.</p> <h3 id="bgp-as-path-errors">BGP AS path errors</h3> <p>Not all routing incidents involve the perpetrator specifying its own ASN as the origin of the erroneous route. In November 2018, MainOne, a large telecommunications company in Nigeria, <a href="https://www.internetsociety.org/blog/2018/11/route-leak-caused-a-major-google-outage/">leaked routes received from a number of its peers</a>, including major content delivery networks, to its upstream transit providers.</p> <p>One of MainOne’s transit providers, China Telecom, failed to filter these incoming erroneous announcements, integrated them into its own routing tables, and proceeded to propagate them onward to its many customers and peers. Consequently, a significant portion of internet traffic bound for the victim networks was misdirected through China. Shortly afterwards, <a href="https://twitter.com/Mainoneservice/status/1062321496838885376">MainOne confirmed the leak </a>was caused by their mistaken router configuration. Despite the error, misdirected traffic could still have been subject to interception or manipulation.</p> <p>In June 2019, <a href="https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/">Allegheny Technologies leaked thousands of routes</a> learned from one transit provider (DQE Communications) to another, Verizon. The routes that Allegheny leaked included many more-specifics that had been generated by a route optimizer employed by DQE. The result was that these leaked more-specific routes propagated throughout the internet and misdirected substantial amounts of internet traffic to Allegheny, causing a severe disruption.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4O84DUVOb1xt25y1ffpJzX/6559ed890c42951ca5bda5ae878598ce/allegheny-tech-bgp-leak.png" style="max-width: 600px;" class="image center" alt="Allegheny Tech BGP leak" /> <div class="caption" style="margin-top: -35px;">Diagram of the Allegheny Technologies BGP leak of June 2019 (<a href="https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/">source</a>)</div> <p>And finally, for a period lasting more than two years, <a href="https://arstechnica.com/information-technology/2018/11/strange-snafu-misroutes-%20domestic-us-internet-traffic-through-china-telecom/">China Telecom leaked routes from Verizon’s Asia-Pacific network</a> that were learned through a common South Korean peer AS. The result was that a portion of internet traffic from around the world destined for Verizon Asia-Pacific was misdirected through mainland China. Without this leak, China Telecom would have only been in the path to Verizon Asia-Pacific for traffic originating from its customers in China. Additionally, for ten days in 2017, Verizon passed its US routes to China Telecom through the common South Korean peer causing a portion of US-to-US domestic internet traffic to be misdirected through mainland China.</p> <img src="//images.ctfassets.net/6yom6slo28h2/37rf8GFEKNzdyv6pmbq1TN/b748aa1437a2e3af58d7f20ce1095b4f/china-telecom-leak-verizon-routes.png" style="max-width: 600px;" class="image center" alt="China Telecom's Internet Traffic Misdirection" /> <div class="caption" style="margin-top: -35px;">Diagram of China Telecom leak of Verizon routes in 2017 (<a href="https://arstechnica.com/information-technology/2018/11/strange-snafu-misroutes-domestic-us-internet-traffic-through-china-telecom/">source</a>)</div> <p>In each of these incidents, the origins of the leaked routes were unaltered, meaning any BGP security mechanism based on verifying route origins would have had no effect.</p> <p>Since the Allegheny leak involved more-specifics, RPKI Route Origin Validation (ROV) could have proved helpful had Verizon employed it for route filtering at the time of the leak. More-specifics have a longer prefix length and would have been rejected because they would have exceeded the maximum length set in RPKI, such as was the case for Cloudflare’s affected routes.</p> <p>Overall, these types of leaks are harder to mitigate than origination leaks and can only be addressed by analyzing and filtering the AS paths of BGP routes. There are technical proposals such as <a href="https://datatracker.ietf.org/doc/draft-ietf-sidrops-aspa-verification/">Autonomous System Provider Authorization</a> (ASPA) that are in discussion, but no internet-wide mechanism exists presently to eliminate these types of incidents.</p> <h3 id="attacks-on-cryptocurrency-services">Attacks on cryptocurrency services</h3> <p>This section focuses on BGP incidents (all of which were erroneous originations) which were intentional since they enabled larger attacks that successfully stole cryptocurrency, a particularly lucrative target.</p> <p>In 2014, <a href="https://www.secureworks.com/research/bgp-hijacking-for-cryptocurrency-profit">BGP hijacks were used to intercept</a> unprotected communication between Bitcoin miners and mining pools. This allowed an adversary to obtain bitcoin that should have been allocated to the mining pool. While this incident serves as an example of a BGP hijack targeting behind-the-scenes communication of cryptocurrency mining, more recent attacks have used BGP to attack cryptocurrencies with a more direct approach: stealing currency from users of online cryptocurrency wallets.</p> <p>In 2018, attackers <a href="https://www.theregister.com/2018/04/24/myetherwallet_dns_hijack/">employed a BGP hijack</a> that redirected traffic to Amazon’s authoritative DNS service. Having hijacked the DNS traffic, the adversary answered DNS queries for the web-based cryptocurrency wallet “myetherwallet.com” with a malicious IP address. Users that received this erroneous DNS response were directed to an imposter “myetherwallet.com” website. Some users entered their login credentials which were then stolen by the adversary, along with the contents of their cryptocurrency wallets.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5k0P7e6IdFvcPyvbTdOeiU/7df1cf68085c2af5ee29c46fc71fcd87/myetherwallet-attack.png" style="max-width: 800px;" class="image center" alt="BGP and DNS hijacks targeting myetherwallet.com" /> <div class="caption" style="margin-top: -35px;">Diagram of the BGP and DNS hijacks targeting myetherwallet.com (<a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/">source</a>)</div> <p>While more advanced DNS security measures (e.g., DNSSEC) could have prevented this attack, the primary protection in place was the Transport Layer Security (TLS) protocol, which requires all connections to be encrypted. When TLS establishes an encrypted connection, the server must present a valid certificate that vouches for the server’s identity. Because MyEtherWallet did use TLS, users that were directed to the imposter site were presented with a prominent warning that their connection might be under attack. Despite this, many users clicked past the warning, and the adversary amassed <a href="https://www.theverge.com/2018/4/24/17275982/myetherwallet-hack-bgp-dns-hijacking-stolen-ethereum">$17 million in the cryptocurrency Ethereum</a>.</p> <p>While the 2018 attack was quite effective and demonstrated the viability of BGP attacks against cryptocurrency at a large scale, there were some silver linings. In particular, communication was still (at least in theory) protected by the TLS protocol, which led to the security warning. In the majority of cases, the proper behavior for a TLS connection when it gets an untrusted certificate is to abort the connection, and newer versions of Firefox <a href="https://support.mozilla.org/en-%20US/questions/1175070">do not allow users to click past</a> TLS certificate warnings. Additionally, had the website used DNSSEC to secure its DNS traffic, the attack would not have succeeded.</p> <p>However, both of these security technologies were completely bypassed in an attack in 2022 on the Korean cryptocurrency exchange KLAYswap. As <a href="https://henrybirgelee.com">Henry Birge-Lee of Princeton</a> described in his <a href="https://freedom-to-tinker.com/2022/03/09/attackers-exploit-fundamental-flaw-in-the-webs-security-to-steal-2-million-in-cryptocurrency/">write-up</a>, the attack on KLAYswap exploited several vulnerabilities of KLAYswap’s cryptocurrency exchange web app.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4e5bdNcFlBHQjl3eMbwJJL/e121a6bd7b09c15e04deeff92cd7620b/klayswap-kakao.png" style="max-width: 600px;" class="image center" alt="Diagram of KLAYswap attack" /> <div class="caption" style="margin-top: -35px;">Diagram of KLAYswap attack (<a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/">source</a>)</div> <p>The adversaries used BGP to hijack the IP address of a server that belonged to Kakao Corp and was hosting a specific piece of javascript code used by the KLAYswap platform. The adversary’s objective was to serve a malicious version of this code file that would ultimately cause users of the KLAYswap platform to unknowingly transfer their cryptocurrency to the adversary’s account.</p> <p>However, like MyEtherWallet, KLAYswap, and Kakao Corp were using TLS, so without the adversary presenting a valid certificate to complete the TLS connection, the adversary’s code would not be loaded. This did not stop the adversary, as it used <a href="https://www.usenix.org/conference/usenixsecurity18/presentation/birge-lee">an attack known in the research community</a> where, after launching the initial attack, it approached a trusted certificate authority (or CA, the entities that sign TLS certificates) and requested a certificate for the domain name of Kakao Corp’s server that was hosting the javascript file.</p> <p>CAs have to operate under guidelines designed to prevent the issuance of malicious certificates, which require the CA to <a href="https://cabforum.org/baseline-requirements-documents/">verify the party requesting the certificate </a>has control of the domain names in the certificate. One of the approved verification methods involves contacting the server at the domain through an unencrypted HTTP connection and verifying the presence of a specific piece of content requested by the CA. This cannot be done over an encrypted and authenticated connection, as the party requesting the certificate may be requesting a certificate for the first time.</p> <p>During the attack, when the CA went to verify the domain ownership, its request was routed to the adversary’s server because of the BGP hijack. This falsely led the CA to believe the adversary was the legitimate owner of the domain and caused it to issue a certificate to the adversary. The adversary then completed the attack by using this certificate to establish an “authenticated” connection with KLAYswap users and serve its malicious code. Ultimately <a href="https://www.bankinfosecurity.com/crypto-exchange-klayswap-loses-19m-after-bgp-hijack-a-18518">$2 million dollars were stolen from KLAYswap users</a> over the span of several hours.</p> <p>This attack is particularly notable because it involves a BGP attack successfully exploiting a system that was compliant with current best security practices. Even more aggressive application-layer defenses like DNSSEC and better TLS certificate error behavior would have been ineffective at preventing this attack because the adversary did not manipulate any DNS responses and served its malicious code over a trusted encrypted connection. In the current web ecosystem, millions of other websites, including those following best practices, are vulnerable to this type of attack.</p> <p>In August 2022, cryptocurrency service Celer Bridge was <a href="https://www.coinbase.com/blog/celer-bridge-incident-analysis">attacked using a BGP hijack</a> that employed fake entries in AltDB, a free alternative to the IRR databases as well as <a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/">forged BGP announcements</a>. By surreptitiously altering the contents of AltDB, the attacker was able to trick a transit provider into believing that a small hosting center in the UK was allowed to transit address space belonging to Amazon Web Services, which hosted Celer Bridge infrastructure. The attacker then forged the AS path of its hijack announcements to include an Amazon ASN as the origin, thereby defeating RPKI ROV. The hijack enabled the attacker to redirect cryptocurrency funds to an account controlled by the attacker.</p> <h3 id="ip-squatting">IP Squatting</h3> <p>The discussion above has focused mainly on the disruptions or security implications of misrouting IP addresses which were actively in use (i.e., routed) at the time of the leak or hijack. However, there are bad actors that announce normally unrouted IP address ranges that don’t belong to them for the purpose of evading IP-based blocklists and complicating attribution. This phenomenon is generally referred to as “IP squatting,” but since it involves unauthorized BGP announcements, it sometimes is also referred to as BGP hijacking.</p> <p>Since there is no effective legal or technical measure preventing this practice, bad actors can announce previously unused IP ranges belonging to others until networks on the internet take steps to block them for this bad behavior. In July 2018, a network that became known as the “<a href="https://blog.apnic.net/2018/07/12/shutting-down-the-bgp-hijack-factory/">BGP hijack factory</a>” was <a href="https://krebsonsecurity.com/2018/07/notorious-hijack-factory-shunned-from-web/">removed from the internet</a> through a collective effort. However, such a remediation is highly unusual and cannot be counted on to keep the practice at bay.</p> <h2 id="closing-thoughts">Closing thoughts</h2> <p>Originally composed for the <a href="https://www.bitag.org/Routing_Security.php">BITAG report on routing security</a>, the preceding paragraphs discuss only the most notable of many incidents, accidental or otherwise, involving BGP over the years. This extensive list of incidents bolsters the case that networks must take routing security seriously and implement measures to either protect themselves or other parts of the internet.</p> <p>At a minimum, we recommend using a <a href="https://www.kentik.com/blog/bgp-monitoring-from-kentik/">BGP monitoring solution</a> to make sure you are alerted when an incident such as the ones above affect IP address space belonging to your business or organization. Additionally, we recommend deploying RPKI ROV by both creating ROAs for your IP address space as well as configuring your routers to reject RPKI-invalid routes.</p> <p>Additional recommended actions for routing security can be found on the website of <a href="https://www.manrs.org/">Mutually Agreed Norms for Routing Security (MANRS)</a>, which describes itself as a “global initiative that helps reduce the most common routing threats.”</p> <p>We have made <a href="https://www.kentik.com/blog/exploring-the-latest-rpki-rov-adoption-numbers/">tremendous progress</a> over the last decade. For example, we have not experienced a large-scale origination leak in many years, and that is not an accident. Many engineers at many companies have worked to improve overall routing hygiene, and we are all the beneficiaries of such work. However, there is much more to be done before we can say we’ve secured the BGP routing protocol, so we must continue to make progress on this complex and difficult task.</p><![CDATA[When Reliability Goes Wrong in Cloud Networks]]><![CDATA[Cloud network reliability has become a catch-all for four related concerns: availability, resiliency, durability, and security. In this post, we'll discuss why NetOps plays an integral role in delivering on the promise of reliability.]]>https://www.kentik.com/blog/when-reliability-goes-wrong-in-cloud-networkshttps://www.kentik.com/blog/when-reliability-goes-wrong-in-cloud-networks<![CDATA[Ted Turner]]>Thu, 01 Jun 2023 04:00:00 GMT<p>In the <a href="https://www.kentik.com/blog/practical-steps-for-enhancing-reliability-in-cloud-networks-part-1/">first part of this series</a>, I introduced network reliability as a concept foundational to success for IT and business operations. Reliability has become a catch-all for four related concerns: availability, resiliency, durability, and security. I also pointed out that because of necessary factors like redundancy, the pursuit of reliability will inevitably mean making compromises that affect a network’s cost or performance.</p> <p>As a cloud solutions architect with Kentik, I have the opportunity to work with some of the planet’s most cutting-edge, massively scaled networks. This has given me a front-row seat to innovative design and implementation strategies and, most importantly, hopeful solutions with unintended consequences.</p> <p>In this article, I want to underscore why <a href="https://www.kentik.com/blog/five-issues-your-netops-team-will-face-in-the-cloud/">NetOps has an integral role</a> (and more responsibility) in delivering on the promise of reliability and highlight a few examples of how engineering for reliability can make networks less reliable.</p> <h2 id="reliability-is-a-massive-burden-for-network-operators">Reliability is a massive burden for network operators</h2> <p>While DevOps teams and their SREs focus on the reliability of the application environment, an enterprise’s network concerns often extend well beyond the uptime of a suite of customer-facing web and mobile apps. For many enterprises, applications represent only a portion of a much larger reliability mandate, including offices, robotics, hardware, and IoT, and the complex networking, data, and observability infrastructure required to facilitate such a mandate.</p> <p>For NetOps, this mandate includes a wide range of tasks, including monitoring and identifying top talkers, careful capacity planning, resource availability and consumption, path analysis, security and infrastructure monitoring and management, and more. As I mentioned, these responsibilities fall under one of the four closely related reliability verticals: availability, resiliency, durability, and security.</p> <p>I want to look at each of these individually and examine some of the pitfall scenarios I’ve seen as teams attempt to bolster the reliability of their networks.</p> <h3 id="availability">Availability</h3> <p>The core mission of availability is uptime, and the brute force way to see this done is via redundancy, typically in the form of horizontal scaling to ensure that if a network component is compromised, there is another instance ready to continue.</p> <p>Under this model, network topology is highly variable, creating a complexity that can mask root causes and make proactive availability configurations a highly brittle point of the network. A single misconfiguration, such as an incorrect firewall rule or a misrouted connection, can trigger a cascade of failures. For instance, an erroneous firewall configuration might be unprepared for a redundant router or application instance, blocking traffic critical to maintaining uptime.</p> <h3 id="resiliency">Resiliency</h3> <p>One resiliency strategy for NetOps teams working with the cloud to consider is multizonal deployments. In this case, region-wide disruptions of the internet or your cloud provider affect only a portion of traffic and provide safe destinations to re-route this affected traffic. Status pages are a great way to communicate with customers or users about outages or shifts in deployment regions (you do have a status page, don’t you?).</p> <img src="//images.ctfassets.net/6yom6slo28h2/2iHkRzsHmGVJKsBNyC3zGx/d0ebf26c087245a5d8777a7f4b847c4c/google-cloud-service-health.png" style="max-width: 800px;" class="image center" alt="Screenshot of Google Cloud Service Health during a disruption" /> <p>Here are a few examples of potential unintended side effects of relying on multizonal infrastructure for resiliency:</p> <p><strong>Split-brain scenario</strong>: In a multizonal deployment with redundant components, such as load balancers or routers, a split-brain scenario can occur. This happens when communication between the zones is disrupted, leading to the independent operation of each zone. In this situation, traffic may be routed to both zones simultaneously, causing inconsistencies in data processing and potentially leading to data corruption or other issues.</p> <p><strong>Failover loops</strong>: When implementing failover mechanisms across multiple zones, there is a risk of creating a failover loop. This occurs when numerous zones repeatedly detect failures in each other and trigger failover actions back and forth. As a result, traffic continuously switches between zones, causing unnecessary network congestion and affecting the overall performance and stability of the system.</p> <p><strong>Out-of-sync state</strong>: Maintaining a consistent state across all zones can be challenging in a network with multizonal deployments. In some cases, due to network latency or synchronization delays, different zones may have slightly different versions of the data or application state. This can lead to unexpected traffic patterns as zones exchange data or attempt to reconcile inconsistent states, potentially causing increased network traffic and delays.</p> <h3 id="durability">Durability</h3> <p>In the context of cloud networks, durability refers to the ability of the network to retain and protect data over an extended period of time, even in the face of hardware failures, software bugs, or other types of disruptions like attacks on the network. While much of the data-specific measures like replication, versioning, or use of distributed file systems like Amazon S3 fall under the purview of data engineers, it is up to NetOps to monitor and manage this infrastructure’s connections to the network.</p> <p>This is no small feat and can lead to significant overhead and resource consumption. While cloud providers invest heavily in being able to ensure their data services are highly durable (<a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/DataDurability.html">S3 guarantees eleven 9s</a>, or 99.999999999%, data accuracy over a given year), this represents only one portion of the durability story. As this data moves to and from highly durable storage services, NetOps must guarantee that the data remains secure and accurate. Replication, analysis, and data transfer all present opportunities for security threats, data integrity loss, and intense bandwidth and memory consumption.</p> <p>In distributed, service-oriented development environments, it is not uncommon for these efforts to happen primarily outside of the attention of NetOps until a problem arises. While these teams and development efforts may be distributed, the underlying network infrastructure is often shared: so what? The intense resource demands of durability efforts can create contention, which triggers latency and cascading failures. The network underneath becomes overconsumed due to contention, latency, and retries of all the cascading failures.</p> <p>I wrote an article a while ago <a href="https://www.kentik.com/blog/cascaded-lag-impact-of-latency-on-applications-user-experience/">addressing latency</a>. This contention becomes a de facto result of most latency, which can take out many services and networking devices.</p> <h2 id="security">Security</h2> <p>There are several reliability strategies that if not properly and carefully accounted for, can increase a network’s threat surface area. Here are two examples:</p> <ul> <li> <p><strong>Redundancy and high availability</strong>: While redundant components and geographic distribution of resources enhance reliability, multiple network entry points and load balancing mechanisms can be exploited to launch DDoS attacks, bypass security controls, or otherwise overwhelm resources.</p> </li> <li> <p><strong>Elasticity and scalability</strong>: As new instances, containers, VMs, or other network resources are dynamically added to the network, improper configurations and monitoring can create vulnerabilities for attackers to exploit.</p> </li> </ul> <h2 id="conclusion">Conclusion</h2> <p>Reliability is a cornerstone of delivering top-tier IT services and customer experiences. For NetOps, especially those responsible for the highly scaled networks found in enterprises, service providers, and telecom companies, the pursuit of reliability via availability, resiliency, durability, and security measures can introduce its own challenges. Handling these challenges can come down to rigorous planning, careful monitoring, and dynamic systems, but as any network specialist knows, there are always unknown unknowns.</p> <p>In this series’ next and final installment, I will examine how network observability offers an opportunity to engage with these unknown unknowns, the most comprehensive and robust path to addressing and avoiding these challenges.</p><![CDATA[Automating Capacity Planning for IP Networks: A Journey into the Future]]><![CDATA[By automating capacity planning for IP networks, we can achieve cost reduction, enhanced accuracy, and better scalability. This process requires us to collect data, build predictive models, define optimization objectives, design decision algorithms, and carry out consistent monitoring and adjustment. However, the initial investment is large and the result will still require human oversight.]]>https://www.kentik.com/blog/automating-capacity-planning-for-ip-networks-a-journey-into-the-futurehttps://www.kentik.com/blog/automating-capacity-planning-for-ip-networks-a-journey-into-the-future<![CDATA[Nina Bargisen]]>Wed, 31 May 2023 04:00:00 GMT<p>Capacity planning for IP networks is an essential and continuous process that ensures internet service providers (ISPs) and content providers can meet the growing demands of their users. With the rapid advancement of technology, the idea of automating this process is becoming increasingly appealing. In this post, we will dive into what it would take to automate capacity planning and the possible caveats.</p> <h2 id="why-automate-capacity-planning">Why automate capacity planning?</h2> <p>Automating capacity planning offers several benefits:</p> <ol> <li><strong>Cost</strong>: Automation reduces the time and effort required to carry out capacity planning, ultimately reducing operational costs.</li> <li><strong>Accuracy</strong>: Automated systems can quickly process large amounts of data, leading to more accurate projections and decisions.</li> <li><strong>Scale</strong>: As the network grows, automated processes can easily adapt and scale to accommodate changes without human intervention.</li> </ol> <h2 id="steps-to-automate-capacity-planning">Steps to automate capacity planning</h2> <p>So what does it take to automate the process? The tasks include:</p> <ol> <li>Data Collection</li> <li>Predictive Models</li> <li>Optimization Objectives</li> <li>Decision Algorithms</li> <li>Monitoring and Adjustment</li> </ol> <p><strong>Data collection</strong> is the foundation of automating capacity planning. This includes accurate and comprehensive data on network topology, traffic patterns, historical data, and equipment specifications. A robust data collection and enhancement system must be in place to facilitate automated decision-making. Raw data from different sources will not do, but must be enhanced and combined so all relevant aspects of the network are considered.</p> <p>Using collected data, <strong>predictive models</strong> can be developed to forecast network usage, capacity requirements, and potential bottlenecks. Machine learning and artificial intelligence techniques can be employed to enhance the accuracy of these predictions over time. The models should ideally be able to take input from third parties – <a href="https://www.kentik.com/blog/unlocking-power-embedded-cdns-comprehensive-guide-deployment-optimal-use-cases/">like a CDN, content, or ISP partner</a>.</p> <p>Then we can clearly outline the goals of the automated capacity planning process. These <strong>objectives</strong> include minimizing costs, maximizing network efficiency, ensuring high-quality service, or maintaining redundancy. Having well-defined objectives will help guide the development and evaluation of the automated system.</p> <p>With the data and predictive models in place, we can develop the <strong>decision algorithms</strong> to optimize network capacity based on the defined objectives. These algorithms should be capable of processing the collected data and making real-time adjustments to network configurations, equipment upgrades, and other relevant factors.</p> <p>Finally, the automation process should include continuous <strong>network performance and usage monitoring and regular adjustments</strong> to the predictive models and decision algorithms. This ensures that the system remains up-to-date and can respond effectively to network conditions and user requirements changes. Traffic might not grow precisely according to plan, so this part is crucial in deciding when to start an upgrade project.</p> <p>It sounds easy when listed like this until you dive into each point. The comprehensive data collection is a large project in itself, so the sheer size of this part of the project is a massive undertaking alone. Let’s have a look at some of the other caveats.</p> <h2 id="caveats-and-considerations-of-automating-capacity-planning">Caveats and considerations of automating capacity planning</h2> <p>Aside from the scale of such a project, there are a number of other things to watch out for:</p> <ol> <li>Complexity</li> <li>Adaptability</li> <li>Human oversight</li> <li>Cost</li> </ol> <p>Developing an automated capacity planning system is a complex task that requires a thorough understanding of network infrastructure, traffic patterns, and capacity management. No software or data engineer can do this, so a collaborative team of developers, network engineers, and planners need to work together to design and eventually use the system.</p> <p>The automation system must be adaptable to the dynamic nature of IP networks. It should handle changes in network topology, user demands, and technology advancements. Regular updates and maintenance are necessary to ensure the system remains responsive to the ever-changing industry landscape.</p> <p>While automating capacity planning can save time and resources, we can only partially eliminate human oversight to be safe. Human expertise is necessary to validate the system’s decisions, monitor its performance, and intervene in case of unforeseen issues or complex situations that the automation system might not be equipped to handle.</p> <p>Developing and implementing an automated capacity planning system requires a significant initial investment in technology, infrastructure, and skilled personnel. You should carefully weigh the potential benefits against the costs and consider the long-term return on investment.</p> <p>The caveats seem significant, but if you think about it, you might already be halfway there. If you are already running a <a href="https://www.kentik.com/blog/network-capacity-planning-for-2022-made-easy/">network observability tool like Kentik</a>, you will have solved a large part of the data collection task. You will know your traffic in detail, and this, combined with your inventory data and configuration management system, is the first big step on the way.</p><![CDATA[Understanding MTTR Networking: How to Improve Incident Response Time]]><![CDATA[As organizations continue to shift their operations to cloud networks, maintaining the performance and security of these systems becomes increasingly important. Read on to learn about incident management and the tools and strategies organizations can use to reduce MTTR and incident response times in their networks.]]>https://www.kentik.com/blog/understanding-mttr-networking-how-to-improve-incident-response-timehttps://www.kentik.com/blog/understanding-mttr-networking-how-to-improve-incident-response-time<![CDATA[Stephen Condon]]>Tue, 30 May 2023 04:00:00 GMT<p>As organizations continue to shift their operations to cloud networks, maintaining the performance and security of these systems becomes increasingly important. Responding to incidents promptly and efficiently is vital to minimizing damage, reducing downtime, and safeguarding critical data and infrastructure. One metric that plays a crucial role in incident response is mean time to repair (MTTR), which measures the average time it takes to fix a network issue once it has been detected.</p> <p>Focusing on reducing MTTR can help organizations improve their incident response capabilities. This involves streamlining processes, having the right tools to identify real-time issues, determining bottlenecks, and implementing automation where possible. By reducing the time it takes to identify and resolve incidents, organizations can ensure that their systems are up and running again as quickly as possible, minimizing the impact on user experience and system performance.</p> <p>This article will cover incident management and the tools and strategies organizations can use to reduce MTTR and incident response times in their networks. To close, we will see how network observability helps NetOps teams adopt a proactive approach to incident response that reduces MTTR and incident-related costs while enhancing their networks’ security and performance.</p> <h2 id="stages-of-incident-management">Stages of incident management</h2> <p>Incident management refers to the processes, procedures, and tools used to identify and resolve performance issues and disruptions in a computer network.</p> <p>The “lifecycle” of incident management can be thought of in five steps:</p> <ul> <li>Prepare</li> <li>Detect</li> <li>Isolate</li> <li>Resolve</li> <li>Optimize</li> </ul> <p>Let’s take a closer look at each of these stages of incident management.</p> <h3 id="prepare">Prepare</h3> <p>This stage involves developing and implementing incident response plans, policies, and tools and training employees to respond to incidents.</p> <h3 id="detect">Detect</h3> <p>This stage involves identifying and detecting signs of performance issues or disruptions in a networking environment. This can be done through the use of monitoring and observability tools that provide visibility into the health and performance of network infrastructure and services.</p> <h3 id="isolate">Isolate</h3> <p>Once an issue is detected, the next step is to isolate the affected systems or resources if an immediate fix isn’t apparent to prevent further performance degradation or disruption. This could involve diverting traffic to alternate resources or scaling resources up or down based on demand.</p> <h3 id="resolve">Resolve</h3> <p>This stage involves identifying the root cause of the performance issue and developing and executing a plan to resolve it. This could include updating resource utilization, upgrading hardware or software, or reconfiguring the environment to better align with application requirements.</p> <h3 id="optimize">Optimize</h3> <p>Once the incident has been resolved, it’s essential to maintain ongoing monitoring and analysis to ensure that the network remains performant and resilient. Post-mortems provide excellent data points for optimizations, which could involve updating monitoring and alerting systems, incident management preparations, new security or traffic management policies, or other measures such as re-visiting infrastructure design or hardware choices.</p> <h2 id="metrics-for-incident-management">Metrics for incident management</h2> <p>Assessing incident response in enterprise networks can be accomplished with a wide variety of metrics. Here are some of the most foundational incident management metrics:</p> <ul> <li> <p><strong>Mean time to detect (MTTD)</strong>: Closely aligned with Mean time to innocence (MTTI), this metric measures the time to detect an incident from the moment it occurs. It includes the time it takes to identify the incident, investigate it, and confirm its existence.</p> </li> <li> <p><strong>Mean time to repair (MTTR)</strong>: This metric measures the time to resolve an incident and restore services to normal. It includes the time it takes to identify the problem, diagnose it, plan and execute a solution, and verify that it has worked.</p> </li> <li> <p><strong>First call resolution (FCR)</strong>: This metric measures the percentage of incidents that are resolved in a single interaction between the customer and the support team. A high FCR rate is indicative of efficient and effective incident management.</p> </li> <li> <p><strong>Incident response time</strong>: This metric measures the time the support team takes to respond to an incident once it has been reported. A fast response time can help prevent an incident from escalating and minimize its impact.</p> </li> <li> <p><strong>Incident severity</strong>: This metric categorizes incidents based on their severity level, which helps prioritize incident management efforts. The impact on services may determine a severity level, the number of users affected, the urgency of the situation, or a more service-specific metric.</p> </li> <li> <p><strong>Incident backlog</strong>: This metric measures the number of unresolved incidents at any given time. A large backlog can indicate that the support team is overwhelmed and may require additional resources to manage incidents effectively.</p> </li> </ul> <p>By tracking and analyzing these metrics, network operators and engineers can better understand their incident management performance and identify areas for improvement.</p> <h2 id="what-is-mttr">What is MTTR?</h2> <p>MTTR most often stands for mean time to repair. It refers to the average amount of time it takes to resolve an incident, from the moment it is detected to the point where the system or service is fully operational again.</p> <p>MTTR is an important metric because it helps teams measure their efficiency and effectiveness in responding to incidents. A low MTTR indicates that the system or team or both can quickly detect and resolve incidents, minimizing downtime and reducing the impact on users. On the other hand, a high MTTR may indicate inefficiencies in the incident response process, leading to longer downtime and increased costs.</p> <p>By tracking MTTR over time, teams can identify trends and improve their incident response process to reduce downtime and improve service reliability. It may also be an appropriate metric to share with customers to set expectations.</p> <h2 id="other-mttrs">Other MTTRs</h2> <p>Besides mean time to repair, MTTR can refer to one of several related metrics for network operators and incident response teams.</p> <ul> <li><strong>Mean time to respond</strong>: This MTTR measures the average time it takes for an incident response team to acknowledge and respond to an incident. It includes the time it takes to detect and identify the issue and the time it takes to initiate a response. This can be considered a sub-metric for mean time to repair.</li> <li><strong>Mean time to recovery</strong>: This metric picks up where <em>mean time to respond</em> leaves off and takes into account the time it takes to restore systems and services to full functionality and the time it takes to verify that the systems and services are working as expected. This can also be considered a sub-metric for mean time to repair.</li> <li><strong>Mean time to restore</strong>: Synonymous with <em>mean time to repair</em>.</li> <li><strong>Mean time to resolution</strong>: Synonymous with <em>mean time to repair</em>.</li> </ul> <h2 id="factors-that-affect-mttr-in-cloud-networks">Factors that affect MTTR in cloud networks</h2> <p>Here are five factors that can impact MTTR for network operators:</p> <ol> <li><strong>Complexity of the network</strong>: The complexity of the network can lead to longer resolution times, as it becomes more challenging to identify and troubleshoot issues. Are public and private resources being utilized? Are containers in the mix?</li> <li><strong>Lack of visibility</strong>: Lack of visibility into the network infrastructure and its components can make it challenging to pinpoint the root cause of an incident. This problem can be exacerbated by visibility into traffic flows within public clouds.</li> <li><strong>Inefficient incident management processes</strong>: Inefficient incident management processes, such as a lack of documentation or communication, can delay resolution times and lead to more extended downtimes.</li> <li><strong>Human error</strong>: Human error can also contribute to longer MTTR, mainly if mistakes are made during incident response or if staff lack the necessary skills and experience.</li> <li><strong>Lack of data analytics</strong>: Poor analytics can impede incident resolution as teams may struggle to identify patterns or trends that could aid in troubleshooting and incident response, especially prevention.</li> </ol> <h2 id="strategies-for-improving-mttr">Strategies for improving MTTR</h2> <p>Despite the complexity of modern networks, NetOps have more tools and strategies than ever to help reduce their mean time to repair.</p> <p>Let’s examine some of them here.</p> <h3 id="processes-and-tools">Processes and tools</h3> <p>Establishing an effective incident management process is essential for reducing MTTR in cloud networking. This process involves several steps, including incident identification, triage, investigation, resolution, and post-incident review. Implementing a centralized incident tracking system that integrates with IT service management (ITSM) tools can help streamline incident resolution and reduce response times.</p> <p>Network observability involves tools and practices that enable real-time monitoring, analysis, and troubleshooting of network performance and health. Using highly contextual instrumentation and powerful data analytics helps IT teams proactively identify and resolve issues before they become significant incidents, understand system usage patterns, and test network-wide optimizations. Comprehensive network observability should include a combination of monitoring your actual traffic and test, or synthetic traffic, to give a complete picture of all threats and impacts to QoS.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4VrNqs7Ptw2YbKW2JvKLkO/1fdb1e444796125975069cb3f1b7373e/cloud-performance-monitor.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Monitor cloud performance in Kentik" /> <p>Automation tools can reduce the time spent on manual tasks and improve overall efficiency. For example, implementing a chatbot that automatically creates incidents, categorizes them, and assigns them to the appropriate support team can significantly reduce MTTR. Incident management tools like PagerDuty and Opsgenie can help teams quickly identify and prioritize issues and send alerts to the right stakeholders for prompt resolution.</p> <h3 id="support-team-training">Support team training</h3> <p>Providing training for support teams is critical for improving MTTR. Teams must have the necessary skills, knowledge, and experience to troubleshoot and resolve complex issues effectively. Continuous training and development on new technologies like container orchestration platforms, cloud-native security solutions, and the latest features and updates to cloud networking platforms like AWS, Azure, and GCP.</p> <h3 id="incident-response-drills">Incident response drills</h3> <p>Regular incident response drills can help identify gaps in incident response plans and improve overall preparedness for handling complex issues. These drills can help teams practice responding to simulated incidents and improve their ability to resolve them quickly.</p> <p>For instance, teams can simulate a DDoS attack on the network and practice isolating the affected resources and mitigating the attack. Regular drills can also help identify gaps in documentation, procedures, and tools, which can be addressed to improve response times during actual incidents.</p> <h2 id="how-kentik-can-help">How Kentik can help</h2> <p>Kentik’s network observability platform can help IT organizations reduce mean time to repair (MTTR) by providing unparalleled, real-time visibility into network infrastructure. By collecting and analyzing network telemetry from across the entire network, Kentik can identify performance issues and anomalies before the system experiences significant performance or security impacts. This proactive approach to network observability helps IT teams quickly identify and address issues, reducing the time it takes to resolve them.</p> <p>Kentik monitors your actual traffic alongside test traffic – <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a>. Synthetic tests can be a significant asset in reducing MTTR, especially in heading-off issues before they are customer-impacting. Synthetics also help you identify whether the problem is with your network infrastructure or a broader issue impacting internet apps or public cloud infrastructure.</p> <p>Kentik’s platform provides detailed network performance metrics, allowing IT teams to quickly understand the root cause of network issues. With customizable alerts and dashboards, Kentik helps IT teams respond to network issues as they arise. Powerful workflows and third-party integrations help ensure prompt, automated, and highly actionable responses.</p> <p>To see how Kentik can help reduce MTTR in your incident response, <a href="https://www.kentik.com/get-started/">start your free 30-day trial</a>.</p><![CDATA[The Subtle Details of Livestreaming Prime Video with Embedded CDNs]]><![CDATA[Live sports have moved to the internet and are now streaming instead of being broadcast. Traditional streaming protocols have a built-in delay that challenges the experience of a live game. Amazon Prime has found a solution by combining a new protocol with a very distributed CDN.]]>https://www.kentik.com/blog/the-subtle-details-of-livestreaming-prime-video-with-embedded-cdnshttps://www.kentik.com/blog/the-subtle-details-of-livestreaming-prime-video-with-embedded-cdns<![CDATA[Nina Bargisen]]>Thu, 25 May 2023 04:00:00 GMT<p>It has been interesting to observe how professional sports have moved the transmission onto the internet instead of requiring their viewers to sit at home and watch their TV. Many still do that — the details of any sportsball game are best enjoyed on a big screen — but the games are no longer broadcast via the TV signal in the air or on the cable, but coming via the connection to the internet.</p> <p>Amazon Prime Video is one of the services delivering a lot of sports — you might recall my colleague Doug Madory’s blog in 2022 about <a href="https://www.kentik.com/blog/anatomy-ott-traffic-surge-thursday-night-football-amazon-prime-video/">Thursday Night Football traffic</a>, as seen in the Kentik data.</p> <h2 id="the-challenges-in-transitioning-to-internet-broadcasting">The challenges in transitioning to internet broadcasting</h2> <p>Before we dig into the Prime Video solution, what is the main challenge with moving live events from broadcast to the internet?</p> <p>Flow TV (traditional television) broadcasts are transmitted using dedicated networks, such as over-the-air, cable, or satellite. In this case, the signal is sent once to all viewers, and multiple users watching the same channel do not create additional load on the network. This is because the broadcast signal is shared, and the network’s capacity is designed to handle the full audience without degradation of the signal.</p> <p>Internet broadcasts rely on data packets transmitted over the internet. Each viewer receives a separate stream of data, which means that as more users watch a live event, the load on the network increases. This can lead to congestion, slower connection speeds, or buffering, particularly if the network does not have enough capacity to handle the increased demand.</p> <div as="Promo"></div> <h2 id="what-is-the-impact-of-network-congestion-on-live-streaming">What is the impact of network congestion on live streaming?</h2> <p>So the main difference is that more viewers lead to a higher load on the network that carries the event. One consequence of the congestion that might result from the higher load is delay — which is nearly unacceptable for sportsball game events. I remember watching the World Cup in football (aka soccer for you Americans) in 2018 on the internet — in the summer with open windows. We got a good warning to pay attention the next couple of minutes when we heard the neighbors’ cheer. (My country’s team had a good run in that World Cup.)</p> <p>However, this delay was not due to congestion but to the live-streaming technology needing to be more mature at the time.</p> <h2 id="understanding-adaptive-bitrate-streaming-and-its-limitations">Understanding adaptive bitrate streaming and its limitations</h2> <p>The reason for the delay — or the unsynchronized delivery of the packets that make up the live stream — is that traditional adaptive streaming protocols are based on chopping the video up into small segments. These are then encoded in several different bitrates. The playback of the files is compared to the speed of the download of the segment, and if it is slower, the client will request the next segment with a lower bitrate and vice versa (until the maximum bitrate for the player is reached).</p> <p>The primary advantage of ABS is its adaptability. It allows for a smooth viewing experience, reducing buffering and improving playback by adjusting the stream’s quality to match the viewer’s network conditions.</p> <p>However, ABS can have issues with latency. Because it requires a certain amount of video to be buffered to switch between different quality levels, this can introduce a delay in live streaming scenarios, which can be problematic for real-time content like sports events or online gaming.</p> <h2 id="sye-a-solution-for-livestream-latency-and-synchronization-issues">Sye: A solution for livestream latency and synchronization issues</h2> <p>Sye is a streaming technology developed by Net Insight. It’s designed to address one of the key issues with traditional live streaming: latency and synchronization.</p> <p>Sye offers frame-accurate synchronization, meaning all viewers see the same frame simultaneously, no matter their device or network. This is particularly important for live sports or esports, where real-time performance is critical.</p> <p>Another achievement is that Sye does not segment the streams but has developed a technique where the player can switch between streams with different bitrates on the fly to adapt to the varying network conditions. This switch happens and is decided on the server side, not the client.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/YkE1Bhwnmu8s79Tb6MfR5/464ce206074f9e2895f21fa397a79cd6/adaptive-bitrate-vs-sye-streaming.png" style="max-width: 650px" class="image center no-shadow" alt="Adaptive Bitrate vs Sye Streaming" /> <h2 id="implementation-of-sye-by-amazon">Implementation of Sye by Amazon</h2> <p>Amazon acquired the technology from Net Insight in 2020 and has implemented Sye instances on all edge devices in Cloudfront, their own CDN. This means Prime Video has end-to-end control over the live streaming and can deliver a near real-time synchronized live-event experience for the end users, along with some other benefits of live streaming enhancements that the platform offers.</p> <h2 id="how-do-you-address-network-congestion-and-enhance-user-experience">How do you address network congestion and enhance user experience?</h2> <p>With this technology, congestion is the biggest threat to a fully synchronized and high-quality end-user experience. The risk is also high — a really good game will attract even more viewers and not necessarily from other services like VOD running on the same infrastructure.</p> <p>One way of trying to mitigate this is to deploy CDN servers as close as possible to the end users, and AWS is, in this respect, following the footsteps of other CDNs by offering ISPs the ability to embed servers from Cloudfront into their network, as we have described in two different blogs posts:</p> <ul> <li><a href="https://www.kentik.com/blog/speeding-up-the-web-comprehensive-guide-content-delivery-networks-embedded-caching/">Speeding Up the Web: A Comprehensive Guide to Content Delivery Networks and Embedded Caching</a></li> <li><a href="https://www.kentik.com/blog/unlocking-power-embedded-cdns-comprehensive-guide-deployment-optimal-use-cases/">Unlocking the Power of Embedded CDNs: A Comprehensive Guide to Deployment Scenarios and Optimal Use Cases</a></li> </ul> <h2 id="understanding-cdn-mapping-and-ensuring-system-resilience">Understanding CDN mapping and ensuring system resilience</h2> <p>So what do you specifically have to have in mind when working out how to best deploy embedded edge appliances from AWS to support live video from Prime Video?</p> <ul> <li>The end user to embedded server cluster mapping</li> <li>Failure scenarios and how to secure that the event can run smoothly in the case of failures in the network.</li> </ul> <p>In the recent blog posts, we detailed the most common ways CDNs map end users to the correct cache location. AWS has a similar system that, initially, when play is pressed in the player, sets up a session with the proper cache taking information like location and load on the caches into account.</p> <p>As we discussed in the previous posts, location is defined by the location of the DNS resolver used by the end user or the end user’s IP address. In the latter case, it is signaled using BGP by the ISP, which caches should be prioritized for end users in which IP prefixes. Once the session between the player and the cache is up, the cache will push a stream to the player based on the bandwidth estimation between the cache and the player, so the experience is always as good as possible. Quality data is also continuously sent to the Sye control system and compared. If the quality from another cache to end users in that same area is better, the stream is switched.</p> <p>Pushing the caches to many smaller locations has an often overlooked side effect when looking at the system’s resilience. Imagine a large ISP that connects to AWS with PNI peering with one connection in two locations — the ones where both networks are present. If there is a fault in one of the locations, the second PNI should be able to handle all the failover traffic. If it is not, not only the users who the faulty PNI serves, but all users will experience the consequences of the fault. This means we must build 100% extra capacity to handle one fault.</p> <p>If, instead, the traffic is served from 10 smaller embedded locations and there is an outage on one of these, then only 1/10th of the total traffic is affected, and only 1/10th of the total traffic needs to be distributed to the functioning nine locations. This means the total demand for spare capacity to keep a resilient installation is much smaller.</p> <h2 id="getting-started-with-amazons-peering-and-isp-relationships-team">Getting started with Amazon’s peering and ISP relationships team</h2> <p>So how do you get started?</p> <p>First, understand how the traffic flows within your network. When you want to build several sites, you need to know how much traffic the customer base for each site will demand. As we discussed earlier, dividing your customers into groups must be consistent with how Amazon maps customers to cache locations, so either you identify the groups by their IP addresses or DNS resolvers. In <a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2/">The Peering Coordinators’ Toolbox - Part 2</a>, we show an example of how you can use this grouping to analyze the traffic with a NetFlow-based tool like Kentik.</p> <p>Use this information to reach out to the team at Amazon that handles peering and ISP relationships, and they will work with you to create the best solution.</p> <p>If you want to learn more about Sye and the live streaming for NFL Thursday night football, check out these videos: <a href="https://www.youtube.com/watch?v=27IaaibmoGw">Sye - Live Streaming</a> and <a href="https://youtu.be/Tw93_k_QxuI">AWS re:Invent 2022 - How Prime Video delivers NFL’s Thursday Night Football globally on AWS</a>.</p><![CDATA[Exploring the Latest RPKI ROV Adoption Numbers]]><![CDATA[In this blog post, BGP experts Doug Madory of Kentik and Job Snijders of Fastly update their RPKI ROV analysis from last year while discussing its impact on internet routing security.]]>https://www.kentik.com/blog/exploring-the-latest-rpki-rov-adoption-numbershttps://www.kentik.com/blog/exploring-the-latest-rpki-rov-adoption-numbers<![CDATA[Doug Madory, Job Snijders]]>Wed, 24 May 2023 04:00:00 GMT<blockquote> <p>For the latest RPKI ROV adoption numbers, see our May 1, 2024 article, “<a href="https://www.kentik.com/blog/rpki-rov-deployment-reaches-major-milestone/" title="Kentik Blog: RPKI ROV Deployment Reaches Major Milestone">RPKI ROV Deployment Reaches Major Milestone.</a>”</p> </blockquote> <p>Last year, we published two pieces of analysis that assessed where we were with RPKI ROV adoption (RPKI is Resource Public Key Infrastructure, ROV is Route Origin Validation). This routing security technology continues to be the best defense against accidental BGP hijacks and origination leaks. For it to do its job (rejecting RPKI-invalid routes), two steps must be taken: ROAs must be created, and ASes must reject routes that aren’t consistent with the ROAs.</p> <div as="WistiaVideo" videoId="qpfh2174t5" audio></div> <p>In the <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">first piece</a>, we looked at the state of ROA creation through the lens of traffic volume using Kentik’s aggregate NetFlow. What we found was that although only a <em>minority</em> of BGP routes had valid ROAs, the <em>majority</em> of traffic sent on the internet was destined for those routes. We had made more progress than most people realized!</p> <p>In the <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">second</a>, we looked at the propagation of RPKI-invalid routes and found that they experienced a dramatic drop in circulation compared to valid or unknown routes. When the “blast radius” of problematic routes is reduced, so is the potential for connectivity disruption as a result of an origination leak.</p> <p>How have things changed from last year? Let’s take a look at the numbers.</p> <h2 id="measuring-roa-creation-with-netflow">Measuring ROA creation with NetFlow</h2> <p>As mentioned above, our NetFlow analysis from the spring of 2022 on the state of ROA creation showed that despite only a third of BGP routes having valid ROAs (34.89% for IPv4 and 34.28% for IPv6), we were seeing a majority (56.4%) of internet traffic (bits/sec based on aggregate NetFlow) going to RPKI-valid routes — a far more optimistic picture.</p> <p>Since our analysis last year, the number of ROAs has climbed continuously and, at the time of this writing, stands at 43.17% for IPv4 and 45.17% for IPv6. These new figures represent percentage increases of 8.28% and 10.89%, respectively. Below is <a href="https://rpki-monitor.antd.nist.gov/">NIST’s chart</a> of BGP routes over time that are evaluated as RPKI-valid (green), RPKI-invalid (red), and RPKI-not-found (yellow) for those BGP routes without a ROA.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3BTrLlLHqsoatEITmDdhNs/0718541ac9b998c230997734b65db864/rpki-rov-original.png" style="max-width: 800px;" class="image center" alt="RPKI-ROV History of Unique Prefix-Origin Pairs - Original" /> <p>So in the past year, the share of BGP routes (IPv4 and IPv6) evaluated as RPKI-valid increased by roughly a third! Given this increase, we would expect to see a change in the share of traffic to RPKI-valid BGP routes, so let’s take a look.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4Oxx7LuYtHgWpwfHzmdnKT/672e3912721269fb9051597304ad99a7/internet-traffic-rpki-202305.png" style="max-width: 600px;" class="image center" alt="Internet traffic volume by RPKI evaluation" /> <p>As one would expect to see, the percentage of internet traffic to RPKI-valid BGP routes increased — from 56.4% last year to 62.5%! As was the case with our analysis from last year, these numbers are driven by major RPKI deployments in both large content providers (Amazon, Google, Cloudflare, Akamai, etc.) as well as large access networks (Comcast, Spectrum, etc.). These networks are responsible for the lion’s share of traffic exchanged on the internet, which has become only more concentrated in recent years.</p> <p>If we are to assume steady growth of the share of BGP routes with ROAs, it should become the majority case in about a year from now (May 2024). Mark your calendars!</p> <img src="//images.ctfassets.net/6yom6slo28h2/7g0IkHcVNSYaDiy3ptCXET/04983302a6280835c6e7c8c3475f9627/rpki-rov-trend.png" style="max-width: 800px;" class="image center" alt="RPKI-ROV History of Unique Prefix-Origin Pairs - Trend" /> <h2 id="reduction-of-propagation-of-rpki-invalid-routes-due-to-rpki">Reduction of propagation of RPKI-invalid routes due to RPKI</h2> <p>The <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">second part</a> of last year’s analysis was to understand the rejection of RPKI-invalids better. Multiple groups have attempted to enumerate which ASes reject RPKI-invalid routes, which is a tricky endeavor. Instead of trying to determine precisely which ASes reject invalids, we chose to measure how route propagation differs between RPKI-invalid routes and routes of other types.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7uL7smjzcsmg6Hec6Pvl1A/294032450ffb693669e9c8ab9c14625a/IPv4-v6_propagation_by_RPKI.png" style="max-width: 800px;" class="image center" alt="Diagrams showing RPKI invalid, valid, routes not found" /> <p>Our <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/#conclusion">conclusion</a> was that “the evaluation of a route as invalid reduces its propagation by anywhere between one-half to two-thirds.” This really hasn’t changed much in the past year, in part because the impact was driven by tier-1 backbone providers such as Arelion, Lumen, NTT, and Cogent rejecting invalids. Due to the immense scale of these backbone providers, they end up shielding much of the internet from RPKI-invalid routes.</p> <p>We see the impact of RPKI on the propagation of invalid routes every day. Kentik’s BGP visualization captures the drop in reachability (see upper stacked plot below) as propagation drops when a route becomes invalid and starts getting rejected. This recently occurred during yet another instance of an all-to-frequent <a href="https://twitter.com/DougMadory/status/1648775274912264192">prepending typo</a> like this one from April:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6lo1zXRLgCtMkmSUozrtQZ/9e0a6084108d230d706db303903b3665/bangladesh-drop-in-reachability.png" style="max-width: 800px;" withFrame class="image center" alt="Drop in reachability due to prepending typo" /> <p>Or due to the recent <a href="https://twitter.com/DougMadory/status/1646937938687664128">mysterious daily BGP hijacks</a> of Iranian networks by AS41689 of Iran - many of which are RPKI invalid (and therefore getting rejected). Note that ROAs cover <a href="https://stat.ripe.net/app/launchpad/S1_IR_C32C23C5C26C8_TMR672">97% of Iranian IPv4 space</a>:</p> <img src="//images.ctfassets.net/6yom6slo28h2/57ddUjQtiWVSJBT5OAQOfw/f2aa56cd033d90ce36f9c78314395c43/asiatech-rpki-invalid.png" style="max-width: 800px;" class="image center" withFrame alt="Daily BGP hijacks of Iranian networks as seen in Kentik" /> <p>If an AS doesn’t reject RPKI-invalid routes, but its transit providers do, it is almost like they do, too. Unless, of course, the invalid routes are arriving over a peering connection, circumventing transit.</p> <p>This brings us to <a href="https://blog.cloudflare.com/rpki-updates-data/">Cloudflare’s recent analysis</a> from December. Their report observed a “very low effective coverage of just 6.5% over the measured ASes, corresponding to 261 million end users currently safe from (malicious and accidental) route leaks.”</p> <p>Let’s contrast their observation with our conclusion that RPKI ROV dramatically reduces the propagation of RPKI-invalid routes. These assertions might seem contradictory, but they are not. It is certainly the case that, numerically, very few ASes reject RPKI-invalids (Cloudflare’s observation), and, at the same time, RPKI-invalids experience a severe propagation penalty (our observation).</p> <p>Cloudflare refers to our observation as “limit(ing) the blast radius” through “indirect validation”:</p> <blockquote>In other words, the methodology used focuses on ROV adoption by end-user networks (e.g., ISPs) and isn’t meant to reflect the eventual effect of indirect validation from (perhaps validating) upper-tier transit networks. While indirect validation may limit the "blast radius" of (malicious or accidental) route leaks, it still leaves non-validating ASes vulnerable to leaks coming from their peers.</blockquote> <p>Unless a leak originates from a highly peered network like Cloudflare’s, problematic routes will need to traverse large transit providers to have a widespread impact. That is why having backbone providers rejecting RPKI-invalid routes is highly beneficial for the health of the global internet.</p> <p>There are a few unheralded successes every day due to RPKI. Take this example from January. Pakistani incumbent PTCL began leaking 90.0.0.0/24, which was a more-specific of a route announced by Orange’s domestic network in France (AS3215), 90.0.0.0/17. This was probably an internal route that was accidentally leaked onto the internet. Still, since Orange had a ROA for this route, most backbone carriers automatically rejected the bogus Pakistani route.</p> <p>In Kentik’s BGP visualization below, we reported that only 25% of our BGP sources observed the problematic more-specific route, which without RPKI would have been globally propagated (i.e., 100% of BGP sources). When our analytics picked this up, we contacted Orange, who alerted PTCL of the leak, and they stopped announcing it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4jzEqGTFU9kAzt7bk5k43J/bafb59c5aadace8f7599e5c2394b251f/orange-bgp-monitor.png" withFrame style="max-width: 700px;" class="image center" alt="Kentik BGP visualization" /> <p>In this case, had one additional backbone carrier been rejecting RPKI-invalids, the propagation percentage would have been as low as 1% (just peers of PTCL that do not reject invalids). The lower the percentage, the lower the potential for disruption.</p> <h2 id="conclusion">Conclusion</h2> <p>To be sure, RPKI ROV does not alone solve the security issues facing BGP. A determined adversary can still forge an AS path to create a route that would be evaluated as RPKI-valid, as was the case in last year’s <a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/">Celer Bridge attack</a>.</p> <p>When discussing the impact of RPKI ROV, we intentionally point to its benefits against accidental hijacks due to typos and leaks, such as the PTCL example above or the <a href="https://twitter.com/DougMadory/status/1508466367112093709">leak of a Russian hijack</a> against Twitter last year.</p> <p>Yet to be fielded, BGPSEC is the technology intended to prevent the impersonation of ASes from forged AS paths. However, since BGPSEC’s protection will only extend through ASes, which are BGPSEC-aware, its benefits are limited to a subset of the internet. Despite this limitation, it is important to understand why BGPSEC will still offer protection in a partial deployment scenario.</p> <p>Returning to the Celer Bridge incident, let’s look at the problematic AS path of that BGP hijack:</p> <p><strong>… 1299 209243 14618</strong></p> <p>If Amazon (AS14618) and Arelion (AS1299) were BGPSEC-aware, Arelion would have ignored the above announcement because the signed route would have been preferred. The BGPSEC-verified announcements governing the same IP space passed directly between Amazon and Arelion would have trumped the unverified announcement from the phony Amazon. The hijack route would not have been selected, and the attack would have been prevented <em>without any other AS needing to be BGPSEC-aware</em>.</p> <p>Like their impact with RPKI ROV, adoption by major cloud providers and network service providers alone can severely limit the efficacy of AS impersonations by greatly restricting the propagation of those harmful routes. Partial deployment <em>does</em> offer benefits as it immediately benefits the deployer.</p> <p>Getting there will take time. The progress described above on the adoption of RPKI ROV took many years, but if we are to ever secure BGP, we must keep marching forward.</p><![CDATA[Making Peering Easy: Announcing the Integration of PeeringDB and Kentik]]><![CDATA[Peering evaluations are now so much easier. PeeringDB, the database of networks and the go-to location for interconnection data, is now integrated into Kentik and available to all Kentik customers at no additional cost.]]>https://www.kentik.com/blog/making-peering-easy-announcing-the-integration-of-peeringdb-and-kentikhttps://www.kentik.com/blog/making-peering-easy-announcing-the-integration-of-peeringdb-and-kentik<![CDATA[Greg Villain]]>Tue, 23 May 2023 04:00:00 GMT<p>At Kentik, our mission is to make life awesome for people building and running the connected world. Today we are thrilled to announce a new milestone: Kentik is now the first network observability platform to seamlessly integrate the PeeringDB dataset, elevating our position as the leading choice for network operators seeking unparalleled insights and efficiency.</p> <h2 id="peeringdb-integration-boosts-kentik-user-experience">PeeringDB integration boosts Kentik user experience</h2> <p>Kentik customers can now streamline their peering partner evaluation thanks to PeeringDB data integrated directly into the Kentik Network Observability Platform. This integration allows Kentik users to configure mappings between their Kentik sites and PeeringDB facilities, and their IX-classified interfaces and PeeringDB exchanges. With these mappings, non-PeeringDB-registered networks can now evaluate their common footprint with any PeeringDB-registered network.</p> <p>In response to demand, we’re taking this strategic step forward for several compelling reasons:</p> <ul> <li> <p><strong>Improving the useability of PeeringDB data</strong>: Previously, users relied on makeshift solutions and scripts to access the dataset, lacking a unified approach to streamline the peering decision process. Kentik’s integration now effectively addresses these issues by improving accessibility and adding mapping.</p> </li> <li> <p><strong>Enhancing footprint display and computation</strong>: While PeeringDB effectively showcases a member network’s footprint, it doesn’t compute the common footprint between two networks or map location data—areas where Kentik’s integration excels.</p> </li> <li> <p><strong>Providing traffic flow insights</strong>: PeeringDB lacks information on customers’ traffic flows to and from ASNs of interest, a gap filled by Kentik’s integration.</p> </li> <li> <p><strong>Streamlining market intelligence</strong>: <a href="https://www.kentik.com/product/market-intelligence/">Kentik Market Intelligence</a> (KMI) users can now effortlessly navigate between market intelligence scoring/ranking data and footprint evaluation, thanks to the seamless workflow enabled by this integration.</p> </li> </ul> <div as="Testimonial" index="0" color="orange"></div> <h2 id="what-is-peeringdb">What is PeeringDB?</h2> <img src="//images.ctfassets.net/6yom6slo28h2/31hOfi5BGsZWsPX3wh4uJR/a22e0f13752309119a70f6ae241e14b6/peeringDB.svg" style="max-width: 200px;" alt="PeeringDB logo" class="image right no-shadow"/> <p><a href="https://www.peeringdb.com/">PeeringDB</a> is a freely available online database that provides information on networks, internet exchange points (IXPs), and other related organizations. It allows network operators to maintain and share detailed information about their network infrastructure, peering policies, and contact details with other network operators.</p> <p>PeeringDB is a non-profit, community-driven organization that maintains a peering and infrastructure directory. They have grown to be one of the internet’s critical infrastructure datasets, used in strategic peering decision-making across networks in the interconnection ecosystem.</p> <h2 id="how-peeringdb-works-in-kentik">How PeeringDB works in Kentik</h2> <p>To achieve seamless integration with peering-related workflows throughout the Kentik portal, we’ve implemented the PeeringDB integration as a new tab within the dynamically generated ASN Quick-View screens, available for any ASN. This newly added tab offers contextualized IN and OUT traffic levels, ratios, and traffic mix data necessary for evaluation against any network’s policy. Moreover, policy data for the displayed network is also available on this screen, along with geo-filter-friendly map and tabular data for assessing common footprint.</p> <img src="//images.ctfassets.net/6yom6slo28h2/HMfoC5OB5Ns4pM9vr3knX/bf5d4546c7405c685db49b20161ee077/peeringdb-microsoft.png" style="max-width: 800px;" alt="PeeringDB data in the Kentik network observability platform" class="image center" withFrame thumbnail /> <p>The screenshot above demonstrates the presentation of PeeringDB data within the Kentik platform. The “Traffic Profile” data displays the inbound and outbound traffic from the target ASN — in this instance, Kentik traffic in and out of Microsoft’s AS8075. Below the traffic profile, you’ll find the PeeringDB data for that ASN. On the right side, the peering and exchange facilities used by this ASN are shown, with common facilities (Microsoft and Kentik) highlighted in blue. By selecting a common facility, in this case, Equinix Ashburn, you can access a comprehensive list of other networks and exchanges at this facility, along with their peering policy.</p> <p>This is just the first stage in Kentik’s utilization of PeeringDB data. We have an exciting roadmap for the coming quarters to further leverage this data and make peering optimization simpler and more intuitive. This value-added service will be available to all Kentik customers at no additional charge.</p> <div as="Promo"></div> <p>Integrating PeeringDB data into Kentik’s platform provides crucial context for network planners and streamlines peering operations, making them more efficient and actionable. Kentik’s platform helps network operators improve their operational efficiency by offering a single, comprehensive view for making data-driven, analytical peering decisions. We’d love to give you a demonstration! <a href="https://www.kentik.com/get-demo/">Get in touch.</a></p><![CDATA[Mind Your MANRS: A Safer Internet Through Secure Global Routing]]><![CDATA[We access most of the applications we use today over the internet, which means securing global routing matters to all of us. Surprisingly, the most common method is through trust relationships. MANRS, or the Mutually Agreed Norms for Routing Security, is an initiative to secure internet routing through a community of network practitioners to facilitate open communication, accountability, and the sharing of information.]]>https://www.kentik.com/blog/mind-your-manrs-a-safer-internet-through-secure-global-routinghttps://www.kentik.com/blog/mind-your-manrs-a-safer-internet-through-secure-global-routing<![CDATA[Phil Gervasi]]>Wed, 17 May 2023 04:00:00 GMT<p>MANRS, or the Mutually Agreed Norms for Routing Security, is an initiative of the <a href="https://www.internetsociety.org/">Internet Society</a> to help secure internet peering relationships and ultimately help secure global routing. It’s not a technology, it’s not a formal regulatory body, and it’s not a new encryption method. Instead, <a href="https://www.manrs.org/">MANRS</a> is a culture, a philosophy, and a community.</p> <div as="WistiaVideo" videoId="r48u1kuypm" audio></div> <h2 id="mind-your-manrs">Mind your MANRS</h2> <p>In a <a href="https://www.kentik.com/telemetrynow/s01-e13/">recent episode</a> of Telemetry Now, <a href="https://www.internetsociety.org/author/siddiqui/">Aftab Siddiqui</a> from the Internet Society joined us to talk about how the MANRS initiative is the center of a global community of engineers trying to keep peering relationships and routing advertisements safe and secure. <em>Community</em> is undoubtedly a central theme with MANRS. From their website:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2RMo7U9oNjN8FMdAgtb6id/a1b54e950ef7050211d5ee6f72e648df/manrs.svg" style="max-width: 200px;" class="image right simple" alt="MANRS logo" /> <p><em>“Joining MANRS means joining a community of security-minded organizations committed to making the global routing infrastructure more robust and secure. MANRS outlines simple, concrete actions organizations can take, tailored to their role on the Internet….”</em></p> <h2 id="how-manrs-helps-with-global-routing-security">How MANRS helps with global routing security</h2> <p>MANRS is a matter of accountability, peer review, coordination, and collaboration. The organization comprises any organization with an internet presence that wants to secure its global routing footprint. MANRS also serves to provide a third-party mechanism organizations can use to show the world they follow best practices to secure routing relationships and configurations. This way, a service provider, CDN, or any other member organization can point to their MANRS membership to validate they follow routing best practices.</p> <p>Made up of a self-governed collection of service providers, CDNs, IXPs, large enterprises, and network vendors, the MANRS community operates with a charter and steering committee to promote the global adoption of MANRS actions and improvements in routing security. To get there, members work together to provide reliable tools for compliance and measurement, such as the MANRS Observatory, building the capacity of network engineers through training and fellowship programs, and advocating for policies that strengthen routing security.</p> <p>The MANRS initiative identifies four “actions” member organizations can take to improve routing security.</p> <ol> <li><strong>Filtering</strong> – defining a clear routing policy and implementing a system to ensure that announcements to adjacent networks are correct.</li> <li><strong>Anti-spoofing</strong> – enabling source address validation (SAV) and implementing anti-spoofing to prevent packets with incorrect source IP addresses from entering and leaving the network.</li> <li><strong>Coordination</strong> – maintaining globally accessible up-to-date contact information to assist with incident response.</li> <li><strong>Global validation</strong> – publishing data that enable other stakeholders to validate routing information worldwide.</li> </ol> <p>Notice that these actions comprise configuration best practices <em>and</em> the collaborative work of peer review and open communication. It’s not simply a matter of fixing bad <a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1/">BGP configuration</a>. It’s also a framework for open communication among peers, a mechanism for peer review, an encouragement to employ <a href="https://www.kentik.com/telemetrynow/s01-e12/">technologies such as RPKI</a>, and a community to help with security issues.</p> <h2 id="routing-security-and-best-practices">Routing security and best practices</h2> <p>The term “routing security” may conjure thoughts of malicious activity, routing hijacks, <a href="https://www.kentik.com/kentipedia/ddos-detection/">DDoS attacks</a>, etc. However, many of the problems with global routing are a result of unintended misconfiguration or poor implementation.</p> <p>So as much as the MANRS community is indeed concerned with securing global routing from malicious attacks and activity, it’s also concerned with helping members use configuration best practices.</p> <p>For example, members will be periodically peer-reviewed to ensure their configurations and policies meet routing best practices. Additionally, the MANRS organization provides best practice guides, policies, and monitoring and debugging tools.</p> <p>Community members are also encouraged to employ methods to verify source addresses, peering entities, and more using techniques such as ASPA, ROA, and RPKI. This way, the MANRS initiative helps its members ensure they have the tools, knowledge, and community to protect their presence on the internet and their peering relationships.</p> <p>To learn more and hear a very informative conversation with Aftab Siddiqui, listen to the recent Telemetry Now podcast episode “<a href="https://www.kentik.com/telemetrynow/s01-e13/">The MANRS initiative to secure global internet routing</a>.”</p><![CDATA[Network Operator Confidential: Diving Into Our Latest Webinars on DDoS Trends, RPKI Adoption, and Market Intel for Service Providers]]><![CDATA[Didn’t have time to watch our two recent webinars on the top trends network operators need to know about to be successful in 2023? We’ve got you covered. Let’s look at the biggest takeaways and break down some key concepts. ]]>https://www.kentik.com/blog/network-operator-confidential-ddos-trends-rpki-adoption-market-intel-service-providershttps://www.kentik.com/blog/network-operator-confidential-ddos-trends-rpki-adoption-market-intel-service-providers<![CDATA[Lauren Basile]]>Tue, 16 May 2023 04:00:00 GMT<p>Kentik’s own <a href="https://www.kentik.com/analysis/">Doug Madory, head of internet analysis</a>, recently joined Mattias Friström, VP and chief evangelist at Arelion, and Sonia Missul, IP transit product manager at Orange International Carriers, as panelists on two webinars hosted by <a href="/resources/replay-what-network-teams-need-to-know-to-be-successful-in-2023/">Fierce Telecom</a> and <a href="https://www.kentik.com/resources/understanding-global-internet-connectivity-trends-and-insights-replay/">Capacity Media</a> respectively. The webinars featured lively discussions on global connectivity patterns and events, trends in RPKI compliance, and how service providers can keep an eye on their market share and competitor activities.</p> <h2 id="the-changing-ddos-landscape">The changing DDoS landscape</h2> <p>As one of the 15 global Tier 1 providers, Orange’s vast network helps keep a pulse on the internet. Sonia shared some global connectivity trends Orange is seeing that will impact network teams this year:</p> <ul> <li>After record-breaking traffic levels during 2020 and 2021 due to the pandemic, IP transit has leveled out and is returning to pre-Covid levels.</li> <li>Increasingly, more critical business is being conducted on IP transit networks due to cloud usage and adoption across the enterprise.</li> <li>Leading service providers, including Orange, are investing more in threat mitigation system (TMS) platforms for robust DDoS protection.</li> </ul> <p>For the over 100,000+ networks that make up the public internet, DDoS protection is always top of mind as the first line of network defense against increasing attacks – but analysis from Arelion shows a fundamental shift occurring. During the Fierce Telecom webinar, Mattias presented insights from Arelion’s internal traffic analysis showing DDoS attacks are actually trending <em>downwards</em>, with fewer attacks per week being recorded. This is in stark contrast to 2018-2020, where DDoS attacks increased in both size and volume YoY. What’s behind this? That has yet to be fully analyzed, but coordination between network service providers focusing on rooting out sources of spoofed traffic could be a potential cause for the decline.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3MMAEOLpN0CwpXUfJGJLzb/bccc7721ecb40b4e27da9763bceac7eb/arelion-downward-ddos-trends.png" style="max-width: 800px; border: 8px solid #ebeff3; border-radius: 8px; padding: 10px;" class="image center no-shadow" alt="Arelion - Downward DDoS Trends" /> <p>In addition, as part of his analysis, Mattias shared interesting observations about the DDoS threat landscape spanning both volume/sized-based and network protocol attacks. Based on Arelion’s internal data:</p> <ul> <li>Attack lengths are still very short, with 75% of attacks being less than 20 minutes</li> <li>Most common volumetric attacks are less than 5Gbps</li> <li>The majority of protocol attacks are lower than 5Mpps</li> <li>Attacks are typically carried out after office hours and over weekends</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/3x8OOv7RGTPUnE2zJHtnVG/f9e47e7773fa3340aa836f729e702b04/ddos-attack-length-pie-charts.png" style="max-width: 800px" class="image center" alt="DDoS Attack Length" /> <img src="//images.ctfassets.net/6yom6slo28h2/4wvJDtdXjzQrGYtQYok5W8/04f829429fc3daa35c9e1ce022a56190/ddos-alert-ip-location.png" style="max-width: 550px" class="image center" alt="DDoS Alert with IP Location" /> <p>Arelion is seeing significant growth globally in ‘carpet bombing,’ where a range of addresses or subnets, which could contain hundreds or thousands of destination IP addresses, are attacked. Given how swiftly DDoS attacks are carried out, immediate detection and response time are critical to mitigating service downtime. Mattias encouraged service providers to embrace automation for DDoS protection where possible, noting Arelion now auto-mitigates 70+% of DDoS attacks, with only sophisticated cases requiring manual intervention.</p> <p>Learn more about how <a href="https://www.kentik.com/blog/cybersecurity-cloudflare-and-kentik-mitigate-ddos-attacks/">Kentik integrates with Cloudflare to mitigate DDoS attacks on demand</a>.</p> <h2 id="the-rpki-tipping-point">The RPKI tipping point</h2> <p>An update on <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">Resource Public Key Infrastructure (RPKI) Route Origin Validation (ROV)</a> adoption was next on the docket. A leading authority on BGP analysis, Doug Madory presented his latest RPKI adoption findings and gave his take on what it means for the industry moving forward.</p> <p>As network operators across the internet continue to implement RPKI in a collaborative effort to increase global routing security, Doug noted there are several positive signs that we have reached a true adoption tipping point. To frame the conversation, he first outlined the two steps it takes to reject an RPKI-invalid BGP route:</p> <ol> <li>Route Origin Authorizations (ROAs) must be created to assert valid origin and prefix length</li> <li>Networks must reject RPKI-invalids</li> </ol> <p>On the ROA creation front, the below graphic from <a href="https://rpki-monitor.antd.nist.gov/">NIST</a> shows that the number of routes with ROAs has risen dramatically since 2020 and is on track to outpace routes without ROAs soon. Doug estimates this intersection will occur in the next year, where the majority of routes in circulation will have ROAs and be eligible for protection, thereby cementing RPKI as the internet’s cornerstone defense against hijacks due to accidental typos and leaks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3erf44Z0NBoiKOm0LsdIIk/506b1943c32323eb4db1ea674342f5b0/rpki-rov-history.png" style="max-width: 650px" class="image center" alt="RPKI" /> <p>However, by evaluating RPKI adoption using Kentik’s NetFlow traffic data, we can see that the state of ROA creation is better than what the current NIST metrics reflect. Specifically, Doug observed that while only 43% of routes have ROAs and are valid, 63.9% of traffic (in bits/sec) is going to RPKI-valid routes.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2AR20y4cY2LgUIWA6GwbX1/2d632fecd0770fd4bc904d823ea791d1/internet-traffic-rpki-202304.png" style="max-width: 550px" class="image center" alt="Internet traffic volume by RPKI evaluation" /> <p>As he described in a <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">blog post last year</a>, this discrepancy can be attributed to RPKI deployments at large content providers and major access networks, which push much of the traffic on the internet.</p> <p>The second part of the RPKI analysis considered the degree to which RPKI-invalid routes are rejected on the internet. Developing a reliable methodology to determine which ASes reject RPKI-invalid routes is an active area of research. But Doug explained that we could see the impact of RPKI on invalid routes without knowing precisely which ASes are rejecting them.</p> <p>In an <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">analysis from last year</a>, he investigated what happened to BGP routes that are evaluated at RPKI-invalid. The conclusion was that they experienced a reduction in propagation anywhere between one-half to two-thirds.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7uL7smjzcsmg6Hec6Pvl1A/294032450ffb693669e9c8ab9c14625a/IPv4-v6_propagation_by_RPKI.png" style="max-width: 800px" class="image center" alt="BGP routes - RPKI invalid" /> <h2 id="rpki-success-storytime">RPKI success storytime</h2> <p>To further illustrate the real-world impact of RPKI adoption, Doug shared a story of it in action. In January, Pakistan’s PTCL accidentally leaked 90.0.0.0/24, which is a more-specific route of Orange’s (AS3215) 90.0.0.0/17. If it had been globally circulated, it could have caused disruptions to Orange customers using that IP address range.</p> <p>Thankfully the route had an ROA enabling most of PTCL’s providers to reject the leaked route automatically. The screenshot below shows the leaked more-specific route only propagating to 25% of the internet. The limited propagation was due to RPKI and required no human intervention. Once Kentik noticed the leak, we alerted Orange, who contacted PTCL to block the route. If only one more major backbone carrier had RPKI fully implemented, then it likely would not have circulated at all - a reality that we may be on the cusp of achieving.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3K0RjrXbX1K7IDDa5KibpF/6b37b3f032a0eda7c8f08da2cc959800/orange-kentik-synthetics.png" withFrame style="max-width: 600px" class="image center" alt="Orange - BGP Monitoring in Kentik Synthetics" /> <p>A natural question then quickly arises – why hasn’t the entire industry adopted RPKI yet? While creating ROAs doesn’t cost providers any money, Doug estimates lack of awareness, older hardware that makes rejecting invalids difficult, and limited engineering team resources are contributing to adoption delays.</p> <p>During the Q&#x26;A portion, a thought-provoking question was offered about whether RPKI is worth doing, given how most internet traffic is transmitted via direct private peering rather than IP transit. Both Doug and Mattias wholeheartedly support RPKI measures, arguing that IP transit is very important, and although it may not be the majority percentage-wise, critical queries and internet traffic travel through IP transit connections that must be secured.</p> <p>Arelion, the first Tier 1 to fully implement RPKI and a fierce advocate for global routing security, reinforced that the RPKI tipping point is here. Mattias shared how more RFQs from large customers are requiring RPKI – a resounding alarm to backbone carriers and major players that adoption will soon no longer be a choice.</p> <h2 id="the-future-of-global-routing-security">The future of global routing security</h2> <p>While RPKI is not 100% foolproof (AS paths can be forged and slip past it undetected), its widespread adoption is a significant first step in the journey to fully securing BGP. Doug predicts the next routing security frontiers the industry will need to embrace is <a href="https://www.kentik.com/telemetrynow/s01-e12/">BGPsec and autonomous system provider authorization (ASPA)</a>, which would help close the gap in securing BGP routing from determined adversaries.</p> <h2 id="how-to-stay-ahead-of-the-curve-with-market-intel">How to stay ahead of the curve with market intel</h2> <p>Wrapping up our highlights, Doug gave a quick video walkthrough of the <a href="https://www.kentik.com/product/market-intelligence/">Kentik Market Intelligence (KMI) tool</a>, which both Orange and Arelion use to keep an eye on the market. KMI uses BGP routing data to display AS transit and peering relationships for any market in the world, equipping service providers with instant access to data to better understand their market share and customer and competitor activities.</p> <p>Be sure to watch the <a href="/resources/replay-what-network-teams-need-to-know-to-be-successful-in-2023/">webinar recording</a> to see the full demo, where Doug walks through network planning, sales, and marketing use cases for the tool.</p> <p>Have questions or want to learn more? <a href="https://www.kentik.com/get-demo/">Sign up for a personalized demonstration</a> or <a href="/get-started/">start a trial</a> to see Kentik in action.</p><![CDATA[A Step-by-Step Guide to Writing VPC Flow Logs to an S3 Bucket]]><![CDATA[Virtual Private Cloud (VPC) flow logs are essential for monitoring and troubleshooting network traffic in an AWS environment. In this article, we'll guide you through the process of writing AWS flow logs to an S3 bucket.]]>https://www.kentik.com/blog/a-step-by-step-guide-to-writing-vpc-flow-logs-to-an-s3-buckethttps://www.kentik.com/blog/a-step-by-step-guide-to-writing-vpc-flow-logs-to-an-s3-bucket<![CDATA[Phil Gervasi]]>Fri, 12 May 2023 04:00:00 GMT<p>In the world of Amazon Web Services, flow logs are analogous to the flow records (e.g., NetFlow, sFlow, etc.) generated by devices on physical networks. A flow log consists of a set of records about the flows that either originated or ended in a given Virtual Private Cloud, with each individual record made up of a set of fields providing information about a single flow.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6frgkVjwHYYAUiMUAKaau4/73d67a564490fd4162962e54525c8eb9/sample-record.png" class="image center" style="max-width: 650px;" alt="VPC flow log example" /> <p><a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/">Virtual Private Cloud (VPC) flow logs</a> are essential for monitoring and troubleshooting network traffic in an AWS environment. Storing these logs in an Amazon S3 bucket allows for easy accessibility, analysis, and long-term retention.</p> <p>This blog post will walk you through the process of writing VPC flow logs to an S3 bucket step by step.</p> <h2 id="prerequisites">Prerequisites</h2> <p>Before you begin, ensure you have the following:</p> <ol> <li>An active AWS account</li> <li>A configured VPC with one or more subnets</li> <li>An S3 bucket to store the VPC flow logs</li> </ol> <h2 id="step-1-create-an-iam-policy-for-vpc-flow-logs">Step 1: Create an IAM Policy for VPC Flow Logs</h2> <p>To write VPC flow logs to an S3 bucket, you must create an IAM policy granting the necessary permissions.</p> <ol> <li>Navigate to the <a href="https://aws.amazon.com/console/">AWS Management Console</a> and open the <a href="https://console.aws.amazon.com/iam/">IAM service</a>.</li> <li>Click <strong>Policies</strong> in the left navigation pane, then click the <strong>Create policy</strong> button.</li> <li>Select the <strong>JSON</strong> tab and paste the following policy, replacing <strong>your-bucket-name</strong> with the name of your S3 bucket:</li> </ol> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">{ "Version": "2023-04-05", "Statement": [ { "Action": [ "s3:PutObject", "s3:GetBucketAcl" ], "Effect": "Allow", "Resource": [ "arn:aws:s3:::your-bucket-name/*" ] } ] }</code></pre></div> <ol start="4"> <li>Click <strong>Review policy</strong> give your policy a name and description, and click <strong>Create policy</strong>.</li> </ol> <h2 id="step-2-create-an-iam-role-for-vpc-flow-logs">Step 2: Create an IAM Role for VPC Flow Logs</h2> <p>Next, create an IAM role that uses the policy you created in Step 1.</p> <ol> <li>In the IAM service, click <strong>Roles</strong> in the left navigation pane and click <strong>Create role</strong>.</li> <li>Select <strong>VPC Flow Logs</strong> as the service using this role, and click <strong>Next: Permissions</strong>.</li> <li>Search for the policy you created in Step 1, select it, and click <strong>Next: Tags</strong>.</li> <li>(Optional) Add any tags you’d like, and click <strong>Next: Review</strong>.</li> <li>Provide a name and description for the role, then click <strong>Create role</strong>.</li> <li>(Optional) For Trusted entity type, choose Custom trust policy. For Custom trust policy, replace <strong>Principal: {}</strong>, with the following, then select <strong>Next</strong>.</li> </ol> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">"Principal": { "Service": "vpc-flow-logs.amazonaws.com" },</code></pre></div> <ol start="7"> <li>On the <strong>Add Permissions page</strong>, select the checkbox for the policy you created earlier in this procedure, then choose Next.</li> <li>Now enter a name for your role and provide an optional description.</li> <li>Finally, choose <strong>Create role</strong>.</li> </ol> <h2 id="step-3-configure-vpc-flow-logs-to-write-to-s3-bucket">Step 3: Configure VPC Flow Logs to Write to S3 Bucket</h2> <p>Now that you have the necessary IAM role, you can configure VPC flow logs to write to your S3 bucket.</p> <ol> <li>Open the VPC service in the AWS Management Console.</li> <li>In the left navigation pane, click on <strong>Your VPCs</strong>.</li> <li>Select the VPC you want to create flow logs for, then click on the <strong>Actions</strong> button and choose <strong>Create flow log</strong>.</li> <li>In the <strong>Filter</strong> section, choose the type of traffic you want to capture (All, Accept, or Reject).</li> <li>In the <strong>Destination</strong> section, choose <strong>Send to an S3 bucket</strong>.</li> <li>Provide the ARN of your S3 bucket in the format <strong>arn:aws:s3:::your-bucket-name</strong>.</li> <li>For the <strong>IAM Role</strong>, select the role you created in Step 2.</li> <li>Click <strong>Create flow log</strong>.</li> </ol> <h2 id="step-4-view-your-vpc-flow-log-records">Step 4: View Your VPC Flow Log Records</h2> <p>With everything properly configured, you can now view your flow log records inside the s3 service. Remember, loading all of the logs into your S3 bucket may take up to ten minutes, so be patient.</p> <ol> <li>Open the Amazon S3 console at <a href="https://console.aws.amazon.com/s3/">https://console.aws.amazon.com/s3/</a>.</li> <li>Find the name of the bucket to open its corresponding details page.</li> <li>Navigate to the folder with the log files. An example of what that path would look like is:</li> </ol> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">prefix/AWSLogs/account_id/vpcflowlogs/region/year/month/day/</code></pre></div> <ol start="4"> <li>Select the checkbox next to the file name and choose <strong>Download</strong>.</li> </ol> <h2 id="conclusion">Conclusion</h2> <p>Congratulations! You have now successfully configured VPC flow logs to write to an S3 bucket.</p> <p>These flow logs will be stored in your specified S3 bucket, making it easy for you to access, analyze, and retain your VPC network traffic data. Remember to monitor your logs regularly to gain valuable insights into your network activity and troubleshoot potential issues.</p> <p>Learn more about <a href="https://www.kentik.com/solutions/usecase/amazon-web-services/">AWS cloud observability</a> and how <a href="https://www.kentik.com/product/cloud/">Kentik Cloud</a> can help you troubleshoot issues and ensure smooth migrations to and from AWS.</p><![CDATA[Modernizing Network Data Analytics With a Unified Data Repository]]><![CDATA[In this post, learn about what a UDR is, how it benefits machine learning, and what it has to do with networking. Analyzing multiple databases using multiple tools on multiple screens is error-prone, slow, and tedious at best. Yet, that’s exactly how many network operators perform analytics on the telemetry they collect from switches, routers, firewalls, and so on. A unified data repository unifies all of that data in a single database that can be organized, secured, queried, and analyzed better than when working with disparate tools.]]>https://www.kentik.com/blog/modernizing-data-analytics-with-a-unified-data-repositoryhttps://www.kentik.com/blog/modernizing-data-analytics-with-a-unified-data-repository<![CDATA[Phil Gervasi]]>Thu, 11 May 2023 04:00:00 GMT<h2 id="what-is-a-udr-unified-data-repository">What is a UDR (Unified Data Repository)?</h2> <p>A unified data repository (UDR) is a centralized storage system that consolidates, organizes, and manages data from different sources. It serves as a single point of access for various types of data, which could be a significant volume of very diverse data formats in an extensive network.</p> <h2 id="why-do-we-need-a-udr">Why do we need a UDR?</h2> <p>In a typical network, there are so many different types of devices and services producing some form of telemetry that there is the potential to end up with disparate visibility tools and databases. In fact, this is very common to see in many network operations centers. This is a problem because having many separate databases makes it difficult for a network operations team to collect, organize, secure, and, most importantly, analyze all the data simultaneously.</p> <p>We want to be able to do that simply because application delivery doesn’t rely on just one type of network device or part of a network infrastructure. Instead, application delivery touches a massive number of devices, network-adjacent devices and services, the public internet itself, and so on. Therefore, we have to analyze all of this data as a whole to truly understand application performance over the network.</p> <p>The alternative is to have separate visibility tools that a human engineer looks at one at a time, searching for meaningful correlation and deriving insights in their own head. This process of manual clue-chaining is tedious at best, error-prone, and impossible to do at scale. And this problem will only get worse as new forms of telemetry are collected from the ever-changing nature of networks.</p> <p><em>The primary purpose of a UDR is to help solve this problem and enable more comprehensive data analysis across data types, formats, and from different sources.</em></p> <h2 id="how-a-udr-benefits-machine-learning">How a UDR benefits machine learning</h2> <p>Machine learning relies on large amounts of data to build models, make predictions, find correlations, and ensure its results’ accuracy. ML is inherently data-driven, so it’s vital to implement the right data management strategy for data analysis to be successful.</p> <p>UDRs can be implemented using various technologies, such as data warehouses (for structured data), data lakes (for unstructured data), or hybrid solutions that combine both. The choice of technology largely depends on an organization’s specific needs, data types, volume, and the desired level of scalability and flexibility.</p> <p>A UDR plays a critical role in machine learning projects by:</p> <ul> <li><strong>Providing high-quality, consistent data</strong>: A UDR streamlines data management and improves data quality, ensuring that machine learning models can access accurate and consistent data for training and validation. Because applications rely on so many devices and services, we must analyze them all at the same time using consistent scaling, etc.</li> <li><strong>Accelerating the training process</strong>: By consolidating data into a single location, a UDR reduces the time spent on data collection and preprocessing, allowing engineers, data scientists, and ML practitioners to focus on developing and optimizing the data analysis workflow.</li> <li><strong>Enhancing model performance</strong>: With access to a wide range of diverse data in a single database, ML models can be trained on more representative samples, leading to better generalization, prediction, and improved performance in real-world scenarios. This is very important, especially in networking, in which engineers care about understanding trends, seasonality, and the predictive capacity of a visibility tool.</li> <li><strong>Facilitating collaboration</strong>: A UDR enables data scientists, engineers, and others to collaborate more effectively on data analytics projects by providing a centralized data source, reducing the risk of duplicated efforts or inconsistent results. This is a direct benefit to operational teams running networks day-to-day.</li> </ul> <p>Remember that most network telemetry is unstructured and unlabeled data. This means that to apply certain ML models, the data must first be labeled, organized, and structured somehow. To do this, we want to use a pre-processing workflow to standardize and normalize data of different types, formats, and scales.</p> <p>Once this is complete, the UDR unifies these now normalized, standardized, and scaled data so that a single tool can perform the automated data analysis efficiently, accurately, and much faster than a human engineer. That’s when we can start identifying correlations and patterns among multiple data types from different sources and over time.</p> <h2 id="how-a-udr-benefits-networking">How a UDR benefits networking</h2> <p>For networking, a unified database means the algorithms have the data necessary to make accurate predictions and find meaningful correlations across data types and formats.</p> <p><em>In other words, the more unified the data, the better the analytics and the better the results.</em></p> <p>For example, a typical network operations center might have one <a href="https://www.kentik.com/kentipedia/netflow-analyzers-and-netflow-tools/">tool to collect NetFlow information</a>, another for SNMP, another for packet analysis, another for cloud logs, and so on. Each of these tools, though potentially excellent on its own, has separate underlying databases. This means it’s up to a human engineer to log into each tool, look for the relevant data, then log into the following tool and search for the related data again.</p> <p>This is not only error-prone but also incredibly tedious and slow, even in a medium-sized network environment. Therefore, we can improve day-to-day network operations using a UDR and appropriate data analysis workflows. A UDR will also allow ML models to make better predictions. Instead of identifying seasonality with only one data type, such as NetFlow, a ML model operating from a unified database of multiple forms of telemetry can go beyond seeing NetFlow seasonality and programmatically identify how NetFlow seasonality relates to a specific application latency, possible security vulnerabilities, and so on.</p> <h2 id="kentiks-udr---kentik-network-observability-platform">Kentik’s UDR - Kentik Network Observability Platform</h2> <p>Kentik uses a custom-built scalable columnar datastore called Kentik Data Engine. KDE ingests flow data (e.g., NetFlow or sFlow), correlates it with additional data such as GeoIP and BGP, and stores it (redundantly) in flow records that can be queried from the Kentik portal via Kentik APIs or a fully ANSI SQL-compliant individual PostgreSQL interface.</p> <p>KDE keeps a discrete set of databases for the flow records of each customer. These databases are made up of “Main Tables” for the flow records and associated data received for each device as well as for additional data learned from the flow data. The columns of those Main Tables contain the data that Kentik queries, and most of these columns are represented as dimensions used for filtering and group-by in queries.</p> <p>Everything ingested or learned by backend data analytics is stored in the KDE, making it the unified data repository for the Kentik platform. This unified, data-driven approach to network observability allows advanced analytics across data sources and types in a single platform.</p> <p>Watch the <a href="https://www.kentik.com/resources/data-driven-network-observability/">Data-Driven Network Observability</a> presentation from Networking Field Day 31 to learn more.</p><![CDATA[Resilience vs Redundancy in Networking: Exploring Key Comparisons]]><![CDATA[In this post, we discuss the crucial differences between resilience versus redundancy in networking. Learn how to optimize your network for seamless performance.]]>https://www.kentik.com/blog/resilience-and-redundancy-in-networkinghttps://www.kentik.com/blog/resilience-and-redundancy-in-networking<![CDATA[Stephen Condon]]>Tue, 09 May 2023 04:00:00 GMT<p>Predictability in network flows is the ability to consistently deliver traffic from one point to another, even in the face of disruptions. Yet, establishing predictability has its share of challenges. Learn all about resilience in networking and how it relates to redundancy.</p> <p>Let’s start with some definitions to set the stage.</p> <h2 id="core-concepts-for-resilience-vs-redundancy">Core concepts for resilience vs redundancy</h2> <h3 id="what-is-resilience-in-networking">What is resilience in networking?</h3> <p><strong>Resilience in networking is the ability of a network to withstand and quickly recover from failures or changes in its environment</strong>. This includes the ability to:</p> <ul> <li>Dynamically adjust to changes in <a href="https://www.kentik.com/kentipedia/what-is-network-topology/">network topology</a></li> <li>Detect and respond to outages</li> <li>Route around faults in order to maintain connectivity and service levels.</li> </ul> <h3 id="why-is-resilience-important">Why is resilience important?</h3> <p>Resilience in networking is more than just a measure to ensure smooth operations; it’s an assurance of continuity, reliability, and trust in the network’s ability to perform its function, even in challenging scenarios. Let’s explore why resilience holds such significance:</p> <ol> <li> <p><strong>Fault Tolerance</strong>: This is a primary goal of resilience. In the world of networking, faults are inevitable. They can arise from various factors, including hardware malfunctions, software glitches, or unexpected disruptions. A resilient network is designed to tolerate these faults, ensuring that the overall system remains functional even when certain parts of the network malfunction.</p> </li> <li> <p><strong>High Availability</strong>: Organizations rely on networks to perform crucial operations, from facilitating communications to carrying out transactions. A network’s downtime can mean significant losses for businesses. High availability ensures the network remains operational for the maximum possible time, minimizing outages and interruptions. Resilient networks achieve this by dynamically adjusting to changes and rerouting traffic as needed.</p> </li> <li> <p><strong>Disaster Recovery</strong>: Disasters can be natural, like earthquakes, floods, or fires, or they can be artificial, such as cyberattacks or power outages. A resilient network incorporates disaster recovery protocols that enable quick restoration of network functionality after a catastrophic event. This could involve switching to backup systems, rerouting traffic to alternative paths, or activating standby resources.</p> </li> <li> <p><strong>Business Continuity</strong>: Today’s businesses are intrinsically linked with their networks. Whether it’s an e-commerce platform relying on its online storefront or a global corporation depending on its intranet for daily operations, any disruption can have cascading effects on the bottom line and brand reputation. Resilience ensures business continuity by prioritizing critical network functions, mitigating risks, and ensuring that the network can adapt to changing conditions.</p> </li> <li> <p><strong>Trust and Reputation</strong>: Customers, clients, and stakeholders trust businesses that offer consistent and reliable services. Network disruptions can lead to dissatisfied customers, missed opportunities, and tarnished reputations. By emphasizing resilience, companies demonstrate a commitment to delivering uninterrupted services, fostering trust and enhancing their reputation in the market.</p> </li> </ol> <div as="Promo"></div> <h3 id="what-is-redundancy-in-networking">What is redundancy in networking?</h3> <p><strong>Redundancy in networking is the use of multiple interchangeable components to increase the reliability of a system</strong>. These components can include hardware, software, or services. When one of the components fails, it can automatically be replaced with another component, keeping the network intact.</p> <h3 id="why-is-network-redundancy-important">Why is network redundancy important?</h3> <p>Network redundancy ensures that networks can maintain service levels and remain reliable during outages or disruptions. Redundancy can also be used to increase performance and handle temporary spikes.</p> <p>Redundancy is a cornerstone of resilience. Eventually, every component will fail. Hardware components have a finite lifetime. Even software components will have bugs, faulty designs, or misconfigurations. Redundancy allows you to establish resilient and reliable networks from fallible and unreliable components.</p> <p>Now that we’ve laid out these core concepts let’s take a closer look at how to build redundancy in a network.</p> <h2 id="how-to-build-redundancy-in-networking">How to build redundancy in networking</h2> <p>In IP networks, redundancy is accomplished by providing multiple paths for traffic to travel through. If one path is disrupted or fails, traffic can be rerouted through another path. This allows the network to remain operational even if some parts are unresponsive or slow.</p> <p>Common network redundancy mechanisms include node and link redundancy, redundant power supplies, and load balancers. However, simply installing redundant components is not enough. The network must be able to quickly detect when packets fail (or take too long) to route along a specific path and failover to an alternative path to maintain connectivity and service levels.</p> <h2 id="what-are-the-different-types-of-network-redundancies">What are the different types of network redundancies?</h2> <p>To create a network that can adapt, adjust, and maintain its functionality despite unforeseen circumstances, various forms of network redundancies have been devised. Here are some of the most prominent types of network redundancies that NetOps professionals should be familiar with:</p> <ol> <li> <p><strong>Multiple Spanning Trees (MST)</strong> is an extension of the basic Spanning Tree Protocol (STP). It is designed to alleviate some of the limitations of STP, especially in more complex or more extensive networks. MST allows for multiple VLANs to be mapped to fewer spanning-tree instances, reducing the load on network devices and optimizing network resources.</p> </li> <li> <p><strong>Multi-Protocol Label Switching (MPLS)</strong> provides a mechanism to route traffic based on simple label switching instead of traditional IP routing. While MPLS isn’t strictly a “redundancy” mechanism, it can be used with other techniques to create redundant network paths. In essence, MPLS optimizes traffic flow and can be part of a redundant network design.</p> </li> <li> <p><strong>Diverse Trunking</strong> is a straightforward redundancy approach that ensures continuous communication by establishing multiple communication pathways. It’s a classic form of redundancy and is very effective for ensuring network resilience.</p> </li> <li> <p><strong>Virtual Router Redundancy Protocol (VRRP)</strong> is a high-availability protocol designed to eliminate the single point of failure in a static default routed environment. By creating a virtual router (an abstraction over the physical routers), VRRP ensures that if the active router fails, the backup router takes over the IP address and continues the operation.</p> </li> <li> <p><strong>BGP Multipath</strong>: BGP is traditionally known for its path-vector protocol, which selects a single best path based on a list of attributes. However, there are scenarios where multiple paths are available, and they are equally desirable due to having the same attributes (like AS_PATH length, MED, etc.). Instead of selecting one and discarding the rest, BGP multipath allows for the utilization of multiple paths to distribute outgoing traffic. This offers two primary benefits: <strong>Load Balancing</strong>—By leveraging multiple equally preferable paths, networks can distribute the traffic over these paths, thus efficiently utilizing available bandwidth and preventing a single link from becoming a bottleneck. <strong>Redundancy</strong>—If one of the paths experiences an issue or becomes unavailable, traffic can continue to flow using the other paths, ensuring uninterrupted network services.</p> </li> </ol> <p>When implementing BGP multipath, it’s crucial to ensure proper configurations and monitor the distribution of traffic to avoid unintentional traffic engineering issues or uneven load distribution.</p> <p>While these techniques and protocols enhance redundancy and resilience, they should be carefully designed and implemented. Misconfiguration or lack of understanding can lead to network anomalies or even failures. Furthermore, the choice of which redundancies to implement should be based on the specific requirements, constraints, and goals of the network in question.</p> <h2 id="the-advantages-of-redundancy-in-networking">The advantages of redundancy in networking</h2> <p>Incorporating redundancy within a network design is not just about having backup systems—it’s about ensuring robustness, adaptability, and an enhanced user experience. Here are some key advantages:</p> <ol> <li> <p><strong>Increased Network Reliability</strong>: Redundancy is pivotal in ensuring network availability. By having backup systems or pathways, the risk of a total network failure is significantly reduced. This consistent functionality fosters trust among users and stakeholders.</p> </li> <li> <p><strong>Improved Network Uptime</strong>: Downtime can be financially and reputationally costly. Network redundancy ensures that even when certain components fail, the network as a whole remains operational, leading to enhanced uptime and ensuring business continuity.</p> </li> <li> <p><strong>Enhanced Performance and Load Balancing</strong>: Redundant systems often facilitate load balancing, distributing data traffic across multiple paths or servers. This not only optimizes the use of network resources but also ensures a smoother, lag-free user experience.</p> </li> <li> <p><strong>Resilience Against Cyber Attacks</strong>: Redundant networks offer an added layer of security against cyber threats. If one part of the network is compromised, traffic can be rerouted through another safe path, minimizing potential damage and ensuring uninterrupted service.</p> </li> <li> <p><strong>Scalability and Future-Proofing</strong>: As businesses grow and data traffic increases, the network must scale accordingly. Redundant systems are inherently scalable, allowing for the addition of new components without major overhauls. Plus, by accounting for future demands and potential technological advancements, network redundancy future-proofs the system against upcoming challenges.</p> </li> </ol> <h2 id="the-disadvantages-of-redundancy-in-networking">The disadvantages of redundancy in networking</h2> <p>Although redundancy in networks is important, keep in mind that it also has some downsides. First, redundancy increases the complexity of the network. Also, a network must be appropriately configured to leverage redundancy measures to provide the expected benefits. Finally, redundancy adds cost. In some cases, the return on investment might not justify the cost.</p> <p>While redundancy significantly contributes to network resilience, other mechanisms, protocols, and methods can also contribute to overall network resilience.</p> <h2 id="additional-network-resilience-mechanisms">Additional network resilience mechanisms</h2> <p>Successfully routing a packet over the internet from its source to its destination is not trivial. Many network protocols have been designed to handle different aspects of this process. Let’s highlight some of the primary protocols.</p> <h3 id="built-in-data-verification-with-checksums">Built-in data verification with checksums</h3> <p>Network packets are sent with a checksum. If the payload (or checksum) somehow becomes corrupt along the way, the checksum will not match, and the packet will be resent. This verification mechanism prevents the danger of unreliable data.</p> <h3 id="message-integrity-and-guaranteed-delivery-with-tcpip">Message integrity and guaranteed delivery with TCP/IP</h3> <p>Because IP does not require acknowledgments from endpoints, it does not ensure delivery; therefore, it is considered an unreliable protocol.</p> <p>On the other hand, Transmission Control Protocol (TCP) provides a connection-based, reliable byte stream. TCP/IP is TCP built on top of IP, and it ensures that bytes sent from a source to a destination will be received in the order sent. Under the hood, TCP handles retransmissions, breaking and recomposing packets, and much more. TCP/IP is one of the fundamental networking protocols and the basis for common protocols like HTTP.</p> <h3 id="ip-address-abstraction-and-network-failover-capabilities-with-dns">IP address abstraction and network failover capabilities with DNS</h3> <p>Domain Name System (DNS) is a system for mapping domain names to IP addresses. By using DNS, users can access a website or other service using a human-readable name without needing to remember the IP address. DNS also provides redundancy and failover capabilities by allowing multiple IP addresses to be associated with a single domain name. If one IP address fails, traffic can be routed to another to maintain service levels.</p> <h3 id="maintaining-network-connectivity-with-bgp">Maintaining network connectivity with BGP</h3> <p>Border Gateway Protocol (BGP) is a routing protocol that connects different networks over the internet. It is used to maintain network connectivity by helping routers find the best path for traffic to travel through. BGP also helps to create network resilience by allowing routers to quickly detect and respond to outages or changes in the network topology. Traffic can be rerouted around outages or disruptions, maintaining service levels.</p> <p>Now that we have looked at several network resilience mechanisms, let’s examine how predictability in network flows is related to network resilience.</p> <h2 id="what-is-predictability-in-network-flows">What is predictability in network flows?</h2> <p>Predictability in network flows is the ability to consistently deliver traffic from one point to another—even in the face of disruptions. This is another facet of network resilience. Let’s consider the most important concepts.</p> <h3 id="how-network-flows-enter-and-leave-a-network">How network flows enter and leave a network</h3> <p>Network flows enter and leave a network through network interfaces or routers, which connect different networks. Routers use routing protocols such as <a href="https://www.kentik.com/kentipedia/what-is-bgp-border-gateway-protocol/">Border Gateway Protocol (BGP)</a> to determine the best path for traffic and then forward the traffic to its destination based on that path. Routers also provide security features such as firewalls and access control lists, which can help to protect the network from malicious traffic. Other lower-level pieces, such as Address Resolution Protocol (ARP) and Media Access Control (MAC), are also involved in this routing step.</p> <h3 id="ingress-and-egress-stability-when-a-fault-occurs-on-the-path-between-them">Ingress and egress stability when a fault occurs on the path between them</h3> <p>“Ingress” and “egress” are the points at which traffic enters and exits a network. When a fault (such as a link or a node failure) occurs on the path between the ingress and egress points, the ingress or egress points are not affected directly. The various protocols and mechanisms we have discussed should be able to reroute traffic around the fault.</p> <h3 id="the-challenge-of-maintaining-predictable-network-flows">The challenge of maintaining predictable network flows</h3> <p>Predictable network flows are a desirable property of networks, but they are challenging to accomplish. The following are concrete scenarios that can challenge a network’s ability to maintain predictable network flows.</p> <ul> <li><strong>Traffic spikes</strong> can cause bottlenecks that lead to packet loss, retransmission, timeouts, and additional overall load on the network.</li> <li><strong>Changes in network topology</strong>, such as adding or removing devices, can cause unpredictable changes to a network’s performance and even inhibit the reachability of some endpoints.</li> <li><strong>Hardware failures</strong> can be as minimal and local as a router or switch going down or as severe as an entire underwater cable getting cut.</li> <li><strong>Software bugs and device misconfigurations</strong> can cause disruption—from lost packets to poor performance—and might be very challenging to detect in large, dynamic networks.</li> <li><strong>Security threats</strong> are a constant danger as networks are often the entry point for an attack.</li> </ul> <h2 id="conclusion">Conclusion</h2> <p>In this post, we’ve covered the basics of resilience and redundancy in networking, providing a detailed treatment of mechanisms for network resilience and predictability of network flows.</p> <p>Redundancy is essential for creating resilience in IP networks by providing multiple paths for traffic to travel through. Additional mechanisms for network resilience include data verification, TCP/IP, DNS, and BGP.</p> <p>Predictability in network flows is the ability to consistently deliver traffic from one point to another, even in the face of disruptions. Yet, establishing predictability in network flows has its share of challenges.</p> <p>In order to operate a reliable, performant, secure, and cost-effective network, it is crucial to understand network resilience and how to improve it using redundancy and other mechanisms.</p> <p><a href="https://www.kentik.com/blog/reinforcing-networks-advancing-resiliency-and-redundancy-techniques/">Continue reading about reinforcing networks and advancing resiliency and redundancy techniques</a>.</p><![CDATA[Why Your Data-Driven Strategies for Network Reliability Aren’t Working]]><![CDATA[What do network operators want most from all their hard work? The answer is a stable, reliable, performant network that delivers great application experiences to people. In daily network operations, that means deep, extensive, and reliable network observability. In other words, the answer is a data-driven approach to gathering and analyzing a large volume and variety of network telemetry so that engineers have the insight they need to keep things running smoothly.]]>https://www.kentik.com/blog/optimizing-network-stability-and-reliability-through-data-driven-strategieshttps://www.kentik.com/blog/optimizing-network-stability-and-reliability-through-data-driven-strategies<![CDATA[Phil Gervasi]]>Tue, 25 Apr 2023 04:00:00 GMT<p>What do network engineers working in the trenches, slinging packets, untangling coils of fiber, and spending too much time in the hot aisle really want from all their efforts? The answer is simple. They want a rock-solid, reliable, stable network that doesn’t keep them awake at night and ensures great application performance.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4Lx4cuGmlesEPTrcjN7Bvd/b9fd5e3c133f4fb543a3ea90bc18aa56/nfd-logo-org-border.png" style="max-width: 150px;" class="image right no-shadow" alt="Networking Field Day logo" /> <p>So how do we get there? What do we have to do short of locking all end-users out of their computers to have five minutes of peace?</p> <p>This was what Kentik was all about at Networking Field Day 31. We believe a data-driven approach to network operations is the key to maintaining the mechanism that delivers applications from data centers, public clouds, and containerized architectures to actual human beings. Ultimately, we’re solving a network operations problem using a data-driven approach.</p> <h2 id="more-data">More data!</h2> <img src="//images.ctfassets.net/6yom6slo28h2/2fdhrhufokD33998mIfucM/4bf892449b6ed45a43d7c5d079250e6c/lt-commander-data.jpg" style="max-width: 250px;" class="image left" alt="Lieutenant Commander Data" /> <p>A data-driven approach means nothing if it doesn’t mean more data. If you think about everything application traffic flows through between its source and destination, the sheer variety and volume of physical and virtual devices are enormous. Some of these devices an enterprise network engineer owns and manages, and a lot of it they don’t.</p> <p>Rather than collecting <em>only</em> flow data from routers, or <em>only</em> eBPF information from containers, the fact that application traffic traverses so many devices means we need much more data to get an accurate view of what’s happening.</p> <p>Only then can we pinpoint <em>why</em> one of our data center ToR switches is overwhelmed with unexpected traffic, <em>why</em> our line of business application is experiencing latency over the SD-WAN, <em>why</em> an OSPF adjacency is flapping, or <em>why</em> our SaaS app performance is terrible despite having a ton of available bandwidth.</p> <p>These are all examples of instability with our network and its adverse effects on application performance. But notice all of these examples start with “why.” That’s a big difference between traditional network visibility and modern network observability.</p> <h2 id="seeing-and-understanding">Seeing and understanding</h2> <p>You can think of <a href="https://www.kentik.com/kentipedia/what-is-network-observability/" title="Kentipedia: What is Network Observability?">network observability</a> as the difference between seeing and understanding. It’s the difference between simply seeing more data points on pretty graphs and understanding why something happened. It’s a difficult leap to make, and it requires as much telemetry as we can get our hands on (hence more data).</p> <img src="//images.ctfassets.net/6yom6slo28h2/3bF6q1tJp8gY22SaeRGjuO/212e91064bbe9bbc61425991108214e2/visibility-vs-observability.png" style="max-width: 500px;" class="image center" alt="Network visibility vs network observability" /> <p>And this is one big reason that Kentik’s approach starts with data and not features; it’s data-driven from the start because, just like we learned in our boring high school statistics class, the larger the dataset, the more accurate our predictions, the more likely it is we’ll find a strong correlation, and the more precise our understanding of what’s happening on the network.</p> <p>In the context of modern networking, this means a lot more than it used to. There was a day when flow and SNMP were enough, but that’s not the case today. Today, when we learn that an application is slow, we need to be able to pinpoint precisely where in the path latency is happening. And when we find that, we’ll want to see interface errors, DNS response times, TCP retransmissions from container resources, and so on, all in a time series to identify what is causing that latency.</p> <p><em>Just seeing more data points won’t help maintain a reliable network, but understanding how those data points relate to each other will.</em></p> <h2 id="volume-and-variety">Volume and variety</h2> <p>As much as we need all this data, it does create two problems for us.</p> <p>Think again about everything involved with handling packets, including network-adjacent services that don’t necessarily forward packets, but are critical for getting your application from containers in AWS to the cell phone in your hand.</p> <p>Here’s a ridiculous list for you:</p> <ul> <li>Switches</li> <li>Routers</li> <li>Firewalls</li> <li>CASBs</li> <li>IDS/IPS appliances</li> <li>Wireless access points</li> <li>Public clouds</li> <li>Network load balancers</li> <li>Application load balancers</li> <li>Service provider networks</li> <li>5G networks</li> <li>Data center overlays</li> <li>SD-WAN overlays</li> <li>Container network interfaces</li> <li>Proxies</li> <li>DHCP services</li> <li>IPAM databases</li> </ul> <p>…and the list goes on.</p> <p>Now think about every make and model of device and service for each of the network elements I just listed. Then think of all the <em>types</em> of data we can collect from these devices. I almost want to create another ridiculous list below to prove my point, but I’m confident you get the idea.</p> <p>So, the first problem is the sheer volume of data we have. Volume in and of itself isn’t a problem, but what is a problem is querying that data extremely fast while troubleshooting an issue in real time. Remember how we’re collecting telemetry from about eleventy billion devices? That means a querying engine needs to work extra hard to filter through exactly what you’re looking for, even as new data is ingested into the system.</p> <p>Filtering a query quickly is critical for network operations, but it’s difficult when querying a massive database filled with disparate data types.</p> <p>The second problem is related to the first, but instead of the volume of data, it’s how many <em>different types</em> of data we have to deal with. We have flow records, security tags, SNMP metrics, VPC flow logs, eBPF metrics, threat feeds, routing tables, DNS mappings, geo-id information, etc.</p> <p>Each telemetry type exists in very different formats, often using completely different scales. A metric about throughput would be in millions of bits or packets per second, whereas a flow record may give you source and destination information. A security tag is a random identifier, not a quantity of anything. Different formats and scales require an entire workflow to pre-process the data before applying an algorithm or machine learning model.</p> <p>A data-driven approach to network observability uses an ML pre-processing workflow to solve this problem by scaling and normalizing the data. This step alone may provide enough insight that a straightforward statistical analysis algorithm or time series model can produce the understanding a network engineer needs to find correlations, identify patterns, and predict network behavior.</p> <h2 id="a-new-world-of-network-telemetry">A new world of network telemetry</h2> <p>Justin Ryburn, Kentik’s VP of global solutions engineering, <a href="https://www.kentik.com/resources/container-network-observability/">explained</a> that the list of data we need to collect is only getting bigger. Many applications are built on containers, each of which has a network interface of some type (depending on the type of container). That means we now also need to collect container network information as part of our overall network telemetry dataset.</p> <img src="//images.ctfassets.net/6yom6slo28h2/62Wwr1S9XohSyARLvMg6iF/2c822ec8903de67e0caf8c478f1b991e/kappa-host-agent.png" style="max-width: 800px;" class="image center no-shadow" alt="Kappa host agent diagram" /> <div class="caption" style="margin-top: -30px">The Kappa Host agent deployed to collect kernel information and return it to the Kentik platform.</div> <p>Using <a href="https://www.kentik.com/blog/ebpf-explained-why-its-important-for-observability/" title="Learn more about eBPF in our post, eBPF Explained: Why it&#x27;s Important for Observability">eBPF</a>, we can observe the interaction between an application and the underlying Linux kernel within the application’s container for resources and network processes. Kappa, Kentik’s host-based telemetry agent, is built on eBPF to provide ultra-efficient observability for containers on-prem and in the cloud.</p> <p>And since we rely so much on the public cloud today, Ted Turner, cloud solutions architect at Kentik, explained how important it is to collect whatever <a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="Kentipedia: What are VPC Flow Logs?">cloud flow logs</a> we can and combine it with other relevant telemetry, such as the metrics we can learn from SaaS providers, public internet service providers, DNS databases, and so on.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7xCOI2O0v2qdljQIdfqBPd/a65c8e0b0a00e19d72529bef0ec2874d/kenti-map-nfd.png" style="max-width: 800px;" class="image center" withFrame alt="Kentik Map" /> <div class="caption" style="margin-top: -30px">The Kentik Map in action</div> <p>With all this data in one place, we can map how traffic flows among our cloud instances, in between our multiple cloud providers, and back to our on-premises data centers (see above).</p> <p>The key here is that as new technologies emerge, just as containers have in the last few years, new telemetry will need to be collected to give us the best picture of application delivery we can get.</p> <p>The answer isn’t the one type of data that serves as a magic bullet to give us all the answers. Instead, the answer is a data-driven approach that optimizes the volume and variety of data to provide network engineers the insight they need to maintain a stable, reliable network.</p> <p>Watch all of <a href="https://www.kentik.com/go/event/nfd31/">Kentik’s presentations at NFD 31 here</a>.</p><![CDATA[Unlocking the Power of Embedded CDNs: A Comprehensive Guide to Deployment Scenarios and Optimal Use Cases]]><![CDATA[This guide explores the benefits of embedded caching for ISPs and discusses deployment optimization strategies and future trends in CDN technology. Embedded CDNs help reduce network congestion, save costs, and improve user experiences. ISPs must carefully plan their deployment strategies by considering how each of the CDNs distributes content and directs end-users to the caches. They need to know both the CDNs and their network architecture in detail to build a successful solution. ]]>https://www.kentik.com/blog/unlocking-power-embedded-cdns-comprehensive-guide-deployment-optimal-use-caseshttps://www.kentik.com/blog/unlocking-power-embedded-cdns-comprehensive-guide-deployment-optimal-use-cases<![CDATA[Nina Bargisen]]>Thu, 20 Apr 2023 04:00:00 GMT<p>In today’s fast-paced digital world, content distribution networks (CDNs) and embedded caching play a vital role in delivering a high-quality user experience. In this post, we will explore the benefits of embedded caching for internet service providers (ISPs) and discuss various strategies for optimizing the deployment of embedded caches in their networks. Additionally, we will take a closer look at the future developments and trends in CDN technology to provide a comprehensive understanding of this dynamic industry.</p> <p>Previously we wrote about <a href="https://www.kentik.com/blog/speeding-up-the-web-comprehensive-guide-content-delivery-networks-embedded-caching/">how embedded CDNs work</a>. In this post, we will dive deeper into when it makes sense for ISPs to deploy embedded caches, the potential benefits, and how you can optimize the deployment to your network.</p> <h2 id="what-is-a-cdn">What is a CDN?</h2> <p>Content distribution networks are a type of network that emerged in the 90s, early in the internet’s history, when content on the internet grew “richer” – moving from text to images and video. Akamai was one of the first CDNs and remains a strong player in today’s market.</p> <p>A high-level definition of a CDN is:</p> <ul> <li>A collection of geographically distributed caches</li> <li>A method to place content on the caches</li> <li>A technique to steer end-users to the closest caches</li> </ul> <p>The purpose of a CDN is to place the data-heavy or latency-sensitive content as close to the content consumers as possible – in whatever sense close means. (More on that later.)</p> <h2 id="what-is-embedding">What is embedding?</h2> <p>The need to have the content close to the content consumers fostered the idea of placing caches belonging to the CDN inside the access network’s border. This idea was novel and still challenges the mindset of network operators today.</p> <p>We call such caches embedded caches. They are usually intended only to serve end users in that network. These caches often use address space originating in the ISP’s ASN, not the CDN.</p> <h2 id="why-do-cdns-offer-this-to-isps">Why do CDNs offer this to ISPs?</h2> <p>There are several reasons for a CDN to offer embedded caches to ISPs. For one, it does bring the content closer to the end users, which increases the quality of the product the CDN provides their customers. This is just the CDN idea taken a bit further.</p> <p>Secondly, offering ISPs caches can strengthen the relationship between the CDN and the ISP and foster collaboration that can lead to service improvements for both parties. It is not a secret that the traffic volumes served by the CDNs can be a burden on the network, and the embedded solution can be viewed as a way for the CDNs to play their part in the supply chain of delivering over-the-top services instead of paying the ISPs for the traffic.</p> <div as="Promo"></div> <p>Sometimes there is a perception that the embedded solution is about saving money for space and power needed to host the servers, but this is mostly a misunderstanding. It is far more efficient for most CDNs to operate large clusters in fewer locations than to operate a large number of small clusters.</p> <h2 id="so-what-is-in-it-for-the-isp">So, what is in it for the ISP?</h2> <p>In general, the ISP also benefits from bringing the content closer to the end users. Still, the question is whether bringing it inside their network edge provides enough benefit to justify the extra cost of hosting and operating the CDN caches.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5gHdGldqOm6IIalJb6amQV/7bccef7e9a6dd3392518bb932f6d909d/isp-benefits.png" style="max-width: 700px" class="image center no-shadow" alt="Diagram showing how ISPs and content connect" /> <p>Let’s look at how end-user ISPs and content reach each other. In this diagram, we show where money typically flows between the parties and where traffic flows. Let’s explore when there is a benefit for the ISP to deploy embedded caches.</p> <ol> <li>The CDN traffic is running on the ISP’s transit connections, and peering is not possible for some reason or another.</li> <li>The CDN and the ISP peers over an IXP, the volume drives port upgrades, and private connections are impossible.</li> <li>The CDN and the ISP peer, but in a limited number of locations that are not close to a large number of consumers of the content from the CDN, so the traffic travels a long distance internally in the ISP’s network.</li> </ol> <p>The first two scenarios are straightforward. Removing traffic from the edge will save money on transit and IXP port fees and lower the risk of congestion. The business case can directly compare the saved cost and the estimated cost of space and power for the embedded servers. More on what to consider when building the business case later.</p> <p>Let’s have a look at the third case. We can roughly split that into two different issues.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7ChyPE4KjW3lwRAvlecxmS/ac40e4a3407824df18574db928dba175/home-network-with-consumers-1.png" style="max-width: 500px" class="image center no-shadow" alt="Diagram showing peering and transit far from end users" /> <p>The first issue: The peering and transit sites are far away from the majority of the end users, and there is a high cost of building the connectivity to those sites. For example, there is the cost of transport capacity or dark fiber and housing on top of the interface-related costs. This case is similar to 1 and 2 above, concerning working out whether a deployment makes sense or not.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3AbjVYLxBHdOj8sJtLAjqi/4bdf67ea4022d7caf2027a5be1e58946/home-network-with-consumers-2.png" style="max-width: 500px" class="image center no-shadow" alt="Diagram showing end users at a distance" /> <p>The second issue: End users in the network are spread out, and the traffic is causing a heavy load on the ISP’s internal backbone. This one is trickier to evaluate.</p> <p>But how about we just put a cache at each location where the end users are terminated and be done with it?</p> <h2 id="challenges-with-a-distributed-deployment-of-embedded-caches">Challenges with a distributed deployment of embedded caches</h2> <p>Unfortunately, this might work for some CDNs, but traffic will usually flow on the backbone links since the end-user mapping is most likely not granular enough to support a super-distributed deployment.</p> <p>This is due to how end users are directed to the closest cache or how the closest cache is defined in the CDN. And guess what? They do not do that the same way 🙂.</p> <p>We touched on this in our earlier blog post, but let’s look at potential solutions to challenges the end-user mappings create for distributed deployments of embedded caches.</p> <h2 id="getting-end-users-to-the-closest-cache-using-bgp">Getting end users to the closest cache using BGP</h2> <p>Open Connect from Netflix stands out from most of the other players in this matter. They heavily rely on BGP (Border Gateway Protocol, the protocol that networks use to exchange routes) to define which cache an end user is directed to.</p> <p>A little about BGP:</p> <ul> <li>BGP is used to signal to a cache which IP ranges it should serve</li> <li>A setup where all caches have identical announcements will work for most deployments since the next tie-breaker is the geolocation information for the end user’s IP address.</li> <li>Prefixes are sent from the ISP to the caches.</li> </ul> <div style="border: 1px solid #efefef; padding: 30px; background-color: #ebeff3; max-width: 340px; margin: 0 auto"> <p><strong><span style="color: #FA541C">•</span> x.y.z.0/24</strong> is announced to <strong>A</strong><br> <strong><span style="color: #FA541C">•</span> x.y.w.0/24 </strong>is announced to <strong>B</strong></p> <p><strong style="color: #FA541C">1.</strong> Give me movie<br> <strong style="color: #FA541C">2.</strong> Go to <strong>B</strong> and get movie<br> <strong style="color: #FA541C">3.</strong> Give me movie<br> <strong style="color: #FA541C">4.</strong> Movie</p> </div> <img src="//images.ctfassets.net/6yom6slo28h2/7dAlT3sSikEwjRFhOiln80/14142fb8d3c5abbbb06e87d1d5c6423d/embedded-cdns-bgp.png" style="max-width: 500px; border: 1px solid #bbbbbb; padding: 30px" class="image center no-shadow" alt="End-user mapping with BGP" /> <p>Open Connect is unique among the CDNs since they do not rely on the DNS system to direct the end user to the suitable cache. The request to play a movie is made to the CDN, which replies with an IP address for the cache that will serve the movie without using the DNS system.</p> <p>Some of the other CDNs will also use communities tagged to the routes from the ISP to map end users to clusters.</p> <p>The BGP method works well when the ISP has a regionalized IP address plan such that each region uses one set of prefixes, and another region accepts a different set, which can be announced to the CDN. However, this is not always possible – sometimes, the entity that deploys the caches does not have control over the address plan for the access network.</p> <p>Some ISPs who decided to go all-in on embedded caches and built a large number of locations for caches solved this problem by implementing centralized route servers for the BGP sessions to the caches. These route servers communicate with a back-end system that keeps track of which IP addresses are in use where, such that the correct IP addresses are announced to the suitable clusters (or tagged with the right community). Interestingly, the benefit of a distributed deployment in these extensive networks justifies the investment in developing and maintaining this system.</p> <h2 id="dns-server-dependency">DNS server dependency</h2> <p>Most CDNs map end users to a cache by mapping the end user’s DNS server to request the content to a cache or a cache location.</p> <p>This means that if an ISP wants to deploy embedded caches from such a CDN, they must also dedicate DNS servers to each region they will divide their network into.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6Djr3pXc5szzbnPsCqNxHX/cd30167866d1a9f779ddaa4f0a469d77/embedded-cdns-cloud-diagram.png" style="max-width: 600px" class="image center no-shadow" alt="Dedicated DNS servers in each region" /> <p>There are more efficient ways of operating DNS resolvers, so are there any workarounds available?</p> <p>Many embedded CDNs support eDNS0 Client Subnet (ECS) – an optional extension to the DNS protocol, published in RFC7871. If an ISP enables ECS on their DNS resolvers, the DNS request will contain information about the request’s source – the end user’s IP address. This way, the CDN can use the IP addresses to map the end user to a cache server location, providing a more granular mapping than when based on the DNS server.</p> <p>When enabling ECS, the method of mapping the end user to a cluster varies from clean geolocation to a mix of geolocation, QoS measurements of the prefixes of the ISP, and the method mentioned above.</p> <p>If end users decide to use one of the publicly available DNS services instead of the one provided by their ISP, this might mess up the end-user mapping. 8.8.8.8 supports ECS, but 1.1.1.1 is a service created to preserve privacy, so here, ECS is only supported for debugging purposes.</p> <p>Now we have determined where to offload traffic and whether our network supports a widely distributed deployment or if a centralized deployment makes better sense. The next thing to investigate is how much traffic will be moved to the embedded solution.</p> <h2 id="building-the-business-case">Building the business case</h2> <p>The first step is to measure the total traffic to each potential region where a cache cluster could make sense. A good flow tool like Kentik is crucial unless the CDN is able and willing to help you with the detailed analysis. In our blog, <a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2">The peering coordinator’s toolbox part 2</a>, we discussed how to use Kentik to analyze the traffic going to different regions in your network.</p> <p>In our earlier blog about embedded CDNs, we described how the different methods of distributing content to the caches determine the offload/hit rate. CDNs such as Netflix’s Open Connect, which determines the required file locations beforehand, can distribute content to caches during periods of low demand. In contrast, pull-based CDNs distribute content to caches when demand is high. Both types of CDNs have a long tail of content that will never be cached on the embedded caches. So there is no general rule on how to calculate this.</p> <p>However, content CDNs built to serve a single content service have a very high offload, where the commercial CDNs are very dependent on the variations of demand, and the mix of content in each ISP means there is no general offload number that can be achieved. A common denominator is that the higher demand for the content, the better offload will result from a deployment. Higher demand means the need for bandwidth will justify more servers, which again means more disk space will be available to cache more content.</p> <p>However, the lack of a general calculation means the best way to make this business case is in close dialogue with the CDNs you are considering embedding.</p> <h2 id="future-developments-and-trends">Future developments and trends</h2> <p>As technology advances, CDNs and embedded caching are evolving to meet the growing demands of internet users. Some future developments and trends to watch for include:</p> <ul> <li><strong>Edge computing</strong>: With the rise of edge computing, CDNs are expected to extend their networks further, bringing content closer to users by deploying caches at the edge of the network. This will help reduce latency and improve the user experience, particularly for latency-sensitive gaming and virtual reality applications.</li> <li><strong>Artificial intelligence (AI) and machine learning (ML)</strong>: AI and ML technologies are expected to optimize CDN operations significantly. These technologies can help predict traffic patterns, automate cache placement decisions, and improve content delivery algorithms, leading to more efficient and cost-effective CDN operations.</li> <li><strong>5G and IoT</strong>: The widespread adoption of 5G and the Internet of Things (IoT) will increase connected devices and data traffic. CDNs must adapt and scale their infrastructure to accommodate this growth, including deploying embedded caches to manage the increased traffic and maintain high-quality content delivery.</li> </ul> <p>As we have seen more CDNs over the past year offer embedded solutions, it seems likely that even more will decide to move in that direction to support the new use cases.</p> <h2 id="conclusion">Conclusion</h2> <p>In summary, embedded caching offers numerous benefits to ISPs, including reduced network congestion, cost savings, and improved user experiences. To fully harness these benefits, ISPs must carefully plan and optimize their deployment strategies while staying informed of future developments and trends in CDN technology. By doing so, ISPs can ensure they are well prepared to meet the growing demands of their users and thrive in the ever-evolving digital landscape.</p><![CDATA[Data-Driven Defense: Exploring Global Cybersecurity and the Human Factor]]><![CDATA[A data-driven approach to cybersecurity provides the situational awareness to see what's happening with our infrastructure, but this approach also requires people to interact with the data. That's how we bring meaning to the data and make those decisions that, as yet, computers can't make for us. In this post, Phil Gervasi unpacks what it means to have a data-driven approach to cybersecurity.]]>https://www.kentik.com/blog/data-driven-defense-exploring-global-cybersecurity-and-the-human-factorhttps://www.kentik.com/blog/data-driven-defense-exploring-global-cybersecurity-and-the-human-factor<![CDATA[Phil Gervasi]]>Wed, 19 Apr 2023 04:00:00 GMT<p>A security breach often manifests itself in some sort of performance degradation of services — a slow network, an application that isn’t behaving correctly, or in some scenarios, a complete hard down. But this isn’t always the case. Especially when an attacker is interested in covert data exfiltration, a breach may go unnoticed for weeks, months, or even years.</p> <p>So how can we ever know if we’re under attack when a determined adversary can go unnoticed for so long?</p> <div as="WistiaVideo" videoId="ldnhzphtm5" audio></div> <p>The answer isn’t purely technical. There is no silver bullet that can solve all of our security problems and prevent any and all would-be attackers from carrying out an attack. Instead, to be as effective as possible, a cybersecurity defense posture has to be a two-pronged approach.</p> <p>It must be data-driven and consist of the appropriate volume and types of telemetry necessary for situational awareness from a high level to the most granular. It also needs the human factor, or in other words, the subjective and critical decision-making ability of a human being.</p> <h2 id="a-data-driven-approach">A data-driven approach</h2> <p>A strong security posture means constant vigilance, which means the continuous collection and analysis of telemetry from various sources. This includes infrastructure devices involved with running and delivering services and applications, as well as the code underlying those services.</p> <p>This could encompass packet captures, flow records, server logs, routing advertisements, file access records, etc. In a large organization, this is both a huge volume and a massive variety of data to process, analyze, and store.</p> <p>For example, a threat feed can tell us what a particular attack looks like. Using appropriate data analysis methods, we can analyze a large volume of telemetry in real-time to find evidence for that attack. We can look at flow records, interface utilization, server logs, and more to see what type of attack we’re experiencing, who it’s affecting, when it started, and so on.</p> <p><em>However, this is only part of the process. A data-driven defense also means a subjective component — the human factor</em>.</p> <h2 id="the-human-factor">The human factor</h2> <p>In a <a href="https://www.kentik.com/telemetrynow/s01-e11/">recent episode of Telemetry Now</a>, TJ Sayers, Manager of the MS and EI-ISAC’s Cyber Threat Intelligence team at the Center for Internet Security, explains that almost 85% of incoming alerts, alerts for a potential attack, are reviewed by actual human beings to judge if the alert warrants further action. As much as a sophisticated data science workflow can augment the threat hunting and analysis process, the actual human interaction with the data still provides a true <em>understanding</em> of the data.</p> <p>In this case, it takes a human being to determine if a threshold that was exceeded, or a file that was accessed outside of business hours, is significant and meaningful. TJ explained that a big part of a strong security posture is simply open communication among teams. With good communication, a security analyst looking at some strange behavior just needs to ask the network engineer, application owner, or sysadmin if that event is indeed something to worry about.</p> <p>Although the industry is working on systems to do this programmatically, adding this subjective component to a machine learning workflow is no small feat. Yes — a statistical analysis algorithm and an ML model can significantly help us, especially with organizing data, identifying patterns, and detecting anomalies. However, those technologies still need to improve in providing the subtle and subjective insight only a human can provide.</p> <p>For example, consider an alert for link utilization reaching 100% on a pair of active/standby data center WAN ports. This behavior is unexpected and seems to correlate with a known active <a href="https://www.kentik.com/resources/kentik-protect-ddos-detection-and-defense/">DDoS attack</a> occurring against similar organizations in the region.</p> <p>With this information, a system could automatically kick off a mitigation workflow, but a SOC engineer might approach this differently. With a quick call to the data center team, an L1 SOC engineer would learn that the team just kicked off a replication task across the WAN but didn’t notify anyone. Ultimately, this was not a security incident but poorly planned maintenance.</p> <h2 id="data-and-people">Data and people</h2> <p>People processes, communication, and incident response workflows aren’t necessarily data themselves but are part of a data-driven approach to cybersecurity. Otherwise, the data is simply more noise leading to alert fatigue, security breaches, and attackers having their way with our networks.</p> <p>We need the data for situational awareness and the assistance of programmatic workflows to help us sort, organize, and analyze that data. However, a data-driven approach to cybersecurity also requires people — human beings — to interact with the data bringing meaning, finding significance, and making those decisions that machines still need to be genuinely able to make.</p><![CDATA[NetOps for Application Developers: Understanding the Importance of Network Operations in Modern Development]]><![CDATA[True observability requires visibility into both the application and network layers. For companies reliant on multi-zonal cloud networks, the days of NetOps existing as a team siloed away from application developers are over.]]>https://www.kentik.com/blog/netops-app-developers-importance-of-network-operations-modern-networkshttps://www.kentik.com/blog/netops-app-developers-importance-of-network-operations-modern-networks<![CDATA[Ted Turner]]>Mon, 17 Apr 2023 04:00:00 GMT<p>One of the great successes of software development in the last ten years has been the relatively decentralized approach to application development made available by containerization, allowing for rapid iteration, service-specific stacks, and (sometimes) elegant deployment and orchestration implementations that piece it all together.</p> <p>At scale, and primarily when carried out in cloud and hybrid-cloud environments, these distributed, service-oriented architectures and deployment strategies create a complexity that can buckle the most experienced network professionals when things go wrong, costs need to be explained, or optimizations need to be made.</p> <p>In my experience, many of these complexities and challenges can be addressed proactively if organizations include their network specialists in the planning and monitoring of their distributed applications. CTOs and other umbrella decision-makers recognize that software <em>and</em> network engineers must work together to deliver secure and performant applications.</p> <h2 id="devops-is-blind-to-the-network">DevOps is blind to the network</h2> <p>While DevOps teams may be skilled at building and deploying applications in the cloud, they may have a different level of expertise when it comes to optimizing cloud networking, storage, and security. Unaddressed, this can lead to unreliable (and unsafe) application environments.</p> <p>A common assumption among application developers is that cloud environments are highly available and resilient <em>by default</em>. While cloud providers offer highly available and resilient infrastructure, it is still up to application developers to properly configure and manage their cloud resources to ensure optimal performance and availability. As these applications scale, and engineering for reliability comes into the forefront, DevOps engineers begin to rely on networking concepts like load balancing, auto-scaling, traffic management, and network security.</p> <p>Something else I’ve run into is that DevOps teams often fail to fully appreciate the importance of cost optimization in cloud environments. Cloud resources can be highly flexible and scalable, but they can also be cripplingly expensive if not properly managed. DevOps teams need to be aware of the cost implications of their cloud infrastructure and take steps to optimize their resource usage, such as using reserved instances, automating resource management, peering, and implementing cost monitoring and business context.</p> <p>Observability and its SRE (site reliability engineer) champions have risen in demand as applications have evolved into these deeply distributed architectures. Observability strategies like collecting, sampling, and analyzing MELT (metrics, events, logs, and traces) telemetry have dramatically improved structural responses to challenges like incident response and system-wide optimizations.</p> <p>However, realizing this at the organizational level can involve significant stutter steps as applications grow in their capabilities and sophistication, with many observability implementations still containing significant blind spots to network-specific telemetry, such as from transit and ingress gateways, CDNs, IoT, SD-WANs, routers, switches, and so much more.</p> <h2 id="devops-and-netops-need-to-work-together">DevOps and NetOps need to work together</h2> <p>Collaboration is often a two-way street. While DevOps may indeed be “blind to the network,” achieving visibility will involve a lot of work and contribution from NetOps.</p> <p>Following are a few key ways NetOps and DevOps can collaborate to make more reliable systems.</p> <h3 id="architectural-decisions">Architectural decisions</h3> <p>First and foremost, having NetOps at the table means allowing network specialists to provide input at the very earliest stages of cloud development. Designing modern applications requires practices like the loose coupling of containerized application stacks, resulting orchestration layers, and data management abstractions like caching, replication, and transformations; these efforts rely on network principles and infrastructure for performance, reliability, and security.</p> <p>Having an expert perspective on network protocols helps ensure data will be moved securely and with network performance in mind. As this infrastructure is designed and configured across multi-zonal hybrid networks, the NetOps perspective can detail key performance metrics and analysis methods to instrument at the application layer, like when dealing with containers and load balancers. These insights become key as applications mature and find they need to scale in sometimes dramatic and unpredictable ways.</p> <h3 id="cross-functional-teams">Cross-functional teams</h3> <p>One way to ensure architectural decisions include the perspective of both application and network specialists is to create cross-functional teams. Instead of an IT or infra team trying to manage the hydra of networks and configurations in a completely different fashion, cross-functional teams help ensure each service or development vertical has a NetOps representative from planning through deployment. This personnel shift provides a proactive solution to the networking challenges of highly scaled cloud applications.</p> <h3 id="shared-tools-and-processes">Shared tools and processes</h3> <p>As NetOps teams gain a seat at the table, one of the significant shifts for many network professionals will be adopting the methodologies and development pacing of the application developers they are now working with more closely. This can be a big challenge for NetOps, which has historically been afforded very stalwart and insular silos for their work.</p> <p>But, the complexity of modern networks calls for a change, and adapting network operations to include continuous integration/delivery, automated testing and security scanning, and more human-centered tools for monitoring, alerting, and visualizing information gives application developers and network operators a shared understanding of the systems they both support.</p> <h3 id="unified-telemetry--data-management">Unified telemetry + data management</h3> <p>DevOps observability has provided a great roadmap for network engineers on better collecting and analyzing the massive amount of data generated by today’s applications. Sampling strategies, technologies like tracing, and contextual instrumentation have made application-layer data a gold mine for root cause analysis, customer or region-specific optimizations, and an overall improved capability to “<a href="https://www.kentik.com/why-kentik-network-observability-and-monitoring/">ask anything</a>” about an application <em>in real-time</em>.</p> <p>This step of instrumentation (providing high-cardinality context to network layer data) is a critical step in unifying the data available to both DevOps and NetOps and sets up automation efforts that better take into account the full spectrum of components within a system.</p> <h3 id="automated-workflows">Automated workflows</h3> <p>Automation is critical to both DevOps and NetOps. IaC (infrastructure-as-code) can automate vital tasks such as provisioning infrastructure, configuring servers, and deploying applications, granting both teams velocity, reducing the risk of human error, and ensuring that these workflows are taking into account both application and networking concerns.</p> <h2 id="summary">Summary</h2> <p>True observability requires visibility into both the application and network layers. For companies reliant on multi-zonal cloud networks, the days of NetOps existing as a team siloed away from application developers must be over. The complexity of modern application environments creates a host of issues around traffic management, network-specific monitoring, and security. By bringing network specialists into the very earliest stages of application development, software engineers can ensure applications are being built in the most cost-effective, reliable, and threat-averse ways possible.</p><![CDATA[DataOps Uncovered: A Bold New Approach to Telemetry and Network Visibility]]><![CDATA[Network telemetry and DataOps play a critical role in enhancing network visibility. By combining both, organizations can improve network visibility and gain insight to help them optimize their network performance, improve security, and enhance the overall user experience.]]>https://www.kentik.com/blog/dataops-uncovered-a-bold-new-approach-to-telemetry-and-network-visibilityhttps://www.kentik.com/blog/dataops-uncovered-a-bold-new-approach-to-telemetry-and-network-visibility<![CDATA[Stephen Condon]]>Thu, 13 Apr 2023 04:00:00 GMT<p>Network telemetry and DataOps are two concepts that play a critical role in enhancing network visibility. With modern networks’ increasing complexity and scale, it has become essential to collect and analyze data from various sources to gain insights into network performance, security, and availability. <a href="https://www.kentik.com/kentipedia/network-observability-in-modern-apm/#telemetry-data">Network telemetry</a> is the process of collecting and analyzing data from network devices, including switches, routers, firewalls, and servers, to gain visibility into network traffic and performance.</p> <p>By combining network telemetry and DataOps, organizations can improve their network visibility and gain actionable insights that can help them optimize their network performance, improve security, and enhance the overall user experience.</p> <h2 id="what-is-dataops">What is DataOps?</h2> <p>DataOps is a methodology that aims to streamline and automate managing and delivering data throughout its lifecycle, from ingestion to analysis and visualization. It is an extension of DevOps principles and practices to data management, enabling organizations to manage and automate data pipelines for quality, accuracy, and reliability.</p> <p>The DataOps ecosystem comprises several components, including people, processes, and tools. At the heart of DataOps is the agile development methodology, which emphasizes collaboration, iteration, and continuous delivery. Data scientists play a critical role in the DataOps ecosystem, leveraging advanced analytics and machine learning techniques to gain insights from large and complex data sets.</p> <p>DataOps also involves a range of tools and technologies, including data integration and ETL (extract, transform, load) tools, data quality and governance tools, data catalog and metadata management tools, and data visualization and reporting tools. DataOps strategies require a robust data infrastructure, including data warehouses, data lakes, caches, and other data storage and processing systems.</p> <h3 id="dataops-team-roles">DataOps team roles</h3> <p>In a DataOps team, several key roles work together to ensure the data pipeline is efficient, reliable, and scalable. These roles include data specialists, data engineers, and principal data engineers.</p> <h4 id="data-specialists">Data specialists</h4> <p>Data specialists are responsible for ensuring the quality of data and its suitability for analysis. They work closely with data owners and data consumers to understand their needs and requirements, and they use their expertise to ensure that data is collected, processed, and stored correctly. Data specialists also ensure that data is accessible to those who need it, and they monitor the data pipeline to identify and resolve any issues.</p> <h4 id="data-engineers">Data engineers</h4> <p>Data engineers are responsible for building and maintaining the data pipeline infrastructure. They work with data scientists and specialists to design and implement data processing workflows, ensuring that data is transformed and loaded into the appropriate data stores. Data engineers also ensure that the data pipeline is scalable, reliable, and efficient, and they monitor the pipeline to identify and address any bottlenecks or issues.</p> <h4 id="principal-data-engineers">Principal data engineers</h4> <p>Principal data engineers are senior members of the DataOps team who oversee the design and development of the data pipeline infrastructure. They work closely with data engineers to ensure the pipeline is robust and scalable. They also work with data scientists and specialists to ensure that the pipeline meets the needs of the business. Principal data engineers also play a crucial role in identifying and evaluating new technologies and tools that can improve the efficiency and effectiveness of the data pipeline.</p> <h3 id="how-is-dataops-different-from-devops">How is DataOps different from DevOps?</h3> <p>While both DevOps and DataOps efforts can be applied to network observability, they have different approaches and focus on other aspects of network management.</p> <p>DevOps focuses on the software development lifecycle and is principally a telemetry source and operational context for network operators. DataOps specifically targets data management and delivery, leveraging advanced analytics and machine learning techniques to gain insights and improve network performance.</p> <p>For NetOps, DataOps represents a pivotal approach to successfully managing and leveraging network data in highly scaled systems.</p> <h3 id="why-invest-in-dataops">Why invest in DataOps?</h3> <p>By using DataOps, businesses can ensure that the data they collect and analyze is high-quality, accurate, and reliable, which is essential for effective data analytics and analysis. With DataOps, businesses can improve their data agility and accelerate their time to insights, enabling them to make faster and better-informed decisions. These improvements can lead to operational efficiency, reduced costs, and improved customer satisfaction, all critical for meeting the demands of today’s business environment.</p> <h2 id="the-importance-of-telemetry-data-to-network-visibility">The importance of telemetry data to network visibility</h2> <p>Telemetry data is essential for keeping a network up and running. It provides real-time visibility into the performance of network devices, applications, and traffic, enabling network operators to detect and resolve issues quickly. Telemetry data includes information such as network traffic patterns, packet loss, latency, and jitter, as well as device metrics such as CPU utilization, memory usage, and interface errors.</p> <p>With network telemetry data, network operators can gain a holistic view of the network and identify performance issues before they impact end users. For example, suppose telemetry data shows network traffic congested at a particular interface. In that case, network operators can take proactive measures to alleviate the congestion, such as increasing bandwidth or optimizing traffic routing.</p> <p>Telemetry data is also critical for network security. Network operators can detect and respond to security threats in real-time, such as <a href="https://www.kentik.com/kentipedia/ddos-detection/">DDoS attacks</a> or unauthorized access attempts, by monitoring telemetry data.</p> <p>To ensure the effectiveness of telemetry data, it is essential to <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/">enrich it with context</a> and metadata, such as device and application information. This enables network operators to better understand the performance of the network and the root causes of issues. By following best practices for enriching telemetry data, network operators can improve network observability and ensure the reliability and availability of their networks.</p> <h2 id="how-dataops-operationalizes-network-telemetry-data">How DataOps operationalizes network telemetry data</h2> <p>DataOps leverages network telemetry to remove bottlenecks by providing end-to-end visibility and actionable intelligence. DataOps can identify issues and prioritize responses and optimizations based on key performance indicators (KPIs), like connections per second, latency, and packet loss, by aggregating and analyzing real-time network performance telemetry.</p> <p>This data can be used to automate workflows and improve data governance, ensuring that data is accurate, reliable, and compliant. With network telemetry, DataOps teams can gain deeper insights into network performance and make more informed decisions, ultimately leading to improved network visibility and optimized performance.</p> <h2 id="leverage-telemetry-data-with-kentik-to-solve-network-issues-before-they-start">Leverage telemetry data with Kentik to solve network issues before they start</h2> <p>Kentik can help companies solve network issues before they start by leveraging telemetry data to provide real-time network observability and analytics. Kentik ingests <a href="https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources/">telemetry data from a wide range of sources</a>, including <a href="https://www.kentik.com/kentipedia/netflow-guide-types-of-network-flow-analysis/">network flows</a>, <a href="https://www.kentik.com/kentipedia/evolution-of-network-monitoring-snmp-to-network-observability/">SNMP</a>, <a href="https://www.kentik.com/kentipedia/what-is-bgp-border-gateway-protocol/">BGP</a>, and more, providing end-to-end visibility across hybrid and multi-cloud environments.</p> <p>Kentik’s machine learning algorithms and advanced analytics provide actionable insights into network performance, security, and capacity planning, enabling companies to identify and resolve issues before they impact end-users proactively. With Kentik, companies can set up custom alerts and thresholds to monitor network KPIs and receive automated notifications when problems arise.</p> <p>In addition to its advanced analytics and automation capabilities, Kentik provides robust data governance features, enabling companies to ensure that their data is accurate, reliable, and compliant. Kentik’s user-friendly dashboards and reporting tools enable companies to quickly and easily visualize their network data and gain insights into network performance.</p> <p>Kentik provides companies with the tools and insights they need to optimize network performance, proactively identify and resolve issues, and ensure the reliability and availability of their networks.</p> <p>To get started with Kentik, <a href="https://www.kentik.com/go/get-started/demo/">sign up for a demo today</a>.</p><![CDATA[Ukraine’s Wartime Internet from the Inside]]><![CDATA[It has now been over a year since Russian forces invaded its neighbor to the west leading to the largest conflict in Europe since World War II. Kentik’s Doug Madory reviews what has happened with internal connectivity within Ukraine over the course of the war in this analysis done for a collaboration with the Wall Street Journal.]]>https://www.kentik.com/blog/ukraines-wartime-internet-from-the-insidehttps://www.kentik.com/blog/ukraines-wartime-internet-from-the-inside<![CDATA[Doug Madory]]>Tue, 11 Apr 2023 04:00:00 GMT<p>This February marked a grim milestone in the ongoing war in Ukraine. It has now been over a year since Russian forces invaded its neighbor to the west leading to the largest conflict in Europe since World War II.</p> <p>In the past year, we have used Kentik’s unique datasets to show some of the conflict’s impacts on Ukraine’s external internet connectivity, ranging from <a href="https://twitter.com/DougMadory/status/1496598152706772993">DDoS attacks</a> and <a href="https://twitter.com/DougMadory/status/1501215336573591552">large outages</a>, to the <a href="https://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplan/">rerouting of internet service</a> in the southern region of Kherson.</p> <p>This blog post contains analysis done for a <a href="https://www.wsj.com/articles/ukrainians-work-through-blackouts-internet-outages-as-russia-targets-power-grid-218a0fd5">collaboration with the Wall Street Journal</a> (pictured below) using a novel data source that allows us to explore connectivity inside Ukraine: the <a href="https://www.caida.org/projects/ark/">Ark dataset</a> from the <a href="https://www.caida.org/">Center for Applied Internet Data Analysis (CAIDA)</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5Sr6FS4BRXokWYujsdpB5K/a20afab582c336078b14780686104b23/wsj-analysis.png" style="max-width: 600px;" class="image center" alt="Wall Street Journal Ukraine diagram with Kentik data" /> <div class="caption" style="margin-top: -30px;">Kentik analysis on Ukraine in the Wall Street Journal, March 6, 2023</div> <h2 id="domestic-measurements-using-the-ark-dataset">Domestic measurements using the Ark dataset</h2> <p>Based at the University of California San Diego, the <a href="https://www.caida.org/">Center for Applied Internet Data Analysis (CAIDA)</a> is a leader in the academic field of internet measurement. Among their numerous measurement projects is the <a href="https://www.caida.org/projects/ark/">Archipelago Measurement Infrastructure</a>, or Ark, for short. Ark consists of servers located around the world continuously performing traceroutes to randomly selected IP addresses.</p> <p>One of those Ark servers is located in Kyiv. For the past year, it has been dutifully performing measurements to IP addresses around the world, including destinations within Ukraine and Russia. CAIDA graciously provided the data generated from this server to Kentik for the following analysis.</p> <p>The data gives us a unique view into the internal connectivity within Ukraine over the course of the war — at least from the perspective of this one important internet connection in Kyiv.</p> <p>Arguably the most dramatic development that appears in the data was the <a href="https://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplan/">rerouting of internet service</a> to Kherson through Russia. To analyze this development, we extracted the traceroutes performed by the Ark server in Kyiv to IP address space originated by the ASes of Kherson.</p> <p>The data from these traceroutes is plotted below by the overall latency (y-axis), time of measurement (x-axis), and AS origin of the last responding hop (color).</p> <img src="//images.ctfassets.net/6yom6slo28h2/4XXRULGwe25LcZA2skKP3e/9e9e758d7d17248a608094341acd82ad/kiev-kherson-1.png" withFrame style="max-width: 600px;" class="image center" alt="Traceroutes in Kiev to Kherson" /> <p>There is a clear point when the latencies increase due to the Russian rerouting at the beginning of June 2022. As one would expect, this aligns with the <a href="https://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplan/">timing observable in BGP</a>. The above graphic also illustrates the result of the Ukrainian liberation effort in Kherson. Ukrainians have recaptured <a href="https://en.wikipedia.org/wiki/Russian_occupation_of_Kherson_Oblast">half of the region</a>, and we see a portion of the traceroutes reverting to a lower latency as those networks restore their Ukrainian transit connections. A few providers in the region of Kherson are still on Russian transit, presumably in the territory that is still under Russian control.</p> <p>If we isolate the measurements to one particular ASN in Kherson, AS21151 (Ukrcom), the traceroutes tell a clear story of the network’s transition from Ukrainian to Russian transit and back again.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6BgetfWBXDc6gS4F8tE3sF/50c60d2d0d8db76f97443c5224eb04c8/kiev-kherson-2-isolation.png" withFrame style="max-width: 600px;" class="image center" alt="Isolated view of traceroutes in Kherson" /> <p>If we convert the individual traceroutes used in the above graphic into a readable version, we can see how the internet path between Kyiv and Kherson changed hop-by-hop.</p> <p>Before the invasion, packets could get from Kyiv to Kherson in as little as 11ms:</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> Traceroute from 88.81.224.210 to 185.43.224.143 at <b>2022-01-20</b> 00:34:58 (COMPLETED): 1. 88.81.224.209 PJSC Datagroup 3326 88.81.224.209.ipv4.datagroup.ua. 0.999 2. 185.1.62.22 UBNIX-GLOBAL-IX 199524 uarnet-ix.giganet.ua. 0.724 3. 194.44.100.254 http://www.uar.net 3255 14.859 4. 193.109.128.78 Ukrcom Ltd 21151 12.551 5. 193.109.128.182 Ukrcom Ltd 21151 193-109-128-182.ukrcom.kherson.ua. 11.824 6. 185.43.224.143 Ukrcom Ltd 21151 185-43-224-143.ukrcom.kherson.ua. 11.883 </code></pre> <p>During the Russian occupation, traceroutes needed to leave the country to reach Kherson. In the example below, this traceroute travels from Kyiv to Vienna <code class="language-text">win</code> to Frankfurt <code class="language-text">ffm</code> to Moscow <code class="language-text">12389</code> to Simferopol, Crimea <code class="language-text">smfl</code> and finally to Kherson <code class="language-text">21151</code>.</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> Traceroute from 88.81.224.210 to 194.79.22.56 at <b>2022-06-13</b> 18:52:08 (GAPLIMIT): 1. 88.81.224.209 PJSC Datagroup 3326 88.81.224.209.ipv4.datagroup.ua. 0.504 2. 88.81.244.145 PJSC Datagroup 3326 xe-1-0-2.2.at.mx02.iev2.core.as3326.net. 0.62 3. 62.115.189.58 TELIANET 1299 <b style="background-color: #fef89b;">kiev</b>-b1-link.ip.twelve99.net. 0.825 4. 62.115.123.130 TELIANET 1299 <b style="background-color: #fef89b;">win</b>-bb4-link.ip.twelve99.net. 32.405 5. 62.115.138.22 TELIANET 1299 <b style="background-color: #fef89b;">ffm</b>-bb2-link.ip.twelve99.net. 31.878 6. 62.115.151.97 TELIANET 1299 rostelecom-ic319651-ffm-b11.ip.twelve99-cust.net. 64.392 7. 87.226.183.91 PJSC Rostelecom <b style="background-color: #fef89b;">12389</b> 46.853 8. 185.64.45.207 Miranda-Media Ltd 201776 76.742 9. 31.40.132.165 Osipenko Alexander Nikolaevich 201776 ae20-13623.<b style="background-color: #fef89b;">smfl</b>-04-bpe1.miranda-media.net. 94.087 10. 193.109.128.17 Ukrcom Ltd 21151 193-109-128-17.ukrcom.<b style="background-color: #fef89b;">kherson</b>.ua. 103.082 11. 193.109.128.78 Ukrcom Ltd 21151 105.668 12. 194.79.22.26 Ukrcom Ltd 21151 194-79-22-26.ukrcom.kherson.ua. 95.683 </code></pre> <p>As a consequence of the greatly increased geographic distance traveled, the overall latencies from Kyiv to Kherson jumped up to over 70ms - greater than a round-trip time across the Atlantic Ocean.</p> <p>Following the liberation of Kherson, traceroutes revert to a shorter, more direct path:</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> Traceroute from 88.81.224.210 to 194.79.22.43 at <b>2022-12-06</b> 22:50:52 (GAPLIMIT): 1. 88.81.224.209 PJSC Datagroup 3326 88.81.224.209.ipv4.datagroup.ua. 18.287 2. 185.1.213.30 UA-EUROLINE-20210701 None compnetua.1-ix.net. 0.41 3. 193.109.128.137 Ukrcom Ltd 21151 du-137.ukrcom.kherson.ua. 10.254 4. 193.109.128.78 Ukrcom Ltd 21151 10.253 5. 193.109.128.82 Ukrcom Ltd 21151 nat-polar.ukrcom.kherson.ua. 9.983 </code></pre> <p>Not every Kherson AS switched back to Ukrainian transit. Russia still occupies half of the region, and a few ASes operate in that half. RubinTelecom (AS49465) is one of them and remains on Russian transit. It also suffered extended outages.</p> <img src="//images.ctfassets.net/6yom6slo28h2/n8YnNhA5BGSwNTyyLghAn/15c4b68b90ef077e58862b7258db5819/kiev-kherson-3.png" withFrame style="max-width: 600px;" class="image center" alt="RubinTelecom Outages" /> <h2 id="connectivity-to-donbas">Connectivity to Donbas</h2> <p>Measurements to Russian-held Donetsk and Luhansk also exhibited clear changes in latency and path. We can’t be sure whether these changes were due to a technical failure along a more direct path or an administrative disabling of the link. On February 17, 2022 — one full week <em>before</em> the invasion — we saw latencies to Luhansk double from 35ms to 70ms.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Ibevq1pwRB3pgVM6Qq3zU/03da0e092b56453a5a796d3df50dca2d/kiev-luhansk.png" withFrame style="max-width: 600px;" class="image center" alt="Measurements from Kiev to Russian-held Luhansk" /> <p>Looking at the individual traceroutes shows a clear change in path. Initially, the measurements heading to Russian-held Luhansk headed directly to the <a href="https://en.wikipedia.org/wiki/Moscow_Internet_Exchange">Moscow Internet Exchange</a> (MSK-IX) by way of Kharkiv on Ukraine’s northeast border with Russia, as illustrated in the traceroute below:</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> Traceroute from 88.81.224.210 to 176.113.251.26 at 2022-02-14 19:39:45 (COMPLETED): 1. 88.81.224.209 PJSC Datagroup 3326 88.81.224.209.ipv4.datagroup.ua. 0.244 2. 88.81.244.175 PJSC Datagroup 3326 xe-11-1-2.2.at.mx01.iev1.core.as3326.net. 31.632 3. 88.81.244.1 PJSC Datagroup 3326 xe-0-1-0.2.at.mx01.hrk1.core.as3326.net. 35.792 4. 195.208.210.92 <span style="background-color: #fef89b;">MSK-IX</span> None 42.23 5. 193.228.160.179 Telematika LLC 43201 33.902 6. 194.31.154.3 Luganet 39728 pool.luganet.ru. 34.875 7. 176.113.251.26 Luganet 39728 35.3 </code></pre> <p>After the change, traceroutes had to head west to <a href="https://www.de-cix.net/en/locations/frankfurt">DECIX in Frankfurt</a> before getting carried to Moscow in Rostelecom and then on to Luhansk in Eastern Ukraine.</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> Traceroute from 88.81.224.210 to 176.113.242.30 at 2022-02-17 23:44:13 (COMPLETED): 1. 88.81.224.209 PJSC Datagroup 3326 88.81.224.209.ipv4.datagroup.ua. 0.583 2. 88.81.245.221 PJSC Datagroup 3326 xe-3-3-1.2.at.mx01.fra1.core.as3326.net. 30.813 3. 80.81.194.31 <span style="background-color: #fef89b;">DE-CIX Management</span> 6695 frkt-ar2.intl.ip.rostelecom.ru. 32.264 4. 188.128.126.238 PJSC Rostelecom 12389 94.839 5. 178.35.228.91 PJSC Rostelecom 12389 67.917 6. 193.228.160.179 Telematika LLC 43201 94.294 7. 91.217.5.131 Luganet 39728 pool.luga.net.ua. 67.113 8. 176.113.242.30 Luganet 39728 69.986 </code></pre> <p>Datagroup’s connection to Russia via Kharkiv appears to have been disabled ahead of the invasion, perhaps as a measure to thwart <a href="https://twitter.com/DougMadory/status/1496598152706772993">cyberattacks</a> from Russia.</p> <p>In mid-March, traceroutes from Kyiv to Russian-held Donetsk exhibited a similar leap in latency, from as low as 34ms to over 80ms:</p> <img src="//images.ctfassets.net/6yom6slo28h2/15ty7wYNgOqaTm7niwMfPk/d82358e3876e97be7cb3e0f2c0cbfae5/kiev-donetsk.png" withFrame style="max-width: 600px;" class="image center" alt="Traceroutes from Kiev to Russian-held Donetsk" /> <h2 id="connectivity-to-russia">Connectivity to Russia</h2> <p>Below is an illustration of the traceroutes from the Ark server in Kyiv through Russian state telecom, Rostelecom (AS12389), to <em>any global destination</em>. These are traceroutes that received responses from the target addresses (i.e., status = COMPLETED) and, again, are plotted by the overall latency (y-axis), time of measurement (x-axis), and AS origin of the target IP (color).</p> <img src="//images.ctfassets.net/6yom6slo28h2/4DQHXi14trQsBAX3KPPpib/b2d2d3c1c713c121896a593d959a72ca/kiev-russia-state-telecom.png" withFrame style="max-width: 600px;" class="image center" alt="Traceroutes from Kiev through Rostelecom" /> <p>The chart above reveals that traceroutes destined for address space originated by Rostelecom (AS12389, AS42610, AS25490, etc.) and many other Russian networks became unreachable. But the traceroutes didn’t begin and then stop at a certain point. Instead, beginning around 09:00 UTC on February 28, 2022, Datagroup appeared to simply no longer carry any routes from many Russian ASNs in its routing table.</p> <p>The Ark server did, however, continue to successfully run traceroutes <em>through</em> Rostelecom on to other countries: notably Kazakhstan and Iran. In the graphic above, traceroutes to Kazakh Telecom (AS9198) appear in green, while traceroutes to Iran’s Information Technology Company (AS58224) appear in purple. This is to be expected as both countries normally utilize Russian transit.</p> <p>The lower band of measurements in yellow and dark gray represent successful traceroutes into Russian-held Donbas and Crimea — perhaps demonstrating a reluctance to block traffic to any parts of Ukraine, even if occupied by Russian forces.</p> <h2 id="conclusion">Conclusion</h2> <p>The analysis above only scratches the surface of what is contained in the measurement data that was produced by CAIDA’s Ark server in Kyiv. Traceroutes performed from inside Ukraine can reveal changes in latency and path for domestic traffic that isn’t possible to observe from outside of the country.</p> <p>In the case of Kherson, the internal view provided by Ark revealed the performance impact of forcibly re-routing domestic traffic through an external country. Of course, the issue of increased latency pales in comparison to the security concerns that arise from having one’s communications re-routed through an invading country.</p> <p>Recall Skynet (formerly Kherson Telecom) was the first ISP in Kherson to begin using Russian transit in May of 2022. It was an action that the <a href="https://www.wired.com/story/ukraine-russia-internet-takeover/">CEO defended on social media</a>. After the <a href="https://en.wikipedia.org/wiki/2022_Kherson_counteroffensive">Ukrainian counteroffensive</a> in the fall of 2022 recaptured much of the region of Kherson and ISPs there started using Ukrainian transit again, Skynet went dark. Since then, it hasn’t appeared in the <a href="https://stat.ripe.net/widget/routing-history#w.resource=47598">global routing table</a>.</p> <p>The war in Ukraine has devastated the country, and its telecommunications infrastructure, while still operational, has paid a heavy price. For most of Ukraine, the Russian-held Donbas got further away, both in terms of internet distance and national cohesion, while much of the Russian internet became unreachable.</p> <p>Ukrainian telecommunications technicians are continuing to face <a href="https://www.forbes.com/sites/thomasbrewster/2022/03/22/while-russians-bombs-fall-around-them-ukraines-engineers-battle-to-keep-the-internet-running/">unforgiving challenges</a> while working to <a href="https://nogalliance.org/our-task-forces/keep-ukraine-connected/">keep Ukriane connected</a> to the outside world and, of course, <em>from the inside</em>.</p><![CDATA[Speeding Up the Web: A Comprehensive Guide to Content Delivery Networks and Embedded Caching]]><![CDATA[Content delivery networks are an important part of the internet, as they ensure a short path between content and the consumers. The idea of placing CDN caches inside ISPs networks was created early in the days of CDNs. The number of CDNs with this offering is growing and ISPs all over the world take advantage of the idea. This post explains how this works and what to look out for to do it right. ]]>https://www.kentik.com/blog/speeding-up-the-web-comprehensive-guide-content-delivery-networks-embedded-cachinghttps://www.kentik.com/blog/speeding-up-the-web-comprehensive-guide-content-delivery-networks-embedded-caching<![CDATA[Nina Bargisen]]>Thu, 06 Apr 2023 04:00:00 GMT<h2 id="what-is-a-cdn">What is a CDN?</h2> <p>Content distribution networks are a type of network that emerged in the 90s, early in the internet’s history, when content on the internet grew “richer” – moving from text to images and video. Akamai was one of the first CDNs and remains a strong player in today’s market.</p> <div as="WistiaVideo" videoId="5hvet22hzw" audio></div> <p>A high-level definition of a CDN is:</p> <ul> <li>A collection of geographically distributed caches</li> <li>A method to place content on the caches</li> <li>A method to steer end users to the closest caches</li> </ul> <p>The purpose of a CDN is to place the data-heavy content or latency-sensitive content as close to the content consumers as possible – in whatever sense close means. (More on that later.)</p> <p>Most website publishers use CDN services to deliver their sites and applications to ensure reliable and responsive performance for their end users. CDN services just make sense. If a publisher in the U.S. has many visitors from Africa, it isn’t efficient to be sending that content across multiple terrain and submarine cables to reach the end user every time it’s requested. It makes logical sense to store this content locally when first requested and then serve it to subsequent African viewers locally. End users get a higher-quality viewing experience, and it reduces expensive traffic on those backbone networks.</p> <h2 id="what-is-embedding">What is embedding?</h2> <p>The need to have the content close to the content consumers fostered the idea of placing caches belonging to the CDN inside the access network’s network border. This idea was novel and still challenges the mindset of network operators today.</p> <p>We call such caches “embedded caches.” They are usually intended only to serve end users in that network. These caches often use address space originating in the ISP’s ASN and not in the CDN’s.</p> <h2 id="who-are-the-players">Who are the players?</h2> <p>Online video viewership really increased the demand for CDN services. Internet video consumption started in the late 1990s, and OTT streaming services have accelerated the growth in online video consumption. In the early days, specialized video CDNs such as Mark Cuban’s Broadcast.com (acquired by Yahoo) and INTERVU (acquired by Akamai) served the market. By the early 2000s, CDNs had to have a video delivery solution to compete. In 2023, work-from-home and e-learning applications continue to drive video consumption growth.</p> <p>The CDN market has only become more active. Traditional players like Akamai, Lumen, Tata, and Edgio still generate a large share of their business from traditional content delivery, but these services have become commoditized. For these players to grow their business and compete with newer entrants like Cloudflare, Fastly, and Stackpath, they are leveraging their distributed infrastructure to offer a more diverse and specialized set of services like security services and DDoS protection, high-performance real-time delivery services, as well as moving into public cloud services.</p> <p>Large content producers who use commercial CDNs often use multiple CDNs. This supports different types of content, leverages different geographical strengths, creates resilience, and, finally – means better leverage when negotiating contracts, thanks to utilizing more than one provider.</p> <p>Giant content producers like Netflix, Apple, Microsoft, Amazon, Facebook, and Google have all built their own specialized CDN to support their core services. Some also compete in the marketplace selling CDN services – including Amazon, Google, and Microsoft. In 2023, CDNs are looking for ways to leverage their investment in their distributed infrastructure by identifying high-growth services traditionally offered by public clouds. CDNs are now offering storage and compute services. Akamai recently closed on acquiring Linode to add cloud compute to its service offerings.</p> <p>The major CDNs that offer an embedded solution to ISPs are, among others:</p> <ul> <li>Akamai</li> <li>Netflix</li> <li>Google</li> <li>Amazon</li> <li>Facebook</li> <li>Cloudflare</li> <li>CDN77</li> <li>Microsoft</li> <li>Apple</li> <li>Qwilt</li> </ul> <p>In the SIGCOMM ‘21 presentation, “<a href="https://dl.acm.org/doi/10.1145/3452296.3472928">Seven years in the life of Hypergiants’ off-nets</a>,” Petros Gigis and team found that more than 4,500 networks globally in 2021 had embedded caches from at least one of the CDNs – a number that has tripled from 2013 to 2021. Google, Netflix, Facebook, and Akamai are by far the most widely deployed embedded caches, with almost all of the 4,500 networks hosting at least one and often two or more of the four.</p> <h2 id="the-benefit-of-embedding">The benefit of embedding</h2> <p>The benefit of the CDN is that the content served by the CDN can be placed closer to the consumer. For the ISP, the benefit is primarily savings on the internet edge – in the form of transit and capacity costs. Depending on the type of embedded deployment, some ISPs can also save capacity on their internal network. Traffic from CDNs rarely creates revenue for the ISP as most CDNs prefer peering over buying transit from end-user ISPs, so it does become crucial to use as little of the network as possible to deliver the traffic to consumers.</p> <div as="Promo"></div> <h2 id="the-downside-of-embedding">The downside of embedding</h2> <p>The challenge for ISPs is the added complexity of operating caches managed by other networks inside their network border. On top of that, space and cooling needs differ from what their own equipment needs. Complicating matters further is that these embedded caches’ space and cooling needs differ from one CDN to the next.</p> <p>Another reported downside is that some embedded CDNs’ operational processes are misaligned with the ISPs’ operational processes. Expectations of access to the caches or speed in physical replacements are sometimes misaligned.</p> <h2 id="offload--what-to-expect">Offload – what to expect</h2> <p>A common surprise for ISPs new to embedded CDN caches is that they rarely see 100% of the traffic from the CDN served from the embedded caches. There are several reasons why we see some traffic from the CDN over the network edges.</p> <p>To understand this, let’s examine the ways content is placed on the embedded caches (or the caches in the CDN in general).</p> <h3 id="different-ways-of-placing-content-and-embedding--illustrated-by-typical-traffic-profiles-from-select-kentik-customers">Different ways of placing content and embedding – illustrated by typical traffic profiles from select Kentik customers</h3> <p>A few CDNs – most prominently Netflix Open Connect, push the content to the caches in the CDN. A frequent calculation determines which files should be placed where and the system then distributes the files during low-demand hours, providing two significant benefits:</p> <ul> <li>The fill traffic runs on network connections outside of peak hours, lowering the strain on the ISP’s interconnection capacity.</li> <li>The caches’ resources can be fully used to serve traffic in peak demand hours, allowing no resources to store new content files being used during these times.</li> </ul> <p>For ISP partners, this means they will see the fill traffic during the night, and if they agree to let their embedded caches fill from each other, the amount of file downloads to the caches from outside the network is minimal, as you can see in the example below.</p> <p>The graph below shows the traffic profile of Netflix traffic for a network with a well-dimensioned Open Connect deployment. The fill traffic over the network border is the smallest spike, with the cache-to-cache fill having a more considerable spike at night.</p> <p>Notice the large spikes in the peak hours of traffic served from the embedded caches. In the graph, we can see a traffic peak from outside of the network to the end users during the peak hours. This long tail is the content that was not placed on the embedded caches because the expected demand was not high enough. The total catalog is too extensive for the amount of disk space in a typically embedded cluster, so some content will need to be served from the larger POPs of the CDN.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3aQiINWynpnSveK2aReHnG/f4e20471d0d7a84c0e86cba77bd1a3e8/embedded-cdns-1.png" style="max-width: 700px;" class="image center" withFrame alt="Traffic profile of Netflix traffic for a network with a well-dimensioned Open Connect deployment" /> <p>When the CDN uses <em>pull</em> to place the content, the end user’s request for content triggers the content to be downloaded to the embedded cache. The traffic from outside the network to the embedded caches peaks at the peak hour for the content download to end users. Like before, a long tail of content is served directly to the end users from outside the network. This is most often content that the content owners decided wasn’t worth caching, as high demand was not expected. It might be more dynamic content, such as content personalized for the end viewer or with a specific security profile. Finally, some CDNs will serve the file directly to the end user and then subsequently store it on the embedded caches after the first request for future delivery.</p> <img src="//images.ctfassets.net/6yom6slo28h2/22eRyWjzP99DkRSTyCksb2/81d0a6a2dd3e2b2d63835cb509276741/embedded-cdns-2.png" withFrame style="max-width: 700px;" class="image center" alt="" /> <p>So deploying embedded caches from the major CDN sources of the inbound traffic can save ISPs significant traffic and capacity on the network edge. Not all the traffic, but most of the traffic, can be moved away.</p> <p>The next question that arises is where should an ISP deploy embedded caches? Ideally, a cache would be placed at each location where the customers’ lines connect to the IP network. This should remove most CDN traffic from the network links, right?</p> <p>Well, that depends on the network topology of the network but also on which CDN and how that CDN maps the end users to a cluster.</p> <p>Let’s look at the most common methods and how they affect the traffic flows from the embedded caches to end users inside the network.</p> <h2 id="end-user-mapping">End-user mapping</h2> <h3 id="getting-end-users-to-the-closest-cache">Getting end users to the closest cache</h3> <p>Again we have Open Connect from Netflix standing out from most of the other players. They heavily rely on BGP (Border Gateway Protocol, the protocol that networks use to exchange routes) to define which cache an end user is directed to.</p> <p>BGP:</p> <ul> <li>BGP is used to signal to a cache which IP ranges it should serve</li> <li>A setup where all caches have identical announcements will work for most deployments since the next tie-breaker is the geolocation information for the end user’s IP address.</li> <li>Prefixes are sent from the ISP to the caches.</li> </ul> <div style="border: 1px solid #efefef; padding: 30px; background-color: #ebeff3; max-width: 340px; margin: 0 auto"> <p><strong><span style="color: #FA541C">•</span> x.y.z.0/24</strong> is announced to <strong>A</strong><br> <strong><span style="color: #FA541C">•</span> x.y.w.0/24 </strong>is announced to <strong>B</strong></p> <p><strong style="color: #FA541C">1.</strong> Give me movie<br> <strong style="color: #FA541C">2.</strong> Go to <strong>B</strong> and get movie<br> <strong style="color: #FA541C">3.</strong> Give me movie<br> <strong style="color: #FA541C">4.</strong> Movie</p> </div> <img src="//images.ctfassets.net/6yom6slo28h2/7dAlT3sSikEwjRFhOiln80/14142fb8d3c5abbbb06e87d1d5c6423d/embedded-cdns-bgp.png" style="max-width: 500px; border: 1px solid #bbbbbb; padding: 30px" class="image center no-shadow" alt="End-user mapping with BGP" /> <p>Open Connect is unique among the CDNs since they do not rely on the DNS system to direct the end user to the suitable cache. The request to play a movie is made to the CDN, who replies with an IP address for the cache that will serve the movie without using the DNS system</p> <p>Most other CDNs map end users to a cache by mapping the DNS server the end user is using to request the content to a cache or a cache location.</p> <p>The typical DNS-based flow for a content server by a CDN looks like this:</p> <div style="border: 1px solid #efefef; padding: 30px; background-color: #ebeff3; max-width: 420px; margin: 0 auto"> <p><strong style="color: #1890FF">1.</strong> Where is <strong>site.com</strong>?<br> <strong style="color: #1890FF">2.</strong> Where is <strong>site.com</strong>?<br> <strong style="color: #1890FF">3.</strong> <strong>site.com</strong> is <strong>site.com.cdnsomething.com</strong><br> <strong style="color: #1890FF">4.</strong> Where is <strong>site.com.cdnsomething.com</strong><br> <strong style="color: #1890FF;">5.</strong> <strong>site.com.cdnsomething.com</strong> is <strong>A</strong> <em>for you</em><br> <strong style="color: #1890FF">6.</strong> <strong>Site.com</strong> is <strong>A</strong><br> <strong style="color: #1890FF">7.</strong> Give me <strong>Site.com</strong><br> <strong style="color: #1890FF">8.</strong> <strong>Site.com</strong></p> </div> <img src="//images.ctfassets.net/6yom6slo28h2/6WQnfq1SVlqJlsx2KLt9kX/dbfddfd4cb7f519f09651b0e85ee66e0/embedded-cdns-dns.png" style="max-width: 500px; border: 1px solid #bbbbbb; padding: 30px" class="image center no-shadow" alt="DNS-based flow" /> <p>But how is mapping the DNS server to the cache or cache locations done? This is where the individual CDNs add their own magic.</p> <p>The mapping can take several different parameters into account. For example:</p> <ul> <li><strong>Latency</strong>: Measurements from the CDNs network, clients, and embedded caches create a latency map determining the closest cache for a given DNS server.</li> <li><strong>Connectivity to the ISP</strong>: Is the cache embedded, reached by private or public peering or transit?</li> <li><strong>Load</strong>: No CDN wants to direct a user to a cache that is already busy serving content, so the mapping takes the current load into account.</li> </ul> <p>Note that this means that the mappings in the DNS system have quite a short TTL.</p> <p>Some CDNs use anycast to direct the end user to the nearest cache. Announcements and withdrawal of the anycast prefixes from the caches in the ISP are then used for traffic management by the CDN.</p> <h3 id="how-does-end-user-mapping-affect-the-benefits-of-embedding">How does end-user mapping affect the benefits of embedding?</h3> <p>What does the end-user mapping mean for the traffic flows internally in the ISP’s network in the case of more than one cache location? In the case of Open Connect and similar BGP-based mappings, the ISP has optimal control if the IP addresses used by the end users are regionalized. End users will then be served by the caches in that region and only use others in the case of failure or missing capacity. The fail-over caches can also be signaled with BGP, just like you would do with a primary and backup connection to your transit provider.</p> <p>If the address plan is not regionalized, all customers should be announced to all caches. The geolocation of the end user’s IP address, and physical distance will determine where the end users are sent. This works well for geographically large networks but less so if you run an extensive network in a small area. In that case, it is challenging to prevent traffic from crisscrossing all over the network, and a better solution is to build a larger cache location in the center of the network.</p> <p>In the case of CDNs using DNS server mapping, the ISPs must run dedicated DNS servers for each region where they want to build a cache location.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6Djr3pXc5szzbnPsCqNxHX/cd30167866d1a9f779ddaa4f0a469d77/embedded-cdns-cloud-diagram.png" style="max-width: 600px;" class="image center no-shadow" alt="CDNs with DNS server mapping" /> <h2 id="conclusion">Conclusion</h2> <p>Slow page loads and jittery audio and video repel viewers. Without CDN technology efficiently delivering internet content, we would be unlikely to be able to read, listen, and watch the wide variety of content that’s available online today – not without paying more for it or suffering from poor quality experiences. Embedding caches from the CDNs into your network can help you optimize the delivery and remove a lot of the strain of the large amounts of traffic. Kentik can help you analyze and understand the impact of CDN traffic on your network and whether embedding caches is for you. Please watch this <a href="https://www.kentik.com/resources/nfd-sp-2-observing-content-delivery-and-over-the-top-ott-services/">Networking Field Day presentation by Steve Meuse</a>, a Kentik solutions architect, for an overview of how Kentik can help you understand how OTT traffic and CDNs impact your network.</p><![CDATA[Practical Steps for Enhancing Reliability in Cloud Networks - Part I]]><![CDATA[Delivering on network reliability causes an enterprise’s data to become more distributed, introducing advanced challenges like complexity and data gravity for network engineers and operators. Learn concrete steps on how to implement cloud reliability and the trade-offs that come with it.]]>https://www.kentik.com/blog/practical-steps-for-enhancing-reliability-in-cloud-networks-part-1https://www.kentik.com/blog/practical-steps-for-enhancing-reliability-in-cloud-networks-part-1<![CDATA[Ted Turner]]>Wed, 05 Apr 2023 04:00:00 GMT<p>When evaluating solutions, whether to internal problems or those of our customers, I like to keep the core metrics fairly simple: will this reduce costs, increase performance, or improve the network’s reliability?</p> <p>It’s often taken for granted by network specialists that there is a trade-off among these three facets. If a solution is cheap, it is probably not very performant or particularly reliable. Does a solution offer an impressive degree of reliability? Then it is unlikely to be both inexpensive, performant, and so on. Balancing these trade-offs across the many components of at-scale cloud networks sits at the core of network design and implementation.</p> <p>While there is much to be said about <a href="https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-i/">cloud costs and performance</a>, I want to focus this article primarily on reliability. More than anything, reliability becomes the principal challenge for network engineers working in and with the cloud.</p> <h2 id="what-is-cloud-network-reliability">What is cloud network reliability?</h2> <p>Reliability is the degree to which a network is able to behave as expected, even in the presence of stress or failures. Accounting for these stresses and failures in highly scaled, distributed cloud networks can be particularly challenging, as many network components and services are temporary, highly elastic, and outside the complete control of network operators.</p> <h3 id="the-components-of-reliability">The components of reliability</h3> <p>With such a broad mission, reliability is almost a catch-all for several related network health components. Each of these components provides a critical aspect of delivering a reliable network:</p> <ul> <li><strong>Availability</strong>. Simply put, availability is uptime. Highly available networks are resistant to failures or interruptions that lead to downtime and can be achieved via various strategies, including redundancy, savvy configuration, and architectural services like load balancing.</li> <li><strong>Resiliency</strong>. Resiliency is a network’s (or network component’s) ability to recover in the presence of stressors or failure. Resilient networks can handle attacks, dropped connections, and interrupted workflows. Resiliency can be contrasted against redundancy, which replaces, as opposed to restores, a compromised network component.</li> <li><strong>Durability</strong>. When the network has been subjected to interrupted service, durability measures ensure that network data remains accurate and whole. This can mean redundant and more complex data infrastructure or properly accounting for more nuanced concepts like idempotency and determinism in the presence of failure.</li> <li><strong>Security</strong>. Vulnerabilities enable malicious actors to compromise a network’s availability, resiliency, and durability. Even the most detailed reliability engineering can be easily undermined in an insecure network.</li> </ul> <h2 id="implementing-cloud-reliability">Implementing cloud reliability</h2> <p>Making a cloud network reliable is, as of the writing of this post, more than a matter of checking a few boxes in a GUI. It takes cross-departmental planning to envision, create, test, and monitor the best-for-business version of a reliable network. If implemented poorly, many organizations find themselves wasting resources on arbitrary monitoring and persistently vulnerable systems.</p> <p>Whatever your organization’s best-fit version of reliability, there will still be some consistent features and challenges.</p> <h3 id="redundancy">Redundancy</h3> <p>While not always the most affordable option, one of the most direct reliability strategies is redundancy. Be it power supplies, servers, routers, load balancers, proxies, or any other physical and virtual network components, the horizontal scaling that redundancy provides is the ultimate safety net in the presence of failure or atypical traffic demands.</p> <p>Besides the direct costs associated with having more instances of a given component, redundancy introduces additional engineering concerns like orchestration and data management. Careful attention needs to be paid here as these efforts can complicate other aspects of network reliability, namely durability.</p> <h3 id="monitoring">Monitoring</h3> <p>An essential part of my definition of reliability is “as expected,” which implies that there are baselines and parameters in place to shape what reliable performance means for a given network. Direct business demands like SLAs (service level agreements) help define firm boundaries for network performance. However, arriving at specs for other aspects of network performance requires extensive monitoring, dashboarding, and data engineering to unify this data and help make it meaningful.</p> <p>By collecting and analyzing network telemetry, including traffic flows, bandwidth usage, packet loss rates, and error rates, NetOps leverage monitoring to detect and diagnose potential bottlenecks, security threats, and other issues that can impact network reliability, often before end users even notice a problem. Additionally, monitoring becomes critical for network optimizations by identifying areas where resources are under or overutilized.</p> <h3 id="network-reliability-engineers-nres">Network reliability engineers (NREs)</h3> <p>The scale, transient nature of resources and topology, and extensive network boundaries make delivering on network reliability a complex and specialized task. Much the way containerization and cloud computing called for the advent of site reliability engineers (SREs) in application development, the complexity of cloud networks has led to the rise of network reliability engineers (NREs).</p> <p>Their role involves working closely with network architects, software developers, and operations teams to ensure consistent networking priorities and implementations. NREs typically have a strong background in network engineering and are well-versed in technologies such as routing protocols, switching, load balancing, firewalls, and virtual private networks (VPNs). They also have expertise in automation, scripting, and software-defined networking (SDN) to streamline network operations and reduce manual errors.</p> <h2 id="the-trade-offs-of-cloud-network-reliability">The trade-offs of cloud network reliability</h2> <p>To close, I’d like to bring the discussion back to the trade-offs in cost and performance that network operators have to make when prioritizing reliability.</p> <h3 id="costs">Costs</h3> <p>Redundancy isn’t cheap. No matter how you slice it, additional instances, hardware, etc., will simply cost more than having fewer. And while redundancy isn’t the only way to ensure reliability in the cloud, it will be a significant part of an organization’s strategy.</p> <p>Thankfully, cloud networks give us a lot of options for tuning reliability up (or down):</p> <ul> <li>VPN or private backbone </li> <li>Load balancer at network tier or application tier </li> <li>Application-focused load distribution and degraded failures instead of total failures </li> <li>Availability zone allocation and multiple networks and paths </li> <li>Diversity of WAN providers (two or more providers) </li> <li>Multipath choices <ul> <li>LAG</li> <li>ECMP </li> <li>Routing path selection (even if sub-optimal) </li> </ul> </li> <li>Diversity of geography </li> <li>Diversity of naming resolution (DNS), which can choose from the above choices </li> <li>Caching of content (CDN) to reduce the volume of data being delivered to an endpoint </li> <li>Applying quality of service for preferred applications <ul> <li>Examples of preferred applications typically thrown into this category are Voice and Video (Zoom, WebEx, etc.), accounting, cash registers, customer-facing sitesExamples of deferred applications are backups, data replication (if it can catch up after peak business loads daily), backend staffing needs like housekeeping or HR </li> </ul> </li> </ul> <p>Optimizing against these choices is going to be a shifting challenge for network operators, but if handled correctly, networks can achieve reliability that fits both cost restraints and business goals.</p> <h3 id="performance">Performance</h3> <p>If not adequately accounted for, the additional data infrastructure, monitoring efforts, and redundancy measures can soak up bandwidth available in data centers and cloud contracts, leading to network issues like cascading failures that can be difficult to place.</p> <p>Top talkers are often the cause of outages for application stacks. Top talkers can be something like customers on a Black Friday sale. But sometimes, a top talker is a back operation that simply grew in size, and the backup process now exceeds the “projected/estimated” backup window.</p> <p>When backup operations occur during staffing, customer visits, or partner-critical operations, contention occurs. However, as we move into the cloud, some organizations take much of their customer receipts, invoicing, and other components as a system of record (SOR) and drop the records into a cloud storage location. This is something of a backup for the business operations but does not meet the typical definition of a “backup operation.”</p> <p>Working with customers, I have seen bandwidth consumption patterns of 4:1, where four 10GB links are needed to store customer records in cloud storage, but only a single 10GB link is necessary for typical application processing for day-to-day business needs. In this case, choosing to separate the storage traffic from the normal business traffic enhances both performance and reliability. However, deploying four more 10GB circuits for the cloud storage becomes the cost of achieving performance and reliability.</p> <h2 id="conclusion">Conclusion</h2> <p>Delivering on network reliability causes an enterprise’s data to become (more) distributed, introducing advanced challenges like complexity and <a href="https://www.kentik.com/blog/data-gravity-in-cloud-networks-massive-data/">data gravity</a> for network engineers and operators.</p> <p>In my next article, we will take a closer look at how these challenges manifest and how to manage them.</p><![CDATA[eBPF Explained: Why it's Important for Observability]]><![CDATA[eBPF is a powerful technical framework to see every interaction between an application and the Linux kernel it relies on. eBPF allows us to get granular visibility into network activity, resource utilization, file access, and much more. It has become a primary method for observability of our applications on premises and in the cloud. In this post, we’ll explore in-depth how eBPF works, its use cases, and how we can use it today specifically for container monitoring.]]>https://www.kentik.com/blog/ebpf-explained-why-its-important-for-observabilityhttps://www.kentik.com/blog/ebpf-explained-why-its-important-for-observability<![CDATA[Phil Gervasi]]>Tue, 04 Apr 2023 04:00:00 GMT<p>eBPF is a lightweight runtime environment that gives you the ability to run programs inside the kernel of an operating system, usually a recent version of Linux. That’s the short definition. The longer definition will take some time to unpack. In this post, we’ll look at what eBPF is, how it works, and why it’s become such a common technology in observability.</p> <h2 id="what-is-ebpf">What is eBPF?</h2> <p>eBPF, which stands for <strong><a href="https://www.kentik.com/kentipedia/what-is-ebpf-extended-berkeley-packet-filter/" title="Kentipedia: What is eBPF? (Extended Berkeley Packet Filter)">Extended Berkeley Packet Filter</a></strong>, is a lightweight virtual machine that can run sandboxed programs in a Linux kernel without modifying the kernel source code or installing any additional modules.</p> <p>eBPF operates with hooks into the kernel so that whenever one of the hooks triggers, the eBPF program will run. Since the kernel is basically the software layer between the applications you’re running and the underlying hardware, eBPF operates just about as close as you can get to the line-rate activity of a host.</p> <p>An application runs in what’s called <strong>user space</strong>, an unprivileged layer of the technology stack that requires the application to request resources via the system call interface to the underlying hardware. Those calls could be for kernel services, network services, accessing the file system, and so on.</p> <p>When an application runs from the user space, it interacts with the kernel many, many times. eBPF is able to see everything happening at the kernel level, including those requests from the user space, or in other words, by applications. Therefore, by looking at the interactions between the application and the kernel, we can learn almost everything we want to know about application performance, including local network activity.</p> <p>Note that eBPF can also be used to monitor user space via uprobes, but we focus primarily on kernel activity for network observability.</p> <div as="Promo"></div> <h2 id="how-does-ebpf-work">How does eBPF work?</h2> <h3 id="bytecode">Bytecode</h3> <p>The BPF virtual machine runs a custom bytecode designed for verifiability, which is to say that you <em>can</em> write directly in bytecode, though writing directly in bytecode is onerous at best. Typically, eBPF programs are written to bytecode using some other language. For example, developers often write programs in C or <a href="/blog/using-rust-for-kentiks-new-synthetic-network-monitoring-agent/" title="Kentik Blog: Using Rust for Kentik&#x27;s New Synthetic Monitoring Agent">Rust</a> compiled with clang, which is part of the LLVM toolchain, into usable bytecode.</p> <p>Bytecode is generated by a compiler, but the actual programs are compiled just-in-time (JIT). This also allows the kernel to validate the code within boundaries before running it. The JIT step is optional and should occur after validation.</p> <p>eBPF bytecode is a low-level instruction set written as a series of 64-bit instructions executed by the kernel. The eBPF bytecode instructions are expressed as hexadecimal numbers, each consisting of an opcode and zero or more operands.</p> <p>Here’s an example of what eBPF bytecode might look like:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">0x85, 0x00, 0x00, 0x00, 0x02 ; load 2 into register 0 0x18, 0x00, 0x00, 0x00, 0x00 ; load 0 into register 1 0x07, 0x00, 0x00, 0x00, 0x00 ; add registers 0 and 1, store result in register 0 0xbf, 0x00, 0x01, 0x00, 0x00 ; exit syscall with return value from register 0</code></pre></div> <p>This example code loads the value 2 into register 0, loads 0 into register 1, adds the values in registers 0 and 1, and then exits the program with the result in register 0. This is a simple example, but eBPF bytecode can perform much more complex operations.</p> <h3 id="using-python-to-write-ebpf-applications">Using Python to write eBPF applications</h3> <p>Additionally, developers often use a python front end to write an eBPF application in user space. This makes writing eBPF programs much easier because of how commonly used python is, but also because of the many existing libraries for developers to take advantage of. However, it’s important to note that these Python programs are very specific to the BCC toolchain, not a general way of writing BPF apps.</p> <p>To program eBPF with Python, you can use the <code class="language-text">bpf</code> module in the <code class="language-text">bpfcc</code> library. This library provides a Python interface to the BPF Compiler Collection (BCC), which allows you to write and load eBPF programs from Python.</p> <p>For example, to write an eBPF program in Python to monitor <code class="language-text">tcpretransmits</code>, you can use the <code class="language-text">BPF</code> class from the <code class="language-text">bpfcc</code> library to define a <code class="language-text">kprobe</code> that attaches to the <code class="language-text">tcp_retransmit_skb</code> function and captures information about retransmissions.</p> <p>Here’s an example of what the Python code might look like:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">from bcc import BPF # define the eBPF program prog = """ #include &lt;uapi/linux/ptrace.h> BPF_HASH(start, u32); BPF_PERF_OUTPUT(events); int kprobe__tcp_retransmit_skb(struct pt_regs *ctx, struct sock *sk, struct sk_buff *skb) { u32 pid = bpf_get_current_pid_tgid(); // track the starting time of the retransmit attempt u64 ts = bpf_ktime_get_ns(); start.update(&amp;pid, &amp;ts); return 0; } int kretprobe__tcp_retransmit_skb(struct pt_regs *ctx) { u32 pid = bpf_get_current_pid_tgid(); u64 *tsp = start.lookup(&amp;pid); // calculate the duration of the retransmit attempt if (tsp != NULL) { u64 now = bpf_ktime_get_ns(); u64 delta = now - *tsp; events.perf_submit(ctx, &amp;delta, sizeof(delta)); start.delete(&amp;pid); } return 0; } """ # create and load the eBPF program bpf = BPF(text=prog) # attach the eBPF program to the tcp_retransmit_skb function bpf.attach_kprobe(event="tcp_retransmit_skb", fn_name="kprobe__tcp_retransmit_skb") bpf.attach_kretprobe(event="tcp_retransmit_skb", fn_name="kretprobe__tcp_retransmit_skb") # define a function to handle the perf output events def print_event(cpu, data, size): # unpack the duration of the retransmit attempt event = bpf["events"].event(data) duration = float(event) / 1000000 print("TCP retransmit detected (duration: %0.2f ms)" % duration) # loop and handle perf output events bpf["events"].open_perf_buffer(print_event) while True: bpf.kprobe_poll()</code></pre></div> <p>In this example, we define an eBPF program that creates a hash map to track the starting time of retransmit attempts and a <code class="language-text">PERF_OUTPUT</code> event that captures the duration of the retransmit attempt. We then attach the eBPF program to the <code class="language-text">tcp_retransmit_skb</code> function using both a <code class="language-text">kprobe</code> and a <code class="language-text">kretprobe</code>, which allows us to capture both the start and end of the function call.</p> <p>We define a function to handle the <code class="language-text">PERF_OUTPUT</code> events, which unpacks the duration of the retransmit attempt and prints it to the console. Finally, we loop and handle the perf output events using the <code class="language-text">open_perf_buffer</code> and <code class="language-text">kprobe_poll</code> methods.</p> <p>This eBPF program will track <code class="language-text">tcpretransmits</code> and print out the duration of each retransmit attempt. You can modify the program to capture other information as well, such as the number of retransmit attempts or the source and destination IP addresses of the affected packets.</p> <p>This method can help us understand the cause of an application performance problem. In general, this is a sign that some packet loss is occurring on the network. This could be due to congestion or even errors on the remote NIC. Usually, when this happens, the network connection seems slow, but things are still working.</p> <h3 id="higher-level-abstractions">Higher level abstractions</h3> <p>For another level of abstraction, open source tools have emerged, such as <a href="https://cilium.io/" title="Cilium: Open source eBPF-based Networking, Observability, Security">Cilium</a>, which runs as an agent in container pods or on servers. Often tied with common tools like Grafana and Prometheus, Cilium is a management overlay used to manage container networking using eBPF. However, Cillium is also much more than this. It’s a data plane that leverages eBPF to implement service meshes, observability, and networking functions as well.</p> <p>Now owned by New Relic, <a href="https://px.dev/" title="Pixie: Open source Kubernetes observability">Pixie</a> is another popular open source eBPF management overlay with an attractive graphical user interface. Since these management tools operate at the kernel level via eBPF, they can also be used for observability, especially with containers.</p> <h2 id="ebpf-programs-interacting-between-user-space-and-kernel">eBPF Programs: Interacting between user space and kernel</h2> <p>Regardless of how you write them, the eBPF programs themselves are loaded from user space into the kernel and attached to a kernel event. This is when we start to see the benefits of eBPF because when the event we attached our program to occurs, our program runs automatically. So after being loaded from user space, an eBPF program will live in the kernel.</p> <p>Before the program is loaded into the kernel, it’s run through a built-in verification function called a verifier that ensures the eBPF program is safe from both operational and security perspectives. This is important because it’s in this way that we know that our eBPF programs won’t use resources they shouldn’t or create a type of loop scenario. However, it’s important to note that the verifier doesn’t perform any sort of policy checks on what can be intercepted.</p> <p>After the eBPF program passes the verifier, it’s just-in-time compiled into native instructions and attached to the hooks you want to use for your custom program. On the left side of the graphic below, you can see the eBPF program go from user pace (Process) through the verifier, the JIT compiler, and then on the right, attached to the relevant hook(s).</p> <img src="//images.ctfassets.net/6yom6slo28h2/6qzHTmPGuypnPzcOLd4QKQ/2ece9b868909168bf1d5dba7dd2df267/ebpf-diagram-process-verifier-compiler.png" style="max-width: 600px;" class="image center" alt="Diagram of eBPF, the compiler, verifier and hooks" /> <div class="caption" style="margin-top: -30px;">Image source: <a href="https://ebpf.io/what-is-ebpf/">https://ebpf.io/what-is-ebpf/</a></div> <p>Hooks can be almost anything running in the kernel, so on the one hand, eBPF programs can be highly customized, but on the other hand, there are also inherent limitations due to the verifier limiting access to the program.</p> <h2 id="ebpf-maps">eBPF Maps</h2> <p>Once run, the eBPF program may have gathered information that needs to be sent back to user space for some other application to access. This could be to retrieve configuration to run when a hook is triggered or to store gathered telemetry for another program to retrieve. For this, we can use <strong>eBPF maps</strong>. eBPF maps are basically generic data structures with key/value pairs and read/write access by the eBPF program, other eBPF programs, and user space code such as another application.</p> <p>Like eBPF programs, eBPF maps live in the kernel, and they are created and accessed from user space using the BPF syscall and accessed by the kernel via BPF helper functions. There are several types of maps, such as an array, hash, prog array, stack trace, and others, with hash maps and arrays being the most commonly used.</p> <p>Though eBPF maps are a common method for coordinating with user space, Linux <strong>perf events</strong> would also likely be used for large volumes of data like telemetry.</p> <h2 id="lightweight-performance-monitoring-with-ebpf">Lightweight performance monitoring with eBPF</h2> <p>In the context of eBPF, “lightweight” means several things. First, eBPF is fast and performant. The eBPF program uses very minimal resources. eBPF uses a just-in-time (JIT) compiler, so once the bytecode is compiled, it isn’t necessary to re-interpret the code every time a program is run. Instead, the eBPF program runs as native instructions, which is a faster and more efficient method for running the underlying bytecode.</p> <p>Second, an eBPF program doesn’t rely on probes or a visibility touchpoint in the network or application, so no traffic is added to the network. This may not be an issue in a very small, low-performance network; however, in a large network that requires many probes and touchpoints to monitor effectively, adding traffic can adversely affect the performance of the network in terms of latency, thereby skewing the monitoring results and possibly impacting application performance.</p> <blockquote style="border-left: 4px solid #1890FF"><em> There is an important distinction between monitoring traffic originating from or terminating at the system running the BPF program, as opposed to network traffic in general.</em></blockquote> <p>Of course, using probes and artificially generated traffic isn’t inherently bad. That sort of monitoring is very useful and plays a significant role in active monitoring. Still, in some scenarios, passive monitoring is required to get the granular, real-time performance statistics of production traffic as opposed to the artificial traffic among monitoring agents.</p> <p>Third, because eBPF can glean telemetry directly from the processes running in the kernel, there’s no need to capture every single packet to achieve extremely granular visibility. Imagine a scenario in which you’re running 40Gbps, 100Gbps, or even 400Gbps links, and you need that level of granularity. Capturing every packet at those link rates would be nearly impossible, let alone prohibitively expensive to do. Using eBPF, there’s no need for an additional physical tap network, and there’s no need to store the enormous number of copied packets.</p> <p>Next, eBPF doesn’t rely on traffic passing through probes or agents, which may need to traverse a variety of network devices both on-premises and in the cloud. For example, to determine latency using traffic generated from probes or by analyzing packets, that traffic would likely pass through routers, firewalls, security appliances, load balancers, etc. Each of those network elements could potentially add latency, especially the security devices doing DPI.</p> <p>Lastly, prior to eBPF, kernel modules had to be written and inserted into the kernel. This could potentially, often did, have catastrophic results. Before eBPF, if a new module inserted into the kernel faulted, the module would also cause the kernel to crash.</p> <h3 id="ebpf-and-application-latency">eBPF and application latency</h3> <p>When determining application latency accurately, eBPF is very useful because it draws information directly from the kernel and not from traffic moving around the network. Additionally, those routers, load balancers, and firewalls could potentially route traffic differently packet-by-packet or flow-by-flow, meaning the visibility results may not be accurate.</p> <p>Deterministic best-path selection is a strength of modern networking, but when it comes to measuring latency, if your probes take a different path each time, it poses a problem in getting an accurate picture of network latency between two targets.</p> <p>Instead, an eBPF program is designed to observe what’s happening in the kernel and report on it. Network and kernel I/O latency have a direct relationship with application latency, and there are no probes to skew the data or packets to capture and process.</p> <h2 id="active-vs-passive-monitoring">Active vs. passive monitoring</h2> <p>There are two main categories of visibility types, active and passive.</p> <h3 id="active-monitoring">Active monitoring</h3> <p>Active visibility tools modify a system, in our case a network, to obtain telemetry or perform a test. This is very useful, especially in networking, because we can use active visibility to test the state of a network function or network segment without relying on end-user production traffic.</p> <p>For example, when you ping a target to test its availability, you add non-user-related ICMP traffic to the network. In that way, you can see if your resource is online and responding, at least at layer 3. You can also get an idea of round trip time, latency, jitter, and so on.</p> <p>This is also how synthetic testing, sometimes called <a href="https://www.kentik.com/kentipedia/what-is-synthetic-monitoring/" title="Kentipedia: What is Synthetic Monitoring?">synthetic monitoring</a> or just <a href="https://www.kentik.com/product/synthetics/">synthetics</a>, works. Synthetic testing also uses artificial traffic instead of production traffic to perform some type of test function, such as measuring latency, confirming availability, or monitoring path selection.</p> <p>Synthetic tests can be very advanced in what they monitor. For example, we can use synthetic testing to simulate an end-user logging into an e-commerce site. Using a synthetic test, we can capture the metrics for each component of that interaction from layer 3 to the application layer itself.</p> <p>However, though active visibility is very powerful and should be a part of any overall monitoring solution, there are several inherent drawbacks.</p> <p>First, by adding traffic to the system, you’re technically not measuring what’s happening with your application or end-user traffic. You’re collecting telemetry on the test traffic. This isn’t necessarily a bad thing, but it isn’t the same as collecting metrics on the application’s activity itself.</p> <p>For example, suppose you want to know the accurate network latency affecting a production application located in your private cloud. In that case, the traffic of a ping or even a synthetic test may take a different path there and back. Therefore, you would have a twofold problem: first, the active monitoring didn’t test actual application activity, and second, the results may be for a completely different network path.</p> <p>Second, in a busy network, devices such as routers, switches, and firewalls may be very busy operating at a relatively high CPU. This is common in service provider networks and for data center core devices. In this scenario, sending test traffic to the busy router or switch would be a bad idea, thereby adding to the packets it has to process. In some instances, the ongoing monitoring activity of a router might be enough to affect the performance of other applications adversely.</p> <h3 id="passive-monitoring">Passive monitoring</h3> <p>Passive monitoring provides information on what’s happening both historically and in near-real-time. The telemetry gathered from passive monitoring is of actual application, system, and network activity, making the information the most relevant for knowing how an application performs and what the end-user experience is like. No changes are made to the system that would affect the results of your visibility results.</p> <p>However, passive monitoring also has its limitations. Because you’re gathering telemetry from actual production traffic, you’re relying on end-user activity to tell you if things are bad. That means to know if there’s a problem, your end-users are probably already having a poor experience.</p> <p>One workaround is that passive telemetry tools can use hard and dynamic thresholds to alert you when metrics are trending worse. In that way, an engineer can anticipate a poor end-user experience before it happens, or at least before it gets really bad. However, alerting with passive monitoring still relies on production traffic trending worse, so though we can anticipate poor performance to an extent, it’s still not ideal.</p> <p>In its truest form, observability is about monitoring a system without affecting or changing it. It’s about looking at the various outputs of a system to determine its health. eBPF sees the activity happening in the kernel and reports on it rather than adding anything to the system other than the nominal resources it consumes to operate.</p> <blockquote style="border-left: 4px solid #1890FF"><em>Therefore, eBPF is a form of passive monitoring because no changes are made to the system, the application, or the traffic itself.</em></blockquote> <h2 id="ebpf-use-cases">eBPF use cases</h2> <p>There are several use cases for running eBPF at the kernel level. The first is for networking, specifically routing. Using eBPF, we can program kernel-level packet forwarding logic, which is how certain high-performance routers, firewalls, and load balancers operate today. Programming the forwarding logic at the kernel level results in significant performance gains since we are, in effect, routing in hardware at line-rate.</p> <p>The most commonly used hooks for networking are <a href="https://github.com/xdp-project/xdp-tutorial" title="XDP Programming Hands-On Tutorial repo on GitHub">XDP</a>, or eXpress Data Path, tc, or traffic control, and the variety of hooks used for programming the data plane directly. XDP and tc are often used in conjunction because XDP can capture only ingress traffic information, so tc will also be used to capture information about egress traffic.</p> <p>Second, eBPF can be used for both packet-level and system-call visibility and filtering, making it a powerful security tool. If an undesirable or potentially malicious system-call is observed, a rule can be applied to block it. If certain packet-level activity is observed, a filter can be applied to modify it. The benefit of this is providing visibility and remediation as close to the target as possible.</p> <p>A third use case is observability, which we’ll focus on in this post. In the classic sense, observability is determining the state of a system by looking at its outcomes without making any changes to the system itself. Since eBPF doesn’t affect the performance of the kernel, including its processes, we can get extremely accurate information about network and application performance without it being skewed by having to draw resources from the kernel itself.</p> <p>In this way, you can gather runtime telemetry data from a system that does not otherwise have to expose any visibility points that take up system resources. Furthermore, collecting telemetry this way represents data at the actual source of the event rather than using an exported format of sampled data.</p> <h2 id="what-can-you-learn-with-ebpf">What can you learn with eBPF?</h2> <p>In the graphic below from <a href="https://www.brendangregg.com/index.html" title="Brendan Gregg&#x27;s Homepage">Brendan Gregg’s website</a>, dedicated to his extensive work with eBPF and observability, notice the variety of data we can collect directly from a device’s kernel.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3A5GyLEiPjDQmLKM94mHtb/e39b3e7896eba38de5e1c075e702e594/linux-bcc-bpf-tracing-tools.png" style="max-width: 800px;" class="image center" alt="eBPF and network observability - Brendan Gregg" /> <p>You can learn a tremendous amount of information using eBPF; this graphic represents only a portion of what you can do. eBPF is event-driven, so we collect information about every event in the kernel. We can then learn about everything happening on a host machine or container, including each individual application.</p> <p>An eBPF program runs when the kernel or the application you’re interested in passes a specified hook or hook point, including network events, system calls, function entry, function exit, etc.</p> <p>So if we want to know about a single application’s activity and overall performance, we can learn by using specific hooks that grab that telemetry without modifying the application or inadvertently affecting its performance.</p> <p>In the graphic above, the network stack is the area in light green. Notice again what kind of information you can learn directly from the source using eBPF.</p> <p>These are all important functions of observability at the network layer, so to expand on just a few:</p> <ul> <li><strong>tcptop</strong> allows you to summarize, send, and receive throughput by the host.</li> <li><strong>tcpdrop</strong> allows you to trace TCP packet drops.</li> <li><strong>tcpconnect</strong> allows you to trace active TCP connections.</li> <li><strong>tcpretransmit</strong> allows you to see the retransmission of TCP packets when an acknowledgement expires, a common cause of latency.</li> <li><strong>tcpstate</strong> allows you to see the TCP state changes and the duration in each part of the process.</li> </ul> <p>With the information we get from the functions above other eBPF tracing functions, we can ask questions such as:</p> <ul> <li> <p>Is my slow application experiencing TCP retransmits?</p> </li> <li> <p>Is network latency affecting the performance of my interactive application?</p> </li> <li> <p>Is traffic from my container(s) going to an embargoed country?</p> </li> <li> <p>Is the remote server I don’t own taking longer than expected to process my TCP request?</p> </li> </ul> <h2 id="monitoring-containers-with-ebpf">Monitoring containers with eBPF</h2> <p>Since we’re running production workloads today using containerized microservices, we can’t ignore the need for container visibility. However, containers present a problem for traditional visibility tools and methods.</p> <p>First, containers are ephemeral by nature, meaning they are often short-lived. They are spawned when needed and destroyed when unnecessary. Though we can do the same with virtual machines, it’s done so frequently with containers that capturing telemetry information gets difficult.</p> <p>Typical application, network, or infrastructure monitoring can’t easily capture the information we want from containers. You can consider each container as an individual host. Since containers can be enormous in number in a production environment, the sheer amount of metrics and telemetry available to gather is overwhelming.</p> <p>Also, containers are usually deployed en masse in cloud environments, making getting visibility information that much more difficult. It’s not as simple as monitoring virtual machines in your EC2 instance, running an APM solution, and collecting packets and flows from the network devices between you and your cloud environment.</p> <p>Since eBPF runs at the kernel level of a host, or in this case, a container, we can use eBPF programs to collect telemetry from ephemeral constructs such as containers and, to an extent, consolidate network, application, and infrastructure visibility tools into a single eBGP-based visibility solution.</p> <p>So with eBPF, we can capture information about processes, memory utilization, network activity, file access, and so on, at the container level, whether those containers are deployed in the cloud or not.</p> <h2 id="kappa-kentiks-process-aware-telemetry-agent">Kappa: Kentik’s process-aware telemetry agent</h2> <p>Kappa is Kentik’s host-based telemetry agent, designed to address visibility gaps in east/west flows across production data centers. Kappa was built to help organizations better understand traffic flows, find congestion and performance hotspots, visualize and identify application dependencies, and perform network forensics across on-premises and cloud workloads.</p> <h3 id="kappa-features">Kappa features</h3> <p>Kappa uses eBPF to consume as few system resources as possible and to scale to 10 gigabytes of persistent traffic throughput while consuming only a single core. Generating kernel flow data using eBPF allows Kentik to see the total traffic passing between any source and destination IP, port, and protocol across every conversation taking place within a host, cluster, or data center. Because this information is generated using the Linux kernel, Kappa also reports performance characteristics such as session latency and TCP retransmit statistics.</p> <h3 id="enrichment-for-container-monitoring">Enrichment for container monitoring</h3> <p>Kappa also enriches these flow summaries with application context. Using Kappa, we can associate conversations with the process name, PID, and the command-line syntax used to launch the IP conversation.</p> <p>The container ID is also associated if the process runs inside a container. If the container was scheduled by Kubernetes, Kappa enriches the flow record with the Kubernetes pod, namespace, workload, and the relevant node identifiers.</p> <p>Before exporting these records to Kentik, Kappa also looks for any records associated with other nodes in an environment and joins the duplicate traffic sources together with the source and destination context. This gives us a more complete picture of application communication within a data center.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6yxDyrsbnUsL5Rw45U9ftc/5d4654a7b8e371f1aec76ee73c537380/kappa-ebpf-container-monitoring.png" style="max-width: 800px;" withFrame class="image center" alt="Kubernetes monitoring in the Kentik platform" /> <p>Though container network monitoring is an important use case for Kappa, it was designed to seamlessly monitor bare-metal, containerized, and cloud-native workloads using a single, flexible agent deployed as a process, container, or container directly into a Kubernetes cluster.</p> <p>Kappa is distributed as Kubernetes configuration files. Linux packages for VM/bare metal use are available at <a href="https://packagecloud.io/kentik/kappa" title="Kentik kappa on packagecloud">https://packagecloud.io/kentik/kappa</a>, and its configuration can be viewed on GitHub (<a href="https://github.com/kentik/kappa-cfg" title="kappa-cfg - kubernetes configs for kappa repo on GitHub">https://github.com/kentik/kappa-cfg</a>) or can be cloned to your workstation:</p> <p><code class="language-text">$ git clone https://github.com/kentik/kappa-cfg.git</code></p> <h2 id="conclusion">Conclusion</h2> <p>The nature of modern application delivery requires new methods for observability. Because so many applications are delivered over a network, ensuring application availability and great performance means having deep system visibility and granular network visibility in the context of those applications.</p> <p>This means gathering telemetry for a variety of devices related to applications and their delivery, including containers. And when you also factor in public cloud, SaaS, and the various network overlays of today’s WAN and campus, gathering this telemetry becomes more important and difficult.</p> <p>eBPF has emerged as a perfect solution for collecting passive telemetry in modern environments, especially in the context of cloud and cloud-native containers. Operating at the kernel level and being lightweight means eBPF can provide us the telemetry we need specifically about application activity without inadvertently and adversely affecting application performance.</p> <p>Though eBPF has other uses, such as networking and security, its benefits to modern observability are changing the way we both see and understand what’s happening with the applications making their way over networks around the world.</p><![CDATA[Reinforcing Networks: Advancing Resiliency and Redundancy Techniques]]><![CDATA[Resiliency is a network's ability to recover and maintain its performance despite failures or disruptions, and redundancy is the duplication of critical components or functions to ensure continuous operation in case of failure. But how do the two concepts interact? Is doubling up on capacity and devices always needed to keep the service levels up? ]]>https://www.kentik.com/blog/reinforcing-networks-advancing-resiliency-and-redundancy-techniqueshttps://www.kentik.com/blog/reinforcing-networks-advancing-resiliency-and-redundancy-techniques<![CDATA[Nina Bargisen]]>Thu, 30 Mar 2023 04:00:00 GMT<p>The truth is, designing a network that can withstand the test of time, traffic, and potential disasters is a challenging feat. That’s where network resiliency and redundancy come into play, helping network planners construct robust and efficient networks. But do we always need 100% redundancy to achieve resilience? Let’s find out.</p> <h2 id="resiliency-and-redundancy-in-networking">Resiliency and redundancy in networking</h2> <p>First things first, let’s define resiliency and redundancy in the context of networking. Resiliency is a network’s ability to recover and maintain its performance despite failures or disruptions, and redundancy is the duplication of critical components or functions to ensure continuous operation in case of failure.</p> <p>But how do these two concepts interplay? While redundancy can contribute to network resiliency, it’s not the only approach to building a resilient network. Various other mechanisms can also be employed, often leading to a more cost-effective and efficient network design.</p> <h2 id="routing-protocols-and-their-impact-on-network-resilience-the-roles-of-igp-and-bgp">Routing protocols and their impact on network resilience: The roles of IGP and BGP</h2> <p>Let’s first dive into how routing protocols, particularly Interior Gateway Protocols (IGP) and <a href="https://www.kentik.com/blog/new-year-new-bgp-leaks/">Border Gateway Protocol (BGP)</a>, can influence network resilience and help reduce the need for complete redundancy.</p> <p>Interior Gateway Protocols, like OSPF and IS-IS, are the unsung heroes of routing within an autonomous system. They play a critical role in determining the most efficient paths for data packets, ensuring fast and reliable communication within the network. By quickly adapting to network topology changes, IGPs contribute to network resilience by rerouting traffic when a link or node fails. This adaptability minimizes downtime and service disruption, thereby reducing the need for 100% redundancy.</p> <p>On the other hand, BGP governs how networks interact with each other on the internet. It maintains network connectivity and stability by selecting the best routes between autonomous systems. BGP’s ability to reroute traffic during network failures strengthens network resilience, as it can seamlessly switch to an alternative path if the primary route becomes unavailable.</p> <p>Segment routing (SR) has also emerged as a powerful approach to enhance network performance and reliability. As a modern source-routing technique, SR simplifies traffic engineering, optimizes resource utilization, and provides better scalability than traditional routing methods.</p> <p>When it comes to traffic engineering, segment routing allows network operators to define explicit paths for traffic, distributing it more evenly across the network. By doing so, it minimizes congestion and ensures optimal resource usage, ultimately leading to improved network resilience. Additionally, SR’s ability to enable fast rerouting in case of link or node failures is critical to maintaining network continuity. By proactively computing backup paths, traffic can be swiftly switched to an alternative path when a failure occurs, reducing the impact of failures on network performance.</p> <p>Segment routing’s flexibility is another key factor in its contribution to network resilience. With granular control over traffic flows, SR can be easily integrated with other network resilience mechanisms, such as load balancing and traffic prioritization. This adaptability allows network operators to design and implement resilient networks tailored to their specific requirements.</p> <h2 id="traffic-classes">Traffic classes</h2> <p>Traffic classes, a more traditional way of traffic prioritization, can help network planners prioritize certain types of traffic over others, allowing for better resource allocation and improved network performance. By prioritizing critical traffic and ensuring it always has the necessary resources, network planners can further enhance network resilience without the need for complete redundancy.</p> <p>The decision between 100% redundancy and using traffic classes depends on the specific requirements of the network and the business it supports. Both approaches come with their unique advantages and trade-offs.</p> <p>Opting for 100% redundancy may be more suitable for networks with strict uptime requirements or when data integrity is paramount. In such cases, the added investment in redundancy can pay off in the long run, as it provides an extra layer of protection against network failures.</p> <p>On the other hand, traffic classes offer a more flexible and cost-effective solution for networks that don’t require such stringent uptime guarantees. By prioritizing critical traffic and intelligently managing network resources, traffic classes can enhance network resilience without incurring the high costs associated with full redundancy.</p> <p>It should come as no surprise that there’s no one-size-fits-all solution when it comes to building a resilient network. Network planners must weigh the benefits and drawbacks of each approach, considering factors such as <a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/">cost, performance, and specific business needs</a>. By understanding the role of routing protocols, traffic classes, and other network resiliency mechanisms, planners can make informed decisions that strike the right balance between resilience and redundancy.</p> <h2 id="resiliency-and-redundancy-for-efficient-networks">Resiliency and redundancy for efficient networks</h2> <p>Ultimately, it’s essential to grasp the interplay between resiliency and redundancy in networking to build robust and efficient networks. While redundancy can enhance network resiliency, a more comprehensive strategy that includes routing protocols, traffic classes, and other resiliency mechanisms can lead to a cost-effective and adaptable solution. Network planners must thoughtfully assess their network and business needs to determine the most suitable approach for constructing a resilient and efficient network infrastructure. By finding the perfect harmony between resilience and redundancy, network planners can establish a solid and dependable network capable of adapting to the ever-changing demands of the business.</p> <p>Remember, understanding and applying these concepts appropriately is key to ensuring your network’s smooth operation and longevity. With a well-planned and resilient network, you’ll be better equipped to handle challenges and changes as they arise.</p><![CDATA[Diving Deep into Submarine Cables: The Undersea Lifelines of Internet Connectivity]]><![CDATA[Under the waves at the bottom of the Earth’s oceans are almost 1.5 million kilometers of submarine fiber optic cables. Going unnoticed by most everyone in the world, these cables underpin the entire global internet and our modern information age. In this post, Phil Gervasi explains the technology, politics, environmental impact, and economics of submarine telecommunications cables.]]>https://www.kentik.com/blog/diving-deep-into-submarine-cables-undersea-lifelines-of-internet-connectivityhttps://www.kentik.com/blog/diving-deep-into-submarine-cables-undersea-lifelines-of-internet-connectivity<![CDATA[Phil Gervasi]]>Tue, 28 Mar 2023 04:00:00 GMT<p>Worldwide telecommunications began somewhere around 150 years ago, in 1850, with the first commercial international submarine cable between England and France. A few years later, by 1858, the first trans-Atlantic telegraph cable connected London with North America when 143 words were transmitted in about 10 hours.</p> <p>Today, the United States’ financial and military command systems rely on global submarine cables. Trillions of dollars of daily transactions mean the entire global economy depends on submarine cables. Access to information for people around the world is dependent on submarine cables.</p> <div as="WistiaVideo" videoId="mswbehusup" audio></div> <p><em>Kentik’s data on submarine cable outages was highlighted in the <a href="https://www.nytimes.com/interactive/2024/11/30/world/africa/subsea-cables.html">New York Times article Undersea Surgeons</a>, featuring our director of internet analysis, Doug Madory.</em></p> <p>The infrastructure underpinning this incredible advancement in the transfer, availability, and access to information around the world is made almost entirely of submarine telecommunications cables running hundreds or thousands of meters below the waves on the ocean floor.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1qlEHXIfSxQZXCxFkf62eE/163a940c997a73ea7ae851dfedeebeaf/submarine-cable-map.jpg" style="max-width: 700px; padding: 0;" class="image center" thumbnail alt="Map of submarine cables" /> <div class="caption" style="margin-top: -30px;"><a href="https://www2.telegeography.com/">Image from TeleGeography</a></div> <p>TeleGeograpy’s <a href="https://submarine-cable-map-2023.telegeography.com/">interactive map</a> explores the world’s system of undersea fiber optic cables crossing oceans and following the coast of entire continents. And in a recent episode of <a href="https://www.kentik.com/telemetrynow/s01-e10/">Telemetry Now</a>, <a href="https://www.linkedin.com/in/alan-mauldin-53ba622/">Alan Mauldin</a>, Research Director for <a href="https://www2.telegeography.com/">TeleGeography</a>, explained that the world of submarine cables is about politics, money, the environment, global commerce, and of course, moving data vast distances across a wire.</p> <p>This post will explore some of these areas starting with the technology itself.</p> <h2 id="fiber-optics">Fiber optics</h2> <p>The submarine cables that move internet traffic around the world are made from silica glass fiber optic strands that most network engineers are likely familiar with. However, submarine cables need to allow light to travel very long distances with minimal attenuation, so the <a href="https://www.thefoa.org/tech/smf.htm">G.654</a> subset of fiber is used for undersea applications.</p> <p>The cables make use of Dense Wavelength Division Multiplexing (DWDM) to move large amounts of data by allowing different data streams in the form of multiple wavelengths of light to be sent simultaneously over a single fiber infrastructure. In the image below, notice how DWDM technology will take multiple data streams operating at different wavelengths and combine them for transport on a single optical fiber.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6r2OkfX1AGULaMnZKJHqMV/033430593077f88bfdea5b217d049fcf/dwdm-data-streams.png" style="max-width: 650px;" class="image center no-shadow" alt="DWDM technology - data streams" /> <div class="caption" style="margin-top: -30px;"><a href="https://www.ufispace.com/company/blog/what-is-dwdm-its-uses-benefits-components">https://www.ufispace.com/company/blog/what-is-dwdm-its-uses-benefits-components</a></div> <p>Submarine optical fibers are typically G.654A-D, meaning they are single mode and have better attenuation than most common land-based fiber optic cable types. G.654E fibers are in the same family of fibers but are typically used for more unique land-based applications requiring an even lower loss fiber.</p> <div as="Promo"></div> <p>Low attenuation, or low signal loss, is important since crossing oceans means light will be traveling considerable distances. G.654 compliant fibers have a zero-dispersion wavelength at about 1300nm and are optimized for use in the 1500 nm-1600 nm range. Attenuation is in the range of 0.15db/km-0.17db/km, which means there is very little loss over long distances. However, attenuation is not zero, so the signal does need to be repeated or amplified at prescribed points in the path of very long cables.</p> <p>The amplifiers, sometimes called repeaters, are installed inline as part of the cable to restore the signal. These amplifiers are placed about every 70 km or so to enable the signal to travel vast distances while maintaining signal integrity.</p> <p>Most submarine cables are powered with about 20k volts from both ends. Each power source powers half of the inline amplifiers. If power is lost from one end, the other power source can power the amplifiers throughout the cable until the power issue is resolved.</p> <p>The image below is of a typical submarine cable amplifier, though they can vary in size.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7ClDyBT5WqmuHJXru29LYz/d45d50398c44ff7057cfb1148dd9b716/submarine-cable-amplifier.jpg" style="max-width: 700px;" class="image center" alt="Typical submarine cable amplifier" /> <h2 id="submarine-cable-structure">Submarine cable structure</h2> <p>A submarine cable is made up of several layers. Submarine cables are typically composed primarily of marine-grade polyethylene with steel strength members, copper conductors, and a glass fiber core.</p> <div class="image right no-shadow" style="max-width: 400px; padding-top: 10px;"> <img src="//images.ctfassets.net/6yom6slo28h2/6ezt61AVZLUNXwB0omNYIU/b860bc51bc0258cc3b40b13491147b7c/submarine-cable-structure.png" class="no-shadow" alt="Composition of a cable" /> <div class="caption" style="margin-bottom: 0;"><a href="https://www2.telegeography.com/">Image from TeleGeography</a></div></div> <p>At the core are the silicon glass optical fibers themselves, each of which is not much thicker than a human hair. Surrounding the fibers are various layers of protection and insulation.</p> <p>The total thickness of a submarine cable without additional protective armor is only about that of a garden hose, approximately 20mm. The cable could be 50mm or more with the additional protective armor, depending on the application. The outermost armor layer protects the cable in the harsh environment at the bottom of the ocean.</p> <p>Data transfer rates vary among the approximately 550 active cables running along the ocean floor. Because submarine cables are designed with a 25-year life expectancy in mind, some active cables are already two decades old and support a lower data transfer rate compared to one of the newer cables, the <a href="https://www.submarinecablemap.com/submarine-cable/marea">MAREA cable</a>, laid in 2018 over 6,600km long, and with a transfer rate of 224Tbps.</p> <p>When cables reach the end of their lifespan due to economic or technical reasons, they could be simply left unused on the ocean floor, repurposed in some way, or salvaged for raw materials.</p> <p>There are approximately 1.4 million kilometers of active submarine cables in the world today. They vary in length from short spurs only a few kilometers long to long-distance cables such as the recently installed 2Africa cable at 45,000 km long. The 2Africa cable will connect 34 countries in three different contents and will be fully operational in 2024.</p> <p>Some larger three-core power cables may also have some fiber strands to carry communication traffic. Still, dedicated communication submarine cables are generally much smaller, varying in size based on the amount of protective armor.</p> <h2 id="cable-landing-stations">Cable landing stations</h2> <p>Ultimately, submarine cables lay on the ocean floor with terminations, or landings, on either end with the possibility for spurs along the path.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6ERKonco4kYi0vrUFq622z/467255fba5e68b3fdbf937c171770b57/submarine-cable-ocean-floor.jpg" style="max-width: 700px;" class="image center" alt="Cable on the ocean floor" /> <div class="caption" style="margin-top: -30px;"><a href="https://www.telecomreviewafrica.com/en/articles/general-news/1976-new-country-to-connect-to-ellalink-submarine-cable">https://www.telecomreviewafrica.com/en/articles/general-news/1976-new-country-to-connect-to-ellalink-submarine-cable</a></div> <p>A cable landing station, or CLS, is chosen because the location typically has less marine traffic than much busier port cities. This piece of cable infrastructure is a critical component of the entire system as it serves as the point at which the cable makes landfall and transitions from under the water (wet) to land (dry).</p> <p>The CLS is where the cable connects to land-based power and various networking provider infrastructure. A CLS could be a small, nondescript building in a small coastal town or part of a much larger data center. In either case, they are often secure facilities.</p> <h2 id="the-environment">The environment</h2> <p>Since submarine (telecommunication) cables have a small footprint compared to much larger power cables and gas pipelines, their impact on the environment is minimal. In fact, when properly installed, a submarine cable will actually serve as a new marine ecosystem substrate within just a few months of installation. The area around certain sections of cables is dedicated cable protection zones which often also turn into marine wildlife sanctuaries as a side-effect.</p> <p>Notice the cable in the image below encrusted with marine organisms.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1iD3NUZ9shnrS9g9up2B2k/8faccaca08b9cedecec1fc109a4b298e/cable-marine-life.jpg" style="max-width: 700px;" class="image center" alt="Cable encrusted with marine life" /> <div class="caption" style="margin-top: -30px;">Image by Glauco Rivera: <a href="https://www.iscpc.org/">https://www.iscpc.org/</a></div> <p>Cable faults rarely occur due to marine life, such as shark bites or whale entanglements. The last reported incident of a submarine communication cable being damaged by a marine organism was from the mid-1980s when a damaged cable was recovered near the Canary Islands with shark bite marks on the cable itself.</p> <p>As a result, cable armor has improved, and laying techniques have changed such that there is no evidence at all of any damage to a submarine cable due to any type of marine organism (including sharks) since then.</p> <p>However, the environment does affect submarine cables in other ways. The ocean floor is inherently a harsh environment, and the naturally occurring activity at great depths, sometimes thousands of meters deep, can adversely affect the integrity of the cable structure.</p> <p>Undersea current abrasion, weather events such as hurricanes, earthquakes, volcanoes, and other natural phenomena have played a role in damaged cables over the years. But these events are very few and account for only around 10% of cable faults.</p> <h2 id="human-activity">Human activity</h2> <p>Fishing, trawling, and shipping activity all play a much more significant role in causing cable faults today. Despite cable protection zones being identified on nautical charts, the most common cause for cable damage is a ship anchor or fishing equipment coming in contact with a cable in shallower depths of less than 200m.</p> <p>In the image below, you can see a cable protection zone off the coast of Perth, Australia, highlighted in purple. This is where the SEA-ME-WE 3 submarine cable lands at the Perth Cable Landing Station.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4vEEIKQSial7SUMUJeHFNd/a81b960577b5a41fc5a42e722a9318c7/cable-perth.jpg" style="max-width: 800px;" class="image center" alt="Cable protection zone off the coast of Perth, Australia" /> <div class="caption" style="margin-top: -30px;"><a href="https://www.submarinenetworks.com/stations/oceania/australia/perth">https://www.submarinenetworks.com/stations/oceania/australia/perth</a></div> <p>However, commercial fishing and ship anchors still routinely damage submarine cables in shallower depths despite these designated areas. Trawling gear or anchors getting tangled with cables is one of the main causes of a cable fault not because it completely severs the cable but because it can damage the armor layer and sheathing, protecting the fibers deep inside the cable structure. When this happens, salt water can enter the cable and disrupt communication activity.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1usY9fjdpalE6tuHcFDN24/473fb290f65c5616b584b1d70dfdd99a/undersea-cable.jpg" style="max-width: 450px;" class="image right" alt="Submarine cable damaged by trawler gear" /> <p>Repair of damaged cables means sending a repair ship to the location of the problem, splicing the cable, repairing it, and re-attaching the ends.</p> <p>In shallow waters, this can be done by divers, but in deeper waters, this requires capturing the cable and hoisting it up to a ship to repair it.</p> <p>Damaged cables can result in a loss of capacity, increased latency, or in the worst case, a halt of all communication activity. There is significant redundancy in paths, so even a major problem with a submarine cable doesn’t necessarily mean all traffic between two continents stops.</p> <p>In the image above from Seaworks and Transpower NZ and the <a href="https://www.iscpc.org/">International Cable Protection Committee</a>, you can see a submarine cable damaged by trawler gear.</p> <h2 id="people-working-together">People working together</h2> <p>About 99% of all intercontinental internet traffic goes over submarine cables. Though satellites are a great solution for edge connectivity onto the global internet, especially in locations that don’t have easy access to physical infrastructure, they don’t represent a significant amount of overall global capacity. The capacity and latency limitations of satellite communication technology mean that fiber optic submarine cables are the dominant method for moving large volumes of data worldwide.</p> <p>Investment in the submarine cable infrastructure is strong, and work is being done to ensure data rates improve, security improves, and access to more parts of the world is built. However, the overall system of submarine telecommunications cables is less like a global public utility and much more like the global airline industry, with private airlines sharing the cost for airports.</p> <h3 id="who-owns-the-cables">Who owns the cables?</h3> <p>Years ago, submarine cables were first owned by telecommunication companies that would band together to form a consortium of all the parties interested in using the submarine cable. Over time and as the internet grew in the mid to late 90s, more companies saw the potential of investing in the infrastructure that enabled the global internet to take off.</p> <p>It didn’t take long for countries around the world to recognize that submarine cables were becoming part of critical infrastructure for governments and the private sector. In recent years, even content providers like Google, Meta, and Amazon have become prominent investors in developing and installing new cable. In fact, the capacity added to the overall network of submarine cables by these private companies has far outpaced the growth of the traditional telecom providers.</p> <p>Submarine cables are then each owned by various private investors, either large telecom carriers, content delivery providers, or investor groups. A few or many private organizations privately own about 99% of cables. About 1% of submarine cables are owned or owned in part by a government entity. These investors then form a conglomerate to operate and maintain a cable. Sometimes there are conflicts of interest, as is the case with countries that may not be on good terms.</p> <p>However, remember that traffic doesn’t always need a direct route to reach its destination. Even in those situations when there isn’t a direct physical path between two countries, data can travel in a more circuitous route and still get to most destinations.</p> <p>This consortium of owners and investors will contract repair services to private third-party companies that have the ships, staff, and resources to make repairs in the middle of the ocean very quickly. The number of total cable repair ships in service, either installing a new cable or repairing a cable, is surprisingly small at around 60.</p> <p>Below is an image of a repair ship mending a damaged submarine cable connecting Malta and Sicily.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3UhXpLu50LDhQNTUIr5Udu/f5a4112e2aaedb239ee660aa9f064077/cable-repair-ship-malta-sicily.jpg" style="max-width: 700px;" class="image center" alt="Repair ship mending a damaged submarine cable" /> <div class="caption" style="margin-top: -30px;"><a href="https://www.offshore-energy.biz/nexans-patches-up-malta-sicily-subsea-interconnector/">https://www.offshore-energy.biz/nexans-patches-up-malta-sicily-subsea-interconnector/</a></div> <h2 id="submarine-cable-oversight">Submarine cable oversight</h2> <p>In the United States, “the <a href="https://www.fcc.gov/submarine-cables">Federal Communications Commission</a> (FCC) and various other agencies are involved in the oversight of the submarine cables and/or other undersea activities that may impact submarine cable infrastructure.” When cable companies are from outside the US, “the FCC’s International Bureau grants licenses authorizing cable landing license applicants to own and operate submarine cables and associated landing stations in the United States.”</p> <p>The <a href="https://www.iscpc.org/">International Cable Protection Committee</a> was also formed in 1958 as a neutral body to establish standards for cable installation, maintenance, and protection. The ICPC monitors international treaties, various legislation, the installation of cable, and supports research initiatives to ensure that submarine cable interests are protected.</p> <p>From the <a href="https://www.iscpc.org/about-the-icpc/">ICPC website</a>:</p> <div style="padding: 30px; background-color:#ebeff3; border: solid 1px #d3d7dc; margin-bottom:30px; max-width: 90%;"> <p><b>The International Cable Protection Committee (ICPC)</b> was founded in 1958 and its Membership comprises of governmental administrations and commercial companies that own or operate submarine telecommunications or power cables, as well as other companies that have an interest in the submarine cable industry—including most of the world’s major cable system owners and cable ship operators. The primary purpose of the ICPC is to help its Members to improve the security of undersea cables by providing a forum in which relevant technical, legal and environmental information can be exchanged.</p> <p><strong>Prime Activities of the ICPC</strong>:</p> <ul> <li>Promote awareness of submarine cables as critical infrastructure to governments and other users of the seabed</li> <li>Establish internationally agreed recommendations for cable installation, protection and maintenance</li> <li>Monitor the evolution of international treaties and national legislation and help to ensure that submarine cable interests are fully protected</li> <li>Liaison with UN Bodies</li> </ul> </div> <h2 id="the-economics-of-submarine-cables">The economics of submarine cables</h2> <h3 id="ownership-and-investment">Ownership and investment</h3> <p>Because many countries depend on internet connectivity for a large portion of their GDP, and because many nation-state and military organizations around the world rely on submarine cables to function at all, the economics of submarine telecommunication cables is more complex than simply commercial supply and demand.</p> <p>Submarine cables certainly provide the underlying mechanism for much of global commerce, but they also serve the public good in various ways. In fact, a <a href="https://www.worldbank.org/en/news/press-release/2014/02/06/access-to-high-speed-internet-key-to-job-creation-social-inclusion-arab-world">World Bank study from 2014</a> indicates that “a 10% increase in broadband penetration results in a 1.38% increase in gross domestic product (GDP) growth in low and middle-income countries.”</p> <p>Submarine cables are financed almost entirely with private money from companies worldwide. Their interest is to expand their ability to acquire and serve more customers, but working together in groups of investors means submarine cables are shared resources even among telecom competitors. The majority of the capacity is sold as annual leases, usually for 15 years as a standard term.</p> <p>Cables cost upwards of $40,000 per mile, meaning a longer cable can easily run in the hundreds of millions of dollars. Recent trans-oceanic cable bids have reached upwards of $250-$300 million for trans-Atlantic and $300-$400 million for trans-Pacific. Though the cable and cable landing stations are themselves expensive, a significant part of the cost of the cable, at about 25%, is the marine installation which is paid for by the consortium of investors who share both the cost and the risk.</p> <p>Cable installation and ongoing maintenance are done by third-party contractors hired by the consortium of investors/owners. SLAs for maintenance work can be surprisingly aggressive regarding response time, with a 24-hour response to a fault being very common. Actual repair may take much longer, of course. Cable maintenance ships and crews are specialized and, therefore, expensive. The various maintenance contractors compete on efficiency, speed, and cost. The majority of the capacity is sold as annual leases, usually for 15 years as a standard term.</p> <h3 id="major-submarine-cable-fabricators">Major submarine cable fabricators</h3> <ul> <li><a href="https://www.subcom.com/">SubCom</a> and <a href="https://web.asn.com/en/">Alcatel Submarine Networks</a> are North America’s two main submarine cable fabricators.</li> <li>The main fabricator in China is <a href="https://www.hmntechnologies.com/enCompany.jhtml">HMN Technologies Co., Limited</a>, which is majority owned by Shanghai-listed Hengtong Optic-Electric C Ltd.</li> <li>In Japan, the leading submarine cable fabricator is <a href="https://www.nec.com/en/global/prod/nw/submarine/index.html">NEC</a>. Its subsidiaries OCC Corporation and Sumitomo Electric Industries, Ltd. NEC has fabricated cables crossing waters in southeast Asia, the Pacific and Indian Oceans, the Mediterranean, and even the South Atlantic.</li> <li>The fabrication of a submarine cable is no small feat, and many other companies manufacture cables and their components, including Corning, General Cable, and NSW (now owned by <a href="https://www.prysmiangroup.com/en">Prysmian Group</a>).</li> </ul> <h2 id="the-future-of-submarine-cables">The future of submarine cables</h2> <h3 id="sdm-and-multicore">SDM and Multicore</h3> <p>The compound annual growth of international bandwidth demand is expected to expand by 20-40% over the next several years. Clearly, global data usage is expected only to increase over the next few years. The growth of 5G mobile data, the expansion of the internet to underserved parts of the world, and the ever-increasing volume of content among data centers and to billions of consumers mean the need for increased submarine cable capacity is crucial to the growth of the internet and the global economy.</p> <p>One technology designed to help meet this need is space division multiplexing or SDM. SDM increases the cable’s total capacity by increasing the number of independent spatial channels in the core. Not only does this result in higher capacity, but it also reduces power consumption. SDM basically adds more fiber pairs in a single cable.</p> <p>Unlike SDM, multicore technology increases the number of parallel optical fiber cores per fiber pair.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7lMKYYzje9Z54Z7ulQhmZV/74bbea2b79d853236ef5779c49ce4e51/multicore-diagram.png" style="max-width: 700px;" class="image center no-shadow" alt="Single core vs. multicore fiber cables" /> <div class="caption" style="margin-top: -30px;"><a href="https://www.nec.com/en/press/202110/global_20211004_01.html">Comparison of conventional and multicore fiber cables</a></div> <p>From a technical perspective, multicore submarine cable technology is a big part of the future of submarine telecom cables. In 2021, NEC announced the completion of an uncoupled 4-core submarine fiber cable trial.</p> <p>This new 4-core fiber can be installed at depths up to 8,000 meters and contain up to 32 individual fibers. This kind of new technology means the number of cores within the cable itself can be increased without increasing the overall diameter of the entire cable system.</p> <p>With these new cables, a single cable system can increase its capacity significantly without significantly increasing the cost of the cable or its installation. And from an economic perspective, this will reduce the total cost per bit of data transfer.</p> <h2 id="conclusion">Conclusion</h2> <p>Considering 75% of the world’s population is predicted to be connected to the internet (mainly by mobile devices) by 2025, the future of submarine cable technology and the submarine cable industry is very bright.</p> <p>From the physics of light used to transmit data thousands of miles to the economics of consortiums of some of the largest companies on Earth, submarine cables are complex and now crucial to our global economy and access to information.</p> <p>Worldwide telecommunications began in earnest in the mid-19th century, almost 200 years ago. This led to an unprecedented change in our human experience. Underpinned by nearly 1.5 million kilometers of submarine fiber optic cables, the internet has changed the world. And these cables, criss-crossing the ocean floor and touching continents and countries, go largely unnoticed by an entire planet that now relies on them for many daily activities.</p> <p><em>Special thanks to <a href="/blog/author/doug-madory/">Doug Madory</a> from Kentik and Alan Mauldin from TeleGeography for their input and review of this article for accuracy.</em></p><![CDATA[What Does It Mean To Build a Successful Networking Team?]]><![CDATA[What does it mean to build a successful networking team? Is it hiring a team of CCIEs? Is it making sure candidates know public cloud inside and out? Or maybe it’s making sure candidates have only the most sophisticated project experience on their resume. In this post, we’ll discuss what a successful networking team looks like and what characteristics we should look for in candidates.]]>https://www.kentik.com/blog/what-does-it-mean-to-build-a-successful-networking-teamhttps://www.kentik.com/blog/what-does-it-mean-to-build-a-successful-networking-team<![CDATA[Phil Gervasi]]>Tue, 21 Mar 2023 04:00:00 GMT<p>What does it mean to build a successful networking team? Is it all about gathering the most highly certified engineers you can afford? Is it about getting a group of specialists, each with their own subject matter expertise? Or is it all about finding those people with years and years of experience working in the trenches?</p> <div as="WistiaVideo" videoId="ce29bt2ixc" audio></div> <p>I’ve been thinking about this as I’ve been talking to some engineer friends of mine. And based on my own experience and the experience of my packet-slinging comrades, though technology changes faster as I get older, the underlying ideas of building a successful networking team really haven’t changed. And for me, we can boil it down to these few areas.</p> <h2 id="harnessing-an-engineering-mindset">Harnessing an engineering mindset</h2> <p>Having an engineering mindset is one of the foremost things each member of a successful networking team needs to have. I learned this over the years as a network engineer working in daily network operations and as a VAR engineer working on my customers’ networks.</p> <p>In a <a href="https://www.kentik.com/telemetrynow/s01-e09/">recent Telemetry Now podcast</a>, my guest Tony Efantis echoed the same thing when he spoke about a candidate for a position on his team needing to show genuine curiosity and willingness to tackle new technical projects.</p> <p>An engineering mindset is all about the ability to look at a technical problem and literally <em>find a solution</em> rather than give up or never even start in the first place. It’s a willingness to keep pushing until you find the answer, treating a technical problem like a puzzle.</p> <p>For a network engineer, that means thinking logically about a problem, sometimes suspending best practices, drawing on the fundamentals, researching the issue online, diving in without fear of the unknown, and trying ideas until it works.</p> <p>In that sense, to build a successful networking team, find people willing to jump into a technical project and do whatever they must to figure out the solution. They don’t need to be subject matter experts in every technology already, and they certainly don’t need to have all the answers.</p> <h2 id="an-understanding-of-how-technologies-work-together">An understanding of how technologies work together</h2> <p>In addition to a willingness to jump in, an engineer needs a broad understanding of how different technologies work together. Any decent network engineer should know the fundamentals of how frames and packets move across the wire (or air). But a successful networking team will have network engineers who also have a high-level understanding of how network activity relates to server activity, public cloud resources, security functions, containers, and so on.</p> <p>No, I don’t believe a successful network team needs to be made of subject matter experts in all of those technologies, and I don’t think a successful network team should be responsible for all of them, either. But I do think that a broad understanding of how all of these services and components work together over the network provides a team with a fundamental understanding of how the networking piece should function.</p> <h2 id="willingness-to-learn">Willingness to learn</h2> <p>If the networking team members aren’t willing to learn new things, they will fail. It’s as simple as that. Technology changes quickly, and though the fundamentals may stay mostly the same, there is a lot to learn if a team is going to maintain a performant, reliable, and competitive network infrastructure.</p> <p>Building a successful networking team means prioritizing ongoing training and professional development. It means supporting those initiatives with money, time, and lab resources. Building that kind of team also means finding engineers eager to develop their skills and learn the latest relevant technology.</p> <h2 id="cool-under-pressure">Cool under pressure</h2> <p>I’ve been through enough network cutovers and sweated through enough P1 troubleshooting sessions to know that the most successful networking teams and network engineers keep cool under pressure.</p> <p>Networking is different from other areas of technology. If the network blips, even for a second, everyone knows it. Sometimes even a minor issue on a seemingly non-critical network device can significantly impact the rest of the network and, therefore, application delivery. For me, at least, working on a production network can be stressful and sometimes outright scary.</p> <div as="Testimonial" index="0" color="green"></div> <p>This means when things go south, which they inevitably do, it’s critical to keep a cool head, stay focused, and lean on the foundational knowledge that forms the basis of most of what network engineering is all about. Building a successful networking team means finding those engineers who, instead of freaking out during an outage, carefully look at the scenario and methodically work through the problem, even in the face of screaming CIOs and angry customers.</p> <h2 id="being-a-team-player">Being a team player</h2> <p>This last one goes without saying. Building a successful networking team means finding engineers who are team players. There is no such thing as an engineer who knows everything, so a successful team will be built from engineers with the humility to know they don’t know everything and the humility to lean on others when necessary.</p> <p>That goes both ways, too. A team player on a successful networking team is willing to jump in when needed, even if it means doing lower-level tasks for the sake of the team’s success.</p> <h2 id="a-tall-order">A tall order</h2> <p>Finding a group of engineers that meets all these criteria is a pretty tall order, isn’t it? The thing is, it’s not about finding that one perfect candidate, that one shiny unicorn. It’s about building a team where these characteristics are the norm and in which they can even flourish.</p> <p>If you think about it that way, building a successful networking team is less about finding the best engineers with the most desirable attributes and more about finding decent people that can grow into that kind of team.</p> <p>My experience has been all over the place. I’ve worked with amazing people willing to jump in at a moment’s notice, day or night. I’ve worked with arrogant jerks who wouldn’t help carry new gear into the building because they were the senior engineer on the project.</p> <p>But looking over the landscape of my career, the very best teams I worked on had individuals that excelled in one or more of these attributes. They were teams that succeeded as engineers and thrived as people.</p><![CDATA[Securing Your Network Against Attacks: Prevent, Detect, and Mitigate Cyberthreats]]><![CDATA[As networks become distributed and virtualized, the points at which they can be made vulnerable, or their threat surface, expands dramatically. Learn best practices for preventing, detecting, and mitigating the impact of cyberthreats.]]>https://www.kentik.com/blog/secure-your-network-against-attacks-prevent-detect-mitigate-cyberthreatshttps://www.kentik.com/blog/secure-your-network-against-attacks-prevent-detect-mitigate-cyberthreats<![CDATA[Stephen Condon]]>Thu, 16 Mar 2023 04:00:00 GMT<p>As networks become distributed and virtualized, the points at which they can be made vulnerable, or their <em>threat surface</em>, expands dramatically. Multi- and hybrid cloud infrastructures add further complexity to securing threat surfaces, adding even more variability to permissions, configuration points for human error to enter the scene, and network boundaries to exploit.</p> <p>This is compounded by recent trends of remote work, where network operators need to wrestle with the fact that employees often access the network via work sites with far less governance. Cyberthreat strategies have evolved in step with modern cloud networks, often using cheap, virtualized cloud resources to exploit the threat surface topology I briefly described above. As these attacks become more subtle and sophisticated, having real-time, largely automated responses from NetOps and SecOps becomes critical.</p> <p>This article will examine what defines a network attack, the most common types of network attacks, and how network observability can protect your network against such threats.</p> <h2 id="what-is-a-network-attack">What is a network attack?</h2> <p>A network attack is any attempt to gain access to or otherwise compromise the integrity or availability of a network. Network attacks come in many forms. Some are highly automated, machine-based attacks, while others are more subtle, relying on human vulnerability.</p> <p>Guaranteeing network security and performance in the face of these attacks, whatever form they take, is one of the principal responsibilities of network operators today.</p> <h2 id="what-are-the-most-common-types-of-network-attacks">What are the most common types of network attacks?</h2> <p>Here are several of the most common attacks against enterprise networks:</p> <ul> <li>Distributed denial of service</li> <li>Man in the middle</li> <li>Privilege escalation</li> <li>Unauthorized access</li> <li>Insider threats</li> <li>Code and SQL injection</li> </ul> <p>Let’s take a closer look at each of these types of network attacks.</p> <h3 id="distributed-denial-of-service-ddos">Distributed denial of service (DDoS)</h3> <p>DDoS attacks are cyber attacks that most often have the purpose of causing application downtime. This downtime can itself be the ultimate goal for the attacker. Still, it can also be used to weaken security systems, cover tracks, or act as a red herring for investigators. At the same time, a more significant vulnerability is exploited elsewhere in a network. These attacks can involve coordinating thousands of virtual or otherwise devices to overwhelm the target server’s resources.</p> <p>There are three main types of DDoS attacks:</p> <ul> <li><strong>Volume-based</strong>. These attacks aim to overwhelm a service’s bandwidth capabilities with prohibitively high traffic volumes. Common volume-based DDoS attacks are ICMP and UDP floods.</li> <li><strong>Protocol-based</strong>. These attacks overwhelm network infrastructure resources, targeting layer 3 and layer 4 communication protocols. Common protocol-based attacks are Ping of Death, Smurf DDoS, and SYN floods.</li> <li><strong>Application layer</strong>. These attacks typically seek out web server vulnerabilities with malformed or high-volume requests in layer 7 services. Common application layer attacks include HTTP floods and slowloris.</li> </ul> <h3 id="man-in-the-middle-mitm">Man in the middle (MITM)</h3> <p>MITM attacks use false or redirected interfaces to exploit vulnerabilities in both human and machine-based protocols.</p> <p>For a little more detail, here are some of the main MITM attacks levied against networks:</p> <ul> <li><strong>Spoofing</strong>. A hallmark MITM strategy, wherein a malicious agent intercepts and redirects traffic with false credentials. The attacker can spoof IP addresses, DNS, HTTPS headers, and more to deceive users into interacting with compromised applications.</li> <li><strong>Hijacking</strong>. This type of attack involves compromising a network component and intercepting incoming traffic. BGP hijacking has increasingly become a concern, as they cause changes to the routing of the internet’s autonomous systems and can cause severe disruption to a web application’s internet traffic.</li> </ul> <h3 id="unauthorized-access-and-privilege-escalation">Unauthorized access and privilege escalation</h3> <p>For access or privilege-based attacks, attackers rely on various techniques to gain the initial set of network credentials, including social engineering ploys like phishing, dark web data purchases, malware, password breaking, and many others. A surprisingly common exploit is default passwords on hardware that were never reconfigured upon installation.</p> <p>Privilege escalation is a type of network attack that exploits poorly defined roles and security boundaries. Once unauthorized access has been achieved, malicious agents seek access to additional accounts/privileges to infiltrate the network further. Broadly speaking, there are two types of privilege escalation: horizontal and vertical.</p> <p>In <strong>horizontal escalation</strong>, bad actors seek to “hop” laterally across similarly privileged accounts to find the data they are looking for. For example, gaining access to a sales rep’s account but using it to compromise the director of sales’s account to access customer account data. <strong>Vertical privilege escalation</strong>, on the other hand, seeks to move up the privilege scale. For example, moving from a sales rep’s account (basic user privileges) into a sysadmin’s account (admin privileges).</p> <h3 id="insider-threats">Insider threats</h3> <p>True to their name, insider threats can come from anyone with access to facilities, hardware, interfaces, or operational knowledge. Depending on their privileges, and the use of techniques like privilege escalation, these insiders are capable of causing actual harm to networks. Insider threats can be unintentional or malicious, but they present a very significant vulnerability for network and security specialists.</p> <p>Most insider threat attacks can be grouped into one of two categories: espionage and sabotage. Espionage involves using network access to steal other network users’ personal, system, or IP data. Sabotage is less interested in preserving secrets and far more intent on destroying or disrupting data or network availability.</p> <h3 id="injection">Injection</h3> <p>An injection attack is any vector that exploits input vulnerabilities in an application or network. Injection attacks focus heavily on exploiting bad programming practices that provide input opportunities where, when, or to whom there should be none.</p> <p>File uploads, form fields, and a poorly designed API can all create injection opportunities.</p> <p>Injection vectors include:</p> <ul> <li><strong>SQL</strong>. Injection attacks that use SQL inputs to alter or otherwise corrupt servers. A common scenario is a poorly validated user login that accepts SQL statements.</li> <li><strong>Code</strong>. Code injection attacks also make use of poorly validated inputs. But, instead of using the vulnerability to inject SQL commands, the attackers use application code to retrieve or alter data.</li> </ul> <h2 id="using-network-observability-to-keep-your-network-secure">Using network observability to keep your network secure</h2> <p>NetOps and SecOps must work closely to ensure proper threat surface coverage for complex networks. While it is impossible to eradicate the possibility of an attack entirely, the tools and strategies of observability offer network specialists exciting new capabilities.</p> <h3 id="prevention">Prevention</h3> <p>The best-case scenario for an attack is that it is completely prevented. Security philosophies like zero trust help IT structure privileges, authentication, and device control around the idea that every user and entry point to the network should be vigilantly verified. In some cases, multi-factor authentication (MFA) has been shown to prevent up to 99.9% of automated password attacks.</p> <p>From a networking perspective, a lot of valuable prevention data is available to engineers who know how to use it. Here are some of the top network data sources for attack prevention:</p> <ul> <li><strong>Global IP reputation assessments</strong>. Prevent risky IP addresses from interacting with your network.</li> <li><strong>BGP route leak detection</strong>. Find misconfigurations and bad code that expose your network’s Border Gateway Protocol routes to vulnerability.</li> <li><strong>RPKI status checks</strong>. Know if any route containing your address space is being evaluated as RPKI-invalid. This would include accidental leaks, intentional hijacks, or a misconfigured Route Origin Authorization (ROA) that are causing your authorized route to be rejected.</li> <li><strong>Previous attacks</strong>. Post-mortems provide valuable insights for strengthening network policies, updating hardware and software, and further optimizing security strategy.</li> </ul> <p>Network observability solutions unify this data in a single pane of glass and provide the highly scalable data infrastructure necessary to collect, analyze, and adapt this information to highly automated attack prevention strategies.</p> <h3 id="detection">Detection</h3> <p>Try as we might, attacks will still happen. This means early detection is the name of the game in network security, where minutes or even seconds of compromised security can have disastrous implications for an organization’s brand equity (and bottom line). A <a href="https://neustarsecurityservices.com/blog/cyber-threats-and-trends-2020">2020 Neustar report</a> on cyber threats and trends found that even though DDoS detection and mitigation needs to happen within a minute, as of November 2020, only 25% of DDoS mitigations are initiated soon enough.</p> <p>Capable network observability solutions will automatically detect and alert on early signs of attacks, such as traffic spikes, excessive latency, traffic from unexpected regions, or other anomalous traffic behavior by analyzing your real-time and historical NetFlow data. This traffic flow data is constantly compared against benchmarks to catch anomalous traffic patterns. It gives network and security engineers what they need most: the awareness and the time to mitigate the attack and protect their network before it can cause damage.</p> <h3 id="mitigation">Mitigation</h3> <p>Once an attack has been identified, it is essential to limit the impact. Mitigation strategies vary greatly depending on the nature of an attack. For traffic-based attacks like DDoS, strategies like remote-triggered black holes (RTBH) offer a way to re-route traffic toward harmless infrastructure. Service providers like Cloudflare and A10 are leaders in this space and integrate with <a href="https://www.kentik.com/product/protect/">Kentik Protect</a>.</p> <p>Whatever the mitigation strategy being deployed, effective network observability will feature integrations with leading cybersecurity tools to help foster seamless, automated mitigation of attacks.</p> <h2 id="secure-your-network-against-attacks-with-kentik">Secure your network against attacks with Kentik</h2> <p>Kentik ingests and unifies, in real-time, massive volumes of NetFlow, sFlow, IPFIX, and BGP data, network performance metrics, and SNMP device and interface data. These rich telemetry sets power dynamic network maps and traffic visualizations and enable robust flow analysis to automate prevention efforts like botnet detection.</p> <p>Once a threat has been identified, Kentik automates hybrid mitigation via standards-based BGP Flowspec and remote-triggered black hole (RTBH) and integrations with mitigation solutions from leading vendors, including <a href="https://www.kentik.com/resources/network-traffic-intelligence-ddos-protection-kentik-cloudflare-magic-transit/">Cloudflare</a>, A10, Juniper, and Radware. Kentik’s integration with these services provides detailed monitoring that lets operators know for certain whether or not a <a href="https://www.kentik.com/blog/how-bgp-propagation-affects-ddos-mitigation/">mitigation effort was partial or complete</a>.</p> <p>The “<a href="/why-kentik-network-observability-and-monitoring/">ask anything</a>” ethos of Kentik’s network observability enhances the ability to investigate and understand attacks with deep ad-hoc traffic analysis:</p> <ul> <li>Where did the attack come from?</li> <li>What paths were used?</li> <li>What protocols were leveraged?</li> <li>What customers were affected?</li> </ul> <img src="https://images.ctfassets.net/6yom6slo28h2/4Gr5x2VW7TAb5pzyHxXjyG/ae5537f247f4e7f3edbe9a1e722f8242/ddos-diagram.png" style="max-width: 800px" class="image center" alt="Kentik end-to-end DDoS and attack protection" /> <p>With its early detection, customizable dashboarding, and powerful automation capabilities, Kentik Protect is the most comprehensive tool for securing your networks.</p><![CDATA[Today’s Enterprise WAN Isn’t What It Used To Be]]><![CDATA[Today’s enterprise WAN isn’t what it used to be. These days, a conversation about the WAN is a conversation about cloud connectivity. SD-WAN and the latest network overlays are less about big iron and branch-to-branch connectivity, and more about getting you access to your resources in the cloud. Read Phil’s thoughts about what brought us here.]]>https://www.kentik.com/blog/todays-enterprise-wan-isnt-what-it-used-to-behttps://www.kentik.com/blog/todays-enterprise-wan-isnt-what-it-used-to-be<![CDATA[Phil Gervasi]]>Tue, 14 Mar 2023 04:00:00 GMT<p>For most enterprise NetOps teams, a discussion about the WAN is a discussion about the cloud. Whether it’s as simple as ensuring solid connectivity with a SaaS provider or designing a robust, secure, hybrid, and multi-cloud architecture, the enterprise wide area network is all about connecting us to our resources, wherever they are.</p> <h2 id="how-it-used-to-be">How it used to be</h2> <p>When I started my career in networking, servers were down the hall or in the campus data center. The WAN was how we got access to some websites and sent emails. Most resources were local, accessed remotely over some sort of leased line, or at worst, over a site-to-site back to the organization’s private data center.</p> <p>QoS on the WAN wasn’t a thing outside an SLA you might have on a private circuit, and running anything mission-critical over an ISP’s network was sketchy at best, especially for sensitive traffic like voice. We just didn’t do it because the quality was low and generally unreliable.</p> <h2 id="what-changed">What changed</h2> <p>Over the last 15 years, though, the quality of the public internet has improved significantly. Just think about the audio quality of your last Zoom call. Sure, maybe it wasn’t absolutely perfect, but you have to admit that it was probably just fine. Yes, there’s something to say about how applications are written, but on the public internet side, we’ve seen a decrease in latency, cost, and a massive increase in available bandwidth.</p> <p>This coincided with the advent of the public cloud like AWS, Azure, GCP, etc. Because the public internet has generally become better quality and more reliable, moving resources, even the mission-critical, to these public cloud providers became feasible and desirable once you factor in the other benefits the cloud offers you, like agility, speed, and (virtually) unlimited scalability.</p> <p>And if you think about it, many, if not most, enterprise organizations today are hybrid, multi-cloud, or both, often without realizing it. Even a small business with 100 employees in a single region uses SaaS providers like Google, Microsoft, Salesforce, Quickbooks, and many other cloud-based applications. Even a simple thing like Active Directory in the cloud means almost nothing needs to live locally anymore unless you have a particular need to do so.</p> <p>The result of moving resources to the cloud and using SaaS is that the very concept of branch office routing has also changed. Why do we need to create site-to-site VPNs or some sort of modern SD-WAN topology connecting all our branches when almost all traffic goes to the public internet and the cloud?</p> <p>Yes, of course, I’m oversimplifying here. Some organizations have security requirements to backhaul their traffic to a single security stack, and some organizations have found it more cost-effective to bring specific resources back in-house. But by and large, the vast majority of traffic is going up and out, not branch-to-branch.</p> <h2 id="what-is-todays-enterprise-wan">What is today’s enterprise WAN?</h2> <p>So what does this mean for today’s enterprise network engineer? What can we say is the enterprise WAN of today?</p> <img src="https://images.ctfassets.net/6yom6slo28h2/kRHw2cfLUQOC5MVWRxhq4/bf495d9c666fb38c1507c293483ee7f0/what-is-todays-enterprise-wan_1.png" style="max-width: 500px;" class="image center no-shadow" alt="Is it running EIGRP over a full mesh DMVPN topology with dual hubs? Is it the latest SD-WAN overlay providing robust branch-to-branch connectivity? Is it the collection of big iron routers, perimeter firewalls, and circuit IDs?" /> <p>Personally, I don’t think it’s any of those anymore, though ten years ago, it’s precisely what I would have said. I haven’t even mentioned CASBs, NaaS, containerized network services, and NFV, so you can imagine how the concerns of today’s network engineer are so vastly different from an engineer from yesteryear trying to figure out NHRP and BFD settings.</p> <p>Today’s wide area network is predominantly how we connect to the cloud for small, medium-sized, and even large enterprise networks. Of course, we still have to think about new security issues, new visibility problems, new skills, and figure out who’s responsible for what.</p> <p><strong>So where are we today? What is the enterprise WAN?</strong></p> <p>We still need to connect our infrastructure to the public internet, so the enterprise WAN is still about routers, circuit IDs, and perimeter firewalls. But today, those routers are likely a centrally managed SD-WAN, and those perimeter firewalls could be a CASB or other cloud-based firewall service.</p> <p>Since the need to backhaul traffic has gone way down, and the need for branch-to-branch connectivity is all but irrelevant for most organizations, the enterprise WAN is more like a collection of standalone networks under a single administrative domain. Today, engineers are dealing with an <a href="https://www.kentik.com/kentipedia/sd-wan-software-defined-networking-defined-and-explained/">SD-WAN controller</a>, a centralized CASB interface, and cloud-based network services like DNS and user authentication.</p> <p><em>This also means that today’s enterprise WAN poses new challenges for a network engineer.</em></p> <p>For example, cloud visibility, container network visibility, and public internet visibility are much more important today than they once were, but that type of visibility isn’t easy. It requires new tools, skills, and an understanding of how application traffic travels over the internet.</p> <p>Also, service chaining network functions could lead to vendor lock-in, which may not necessarily be so bad, but it’s a potential problem when relying on so many third parties for application delivery. It’s not a deal-breaker, but I have seen the headaches with some customers when switching from one CASB to another.</p> <p>I know there are always exceptions. There will always be those few organizations that require, for whatever reasons, to completely own and manage their own traditional WAN with legacy devices and topologies. But those are the exception. Today’s WAN is all about cloud connectivity, which means changing how we approach routing, perimeter security, and network operations.</p> <p>The new reality is <em>the WAN isn’t what it used to be</em>.</p><![CDATA[Using Device Telemetry to Answer Questions About Your Network Health]]><![CDATA[When coupled with a network observability platform, device telemetry provides network engineers and operators critical insight into cost, performance, reliability, and security. Learn how to create actionable results with device telemetry in our new article.]]>https://www.kentik.com/blog/using-device-telemetry-to-answer-questions-about-your-network-healthhttps://www.kentik.com/blog/using-device-telemetry-to-answer-questions-about-your-network-health<![CDATA[Stephen Condon]]>Fri, 10 Mar 2023 04:00:00 GMT<p>For cloud network specialists, the landscape for their observability efforts includes a mix of physical and virtual networking devices. These devices generate signals (by design or through instrumentation) that provide critical information to those responsible for managing network health.</p> <p>In this article, I will provide some background on different types of telemetry, discuss key network performance signals, and highlight ways network specialists can leverage this device telemetry in their <a href="https://www.kentik.com/kentipedia/what-is-network-observability/">network observability</a> efforts.</p> <h2 id="an-overview-of-telemetry">An overview of telemetry</h2> <p>Telemetry, in its broadest sense, is any signal that is automatically measured, transmitted, and then processed/stored. This can be an error message transmitted from your phone to the manufacturer, any of the myriad signals sent from your vehicle’s many sensors to their respective CPUs, or life-preserving health monitors updating nurses on their patients’ conditions.</p> <p>Regarding telemetry in cloud networks, the measurement, transmission, collection, and processing of these signals is both a tremendous challenge and an opportunity for network operators. Traditional network monitoring relies on telemetry sources such as Simple Network Messaging Protocol (SNMP), sFlow, NetFlow, CPU, memory, and other device-specific metrics. This being said, network observability includes an even more comprehensive source of telemetry for network specialists to leverage.</p> <h2 id="what-is-network-telemetry">What is network telemetry?</h2> <p>In the framing of cloud networks, network telemetry becomes a much more significant concept, as these signals are being generated in a vast, deep system of networks. With so many network boundaries being navigated (application, service, cloud providers, subnets, <a href="https://www.kentik.com/blog/todays-enterprise-wan-isnt-what-it-used-to-be/">SD-WANs</a>, etc.), network operators and engineers cast as wide a net as possible to source their telemetry.</p> <h2 id="what-is-device-telemetry">What is device telemetry?</h2> <p>In the network observability world, one of the principal telemetry types operators have to concern themselves with is device telemetry. Your switches, servers, transits, gateways, load balancers, and more are all capturing critical information about their resource utilization and traffic characteristics. This is the case for both physical devices and their digital abstractions.</p> <p>Whether or not this telemetry is being collected and analyzed depends on an organization’s needs, constraints, and budgets. Still, it holds immense value for operators making cost, performance, and reliability decisions.</p> <h3 id="what-is-endpoint-telemetry">What is endpoint telemetry?</h3> <p>A subset of device telemetry is endpoint telemetry and includes physical sources such as mobile phones, handheld payment processors (think Square or Stripe hardware), personal computers, heavy machinery, and telemetry from applications operating at those endpoints. These endpoints represent the very <a href="https://www.kentik.com/product/edge/">edge of modern networks</a> and have considerable operational and security implications.</p> <h3 id="application-telemetry">Application telemetry</h3> <p>The bread and budder of the DevOps world, application-level, or “layer 7,” telemetry is finding increasing value under the purview of NetOps. Representing diverse sources such as application functions, schedulers, orchestration tools like Kubernetes, and more, application-level telemetry is critical to providing the context that operators and engineers need to make sense of traffic, performance, and security in their networks.</p> <p>With the level of detail that application-level telemetry provides, operators can quickly answer: Is this even a network problem?</p> <h2 id="what-is-the-internet-of-things-iot">What is the Internet of Things (IoT)?</h2> <p>The Internet of Things refers to the networks that power and support enterprises at the edge. It includes devices like the ones we covered in the “endpoint telemetry” section above, plus the connectivity layer that attaches them to an enterprise’s more extensive network. This connectivity layer can include tech like WiFi and Bluetooth or network abstractions like WANs and SD-WANs, among others.</p> <p>IoT is about more than just thermostats that can connect to WiFi. A real business scenario to consider is a fast food chain using employees outside in the drive-thru to manage traffic during the busiest hours. Armed with a tablet to run point-of-sale, communicate with staff inside, and manage wait times, these employees rely on IoT to delight their customers and keep things running smoothly. Can you imagine what happens to sales if these tablets have trouble connecting to the network or are not adequately secured?</p> <p>At scale, network issues affecting IoT, like the devices at this fast food chain, can be devastating to an organization’s bottom line and reputation. Common network device metrics for understanding network health So far, we’ve covered common sources for device telemetry, but network operators want to know what signals prove their value among all this data noise.</p> <p>In this section, we’ll take a closer look at a few of these critical signals:</p> <ul> <li>Uptime</li> <li>Bandwidth and throughput</li> <li>CPU and memory</li> <li>Interface errors</li> </ul> <h3 id="uptime">Uptime</h3> <p>One of the more foundational device signals is uptime. Usually communicated as a percentage (ideally in the range of 99.999% or “five nines” for most of today’s offerings), uptime is the ratio of the network device or service being available versus not.</p> <p>Lagging uptime is a key signal that performance is suboptimal and, in complex, distributed systems, should be addressed swiftly as services deep within a network can cause severe performance issues further up.</p> <p>As a network’s uptime begins to be better described as “downtime,” there is more at stake than client expectations and user experience: the integrity of the network’s data. Malicious actors can target specific network devices that provide security layers, creating isolated points of vulnerability that can be difficult to detect if uptime telemetry isn’t being collected and analyzed.</p> <h3 id="bandwidth-and-throughput">Bandwidth and throughput</h3> <p>Probably the hallmark signal for network specialists, a device’s bandwidth refers to its maximum capacity for data transfer. A more detailed picture of how a device handles its role in the network can be seen when bandwidth is coupled with a device’s throughput, the amount of data actually moving across a device in a given time frame.</p> <p>Bandwidth and throughput telemetry are instrumental in capacity planning, identifying cyberattacks in their earliest stages, and providing meaningful baselines for optimization efforts.</p> <h3 id="cpu-and-memory">CPU and memory</h3> <p>Monitoring a device’s CPU and memory utilization gives operators insight into several aspects of network health, allowing them to ask and answer key questions:</p> <ul> <li>Is this device fit for its task?</li> <li>Is a given level of utilization abnormal?</li> <li>Can this device handle the next release?</li> <li>Is this a network issue, an application issue, or a unique blend of both?</li> <li>Is the network under attack?</li> </ul> <h3 id="interface-errors">Interface errors</h3> <p>Interfaces provide entry points between devices and networks. Collecting error telemetry from these devices can give network operators insight into any authorization issues or security threats.</p> <h2 id="supporting-network-observability-with-device-telemetry">Supporting network observability with device telemetry</h2> <p>Unfortunately, this telemetry is not very meaningful if left to its own devices. But, when incorporated into a big data approach like network observability, it provides robust statistical baselines for the network engineers and operators making decisions around cost, performance, reliability, and security.</p> <h3 id="you-need-a-unified-data-platform">You need a unified data platform</h3> <p>Even businesses of modest scale can generate petabytes of device telemetry data. Collecting, processing, and storing this data requires significant engineering efforts that can span entire organizations. Managing this data separately across distributed teams sets organizations up for failure or, at the very least, underperformance.</p> <p>Assuming it can scale, providing a central, single source of truth, a unified data and analytics platform reduces the risk of miscommunication and accelerates incident resolution and optimizations in large, complex networks where different teams are responsible for various aspects of network operations.</p> <p>Unified data platforms also facilitate using machine learning and artificial intelligence algorithms to detect and alert on network issues automatically. Although the full utility of AI and ML in NetOps is emerging, having access to a unified data platform gives these technologies richer datasets.</p> <p>In short, data platforms make the “big data” analysis central to network observability a possibility.</p> <p>To see what network observability can do for you, <a href="https://www.kentik.com/get-demo/">get a Kentik demo</a> today.</p><![CDATA[Data Gravity in Cloud Networks: Distributed Gravity and Network Observability]]><![CDATA[In the final entry of the Data Gravity series, Ted Turner outlines concrete examples of how network observability solves complex issues for scaling enterprises.]]>https://www.kentik.com/blog/data-gravity-in-cloud-networks-distributed-gravity-and-network-observabilityhttps://www.kentik.com/blog/data-gravity-in-cloud-networks-distributed-gravity-and-network-observability<![CDATA[Ted Turner]]>Wed, 08 Mar 2023 05:00:00 GMT<p>So far in <a href="https://www.kentik.com/blog/data-gravity-in-cloud-networks-massive-data/">this series</a>, I’ve outlined how a scaling enterprise’s accumulation of data (data gravity) struggles against three consistent forces: cost, performance, and reliability. This struggle changes an enterprise; this is “digital transformation,” affecting everything from how business domains are represented in IT to software architectures, development and deployment models, and even personnel structures.</p> <p>As datasets scale and networks become distributed to free their data, the data gravity story begins to morph into a data complexity story. The more distributed an enterprise’s data, the more heterogeneous its sources. This means moving more data more frequently, carrying out more complex transformations, and managing this data across more complex pathways. To accommodate this, new data infrastructures must be implemented to facilitate the data lifecycle across multiple zones, providers, customers, etc.</p> <p>There are several main problems with this complexity for networks:</p> <ul> <li>The complexity gets worse as they scale.</li> <li>Finding the root causes of issues across these distributed networks takes time, effort, and is ultimately expensive.</li> <li>Cloud networks give operators a much wider surface area to protect against cyberattacks.</li> <li>How do network operators effectively optimize against cost, performance, and reliability with so many moving parts?</li> </ul> <p>Solving these problems for distributed cloud networks has required a big data approach, ultimately resulting in the evolution of network observability.</p> <h2 id="tenets-of-network-observability">Tenets of network observability</h2> <p>A <a href="https://www.kentik.com/kentipedia/what-is-network-observability/">detailed explanation of network observability</a> itself is out of the scope of this article, but I want to focus on its core tenets before exploring a couple of brief case studies.</p> <p>Network observability, when properly implemented, enables operators to:</p> <ul> <li><strong>Ingest telemetry from every part of the network</strong>. The transition from monitoring to observability requires access to as many system signals as possible.</li> <li><strong>Have full data context</strong>. Instrumenting business, application, and operational context to network telemetry give operators multifaceted views of traffic and behavior.</li> <li><strong>Ask any questions about their network</strong>. Rich context and real-time datasets allow network engineers to dynamically filter, drill down, and map networks as queries adjust.</li> <li><strong>Leverage automated insights and response flows</strong>. The “big data” approach enables powerful automation features to initiate in-house or third-party workflows for performance and security anomalies.</li> <li><strong>Engage with all of these features on a central platform</strong>. Features and insights siloed in different teams or services have limited impact. Unifying network data onto a single platform provides a single observability interface.</li> </ul> <p>As mentioned earlier, engineering data pipelines for the intense flow of telemetry is a huge component of making systems observable. Extract, transform, and load (ETL) systems are used to modify the data received and coordinate it with other data streams. Often this is the first tier of “enriching the data,” where correlations between network details like IP addresses, DNS names, application stack tags, and deployment versions can be made.</p> <p>These ETL/ELT servers can provide a standard method of synthesis, which can be replicated across as many servers as it takes to ingest all of the data. This clustering of servers at the beginning of the pipeline enables the growth of data sets beyond the capabilities of most legacy commercial or open source data systems. It represents the technical foundation of observability features like a centralized interface, network context, network maps, and the ability to “ask anything” about a network.</p> <h2 id="two-case-studies-network-observability-in-action">Two case studies: Network observability in action</h2> <p>Ideas about networks are great, but nothing beats seeing them play out in production. Here are two scenarios from current and former clients managing data gravity and how network observability could have been used to triage or prevent the issue.</p> <h3 id="unexpected-traffic-patterns">Unexpected traffic patterns</h3> <p>For the first case study, I want to discuss an international DevOps team using a 50-node Kubernetes cluster. The question they brought to us was, which nodes/pods are ultimately responsible for pushing the traffic from region to region? They were looking for more info about two pain points: exorbitant inter-region costs and degrading performance.</p> <p>A hallmark of cloud networking is that many networking decisions are</p> <ol> <li>limited by the provider and 2) simultaneously made by developers at many different points (application, orchestration, reliability, etc.). This can make for a very opaque and inconsistent traffic narrative for network operators.</li> </ol> <p>It turned out that there was a significant transaction set in region one, but the K8s cluster was deployed with insufficient constraints; automated provisioning pushed the whale traffic flows towards another, less utilized (and further) region. Without nuanced oversight, this reliability move (being multi-region) led to latency increasing and degrading performance. And by not prioritizing the movement of smaller traffic flows, the automated networking decisions attached to this reliability setup proved very expensive.</p> <p>With a network observability solution in place, these mysterious traffic flows would have been mapped and quickly accessible. Assuming proper instrumentation, observability’s use of rich context would have made short work of identifying the K8s components involved.</p> <h3 id="cascading-failures">Cascading failures</h3> <p>In another case, a large retailer with a global footprint asked for help locating some “top talkers” on their private network. With some regularity, one (or multiple) of their internal services inadvertently shut down the network. Despite what was supposed to be a robust and expansive series of ExpressRoutes with Azure, the retailer’s system kept experiencing cascading failures.</p> <p>After quite a bit of sleuthing, it became clear that site reliability engineers (SREs) were implementing data replication pathways––distributing data gravity––that were inadvertently causing cascading failures because of bandwidth assumptions.</p> <p>This scenario highlighted the need for the following:</p> <ul> <li>A platform to provide a comprehensive, single source of truth to unite siloed engineering efforts</li> <li>Network maps and visualizations to quickly diagnose traffic flows, bottlenecks, and complex connections</li> <li>Rich context like service and network metadata to help isolate and identify top talkers</li> </ul> <h2 id="new-data-gravity-concerns-with-observability">New data gravity concerns with observability</h2> <p>While I am quick to celebrate the pragmatism of a network observability implementation, it has to be pointed out in data gravity discussion that the significant ingress of network telemetry is challenging to handle. The highly scalable model of today’s observability data pipelines requires robust and extensive reliability frameworks (similar to their parent networks).</p> <p>Besides the inherent transit and storage costs involved in this additional level of reliability, observed systems present other constraints that engineers need to negotiate:</p> <ul> <li>Devices being monitored can be pestered to death; SNMP queries for many OIDs (Object ID) can cause CPU or memory constraints on the system being monitored.</li> <li>Low CPU/memory capacity devices like network equipment can sometimes cause outages during high bandwidth consumption events (DDoS attacks, large volumetric data transfers, many customers accessing network resources concurrently, etc.)</li> </ul> <p>Because of its significant data footprint, observability is about saving precious time and optimizing your network. Many of the same <a href="https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-i/">hidden costs associated with cloud networking</a> can creep up as the network observability platform scales and achieves “mission critical” status.</p> <h2 id="data-gravity-series-conclusion">Data gravity series conclusion</h2> <p>For the scaling enterprise, data gravity represents a severe challenge for application, network, and data engineers. Distributing this data gravity across multiple DCs, zones, and providers offers organizations a competitive edge in pursuit of lower costs, higher performance, and more reliable systems. But this distributed data leads to a complex networking infrastructure that, at scale, can become an availability and security nightmare.</p> <p>The best way to manage this complexity is with a network observability platform.</p> <p>Want to talk to a cloud networking professional about data gravity concerns in your network? <a href="https://www.kentik.com/get-demo/">Reach out to Kentik today</a>.</p><![CDATA[Exploring Your Network Data With Kentik Data Explorer]]><![CDATA[A cornerstone of network observability is the ability to ask any question of your network. In this post, we’ll look at the Kentik Data Explorer, the interface between an engineer and the vast database of telemetry within the Kentik platform. With the Data Explorer, an engineer can very quickly parse and filter the database in any manner and get back the results in almost any form.]]>https://www.kentik.com/blog/exploring-your-network-data-with-kentik-data-explorerhttps://www.kentik.com/blog/exploring-your-network-data-with-kentik-data-explorer<![CDATA[Phil Gervasi]]>Tue, 28 Feb 2023 05:00:00 GMT<p>A cornerstone of network observability is the ability to ask any question of your network. That means having an unbound capacity to explore the tremendous amount and variety of network telemetry you collect. It means seeing trends and patterns from a macro level, but it also means getting very granular to pursue any line of analysis of your data.</p> <p>Collecting information from <a href="https://www.kentik.com/kentipedia/what-is-netflow-overview/" title="What is NetFlow? An Overview of the NetFlow Protocol">flow records</a>, <a href="https://www.kentik.com/blog/snmp-vs-netflow/" title="SNMP vs NetFlow">SNMP</a>, <a href="https://www.kentik.com/blog/how-to-maximize-the-value-of-streaming-telemetry/" title="How to Maximize the Value of Streaming Telemetry">streaming telemetry</a>, <a href="https://www.kentik.com/kentipedia/what-is-bgp-border-gateway-protocol/" title="Kentipedia: What Is BGP? Border Gateway Protocol Explained">BGP</a>, <a href="https://www.kentik.com/blog/ebpf-explained-why-its-important-for-observability/" title="eBPF Explained: Why it&#x27;s Important for Observability">eBPF</a>, and so on is indeed very important. Still, it will be useless if it takes forever to do anything meaningful with that data. What’s almost as important as the data is the ability to query that data very quickly according to your specific role or needs.</p> <h2 id="the-problem-with-so-much-data">The problem with so much data</h2> <p>The problem is that when you collect telemetry from today’s network, you’re collecting information from your campus, data center(s), WAN, public cloud resources, container environments, etc. It’s a tremendous amount of information and highly varied in type and format.</p> <p>Querying such a large database, or more likely multiple databases can take a very long time which isn’t an option when triaging an incident. The whole point of collecting all of this data is to mine out answers to our questions and to solve problems. Because flow records alone can comprise millions of lines in a database, trying to mine out specific answers can be almost pointless if it takes too long or gives us inaccurate results.</p> <p>So how can we faster process the data in such a massive database?</p> <h2 id="the-need-for-speed">The need for speed</h2> <p>The Kentik Data Explorer is Kentik’s interface between you as an engineer, whether that’s network, systems, cloud, security, or SRE, and the database of information you’ve collected with the Kentik platform.</p> <p>Using the Kentik Data Explorer, you can manually explore all that data, which is stored in the main tables of the Kentik Data Engine. But the real key here is that the Kentik Data Explorer was purpose-built for querying a massive database. Speed and efficiency were the main drivers in its development from the start.</p> <p>Though massive, the underlying database itself is also highly distributed among many servers and locations, making getting the data and serving the data needed for a query happen in parallel. So instead of querying data in a monolithic stack of iron (or virtual iron), we use a distributed cluster of compute resources.</p> <p>Now you can quickly filter an extensive database on specific parameters such as time range, various data sources and types, and hundreds of dimensions such as IP addresses, ASNs, application and security tags, container information, the fields inside <a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="What are VPC Flow Logs?">VPC flow logs</a>, and much more.</p> <div as="Testimonial" index="0" color="green"></div> <h2 id="ask-any-question-about-your-network">Ask any question about your network</h2> <p>Being able to parse the data the way you need to is critical to making progress with troubleshooting and analysis. Imagine collecting all of this great information from all over your network, the cloud, containers, etc., but being unable to zoom in on precisely what you need for your specific issue.</p> <p>With Data Explorer, you can query the database using built-in filters, or you can create custom ones, which is key to being able to pursue any line of questioning of your network data. Look at the graphic below in which you can see dozens of different dimensions on which you can filter. In a live screen <a href="#video">(see the video below)</a>, you can also scroll down and see dozens more dimensions organized in categories like network devices, public cloud, containers, etc.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3t8xDHnxw01fb3Yvu3aTDv/97a9ebe34cafda2dd2d3773b4849b915/dimensions-filters.png" style="max-width: 700px;" class="image center" withFrame alt="Dimensions for filtering" /> <p>Data Explorer represents the entire underlying database, so whether you’re doing a security investigation, trying to figure out why an app is slow, figuring out the path your traffic is taking from public cloud to public cloud, or anything else you can think of, the data is there. But this means you need to be able to get the data back in the way that makes the most sense to you.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5VVvZgaJ65q36oZhdgKZMG/2829468b0d47716e83604f2c3f58afe6/aws-filters.png" style="max-width: 700px;" class="image center" withFrame alt="AWS dimensions for filtering" /> <p>Is seeing the data in bits per second more relevant than packets per second? Are you more interested in source IP than destination IP? You can get back the results based on any metric or combination that matters to you, and if you know networking, you also know that it can get pretty complex.</p> <p>In the graphics above, we saw the many dimensions you can filter on, and in the graphic below, you can see just a small sample of the dozens and dozens of metrics that you can use to refine your search results.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4g8kqq12da2Tcu3bb2AiQ8/3c889bc0d9c760a4182c0d85da92412a/metrics-filters.png" style="max-width: 800px;" class="image center" withFrame alt="Metrics to refine the filters" /> <p>With just a few clicks, you can mine the database for TCP retransmits, TCP fragments, client latency, flow information, and dozens of other metrics. This means you can use Data Explorer to follow any line of analysis you need to solve various complex problems beyond just seeing CPU utilization and CRC errors on a pretty graph.</p> <p>The idea is an unbound, efficient ability to explore all of the very different types of network telemetry in the database. In other words, ask any question you want — and find the answer you need.</p> <p><span id="video">See the Data Explorer</span> in action this short demo.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/t1wbhwmgu4" title="Exploring Network Data with Kentik Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div><![CDATA[The Russification of Ukrainian IP Registration]]><![CDATA[In this article, Doug Madory uncovers the little-known "Russification" of Ukrainian IP addresses -- a phenomenon that complicates the task of internet measurement and impacts Ukrainians connecting to the internet using IP addresses suddenly considered Russian.]]>https://www.kentik.com/blog/the-russification-of-ukrainian-ip-registrationhttps://www.kentik.com/blog/the-russification-of-ukrainian-ip-registration<![CDATA[Doug Madory]]>Thu, 23 Feb 2023 05:00:00 GMT<p>Last summer we teamed up with the New York Times to <a href="https://www.nytimes.com/interactive/2022/08/09/technology/ukraine-internet-russia-censorship.html">analyze the re-routing</a> of internet service to Kherson, a region in southern Ukraine that was, at the time, under Russian occupation. In my <a href="https://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplan/">accompanying blog post</a>, I described how that development mirrored what took place following Russia’s annexation of Crimea in 2014.</p> <p>Along with the Russian-held parts of eastern Ukraine, these regions have experienced a type of <em><a href="https://en.wikipedia.org/wiki/Russification#:~:text=Russification%20(Russian%3A%20%D1%80%D1%83%D1%81%D0%B8%D1%84%D0%B8%D0%BA%D0%B0%D1%86%D0%B8%D1%8F%2C%20romanized,culture%20and%20the%20Russian%20language." target="_blank">Russification</a></em>, an assimilation where the Ukrainian residents of these regions have been forced to adopt all things Russian: language, currency, telephone numbers, and, of course, internet service.</p> <p>Using a novel utility made available by <a href="https://www.ripe.net/">RIPE NCC</a>, we have identified dozens of changes to registrations, revealing another target of this Russification effort: the geolocation of Ukrainian IP addresses.</p> <h2 id="russifying-occupied-donbas">Russifying occupied Donbas</h2> <p>Internet service in the Russian-held parts of eastern Ukraine has primarily been going through Russian transit providers for many years, but there appears to have been a concerted effort in the past year to make internet resources in Russian-occupied Donetsk and Luhansk appear to the world as if they were, in fact, Russian.</p> <p>Using RIPE NCC’s <a href="https://apps.db.ripe.net/docs/13.Types-of-Queries/16-Historical-Queries.html#historical-queries">historical query</a> functionality, we can see for ourselves how, in recent months, the registrations of IP ranges located in these contested parts of Ukraine had their geolocation fields changed from Ukraine to Russia.</p> <p>Take, for example, the IP address range 178.158.128.0/18. This prefix has been continuously announced out of Donetsk, Ukraine, for many years and has the following versions of RIPE registrations:</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --list-versions 178.158.128.0 - 178.158.191.255 % Version history for INETNUM object "178.158.128.0 - 178.158.191.255" % You can use "--show-version rev#" to get an exact version of the object. rev# Date Op. 1 2010-11-18T10:59:18Z ADD/UPD 2 2014-07-11T13:12:56Z ADD/UPD 3 2014-07-16T16:53:24Z ADD/UPD 4 2014-07-16T17:24:45Z ADD/UPD 5 2015-03-04T16:47:14Z ADD/UPD 6 2015-05-05T01:39:50Z ADD/UPD 7 2016-04-12T09:42:35Z ADD/UPD 8 2016-04-14T10:43:56Z ADD/UPD 9 2016-06-02T10:21:40Z ADD/UPD <span style="background-color: #fef89b;">10 2022-07-21T12:58:43Z ADD/UPD</span> </code></pre> <p>This registration was most recently modified last July, five months after Russia invaded Ukraine. With the command below, we can do a “diff” on versions 9 and 10 to see exactly what was changed last summer:</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --diff-versions 9:10 178.158.128.0 - 178.158.191.255 % Difference between version 9 and 10 of object "178.158.128.0 - 178.158.191.255" @@ -2,3 +2,3 @@ netname: ISP-EAST-NET <span style="background-color: #fef89b;">-country: UA</span> <span style="background-color: #fef89b;">+country: RU</span> org: ORG-EL88-RIPE @@ -10,3 +10,3 @@ created: 2010-11-18T10:59:18Z -last-modified: 2016-04-14T10:43:56Z +last-modified: 2022-07-21T12:58:43Z source: RIPE </code></pre> <p>The highlighted portion reveals that this registration had its country field changed from UA (Ukraine) to RU (Russia) last July, but it wasn’t the only one. In fact, dozens of registrations for IP address ranges originated by networks in Donetsk and Luhansk changed their countries from Ukraine to Russia in the past year.</p> <p>For another example, take 151.0.0.0/20, which is originated by Online Technologies LTD (AS45025) in the Donetsk region. A change, highlighted below, on July 18th last year updated the country field from Ukraine to Russia.</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --diff-versions 3:4 151.0.0.0 - 151.0.31.255 % Difference between version 3 and 4 of object "151.0.0.0 - 151.0.31.255" @@ -3,3 +3,3 @@ descr: Online Technologies LTD <span style="background-color: #fef89b;">-country: UA</span> <span style="background-color: #fef89b;">+country: RU</span> geoloc: 48.045955739960114 37.96531677246094 @@ -10,3 +10,3 @@ created: 2012-01-05T13:39:09Z -last-modified: 2018-12-10T12:06:53Z +last-modified: 2022-07-18T12:09:23Z source: RIPE </code></pre> <p>In case there was any doubt about where this network is purportedly located, this registration entry helpfully contains <a href="https://www.google.com/maps/place/48%C2%B002&#x27;45.4%22N+37%C2%B057&#x27;55.1%22E/@48.0255277,37.6144356,10z/data=!4m5!3m4!1s0x0:0x8b821988daf9d5b0!8m2!3d48.0459557!4d37.9653168">lat/long coordinates</a> which point to an address in Makiivka, just to the east of the city of Donetsk and the site of a <a href="https://www.theguardian.com/world/2023/jan/04/makiivka-strike-what-we-know-about-the-deadliest-attack-on-russian-troops-since-ukraine-war-began">deadly missile strike</a> on New Year’s Eve.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4pS44LBlwWMNw2l1z8rKCi/a7ab7a6a67271aa4fe86c0e3381873fa/donetsk-map.png" style="max-width: 650px;" class="image center" alt="Map of Makiiva near Donetsk" /> <p>Similar changes have been taking place in Russian-held Luhansk (also spelled Lugansk). 178.219.192.0/20 is originated by AS197129 in the Russian-held part of the region and also changed its country field from Ukraine to Russia on July 18, 2022.</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --diff-versions 11:12 178.219.192.0 - 178.219.207.255 % Difference between version 11 and 12 of object "178.219.192.0 - 178.219.207.255" @@ -2,3 +2,3 @@ netname: VRLINE-NET <span style="background-color: #fef89b;">-country: UA</span> <span style="background-color: #fef89b;">+country: RU</span> org: ORG-KOPI1-RIPE @@ -10,3 +10,3 @@ created: 2010-06-18T06:52:49Z -last-modified: 2016-04-14T10:29:32Z +last-modified: 2022-07-18T09:13:53Z source: RIPE </code></pre> <p>Not all country changes occurred in July last year. The prefixes originated by Luganet (AS39728) changed their country codes from Ukraine to Russia in September, just before their <a href="https://en.wikipedia.org/wiki/2022_annexation_referendums_in_Russian-occupied_Ukraine">controversial referendum</a> for independence. The registration diff for AS39728’s 194.31.152.0/22 is shown below:</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --diff-versions 3:4 194.31.152.0 - 194.31.155.255 % Difference between version 3 and 4 of object "194.31.152.0 - 194.31.155.255" @@ -2,3 +2,3 @@ netname: RU-OMEGA-20181128 <span style="background-color: #fef89b;">-country: UA</span> <span style="background-color: #fef89b;">+country: RU</span> geoloc: 48.5335 -39.2783 @@ -9,3 +9,3 @@ created: 2018-11-28T10:49:57Z -last-modified: 2019-11-25T13:11:22Z +last-modified: 2022-09-07T13:45:55Z source: RIPE </code></pre> <p>And finally consider 95.215.51.0/24, originated by Optima-East (AS48882) in Krasnodon in the Luhansk region along the Russian border. Its registration record changed from Ukraine to Russia on November 2, 2022.</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --diff-versions 12:13 95.215.48.0 - 95.215.51.255 % Difference between version 12 and 13 of object "95.215.48.0 - 95.215.51.255" @@ -2,3 +2,3 @@ netname: NET-IPCOM <span style="background-color: #fef89b;">-country: UA</span> <span style="background-color: #fef89b;">+country: RU</span> org: ORG-JCI3-RIPE @@ -10,3 +10,3 @@ created: 2009-02-17T13:32:28Z -last-modified: 2016-04-14T10:42:15Z +last-modified: 2022-11-02T07:40:20Z source: RIPE </code></pre> <h2 id="russification-of-crimea">Russification of Crimea</h2> <p>It is important to note that the Russification of Ukrainian RIPE registrations didn’t start in the past year. In fact, the RIPE NCC <a href="https://apps.db.ripe.net/docs/13.Types-of-Queries/16-Historical-Queries.html#historical-queries">historical query</a> allows us to also identify registration changes taking place in Crimea following the Russian annexation in March 2014.</p> <p>The country field of 91.194.163.0/24, originated by CrimeaCom (AS28761), was changed on December 12, 2014, from Ukraine to Russia.</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0; margin-bottom: 20px;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --diff-versions 8:9 91.194.162.0 - 91.194.163.255 % Difference between version 8 and 9 of object "91.194.162.0 - 91.194.163.255" @@ -1,5 +1,5 @@ inetnum: 91.194.162.0 - 91.194.163.255 -netname: SINET-NET -descr: CrimeaCom LLC <span style="background-color: #fef89b;">-country: UA</span> +netname: CrimeaCom-Network +descr: CrimeaCom South LLC <span style="background-color: #fef89b;">+country: RU</span> org: ORG-CL205-RIPE </code></pre> <p>In fact, one Crimean provider wasted no time to change the country field of its IP addresses. Based in Sevastopol, Lancom’s IP registrations were changed from Ukraine to Russia on March 18th, 2014, the exact same day Russia signed the <a href="http://en.kremlin.ru/events/president/news/20604">Treaty of Accession of the Republic of Crimea to Russia</a>.</p> <pre class="language-text" style="border-radius: 4px; padding-top: 0;"><code class="language-text" style="color: #000; font-size: 1.05em;"> $ whois --diff-versions 1:2 46.35.224.0 - 46.35.255.255 % Difference between version 1 and 2 of object "46.35.224.0 - 46.35.255.255" @@ -3,3 +3,3 @@ descr: Lancom Ltd. <span style="background-color: #fef89b;">-country: ua</span> <span style="background-color: #fef89b;">+country: ru</span> org: ORG-LL42-RIPE </code></pre> <h2 id="so-whats-the-upshot-of-all-of-this">So what’s the upshot of all of this?</h2> <p>Registrations allow internet resource owners the ability to communicate to the internet their intentions of how — or more appropriately, in this case, <em>where</em> — the resource (i.e., an IP address range) will be used.</p> <p>There is no requirement that the resource is <em>actually</em> used in the location listed in its registration — there are plenty of misgeolocations out there that demonstrate that. But there are some practical implications of changing all of these IP address ranges to being registered in Russia.</p> <p>Geolocation service providers take most geolocation information found in registration data at face value. With very few exceptions, changing the registered country of an IP address will cause these services to change the geolocation they report to their customers. Below is the reported geolocation for the first IP range mentioned at the top of the blog post:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5H9hAs7ri9rs3nGFwsUVdC/8e7e97b6751f415b4b1acd5b14ecd98b/ipinfo.png" style="max-width: 650px;" withFrame class="image center" alt="Reported geolocation in ipinfo.io" /> <div class="caption" style="margin-top: -30px;">Geolocation of the IP range in the first example from ipinfo.io</div> <p>Take Crimea, for example, which has been under Russian control since 2014. Today you can find Crimean providers announcing IP ranges registered as Ukrainian and others announcing ranges registered as Russian. Pick your favorite geolocation service provider, and you will see an impossible mix of country-level geo for things in a region that can only be in a single country. One is left to maintain personal lists of prefixes and ASNs belonging to networks that are known to operate in these regions.</p> <p>Aside from complicating the task of internet measurement, perhaps the biggest practical impact may be to the Ukrainians having to connect to the internet using IP addresses that are suddenly considered to be Russian. These users may encounter problems accessing services that have been blocked to Russian IP addresses, requiring them to use a VPN or other means to sidestep geoblocking.</p> <p>While these changes are perhaps intended to be symbolic, they can have subtle and unintended consequences on connectivity. The Russian government wants the world to believe these internet service providers are operating on Russian soil. Changing the registrations of these IP ranges to reflect that worldview is part of a wider effort of the Russification of captured Ukrainian territories.</p><![CDATA[Implementing a Cost-aware Cloud Networking Infrastructure]]><![CDATA[Cloud networks introduce a multitude of costs that can become challenging to predict. Learn how to implement a cost-aware infrastructure through maximum visibility in cloud networking.]]>https://www.kentik.com/blog/implementing-a-cost-aware-cloud-networking-infrastructurehttps://www.kentik.com/blog/implementing-a-cost-aware-cloud-networking-infrastructure<![CDATA[Ted Turner]]>Tue, 21 Feb 2023 05:00:00 GMT<h2 id="what-is-cloud-networking">What is cloud networking?</h2> <p><a href="https://www.kentik.com/kentipedia/what-is-cloud-networking/">Cloud networking</a> is the IT infrastructure necessary to host or interact with applications and services in public or private clouds, typically via the internet. It’s an umbrella term for the devices and strategies that connect all variations of on-premise, edge, and cloud-based services.</p> <h2 id="why-is-cloud-networking-important">Why is cloud networking important?</h2> <p>Being able to leverage cloud services positions companies to scale in cost and time-prohibitive ways without the infrastructure, distribution, and services of cloud providers. Gaining access to these vast cloud resources allows enterprises to engage in high-velocity development practices, develop highly reliable networks, and perform big data operations like artificial intelligence, machine learning, and observability.</p> <p>These distributed, data-intensive architectural components require layers of abstraction for deployment, orchestration, security, observability, and platform integrations, creating complex and dynamic networks. Cloud networking helps enterprises ensure these connections are cost-effective, secure, performant, and reliable.</p> <h2 id="cloud-networking-vs-cloud-computing">Cloud networking vs. cloud computing</h2> <p>Cloud networking can be thought of as a subset of cloud computing. Cloud computing includes all the concepts, tools, and strategies for providing, managing, accessing, and utilizing cloud-based resources.</p> <h2 id="the-different-types-of-cloud-networking">The different types of cloud networking</h2> <p>When you encounter the phrase “cloud networking,” it may refer to several specific subsets of cloud networking, including, but not limited to:</p> <ul> <li>Hybrid cloud networking</li> <li>Multi-cloud networking</li> <li>Cloud-based networking</li> </ul> <h3 id="multi-cloud-networking">Multi-cloud networking</h3> <p>Multi-cloud cloud networking refers to networks that use multiple cloud providers. For example, a particular microservice might be hosted on AWS for better serverless performance but sends sampled data to a larger Azure data lake. The resulting network can be considered multi-cloud.</p> <h3 id="hybrid-cloud-networking">Hybrid cloud networking</h3> <p>Hybrid cloud networking refers specifically to the connectivity between two different types of cloud environments. A hybrid cloud network could involve connections between on-premise private, hosted private, and public clouds.</p> <h3 id="cloud-based-networking">Cloud-based networking</h3> <p>Slightly different is cloud-based networking, which refers specifically to networking solutions that offer a control plane hosted and delivered via public cloud. This might include caches, load balancers, service meshes, SD-WANs, or any other cloud networking component.</p> <h2 id="designing-a-cloud-networking-environment-with-cost-in-mind">Designing a cloud networking environment with cost in mind</h2> <p>With their planet-wide data centers, a full suite of features, and resources for developers, cloud providers offer unparalleled capabilities in scaling and distributing applications across multiple availability zones (AZ), all while minimizing expenses like overhead for data centers and managers.</p> <p>However, cloud networking presents some unique challenges for network operators and engineers accustomed to managing on-prem networks. Many of these resources and services have highly complex pricing structures, involve significant transit fees at scale, and their distributed nature can expose organizations to unexpected network scenarios that, ultimately, result in unforeseen costs.</p> <p>Here are some common features and scenarios that cloud network operators should consider.</p> <h3 id="instance-types">Instance types</h3> <p>Cloud resources are made available to customers in various instance types, each with special hardware or software configurations optimized for a given resource or use case. Some common instance types include:</p> <ul> <li>General purpose</li> <li>Memory optimized</li> <li>Compute optimized</li> <li>Storage optimized</li> <li>Accelerated computing</li> </ul> <p>With accessible interfaces, making configuration decisions on cloud platforms is easier than ever. This is a double-edged sword, as selecting the wrong instance type for intensive applications or workloads can lead to significant cost inefficiencies.</p> <h3 id="on-demand-vs-reserved-vs-spot">On-demand vs. Reserved vs. Spot</h3> <p>One of the ways cloud providers guarantee availability is by entering into contracts with customers. These contracts can broadly be grouped into three types, each with their corresponding degree of expectations around resource availability (cloud provider) and commitment (customer):</p> <ul> <li> <p><strong>On-demand.</strong> With on-demand pricing, the customer provides zero commitment to the cloud provider. As such, on-demand pricing is elastic to supply and demand and is typically the most expensive option. Network operators should beware of using on-demand pricing for long-running or resource-intensive workloads, as this can expose the organization to dramatic costs in high-demand windows.</p> </li> <li> <p><strong>Reserved.</strong> Reserved instances allow customers more competitive rates at the cost of some time commitment to the provider. They are also complex, involving varying costs for different term limits, activation windows, AZs, regions, and instance types.</p> </li> <li> <p><strong>Spot.</strong> With spot instances, operators bid at competitive rates for instances. Spot instances are an attractive option for short-running workloads or uses that are not hindered by interruptions, as they imply no guarantee of availability from the provider.</p> </li> </ul> <p>With the potential for dramatic variability in pricing, it’s crucial for operators structuring their cloud networks to be mindful and intentional of the architectural effects of their pricing decisions. For example, spot instances will always be the most affordable, but they require more reliability engineering to keep services up. Spot instances also require dynamic responses to instance market adjustments, which can mean additional engineering or oversight.</p> <h3 id="traffic-costs">Traffic costs</h3> <p>One of the key cost factors for cloud network operators to consider when designing an infrastructure is traffic. What data is moving, how much of it, via what methods, and between what points in the network are critical in determining traffic costs.</p> <p>Much the way spot instances allow organizations to bid down, network operators responsible for large amounts of traffic can enter into peering relationships that help facilitate efficient use of network resources while maintaining a quality end user experience. Similarly, architects and operators must remember that data doesn’t have to be moved over the public internet, and private backbones can be established for more secure and affordable data transfer.</p> <h2 id="how-kentik-supports-cost-aware-cloud-networking">How Kentik supports cost-aware cloud networking</h2> <p>As the complexity of a cloud network grows, so do the opportunities for network costs to behave in unexpected ways. For such large systems, rapidly diagnosing issues, automating solutions, and predicting shifting sands that might spell trouble for the network (or your bottom line) is vital.</p> <p>As a <a href="https://www.kentik.com/product/kentik-platform/">network observability platform</a>, Kentik provides a single pane of glass for managing and making the most of your cloud networks. By combining traditional network monitoring data like SNMP and NetFlow with a global and real-time stream of device and security telemetry, Kentik gives network operators and engineers a look at network activity that is fine-grained, comprehensive, and rich with context.</p> <h3 id="incident-resolution">Incident resolution</h3> <p>When something goes wrong in a cloud network, mitigating the ripple effects and finding the root cause can be an overwhelming and costly endeavor. Ephemeral connections, cascading failures, siloed engineering, and monitoring efforts all contribute to a maze of alerts and network data that is difficult to parse or synthesize.</p> <p>Kentik offers operators a range of <a href="https://www.kentik.com/solutions/usecase/troubleshoot-networks/">troubleshooting tools</a>, including customizable and real-time queries, dynamic network maps, and a range of integrations to trigger responsive alerting and mitigation workflows.</p> <h3 id="connectivity-costs">Connectivity costs</h3> <p>As covered above, costs associated with moving traffic into, out of, and between your networks can be some of the most difficult to manage. <a href="https://www.kentik.com/product/edge/">Kentik Edge</a> gives operators a look at their network’s boundaries. By compiling pricing and network data, Kentik helps organizations compare billing statements directly to usage, surface cost trends and predictions, guide capacity planning, and traffic engineering, and even dynamically negotiate and monitor peering relationships.</p> <p>This level of visibility into and control over network traffic gives operators the upper hand in efficiently managing connectivity costs.</p> <h3 id="security">Security</h3> <p>Cloud networks offer an immense surface area for malicious actors. Many modern cyber threats are also cloud-based, and with access to cheap virtual machines, attacks have become more sophisticated and harder to identify in network activity until some sort of interruption or compromise has already occurred.</p> <p>Kentik’s network security offering, <a href="https://www.kentik.com/product/protect/">Kentik Protect</a>, unifies and compares network, device, and threat telemetry to surface anomalous traffic <em>before</em> the network is compromised. It enhances the ability to investigate and understand attacks with deep ad-hoc traffic analysis and powers automated mitigation solutions once a threat has been identified.</p> <h3 id="optimization">Optimization</h3> <p>Software versions, customer regions, and a range of other data points all provide critical context when searching for cost and performance efficiencies.</p> <p>By providing deep, context-rich insights around resource utilization and traffic patterns, Kentik is excellent for establishing baselines around cost, performance, and reliability and systematically testing improvements to your network.</p> <h2 id="conclusion">Conclusion</h2> <p>Cloud networks introduce a lot of opportunities for costs to become challenging to predict, but implementing a cost-aware infrastructure is possible:</p> <ul> <li>Be mindful to choose the correct instance types throughout your cloud network</li> <li>Understand the best-fit pricing plan for the use case</li> <li>Manage traffic and peering relationships efficiently</li> <li>Introduce a network observability solution to provide the visibility needed to make these decisions, keep the system humming against performance and security issues, and search for optimizations</li> </ul> <p>To see how observability can make your cloud networks cost-aware, <a href="https://www.kentik.com/get-demo/">sign up for a demo</a>.</p><![CDATA[The Consolidation of Networking Tasks in Engineering]]><![CDATA[The advent of various network abstractions has meant many day-to-day networking tasks normally done by network engineers are now done by other teams. What's left for many networking experts is the remaining high-level design and troubleshooting. In this post, Phil Gervasi unpacks why this change is happening and what it means for network engineers.]]>https://www.kentik.com/blog/the-consolidation-of-networking-tasks-in-engineeringhttps://www.kentik.com/blog/the-consolidation-of-networking-tasks-in-engineering<![CDATA[Phil Gervasi]]>Thu, 16 Feb 2023 05:00:00 GMT<p>In recent years, the rapid development of cloud-based networking, network abstractions such as SD-WAN, and controller-based campus networking has meant that basic, day-to-day network operations have become easier for non-network engineers. The result we’re starting to see today is a sort of <em>consolidation</em> of networking tasks, leading to a need for only a small number of highly skilled network engineers to handle the less frequent heavy lifting of advanced design and troubleshooting.</p> <h2 id="network-abstractions-make-it-easier">Network abstractions make it easier</h2> <p>One of the major drivers of this consolidation is the <a href="https://www.kentik.com/solutions/usecase/clouds-and-hybrid/">rise of cloud-based networking</a>, such as AWS and Azure, which provide a centralized platform for network engineers to manage their cloud network infrastructure. The major cloud providers made it very easy for non-network engineers to handle basic networking operations, such as setting up VPNs and configuring firewalls. The major cloud platforms also provide a centralized view of the cloud network, making monitoring and troubleshooting issues easier without having to be a high-level engineer.</p> <p>Also, think about the development of network abstractions such as SD-WAN and controller-based campus networking. Among other things, SD-WAN abstracts the underlying network infrastructure allowing network engineers to focus less on configuring interfaces and more on the business requirements of the network. Though there’s undoubtedly a technical component to even the most user-friendly SD-WAN, generally, they make it easier for non-network engineers to manage basic networking operations, such as configuring branch office connectivity, simple security policies, remote access VPN, and routing.</p> <p>In the same way, controller-based campus networking solutions provide a centralized platform for managing campus networks, also making it easier for non-network engineers to handle the more mundane aspects of campus networking. Rather than memorizing many commands and understanding how protocols like STP or BPDU Guard work, engineers can click through templates and menus in a controller to set up their layer 2 networks. The controller abstracts a lot of the very low-level configuration away.</p> <h2 id="the-new-role-of-the-network-expert">The new role of the network expert</h2> <p>So, what’s left? The consolidation of network engineering has led to a need for only a few highly skilled advanced network engineers. These advanced network engineers are responsible for designing and implementing complex network infrastructure, such as data center and service provider networks. They’ll still need that deep technical understanding of networking beyond the fundamentals when things inevitably (but infrequently) break.</p> <p>For example, a cloud engineer may be a novice in how encapsulation technologies work. Still, they can use the SD-WAN controller to set up connectivity between their on-prem data center and AWS instance quickly and easily. However, when a problem can’t be solved by simply looking through the controller settings, the high-level networking expert is brought in to analyze packet captures and log to eventually see an MTU mismatch in the configuration.</p> <p>As much as we, as practicing network engineers, feel like these kinds of problems happen constantly, they really don’t. It just feels that way because it’s all we work on from project to project. In reality, the bulk of network operations activity is mundane configuration and monitoring. This is why I believe that we’re seeing a consolidation of network engineers, at least higher-level ones.</p> <p>Just like we saw with technologies like BGP go from only on the CCIE exam, then the CCNP exam, and now on the CCNA, the last remaining high-level network engineers will be the experts still being kept around when things really fall apart, while cloud engineers, developers, systems administrators, and so on will be doing the mundane work of simple network operations using whatever UI they’re working in that day.</p><![CDATA[Understanding Data-platform Needs to Support Network Observability]]><![CDATA[By providing a central, single source of truth, a unified data platform reduces the risk of miscommunication for large, complex networks. In this article, we dive into how data observability can play a critical part in network observability.]]>https://www.kentik.com/blog/understanding-data-platform-needs-to-support-network-observabilityhttps://www.kentik.com/blog/understanding-data-platform-needs-to-support-network-observability<![CDATA[Stephen Condon]]>Wed, 15 Feb 2023 05:00:00 GMT<p>In today’s digital world, data is generated faster and at a larger scale than ever before. With software architectures increasingly adopting distributed, <a href="https://www.kentik.com/product/cloud/">cloud-based models</a>, network infrastructures have become complex webs of virtual and physical devices.</p> <p>The interplay of multiple, simultaneous deployments (multi-cloud, blue/green, canary), service-oriented production and infrastructure models, the ephemeral nature of the data associated with technologies like containers and serverless, thousands of data sources and destinations, and the need to monitor and act on <em>all</em> of this information, creates a high stakes environment for data management.</p> <h2 id="what-is-a-data-platform">What is a data platform?</h2> <p>A data platform is an abstraction that offers a single pane of glass for managing your data wherever that data is being generated, processed, or stored.</p> <p>Through instrumentation, integrations, automated analysis, visualizations, and a full suite of data management features, data platforms offer data managers and engineers a unique opportunity to interact with distributed data at a scale that would otherwise exist in siloed data infrastructures.</p> <h2 id="what-are-the-main-features-of-a-modern-data-platform">What are the main features of a modern data platform?</h2> <p>Data platforms offer enterprises a range of features:</p> <ul> <li>Data ingestion</li> <li>Data storage</li> <li>Data transformation</li> <li>Data modeling</li> <li>Data discovery</li> <li>Data observability</li> <li>Data security</li> <li>Business intelligence</li> </ul> <h3 id="data-platform-ingestion">Data platform ingestion</h3> <p>Useful data is generated at every layer of an application. Different formats, models, and protocols constrain data from these different domains accordingly. The act of incorporating this disparate data into a single, accessible data mass is called data ingestion.</p> <p>In a modern network, data ingestion is likely to happen at multiple points, as some data needs to be sampled, analyzed, or otherwise processed before reaching the central data store/lake/warehouse.</p> <p>Engineers build data pipelines for specific sources and destinations to structure incoming data for querying and subsequent analysis. Whether directly or through integrations, data platforms provide the tools needed to build and optimize these data pipelines. These pipelines provide a standard method of synthesis, which can be replicated across as many servers as it takes to ingest all of the data, enabling the rest of the system to work at scale across multiple teams for analytics.</p> <h3 id="data-platform-storage">Data platform storage</h3> <p>Data platforms offer a central interface for accessing, querying against, and managing the data storage services in your network, across infrastructure boundaries like cloud providers.</p> <p>With storage metadata, data platforms offer engineers opportunities for additional insights into data access, resource allotment and capacity, performance, and more.</p> <h3 id="data-transformation">Data transformation</h3> <p>As data is ingested or moved between services, its values, structure, or format may need to be transformed to be consumable by its destination. To this end, data transformation may involve validating, scrubbing, and reconfiguring data. Some common data transformation techniques include data generalization, smoothing, aggregation, normalization, and the creation of novel data structures from ingested data (e.g., value1 * value2 = newDataPoint).</p> <p>ETL (extract, transform, load) systems modify and coordinate data with other data streams. This is often the first point of data enrichment for network operators, providing an opportunity to correlate things like IP addresses with DNS names, application stack tags, and deployment versions.</p> <h3 id="data-platform-modeling">Data platform modeling</h3> <p>Whether generating schemas and database profiles from abstract models or generating visualizations based on integrated data sources, an effective data platform will give engineers the data modeling tools they need to design, test, and better understand their data infrastructures.</p> <h3 id="data-platform-discovery">Data platform discovery</h3> <p>One of the principal features of a data platform is providing a single, central location for cross-boundary data discovery. By synthesizing data from every part of your data infrastructure, data discovery enables pattern recognition and trend analysis impossible for human operators alone.</p> <p>By being customizable and extensible, data platforms enable engineers to tailor their data discovery for various applications: dashboard creation, reporting, monitoring, cross-domain analytics, and more.</p> <h3 id="data-platform-observability">Data platform observability</h3> <p>Observability is the degree to which the internal state of a system can be deduced from its outputs. In distributed, hybrid cloud networks, <a href="https://www.kentik.com/product/core/">data observability</a> leverages information like logs, metrics, traces, and flow data to provide end-to-end visibility into the data lifecycle.</p> <p>With instrumentation, contexts such as business domain, customer data, geolocation, feature flags, and blue/green deployments (really whatever proves to be useful) can be added to data as it moves through your networks. This context adds dimensionality to querying and modeling and gives operators and engineers a much better view of end-user experience.</p> <p>A modern data platform will offer tools and interfaces for making the most from your observable data, including anomaly detection, root cause analysis, and highly-nuanced data discovery.</p> <h3 id="data-platform-security">Data platform security</h3> <p>As modern SaaS solutions, data platforms offer organizations with highly complex data infrastructures the ability to manage access, craft and enforce security protocols, perform comprehensive threat analysis, and automate alerting and mitigation efforts.</p> <h3 id="business-intelligence-with-data-platforms">Business intelligence with data platforms</h3> <p>The context of this data innovation is simple: driving business goals forward and delighting customers.</p> <p>Data platform features like data observability, modeling, and discovery enable the next generation of business intelligence efforts. Offering unparalleled insight into customer behavior, logistics, and the economic performance of your data infrastructures, the ability to synthesize “big data” into novel and competitive insights is a crucial feature of data platforms.</p> <h2 id="what-are-the-advantages-of-a-data-platform">What are the advantages of a data platform?</h2> <p>Here are the key advantages of a data platform:</p> <ul> <li><strong>Centralized.</strong> By providing an abstraction that encompasses an enterprise’s data infrastructure, data platforms tear down data silos and give teams a single pane of glass for their data management.</li> <li><strong>Comprehensive.</strong> From data creation to destruction and everywhere in between, data platforms provide the context and visibility to get the most from your data wherever it is in its lifecycle.</li> <li><strong>Customizable.</strong> No two enterprises have identical data needs. Data platform features are highly customizable: visualizations, queries, and database management policies can all be fine-tuned from one service.</li> <li><strong>Extensible.</strong> Data platforms allow organizations to add features and build novel services incorporating the full scope of an organization’s data.</li> <li><strong>Scalable.</strong> Built to handle planet-scale operations, data platforms are designed to provide dynamic, scalable solutions across the data lifecycle.</li> <li><strong>Secure.</strong> By simplifying access, facilitating strategies like microsegmentation, and providing real-time threat analysis that incorporates all your data sources, data platforms offer unique security management, alerting, and mitigation capabilities.</li> </ul> <h2 id="what-is-network-observability">What is network observability?</h2> <p>Similar to data observability, network observability applies the notion that context is critical. In combination with the logs, metrics, trace, and flow data, network observability incorporates device (virtual or physical) data with the rich contextualization made available through instrumentation.</p> <p>This context allows engineers and operators to “<a href="https://www.kentik.com/why-kentik-network-observability-and-monitoring/">ask anything</a>” and figure out the unknown unknowns in their networks. Network observability’s <a href="https://www.kentik.com/product/kentik-platform/">big data approach</a> helps teams transcend simply monitoring their networks and makes the most of machine learning and automation to identify and mitigate performance and security issues.</p> <h2 id="what-are-the-advantages-of-network-observability">What are the advantages of network observability?</h2> <p>Network observability’s unique approach offers distinct advantages to network operators:</p> <ul> <li>Visibility into all your networks</li> <li>Context</li> <li>Customizable queries</li> <li>Visualizations</li> <li>Real-time analysis</li> <li>Automated insights and solutions</li> </ul> <h2 id="supporting-network-observability-with-a-unified-data-platform">Supporting network observability with a unified data platform</h2> <p>By providing a central, single source of truth, a unified data platform reduces the risk of miscommunication. This is particularly important in large, complex networks where different teams are responsible for various aspects of network operations.</p> <p>Unified data platforms also facilitate using machine learning and artificial intelligence algorithms to detect and alert on network issues automatically. Although the utility of AI and ML in NetOps is emerging, having access to a unified data platform gives these technologies richer datasets.</p> <p>In short, data platforms make the “big data” analysis central to network observability a possibility.</p> <p>To see what network observability can do for you, <a href="https://www.kentik.com/get-demo/">get a Kentik demo</a> today.</p><![CDATA[Gathering, Understanding, and Using Traffic Telemetry for Network Observability]]><![CDATA[Traffic telemetry is the foundation of network observability. Learn from Phil Gervasi on how to gather, analyze, and understand the data that is key to your organization's success.]]>https://www.kentik.com/blog/gathering-understanding-using-traffic-telemetry-for-network-observabilityhttps://www.kentik.com/blog/gathering-understanding-using-traffic-telemetry-for-network-observability<![CDATA[Phil Gervasi]]>Thu, 09 Feb 2023 05:00:00 GMT<p>Traffic telemetry is the data collected from network devices and used for analysis. With traffic telemetry, engineers can gain real-time visibility into traffic patterns, correlate events, and make predictions of future traffic patterns. As a critical input to a network observability platform, this data can help monitor and optimize network performance, troubleshoot issues, and detect security threats.</p> <p>However, traffic telemetry can be difficult to understand. While some enterprises might gather the data, many don’t know what to do with it—or they don’t have the tools to make the most of it.</p> <p>In this article, we’ll look at:</p> <ul> <li>How to effectively gather traffic telemetry</li> <li>Tools for analyzing and understanding the collected data</li> <li>How to leverage traffic telemetry for network observability</li> </ul> <p>Let’s start by looking at data collection.</p> <h2 id="how-to-gather-traffic-telemetry">How to gather traffic telemetry</h2> <p>Gathering traffic telemetry involves <strong>monitoring and collecting data on the flow of network traffic</strong>. This can be done through a variety of techniques, such as using network taps, port mirroring, or software probes.</p> <h3 id="network-taps">Network taps</h3> <p>One common method of gathering traffic telemetry is by using a network tap. A network tap is a hardware device that enables the passive monitoring of network traffic by copying all data passing through a network connection. This data can then be analyzed and used to monitor network performance, detect issues, and troubleshoot problems.</p> <h3 id="port-mirroring">Port mirroring</h3> <p>Another method for gathering traffic telemetry is port mirroring, which replicates (mirrors) the network traffic on a switch or router to a monitoring device. With port mirroring, an engineer can monitor the traffic on specific ports or VLANs, which is especially useful for identifying issues with specific devices or applications.</p> <h3 id="software-probes">Software probes</h3> <p>Software probes are programs that can be installed on servers or network devices to collect data on network traffic. They can provide detailed information on traffic patterns and usage. <a href="http://envoyproxy.io/">Envoy</a> is a popular network proxy used by many service meshes that can also operate as a software probe.</p> <p>Many cloud providers already have capabilities in place for collecting traffic telemetry. Azure has the <a href="https://learn.microsoft.com/en-us/azure/network-watcher/network-watcher-monitoring-overview">Azure Network Watcher</a>, AWS has <a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html">CloudWatch</a>, and GCP has <a href="https://cloud.google.com/monitoring/docs">Cloud Monitoring</a>.</p> <p>Monitoring network traffic can be expensive. It’s easy to fall into the trap of capturing every bit that goes through your network, then storing it forever. This may give you perfect knowledge, but unless this amount of information is required for compliance and regulatory reasons, this approach is undoubtedly overkill. You should think carefully about how much traffic telemetry you collect, balancing the value you derive from the data against the cost of capturing and storing it.</p> <h2 id="tools-for-analyzing-and-understanding-traffic-telemetry">Tools for analyzing and understanding traffic telemetry</h2> <p>The raw data of traffic telemetry is too massive and low-level to understand directly and utilize. You’ll need to process this data, normalizing and aggregating it. Then, you’ll correlate it with other data sources. Finally, you can start the analysis after storing the processed data to identify trends and patterns.</p> <p>Popular open-source projects for storage and visualization of traffic telemetry data (and other types of data) include:</p> <ul> <li><strong>Prometheus and Grafana</strong>: This is a powerful combination of the Prometheus time series database (that focuses on metrics collection) and the Grafana dashboarding-and-alerting platform.</li> <li><strong>Elasticsearch and Kibana</strong>: Geared primarily towards log data, this combination is often used along with Logstash or Fluentd to form the ELK (Elasticsearch + Logstash + Kibana) or EFK (Elasticsearch + Fluentd + Kibana) stacks.</li> </ul> <p>Kentik has several open-source projects in the area of network observability. Check out <a href="https://kentiklabs.com/">Kentik Labs</a> for more information.</p> <p>In the commercial space, <a href="/product/kentik-platform/">Kentik’s network observability platform</a> offers a streamlined, turnkey solution that can integrate with your existing infrastructure and open-source observability solutions.</p> <h2 id="how-to-leverage-traffic-telemetry-for-network-observability">How to leverage traffic telemetry for network observability</h2> <p>Gathering and analyzing data might be fun, but at the end of the day, the goal is to leverage your data and gain value from it. Let’s consider simple use cases for leveraging traffic telemetry to troubleshoot network issues and improve performance.</p> <h3 id="identifying-bottlenecks">Identifying bottlenecks</h3> <p>You can identify bottlenecks by looking for high utilization on specific network segments, large numbers of dropped packets, or retransmissions.</p> <h3 id="detecting-malware-and-ddos-attacks">Detecting malware and DDoS attacks</h3> <p>You can detect malware and DDoS attacks by looking for unusual traffic volumes, unexpected traffic destinations or sources, or abnormal protocol usage. Correlate the data with other information, such as firewall logs, threat intelligence feeds, and endpoint security data. <img src="https://images.ctfassets.net/6yom6slo28h2/4YO6vcn0wqeDT40qdZHsl0/cfbc2484e5674b511d119f387d9390fa/home-protect-ddos.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="DDoS defense in the Kentik platform" /></p> <div class="caption" style="margin-top: -30px;">Your network observability solution can help you detect security incidents or malicious threats by analyzing network telemetry for unusual traffic spikes or patterns.</div> <h3 id="capacity-planning">Capacity planning</h3> <p>Traffic telemetry can assist you with capacity planning as you look for daily and weekly traffic patterns, seasonal variations, and traffic spikes. With these insights in hand, you can use statistical techniques such as trend analysis, time series analysis, or forecasting algorithms to predict future traffic patterns. The forecast will allow you to determine future resource needs.</p> <h3 id="identifying-devices-using-a-lot-of-network-resources">Identifying devices using a lot of network resources</h3> <p>Look for patterns such as high utilization on specific network segments, large numbers of packets sent or received, or high numbers of retransmissions. Then, correlate that data with information such as IP addresses, MAC addresses, and device names. In doing so, you’ll identify specific devices that are using a lot of network resources.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/61sxKAjYcmvjpxR0TYeXzR/0263c3c38a7fd4d4c7120c403e3252a1/observation-deck.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik dashboard with network telemetry data" /> <div class="caption" style="margin-top: -30px;">Dashboards from Kentik’s Network Observability Platform provide visualizations to help you make sense of your network telemetry data.</div> <p>As you take concrete action on your traffic telemetry, consider these best practices:</p> <ul> <li>Regularly review network configurations to identify and correct misconfigurations.</li> <li>Regularly update and patch servers and devices for security and performance.</li> <li>Identify and isolate malicious traffic and attackers.</li> <li>Detect and alert on non-encrypted traffic.</li> <li>Implement continuous monitoring over time.</li> </ul> <h2 id="conclusion">Conclusion</h2> <p>Traffic telemetry is the foundation of your network observability, providing several benefits that include improved visibility, faster troubleshooting, network optimization, better security, tighter compliance, and response automation. However, putting in place a full-fledged network observability solution is not for the faint of heart. You’ll need good tools and a robust network observability platform. To learn more about Kentik, <a href="#demo_dialog" title="Request your Kentik demo">sign up for a demo</a>.</p><![CDATA[Data Gravity in Cloud Networks: Achieving Escape Velocity]]><![CDATA[In the second of our data gravity series, Ted Turner examines how enterprises can address cost, performance, and reliability and help the data in their networks achieve escape velocity.]]>https://www.kentik.com/blog/data-gravity-in-cloud-networks-achieving-escape-velocityhttps://www.kentik.com/blog/data-gravity-in-cloud-networks-achieving-escape-velocity<![CDATA[Ted Turner]]>Wed, 08 Feb 2023 05:00:00 GMT<p>In an ideal world, organizations can establish a single, citadel-like data center that accumulates data and hosts their applications and all associated services, all while enjoying a customer base that is also geographically close. As this <a href="https://www.kentik.com/blog/data-gravity-in-cloud-networks-massive-data/">data grows in mass and gravity</a>, it’s okay because all the new services, applications, and customers will continue to be just as close to the data. This is the “have your cake and eat it too” scenario for a scaling business’s IT.</p> <div as="WistiaVideo" videoId="xwqjildv55" audio></div> <p>But what are network operators to do when their cloud networks <em>have</em> to be distributed, both architecturally and geographically? Suddenly, this dense data mass is a considerable burden, and the same forces that happily drew in service and customer data find that that data is now trapped and extremely expensive and complicated to move. As a cloud network scales, three forces compete against the acceptable accumulation of this gravity: cost, performance, and reliability.</p> <p>In this second installment in my data gravity series, I want to examine how enterprises fight this gravity and help the data in their networks achieve escape velocity.</p> <p>As a reminder, “escape velocity” (<a href="https://datagravitas.com/2010/12/18/cloud-escape-velocity-switching-cloud-providers/">thanks Dave</a>) in the context of cloud networks refers to the amount of effort required to move data between providers or services. Data mass, data gravity, and escape velocity are all directly related.</p> <p>To see this played out, I want to use three discrete points along the cloud network lifecycle where many of today’s companies find themselves:</p> <ul> <li>Lift and shift</li> <li>Multizonal, hybrid</li> <li>Multi-cloud, global, observable</li> </ul> <p>These stages do not, in my opinion, represent an evolution in complexity for complexity’s sake. They do, however, represent an architectural response to the central problem of data gravity.</p> <h2 id="all-systems-go">All systems go</h2> <p>The lift and shift, re-deploying an existing code base “as is” into the cloud, is many organizations’ first step into cloud networking. This might mean a complete transition to cloud-based services and infrastructure or isolating an IT or business domain in a microservice, like data backups or auth, and establishing proof-of-concept.</p> <p>Either way, it’s a step that forces teams to deal with new data, network problems, and potential latency. One of my personal experiences is with identity. My first experience breaking down applications into microservices and deploying in new data centers failed due to latency and data gravity in our San Diego data center. San Diego was where all of our customer data was stored. Dallas, Texas, is where our customer data was backed up for reliability/fault tolerance.</p> <p>As we started our journey to refactor our applications into smaller microservices, we built a cool new data center based on the premise of hyperscaling. The identity team built the IAM application for high volume processing of customer logins and validation of transaction requests. We did not account for the considerable volume of logons at a distance.</p> <p>San Diego is <a href="https://en.wikipedia.org/wiki/2011_Southwest_blackout">electricity poor</a>, and our executives decided to build in a more energy-rich, mild climate with less seismic activity. We built inside the Cascade mountain range near the Seattle area.</p> <p>What we found is that our cool new data center, built using the latest hyperscaler technologies, suffered from one deficiency – latency and distance from our customer data staged/stored in San Diego. The next sprint was to move the customer data to the new hyperscaler facility near Seattle. Once the customer data was staged next to the identity system, most applications worked great for the customers.</p> <p>Over time, the applications were refactored to incorporate latency and slowly degrade in place, instead of failing outright, due to application tier latency (introduced by network distances).</p> <p>The learning item here is that latency kills large data applications when customers (or staff) need to access their data live.</p> <h2 id="second-stage-distributing-your-networks-data">Second stage: Distributing your network’s data</h2> <p>How do IT organizations maintain cost, performance, and reliability targets as networks scale and cater to business and customer needs across a wider geographical footprint?</p> <p>Let’s elaborate on the previous e-commerce example. The cloud architecture has matured as the company has grown, with client-facing apps deployed in several regional availability zones for national coverage. Sensitive data, backups, and analytics are being handled in their Chicago data center, with private cloud access via backbones like Express Route or Direct Connect to help efficiently move data out of the cloud and into the data center.</p> <p>In terms of performance, the distribution into multiple zones has improved latencies and throughput and decreased data gravity in the network. With multiple availability zones and fully private backups, this network’s reliability has significantly improved. Though it must be said this still leaves the network vulnerable to a provider-wide outage.</p> <p>Regarding cost, the improvements can be less clear at this stage. As organizations progress their digital transformation, it is not simply a matter of new tech and infrastructure. IT personnel structure will need to undergo a corresponding shift as service models change, needed cloud competencies proliferate, and teams start to leverage strategies like continuous integration and continuous delivery/deployment (CI/CD). These adaptations can be expensive at the onset. I wrote another series, <a href="https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-i/">Managing the hidden costs of cloud networking</a> that outlines the many ways migrating to cloud networks can lead to unforeseen expenses.</p> <h3 id="adapting-to-distributed-scale">Adapting to distributed scale</h3> <p>The tools and strategies needed to deploy and manage a multizonal network successfully are the building blocks for a truly scalable digital transformation. Once in place, this re-org of both IT architecture and personnel enables a dramatic increase in the size of datasets that network operators need to contend with.</p> <p>There are many tools and concepts in this effort to distribute data gravity that many network and application engineers will already be familiar with:</p> <ul> <li>Load balancer</li> <li>Content delivery networks</li> <li>Queues</li> <li>Proxies</li> <li>Caches</li> <li>Replication</li> <li>Statelessness</li> <li>Compression</li> </ul> <p>What do these have to do with data gravity? You’ll recall that throughput and latency are the two main drivers of data gravity. Essentially, these all serve to abstract either the distance or size of the data to maintain acceptable throughput and latency as the data’s point-of-use moves further and further away from your data centers or cloud zones.</p> <p>It becomes imperative that network operators/engineers and data engineers work closely to think about where data is being stored, for how long, when and how often it is replicated or transformed, where it is being analyzed, and how it is being moved within and between the subnets, SD-WANs, VPNs, and other network abstractions that make up the larger network.</p> <h3 id="reliability">Reliability</h3> <p>Maintaining the integrity of networks has implications beyond uptime. Security, data correctness, and consistency mean that cloud networks must be highly reliable. This means the infrastructure and processes necessary to ensure that data stays available must be put in place <em>before</em> an incident occurs.</p> <p>At scale, this reliability becomes one of the more significant challenges with data gravity. As we’ve discussed, moving, replicating, storing, and otherwise transforming massive data sets is cost prohibitive. Overcoming these challenges brings networks into their most mature, cloud-ready state.</p> <h2 id="a-network-of-gravities">A network of gravities</h2> <p>Reformatting personnel and IT to perform in a more distributed context improves application performance, sets the groundwork for cost-effective scaling, and creates a more reliable network with geographical redundancies.</p> <p>Take a hypothetical organization that has scaled well off the multizonal, single-provider cloud architecture outlined in the previous section. As this organization has scaled off of this new framework, one of its leading competitors was brought down by a cloud provider outage. The outage took down one of their critical security services and exposed them to attack. Taking the hint, there came in our hypothetical organization’s board room a push to refactor the codebase and networking policies to support multi-cloud deployments. As with every previous step along this digital transformation, this new step introduces added complexity and strains against data gravity.</p> <p>Another way of achieving multi-cloud deployments in 2023 is acquisitions. We see many current conglomerates acquiring technologies and customers which reside in cloud providers different from their core deployments.</p> <p>The shift to multiple providers means significant code changes at the application level, additional personnel for the added proficiencies and responsibilities, the possibility of entirely new personnel abstractions like a platform or cloud operations team, and of course, the challenge of extraction, replication, and any other data operations that need to take place to move datasets into the new clouds.</p> <h3 id="multi-cloud-networks">Multi-cloud networks</h3> <p>For security, monitoring, and resource management, a fundamental principle in multi-cloud networks is <em>microsegmentation</em>. Like microservices, networks are divided into discrete units complete with their own compute and storage resources, dependencies, and networking policies.</p> <p>Now take the same principles of microsegmentation when considering network access to resources worldwide for physical retail, office space, warehousing, and manufacturing. The resources reach into the cloud to find and store reference data (customer purchase history lookup, manufacturing CAD/CAM files, etc.). The complexity is dizzying!</p> <p><a href="https://twitter.com/forrestbrazeal/status/1612473738259316736?s=20" /><img src="https://images.ctfassets.net/6yom6slo28h2/2Im2XJrKnQzqghhmAQyoAT/2edea100e122e92ea7e19ffae6e09565/forrest-brazeal-tweet.png" style="max-width: 500px;" class="image center" alt="Tweet from Forrest Brazeal, January 9, 2023" /></a></p> <h2 id="managing-data-complexity-with-network-observability">Managing data complexity with network observability</h2> <p>The tools and abstractions (data pipelines, ELT, CDNs, caches, etc.) that help distribute data in a network add significant complexity to monitoring and observability efforts. In the next installment of this series, we will consider how to leverage network observability to improve data management in your cloud networks.</p> <p>In the meantime, check out the <a href="https://www.kentik.com/product/cloud/">great features in Kentik Cloud</a> that make it the leading network observability solution for today’s leading IT enterprises.</p><![CDATA[Digging Into the Recent Azure Outage]]><![CDATA[In the early hours of Wednesday, January 25, Azure, Microsoft’s public cloud, suffered a major outage that disrupted their cloud-based services and popular applications such as Sharepoint, Teams, and Office 365. In this post, we’ll highlight some of what we saw using Kentik’s unique capabilities, including some surprising aftereffects of the outage that continue to this day.]]>https://www.kentik.com/blog/digging-into-the-recent-azure-outagehttps://www.kentik.com/blog/digging-into-the-recent-azure-outage<![CDATA[Doug Madory]]>Fri, 03 Feb 2023 05:00:00 GMT<p>In the early hours of Wednesday, January 25, Microsoft’s public cloud suffered a major outage that disrupted their cloud-based services and popular applications such as Sharepoint, Teams, and Office 365. Microsoft has since <a href="https://status.azure.com/en-us/status/history/#:~:text=Preliminary%20Post%20Incident%20Review%20(PIR)%20%E2%80%93%20Azure%20Networking%20%E2%80%93%20Global%20WAN%20issues%20(Tracking%20ID%20VSG1%2DB90)">blamed the outage</a> on a flawed router command which took down a significant portion of the cloud’s connectivity beginning at 07:09 UTC.</p> <p>In this post, we’ll highlight some of what we saw using Kentik’s unique capabilities, including some surprising aftereffects of the outage that continue to this day.</p> <h2 id="azure-outage-in-synthetic-performance-tests">Azure outage in synthetic performance tests</h2> <p>Kentik operates an agent-to-agent <a href="https://www.kentik.com/go/webinar/performance-mesh-testing/">mesh</a> for each of the public clouds — a portion of which is freely available to any Kentik customer in the <a href="https://kb.kentik.com/v4/Ma09.htm#Ma09-Public_Clouds_Tab">Public Clouds tab</a> of the State of the Internet page. When Azure began having problems last week, our performance mesh lit up like a Christmas tree with red alerts mixed in with green statuses. A partial view is shown below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2oL6kPXXcz90FmzjdMjG1k/d99459c7a6afe6cd5c36f8b826be0e1c/azure-christmas-tree.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Azure outage in the Kentik portal" /> <p>Clicking into any of these red cells takes the user to a view such as the one below, revealing latency, packet loss, and jitter between our agents hosted in Azure. The screenshot below shows a temporary disconnection from 07:20 to 08:40 UTC between Azure’s <code class="language-text">westus2</code> region in Seattle and its <code class="language-text">southeastasia</code> region in Singapore.</p> <img src="//images.ctfassets.net/6yom6slo28h2/ljU15ARogkLkblS03LCPk/5e045406ba23b4ea1831aff1937ccfe3/seattle-se-asia-disconnect.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Temporary disconnection between Seattle and Singapore" /> <p>However, not all intra-Azure connections suffered disruptions like this. While <code class="language-text">westus2</code>’s link to <code class="language-text">southeastasia</code> may have gone down, its link to <code class="language-text">eastus</code> in the Washington DC area only suffered minor latency bumps, which were within our alerting thresholds.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6KMZs5KQ2kZj6ey379XVkW/44b8339a63fc25f079b09cd0f014be22/seattle-dc-connection.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Minor latency between Seattle and DC" /> <p>To give another curious example of the variety of impacts from this outage, consider these two equivalent measurements from different cities in South Korea to Canada. Tests between <code class="language-text">koreasouth</code> in Busan and <code class="language-text">canadaeast</code> in Quebec were failing between 07:50 and 08:25 UTC. At the same time, connectivity between <code class="language-text">koreacentral</code> in Seoul and the same region in Canada hardly registered any impact at all.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2fUCJOXENQz8G6DuU26C54/4d171ab6e0a3379aad4ae0b2ebe4fbb6/busan-quebec.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Busan to Quebec" /> <div class="caption" style="margin-top: -30px">Tests between Busan and Quebec</div> <img src="//images.ctfassets.net/6yom6slo28h2/6bjFFYeC6r0wma9GGeVvNW/a7172b9f67dad58d819ae645e9814c0c/seoul-quebec.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Seoul to Quebec" /> <div class="caption" style="margin-top: -30px">Tests between Seoul and Quebec</div> <p>It is hard to know exactly why some parts of Azure’s connectivity were impacted while others were not without intimate knowledge of their architecture, but we can surmise that Azure’s loss of connectivity varied greatly depending on the source and destination.</p> <h2 id="azure-outage-in-netflow">Azure outage in NetFlow</h2> <p>When we look at the impact of the outage on Microsoft services in aggregate NetFlow based on Kentik’s <a href="https://kb.kentik.com/v4/Ha02.htm">OTT service tracking</a>, we arrive at the graphic below, which shows two distinct drops in traffic volume at 07:09 UTC and 07:42 UTC and an overall decreased level of traffic volume until 08:43 UTC.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5Sva65Z5Dx8VjASLBAhUZ/e300de145c88ba3f0fbdcb5ad0cb6ce3/ms-services-two-drops.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Drops in traffic based on aggregate NetFlow" /> <p>Given the time of day of the outage (07:09 UTC), the impact was more disruptive in East Asia. If we look at the aggregate NetFlow from our service provider customers in Asia to Microsoft’s AS8075, we see a <em>surge in traffic</em>. Why, you might ask, would there be a surge in traffic during an outage?</p> <img src="//images.ctfassets.net/6yom6slo28h2/7AQ6D3RahSqt4P37Iq6TU9/3a3df52d1ad912c19db3ef6883f0c592/azure-outage-traffic-surge.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Surge in traffic during an outage" /> <p>This has to do with the fact that many peering sessions with AS8075 were disrupted during the outage and traffic shifted to traversing transit links and onto the networks of Kentik’s customers in Asia. We touch on this phenomenon again in the next section.</p> <h2 id="azure-outage-in-bgp">Azure outage in BGP</h2> <p>And finally, let’s take a look at this outage from the perspective of BGP. Microsoft uses a few ASNs to originate routes for its cloud services. AS8075 is Microsoft’s main ASN, and it originates over 370 prefixes. In this incident, 104 of those prefixes exhibited no impact whatsoever — see 45.143.224.0/24, 2.58.103.0/24, and 66.178.148.0/24 as examples.</p> <div as="Promo"></div> <p>However, the other prefixes originated by Microsoft exhibited a high degree of instability, indicating that something was awry inside the Azure cloud. Let’s take a look at a couple of examples.</p> <p>Consider the plight of 20.47.42.0/24, which is used at Azure’s <code class="language-text">centralindia</code> region located in Pune, India. Kentik’s BGP monitoring, pictured below, presents three time series statistics above a ball-and-stick AS-level diagram.</p> <p>In this particular case, the prefix was withdrawn at 07:12 and later restored at 10:36 UTC. The upper plot displays the percentage of our BGP sources that had 20.47.42.0/24 in their routing tables over this time period — the outage is marked with an arrow. The route withdrawals and re-announcements are tallied in the BGP Events timeline and contribute to corresponding spikes of AS Path Changes.</p> <img src="//images.ctfassets.net/6yom6slo28h2/y1kamwRANU5w2zXoJuQNU/cd9237d3f227e373519b32ea21e31dc2/route-withdrawals-announcements.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP events timeline and AP path changes" /> <p>Not every Microsoft prefix experienced such a cut-and-dry outage. Consider 198.180.97.0/24 used in <code class="language-text">eastus2</code> pictured below. The upper graphic is a stacked plot that conveys how the internet reaches AS8075 for this prefix over time. As is common with cloud providers, AS8075 is a highly peered network, meaning much of the internet reaches it via a myriad of peering relationships.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7qkp6pxeh2850nKUH9nHWR/b6cdae405ce6bb5b7bb918823c8e6c7d/highly-peered-network.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Highly peered network example" /> <p>Since each of those relationships will be seen by only a few BGP vantage points, they get grouped together in the <em>Other</em> category (deep purple band above). Once the problems start at 07:09 UTC, we can see spikes in the AS Path Changes timeline indicating that something is happening with 198.180.97.0/24.</p> <p>At that time the share of <em>Other</em> decreases in the stacked plot, meaning many networks on the internet are losing their peering with Microsoft. As <em>Other</em> decreases, AS8075’s transit providers (Arelion, Lumen, GTT, and Vocus) see their share increase as the internet is shifting to reach AS8075 via transit, having lost a portion of its peering links to Microsoft. This is consistent with what we were seeing in the NetFlow from our service provider customers in Asia in the previous section.</p> <h2 id="azure-outages-bgp-hijacks">Azure outage’s BGP hijacks</h2> <p>The most curious part of this outage is that AS8075 appeared to begin announcing a number of new prefixes, some of which are hijacks of routed address space belonging to other networks. At the time of this writing, these routes are still in circulation.</p> <p>Between 07:42 and 07:48 UTC on January 25th, 157.0.0.0/16, 158.0.0.0/16, 159.0.0.0/16, and 167.0.0.0/16 appeared in the global routing table with the following AS path: <code class="language-text">… 2764 4826 8075</code>.</p> <p>AS2764 and AS4826 belong to the Australian telecoms AAPT and Vocus, respectively. The problem is that these routes belong to China Unicom, the US Department of Defense and Saudi Telecom Company (STC), and were already being originated by AS4837, AS721 and AS25019, respectively. 167.0.0.0/16 is presently unrouted, but belongs to <a href="https://bgp.tools/prefix/167.0.0.0/16#whois">Telefónica Colombia</a>.</p> <p>Below is Kentik’s BGP visualization showing the 9.3% of our BGP vantage points see AS8075 as the origin of this prefix belonging to the Saudi Arabian incumbent.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6np47LuXPhEdNOrBAWIq6a/4a17fcf1265a795cd2bef9a0f208adb8/bgp-visualization-saudi-arabia.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="BGP visualization - Saudi Arabia" /> <div class="caption" style="margin-top: -30px">Kentik BGP visualization of 159.0.0.0/16.</div> <p>Shortly afterwards, AS8075 also began appearing as the origin of several other large ranges, including 70.0.0.0/15 (more-specific of T-Mobile’s 70.0.0.0/13). All of these routes contain AS paths ending with “2764 4826 8075”, and, for the time being, they have largely remained within Australia, limiting the disruption that they might have otherwise caused.</p> <p>What is strange is that despite AS4826 being in the AS path, Vocus does not report these routes in their <a href="https://lg.vocus.network/">looking glass</a>. Meanwhile over on <a href="http://looking-glass.connect.com.au/lg/">AAPT’s looking glass</a>, it reports the routes as coming from Vocus (AS4826):</p> <img src="//images.ctfassets.net/6yom6slo28h2/3aybARBO7liYa6zALQY7a0/577b466dd28fdef838502cf3b1c54690/vocus.png" style="max-width: 700px;" class="image center" thumbnail alt="Vocus looking glass" /> <img src="//images.ctfassets.net/6yom6slo28h2/j7tavlJPZdlevtUsLVzfQ/43a804c2f5de242099fa745e825f87cb/tpg-aapt-looking-glass.png" style="max-width: 700px;" class="image center" thumbnail alt="AAPT looking glass" /> <p>We contacted Microsoft and Vocus and they are actively working to remove these problematic BGP routes.</p> <p>This isn’t the first routing snafu involving Microsoft and an Australian provider. In a <a href="https://www.linkedin.com/in/dougmadory/overlay/1635515851078/single-media-viewer/?profileId=ACoAAABgDfMBQqg6K3WmEobvLYAvoetjdTSF-R0">presentation</a> on routing leaks I gave at <a href="https://archive.nanog.org/meetings/nanog63/agenda">NANOG 63</a> eight years ago <em>to the day</em>, I described a situation where Microsoft had mistakenly removed the BGP communities that prevented the routes they announced through Vocus from propagating back to the US. The result was that for six days, traffic from users in the US going to Microsoft’s US data centers was routed through Australia, incurring a significant latency penalty.</p> <img src="//images.ctfassets.net/6yom6slo28h2/JeJDEaOopzuZZc7nYcvxm/9b91e2dc8b29e5bc75b46e74b1a9eee8/dyn-san-jose-australia.png" style="max-width: 800px;" class="image center" thumbnail alt="Presentation from NANOG 63" /> <h2 id="conclusion">Conclusion</h2> <p>Following the outage, Microsoft published a <a href="https://status.azure.com/en-us/status/history/#:~:text=Preliminary%20Post%20Incident%20Review%20(PIR)%20%E2%80%93%20Azure%20Networking%20%E2%80%93%20Global%20WAN%20issues%20(Tracking%20ID%20VSG1%2DB90)">Post Incident Review</a> which summarized the root cause in the following paragraph:</p> <blockquote>As part of a planned change to update the IP address on a WAN router, a command given to the router caused it to send messages to all other routers in the WAN, which resulted in all of them recomputing their adjacency and forwarding tables. During this re-computation process, the routers were unable to correctly forward packets traversing them.</blockquote> <p>Reminiscent of the <a href="https://www.kentik.com/blog/facebooks-historic-outage-explained/">historic Facebook outage of 2021</a>, a flawed router command appears to have taken down a portion of Azure’s WAN links. If I understand their explanation correctly, the command triggered an internal update storm. This is similar to what happened during the <a href="https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outage/">Rogers outage of summer 2022</a> when an internal routing leak overwhelmed their routers causing them to drop traffic. As was the case with the Rogers outage, the instability we can observe in BGP was likely a symptom, not the cause of this outage.</p> <p>Managing the networking at a major public cloud like Microsoft’s Azure is no simple task, <em>nor is analyzing their outages when they occasionally occur</em>. Understanding all of the potential impacts and dependencies in a highly complex computing environment is one of the top challenges facing networking teams in large organizations today.</p> <p>It is these networking teams that we think about at Kentik. The key is having the ability to <em>answer any question</em> from your telemetry tooling; this is a core tenet to Kentik’s approach on <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">network observability</a>.</p><![CDATA[Best Practices for Enriching Network Telemetry to Support Network Observability]]><![CDATA[Collecting and enriching network telemetry data with DevOps observability data is key to ensuring organizational success. Read on to learn how to identify the right KPIs, collect vital data, and achieve critical goals.]]>https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observabilityhttps://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability<![CDATA[Phil Gervasi]]>Wed, 01 Feb 2023 05:00:00 GMT<p><a href="https://www.kentik.com/kentipedia/what-is-network-observability">Network observability</a> is critical. You need the ability to answer any question about your network—across clouds, on-prem, edge locations, and user devices—quickly and easily.</p> <p>But network observability is not always easy. To be successful, you need to collect <strong>network telemetry</strong>, and that telemetry needs to be extensive and diverse. And once you have that raw telemetry data, you need to interpret it. And even then, key questions— such as, <em>Am I using my network resources effectively?</em>—are not always easy to answer.</p> <p>To answer the business-level questions that can move the needle, you need to enrich your network telemetry. This post will provide concrete guidance on how to do just that. We’ll look at how, by combining DevOps observability data with network telemetry, you can get strong, network-focused observability. Let’s begin with a discussion of KPIs.</p> <h2 id="identifying-key-performance-indicators-kpis-in-networking">Identifying Key Performance Indicators (KPIs) in networking</h2> <p>The first step toward comprehensive network observability is identifying your key performance indicators. Here are some examples of network-related KPIs:</p> <ul> <li>Network latency</li> <li>Packet loss</li> <li>Throughput</li> <li>Connections per second</li> <li>Bandwidth utilization</li> </ul> <p>Note that these KPIs can be aggregated at different levels of the hierarchy—individual endpoints or instances, multi-instance services, entire data centers, across regions, and globally.</p> <p>After identifying and categorizing the relevant KPIs, you need to gather data about these KPIs. Network monitoring tools use various techniques for data gathering, including polling, collecting metrics from network devices, and scraping traffic logs.</p> <p>In the cloud, you can ingest network telemetry data from cloud providers into your network observability platform. In your own data centers, you will need to choose, install, and configure network monitoring tools.</p> <h2 id="identifying-and-collecting-data-from-auxiliary-data-sources">Identifying and collecting data from auxiliary data sources</h2> <p>The next step is to collect the auxiliary data that will be used to enrich the network telemetry data. Let’s cover the different major types of auxiliary data.</p> <h3 id="logs-and-events">Logs and events</h3> <p>Log files from network devices, servers, and applications can contain information relevant to your network observability KPIs. The basic process looks like this:</p> <ol> <li>Extract relevant information.</li> <li>Correlate that information with network telemetry using timestamps, shared tags, and geographical locations.</li> <li>Ingest the auxiliary data into your network observability platform.</li> </ol> <p>Events, such as alerts generated by network devices, can also be ingested into the observability platform, potentially triggering a higher-level alert.</p> <h3 id="endpoint-telemetry">Endpoint telemetry</h3> <p>Endpoint telemetry refers to data collected from devices that are connected to the network, such as laptops, tablets, and smartphones. This data may include performance metrics and resource usage of the devices, as well as the applications and services running on them. This endpoint telemetry data, too, can be used to enrich network telemetry.</p> <p>For example, if you see a spike in CPU usage on endpoint devices, this might indicate an issue on the network, causing the devices to work harder than usual.</p> <p>As another example, let’s assume you see an increase in network latency. As part of your investigation into the issue, you can use endpoint telemetry data to see if there are changes in network access patterns on endpoint devices.</p> <h3 id="application-level-telemetry">Application-level telemetry</h3> <p>Application-level telemetry refers to data collected from the applications and services running on the network, such as web servers, databases, and custom business applications. This data includes performance, errors, and resource usage by these applications and services.</p> <p>Imagine that your monitoring of application-level telemetry shows a spike in response times. This might indicate an issue on the network that is causing the application to wait longer for network responses. Application-level telemetry can help you determine if your network is having problems. When properly correlated with network telemetry, it can even help you with root cause analysis.</p> <p>When considering observability at the application level, take advantage of distributed tracing, making sure to use it comprehensively. This can be especially helpful for enriching network telemetry if your system is based on a microservice architecture.</p> <h3 id="leveraging-ai-and-ml-to-gain-insights">Leveraging AI and ML to gain insights</h3> <p>Your network observability platform should have dashboards and visualizations for humans to understand overall network health and performance. However, at scale, humans alone can’t detect and respond to issues fast enough.</p> <p>Machine learning (when implemented and trained properly) excels at digesting high-dimensionality data like enriched network telemetry. It can identify trends, predict future outcomes, and discover anomalies. These are network observability insights that even keen-eyed human operators would be unable to spot.</p> <p>In addition, <a href="https://www.kentik.com/blog/the-reality-of-machine-learning-in-network-observability/">AI/ML-backed tools</a> can be used to summarize and consolidate complex data to make it digestible by humans. As it helps human operators understand the state of the network, these tools can also recommend courses of action during incidents.</p> <h2 id="key-considerations-for-managing-and-storing-enriched-telemetry-data">Key considerations for managing and storing enriched telemetry data</h2> <p>Now, let’s look at a few of the key considerations you’ll want to consider when you manage and store all this enriched telemetry data.</p> <p>First, when collecting the data, accounting for <strong>user privacy is imperative</strong>. You need to be aware of the types of data you feed into your network observability platform and ensure you comply with all relevant laws and regulations.</p> <p>Next, observability doesn’t come cheap. It is easy to collect a lot of data, but you must consider the cost of collection, storage, and analysis and weigh that against the value that you derive from your data. For example, do you need to capture and analyze every network packet, or is it sufficient to analyze only 10% of the packets? Do you need to store your flow logs forever, or can you purge them after two years?</p> <p>Finally, the value of enriching network telemetry is clear. However, an organization must manage and store all that data appropriately in order to reap the benefits. This is where a network observability platform like Kentik comes in. You need a solid platform that follows industry best practices, integrates with all the standard tools and network providers, and offers a turnkey (yet customizable) solution.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4iKqyVpe73bQ4fs8IprQ9y/37cc541f7b4dd40dee189ad7aab610d7/home-network-explorer202009.png" style="max-width: 800px;" class="image center" withFrame alt="Network Explorer" /> <h2 id="conclusion">Conclusion</h2> <p>Let’s recap. <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">Network telemetry is fundamental to network observability</a>, but it can be much more useful if you enrich it with data from auxiliary sources such as logs, events, endpoint device telemetry, and application-level telemetry. Once you identify your network observability KPIs, you can collect all the relevant data and feed it into your network observability platform.</p> <p>Meanwhile, you should leverage AI/ML-backed tools to understand your network, detect problems early, and provide predictive analysis.</p> <p>Ongoing analysis of network telemetry data is crucial for maintaining network health and performance. <em>Enriched</em> network telemetry can level up your network observability effectiveness significantly. This is because your network—and the traffic attempting to pass through it—is dynamic and constantly changing. Real-time analysis of the current state and behavior of the network can help network administrators (and automation safeguards) to identify issues and take proactive measures to resolve them.</p> <p>To take advantage of enriched network telemetry and realize the goal of true network observability, you need a robust network observability platform like Kentik.</p> <hr> <h2 id="network-telemetry-faqs">Network Telemetry FAQs</h2> <h3 id="what-is-network-telemetry">What is network telemetry?</h3> <p>Network telemetry, a fundamental concept for network observability, is the process of collecting, processing, and interpreting data from various network devices and components to monitor their performance, behavior, and status in real-time.</p> <p>In the broadest sense, network telemetry encompasses a vast array of measurements that cover everything from latency and bandwidth utilization to packet loss and connections per second. This data is usually obtained by active monitoring methods like synthetic tests or passive monitoring methods such as packet capture or flow data.</p> <p>Once this data is collected, it is transmitted from the network’s edge to a central location for storage, analysis, and visualization. This telemetry data provides network administrators with valuable insights into the health and performance of the network, enabling them to detect and troubleshoot issues before they impact end-users or escalate into larger problems.</p> <p>What sets network telemetry apart from network performance metrics is its focus on depth, detail, and dynamism. It’s not just about capturing basic metrics or sporadic snapshots of network activity. Instead, network telemetry strives to collect comprehensive, granular, and real-time data across all areas of the network, providing a more accurate and timely picture of network behavior and performance.</p> <p>In an era where networks are increasingly complex, distributed, and vital to business operations, network telemetry has become an indispensable tool for maintaining network reliability, efficiency, and security. By enriching network telemetry with auxiliary data sources such as logs, events, endpoint device telemetry, and application-level telemetry, network operators can enhance their network observability and manage their networks more effectively.</p> <h3 id="what-is-meant-by-streaming-network-telemetry">What is meant by streaming network telemetry?</h3> <p>Streaming network telemetry refers to the continuous, real-time transmission of network data from network devices to a centralized system or platform for analysis and visualization. Unlike traditional methods which involve periodic polling or snapshotting of network status, streaming telemetry enables the network administrators to have a more dynamic, up-to-the-minute view of the network’s health and performance.</p> <p>In the context of network observability, streaming telemetry provides more granular insights into network behaviors, helping to identify patterns, predict potential issues, and initiate quick troubleshooting. With streaming network telemetry, network anomalies can be detected and addressed as they happen, reducing the risk of network downtime or performance degradation.</p> <p>The increased visibility and real-time nature of streaming network telemetry make it a crucial component of modern, robust network observability strategies, especially in complex, distributed network environments. By leveraging this technology, network administrators are better equipped to ensure network reliability, performance, and security.</p> <h3 id="how-does-network-telemetry-improve-network-security">How does network telemetry improve network security?</h3> <p>Network telemetry data can provide in-depth visibility into network traffic, which is crucial for detecting security threats. Detailed telemetry data can help identify patterns or anomalies that might indicate a security breach or cyberattack. By using telemetry data, security teams can respond quickly to threats, isolate affected systems, and prevent further damage.</p> <h3 id="what-are-some-common-network-telemetry-data-sources">What are some common network telemetry data sources?</h3> <p>Network telemetry data sources span a wide array of network types and elements. Common sources include:</p> <ul> <li><strong>Cloud infrastructure:</strong> Service meshes, transit gateways, and ingress gateways specific to cloud environments.</li> <li><strong>Data center:</strong> Leaf and spine switches, top of rack switches, and API gateways for digital services.</li> <li><strong>Internet and broadband infrastructure:</strong> Includes access and transit networks, edge and exchange points, and Content Delivery Networks (CDNs).</li> <li><strong>4G, 5G networks:</strong> Components like evolved packet core (v)EPC, Multi-access edge computing (MEC), optical transport switches (ONT/OLT), and Radio Access Network (RAN).</li> <li><strong>IoT:</strong> IoT endpoints, gateways, and industrial switches.</li> <li><strong>Campus network:</strong> Ethernet switches, layer 2 and 3 switches, hubs, network extenders, wireless access points and controllers.</li> <li><strong>Traditional WAN:</strong> WAN access switches, integrated services routers, and cloud access routers.</li> <li><strong>SD-WAN:</strong> Access gateways, uCPE, vCPE, and composed SD-WAN services including their cloud overlays.</li> <li><strong>Service provider backbone:</strong> Edge and core routers, transport switches, optical switches, and Data Center Interconnects.</li> <li><strong>MSO (Multiple System Operators):</strong> Cable Access Platforms (CAP), CMTS, Optical Distribution Network (ODN), and Broadband Network Gateway (BNG).</li> </ul> <p>Additionally, there are observation points that could generate telemetry data:</p> <ul> <li><strong>Network devices:</strong> Physical and virtual routers, switches, wireless access points, application delivery controllers, and other network elements.</li> <li><strong>Endpoints:</strong> Client and server/service endpoints, including physical, virtual, and overlay/tunnel interfaces.</li> <li><strong>Controllers:</strong> Software-defined network controllers, orchestrators, and path computation applications.</li> <li><strong>Network TAPs, SPAN ports, and Network Packet Brokers (NPBs):</strong> These provide access to network traffic for monitoring and analysis.</li> <li><strong>L4-7 network elements:</strong> These include web appliances, content delivery networks, and application delivery controllers.</li> <li><strong>Firewalls and security appliances/services:</strong> These act as gateways, enforce policies, and generate telemetry data.</li> <li><strong>Application layer:</strong> Elements like Application Delivery Controllers (ADCs), load balancers, and service meshes.</li> </ul> <p>In terms of telemetry protocols, these devices may use standardized formats such as NetFlow, IPFIX, or VPC Flow Logs, amongst others.</p> <p>In modern networks, visibility across all these data sources is crucial for comprehensive network observability. However, due to the diversity of devices from multiple vendors, achieving unified visibility can be a challenge. With a well-designed data platform, it’s possible to iteratively work towards complete coverage, starting with key areas and expanding over time.</p><![CDATA[Understanding the Advantages of Flow Sampling: Maximizing Efficiency without Breaking the Bank]]><![CDATA[Does flow sampling reduce the accuracy of our visibility data? In this post, learn why flow sampling provides extremely accurate and reliable results while also reducing the overhead required for network visibility and increase our ability to scale our monitoring footprint.]]>https://www.kentik.com/blog/advantages-of-flow-sampling-maximize-efficiency-without-breaking-the-bankhttps://www.kentik.com/blog/advantages-of-flow-sampling-maximize-efficiency-without-breaking-the-bank<![CDATA[Phil Gervasi]]>Wed, 18 Jan 2023 05:00:00 GMT<p>The whole point of our beloved networks is to deliver applications and services to real people sitting at computers. So, as network engineers, monitoring the performance and efficiency of our networks is a crucial part of our job. Flow data, in particular, is a powerful tool that provides valuable insights into what’s happening in our networks for ongoing monitoring and troubleshooting poor-performing applications.</p> <h2 id="what-is-flow-data">What is flow data?</h2> <p><a href="https://www.kentik.com/telemetrynow/s01-e04/">Flow data is a type of metadata</a> derived from packets that summarizes the information embedded in a network stream. We use flow data to monitor IP conversations, protocol activity, applications, to see trends, and to identify patterns in traffic. The network devices generate this metadata in the form of flow records sent to flow collectors, usually over the production network.</p> <p>However, the volume of data that needs to be processed, especially in large networks with high-speed links, can be overwhelming to the network device creating the flow records and the monitoring system. To solve this problem, we can use <em>sampling</em>.</p> <h2 id="what-is-flow-sampling">What is flow sampling?</h2> <p>Sampling is a method used to reduce the amount of flow data that needs to be processed by a network device, such as a router or a switch, as well as the monitoring system. As traffic traverses a link, the network device selects a subset of packets to represent the whole, rather than make a copy of every single packet. It then sends this sampled data as a flow record to a flow collector for processing and analysis.</p> <p>Consider a router in a busy network with very high-speed links. Since the router itself is our de facto monitoring device, we use it both to monitor network traffic as well as for its primary function, to forward packets. The problem is that capturing every single packet and generating many flow records is a massive burden to the local CPU, flow cache, and the network itself, though flow records are lightweight.</p> <p>So we can configure a <em>sampling rate</em> to capture only a portion of those packets crossing the link and generate a flow record from this sampled data. Based on your needs, you may collect only 1 out of every 1,000 packets, for example, reducing the amount of information you need to process locally and by your monitoring system.</p> <h2 id="pros-and-cons-of-network-flow-sampling">Pros and cons of network flow sampling</h2> <p>The benefits of sampling are pretty straightforward. Your network devices aren’t taxed as heavily, your monitoring system doesn’t have to process as much, and you aren’t adding as much extra traffic to your production network.</p> <p>This means we can scale our monitoring to a much larger footprint of network devices, servers, containers, clouds, and end-users — a scale that may be nearly impossible otherwise. Sampling allows us to improve our monitoring tools’ performance and scope since we can process large amounts of data much more efficiently.</p> <p>However, a common argument against sampling is that capturing only a subset of packets gives us incomplete visibility and potentially reduces the accuracy of the results. In other words, sampling costs us visibility and can lead to underrepresentation of the data since there’s less of it collected. This results in an inaccurate picture of the network and our conclusions based on the data.</p> <h2 id="accuracy-and-sampling-rate">Accuracy and sampling rate</h2> <p>Though this concern may be valid in certain scenarios, it usually isn’t an issue because of the size of the dataset or, in statistics terms, the population we’re dealing with. Millions and millions of packets traverse a typical router or switch link in a very short period of time. And in statistics, the larger the population, the more accurate the samples will represent the population.</p> <p><strong>Let’s look at an example.</strong></p> <p>A sampling rate of 1:1,000 means you’re collecting statistics from one packet out of every 1,000. So, in this case, you’re ignoring the information embedded in those other 999 packets. That’s where the concern arises for some.</p> <p>We might ask ourselves, “What am I not seeing that could be in those other 999 packets?”</p> <p>In reality, you’d be capturing a statistically significant sample of a vast dataset when capturing one out of every 1000 packets (a sampling rate of 1:1,000) on a typical 1Gbps link. And that would give you enough information about your network without the fear of overwhelming your routers or monitoring system.</p> <p>This is because the sampling rate of 1:1000 provides a sufficiently large sample to reflect the entire flow accurately. So if you’re using a sampling rate of 1:1000 and a randomly sampled sampling mode, and if a flow has 100,000 packets and you’re able to sample 100 of them, you’ll have a statistically significant idea of what the other 99,000 packets looked like. In fact, you’ll likely be able to identify the average packet size correctly and even figure out how many bytes were transmitted within a fraction of a percent. This is why most providers employ sampling to determine bandwidth and peering choices.</p> <p>And therein lies the problem. You have to consider your specific scenario; if your network devices can handle creating and sending a lot of flow records, if you have the available bandwidth to accommodate the additional flow record traffic, and what level of resolution you need in the first place.</p> <p>Are you looking for the very highest resolution visibility possible? Then you’ll want a sampling rate of 1:1 or pretty close to that, assuming your devices can handle it. If you’re ok with seeing trends, IP conversations, what applications are taking up the most bandwidth, and what endpoints are the chattiest, you can bump that sampling rate to 1:100, 1:1,000, or even 1:10,000.</p> <p>But is there a sweet spot of sampling enough to be able to scale but not sampling so much that the data becomes useless? Like almost everything in tech, the answer to how much sampling is just right is “<em>it depends</em>.”</p> <h2 id="sampling-trade-offs">Sampling trade-offs</h2> <p>When monitoring a high-speed network like the type used in some banking and finance organizations, keep in mind that individual transactions happen quickly in a short amount of time. Though sampling shouldn’t affect accuracy in most situations, a sampling rate of 1:10,000 could miss vital visibility information for this type of application. So, in this case, you may want a lower sample rate since there’s so much activity happening quickly. This means allocating more resources to your monitoring systems, like CPU and storage, and it means your network devices will have to work a little harder.</p> <p>On the other hand, a large school district in the United States should be able to get more than enough network visibility information from more sampled flow data. With thousands of students, various devices (including personal devices), and thousands of teachers and staff, a school district would be hard-pressed to handle a low sample rate both in terms of the staff to handle the information, the budget required to process and store the data, and the toll it takes on the network itself.</p> <p>The inaccuracy we get from sampling is statistically small and generally within acceptable limits for network operators. So, in general, it’s a good practice to start with a low sampling rate and then adjust it as needed based on the type of links being monitored, the volume of data being captured, and your desired level of resolution. It’s also a good idea to monitor your network devices themselves to ensure that the sampling rate is not causing any issues, such as high CPU or memory usage.</p> <p>As with most things in tech, there are always trade-offs. You adjust your sampling rate according to your specific scenario and understand there may be a slight decrease in accuracy — the cost is a decrease so small that it may be statistically irrelevant.</p> <p>Even so, this tradeoff is worth it because we can scale our network monitoring to a much greater number and type of devices both on-premises and in the cloud. In the end, sampling costs us a few packets, so we see less. But it allows us to <a href="https://www.kentik.com/blog/maximizing-application-performance-extract-practical-data-from-your-network/">scale our visibility</a>, so we see more.</p><![CDATA[Cuba and the Geopolitics of Submarine Cables]]><![CDATA[This week marks a decade since the ALBA-1 submarine cable began carrying traffic between Cuba and the global internet. And last month’s recommendation by the US Department of Justice to deny the request by the ARCOS cable system to connect Cuba shows that, almost a decade later, geopolitics continues to shape the physical internet — especially when it comes to Cuba.]]>https://www.kentik.com/blog/cuba-and-the-geopolitics-of-submarine-cableshttps://www.kentik.com/blog/cuba-and-the-geopolitics-of-submarine-cables<![CDATA[Doug Madory]]>Fri, 13 Jan 2023 05:00:00 GMT<p><em>The views expressed in the following article are that of the author and do not necessarily reflect the views or positions of any entities they represent</em>.</p> <p>This week marks a decade since the <a href="https://en.wikipedia.org/wiki/ALBA-1">ALBA-1 submarine cable</a> began carrying traffic between Cuba and the global internet. On 20 January 2013, I <a href="https://web.archive.org/web/20130121174409/http://www.renesys.com/blog/2013/01/cuban-mystery-cable-activated.shtml">published the first evidence</a> of this <a href="https://www.reuters.com/article/cuba-internet/cubas-mystery-fiber-optic-internet-cable-stirs-to-life-idUKL1N0AR9TQ20130122">historic subsea cable activation</a> which enabled Cuba to finally break its dependence on geostationary satellite service for the country’s international connectivity.</p> <p>ALBA-1 was one of my first lessons on how geopolitics can shape the physical internet.</p> <p>Last month’s recommendation by the US Department of Justice to deny the request by the <a href="https://www.submarinecablemap.com/submarine-cable/arcos">ARCOS</a> cable system to connect Cuba shows that, almost a decade later, geopolitics continues to shape the physical internet — <em>especially when it comes to Cuba</em>.</p> <h2 id="the-alba-1-activation">The ALBA-1 activation</h2> <img src="//images.ctfassets.net/6yom6slo28h2/vM7qKqAdhB76ZEniUk9cF/9b65dc6c29bb513be73454d491a829e0/submarine-cable-map.png" style="max-width: 600px; padding: 0;" class="image center" alt="Submarine cable map" /> <div class="caption" style="margin-top: -30px;">ALBA-1 submarine cable from <a href="https://www.submarinecablemap.com">www.submarinecablemap.com</a></div> <p>I first learned of the mystery of the ALBA-1 submarine cable from <a href="http://laredcubana.blogspot.com/">The Internet in Cuba</a> blog of <a href="http://som.csudh.edu/fac/">Larry Press</a>, computer science professor at California State University, Dominguez Hills. At that time, the cable had reportedly been constructed but was inexplicably laying dormant for over a year, prompting intense suspicion and speculation about its status.</p> <p>As the world became more connected, Cuba had been excluded from every previous submarine cable system in the Caribbean due to the U.S. embargo against the communist nation. As a result, the Cuban internet was completely dependent on geostationary satellites to reach the outside world. In contrast to undersea fiber optics, geostationary satellite service offers lower capacity with significantly higher latencies, all at a higher cost per Mb — not great for a developing nation’s sole source of internet service.</p> <p>Ultimately, it was the Venezuelan government (ally to Cuba and adversary to the U.S.) that put up the money to build a submarine cable to improve Cuba’s connection to the internet. In 2007, a <a href="https://www.commsupdate.com/articles/2007/10/01/venezuela-cuba-cable-set-for-2009/">joint Cuba-Venezuela venture</a> announced its intention to construct an undersea fiber optic cable to Cuba by 2009.</p> <p>The ALBA (<em>Alternativa Bolivariana para los Pueblos de nuestra América</em>) cable experienced numerous delays but was eventually designated RFS (ready-for-service) in <a href="http://www.cubadebate.cu/noticias/2011/01/18/llega-a-venezuela-buque-con-fibra-optica-para-cuba/">February 2011</a>. However, as the months progressed, there remained <a href="http://laredcubana.blogspot.com/2012/05/hard-data-on-idle-alba-1-undersea-cable.html">no evidence</a> that the new cable had made any difference for the Cuban internet. It was a mystery closely followed by Cuba watchers everywhere.</p> <p>I configured my internet monitoring tools to look for any new connection into Cuba, and a year later, it found one — a new BGP adjacency between Cuban state telecom ETECSA (AS11960) and Spanish telecom giant Telefonica (AS12956). When I checked our active measurements into Cuba that traversed this new link, we observed latencies that were impossibly low for a round-trip time over geostationary satellite.</p> <p>This graphic from <a href="https://archive.nanog.org/meetings/nanog57/presentations/Tuesday/tue.lightning1.madory.alba-1.pdf">my presentation at NANOG 57</a> in February 2013 captures the migration of latency measurements to Cuba observed from one of Renesys’s measurement servers.</p> <img src="//images.ctfassets.net/6yom6slo28h2/41pGNZTyNCSbS6K4Uj9Akb/441583f3fb6afd14c87ed1438f1de76e/latency-measurements-cuba.png" style="max-width: 600px;" class="image center" alt="Renesys chart showing migration of latency measurements to Cuba" /> <p>When we published our <a href="https://web.archive.org/web/20130121174409/http://www.renesys.com/blog/2013/01/cuban-mystery-cable-activated.shtml">initial report on the ALBA-1 activation</a>, we speculated on why the latencies weren’t even lower than what we were observing:</p> <blockquote>We believe it is likely that Telefonica's service to ETECSA is, either by design or misconfiguration, using its new cable asymmetrically (i.e., for traffic in only one direction).</blockquote> <p>This misconfiguration appeared to have been resolved when the latencies dropped further a few days later. During my participation in <a href="https://www.lacnic.net/2268/32/evento/lacnic19-report">LACNIC 19</a> in Medellin, Colombia, a few months later, I was introduced to the Director of ETECSA, who confessed that his engineers fixed the asymmetric routing snafu after seeing my blog post. <em>You never know who’s reading your stuff!</em></p> <p>In the intervening decade, internet adoption in Cuba grew at an anemic pace: from relying on <a href="http://www.cubadebate.cu/noticias/2013/06/04/salas-de-navegacion-en-cuba-listas-para-acceso-a-internet/">internet cafes</a> and <a href="https://insightcuba.com/blog/2017/03/05/wifi-online-cuba">wifi hotspots</a> to the eventual activation of <a href="https://web.archive.org/web/20190109015016/dyn.com/blog/cubas-new-3g-service-six-years-after-ALBA-1/">3G mobile service</a> in December 2018. But adoption grew enough to enable Cubans to start enjoying a freedom of communication that many in other countries around the world expect and sometimes take for granted.</p> <p>In fact, the internet became important enough that the Cuban government <a href="https://www.kentik.com/analysis/internet-disruptions-in-cuba-following-widespread-protests/">began cutting service</a> following the largest anti-government protests in decades in the summer of 2021. While Cuba has been a repressive state for many years, it wasn’t until the past two years that the country felt the <a href="https://www.kentik.com/blog/suppressing-dissent-the-rise-of-the-internet-curfew/">need to shut down services</a> to counteract protests. Maybe that’s progress.</p> <h2 id="blocking-the-arcos-segment-to-cuba">Blocking the ARCOS segment to Cuba</h2> <img src="//images.ctfassets.net/6yom6slo28h2/6oe4Echwk9Vj6as92YM6Ba/eff3fe1c3b162883cade0a4d87342bab/arcos-request-denied.png" style="max-width: 600px;" class="image center" alt="" /> <p>The latest chapter in the saga of the Cuban internet came last month, when the US Department of Justice’s Team Telecom <a href="https://www.justice.gov/opa/press-release/file/1554426/download">published their recommendation</a> that the FCC deny a request by the <a href="https://www.submarinecablemap.com/submarine-cable/arcos">ARCOS submarine cable system</a> to add a segment that would connect Cuba to the cable. According to their recommendation, building such an extension would create “immitigable risks to the national security and law enforcement interests of the United States.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/5ogml1hu2pWOrMAO8efZI9/f513df191f9dc59ab49c8864277a03b5/submarine-cable-map2.png" style="max-width: 600px; padding: 0;" class="image center" alt="Submarine cable map" /> <div class="caption" style="margin-top: -30px;">ARCOS submarine cable from <a href="https://www.submarinecablemap.com">www.submarinecablemap.com</a></div> <p>Unfortunately, the technical rationales provided in the published decision reveal a fundamental misunderstanding of how the logical internet and its underlying physical infrastructure work together. It is safe to say that as a result of this recommendation, the ARCOS cable will not build a segment to Cuba, leaving the country almost completely reliant on a single submarine cable.</p> <p>The recommendation describes the government of Cuba as “a foreign adversary that poses a national security threat to the United States.” Make no mistake; the Cuban government is an authoritarian regime that represses political dissent, and the two countries’ governments have been adversaries for decades. As the recommendation states, the Cuban government censors its internet and has shut down service in response to anti-government protests. All true statements.</p> <p>Be that as it may, the argument that adding a Cuban segment to the ARCOS cable system poses “immitigable risks” to the United States’ national security falls flat. Let’s look at some key pieces of that argument.</p> <blockquote><em>The Cuban government could soon glean communications, travel records, health records, credit information and any other information transiting Segment 26, and share that information with [China]...</em></blockquote> <p>The recommendation states that if the segment were built, Cuban state telecom ETECSA would control the cable landing station and therefore be in a position to intercept any US internet traffic traversing the segment. The truth is that ETECSA owns and operates <em>all of the telecommunications infrastructure</em> in Cuba. Therefore, <em>all internet traffic</em> entering or leaving the country could be at risk of interception by the Cuban government. This is true today, whether or not Cuba gets a submarine cable directly linking it to the United States.</p> <blockquote><em>ETECSA could also take advantage of these vulnerabilities to cause BGP route leaks, leading traffic not destined for Cuba to be misrouted over Segment 26 and into the Cuban government’s hands.</em></blockquote> <p>The recommendation argues that a submarine cable directly linking the United States to Cuba would somehow enable BGP hijacks – a topic I have researched extensively for more than a decade. In fact, to bolster its case, the recommendation cites the FCC’s decision to revoke China Telecom’s license to operate in the US due to reported incidents of traffic misdirection due to BGP vulnerabilities. That FCC decision <a href="https://www.hsgac.senate.gov/imo/media/doc/2020-06-09%20PSI%20Staff%20Report%20-%20Threats%20to%20U.S.%20Communications%20Networks.pdf">cited my work</a> over the years documenting these incidents, so I have some bona fides on this matter.</p> <p>Regardless, a submarine cable alone cannot enable a BGP hijack. The only thing that stops ETECSA (or any telecom) from performing BGP hijacks today are route filtering by its transit providers and routing security mechanisms such as RPKI ROV. A submarine cable directly linking Cuba to the United States does not increase the possibility of a BGP hijack from Cuba.</p> <blockquote><em>ETECSA could manipulate routing information upstream so that more data transits Cuba instead of other routes, including by offering low-or-no-cost transit to small internet providers to entice traffic to transit Segment 26 into Cuba.</em></blockquote> <p>Lastly, the recommendation describes scenarios in which internet traffic in the Caribbean could be re-routed through ETECSA. The first scenario hypothesizes that if ETECSA started selling transit to other Caribbean telecoms (supposedly at a greatly reduced cost — ostensibly subsidized by China), then they could attract non-Cuban traffic through their infrastructure, putting it at risk of interception. Secondly, ETECSA could use its connection to ARCOS to directly exchange traffic with other telecoms in the Caribbean, and those connections could double as backup links for other telecoms in the region. Activating a link to ETECSA as a backup would risk the interception of non-Cuban traffic.</p> <p>Either of these hypothetical scenarios could have already taken place but haven’t. Recall that ALBA-1 is a cable “system,” meaning that it is made up of two submarine cables: one to Venezuela and one to Jamaica. In fact, for several weeks in 2013, ETECSA was <a href="https://www.commsupdate.com/articles/2013/05/23/jamaica-branch-of-cuba-venezuela-cable-goes-live/">using the Jamaica segment</a> to attain transit from Cable &#x26; Wireless Jamaica. Nothing is stopping ETECSA from peering with Jamaican providers today to offer cheap transit or a backup route in case of a cable failure.</p> <h2 id="conclusion">Conclusion</h2> <p>In my opinion, the technical basis for Team Telecom’s recommendation to deny ARCOS’s request to build a segment to Cuba is flimsy but perhaps unsurprising — why help an adversary improve their internet service?</p> <p>However, in my opinion, limiting Cuba’s international connectivity is not in the interest of the Cuban people nor the United States, whose Cuba Internet Task Force’s <a href="https://www.state.gov/cuba-internet-task-force-final-report/">top recommendation</a> in 2018 was the “construction of a new submarine cable.” Blocking such a cable runs counter to the US’s purported support for greater internet connectivity in Cuba as well as the national interests of both countries.</p> <p>Cuba’s dependence on a single submarine cable is a risk to the internet of Cuba and, ultimately, the Cuban people. A second submarine cable to Cuba would offer the benefits of lower latencies to the United States and neighboring Caribbean countries, increased bandwidth capacity to the global internet in general, and greater resiliency in the event of a submarine cable failure.</p> <p>Help appears to be on the way for the internet of Cuba. Immediately following the publication of DoJ’s recommendation to reject ARCOS’s segment to Cuba, <a href="https://www.submarinenetworks.com/en/systems/brazil-us/arimao/orange-and-etecsa-to-build-arimao-subsea-cable-to-cuba">ETECSA announced a deal</a> with French incumbent Orange to build a submarine cable to Martinique. The proposed <a href="https://www.submarinecablemap.com/submarine-cable/arimao">Arimao submarine cable</a> will span over 2400 km and be the first new submarine cable to connect to Cuba (that doesn’t land at Guantanamo Bay Naval Base) in over a dozen years.</p> <p>Crafting effective policy towards adversarial nations is no easy task. Still, there is a growing consensus within the digital rights space that any measures to restrict internet and telecommunications are counterproductive. Case in point is the <a href="https://home.treasury.gov/news/press-releases/jy0974">recent decision</a> by the U.S. Department of the Treasury to modify sanctions against Iran to explicitly allow cloud providers such as Google to provide free services in Iran. If we sincerely want to support the people of countries like Cuba and Iran, I think that would should encourage, not block, initiatives that would improve the reliability of their internet access.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><em>See <a href="https://www.yucabyte.org/">Yucabyte’s</a> terrific <a href="https://twitter.com/YucaByte/status/1600950975925465088">Twitter thread</a> (en Español) on the history of connectivity between the US and Cuba.</em></p><![CDATA[Data Gravity in Cloud Networks: Massive Data]]><![CDATA[With data sets reaching record scale, it is more important than ever for network operators to understand how data gravity is impacting their bottom line and other critical factors. Learn how network observability is essential for managing data gravity.]]>https://www.kentik.com/blog/data-gravity-in-cloud-networks-massive-datahttps://www.kentik.com/blog/data-gravity-in-cloud-networks-massive-data<![CDATA[Ted Turner]]>Wed, 11 Jan 2023 05:00:00 GMT<p>I spent the last few months of 2022 <a href="https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-i/">sharing my experience</a> transitioning networks to the cloud, with a focus on spotting and managing some of the associated costs that aren’t always part of the “sticker price” of digital transformation.</p> <p>I originally envisioned this data gravity content as part of the <a href="/blog/managing-the-hidden-costs-of-cloud-networking-part-i/">Managing Hidden Costs series</a>, but the more planning I did, the more I realized it was an important topic all its own. Why data gravity? With data sets reaching record scale, and organizations pressed from multiple angles to distribute their data across availability zones, it is more important than ever for network operators to understand how data gravity is impacting not only their bottom line but other critical factors like user experience, development velocity, and engineering decisions.</p> <p>In this and the following articles of this series, we will:</p> <ul> <li>Define data gravity and related concepts</li> <li>Explore how data gravity affects network costs, performance, and reliability</li> <li>Discuss how to help your data achieve “escape velocity”</li> <li>Make the case for network observability as an indispensable tool for managing data gravity</li> </ul> <h2 id="what-is-data-gravity">What is data gravity?</h2> <p>I have Dave McCrory to thank for the conceptual framework of <em>data gravity</em>, an idea <a href="https://datagravitas.com/2010/12/07/data-gravity-in-the-clouds/">he introduced in 2010</a>. Similarly, concepts I will introduce later, service <em>energy</em> and <em>escape velocity</em>, were also coined by Dave when they are used in reference to data in cloud systems.</p> <p>To quote him directly:</p> <blockquote>“Consider data as if it were a planet or other object with sufficient mass. As Data accumulates (builds mass), there is a greater likelihood that additional services and applications will be attracted to this data. This is the same effect Gravity has on objects around a planet.”</blockquote> <p>As data accumulates <em>in a specific location</em>, its gravity increases. As Dave notes in the quote above, this draws apps and services that rely on this data closer. But why? Physics. As the gravity of the data grows, associated services and apps will rely more on low latency and high throughput and continue their acceleration toward the data to compensate for the increase in data mass.</p> <h2 id="data-gravity-in-cloud-networks">Data gravity in cloud networks</h2> <p>One of the best examples of data gravity in the cloud is the cloud providers themselves. As Amazon, Microsoft, and Google (and others) started offering cloud data services, their data centers had to grow dramatically to accommodate the increased data. The more apps and services that use these cloud services, the more bound enterprises are to the provider’s data centers. This increased gravity meant pulling in apps and services from a wider area, though eventually at the expense of cost and performance for users.</p> <p>The cloud providers understood this problem and started to build data centers around the country/internationally to redistribute their data mass, enabling high throughput and low latency for a greater geographical range of customers. In the early days, this could involve days or weeks of downtime (unthinkable to today’s cloud consumers), as data had to be physically copied and moved – a sensitive and time-consuming endeavor.</p> <p>For network operators and teams building hybrid cloud systems, managing this data redistribution within and between private and public clouds is easier said than done. As the mass (scale) of the data grows beyond a certain point, the movement of the data becomes time and cost prohibitive. This may be accounted for as a once-off event, but it needs to move freely for data to be valuable.</p> <p>Additionally, services that update, validate, prune, report, and automate will build structures around the data. These services are interesting components when planning out application stack infrastructure. These data services become the network and run on top of “the network.”</p> <p>To better understand how and when to use these abstractions to help your data achieve escape velocity (or where high gravity is beneficial), let’s look at data gravity and its relationship to cost, performance, and reliability in your cloud networks.</p> <h3 id="data-gravity-vs-cost">Data gravity vs. cost</h3> <p>The closer your apps and services are to your data, the cheaper your egress and transit costs. At face value, this suggests that a high-gravity, “monolithic” approach to data storage is the way to go. Unfortunately, most network enterprises handling massive data must do so across various zones and countries for reliability, security, and performance.</p> <p>So, in a vacuum, high data gravity equals low costs. Still, with operations and customer use spread geographically, costs can start to rise without multi-zonal presence and additional data infrastructure.</p> <p>Higher data gravity for distributed networks at scale will ultimately equal higher costs. The data needs to be moved in from the edge, moved around for analytics, and otherwise exposed to egress/transit costs.</p> <h3 id="data-gravity-vs-performance">Data gravity vs. performance</h3> <p>As pointed out earlier, the closer services and apps can be to the data, the lower the default latency. So, high data gravity for a system with a geographically-close footprint will result in improved performance.</p> <p>But, as with cost considerations, performance makes its own demands against data gravity. To deliver optimal performance, efficient data engineering and networking infrastructure are critical when generating or interacting with massive data sets. If apps and services can not be any closer to the data, abstractions must be engineered that bring the data closer.</p> <h3 id="data-gravity-vs-reliability">Data gravity vs. reliability</h3> <p>For me, reliability is the crux of the issue with data gravity. Because even if an organization is operating in a relatively tight geographical range, allowing data gravity to grow too high in a given cloud zone is a recipe for unintended downtime and broken SLAs. <a href="https://www.kentik.com/analysis/">What if their data center experiences an outage?</a> What if your provider has run out of capacity for your (massive) data analysis processes?</p> <p>Reliability, though, is not just about availability but about the correctness and persistence of data. Let’s take a look at a few services needed to make a large chunk of data useful:</p> <ul> <li>Authorization, authentication, accounting (leads towards IAM)</li> <li>Naming conventions (leads towards DNS)</li> <li>Data lookup</li> <li>Data caching to accelerate live or batch processes</li> <li>Reporting</li> <li>Anomaly detection in data usage patterns</li> <li>Data caching to accelerate real-time updates</li> </ul> <p>Many of the items above can be skipped entirely when crafting the application stack. However, when data volumes get big, most organizations must incorporate one or most of the above concepts. The more valuable the data store becomes, the more likely the services above are required to enable the growth of the data set.</p> <p>Most organizations that house large data structures will build many teams around data usage. Each team will take on an aspect of adding data, cleaning up data, reporting on data, taking action on the data stored, and finally purging/pruning/archiving data. In cloud networking, this means taking on more cloud-native approaches to these concepts, forcing teams to rethink/rearchitect/rebuild their stack.</p> <p>This process is what is often called <em>digital transformation</em>. Digital transformations do not require big data, but big data usually requires digital transformations.</p> <h2 id="final-takeaways">Final takeaways</h2> <p>Network architects and operators handling massive data sets in the cloud must be aware of the effects of that data accumulating in one region (or even with one provider). If left alone, these high-gravity data sets become impossible to move, difficult to use, and vulnerable to outages and security threats while weighing down the balance sheet.</p> <p>Network components and abstractions are required to counter this and “free” data. But, this introduces some of its own challenges, which we will explore in part II of this series.</p><![CDATA[New Year, New BGP Leaks]]><![CDATA[Only two days into the new year and we had our first BGP routing leak. It was followed by a couple more in subsequent days. In this blog post, we use some of Kentik’s unique capabilities to take a look at what happened, what the impacts were, and what might prevent these in the future.]]>https://www.kentik.com/blog/new-year-new-bgp-leakshttps://www.kentik.com/blog/new-year-new-bgp-leaks<![CDATA[Doug Madory]]>Tue, 10 Jan 2023 05:00:00 GMT<p>Only two days into the new year, and we had our first BGP routing leak. It was followed by a couple more in subsequent days. Although these incidents were brief with marginal operational impact on the internet, they are still worth analyzing because they shed light on the cracks in the internet’s routing system.</p> <p>In this blog post, I’m going to use some of Kentik’s unique capabilities to take a look at what happened, what the impacts were, and what might prevent these in the future.</p> <h2 id="so-first--what-happened">So first — what happened?</h2> <p><a href="https://twitter.com/Qrator_Radar/status/1609793967935078400">Flagged first</a> by our friends over at <a href="https://qrator.net/en/">Qrator</a>, the first leak was perpetrated by AS138805 of Indonesia when it passed several thousand routes learned from one transit provider (TELIN, AS7713) to another transit provider (Lintasarta, AS4800). The leak began at 05:37 UTC on 2 January 2023 and only lasted around five minutes, but there are some interesting insights that are worth analyzing here.</p> <p>Perhaps the first question to answer is, <em>why would this leak propagate at all?</em> The leak didn’t introduce more-specifics as is often the case with leaks involving <a href="https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/">route optimizers</a> or <a href="https://medium.com/oracledevs/widespread-impact-caused-by-level-3-bgp-route-leak-internet-intelligence-3dbd724d9ac5">leaked traffic engineering</a>. To answer this, let’s look at the propagation of affected prefixes in Kentik’s BGP visualization.</p> <p>Starting with this prefix from a Nova Scotian provider named for a colorful bovine. One of Purple Cow Internet’s prefixes (104.36.174.0/23) got sucked into this leak. In the graphic below, the upper stacked time series is a measure of route propagation over time. It shows how our BGP sources reach this prefix by each upstream of the origin. Normally about 20% of our sources see this route <em>at all</em> (17.4% via Hurricane Electric), but during the leak it jumps up to about 70% (with 51.7% suddenly seeing it via the Toronto Internet exchange TorIX, AS11670).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7sLNvCPcfg2Fa9ayK8Q77v/cc9ba2c6fcd821f91afa4faf0619fd51/purple-cow-internet1.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Leak progression in BGP visualization" /> <p>The ball-and-stick diagram at the bottom shows the AS-AS level adjacencies based on aggregated routes at the time of the leak. The diagram is pruned to exclude edges seen by less than 3% of our BGP sources.</p> <p>The highlighted path shows the leak progression. Purple Cow Internet (AS397545) originates this prefix and was sharing it at TorIX, perhaps via a route server. TELIN (Telekom Indonesia, AS7713) picked it up there and shared it with its transit customers. However, one of those customers (AS138805) accidentally announced it to another of its transit providers, AS4800, who then shared it with the wider internet. There were a number of other Canadian routes leaked from TorIX in this manner.</p> <p>Below is another impacted prefix from a different part of the world. This Taiwanese prefix is usually seen in the routing tables of 30% of our BGP sources; however, that figure jumped to 70% during the leak. This time the leak went through the Hong Kong Internet Exchange (HKIX, AS4635), as shown in the lower portion of the Kentik BGP visualization below.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1f5wzqsNl0NENt3Ty42PJE/2be598cab5c096d6f5aa7deee2f94315/digital-united1.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Leak progression in BGP visualization - Hong Kong Internet Exchange" /> <p>You may have picked up on a commonality between these two prefixes. Neither were globally routed to begin with. Networks will often intentionally limit the propagation of a route that they prefer to be used only in certain geographies, for example. The problem with these “regional routes” is that when a leak occurs, there is nothing for the leaked route to compete against.</p> <p>Each of these leaked routes has an AS_PATH that is longer than normal (extended by “4800 138805 7713” at the very least), so they <em>should be</em> a loser in BGP’s shortest path comparison. But if there is no other path to compare it to, the leaked route gets selected and propagated.</p> <p>If we were to plot how many <a href="https://routeviews.org/">Routeviews</a> BGP sources saw each leaked route versus how many Routeviews BGP sources see the same route typically, we arrive at the chart below. There is a clear negative correlation between leak propagation and, let’s call it, steady state propagation.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4QEqxgIn4bCgwwXe5Qz53n/4ca6f95352962db4dacad175d90c624b/routeviews-leakpropgation-steadystate.png" style="max-width: 600px;" class="image center" alt="Routeviews negative correlation between leak propagation and steady state" /> <p>In other words, leaked versions of routes with limited propagation propagate further. If those routes with limited propagation are more-specifics, a lot of traffic destined for the covering routes will get dropped or misdirected.</p> <p>Maybe the next question is, how much traffic gets dropped versus misdirected in an incident like this? The joy of being a BGP analyst at a NetFlow analysis company is that I can dig into our aggregate NetFlow to explore the answer to this question. Kentik annotates NetFlow records upon intake with the AS_PATH of the source and destination IPs as seen by the router generating the NetFlow. This enables users of Kentik’s <a href="https://kb.kentik.com/v3/Db03.htm">Data Explorer</a> to essentially perform BGP analysis using NetFlow.</p> <p>If we search for NetFlow records destined for an IP marked with an AS_PATH that includes the leak subsequence “4800 138805 7713”, we can see the leak’s impact on actual internet traffic. Below is a screenshot of the results of that query when grouped by destination country.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3BNB3nVfMEjpdXgfxRIfM2/241fd274ae7e28d9d097cc035baf2cf0/spike-destination-country.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Traffic spike corresponding to leak" /> <p>The graph shows a spike in traffic that corresponds to the leak. The top five most affected countries by total misdirected traffic (in bits/sec) based on Kentik’s aggregate NetFlow were Hong Kong, Guam, Indonesia, the United Kingdom, and Brazil. We’re no longer talking about theoretical impact, as is often the case with BGP leak analysis. Here we can show this leak misdirected actual internet traffic.</p> <p>But here is where it gets interesting. Can we compare the amount of misdirected traffic to the amount of lost traffic? If we look at the traffic from around the world destined for the top 500 leaked prefixes (when ranked by leak route propagation), we can isolate the portion that followed “4800 138805 7713” using the BGP annotations in our NetFlow records.</p> <p>This approach arrives at the following screenshot. In this 90-minute period, there is a steady flow of traffic to the affected prefixes. At the time of the leak, we can see two impacts:</p> <ol> <li>A drop in overall traffic due to packet loss</li> <li>Separately, the portion of the traffic that followed “4800 138805 7713”</li> </ol> <p>Again the leak was brief, but the ability to perform this style of traffic analysis of a BGP leak using NetFlow is unique to Kentik.</p> <img src="//images.ctfassets.net/6yom6slo28h2/63fawTqgz9l8iAICMhCOOh/48b6ea3de545b5855ac37e62504f37fe/impacts-from-leak-1.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Impacts to the affected prefixes" /> <p>The subsequent leaks occurred two days later. The first was when B. Online (formerly Gulfnet, AS3225) briefly leaked routes from Zain (AS59605) to Telecom Italia Sparkle (AS6762) — again <a href="https://twitter.com/Qrator_Radar/status/1610579759737479168">spotted first</a> by Qrator. Like the previous leak, this was also a <em>Type 1: Hairpin Turn</em> leak, as defined by <a href="https://www.rfc-editor.org/rfc/rfc7908.html">RFC 7908</a>. It occurred twice on 4 January 2023, first at 9:39 UTC and then again at 11:33 UTC. Each instance lasted just a few minutes.</p> <p>At 10:16 UTC the same day, Bangladesh Telecom (AS17494) leaked over 1700 routes from its peering sessions to its transit provider Bharti Airtel (AS9498). Unlike the other two, this leak is a <em>Type 4: Leak of Peer Prefixes to Transit Provider</em> from <a href="https://www.rfc-editor.org/rfc/rfc7908.html">RFC 7908</a>. Regardless, for the next eight minutes, providers around the world began sending Bangladesh Telecom traffic destined for faraway destinations such as Vietnam, South Korea, and Great Britain.</p> <p>Below is an example affected prefix from UK-based ISP TalkTalk. The hump in the center is the leak from Bangladesh. If you look very closely, you can also see some marginal effects of the two AS3225 leaks at 9:39 UTC and 11:33 UTC.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7kq3WHtx99M3DE9JHm0Hv5/e16922eba5b64f6e3bf5f9f1e0e24201/talktalk-leaked-prefix1.png" style="max-width: 800px;" class="image center" withFrame thumbnail alt="Leak from Bangladesh" /> <p>Below is a graphic showing the internet traffic misdirected by the leaks on 4 January 2023 based on Kentik’s aggregate NetFlow.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5mnhTg2TBspmBQPvGpOPYE/83b9ac262d0bb9cdbc858c1b126f20e1/traffic-misdirected-by-leaks.png" style="max-width: 800px;" withFrame thumbnail class="image center" alt="Internet traffic misdirected by leaks" /> <p>Using the same approach we used in the first leak, we can observe that the amount of packet loss due to the Bangladesh leak was much more significant than the amount misdirected.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1B4h6ZmPIteYs1RxY97gs6/26019a55f65e1fff7a172f17fc6105cf/packet-loss-bangladesh-leaks.png" style="max-width: 800px;" withFrame thumbnail class="image center" alt="Packet loss due to leak compared to misdirected" /> <h2 id="what-can-we-learn-from-all-of-this">What can we learn from all of this?</h2> <p>There are two impacts from routing leaks: misdirected traffic and dropped traffic. Dropped traffic is a result of congested links and represents the disruption caused by the leak. Alternatively, misdirected traffic usually incurs a performance penalty (i.e., higher latency) and introduces security concerns, specifically the possibility of interception or manipulation.</p> <p>Which impact is greater will depend on a variety of factors, including the ASes involved in the leak — and our view of it is, of course, subject to the biases present in our aggregate NetFlow dataset. Regardless, it is fascinating to begin to be able to measure these impacts on actual internet traffic using aggregate NetFlow data.</p> <p>I view challenges facing routing security as a constellation of problems. That constellation necessitates multiple solutions. Each of these adjacency leaks did not alter the origins of the routes nor introduce more-specifics; therefore, RPKI ROV (<a href="https://mailman.nanog.org/pipermail/nanog/2022-December/221230.html">despite significant growth in adoption</a>) would not have had any beneficial effect.</p> <p>To prevent leaks like these, network service providers typically filter the routes they receive from their transit customers. Yet leaks like these continue to slip through the cracks, which is why <a href="https://datatracker.ietf.org/doc/draft-ietf-sidrops-aspa-verification/">Autonomous System Provider Authorization (ASPA)</a> was proposed. ASPA enables providers to enumerate their transit relationships within the RPKI system (ASPA records are published and validated using the exact same infrastructure as ROAs).</p> <p>This enables providers participating in the system to evaluate routes with AS_PATHs containing <a href="https://www.cs.princeton.edu/courses/archive/fall06/cos561/papers/gao01.pdf">valley-free</a> violations as invalid and reject them. Using the first example above, if it were known a priori that AS138805 was a customer of both AS4800 and AS7713, then a route with “4800 138805 7713” can be evaluated as invalid and rejected, limiting the impact of the leak.</p> <p>As RPKI ROV aims to limit the disruption due to accidental origination leaks, ASPA helps to address issues in the middle of the AS_PATH due to accidental adjacency leaks. The examples above were brief but show that the internet is still vulnerable to this all-too-common BGP routing mistake.</p> <p><em>Thanks to Job Snijders of Fastly for his expert advice on this blog post.</em></p><![CDATA[Kubernetes and the Service Mesh Era: Exploring the Role of Kubernetes Service Mesh in Networking]]><![CDATA[In this blog, we discuss how Kubernetes approaches networking, the gaps in networking from Kubernetes, and how Kubernetes service meshes address those gaps.]]>https://www.kentik.com/blog/kubernetes-and-the-service-mesh-erahttps://www.kentik.com/blog/kubernetes-and-the-service-mesh-era<![CDATA[Ted Turner]]>Thu, 05 Jan 2023 05:00:00 GMT<p>The adoption of Kubernetes in enterprise organizations is revolutionizing the way businesses manage their IT infrastructures. Automating deployment, scaling, and management of containerized applications allows organizations to embrace a cloud-native paradigm at scale and more easily employ best practices, such as microservices and DevSecOps.</p> <p>But as with all tech, Kubernetes has its limits. <a href="https://twitter.com/kelseyhightower?lang=en">Kelsey Hightower</a> famously tweeted that “Kubernetes is a platform for building platforms. It’s a better place to start; not the endgame.”</p> <p>And networking is arguably the area where this quote is most applicable. Kubernetes provides a generic <a href="https://www.kentik.com/blog/kubernetes-networking-101/" title="Kentik Blog: Kubernetes Networking 101">networking baseline</a>—a flat address model, services for discovery, simple ingress/egress, and network policies—but anything beyond these basics must come from an extension or integration.</p> <p><a href="https://www.kentik.com/blog/kubernetes-and-cross-cloud-service-meshes/" title="Kentik Blog: Kubernetes and Cross-cloud Service Meshes">Service meshes</a> were built to close this gap by providing advanced services around traffic management, security, and observability.</p> <p>Let’s start with a background on the Kubernetes networking model.</p> <h2 id="understanding-the-kubernetes-networking-model">Understanding the Kubernetes networking model</h2> <p>Kubernetes runs workloads (services, applications, and jobs) in <strong>pods</strong>. Each pod contains one or more <strong>containers</strong>. Each pod also has a unique (within the cluster) <strong>private IP address</strong>.</p> <p>All of the containers in a pod run on the same node and (since they share the same IP address) can communicate with one another over localhost. Within a cluster, pods are directly reachable by these private IP addresses. Kubernetes has an internal IP service that is used for this communication and load balancing inside the cluster.</p> <p>However, communication with a pod <em>outside</em> the cluster is a little more complicated. There are several options:</p> <ul> <li>Provide a <em>public</em> IP address to pods (not recommended)</li> <li>Use a <a href="https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport" title="Kubernetes docs: NodePort service">NodePort service</a> that exposes a public IP port</li> <li>Use a <a href="https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport" title="Kubernetes docs: LoadBalancer service">LoadBalancer service</a> with a public IP</li> </ul> <p>In short, a <a href="https://kubernetes.io/docs/concepts/services-networking/service/" title="Kubernetes docs: service">service</a> in Kubernetes is a way to group a set of pods and present them as a single entity via an IP address and/or DNS name. When the service receives a request, it delegates it to one of the backing pods.</p> <p>Pods can discover services in two ways: environment variables and DNS. The environment of each pod contains the endpoint to every service in the cluster, such as <code class="language-text">REDIS_SERVICE_HOST_PORT=10.0.0.11:6738</code>. Every service has a DNS name based on its name and namespace, such as <code class="language-text">&lt;service>.&lt;namespace>.svc.cluster.local</code>.</p> <p>Let’s look at several other important networking constructs that Kubernetes works with.</p> <h3 id="dns">DNS</h3> <p><a href="https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/" title="Kubernetes docs: DNS pod service">DNS</a> is a staple of networking beyond Kubernetes. Kubernetes comes with its own internal DNS server, <a href="https://kubernetes.io/docs/tasks/administer-cluster/coredns/#about-coredns" title="Kubernetes docs: About CoreDNS">CoreDNS</a>, to create DNS records for pods and services. CoreDNS is its own CNCF project.</p> <h3 id="kubernetes-networkpolicies">Kubernetes NetworkPolicies</h3> <p><a href="https://kubernetes.io/docs/concepts/services-networking/network-policies/" title="Kubernetes docs: NetworkPolicies">Kubernetes NetworkPolicies</a> allow you to manage traffic between pods at the host and port level. You can set policies for ingress or egress within the cluster.</p> <h3 id="ingress">Ingress</h3> <p><a href="https://kubernetes.io/docs/concepts/services-networking/ingress/" title="Kubernetes docs: Ingress">Ingress</a> resources control HTTP and HTTPS traffic from the outside world to services inside the cluster. You can define ingress rules, but Kubernetes actually doesn’t know what to do with these rules. This is one of the extensibility points where you need to deploy a third-party ingress controller to watch for the ingress resources you define and enforce the rules.</p> <h3 id="gateway-api">Gateway API</h3> <p>The <a href="https://gateway-api.sigs.k8s.io/" title="Kubernetes Gateway API">Gateway API</a> is the evolution of Ingress. The Gateway API is more expressive and flexible. It is broken into multiple resources that allow application developers and cluster administrators to cooperate without stepping on toes.</p> <h2 id="the-network-requirements-of-enterprise-systems">The network requirements of enterprise systems</h2> <p>As you can see, Kubernetes provides a robust and clean networking model. Many of the fundamental building blocks of networking are supported. However, as an enterprise organization, you probably need much more.</p> <p>For example:</p> <ul> <li>Sophisticated routing</li> <li>Strong security</li> <li>Observability</li> <li>Inter-service authentication and authorization</li> <li>Load balancing</li> <li>Health checks</li> <li>Timeouts and retries</li> <li>Fault injection</li> <li>Bulkhead</li> <li>Rate limiting</li> </ul> <p>Before the cloud-native era, the landscape for enterprise organizations was proprietary. Organizations ran their systems in private data centers. Infrastructure was mostly static, with separate IT teams responsible for capacity planning. Software was typically a large monolith with long release cycles.</p> <p>To handle the enterprise networking requirements mentioned above, the common practice to ensure adherence to policies and interconnectivity between subsystems was to have standard client libraries used by all software teams. This, of course, led to a lack of flexibility, over-budget and past-deadline project failures, and slow decay, as there was no way for complex software systems to stay up to date with modern innovation.</p> <p>Let’s fast forward to the cloud-native age!</p> <h2 id="networking-in-the-cloud-native-age">Networking in the cloud-native age</h2> <p>In the modern age, software systems are deployed in the cloud, on multiple clouds, private data centers, and even edge locations. The infrastructure is dynamic. The software comprises hundreds and thousands of microservices that may be implemented in multiple programming languages. The infrastructure and application development follow DevOps practices for continuous delivery. Security is integrated into the process following DevSecOps practices. Different components of the system are released constantly.</p> <p>This was a boon for productivity and flexibility—but brought on new problems of management, control, and policy enforcement. All these microservices implemented in multiple languages somehow need to interact. Developers and administrators need to understand the flow of information, be able to detect and mitigate problems, and secure the data and the infrastructure.</p> <p>Enter the service mesh.</p> <h2 id="what-is-a-kubernetes-service-mesh">What is a Kubernetes service mesh?</h2> <p>A service mesh is a network software layer that employs a <strong>control plane</strong> to set up policies and a <strong>data plane</strong> with proxies or node agents to intercept network traffic and implement those policies.</p> <p>A service mesh serves as a software layer that oversees and manages traffic within a network. Similar to Kubernetes, the service mesh comprises a data plane and a control plane.</p> <h3 id="the-data-plane">The data plane</h3> <p>The data plane is composed of proxies that run alongside each service. These proxies capture every request between services on the network, apply relevant policies, determine how to process the request, and—assuming the request is approved and authorized—decide the routing path.</p> <p>The data plane can adopt one of two proxy models: the sidecar model or the host model. In the sidecar model, a mesh proxy is connected as a sidecar container to every pod. In the host model, each node operates an agent that functions as the proxy for all workloads running on that node.</p> <h3 id="the-control-plane">The control plane</h3> <p>The control plane offers an API that mesh operators use to establish policies dictating the mesh’s behavior. Additionally, the control plane facilitates communication by identifying all data plane proxies and updating them with the service locations throughout the mesh.</p> <h2 id="the-benefits-of-service-mesh">The benefits of service mesh</h2> <p>A service mesh has many benefits in a modern, large, and dynamic networking environment, such as Kubernetes-based systems, where new workloads are deployed constantly, pods come and go, and instances scale up or down.</p> <ul> <li>The service mesh externalizes all the networking concerns from the applications. Now they can be managed and updated centrally. By offloading all networking concerns to the service mesh, service developers can focus their efforts solely on their application and business logic.</li> <li>With a service mesh, you can upgrade your service mesh, and everyone immediately enjoys the latest and greatest transparently. Traditionally, to introduce a change or upgrade to a client library, you would need to negotiate with each team individually, supporting multiple versions of libraries and across multiple programming languages.</li> <li>You benefit from the efforts of experts that keep evolving, improving, and optimizing the service mesh. The service mesh is also used and battle-tested by many organizations. This means that problems that might impact you may have been discovered and reported by other users.</li> <li>As a central component that touches all of your services, the service mesh can handle cross-cutting concerns—such as observability, health checks, and access policy enforcement—across all services in your Kubernetes-based system.</li> <li>The service mesh can add a layer of security to an enterprise’s inter-service communication by employing a zero-trust approach to access and using mTLS to encrypt traffic for secure communication. Additionally, limiting access from application to application helps to ensure that a malicious attacker who exploits one service cannot move laterally through your network to exploit other services.</li> </ul> <h2 id="service-meshes-on-kubernetes">Service meshes on Kubernetes</h2> <p>Service mesh fits Kubernetes like a glove. Kubernetes makes it easy for service meshes to integrate with the platform due to its extensibility. The synergy between Kubernetes and service meshes is powerful as the service builds on top of the basic Kubernetes networking model.</p> <p>For large systems—in particular, systems composed of multiple Kubernetes clusters—the service mesh becomes a standard add-on. Once enterprises begin working with multiple clusters, which might spread across different clouds, the service mesh becomes an essential component for properly facilitating and securing inter-service communication.</p> <h2 id="a-quick-review-of-popular-service-meshes-on-kubernetes">A quick review of popular service meshes on Kubernetes</h2> <p>If you’re ready to implement a service mesh on top of Kubernetes, there are many choices. Let’s look at a few of them and their strengths and attributes.</p> <ul> <li> <p><a href="https://istio.io/" title="Istio">Istio</a> is arguably the most popular service mesh for Kubernetes. Google, IBM, and Lyft originally developed it. It uses the <a href="https://www.envoyproxy.io/">Envoy</a> project from Lyft as its data plane.</p> </li> <li> <p><a href="https://linkerd.io/" title="Linkerd">Linkerd</a> is the first service mesh. Its claim to fame is that it is more performant and less complicated than Istio. It implements its own data plane using Rust.</p> </li> <li> <p><a href="https://kuma.io/" title="Kuma">Kuma</a> is a service mesh originally developed by Kong, who also has an enterprise service mesh called Kong Mesh built on top of Kuma. Kuma also uses Envoy as the control plane. Its claim to fame is that it allows connecting Kubernetes clusters with non-Kubernetes workloads running on VMs (but Istio now has this capability, too).</p> </li> </ul> <p>Here are several other service meshes you may want to explore:</p> <ul> <li><a href="https://traefik.io/traefik-mesh/" title="Traefik Mesh">Traefik Mesh</a> (node agents as control plane)</li> <li><a href="https://openservicemesh.io/" title="Open Service Mesh">Open Service Mesh</a> (heavily pushed by Microsoft, can be enabled on AKS as an add-on)</li> <li><a href="https://aws.amazon.com/app-mesh/" title="AWS App Mesh">AWS App Mesh</a> (AWS proprietary service mesh, strong integration with EKS, ECS, and EC2)</li> <li><a href="https://docs.cilium.io/en/v1.9/gettingstarted/clustermesh/" title="Cilium Mesh">Cilium Mesh</a> (up and comer service mesh using eBPF in the data plane)</li> </ul> <h2 id="drawbacks-of-not-utilizing-a-kubernetes-service-mesh">Drawbacks of not utilizing a Kubernetes service mesh</h2> <p>As powerful as Kubernetes is, not utilizing a service mesh can lead to several drawbacks in networking and management of your containerized applications. When operating without a Kubernetes service mesh, you may face several challenges:</p> <h3 id="lack-of-observability">Lack of observability</h3> <p>Without a service mesh, it becomes difficult to gain insights into the traffic flow and performance of your microservices. This lack of visibility can make it challenging to identify issues, optimize performance, and troubleshoot problems within your infrastructure.</p> <h3 id="limited-traffic-management">Limited traffic management</h3> <p>In the absence of a service mesh, you may find it harder to implement advanced traffic management features such as load balancing, timeouts, retries, and fault injection. This can lead to suboptimal routing decisions and an overall less resilient system.</p> <h3 id="security-concerns">Security concerns</h3> <p>Not using a service mesh may expose your infrastructure to security vulnerabilities. With a service mesh, you can implement a zero-trust approach and use mTLS to encrypt inter-service communication, ensuring a more secure environment.</p> <h3 id="increased-complexity">Increased complexity</h3> <p>Managing networking configurations and policies for a large number of microservices can become complex without a service mesh. A service mesh helps centralize and standardize the management of networking concerns, simplifying the overall process.</p> <h3 id="performance-issues">Performance issues</h3> <p>Without a service mesh, you may face performance bottlenecks and scalability issues as your infrastructure grows. Service meshes are designed to handle the dynamic nature of microservices, ensuring that your system remains performant and efficient even as it scales.</p> <h2 id="kubernetes-service-meshes-and-kentik-cloud">Kubernetes, service meshes, and Kentik Cloud</h2> <p>As we’ve seen, Kubernetes with a service mesh is a powerful combination that lets you connect workloads across clouds, data centers, and the edge and enforce policies and best practices.</p> <p>As an additional benefit, as the service mesh works, it collects a lot of valuable data from flow logs and metrics related to your network traffic. This data can help you create a more robust and reliable system. But being able to understand and use this data in a meaningful way, and make it actionable, can be difficult. A strong and robust observability solution (such as <a href="https://www.kentik.com/product/cloud/" title="Learn more about Kentik Cloud">Kentik Cloud</a>) can help you make sense of the data from your service mesh, ensuring your system is cost-effective, healthy, and performs well. It can also help to mitigate incidents and/or attacks.</p> <h2 id="conclusion">Conclusion</h2> <p>Kubernetes is a powerful tool for modern cloud infrastructure. Out of the box, it offers some networking capabilities, but by adding a service mesh on top, you gain a long list of benefits. Hopefully, you now understand how <a href="https://www.kentik.com/solutions/usecase/kubernetes-networking/" title="Using Kentik for Kubernetes Networking">Kubernetes</a> and service meshes can work together to create modern and robust enterprise systems.</p> <p>To learn more about how multi-cluster service meshes solve hybrid and multi-cloud networking complexities, read our previous article, <a href="https://www.kentik.com/blog/kubernetes-and-cross-cloud-service-meshes/">Kubernetes and Cross-cloud Service Meshes</a>.</p><![CDATA[The Reality of Machine Learning in Network Observability]]><![CDATA[Machine learning has taken the networking industry by storm, but is it just hype, or is it a valuable tool for engineers trying to solve real world problems? The reality of machine learning is that it’s simply another tool in a network engineer’s toolbox, and like any other tool, we use it when it makes sense, and we don’t use it when it doesn’t. ]]>https://www.kentik.com/blog/the-reality-of-machine-learning-in-network-observabilityhttps://www.kentik.com/blog/the-reality-of-machine-learning-in-network-observability<![CDATA[Phil Gervasi]]>Wed, 04 Jan 2023 05:00:00 GMT<p>For the last few years, the entire networking industry has focused on analytics and mining more and more information out of the network. This makes sense because of all the changes in networking over the last decade. Changes like network overlays, public cloud, applications delivered as a service, and containers mean we need to pay attention to much more diverse information out there.</p> <div as="WistiaVideo" videoId="z97e03uyb1" audio></div> <p>After all, if we want to figure out why an application delivered over a network isn’t performing well, we need to look at all these new network components involved in getting that application down to me and you, sitting behind a computer screen.</p> <p><a href="https://www.kentik.com/telemetrynow/s01-e03/">The industry has looked to machine learning for the answer</a>, sometimes resulting in severe eye-rolling. So let’s take a step back and look at what problems we’re trying to solve and how we actually solve them, whether that’s machine learning or not.</p> <p>Let’s look at some of the problems we face today with <a href="https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources">network telemetry</a>.</p> <h2 id="dealing-with-vast-amounts-of-data">Dealing with vast amounts of data</h2> <p>First, the sheer volume of network data we have available today eclipses the smattering of flows, SNMP information, and occasional packet capture we used to rely on. Consider what we can collect now — the sum of just one day’s worth of flows, packet captures, SNMP messages, VPC flow logs, SD-WAN logs, routing tables, and streaming telemetry can overwhelm a network operations team.</p> <p>Add to that additional information like IPAM databases, configuration files, server logs, and security threat feeds, and a network operations team would be hard-pressed to find meaning in the ocean of information flooding their monitoring systems.</p> <p>So to handle this volume of data, we first need to figure out how to:</p> <ul> <li>Collect very different types of data securely and fast enough to be relevant</li> <li>Query massive databases in real time</li> <li>Reduce the overall data volume</li> <li>Secure data at rest both internally and externally</li> </ul> <p>This isn’t trivial, and most answers will have little or nothing to do with machine learning. These are database architecture decisions and workflow designs using sufficient compute resources. For example, using a columnar database is a good balance of fast queries but with the ability to separate data vertically for multi-tenant scenarios. From a high level, there really isn’t any reason to cram an ML model into this process.</p> <p>But dealing with large databases is just one problem. How do we analyze the variety of very <em>different</em> data types?</p> <h2 id="handling-different-data-types">Handling different data types</h2> <p>The second problem, which is really two distinct problems, is working with a variety of very different data types. Think about some of the very basic telemetry we’ve been collecting for years.</p> <ul> <li> <p>Flow data can tell us the volume of one protocol on our network concerning all the other protocols. So the data point is a percentage, like 66%, and describes the volume of a protocol.</p> </li> <li> <p>SNMP can tell us what VLAN is active on an interface, which ultimately is just a random tag, not a percentage.</p> </li> <li> <p>SNMP can also tell us the uptime of a device, represented in seconds, minutes, hours, days, and years. A measurement of time. Not a percentage and not a tag.</p> </li> <li> <p>A packet collection and aggregation tool can tell us how many packets go over a wire in a given amount of time — some number in the millions or billions. A finite but dynamic number many orders of magnitude larger than a percentage.</p> </li> </ul> <p>It’s not just the huge volume of data but also how different the data are, and the scale each uses. Telemetry as represented by percentages, bits per second, random ID tags, timestamps, routing tables, etc.</p> <p>We can solve these problems using standardization or normalization at the point of ingest into the system. Remember that though data scientists often use standardization and normalization functions of statistical analysis in machine learning preprocessing, they aren’t technically machine learning themselves.</p> <p>Using normalization, which is simple math, we can transform diverse data points on vastly different scales into new values all appearing on the same scale, usually 0 to 1. Now we can compare what was previously very different data and do more interesting things like finding correlations, identifying patterns, etc.</p> <h2 id="when-machine-learning-makes-sense">When machine learning makes sense</h2> <p>The right database design and fundamental statistical analysis are enough to do some amazing things. It isn’t necessary to <em>start</em> with ML. When we get to the point with our analysis when we can’t do much else with a more basic algorithm, we can apply an ML model to get the desired result. And that’s the key here — the <em>result</em>, not the method.</p> <p>That’s why the reality of ML is that it’s another tool in our toolbox, not the only tool we use to analyze network telemetry. We use it when it makes sense to get us the desired result, and we don’t use it when it doesn’t.</p> <p>So, for example, we can apply an ML model when we want to do more advanced analysis, such as:</p> <ul> <li>Correlating dynamic events</li> <li>Correlating dynamic events with static elements</li> <li>Discovering patterns and seasonality</li> <li>Predicting future activity</li> <li>Identifying causal relationships</li> </ul> <p>Because network telemetry is inherently unstructured, dynamic, and usually unlabeled, it makes sense that we use ML to perform specific tasks.</p> <blockquote style="border-left: 4px solid #1890FF">We should use clustering to group unlabeled data and reduce the overall amount of data we have to deal with.</blockquote> <blockquote style="border-left: 4px solid #1890FF">We should use classification to recognize patterns in data and classify new data.</blockquote> <blockquote style="border-left: 4px solid #1890FF">We should apply a time series model to estimate seasonality, make predictions, detect anomalies, and identify short and long-range dependencies.</blockquote> <p>At this stage of our analysis, ML serves a specific purpose. It isn’t just a bolt-on technology to join the fray of vendors claiming they use ML to solve the world’s problems. And when a model doesn’t produce the result we want, is inaccurate, unreliable, or takes too many resources to run, then drop it and do something simpler that gets you most of the way there.</p> <p>I don’t believe using machine learning (and artificial intelligence) is just marketing hype, though it can be when presented as a one-size-fits-all solution. The reality of ML in network observability is that it’s just another tool in the data scientist’s toolbox. We use ML to get a specific result. We use it when it makes sense and don’t use it when it doesn’t.</p><![CDATA[A Year in Internet Analysis: 2022]]><![CDATA[This past year was another busy one for the internet. This year-end blog post highlights some of the top pieces of analysis that we published in the past 12 months. This analysis employs Kentik’s data, technology, and expertise to inform the industry and the public about issues involving the technical underpinnings of the global internet and how global events can impact connectivity.]]>https://www.kentik.com/blog/a-year-in-internet-analysis-2022https://www.kentik.com/blog/a-year-in-internet-analysis-2022<![CDATA[Doug Madory]]>Wed, 14 Dec 2022 05:00:00 GMT<p>This past year was another busy one for the internet. In this blog post, I will highlight some of the top pieces of analysis that we published in the past 12 months. This analysis employs Kentik’s data, technology, and expertise to inform the industry and the public about issues involving the technical underpinnings of the global internet and how global events can impact connectivity.</p> <div as="WistiaVideo" videoId="716mfhkgzo" audio></div> <p>These posts are organized into two broad categories: major internet disruptions and BGP routing security.</p> <h2 id="internet-disruptions">Internet disruptions</h2> <h3 id="undersea-volcano-eruption-near-tonga">Undersea volcano eruption near Tonga</h3> <img src="//images.ctfassets.net/6yom6slo28h2/1RMZvRaFAdXPVuxcMbLc6B/7ab88f88c432c10215034efe38a1d024/tonga-undersea-volcano.gif" style="max-width: 600px;" class="image center" alt="The Hunga Tonga–Hunga Ha‘apai eruption near Tonga" /> <div class="caption" style="margin-top: -30px;">The Hunga Tonga–Hunga Ha‘apai eruption near Tonga. <br />Credit: Visible Earth/NASA</div> <p>The year began with a bang – literally! The eruption of an undersea volcano on January 15 in the south Pacific devastated the island nation of Tonga, killing <a href="https://www.dw.com/en/deaths-confirmed-in-tonga-after-devastating-volcano-eruption/a-60455979">three of its people</a> and knocking out <a href="https://apnews.com/article/tonga-volcanic-eruption-tsunami-threat-d434936e25079859df8261780976def8">voice and internet communications</a>.</p> <p>I wrote a <a href="https://www.kentik.com/blog/tonga-downed-by-massive-undersea-volcano-eruption/">blog post</a> following the outage, giving some history of connectivity in Tonga, which I had previously covered. I spotted <a href="https://www.submarinecablemap.com/submarine-cable/tonga-cable">Tonga’s submarine cable</a> first carrying traffic on August 5, 2013. The cable was funded by the World Bank and Asian Development Bank due to its status as a “thin route,” a term for a submarine cable that promises a small (or <em>thin</em>) return on investment. The undersea eruption destroyed this cable, disconnecting Tonga from the world.</p> <p>On January 20, <a href="https://twitter.com/DougMadory/status/1484215403119353857">we saw the first internet traffic to Tonga</a> as the island began restoring its connection to the world via connections from satellite operators Speedcast and Kacific.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2sMVbmPdW9g4lxgw7X7oLQ/7dfad4be27333a2f82489496f0409471/internet-traffic-tonga.png" withFrame style="max-width: 700px;" class="image center" alt="Internet traffic restored" /> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="terrestrial-outage-in-egypt-led-to-global-impacts">Terrestrial outage in Egypt led to global impacts</h3> <p>In June, a fiber outage in Egypt’s TransEgypt overland route, which connects submarine cables in the Mediterranean Sea to the Red Sea, caused <a href="https://twitter.com/DougMadory/status/1534231462035218432">international internet disruptions</a>. In this <a href="https://www.kentik.com/blog/outage-in-egypt-impacted-aws-gcp-and-azure-interregional-connectivity/">blog post</a>, we covered the outage by combining our data with that from <a href="https://radar.cloudflare.com/">Cloudflare Radar</a>, <a href="https://ioda.inetintel.cc.gatech.edu/">IODA</a>, and <a href="https://www.iijlab.net/en/">IIJ Research Lab</a>. WIRED magazine later took this content about the history of Egypt’s role as a global internet chokepoint and expanded it into a <a href="https://www.wired.com/story/submarine-internet-cables-egypt/">feature story</a>.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7rse8Fiw9yrwHDFDmwhaNY/c166d59cfac43ab33ce0e8f6aa603138/transegypt-diagram.jpg" style="max-width: 600px;" class="image center" alt="TransEgypt diagram Telecom Egypt from 2013" /> <div class="caption" style="margin-top: -30px;">TransEgypt diagram Telecom Egypt from 2013</div> <p>The post also offered a unique view from Kentik’s cloud measurements comparing how the outage impacted intra-cloud connectivity for Amazon Web Services, Microsoft Azure, and Google Cloud.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3bjnsd8v1YuHUV4HRIba3f/f9e87e077df4320b43c4ac425e8f6ee7/london-singapore-gcp.png?q=80&w=1560&fm=webp" style="max-width: 800px;" class="image center" withFrame alt="Intra-cloud connectivity" /> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="conflict-in-ukraine">Conflict in Ukraine</h3> <p>Russia’s invasion of Ukraine continues to be one of the biggest and most impactful stories of 2022. This brutal conflict has brought death and destruction to the Ukrainian and Russian people, as well as outages, DDoS attacks, and the rewiring of internet transit for the southern city of Kherson.</p> <p>I covered all of these topics and more in an invited talk at <a href="https://www.nanog.org/events/nanog-86/">NANOG 86</a> in Hollywood, California, in October. The talk was well-received and was listed at the top of the <a href="https://www.nanog.org/stories/nanog-86-pc-picks/">list of favorite talks</a> by the program committee.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/hKwjq94Quhc?rel=0" title="Internet Impacts Due to the War in Ukraine" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <p>Additionally, I collaborated with the New York Times this summer to tell the story of the re-routing of internet service in Russian-occupied Kherson from Ukrainian transit to Russian transit. The <a href="https://www.nytimes.com/interactive/2022/08/09/technology/ukraine-internet-russia-censorship.html">Times’ story</a> landed on the front page while we published an <a href="https://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplan/">accompanying blog post</a> that provided more technical detail and some comparisons to Crimea’s switchover to Russian transit in 2014.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 64.75507765830346%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/3quz3zd1s2" title="Reconfiguration of ASes in Kherson" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="rogers-outage">Rogers outage</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/24Q0bOjtYtMlM4Pa2mtmfR/3c9728060de2d7d775d805544d8c2055/rogers-outage-view-from-kentik.png?q=80&w=1560&fm=webp" style="max-width: 600px;" class="image center" withFrame alt="Rogers outage and service restoration" /> <p>In July, Canadian telecommunications giant Rogers Communications suffered what is arguably the most significant internet outage in Canadian history. In my <a href="https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outage/">blog post</a> on the incident, we highlighted the impacts of the outage and tried to shed some light on the role of BGP, which had been the focus of blame in the initial accounts of the outage.</p> <blockquote>I’m here to say that BGP gets a bad rap during big outages. It’s an important protocol that governs the movement of traffic through the internet. It’s also one that every internet measurement analyst observes and analyzes. When there’s a big outage, we can often see the impacts in BGP data, but often these are the symptoms, not the cause. <em>If you mistakenly tell your routers to withdraw your BGP routes, and they comply, that’s not BGP’s fault.</em></blockquote> <p>A couple of weeks later, the CRTC (Canadian Radio-television and Telecommunications Commission) <a href="https://crtc.gc.ca/otf/eng/2022/8000/c12-202203868.htm">published an explanation of the outage</a>, which blamed an internal route leak. Basically, a filter had been removed, allowing the global routing table to be leaked into Rogers’ interior routing protocol, which overwhelmed their routers with a flood of internal routing updates.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 40px;"></div> <h3 id="the-rise-of-the-internet-curfew">The Rise of the Internet Curfew</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/ZAPhmLM6YIH2lJtwESSAo/427039420d190d0ba4933f99c1d03186/featured-internet-curfew.png" style="max-width: 600px;" class="image center" alt="Internet curfew graphic" /> <p>Government-directed shutdowns in <a href="https://twitter.com/DougMadory/status/1576169117039501313">Cuba</a> and <a href="https://twitter.com/DougMadory/status/1577341946783318035">Iran</a> this fall led me to join up with <a href="https://twitter.com/lawyerpants">Peter Micek</a> of digital rights NGO <a href="https://www.accessnow.org/">Access Now</a> to write a <a href="https://www.kentik.com/blog/suppressing-dissent-the-rise-of-the-internet-curfew/">blog post that traced the history and logic motivating “internet curfews,”</a> a tactic of communication suppression in which internet service is temporarily blocked on a recurring basis. We wrote:</p> <blockquote>The objective of internet curfews … is to reduce the cost of shutdowns on the authorities that order them. By reducing the costs of these shutdowns, they become a more palatable option for an embattled leader and, therefore, are likely to continue in the future.</blockquote> <p>The practice was first seen in <a href="https://www.vice.com/en/article/gv58vw/gabon-internet-censorship-election-outage">Gabon in 2016</a> but reappeared in <a href="https://twitter.com/DougMadory/status/1362853999691636741">Myanmar</a> last year following the military coup. Similar incidents in Cuba and Iran suggest this is, sadly, a tactic that we will see again in the future.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 40px;"></div> <h3 id="protests-in-iran">Protests in Iran</h3> <p>In September, a <a href="https://www.reuters.com/world/middle-east/tehran-governor-accuses-protesters-attacks-least-22-arrested-2022-09-20/">widespread protest movement</a> had sprung up in cities across Iran following the <a href="https://www.cnn.com/2022/10/26/middleeast/iran-clashes-mahsa-amini-grave-intl">death of Masha Amini</a> while in police custody. In an effort to combat the protests, the Iranian government directed the three major mobile operators to begin <a href="https://twitter.com/DougMadory/status/1574813306547847169">disabling internet service</a> across the country every evening before restoring service in the early hours of the following morning.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1YjSvk3w1qkR0D13hlFrm5/13e937d0fbf29c7a20b3c907f34b8e99/iran-sept21.png" style="max-width: 600px;" class="image center" withFrame alt="Graph of internet service disabled and restoring" /> <p>In addition to <a href="https://twitter.com/DougMadory/status/1577341946783318035">reporting on the internet disruptions</a> in Iran, we contributed our data and expertise with that of multiple academic, industry, and civil organizations to produce a <a href="https://ooni.org/post/2022-iran-technical-multistakeholder-report/">comprehensive report</a> on the various internet disruptions in Iran in recent months. The timeline below summarizes the events covered in the report.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4r4eWWUNV87aTjJpaFONqg/6959151474dc8edd97c81593db7a608a/timeline-iran-internet-disruptions.png" style="max-width: 750px;" class="image center" alt="Timeline of internet disruptions in Iran" /> <p>This joint effort echoed our collaboration with multiple other organizations last year to produce a <a href="https://ooni.org/post/2021-multiperspective-view-internet-censorship-myanmar/">comprehensive report</a> on the internet disruptions following the military coup in Myanmar. It is work like this that resulted in <a href="https://share.america.gov/meet-watchdogs-guarding-internet-access/">Kentik being named as one of “the watchdogs guarding internet access”</a> by the US State Department.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px; margin-top: 40px;"></div> <h2 id="bgp-routing-security">BGP routing security</h2> <p>Under the internet’s hood, there was also a lot of talk about BGP routing security this year. In June, I was invited to speak on BGP security at the <a href="https://nam2022.namex.it/">NAMEX annual meeting</a> in Rome, Italy. The title of the talk updated a <a href="https://en.wikipedia.org/wiki/SPQR#Modern_use">local joke</a> about the meaning of the Roman acronym <a href="https://en.wikipedia.org/wiki/SPQR">SPQR</a>: Sono Pazze Queste Rotte (they’re crazy, these routes 🤣). <a href="https://www.youtube.com/watch?v=nczAT3Ik1T8&#x26;t=3727s">Watch the talk on YouTube</a>.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 40px;"></div> <h3 id="rpki-rov-progress">RPKI ROV progress</h3> <p>I teamed up with the preeminent routing security expert <a href="https://instituut.net/~job/">Job Snijders</a> of Fastly on two analysis projects to shed some light on the progress made on the deployment of <a href="https://rpki-monitor.antd.nist.gov/ROV">RPKI ROV</a>. The first used Kentik’s aggregate NetFlow (<a href="https://www.kentik.com/blog/bgp-and-rpki-a-path-made-clear-with-kentik/">annotated with RPKI ROV evaluations</a>) to measure progress in RPKI ROA creation in terms of traffic volume instead of just counting prefixes or IP address space, as had been done in the past.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/RXBfaD7hWQfl2gp54lNZU/ac55ead76c8363766ec6a6fb81e60626/internet-traffic-rpki.png" style="max-width: 600px;" class="image center" alt="Pie chart showing internet traffic volume by RPKI evaluation" /> <p>The conclusion was that, due to RPKI ROV deployments by major content providers and access networks, the majority of internet traffic (measured in bits per second) presently goes to routes with valid ROAs. This means that most internet traffic is eligible for the protection that RPKI ROV provides, further reinforcing the value of rejecting RPKI-invalid routes.</p> <p>This analysis was presented at NANOG 84 in Austin, Texas, earlier this year:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/dLd27cJo8Ds?rel=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <p>The second part of this analysis looked at the rejection of RPKI-invalid routes. As we wrote in our <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">joint blog post</a> detailing the conclusions:</p> <blockquote>ROAs alone are useless if only a few networks are rejecting invalid routes. The next step in understanding where we are at with RPKI ROV deployment is to better understand how widespread the rejection of invalid routes is.</blockquote> <p>So we ran the numbers and found that the evaluation of a BGP route as invalid reduces its propagation by anywhere between one-half to two-thirds! This is the system working as designed. As a result, we can be confident that the propagation of routes from a future origination leak, for example, will be suppressed in favor of the legitimate routes with valid ROAs.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7uL7smjzcsmg6Hec6Pvl1A/294032450ffb693669e9c8ab9c14625a/IPv4-v6_propagation_by_RPKI.png" style="max-width: 800px;" class="image center" alt="IPv4-v6 - Propagation by RPKI" /> <blockquote>From these histograms, we can see that invalid routes rarely, if ever, experience propagation greater than half that experienced by RPKI-valid and RPKI-not-found routes. In fact, many experience propagation significantly less than half, but the amount of reduction depends on a number of factors, including the upstreams involved in transiting the prefixes. Nonetheless, it is evident that RPKI ROV dramatically reduces the propagation of invalid routes.</blockquote> <p>Given that most internet traffic now flows towards routes with ROAs, we concluded that RPKI ROV is presently offering a significant degree of protection for the internet in the event of a routing mishap.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 40px;"></div> <h3 id="bgp-hijacks-targeting-cryptocurrency">BGP hijacks targeting cryptocurrency</h3> <p>While the above analysis demonstrated that progress has been made in RPKI ROV, more sophisticated BGP hijack attacks remain a major unmitigated risk. This year saw two sophisticated attacks using BGP hijacking that were successful in the theft of large amounts of cryptocurrency.</p> <p>This led to a <a href="https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services/">blog post</a> reviewing the successful BGP hijack against AWS to attack the Celer Bridge, a service that allows users to convert between cryptocurrencies.</p> <p>After the hijack in August, I <a href="https://twitter.com/DougMadory/status/1562089866321698819">tweeted out</a> the following Kentik BGP visualization showing the propagation of this malicious route. The upper portion shows 44.235.216.0/24 appearing with an origin of AS14618 (in green) at 19:39 UTC and quickly becoming globally routed. It was withdrawn at 20:22 UTC but returned again at 20:38, 20:54, and 21:30 before being withdrawn for good at 22:07 UTC.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7yMBjSi56iZ8wGRQILqDzW/2677e9272c2babf503bae0d53df495d2/bgp-visualization-propagation-malicious-route.png" style="max-width: 800px;" class="image center" withFrame alt="BGP visualization of propagation of malicious route" /> <div class="caption" style="margin-top: -30px;">View from Kentik BGP visualization showing the propagation of the hijack route (green) and the legitimate route (purple).</div> <p>My conclusion was that although the attackers had effectively eluded RPKI ROV by forging an AS_PATH with Amazon’s ASN, there is reason to believe that tighter ROA definitions and better BGP route monitoring could have limited the efficacy of the attack and reduced the time it took Amazon to respond. Ars Technica was rather harsh on Amazon in <a href="https://arstechnica.com/information-technology/2022/09/how-3-hours-of-inaction-from-amazon-cost-cryptocurrency-holders-235000/">their coverage</a> of the incident based on our analysis.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 40px;"></div> <h3 id="report-collaborations">Report collaborations</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/20KevPoQCBEu2bFmwX97yJ/297c586c3808b9113cc99fdfcdd51b4f/oecd-digital-economy-papers.png" style="max-width: 200px; padding: 0" class="image left" alt="OECD Digital Economy Papers book" /> <p>This year, the Organisation for Economic Co-operation and Development published a series of technology-focused reports called the <a href="https://www.oecd-ilibrary.org/science-and-technology/oecd-digital-economy-papers_20716826">OECD Digital Economy Papers</a>. Compiled by the OECD’s Directorate for Science, Technology, and Innovation (STI), these papers are intended to “better understand how information and communication technologies (ICTs) contribute to sustainable economic growth and social well-being.”</p> <p>In October, the OECD published <a href="https://www.oecd-ilibrary.org/science-and-technology/routing-security_40be69c8-en">Routing security: BGP incidents, mitigation techniques, and policy actions</a>, a report which included some of my prior analysis of BGP leaks and hijacks. <a href="https://arstechnica.com/information-technology/2018/11/strange-snafu-misroutes-domestic-us-internet-traffic-through-china-telecom/">One incident in particular</a> that got highlighted was the misrouting of Verizon’s Asia-Pacific network (AS703) through China Telecom (AS4134) from 2015 to 2017.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3xGOjstr3rN5FfvPQ6pq7u/4317f61a65356bc836b6b92e6e98c2f0/OECD-excerpt.png" style="max-width: 600px;" class="image center" alt="OECD book excerpt" /> <div class="caption" style="margin-top: -30px;">OECD Routing security: BGP incidents, mitigation techniques, and policy actions, page 18</div> <p>This incident was also covered in the recent book <a href="https://www.csis.org/analysis/digital-silk-road-chinas-quest-wire-world-and-win-future">The Digital Silk Road</a> by <a href="https://www.linkedin.com/in/hillmanjonathan/">Jonathan Hillman</a>, a senior advisor to the US Secretary of State. Chapter 5 of that book is entitled <em>A Crease In the Internet</em> and covers my discovery and remediation of this traffic misdirection which lasted for nearly two years.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2DbzlRkgfP33OPe9CAsCpt/9386a6ef83e31ea585ca79ccffe5f71a/bitag-logo.png" style="max-width: 250px;" class="image right no-shadow" alt="BITAG logo" /> <p>In a separate effort, I joined a team of subject matter experts to help the Broadband Internet Technical Advisory Group (BITAG) draft a comprehensive report on BGP routing security. BITAG is an organization that provides technical guidance to the US broadband industry on various topics, and periodically they form Technical Working Groups to draft technical papers intended to influence US policymakers on a technical subject.</p> <p>In November, our Technical Working Group published <a href="https://www.bitag.org/Routing_Security.php">Security of the Internet’s Routing Infrastructure</a>, which explains issues surrounding BGP routing from the basics to recent BGP hijacks targeting cryptocurrency. It concludes with recommendations for both network operators and policymakers, such as the following:</p> <ul> <li>Respect the internet’s multi-stakeholder standards development process. If regulation is considered, set goals rather than specifying technologies.</li> <li>Engage the internet community to address regional policy incentive issues that slow the adoption of standardized routing security technologies.</li> <li>Fund the long-term monitoring programs needed to understand internet routing and the effects of changes over time.</li> </ul> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px; margin-top: 40px;"></div> <h2 id="looking-ahead-to-2023">Looking ahead to 2023</h2> <p>As we look ahead to the new year, there is no shortage of challenges and opportunities for internet connectivity around the world. We intend to continue producing timely, informative, and impactful analysis that helps inform the public and industry about internet connectivity issues.</p> <p>Follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to make sure you get notified as we publish posts in the future.</p><![CDATA[Kubernetes and Cross-cloud Service Meshes]]><![CDATA[Kubernetes is a powerful platform for large-scale distributed systems, but out of the box, it doesn't address all the needs of complex enterprise systems deployed across multiple clouds and private data centers. Service meshes fill that gap by connecting multiple clusters into a cohesive mesh.]]>https://www.kentik.com/blog/kubernetes-and-cross-cloud-service-mesheshttps://www.kentik.com/blog/kubernetes-and-cross-cloud-service-meshes<![CDATA[Ted Turner]]>Tue, 13 Dec 2022 05:00:00 GMT<p>As today’s enterprises shift to the cloud, Kubernetes has emerged as the de facto platform for running containerized microservices. And while Kubernetes operates as a single cluster, enterprises inevitably run their applications on a complex, often confusing, architecture of multiple clusters deployed to a hybrid of multiple cloud providers and private data centers.</p> <p>This approach creates a lot of problems. How do your services find each other? How do they communicate securely? How do you enforce access and communication policies? How do you troubleshoot and monitor health?</p> <p>Even on a single cluster, these are not trivial concerns. In a multi- or hybrid-cloud setup, the complexity can be overwhelming.</p> <p>In this post, we’ll explore how service meshes — specifically, multi-cluster service meshes — can help solve these challenges.</p> <p>Let’s first look at how multi-cluster connectivity works in Kubernetes.</p> <h2 id="understanding-kubernetes-multi-cluster-connectivity">Understanding Kubernetes multi-cluster connectivity</h2> <p>Kubernetes operates in terms of clusters. A cluster is a set of nodes that run containerized applications.</p> <p>Every cluster has two components:</p> <ul> <li> <p>A <strong>control plane</strong> that is responsible for maintaining the cluster (what is running, what images are used). This control plane also contains the API server used to communicate within the cluster.</p> </li> <li> <p>A <strong>data plane</strong> that is made up of one or more worker nodes. These nodes run the actual applications. Nodes are made up of pods and their containers. Pods are groups of small, deployable units of computing.</p> </li> </ul> <p>When running a single cluster, this all works well. The challenge comes with multiple clusters because the API server in the control plane is only aware of the nodes, pods, and services in <em>its own cluster</em>.</p> <p>So, when you run multiple Kubernetes clusters and want workloads to talk to one another <em>across clusters</em>, Kubernetes (alone) can’t help you. You need some sort of networking solution to connect your Kubernetes clusters.</p> <p>There are two main models for multi-cluster connectivity: global flat IP address space and gateway-based connectivity. Let’s look at each.</p> <h3 id="global-flat-ip-address-space">Global flat IP address space</h3> <p>In this model, <strong>each pod across all Kubernetes clusters has its own unique private IP address</strong> that is kept in a central location and known to all other clusters. The networks of all the clusters are peered together to form a single conceptual network. Various technologies — such as cross-cloud VPN or interconnects — may be needed when connecting clusters across clouds. Using this model requires tight control over the networks and subnets used by all Kubernetes clusters. The engineering team will need to centrally manage the global IP space to ensure there are no IP address conflicts.</p> <h3 id="gateway-based-connectivity">Gateway-based connectivity</h3> <p>In this model, <strong>each cluster has its own private IP address space that is not exposed to other clusters</strong>. Multiple clusters can use the same private IP addresses. Each cluster also has its own gateway (or load balancer), which knows how to route incoming requests to individual services within the cluster. The gateway has a public IP address, and communication between clusters always goes through the gateway. This adds one extra hop to every cross-cluster communication.</p> <p>Let’s look now at how these models apply within the context of service meshes in general and then with cross-cloud service meshes in particular.</p> <h2 id="the-service-mesh-a-brief-primer">The service mesh: a brief primer</h2> <p><strong>A service mesh is a software layer that monitors and controls traffic over a network</strong>. Just like with Kubernetes, the service mesh has a data plane and a control plane.</p> <p>The <strong>data plane</strong> consists of proxies running alongside every service. These proxies intercept every request between services in the network, apply any relevant policies, decide how to handle the request, and—if the request is approved and authorized—decide how to route it.</p> <p>The data plane can follow one of two models for these proxies: the sidecar model or the host model. With the sidecar model, a mesh proxy is attached as a sidecar container to every pod. With the host mode, every node runs an agent that serves as the proxy for all workloads running on the node.</p> <p>The <strong>control plane</strong> exposes an API that the mesh operator uses to configure policies governing the behavior of the mesh. The control plane also routes communications by discovering all the data plane proxies and updating them with the locations of services throughout the entire mesh.</p> <h2 id="what-is-a-cross-cloud-service-mesh">What is a cross-cloud service mesh?</h2> <p>A <em>cross-cluster</em> service mesh is a service mesh that connects workloads running on different Kubernetes clusters — and potentially standalone VMs. When connecting those clusters across multiple cloud providers — we now have a <em>cross-cloud</em> service mesh. However, from a technical standpoint, there is little difference between a cross-cluster and a cross-cloud service mesh.</p> <p>The cross-cloud service mesh brings all the benefits of a general service mesh but with the additional benefit of offering a single view into multiple clouds. There are two primary deployment models, and they’re analogous to the multi-cluster connectivity models discussed above.</p> <p>The first model is a <strong>single conceptual network</strong>. In practice, that network could be composed of multiple, peered physical networks. <strong>Multiple clusters share the same global private IP address space, and every pod has a unique private IP address across all clusters</strong>. The following diagram shows an Istio mesh deployment across two clusters that belong to the same conceptual network.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5Nf0rKpJYaD5klnRcL1wMD/d39338b1097a652c3f772a020b017bd5/multi-primary-cluster.png" style="max-width: 600px;" class="image center no-shadow" alt="Multi-primary cluster" /> <div class="caption" style="margin-top: -30px;">Source: Multi-primary cluster; an example from <a href="https://istio.io/latest/docs/setup/install/multicluster/multi-primary/">Istio</a></div> <p>Services in <code class="language-text">cluster1</code> can directly talk to services in <code class="language-text">cluster2</code> via private IP addresses.</p> <p>The second model is a <strong>federation of isolated clusters</strong> in which the <strong>internal IP addresses of each cluster are not exposed to the other clusters</strong>. Each cluster exposes a gateway, and services in one cluster talk to services in other clusters through the gateway. The following diagram shows an Istio mesh deployment across two clusters that belong to two separate networks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2TKHlpWMLXIbeyCO63RTiM/b707d8ade848ab2461cb49fe76d9be81/multi-primary-cluster-different-networks.png" style="max-width: 600px;" class="image center no-shadow" alt="Multi-primary cluster on different networks" /> <div class="caption" style="margin-top: -30px;">Source: Multi-primary cluster on different networks; an example from <a href="https://istio.io/latest/docs/setup/install/multicluster/multi-primary/">Istio</a></div> <p>Services within a cluster can talk to one another directly using their private IP addresses. However, communication between clusters must go through a gateway with a public IP address.</p> <h2 id="benefits-of-using-a-cross-cloud-service-mesh-with-kubernetes">Benefits of using a cross-cloud service mesh with Kubernetes</h2> <p>As stated in the opening, understanding the enterprise’s overall networking and security posture and all its components is very demanding. Cross-cloud service meshes help you by allowing you to connect large-scale enterprise systems in complex ways. This creates many benefits.</p> <p>First, a cross-cloud service mesh provides a one-stop shop for observability, visualization, health checks, policy management, and policy enforcement. Offloading these concerns — such as authentication, authorization, metrics collections, and health checks — to a central component (the cross-cloud service mesh) is a big deal. Think about it as “aspect-oriented programming in the cloud.”</p> <p>Second, this approach allows an organization to realize the vision of microservices by focusing on the business logic of its applications and services. Teams can pick the best tool for the job and write their microservices in any programming language, while the service mesh combines them all to form a cohesive whole. There is no need to develop, maintain, and upgrade multiple client libraries in each supported programming language just to facilitate the complex networking side of being in a multi-cloud environment.</p> <p>And third, a cross-cloud service mesh allows organizations to deploy their services to cloud providers that make the most sense for them based on cost, convenience, or other factors. Instead of being locked into a single vendor, organizations can use cross-cloud service meshes to connect all of their clusters and services together, regardless of the cloud that each one runs in.</p> <h2 id="service-mesh-examples">Service mesh examples</h2> <p>There are several successful service meshes. Here are some of the most mature and popular services mesh projects:</p> <ul> <li><a href="https://istio.io/">Istio</a></li> <li><a href="https://linkerd.io/">Linkerd</a></li> <li><a href="https://kuma.io/">Kuma</a></li> <li><a href="https://developer.hashicorp.com/consul/tutorials/kubernetes-deploy/service-mesh">Consul Service Mesh</a></li> <li><a href="https://traefik.io/traefik-mesh/">Traefik Mesh</a></li> <li><a href="https://openservicemesh.io/">Open Service Mesh</a></li> <li><a href="https://aws.amazon.com/app-mesh/">AWS App Mesh</a></li> <li><a href="https://docs.cilium.io/en/v1.12/">Cilium</a></li> </ul> <p>Every one of these service meshes can be set up for multi-cluster environments, allowing enterprises to use them as cross-cloud service meshes.</p> <h2 id="conclusion">Conclusion</h2> <p>Kubernetes is a powerful platform for large-scale distributed systems, but out of the box, it doesn’t address all the needs of complex enterprise systems deployed across multiple clouds and private data centers. Service meshes fill that gap by connecting multiple clusters into a cohesive mesh.</p> <p>While a service mesh helps your enterprise with cross-cloud connectivity, gaining a clear picture of everything you’re running across your clouds is still challenging. That’s where <a href="https://www.kentik.com/product/cloud/">Kentik Cloud</a> steps in to help. Kentik Cloud gives you the tools you need for monitoring, visualization, optimizing, and troubleshooting your network — whether you run multi-cloud or hybrid-cloud, and whether you use Kubernetes or non-Kubernetes deployments. As your systems grow in scale and <a href="https://www.kentik.com/resources/kentik-kube-container-network-performance-monitoring/">spread across more clusters and clouds</a>, the ability to see it all through a single pane of glass is critical.</p> <p>Read about <a href="https://www.kentik.com/solutions/usecase/clouds-and-hybrid">Kentik’s multi-cloud solutions</a> for more details.</p><![CDATA[Maximizing Application Performance: How to Extract Practical Data from Your Network]]><![CDATA[Most of the applications we use today are delivered over the internet. That means there's valuable application telemetry embedded right in the network itself. To solve today's application performance problems, we need to take a network-centric approach that recognizes there's much more to application performance monitoring than reviewing code and server logs.]]>https://www.kentik.com/blog/maximizing-application-performance-extract-practical-data-from-your-networkhttps://www.kentik.com/blog/maximizing-application-performance-extract-practical-data-from-your-network<![CDATA[Phil Gervasi]]>Thu, 08 Dec 2022 05:00:00 GMT<p>How many applications do you use that exist only on the computer in front of you? Two? One? None at all?</p> <p>I occasionally use two applications that live locally on my computer, but all the rest, including every application I use for personal and professional work, are delivered over the internet.</p> <p>That’s pretty much where we are these days, isn’t it?</p> <p>Most of the applications we use for work and our personal lives are delivered over the internet and the local network we happen to be on that day.</p> <p>That means a tremendous amount of application performance information is inherently embedded in the network itself. Today, a big part of identifying the cause of application delivery problems, such as slowness, instability, and insecurity, is looking at an application’s <em>network</em> activity.</p> <h2 id="maximizing-an-applications-performance-means-extracting-useful-information-from-the-network-itself">Maximizing an application’s performance means extracting useful information from the network itself.</h2> <p>Don’t get me wrong — there’s good information we can get from looking at an application’s code, server logs, underlying compute resources, and overall architecture, but because of how we consume applications today, we must take a more network-centric approach to application performance visibility.</p> <p>Even a properly designed SaaS application with sufficient compute resources and a robust architecture still relies on a separate system of network protocols, switches, routers, firewalls, DNS servers, and load balancers to reach an end-user.</p> <h2 id="all-of-the-activity-happens-over-the-network">All of the activity happens over the network.</h2> <p>Those application performance metrics are embedded in <a href="https://www.kentik.com/kentipedia/what-is-netflow-overview/" title="Kentipedia: What is NetFlow?">flow data</a>, <a href="https://www.kentik.com/blog/ebpf-explained-why-its-important-for-observability/" title="eBPF Explained">eBPF</a>, packet captures, and the <a href="https://www.kentik.com/kentipedia/what-is-synthetic-monitoring/" title="Kentipedia: What is Synthetic Monitoring?">results of synthetic tests</a>.</p> <p>In the past, we’d use that type of <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Best Practices to Enrich Network Telemetry">telemetry</a> to learn from a macro level what applications are running on our network, which IP endpoints are in those conversations, which interfaces are used on which routers, what the CPU utilization of our core is, and so on. Network-level stuff. Important, but not enough.</p> <p>Now we can parse network telemetry to identify application-specific performance metrics and even make inferences about non-network resources.</p> <p>Let me give you a few real examples from my personal experience.</p> <blockquote style="border-left: 4px solid #ebeff3">A network-centric view of application activity can tell you that the server hosting the back end of your EMR is taking a long time to respond to the request from the front end. Because the server is responding over the network, you can see exactly when it finally does respond and that there was no delay in the network itself. We can infer that the issue is with the server itself. </blockquote> <blockquote style="border-left: 4px solid #ebeff3">A network-centric view of application activity can tell you it’s taking an abnormally long time to resolve a hostname related to your CRM platform. That one slowdown affects the entire application’s performance over the network and impacts the user’s experience. And from the network, you can determine exactly which DNS server is the culprit. </blockquote> <blockquote style="border-left: 4px solid #ebeff3">A network-centric view of application activity can tell you the number of server turns occurring to fulfill a user request to the front end of your payroll app. Digging deeper with a synthetic page load test, we can discover that one particular file is failing to load over the network and the number of server turns is exceptionally high. </blockquote> <blockquote style="border-left: 4px solid #ebeff3">With a network-centric approach to application monitoring, we can identify one tiny corrupt file hosted in the cloud that’s ruining the entire user experience.</blockquote> <p>Yes, sometimes the network itself is the cause of the performance degradation (I’ve seen my share of interface errors and power supply failures). Still, very often, it’s some other component of the application’s architecture that is not part of the network.</p> <p>A <a href="https://www.kentik.com/blog/unifying-application-and-network-observability-with-new-relic/" title="Unifying Application and Network Observability with New Relic">network-centric approach to application monitoring</a> gives us new and valuable insight into an application’s performance — insight you can’t get from server logs and a code review. When troubleshooting application performance, the answers are often right there — embedded in the same network the application depends on.</p><![CDATA[ Multi-cloud vs. hybrid cloud networks: What’s the difference?]]><![CDATA[In today’s digital landscape, application demands such as scalability, performance, and reliability push many IT organizations toward cloud-based networks. Learn what the best cloud solution for your enterprise needs is.]]>https://www.kentik.com/blog/multi-cloud-vs-hybrid-cloud-networks-whats-the-differencehttps://www.kentik.com/blog/multi-cloud-vs-hybrid-cloud-networks-whats-the-difference<![CDATA[Stephen Condon]]>Sat, 03 Dec 2022 05:00:00 GMT<p>In today’s digital landscape, application demands such as scalability, performance, and reliability push many IT organizations toward cloud-based networks. Initially, cloud providers’ main offering was managed, virtualized data storage and services, or <em>cloud computing</em>. As cloud ecosystems have matured, so have the tools, services, and use cases available to their customers.</p> <p>To get the most from cloud providers, many organizations adopt <em>multi-cloud</em> or <em>hybrid cloud</em> strategies for their IT needs.</p> <p>With this article, I want to break down the <a href="/kentipedia/what-is-cloud-networking/">differences between the major types of cloud offerings</a> and look at what these differences mean for teams trying to decide on the best cloud solution for their network.</p> <h2 id="public-cloud-vs-private-cloud">Public cloud vs. private cloud</h2> <p>The first place to start is understanding the difference between a <em>public cloud</em> and a <em>private cloud</em>. When a cloud is public, it means that its data stores and services are fully hosted and managed by the cloud provider and are accessible via the public internet.</p> <p>On the other hand, a private cloud will be accessible through a private network and may be on-premises or hosted by a remote third party. Private clouds shift many management responsibilities onto the cloud consumer but offer greater degrees of security and, in some cases, performance.</p> <h2 id="what-is-multi-cloud">What is multi-cloud?</h2> <p>Multi-cloud networks rely on cloud services and infrastructure from multiple cloud providers. What this looks like in a given application can vary as network or engineering demands dictate.</p> <h3 id="benefits-of-multi-cloud">Benefits of multi-cloud</h3> <p>Not all services are created equal between cloud providers, and being able to host certain applications, data stores, CDNs, etc., with a variety of providers offers network engineers a range of benefits:</p> <ul> <li><strong>Proximity</strong>: When performance is key, using the closest data center possible can reduce latency and costs. Pursuing a multi-cloud strategy gives you the ability to have your applications hosted in closer proximity to more end users.</li> <li><strong>Reliability</strong>: Outages are unlikely to affect multiple providers at once. Multi-cloud networks offer greater reliability to networks that can’t afford downtime.</li> <li><strong>Flexibility</strong>: Not every cloud provider is good at the same things. AWS might have the best pricing and performance for serverless, while your larger traffic loads have a better track record with Azure’s capacity management, but your caching performance is best with GCP…and so on.</li> </ul> <h3 id="challenges-of-multi-cloud">Challenges of multi-cloud</h3> <ul> <li><strong>Complexity</strong>: The cloud is complex. The proliferation of APIs, third-party tools, and the risk for silos when handling multiple clouds within a single network exacerbates this complexity.</li> <li><strong>Cost</strong>: Hard-to-follow cloud pricing models increased data engineering needs, and myriad opportunities for runaway billing make optimizing against costs in multi-cloud networks a full-time job.</li> <li><strong>Management</strong>: Managing multiple cloud deployments and infrastructures often requires the formation of new personnel abstractions, like a cloud or platform team, to monitor and manage the larger network interactions.</li> <li><strong>Security</strong>: While a boon for reliability, multi-cloud networks have an increased surface area for cyber threats and must be vigilantly monitored for vulnerabilities.</li> </ul> <h2 id="what-is-a-hybrid-cloud">What is a hybrid cloud?</h2> <p>A <a href="https://www.kentik.com/resources/hybrid-cloud-monitoring/">hybrid cloud network</a> uses both public and private clouds. As covered above, a private cloud can be a private and secure connection to a cloud provider or an on-prem data store. In both cases, the customer faces a greater degree of responsibility than with a traditional public cloud offering.</p> <h3 id="different-types-of-hybrid-clouds">Different types of hybrid clouds</h3> <p>Hybrid cloud networks come in two forms, homogeneous or heterogeneous.</p> <ul> <li> <p><strong>Homogeneous</strong>: Homogeneous hybrid clouds work with only one provider for all their cloud services. The entire application stack deals with only one cloud provider in this configuration.</p> </li> <li> <p><strong>Heterogeneous</strong>: In a heterogeneous hybrid cloud, private and public cloud services are delivered by various providers. The only difference between a multi-cloud and a heterogeneous hybrid cloud network is the latter’s inclusion of a private cloud.</p> </li> </ul> <h3 id="benefits-of-hybrid-cloud">Benefits of hybrid cloud</h3> <p>The stand-out benefit for hybrid clouds is <strong>security</strong>. The inclusion of private clouds offers organizations with financial, government, healthcare, or other sensitive data greater control and peace of mind around data access. For services handling large data sets or requiring application-specific hardware, hybrid clouds offer scaling and computation abilities that would be cost-prohibitive or otherwise unavailable.</p> <h3 id="challenges-of-hybrid-cloud">Challenges of hybrid cloud</h3> <p>The challenges of hybrid cloud networks mirror many of the challenges of multi-cloud deployments. Namely, increased complexity, greater cloud management responsibilities, and the need for increased cost control measures. In addition to the costs associated with managing and maintaining hybrid cloud networks, the presence of private clouds requires additional CapEx costs for hardware acquisition, storage, and maintenance.</p> <h2 id="can-a-hybrid-cloud-also-be-a-multi-cloud">Can a hybrid cloud also be a multi-cloud?</h2> <p><strong>Yes.</strong> Hybrid cloud and multi-cloud are not mutually exclusive concepts. A “heterogeneous hybrid cloud” is the same as a multi-cloud network that includes a private cloud.</p> <h2 id="the-key-differences-between-multi-cloud-and-hybrid-cloud">The key differences between multi-cloud and hybrid cloud</h2> <table> <tbody> <tr> <td></td> <td><b>Multi-cloud</b></td> <td><b>Hybrid-cloud</b></td> </tr> <tr> <td>Public cloud</td> <td>Yes</td> <td>Yes</td> </tr> <tr> <td>Private cloud</td> <td>No</td> <td>Yes</td> </tr> <tr> <td>Single provider</td> <td>No</td> <td>Maybe</td> </tr> <tr> <td>Multiple providers</td> <td>Yes</td> <td>Maybe</td> </tr> </tbody> </table> <p>In terms of cost and performance, there are many factors outside the pure architectural decision to be hybrid cloud or multi-cloud that affect the financial footprint and performance of your cloud networks. Generally speaking, scale will be the most critical detail when considering cost and performance with cloud deployments. Still, prudent networking, monitoring, and optimization decisions can save organizations millions of dollars.</p> <p>However, regarding reliability and security, we see some key differences between the two. Multi-cloud has a distinct reliability edge as it is extremely unlikely outages, however significant, will affect multiple cloud providers simultaneously. For critical networks, hybrid clouds that rely on a single provider are simply too vulnerable.</p> <p>For security, the edge goes to hybrid clouds because of the inclusion of private clouds. This establishes tighter controls around physical access, authorization configurations, and the ability to respond should a compromise be identified.</p> <h2 id="how-to-choose-between-multi-cloud-and-hybrid-cloud">How to choose between multi-cloud and hybrid cloud?</h2> <p>As we’ve seen, choosing between multi-cloud and hybrid-cloud boils down to application needs.</p> <p>Does part of your application need a private cloud for security or specific hardware? Then a hybrid cloud will probably be your best solution. Should it be homogeneous or heterogeneous? A heterogeneous hybrid cloud is best if you need the engineering flexibility and greater reliability that comes from multiple providers. Can’t afford the overhead associated with the complexity of numerous providers? A homogeneous hybrid cloud might suit your organization best.</p> <p>If, however, your network demands widespread availability, unshakeable reliability, and the engineering flexibility of multiple providers, a multi-cloud configuration is a way to go.</p> <h2 id="how-can-kentik-help">How can Kentik help?</h2> <p>With integrations across all cloud providers, the market’s most comprehensive network monitoring capabilities, and AI-driven insights, Kentik offers <a href="https://www.kentik.com/product/cloud/">best-in-class network observability</a> for public, private, hybrid, and multi-cloud networks, and infrastructures.</p><![CDATA[Peering? ...in this Economy?]]><![CDATA[In this post, Nina Bargisen explores how peering coordinators can use combined NetFlow and BGP analysis tools to work around different capacity upgrades when money or delivery times pose a challenge.]]>https://www.kentik.com/blog/peering-in-this-economyhttps://www.kentik.com/blog/peering-in-this-economy<![CDATA[Nina Bargisen]]>Thu, 01 Dec 2022 05:00:00 GMT<p><a href="https://www.kentik.com/kentipedia/what-is-internet-peering/" title="Kentipedia: What is Internet Peering?">Peering</a> is great for quality and cost savings, but how do you ensure your network is up to the task?</p> <p><a href="https://www.kentik.com/kentipedia/network-capacity-planning/" title="Kentipedia: Network Capacity Planning">Capacity planning</a> is a classic task that should be very familiar to network operators. Providing the right capacity at the right time is crucial to finding a balance between cost and quality in your network.</p> <p>The obvious capacity constraint on an edge router is bandwidth, but handling the ever-increasing amount of prefixes can push CPU and memory to their limits when the BGP process is crunching the data. So in a normal operational cadence, you will be monitoring your CPU and memory usage as well as the link loads on your equipment.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5ItWjhq0ltxL5RAuwlqcN5/a81620eb96925c108a801dc5751e25b8/peering-active-bgp-entries.png" style="max-width: 650px;" class="image center" alt="Graph showing the effects of crunching an increasing number of BGP prefixes" /> <div class="caption" style="margin-top: -30px; max-width: 650px;">With an ever-increasing number of prefixes, crunching BGP data can be an overlooked source of strain on your hardware.<br />Data source: <a href="https://bgp.potaroo.net/">BGP Routing Table Analysis</a>; chart source: <a href="https://www.cidr-report.org/cgi-bin/plota?file=%2fvar%2fdata%2fbgp%2fas2.0%2fbgp%2dactive%2etxt&descr=Active%20BGP%20entries%20%28FIB%29&ylabel=Active%20BGP%20entries%20%28FIB%29&with=step">CIDR Report</a></div> <p>But, in the wake of the global pandemic and now the war in Europe in 2022, delivery times on the equipment have started to climb. So what can be done to manage the capacity with the possibility of delayed upgrades?</p> <h2 id="the-truth-is-in-the-traffic">The truth is in the traffic</h2> <p>Let’s first look at the situation with external interfaces running full.</p> <p>When external interfaces run full, and we cannot upgrade the capacity, the only solution is to try to move network traffic elsewhere. It can be acceptable for some types of traffic to just let the interface run full and drop packets, but most of us want to avoid that solution as much as possible.</p> <p>To do this efficiently, we want to chase a set of prefixes that will move precisely the right amount of traffic. You will also need to know where it is best to move it from a quality and an economic perspective.</p> <p><strong>So, where do we move the traffic?</strong></p> <p>We already discussed the importance of <a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/" title="Blog post: The Peering Coordinator&#x27;s Toolbox">keeping track of our connectivity cost when making peering decisions</a>. More so than ever, it is vital to balance the cost with the most optimal route and, to some extent, quality.</p> <p>A selection of potential other paths to check out is:</p> <ul> <li>Another direct connection to the ASN</li> <li>A peering ASN that is upstream to the ASN</li> <li>A transit connection with available capacity within the commit</li> <li>The cheapest transit connection if we do not have any room within the commit</li> </ul> <p>We can always debate whether moving traffic to peer or a transit with room within the commit is the best option. Price-wise, these are similar, so choosing the best path is now the more important choice.</p> <p>In the examples below, you can learn more about how to validate that the traffic will move to the selected path.</p> <p><strong>What to move and how?</strong></p> <p>The foundation of this analysis is to know our traffic per prefix and ASN. We will need a good <a href="https://www.kentik.com/kentipedia/netflow-analyzers-and-netflow-tools/" title="Kentipedia: NetFlow Analyzers and NetFlow Tools">NetFlow tool</a> that can optimally give us enough insight – in both directions.</p> <p><em>For inbound traffic:</em></p> <p>Select prefixes (or ASNs) with combined destinations that give us the right amount of traffic (make sure to pick a long enough period) – remove these from the announcements of the eBGP sessions on the particular interface.</p> <p><strong>Now, where will the traffic move to?</strong></p> <ul> <li> <p>If we have other direct connections to the ASN, traffic will move to the best route from the source of the traffic unless the ASN in question is a CDN. In this case, the traffic will most likely move the closest connection to this CDN for the destinations in the network. If the addresses in the prefixes are spread all over the network, it might be challenging to predict how much traffic is moving where. If they are used in more well-defined areas of the network, for example, in a metro, we can better predict where the traffic will move.</p> </li> <li> <p>If we do not have any other connections to the ASN or ASNs, then we need to understand how the traffic sources we remove will reach the network. If the source ASN has a looking glass, that is the best way to find out. If not (which is unfortunately quite common these days), we can do some qualified guessing by using tools like <a href="https://www.kentik.com/blog/introducing-kmi-insights/" title="Learn More About Kentik Market Intelligence">Kentik Market Intelligence</a> or <a href="https://www.kentik.com/kentipedia/bgp-netflow-analysis/" title="Kentipedia: BGP NetFlow Analysis">BGP analysis tools</a>. With this, we can determine which providers the ASN uses and work out the potential paths to the network. Inbound traffic is complex, so be prepared for surprises and keep a good eye on the quality of the traffic after the move.</p> </li> </ul> <p><em>For outbound traffic:</em></p> <p>Select prefixes with destinations that combined give us the right amount of traffic. Filter those on the eBGP session on the particular interface, so we do not accept them from our peers or transits over the interface.</p> <p>It is much easier to predict where traffic moves for outbound traffic since we are in control, or at least have full knowledge, of the routing policies out of the network. If we do not have access to an internal looking glass, we can check what BGP routes to the filtered prefixes exist on the other edge routers. Recall that the BGP tables will contain all the BGP paths known to the router and show alternatives if the best is running via the interface we are trying to offload. We may need to tweak the BGP policies to ensure the best route after filtering.</p> <p><em>To see how Kentik can help with your traffic engineering, check out the <a href="https://kb.kentik.com/v4/Fa02.htm" title="Kentik Knowledgebase: Traffic Engineering Workflow">Traffic Engineering workflow</a>.</em></p> <p><strong>Overloaded CPU and memory due to BGP processes</strong></p> <p>In this case, we will only consider the outbound traffic. We will look for ASNs announcing large amounts of prefixes but where we are sending only a little to them. If we have a traffic volume issue simultaneously, this may also solve that issue.</p> <p>The goal is to reduce the number of prefixes the router needs to process. If we do not have volume issues, we want to preserve as much outbound traffic as possible on the interfaces.</p> <p>Examples of the kind of ASNs we can chase:</p> <ul> <li>ASNs with a large number of prefixes with little to no traffic</li> <li>Provider networks and their customers where you peer with both parties.</li> <li>Provider networks, where only one or a few downstreams are significant destinations</li> </ul> <h2 id="asns-with-minimal-traffic-and-a-large-number-of-prefixes">ASNs with minimal traffic and a large number of prefixes</h2> <p>The NetFlow analysis lets us know which networks are candidates to be filtered, but getting the number of received prefixes can sometimes take time and effort. This is straightforward if we have a peering tool available with received routes per peer or have CLI access and can just look it up. But if we don’t have either, we will need help from the same tools we used when estimating how the inbound traffic would move around. In that case, you can look up both transited and originated routes per ASN, and the sum is a reasonable estimate of what each ASN is announcing to you.</p> <p><em>Note that if you are peering on a route-server at an internet exchange, this estimate is the best guess unless you want to parse through all the received routes and count prefixes per ASN with a script.</em></p> <div style="width:100%;background-color:#ebeff3;border:1px solid #BADDFF; padding: 20px; border-radius: 12px;margin: 0 auto;"><p style="font-weight: 500; text-align: center;font-size: 98%">PEERING WITH A ROUTE SERVER</p> <p style="font-size: 98%;">Peering with a given route-server means that a participant with a single BGP session peers with every other ASN peering with that same route server. The participant receives the same prefixes from the other ASNs, just as they would if they had a session with each of the ASNs, and each of the other ASNs will receive the prefixes that the participant announces during the session. It is true one-to-many peering.</p> <p style="font-size: 98%;">Many IXPs have implemented feature-rich route-servers where the participants can select which ASNs their prefixes should announce, effectively providing a selective peering policy.</p></div> <h2 id="peering-with-provider-networks-and-their-customers">Peering with provider networks and their customers</h2> <p><strong>How do we identify those?</strong></p> <p>Breaking our NetFlow traffic into AS paths will not cut it since our router has already picked the best route to install into the FIB. We will not send any traffic on the usable paths, so this approach is not viable.</p> <p>The most precise way, however, would be to analyze the AS paths from all the received routes on your router. Some good scripting will give us what we need here.</p> <p>Another way to get started is to identify the ASNs with more significant amounts of prefixes. This reduces your candidate list enough so that a manual check of whether we have a relationship with one of their providers using the BGP analysis tools will suffice.</p> <p>Once you know which provider/customer pairs we peer with, you can evaluate which to keep and which to either filter or de-peer to offload the router. Maybe it’s best to keep a provider who aggregates several other ASNs. Maybe it is best to keep the customers and de-peer the provider. Or maybe we can filter most of the prefixes and only use a provider to reach a small subset of their customers.</p> <p>While evaluating, we want to make sure to consider the back-up path for the ASNs we want to reach and the alternative paths to those you want to filter, just like we did when moving traffic to offload traffic.</p> <h2 id="monitoring-the-quality-of-the-traffic">Monitoring the quality of the traffic</h2> <p>From this point onwards, the main concern is maintaining high-quality traffic. When we do not directly connect to an ASN that we have traffic to or from, monitoring the quality becomes critical.</p> <p>Monitoring for latency, jitter, and packet loss is a good indication of traffic quality. We can set up continuous testing from agents deployed at the peering locations in the network and create alerts when degradation happens.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7iTAUeSjhCV15gEIdwiAcI/e1059255abc5be7441eacbdab23b3fd2/customize-asn-testing-coverage.png " style="max-width: 800px;" class="image center" withFrame alt="Screenshot of the ASN testing dashboard in the Kentik UI, with General, Health, Ping, and Traceroute options available" /> <div class="caption" style="margin-top: -30px;">With Kentik, your NetOps team can automate and customize ASN testing coverage, signal thresholds, and frequency to monitor your peering connections.</div> <p>We can also dig into the traceroutes when alerted and see if traffic uses the paths we were planning.</p> <p>We have covered some suggestions on what can be done when necessary upgrades are impossible. Still, there are several more things to consider when going down this path, like the [risk of blackholing](<a href="https://www.kentik.com/blog/why-you-need-to-monitor-bgp/">https://www.kentik.com/blog/why-you-need-to-monitor-bgp/</a> “Learn more about blackholing in “Why You Need to Monitor BGP"") and the relationship to our peers.</p> <p>Join us soon as we dive into these issues and more in the next post in our ongoing peering series.</p> <p>Don’t want to wait? <a href="#demo_dialog" title="Request your Kentik demo">Sign up for a Kentik demo today</a>.</p><![CDATA[Close the Cloud Monitoring Gap with Network Observability]]><![CDATA[To fully capitalize on the promises of digital transformation, IT leaders have come to recognize that a mix of cloud and data center infrastructure provides several business advantages. Read on to learn how an observable network leads to a better customer experience.]]>https://www.kentik.com/blog/close-the-cloud-monitoring-gap-with-network-observabilityhttps://www.kentik.com/blog/close-the-cloud-monitoring-gap-with-network-observability<![CDATA[Kevin Woods]]>Wed, 30 Nov 2022 05:00:00 GMT<p>To fully capitalize on the promises of digital transformation, IT leaders have come to recognize that a mix of cloud and data center infrastructure provides several business advantages, including increased agility, cost efficiencies, global availability, and, ultimately, better customer experiences.</p> <p>But, for the network and infrastructure teams making these hybrid infrastructures a reality, it is increasingly difficult to entirely understand and control what’s happening in these networks. The patchwork of cloud providers, architectures, services, carriers, peers, and the transient nature of many cloud-based connections and resources creates an equally patchy monitoring terrain, further complicated by global scale. These inputs can affect hybrid cloud networks’ cost, performance, and reliability. Network operators must account for their performance individually and holistically understand this data.</p> <div as="Testimonial" index="0" color="blue"></div> <p>Traditional <a href="https://www.kentik.com/kentipedia/what-is-npmd-network-performance-monitoring-and-diagnostics/">network performance monitoring (NPM)</a> strategies fail to synthesize this distributed telemetry in an actionable way, leaving operators and engineers with critical visibility gaps as traffic moves around and between networks. And it isn’t just a matter of collecting data. How is it being analyzed? How is it presented? Are you able to ask questions about your data?</p> <p>With the rest of this blog post, I want to explore how we use network observability here at Kentik to bridge these gaps and help our customers build and maintain affordable, performant, and reliable networks.</p> <h2 id="making-your-networks-observable">Making your networks observable</h2> <p>Making a network observable requires a ground-up effort that, for some systems, might require rethinking fundamental network assumptions. At the very least, it will involve becoming very intimate with your network’s telemetry and perhaps even thinking of new data to establish and collect.</p> <h3 id="instrumentation">Instrumentation</h3> <p>The tools and strategies of observability began in DevOps circles trying to tackle the problems of monitoring distributed systems at scale. One of the more interesting projects is <a href="https://opentelemetry.io/">OpenTelemetry</a>, an effort to standardize metric, log, and trace data instrumentation, generation, and collection for observability efforts. As one of the fastest-growing CNCF (Cloud Native Computing Foundation) projects, OpenTelemetry’s popularity highlights the need for distributed systems engineers to be able to instrument their code to provide otherwise unavailable telemetry.</p> <p>In network observability, this granular instrumentation is one of the principal data mechanisms allowing operators to “ask any question” about their observable systems. As such, instrumentation begs considerations for collection, sampling, storage, persistence, and analysis. But what needs instrumentation? There is readily accessible flow data from public clouds and telemetry from network appliances. Still, synthetic agents need instrumentation at the host level and the instrumentation for application and service context at the orchestration and service mesh layer.</p> <h3 id="contextualization">Contextualization</h3> <p>Instrumentation provides the “how,” and contextualization provides the “what” for network telemetry.</p> <p>As I mentioned, it’s possible to get rich performance metrics from your key application and infrastructure servers, even components like proxies and load balancers. You can map those metrics against contextual details like customer or application ID, internet routes (BGP), and location (GeoIP) and correlate them with volumetric traffic flow details (NetFlow, sFlow, IPFIX) from your network infrastructure. Storing these details unsummarized for months enables operators to get answers in seconds on sophisticated queries across multi-billion row datasets (but can become an engineering challenge at scale).</p> <p>Implementing contextual instrumentation often represents the early “heavy lifting” when making networks observable. It provides much of the raw data that observability platforms like Kentik rely on to provide powerful querying support and detailed network visualizations.</p> <h3 id="visualization">Visualization</h3> <p>Besides providing contextually rich telemetry for analysis, instrumentation allows networks to be visualized by operators in several helpful ways. Visualizations provide a layer of abstraction that enables network operators to unify disparate data sources such as data centers, private and public clouds, Software-as-a-Service (SaaS) applications, network edge devices, internet, backbone, WAN and SD-WAN, and more.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4Swfu3aMSxZtQt7EtCeaLd/9572ea0d175869a7e2e26d9f53cc21df/cloud-map-home.png" style="max-width: 800px;" withFrame class="image center" alt="Kentik network visualization" /> <p>You might ask, so what? Visualizations can be paired with performance and security alerts for quick reads on “this is where and what your problem is,” provide a framework for digging down into data, and deliver an efficient, consistent (everyone sees the same thing) way to monitor traffic and overall network performance. This means a more precise picture and faster, more targeted responses when things go wrong.</p> <h3 id="exploration">Exploration</h3> <p>Having an observable network means getting answers quickly in a system that lets you query, filter, drill in, zoom out, and map your network telemetry, no matter how large or complex the data set. This is the “ask anything” tenet of network observability and represents one of the more significant departures from traditional network monitoring, closing the gaps between data siloed in separate network performance tools.</p> <p>Getting immediate responses to your queries is key when business-critical network operations are at stake. Troubleshooting often requires being able to sharpen your filters through repeated attempts, and slow responses can cause this workflow to become tedious and insufficient for the task.</p> <p>This simple yet powerful ability to engage in open-ended network exploration from a single pane of glass saves network operators time, subscription costs, and even helps maintain or reduce overhead costs in the face of scaling.</p> <h3 id="automation-and-optimization">Automation and optimization</h3> <p>Thankfully, one of the real gems of an observable network is that it creates an operations environment that is not beholden to constantly <em>reacting</em> to network performance issues and security threats. This uniquely contextual data of network observability allows network operators to craft predictable, highly-specific automation strategies for security threats, performance deviations, and even network-specific demands like peering policies, regardless of scale and complexity.</p> <p>With unparalleled visibility into traffic, network observability lets you see your blind spots. Optimize against specific resources, applications, locations, and customers, ensuring your networks are dynamic and delivering the highest quality experience at the best cost.</p><![CDATA[Hidden Costs of Cloud Networking: Optimizing for the Cloud - Part 3]]><![CDATA[Getting the most out of cloud networks requires new tools and strategies, captured in the idea of network observability. Read the final entry in our series on managing the hidden costs of cloud networking.]]>https://www.kentik.com/blog/hidden-costs-of-cloud-networking-optimizing-for-the-cloud-part-3https://www.kentik.com/blog/hidden-costs-of-cloud-networking-optimizing-for-the-cloud-part-3<![CDATA[Ted Turner]]>Wed, 23 Nov 2022 04:00:00 GMT<p>I used the first two parts of this series to lay out my case for how and why cloud-based networks can effectively “Trojan horse” costs into your networking spend and highlighted some real-world instances I’ve come across in my career.</p> <p>In the third and final installment of this series, I want to focus on ways you can optimize your personnel and cloud infrastructures to prevent or offset some of these novel costs. When considering network optimizations, I like to group them in terms of <strong>cost</strong>, <strong>performance</strong>, and <strong>reliability</strong>.</p> <p>As engineering goals, cost, performance, and reliability are often at odds with each other, the cheapest path for your traffic is often not the most performant; a highly reliable network is maybe not the most cost-effective. If there are natural trade-offs, how do we decide what to prioritize? Cost, performance, and reliability must constantly be balanced against clearly defined business objectives.</p> <div class="pullquote right" style="max-width: 350px;">Unfortunately for network operators, the scale and complexity of networking in cloud, hybrid cloud, and multi-cloud environments have largely rendered traditional network monitoring strategies obsolete.</div> <p>The first step is understanding what cloud network your customers and business objectives demand. Does your network infrastructure need to scale dynamically to deal with highly variable traffic volumes? Should your network be multi-zonal, and if so, how should your network configurations differ based on these different zones? Which cloud provider or combination of providers is the best for your infrastructure? What peering strategy makes the most sense, and for which customers? Are your resources, cloud or otherwise, being used in the cheapest, most performant, most reliable way? And, of course, is this constantly shifting network safe?</p> <p>Being able to answer these and similar questions about your network regularly is the foundation of any honest cloud optimization attempt. Unfortunately for network operators, the scale and complexity of networking in cloud, hybrid cloud, and multi-cloud environments have largely rendered traditional network monitoring strategies obsolete. To fill this gap, <a href="https://www.kentik.com/go/ebook/ultimate-guide-to-network-observability" title="Get Kentik&#x27;s ebook The Ulitmate Guide to Network Observability">network observability</a> has arisen as a body of tools and strategies to help you answer any question about your networks.</p> <h2 id="optimizing-your-org-chart">Optimizing your org chart</h2> <p>Before I go over some optimizations that can be made leveraging network observability, I want to take a couple of steps back up the development chain and discuss some best practices at the org level. While cloud networks are certainly “software problems,” my experience with customers and my work history has shown me that how businesses organize themselves around their software problems is a crucial component of success.</p> <p>Just as many software stacks have transformed from monoliths to distributed, service-oriented architectures, the modern network has evolved from a strong, single data center into a distributed web of networks with service-specific configurations and infrastructures. If you’ll recall from <a href="https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-i/">Part I</a>, one of the benefits of working with cloud resources is the pace of development enabled by loosely-coupled services with independent architectures (and networks). While this is undoubtedly a pro, a con is that this independence can lead to siloed decision-making and costly networking oversights and miscommunications.</p> <p>For such a development environment, I’ve found it essential to have a platform-level NetOps team to manage visibility over the many service-level networks. This personnel abstraction establishes responsibility for inter- and extra-service networking issues and creates a team to manage a network observability tool.</p> <h2 id="optimizing-your-cloud-network">Optimizing your cloud network</h2> <p>As a complement to the metrics, logs, and traces of DevOps observability, networking observability takes additional, network-specific data into consideration: flow logs, network hardware telemetry, prefixes, paths, underlays and overlays, software-defined networks, and more. A good network observability tool incorporates this data and helps platform-level teams have real-time visibility into the state of the network and its impact on the business.</p> <p>The visibility provided by such a tool allows NetOps to answer any question about the network and forms the technical foundation for optimizing away your cloud network’s “hidden costs.”</p> <h3 id="optimizing-network-costs">Optimizing network costs</h3> <p>Keeping track of costs can get tricky in cloud networks that incorporate a range of provider contracts, variable cost models, router configs, and peering relationships. An <a href="https://www.kentik.com/solutions/usecase/clouds-and-hybrid/">effective network observability tool</a> can provide actionable insights by allowing you to overlay configurable financial models over this traffic accounting data. It’s one thing to see that there was some dropped traffic and quite another to immediately be shown affected customers, any SLAs that might have been breached, and the real-time dollar and cent impact.</p> <p>By getting immediate and actionable insights into the real costs associated with networking decisions, network observability can help you optimize costs by:</p> <ul> <li>Spotting and avoiding inefficient and costly traffic routing</li> <li>Implementing peering policies that protect you from overspending</li> <li>Avoiding congested and degraded paths</li> <li>Identifying unique applications affected when a pathway is impacted</li> <li>Identifying impacted availability zones, regions (cloud PaaS), or geographies (ISP, WAN, data center, campus, retail location…)</li> </ul> <h3 id="optimizing-network-performance">Optimizing network performance</h3> <p>As mentioned above, DevOps-centric observability focuses on logs, metrics, and trace telemetry and analyzes this data to provide useful ways of exploring performance in distributed systems. By combining visualizations like service maps with statistical analysis of <a href="https://sre.google/sre-book/monitoring-distributed-systems/">key signals</a>, DevOps teams can not only surface root causes when there are issues but provide a wealth of data to create meaningful performance baselines and thresholds to optimize against.</p> <p>A good network observability tool ingests <a href="https://www.kentik.com/kentipedia/what-is-npmd-network-performance-monitoring-and-diagnostics/">network performance telemetry</a> from your private, cloud, multi-cloud, or hybrid cloud infrastructure. It adds contextual details such as customer, cloud provider, region, or traffic type to this telemetry, giving you the complete picture of your data when weighing optimization strategies.</p> <p>One of the most powerful ways network observability can help you optimize performance is by allowing you to ask (and answer) any question about your cloud network:</p> <ul> <li>What is the most performant path for my traffic?</li> <li>Do we need to invest in more hardware to meet capacity, or can we improve routing or peering efficiency for our cloud onramps?</li> <li>Is my traffic optimally distributed across cloud regions?</li> <li>What is our highest priority traffic?</li> </ul> <h3 id="optimizing-network-reliability">Optimizing network reliability</h3> <div class="pullquote right" style="margin-top: 10px;">It’s important to have a performant network; it’s very important to have a cost-effective network, but it is absolutely critical to have a reliable network.</div> <p>It’s important to have a performant network; it’s very important to have a cost-effective network, but it is absolutely critical to have a reliable network, as being up and running is something of a prerequisite for both performance and cost.</p> <p>Network reliability has two main facets: security and availability. The baselines and thresholds established through your performance optimizations and a powerful analytics engine can give you <a href="https://www.kentik.com/solutions/usecase/ddos-detection-and-network-security/">advanced warnings about capacity or security issues</a>. An effective network observability tool can detect these outliers in traffic quality or source immediately. Automated processes can be triggered that reroute or re-provision your traffic accordingly.</p> <p>Failures will happen. As a network engineer, you must ensure that <a href="https://en.wikipedia.org/wiki/System_of_record">systems of record</a>, <a href="https://en.wikipedia.org/wiki/Single_source_of_truth">sources of truth</a>, and customer and internal data are accurate and available in multiple regions. Validating that the pathways between data stores can communicate without error or latency ensures data replication strategies can execute. When errors or latencies crop up, as they inevitably will, knowing that the data stores are out of sync is critical to making service restoration choices. It could be devastating to a business to choose to transmit stale data into a data store due to unawareness.</p> <h2 id="conclusion">Conclusion</h2> <p>Working in the cloud isn’t for every business, and if the pure motivation for the migration is cost savings, you should be wary. If scalability, availability, and an infrastructure that can support distributed, service-oriented development and deployment are your primary goals, running your network in the cloud can significantly improve cost, performance, and reliability.</p> <p>For networks at scale, managing the complexity of your public, hybrid, or multi-cloud traffic and infrastructure in a way that captures actionable insights and deepens business value is a tall order. Getting the most out of these cloud networks requires new tools and strategies, captured in the idea of network observability. By taking a big data approach and centering the ability to ask and answer any question about our networks, network observability tools such as Kentik offer NetOps teams and engineers an unrivaled tool to secure, optimize, and strengthen your cloud network. <a href="#demo_dialog" title="Request your Kentik demo">Get a demo</a> and see for yourself.</p><![CDATA[Avoid These Five Cloud Networking Deployment Mistakes]]><![CDATA[When transitioning from physical infrastructure to the cloud, it's easy to think that your networks will instantly be faster, more reliable, and less costly overnight. As it turns out, there's more to it than that.]]>https://www.kentik.com/blog/avoid-these-five-cloud-networking-deployment-mistakeshttps://www.kentik.com/blog/avoid-these-five-cloud-networking-deployment-mistakes<![CDATA[Kevin Woods]]>Thu, 17 Nov 2022 05:00:00 GMT<p>When transitioning from physical infrastructure to the cloud, it’s easy to think that your networks will instantly be faster, more reliable, and produce windfalls of cost savings overnight. Unfortunately, this wishful line of thinking fails to account for some of the complexities of cloud networking and is one of the biggest drivers of the cloud deployment mistakes we see.</p> <p>Cloud networking often introduces new concepts to teams, such as VPCs, cloud interconnects, and multiple availability zones and regions, and this lack of familiarity can lead to questionable implementation. On top of this, these networks connect with other clouds and the internet, forming hybrid and multi-cloud architectures. Combined with the rapid pace of deployment and lack of visibility into how cloud resources are being used, it’s easy to make costly mistakes.</p> <p>After years of helping clients work through these lessons, we’ve compiled this list to help you avoid making the same mistakes others have when transitioning to the cloud.</p> <h2 id="mistake-1-duplicate-services-and-unknown-dependencies">Mistake #1: Duplicate services and unknown dependencies</h2> <p>This mistake commonly happens when multiple, siloed teams jump into the cloud without giving much thought to shared architecture. Imagine separate teams building separate applications in the cloud. Each team spins up cloud resources such as compute, storage, network components, etc. Many of them are actually reusable and shareable. Some are obvious standard services like DNS, databases, load-balancers, etc.</p> <img src="//images.ctfassets.net/6yom6slo28h2/cFhxjgQpYYPrfyVG7MD1v/3a38f5480cae03797d396b75253e0346/networked-sphere.png" style="max-width: 500px;" class="image center no-shadow" alt="Unknown dependencies in cloud networking" /> <p>But it isn’t just duplicating cloud resources. We also see duplication of custom-developed microservices that perform precisely the same function. With all the teams running fast in parallel, it’s not hard to see how they may reinvent the wheel repeatedly. Soon, the cloud environment becomes a tangled web of interdependencies. Without some kind of visibility, including brittle architecture, wasted development effort, and massive cloud overspending.</p> <p>Here is a scenario to illustrate:</p> <p>Team A sets up a DNS service for their apps. Not knowing about Team A’s DNS service, Team B creates a duplicate DNS service for their own app. Now their organization is paying twice for instances that could easily be collapsed into shared infrastructure.</p> <p>Imagine that a third team (Team C) also needs a DNS service for their app. They begin using Team B’s DNS without their knowledge. Sometime later, Team B learns about Team A’s DNS service, so they start using it and shut down their own. Now Team C’s app has an outage because a service it depended on disappeared without warning.</p> <h2 id="mistake-2-traffic-or-request-hairpinning">Mistake #2: Traffic or request hairpinning</h2> <img src="//images.ctfassets.net/6yom6slo28h2/5riaGOOStcpyV2n9nKCa1X/2b6ac250b13af6150600f06aec968499/hairpinning-traffic.png" style="max-width: 330px; margin-top: 20px" class="image right no-shadow" alt="Network traffic hairpinning" /> <p>What is hairpinning? It happens when services that should communicate over short, fast, and cheap network paths end up communicating over long, expensive paths with lots of latency. There are multiple causes and flavors of this problem. Let’s discuss two common examples here.</p> <p>In the first scenario, imagine two services sitting in separate zones within a physical data center. Through poor architecture choices or simple IP routing misconfiguration, the communication path between those services traverses cloud interconnects and VPCs, instead of the local data center network fabric. Every time those services communicate, they’re experiencing much higher latency and racking up expensive per-GB cloud data transfer charges. The inverse is also possible: two cloud-deployed services communicating via a network path that traverses a physical data center.</p> <p>In the second scenario, imagine a service chain consisting of a web front-end, application server, and database backend. A DevOps team has migrated the application server to the cloud, but the web front-end and database are still in the legacy data center. Now every web request results in a chain of calls that traverse cloud interconnects twice, resulting in poor performance and cost exposure.</p> <p>Without visibility, scenarios like these can persist indefinitely.</p> <h2 id="mistake-3-unnecessary-inter-region-traffic">Mistake #3: Unnecessary inter-region traffic</h2> <p>It’s relatively common knowledge that most cloud components are priced using a metered, pay-as-you-go model. But the pricing details of specific components often get lost in the weeds. Did you know that the per-GB cost of inter-region data transfer is significantly more expensive than intra-region data transfer? On the AWS pricing calculator, inter-region data transfer is twice the price of intra-region data transfer! It’s even more dramatic on Google Cloud, with egress between regions almost 10x as expensive as egress within regions.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1fD3Bwa09uaKHxKKTwRIRj/f41e0b25ce61caea6a0bf70674073551/intra-inter-region-traffic-cost.png" style="max-width: 600px;" class="image center no-shadow" alt="Inter-region and intra-region costs" /> <p>Cloud architects and developers build their applications or infrastructures in a way that generates significant, unnecessary traffic between regions, potentially causing a dramatic impact on the bottom line.</p> <p>To streamline cloud network costs, we recommend the following three steps:</p> <ul> <li> <p><strong>Identify</strong>: Get a view of significant contributors to inter-region internet egress traffic. Answer questions such as: What applications are causing that big bandwidth bill? Which application teams are responsible for this traffic? And, are we exposed to data transfer charges that are unnecessary?</p> </li> <li> <p><strong>Relocate</strong>: Evaluate workloads to move within the infrastructure. Replace workloads to minimize inter-region communication.</p> </li> <li> <p><strong>Consolidate</strong>: Carefully examine all workloads, understand the network traffic generated, and combine what can be merged to save more resources.</p> </li> </ul> <h2 id="mistake-4-accessing-cloud-services-over-the-internet-from-local-vpcs">Mistake #4: Accessing cloud services over the internet from local VPCs</h2> <p>Everyone building apps in the cloud knows you can supercharge efforts by using native cloud services — e.g., AWS S3, Google Cloud Pub/Sub, Azure AD — as building blocks rather than starting from scratch. These services are convenient and save time but come with per-hour, per-gigabit, or per-transaction price tags. It’s also important to remember that there are data egress costs that, at scale, can add up quickly. The good news is that savvy operators can considerably decrease these costs.</p> <p>Some background helps to understand these costs better. First, cloud providers make their services accessible to the internet using IP address ranges representing services running in a single region. For example, S3 service running in the AWS us-west-1 region is advertised to the internet from the 3.5.160.0/22 address range. To use these services, clients resolve DNS names, like <code class="language-text">[sms.eu-west-3.amazonaws.com]</code> <code class="language-text">(http://sms.eu-west-3.amazonaws.com/)</code>, to the IP ranges where services are hosted. They then send traffic along to the IP address returned from the query.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/GomKlIrtzfa7jiCpHIdJV/8b82d82f55cc196d38e9e4af0d79b13e/aws-cloud-network.png" style="max-width: 700px;" class="image center no-shadow" alt="AWS region diagram" /> <p>When cloud engineers stand up new VPCs, they must also set up routing rules that determine how traffic forwards toward any non-local destinations. Traffic bound to the internet always follows a default route that points to a gateway device. Traffic crossing such a device is defined as internet egress traffic. This includes traffic heading to the internet only to reach cloud services — even those in the same region or zone as your VPC instances. These costs add up over time with sufficient volume.</p> <p>Luckily, most cloud providers released features commonly referred to as “endpoint services.” They’re designed to keep this traffic away from the public internet, reducing both data transfer costs and security risks associated with sending traffic over the internet. Endpoint services allow users to configure local network interfaces with private IP addresses inside their VPC subnets.</p> <p>These interfaces act as proxies for any traffic destined toward the endpoint services configured. The result is that traffic to these services stays local and private with lower egress pricing.</p> <p>For example, traffic from a VPC in us-east-1 to an AWS service hosted in us-east-1 will cost $0.02/GB. Not a big deal at the scale of most companies. But when transferring 50 TB per month to S3, these costs can quickly surpass $1000/month. You’ll decrease your costs by setting up a VPC endpoint to direct this traffic to S3. VPC endpoint pricing includes $0.01/hour per interface; any data egressed over that is charged at $0.01/GB. The same 50 TB egressed to S3 over an AWS PrivateLink endpoint will cost approximately $520/month — a 50% savings over standard internet egress charges.</p> <p>Read more about AWS VPCs in <a href="https://www.kentik.com/resources/ebook-network-pros-guide-to-the-public-cloud/">The Network Pro’s Guide to the Public Cloud</a>.</p> <h2 id="mistake-5-using-default-internet-traffic-delivery">Mistake #5: Using default internet traffic delivery</h2> <p>Using your cloud provider’s default internet egress is definitely straightforward, but the costs can add up quickly, especially when your business needs to deliver tons of bits. Costs can be more than ten times as expensive as traditional IP transit on a per-GB basis. Cloud migrations can result in a considerable billing surprise without considering and implementing other traffic delivery options.</p> <img src="//images.ctfassets.net/6yom6slo28h2/63xpdrW5tKwtRbY09FTTMk/70ac73929710b989f2b2a6c7ed914ca5/cloud-mistake-default-traffic.png" style="max-width: 700px;" class="image center no-shadow" alt="Network traffic map" /> <p>A simple first step is to inventory which apps deliver traffic to the internet and how much. Teams may deploy new apps using default internet egress because they aren’t aware of the cost impact. Discovery is critical for cost control.</p> <p>Here are some traffic delivery options that can reduce internet egress charges:</p> <p><strong>Leverage CDNs (content delivery networks)</strong>: These services can cache frequently requested objects on nodes distributed all worldwide, serving the traffic to end users at a much lower per-GB cost than serving those users directly from your origin servers, and with much better performance (i.e., lower latency). Cloud providers offer their own built-in CDN services, along with third-party providers.</p> <p><strong>In-app packaging for mobile</strong>: For cloud services that interact with mobile apps, consider packaging large, relatively static objects as part of the app instead of serving them over the network.</p> <p><strong>Private egress</strong>: For services generating lots of traffic, it can be more economical to backhaul internet egress over cloud interconnects to a PoP that’s well-connected to lots of relatively cheap IP transit. This is attractive for networks with PoPs and transit in place for traditional physical data centers.</p> <h2 id="an-increasingly-complex-future-for-multi-cloud-ops-teams">An increasingly complex future for multi-cloud ops teams</h2> <p>As organizations adopt hybrid, multi-cloud environments, network and infrastructure teams will face serious blind spots with siloed tools impacting their ability to identify and troubleshoot problems.</p> <p>Public cloud and cloud-native application infrastructure have introduced incredible new ways to build and deploy applications (including <a href="https://www.kentik.com/kentipedia/cloud-native-network-functions-cnf/" title="Kentipedia: Cloud-Native Network Functions (CNF)">cloud-native network functions</a>). Along with this convenience has come the need to understand and operate complex virtual networks, service mesh architectures, hybrid, multi-region, and multi-cloud networking.</p> <p>To learn even more about mastering the operational aspects of multi-cloud deployments, please download our guide, <a href="https://www.kentik.com/resources/guide-to-network-and-application-observability-multi-cloud-ops-teams/">Network and Application Observability for Multi-Cloud Ops Teams</a>.</p><![CDATA[Suppressing Dissent: The Rise of the Internet Curfew]]><![CDATA[The populations in Cuba and Iran were only the latest to experience what has become an increasingly common tactic of digital authoritarianism: the internet curfew. This tactic, in which internet service is temporarily disabled on a recurring basis, lowers the costs and thus increases the likelihood of government-directed internet disruptions.]]>https://www.kentik.com/blog/suppressing-dissent-the-rise-of-the-internet-curfewhttps://www.kentik.com/blog/suppressing-dissent-the-rise-of-the-internet-curfew<![CDATA[Doug Madory, Peter Micek]]>Wed, 16 Nov 2022 05:00:00 GMT<p>In the evening on September 30, people across Cuba found their <a href="https://twitter.com/DougMadory/status/1575703586495156224">internet service cut.</a> The residents of this Caribbean nation had <a href="https://www.cnn.com/2022/10/15/americas/cuba-hurricane-ian-protestors-charges">begun protesting</a> their government’s tepid response to Hurricane Ian which had wrought destruction a week earlier. Internet service returned to normal the following morning, but this outage wasn’t caused by storm-related damage. This blackout was a deliberate act, a fact confirmed when service <a href="https://twitter.com/DougMadory/status/1576169117039501313">dropped out</a> for the same period of time the following day.</p> <div as="WistiaVideo" videoId="9pouve0mvo" audio></div> <img src="//images.ctfassets.net/6yom6slo28h2/9KdtLCp21apjqadYvxxbr/b3a4af0079f2ad1fff821ffa61ae3f86/curfew-cuba-sept30.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Cuba internet shutdown, September 2022" /> <p>Half a world away, Iranians were struggling with a similar version of episodic internet disruptions. A <a href="https://www.reuters.com/world/middle-east/tehran-governor-accuses-protesters-attacks-least-22-arrested-2022-09-20/">widespread protest movement</a> had sprung up in cities across Iran following the <a href="https://www.cnn.com/2022/10/26/middleeast/iran-clashes-mahsa-amini-grave-intl">death of Masha Amini</a> while in police custody. In an effort to combat the protests, the Iranian government directed the three major mobile operators to begin <a href="https://twitter.com/DougMadory/status/1574813306547847169">disabling internet service</a> across the country every evening before restoring service in the early hours of the following morning.</p> <p>The people of Cuba and Iran were only the latest to experience what has become an increasingly common tactic of digital authoritarianism: the internet curfew. The word curfew has traditionally been used in the context of keeping people indoors during evening or nighttime hours. However, in the world of <a href="https://www.accessnow.org/internet-shutdown-types/">internet shutdowns</a>, it has taken on a new meaning. In this blog post, we will discuss the origins and logic of this increasingly utilized tactic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/ZAPhmLM6YIH2lJtwESSAo/427039420d190d0ba4933f99c1d03186/featured-internet-curfew.png" style="max-width: 550px;" class="image center" alt="Rise of the internet curfew" /> <h2 id="evolution-of-a-tactic">Evolution of a tactic</h2> <p>Egypt’s internet shutdown in the 2011 Arab Spring was a <a href="https://www.kentik.com/blog/from-egypt-to-uganda-a-decade-of-internet-shutdowns/">watershed moment for the internet community</a>, but it was also extremely disruptive to the country’s economy. While the internet blackout was intended to disrupt the organization and coverage of the anti-government protests, it also disabled the government’s ability to communicate as well as crippled the operations of Egypt-based companies. An Egypt-style total internet shutdown involves significant collateral damage and can <a href="https://www.oecd.org/countries/egypt/theeconomicimpactofshuttingdowninternetandmobilephoneservicesinegypt.htm">incur significant costs</a> for the government, as well.</p> <p>By the time Syria experienced their first <a href="https://www.bbc.com/news/world-middle-east-13642917">internet shutdown</a> five months later, the approach of taking down the internet appeared to have evolved. Instead of taking down all of the country’s internet communications, the embattled Syrian government only took down <a href="https://opennet.net/blog/2011/06/syria-goes-mostly-offline-protests-intensify"><em>most</em> of the country’s BGP routes</a>. Routes belonging to mobile and residential networks were pulled, while <a href="https://web.archive.org/web/20110605200740/http://www.renesys.com/blog/2011/06/syrian-internet-shutdown.shtml">those belonging to the government</a> were left untouched.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6PIQBiZeQQY3BjjBFdWf3B/95ae9db199ca70e6d8278ffed6cdf389/renesys-syria.png" style="max-width: 500px;" class="image center" alt="Renesys chart - globally reachable Syrian networks, 2011" /> <p>The logic was presumably to focus on suppressing the communication of the Syrian citizens who were revolting while letting the Assad government and its allies continue to conduct business as usual. This was an early attempt at limiting the collateral damage of an internet shutdown.</p> <h2 id="first-appearance-in-gabon">First appearance in Gabon</h2> <p>Years later in September 2016, the West African nation of Gabon disabled its <a href="https://www.buzzfeednews.com/article/sheerafrenkel/this-countrys-internet-has-been-blocked-for-72-hours-and-cou#.utDnwge0K">internet service for four days</a> following a contested presidential election which led to deadly riots. When service returned, it came with a catch. Service was only up during the day, but was mostly taken down in the evening and at night.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7mbEVKaLwfncQc4MEnEP9F/986d86c82ce34139cabf89df01376136/dyn-gabon.png" style="max-width: 500px;" class="image center" alt="Dyn chart - Gabon internet outage 2016" /> <p>These daily 12-hour national outages in Gabon were soon being referred to as an <a href="https://qz.com/africa/781752/gabon-has-been-imposing-a-12-hour-a-day-internet-curfew-as-its-political-crisis-grows">internet curfew</a>. It marked the latest evolution in blocking access to the internet while minimizing blowback to the president, who continued to <a href="https://twitter.com/vote4africa/status/773546411224727552">tweet his defense</a> while everyone else in the country was either offline or blocked from reaching social media.</p> <p>Afterward, the Hertie School’s <a href="https://www.anitagohdes.net/">Anita Gohdes</a>, who researches political violence and state repression, <a href="https://politicalviolenceataglance.org/2016/09/16/internet-shutdowns-during-political-unrest-are-becoming-normal-and-it-should-worry-us/">concluded the following</a> about the internet disruptions in Gabon:</p> <blockquote>The fact that the authorities in Gabon have limited the shutdowns to the evenings and nights speaks to their simultaneous fear of increasing unrest and their need to provide digital infrastructure to keep their citizens content. The fact that they are continuously doing it amidst unrest and uncertainty suggests the incumbent administration is going to continue its course of violent repression to maintain power.</blockquote> <h2 id="return-in-myanmar">Return in Myanmar</h2> <p>Then again, last year, in the aftermath of the coup in Myanmar, the military junta running the country issued multiple restrictions on the internet. Social media was blocked (leading to an <a href="https://www.manrs.org/2021/02/did-someone-try-to-hijack-twitter-yes/">inadvertent BGP hijack of Twitter</a>), a weekend total shutdown, and later, another internet curfew: nightly internet blackouts of mobile internet service throughout the country.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7JC4N7tW3t00aZttBhB1oK/c278faca254a2c8c2dc6baf73ab8eb50/myanmar-feb2021.png" style="max-width: 700px;" class="image center" thumbnail withFrame alt="Internet disruption in Myanmar, 2021" /> <p>Last year, we teamed up with the <a href="https://ooni.org/">Open Observatory for Network Interference (OONI)</a>, <a href="https://ioda.inetintel.cc.gatech.edu/">Internet Outage Detection and Analysis (IODA)</a>, and <a href="https://censoredplanet.org/">Censored Planet</a> to jointly publish an academic paper documenting the various disruptions in Myanmar using our combined datasets. The paper, <a href="https://dl.acm.org/doi/10.1145/3473604.3474562"><em>A multi-perspective view of Internet censorship in Myanmar</em></a>, was published in <em>ACM SIGCOMM 2021 Workshop on Free and Open Communications on the Internet</em> and relies on Kentik’s traffic data to document the internet curfew in Myanmar.</p> <h2 id="latest-internet-curfews">Latest internet curfews</h2> <p>In September 2022, widespread anti-government protests broke out in Iran after the death of a young woman in police custody. As a result, the Iranian government directed a series of actions to block the Iranian people’s access to multiple types of internet services.</p> <p>On September 21, the Iranian government directed the three major mobile operators of Iran (Irancell, Rightel, and MCCI) to block access to the internet from around 4pm to midnight local for almost two weeks. As was the case in the previous internet curfews, this tactic was intended to target the protestors in the streets while allowing the government and industry to continue to communicate.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1YjSvk3w1qkR0D13hlFrm5/13e937d0fbf29c7a20b3c907f34b8e99/iran-sept21.png" thumbnail withFrame style="max-width: 800px;" class="image center" alt="Iranian internet curfew" /> <p>In general, fixed-line services remained up during this period of time. In fact, we were able to observe a small uptick in traffic volume on fixed line service during the mobile shutdowns as some Iranians were able to find alternative sources of connectivity. However, this small increase in traffic was nowhere near enough to make up for the overall loss of traffic due to the nightly mobile shutdowns.</p> <p>Additionally, while fixed line services might still have been connected, they also suffered from interference. Specific websites like <a href="https://explorer.ooni.org/chart/mat?probe_cc=IR&#x26;test_name=whatsapp&#x26;since=2022-08-24&#x26;until=2022-09-24&#x26;axis_x=measurement_start_day">Instagram and WhatsApp were blocked</a> nationally and the region of <a href="https://twitter.com/DougMadory/status/1575964622598774784">Sistan-Baluchestan</a> suffered a complete outage.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1FQoekjJWNXYeEhuSFhvZI/c8305b00a93f505cccdca251e053d550/curfew-iran-irancell-tci.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Blocked internet access in Iran, September 2022" /> <p>The graphic above shows the traffic volume in bits/sec seen to the largest mobile operator by traffic (Irancell, AS44244) and the largest fixed line operator by traffic (TCI, AS58224) during the first few days of the Iranian internet curfew. Note the slight increases in fixed line service (blue line) when the mobile service drops out (green line).</p> <h2 id="making-shutdowns-more-likely">Making shutdowns more likely</h2> <p>When deciding on the method of suppressing communication, the authoritarian state must balance the disruptive aspects of internet shutdowns with its political goals of maintaining control over its population. If there is a possibility of being deposed from power, then any cost is acceptable for an authoritarian ruler.</p> <p>However, not all civil unrest poses this level of risk to the authoritarian state. In the years following the complete disconnection in Egypt, there was hope that similar total disruptions might occur with less frequency due to the <a href="https://www.brookings.edu/wp-content/uploads/2016/10/intenet-shutdowns-v-3.pdf">severe collateral damage</a> they caused.</p> <p>The thought continued that authoritarian governments may instead turn to a more surgical approach of just censoring social media and other unwanted internet services. However, increases in innovation and awareness of censorship circumvention technology over the past decade may have rendered this approach less effective, at least for those technically savvy internet users.</p> <p>In the cat-and-mouse game of communication suppression, this innovation may have had the unintended consequence of making complete disruptions again seem necessary despite the costs: you can’t circumvent censorship if you are completely without internet service.</p> <p>Over the past decade, Iran has been building a National Internet Network (NIN), ostensibly to allow the country to continue to function in the event that it was cut off from (<em>or elected to cut itself off from</em>) the outside world. <a href="https://twitter.com/Ammir">Amir Rashidi</a>, Director of Digital Rights and Security at the human rights organization <a href="https://www.miaan.org/">Miaan Group</a> has argued that Iran’s development of the NIN meant “cutting off access to the global internet is much less costly for the Iranian government—and thus more likely.”</p> <p>The objective of internet curfews, like Iran’s NIN, is to reduce the cost of shutdowns on the authorities that order them. By reducing the costs of these shutdowns, they become a more palatable option for an embattled leader and, therefore, are likely to continue in the future. The task for civil society and organizations like <a href="https://www.accessnow.org/">Access Now</a> is to maintain pressure on these governments by finding ways to increase costs for disrupting people’s access to the internet, and expose the hidden damage that these shutdowns inevitably cause to marginalized and vulnerable communities.</p><![CDATA[Five Issues Your NetOps Team Will Face in the Cloud]]><![CDATA[For organizations of all types and sizes, hybrid cloud environments pose a substantial challenge. In our new article, we learn about the five biggest issues that NetOps teams face in the cloud.]]>https://www.kentik.com/blog/five-issues-your-netops-team-will-face-in-the-cloudhttps://www.kentik.com/blog/five-issues-your-netops-team-will-face-in-the-cloud<![CDATA[Kevin Woods]]>Tue, 15 Nov 2022 05:00:00 GMT<p>For organizations of all types and sizes, networks are more than just a line item on a departmental budget. Modern digital enterprises understand that long-term competitiveness depends on maximizing their IT assets to create exceptional user experiences. As part of this shift, business managers and infrastructure leaders are expanding their definition of “network performance” from reactive monitoring of flat networks to a global, real-time awareness of the entire network, including all hardware, software, and third-party components.</p> <p>To deliver on these expanded expectations of network performance, network and infrastructure engineering teams need greater visibility into these complex, hybrid cloud infrastructures. <a href="https://www.kentik.com/kentipedia/what-is-npmd-network-performance-monitoring-and-diagnostics/">Gartner observes</a> that “new dynamic network architectures are affecting the efficacy of traditional network monitoring stacks,” and predicts that “by 2024, 50% of network operations teams will be required to re-architect their network monitoring stack due to the impact of hybrid networking — a significant increase from 20% in 2019.”</p> <p>Network and infrastructure engineering teams are caught in a vice: deliver networks with ever-higher functionality, reliability, performance, and security while coping with the increasing complexity and uncertainty.</p> <p>For network and infrastructure teams to address and thrive in the era of hybrid clouds, they need to adapt to these five things:</p> <h2 id="1-clouds-and-data-centers-can-have-very-different-architectures">1. Clouds and data centers can have very different architectures</h2> <p>As more and more applications are moved to the public or private cloud or are natively built in the cloud, <a href="https://www.kentik.com/solutions/usecase/clouds-and-hybrid/">the increased use of containers and microservices is fundamentally changing the visibility and flow of network traffic</a>. Accenture notes, “As a result of digital decoupling and the adoption of microservices, applications are evolving to more complex patterns and topologies, increasingly requiring more dynamic underlying compute, storage, and networking infrastructure. Cloud-native patterns and technologies are typically more ephemeral than traditional environments.”</p> <p>As such, monitoring tools are forced to engage alternative telemetry methods to gain visibility into container-to-container traffic by leveraging APIs from orchestration tools and mesh services. In addition, they need to ingest new types of data (such as <a href="https://www.kentik.com/resources/google-cloud-vpc-flow-logs-for-kentik/">VPC flow logs</a> and NSG VNet flow) to monitor flows between services in different virtual cloud subnets.</p> <p>Distributed architectures to accommodate microservices are also changing the intra-data center traffic topology, creating a highly dynamic network between services and tiers. According to Cisco’s Global Cloud Index, east-west traffic will represent 85% of total data center traffic by 2021, and north-south traffic will account for the remaining 15% of the traffic associated with data centers.</p> <p>Given this, it’s essential to recognize that data center architectures are highly dynamic. Network architecture upgrades, device upgrades, acquisitions, and changing business priorities mean that all data centers are unique. As a result, each needs to be accurately described by network performance tools so network and infrastructure teams can have a fighting chance to do their job effectively and support the organization’s mission.</p> <p>Public cloud architectures differ from private clouds and data centers, but there are differences between public clouds. Dealing with more than one cloud provider is not at all unusual. A Gartner survey of public cloud users found that 81% of respondents said they were working with two or more providers. The Flexera State of the Cloud Report found that enterprises employed an average of 2.2 public and 2.2 private clouds. This change in the IT landscape—where applications, data, and users fluidly interact—means network and infrastructure teams cannot use siloed legacy monitoring tools.</p> <p>Why not? Data center architectures are radically changing to accommodate new paradigms like containerized microservices while still supporting legacy apps, incorporating mergers and acquisitions, and keeping pace with advancing technology. This requires an investment in new tooling to bridge visibility gaps. For instance, for microservices deployed in the cloud, network and infrastructure teams cannot simply assume that the cloud infrastructure runs all the time perfectly and does not require monitoring. Quite the opposite is true. When hybrid environments are involved, network and infrastructure teams are responsible for the entirety of the application delivery and user experience across all infrastructures and networks. That’s why fragmented tooling is inadequate. This is yet another example of how strategic IT decisions (like moving to the cloud and adopting containerization) sometimes are made without fully considering and understanding the impact on network requirements and monitoring capabilities.</p> <h2 id="2-highly-diverse-toolsets-and-increased-vendor-complexity">2. Highly diverse toolsets and increased vendor complexity</h2> <p>While monitoring connectivity within and between on-prem, internet, and WAN infrastructures is a traditional capability of network performance monitoring and diagnostic (NPMD) solutions, the addition of cloud-specific network management solutions from public cloud providers creates an unmanageable number of tools and vendor complexity.</p> <p>According to a survey of IT professionals reported in “The State of Cloud Monitoring,” 35% of respondents use up to five monitoring tools to keep tabs on hybrid cloud and multi-cloud environments. Each tool has a specific purpose—device health, synthetic performance monitoring, traffic flow metrics, packet data capture, etc.—and thus tells only an isolated portion of the story.</p> <p>The same research found that nearly 70% felt “public cloud monitoring is more difficult than monitoring data centers and private clouds,” and a stunning 95% said they had experienced one or more performance issues caused by a lack of visibility into the public cloud. Only 19% said they had all the data they needed to monitor a hybrid network (compared to 82% who had enough information to monitor their on-prem network).</p> <h2 id="3-there-is-no-single-view-of-network-activity-across-a-hybrid-infrastructure">3. There is no single view of network activity across a hybrid infrastructure</h2> <p>There are backbone network maps, capacity maps between sites and devices, cloud visualization maps for a specific cloud, and edge maps for WAN/SD-WAN edge networks. But, there needs to be a consolidated, fully integrated map focused on the performance, health, and traffic analytics required to operate a hybrid cloud network.</p> <p>With applications and data shifting between on-prem and cloud environments—and the absence of a single point of observability—there are gaps in visibility, leading to gaps in network intelligence.</p> <p>The bottom line is that <a href="https://www.kentik.com/blog/avoid-these-five-cloud-networking-deployment-mistakes/">hybrid clouds are complex and challenging to work with</a>. Here’s why:</p> <p>Using traditional network management tools, network and infrastructure professionals tracking down bandwidth, performance, and availability problems need help to piece the network together in time to fix critical issues. Discovering which devices and interfaces make up a data path takes time and effort. Correlating these elements’ traffic, health, and performance consumes valuable time that could be spent optimizing or adding features.</p> <p>Automation, orchestration, and software-defined networking further complicate matters by constantly shifting where applications are located and redefining how they connect. Without tooling built to comprehend this new reality, the network — and those who run it — are at a disadvantage.</p> <h2 id="4-limited-visibility-leads-to-decreased-agility">4. Limited visibility leads to decreased agility</h2> <p>Visibility and intelligence gaps created by separate network monitoring and diagnostic tools slow down troubleshooting, increase infrastructure engineering burn-out, and prevent proactive operations by keeping the team consumed with reactive tasks. Fragmented apps lead to increased risk of outages, reachability issues, errant traffic flows, increased cyber threats, impact from unknown dependencies, and other threats to network stability. Network and infrastructure teams can’t troubleshoot as fast as necessary when apps and data have moved to the cloud.</p> <p>Even as networks are ever-more critical to business success, network operations are becoming increasingly complex, with disruptions intrinsically harder to foresee and recover from.</p> <p>These visibility gaps create an agility gap that:</p> <ul> <li>Undermines mean time to resolution (MTTR) with disparate and uncorrelated data</li> <li>Diminishes analytics intelligence used for automation and prevention</li> <li>Increases the chances of operator error from constantly working in reactive mode</li> </ul> <h2 id="5-falling-behind-their-organizations-increased-expectations">5. Falling behind their organizations’ increased expectations</h2> <p>Network and infrastructure managers should be included in decisions about balancing on-prem and cloud deployments. Decisions about moving to the cloud are usually made for reasons of cost containment, flexibility, business agility, and efficiency. They are seldom, if ever, made considering the impact on network and infrastructure teams regarding network performance monitoring.</p> <p>This presents network engineers and operators with a daunting challenge. New engineers come into an environment without context, and the maps available don’t provide any. Yet, speed to insight for a new data center network engineer is critical. Incumbent engineers, meanwhile, can’t keep up with all the changes, and their tools don’t reflect or accommodate the new and constantly evolving environment.</p> <p>Compounding these challenges are business pressures to optimize operations across multiple infrastructures, especially when the business is migrating from data centers to clouds, acquiring additional infrastructures through mergers and acquisitions and integrating with legacy systems.</p> <h2 id="looking-ahead">Looking ahead</h2> <p>If infrastructure engineering teams are to meet the increasingly high expectation being placed on them, they need to adopt a proactive approach to meeting the <a href="https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-2/">challenges posed by hybrid cloud environments</a>.</p> <p>And to accomplish this, they need the right kind of tools. Most helpful would be a comprehensive, integrated platform that provides visibility across public/private clouds, on-prem networks, SaaS apps, and other critical workloads. It would be a single place to go for network maps that help operators visualize — in real-time — every aspect of their network and keep track of changes to their infrastructure.</p> <p>It would be pretty cool if someone <a href="https://www.kentik.com/">built something like that</a>.</p><![CDATA[Using Kentik Synthetics for Your Cloud Monitoring Needs]]><![CDATA[The final installment of our three-part series explaining the process of using continuous synthetic testing as a comprehensive cloud monitoring solution to find problems before they start.]]>https://www.kentik.com/blog/using-kentik-synthetics-for-your-cloud-monitoring-needshttps://www.kentik.com/blog/using-kentik-synthetics-for-your-cloud-monitoring-needs<![CDATA[Stephen Condon]]>Mon, 14 Nov 2022 18:00:00 GMT<p><em>The final post of a three-part guide to assuring performance and availability of critical cloud services across public and hybrid clouds and the internet</em></p> <p><a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a> adds integrated, autonomous, pervasive performance test telemetry to our market-leading network traffic analytics and observability platform, the <a href="https://www.kentik.com/product/kentik-platform/">Kentik Network Observability Platform</a>. For modern clouds, proactive synthetic monitoring can no longer be delivered as a standalone tool. Integration and context are essential, such as the real traffic and how it flows in the context of the various cloud infrastructures.</p> <p>Network observability gives you the ability to answer any question about your network quickly and easily: across any kind of network (cloud or on-prem), any kind of network element (physical, virtual, or cloud service), whether isolated at the network level or with application and business context.</p> <p>Use it to plan, run, and fix any network that connects in and across public, private, and hybrid clouds.</p> <h2 id="the-importance-of-monitoring-cloud-traffic-for-all-types-of-cloud-architectures">The importance of monitoring cloud traffic for all types of cloud architectures</h2> <p>Whether you use a public cloud, private cloud, or <a href="https://www.kentik.com/resources/hybrid-cloud-monitoring/">hybrid/multi-cloud architecture</a>, it’s critical to monitor traffic as it travels between any cloud services and data centers within your environment. But gaining complete visibility can be difficult if you rely in part on public clouds, which don’t allow the deployment of monitoring tools directly on their network infrastructure.</p> <p>This is why Kentik Synthetics provides dashboards that track vital network metrics (such as latency, packet loss, and jitter) within and between all major public clouds. Kentik Synthetics also identifies the exact path packets take within cloud environments, providing an extra layer of data and context that helps you debug issues faster.</p> <p>In addition, if you need to extend Kentik Synthetics into a hybrid cloud or on-prem environment, you can customize monitoring agents so that they interface with your own data centers or internal network, allowing you to track network metrics and paths across all segments of your cloud environment.</p> <h2 id="using-active-synthetic-testing-to-identify-cloud-networking-problems">Using active synthetic testing to identify cloud networking problems</h2> <p>Continuous testing using network synthetics allows you to proactively measure the uptime of critical network services and simulate network traffic. Know immediately when network incidents start to emerge so you can eliminate availability and functionality problems and poor performance across the user journey.</p> <p>Continuous testing is like an insurance policy for network accessibility. Find the source of an issue faster by viewing networking failures alongside the health of your applications, cloud services, and Kubernetes environments. Quickly understand if a problem stems from a public cloud outage, slow third-party resources, or the health of your backend services and infrastructure.</p> <p>Continuous testing allows you to baseline and objectively measure network SLAs. Baseline analyze trends and variance between peak and off-peak hours and plan capacity. Managing SLAs is critical today as so many companies rely on third-party cloud services to host all or parts of their applications. Monitor the performance of any third-party cloud service at frequencies you want to validate and from locations you choose at any time.</p> <p>By running synthetic tests as frequently as every second — instead of every ten minutes or more, the interval legacy synthetic monitoring tools support — you can catch network performance degradations as they happen. Even subtle degradations become visible when you test continuously.</p> <p>This means not only that your engineering team gains complete confidence in the quality of the network at all times, but also that they can minimize MTTR and optimize customers’ digital experience. At the same time, the business can reap higher ROI on its networking investments by reducing the level of redundancy and safe range that the network must support.</p> <h3 id="how-to-monitor-third-party-cloud-and-saas-services-using-kentik-synthetics">How to monitor third-party cloud and SaaS services using Kentik Synthetics</h3> <p>Third-party services and resources are essential to the digital experience you deliver to your end users. Kentik Synthetics can continuously test public cloud services and SaaS apps hosted outside your own environment. Kentik customers have access to a State of the Internet dashboard that constantly monitors the status and response times, from our public agents, of the most popular SaaS apps, public cloud services (AWS, Azure, and Google Cloud), as well as DNS services. Your own custom agents can be installed in your infrastructure to test from these agents to the Kentik agents or between your agents. Jitter, packet loss, and latency results are shown in a grid and color-coded to bring performance issues to your attention in seconds. With Kentik Synthetics, you can also monitor BGP and the performance of DNS, CDN, and API services.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/w55ieR8cePiqziNFNtVgM/44ca7e7082bd394140533328f3cb5e21/multi-cloud-performance.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Multi-site performance mesh test" /> <div class="caption" style="margin-top: -30px;">Multi-site performance mesh test in the Kentik platform</div> <h2 id="extend-proactive-monitoring-to-your-hybrid-private-cloud">Extend proactive monitoring to your hybrid, private cloud</h2> <p>Kentik makes it easy to collect data from any components of your networking architecture, including both public and private resources.</p> <p>The <a href="/product/global-agents/">Kentik Global Synthetic Network</a> provides more than 200 agents for tracking popular cloud and SaaS platforms. In addition, you can deploy private agents wherever you need — such as within your own data centers — and customize them to gain hyper-focused visibility into traffic flows and network performance issues.</p> <p>This means you can proactively sectionalize your monitoring routine to track traffic flows across all network components of your software stack — public clouds, private clouds, CDN networks, private data centers, and beyond — then pinpoint which component of the system is causing performance issues.</p> <p>Private synthetic agents also benchmark and baseline the performance of your in-house applications, APIs, websites, and more to measure and track performance across periods (weeks, months, years) and geographies, then compare your performance to that of peers or industry standards.</p> <h2 id="putting-it-all-together-troubleshooting-an-incident">Putting it all together: Troubleshooting an incident</h2> <p>Third-party services and resources are as essential to your cloud environment as your own apps.</p> <p>When your observability tools detect a problem, the first step is determining whether the issue lies in your application or the network. But you can’t stop there.</p> <p>If it’s a network problem, the next step is to sectionalize your network to assess which component is causing the issue. Does your private data center have a problem? Is it a public cloud service issue? Is the internet as whole experiencing performance problems?</p> <div as="Testimonial" index="0" color="green"></div> <p>The <a href="https://www.kentik.com/product/kentik-platform/">Kentik Network Observability Platform</a> makes it easy to answer these questions by running synthetic tests on your cloud service mesh, private data centers, and third-party internet services. By viewing and verifying routes and path flows across all sections of your network architecture, you can quickly determine where each problem lies.</p> <p>In turn, you can hold third-party vendors, service providers, and partners accountable when a problem on their end impacts your users. Is your payment processor’s network causing delays and cart abandonment? Is an employee unable to access email because a SaaS app is down?</p> <h2 id="why-choose-kentik-synthetics-to-optimize-your-cloud-networking">Why choose Kentik Synthetics to optimize your cloud networking</h2> <p>Kentik has developed a unique global network footprint of intelligent, software-based agents located in over 180 cities and 90 cloud regions across Amazon Web Services, Microsoft Azure, Google Cloud, and IBM Cloud. This provides unprecedented visibility into every aspect of network communication: data center to data center, between data centers and public clouds, between public cloud regions, and between external SaaS, applications, and API endpoints.</p> <p>By fully integrating network traffic data and synthetic analytics into a single platform, Kentik enables network and infrastructure engineering teams to autonomously and accurately measure performance and availability metrics of essential network infrastructure, applications, and services.</p> <p>The real-time results are presented in a full context across the entire network infrastructure, no matter the combination of data center networks, private clouds, or public clouds.</p> <p>Only Kentik:</p> <ul> <li>Combines actual flow data and synthetic testing in a single, comprehensive solution</li> <li>Uses real-world network conditions to automate synthetic test selection and deployment</li> <li>Eliminates the shortcomings of legacy synthetic tools that can burden network teams with “noise” and meaningless distractions</li> <li>Presents test results in a simple, easy-to-understand format where details can be explored, and changes made with unprecedented speed and ease</li> <li>Offers a pricing model that allows for dramatically more frequent and comprehensive synthetic testing</li> </ul><![CDATA[Everything You Need to Know About Synthetic Testing]]><![CDATA[The second of a three-part guide series explaining the process of using continuous synthetic testing as a comprehensive cloud monitoring solution to find problems before they start.]]>https://www.kentik.com/blog/everything-you-need-to-know-about-synthetic-testinghttps://www.kentik.com/blog/everything-you-need-to-know-about-synthetic-testing<![CDATA[Stephen Condon]]>Mon, 14 Nov 2022 05:00:00 GMT<p><em>Part two of a three-part guide to assuring performance and availability of critical cloud services across public and hybrid clouds and the internet</em></p> <p>Monitoring your user traffic is critical for knowing the quality of the digital experience you are delivering, but what about the performance of new cloud or container deployments, expected new users in a new region, or new web pages or applications that don’t have established traffic? This is where synthetic testing can be invaluable.</p> <h2 id="what-is-synthetic-testing">What is synthetic testing?</h2> <p>Synthetic testing is a technique that proactively simulates different conditions and scenarios, then tracks the results. In a networking context, synthetic testing evaluates a broad range of variables to detect problems before they reach actual end-users. You can test different types of traffic—like web, audio, and video—or trace traffic across different routing paths. You can also test different types of user actions, like browsing, logging in, and <a href="https://www.kentik.com/kentipedia/what-is-synthetic-transaction-monitoring/" title="Kentipedia: What is Synthetic Transaction Monitoring (STM)">checking out</a>. You can even generate traffic from different geographical locations to test how a user in, say, New Zealand will experience your service compared to one in San Francisco.</p> <p>Synthetic monitoring is used to generate different types of traffic (e.g., network, DNS, HTTP, web, etc.), and send it to a specific target (e.g., IP address, server, host, web page, etc.), and then measure metrics associated with that “test” KPIs can be established using those metrics.</p> <h2 id="how-does-synthetic-testing-work">How does synthetic testing work?</h2> <h3 id="synthetic-monitoring-puts-you-in-full-control-of-what-and-how-you-test">Synthetic monitoring puts you in full control of what and how you test</h3> <p>You start by determining the conditions you want to evaluate—which configurations to test for, which routing paths to evaluate, which geographic regions will serve as traffic sources, and endpoints, and so on. Your <a href="https://www.kentik.com/kentipedia/what-is-synthetic-monitoring/">synthetic monitoring</a> platform then generates traffic according to your specifications. From there, it tracks the flow and identifies problems like unavailability, high latency, and packet loss. Finally, synthetic testing helps you pinpoint the source of problems. Was your request slow because of a problem in an API gateway? Downed infrastructure? A performance degradation in a third-party service? Whatever the root of the issue, synthetic testing helps you find and fix it before your users experience it on a large scale.</p> <h3 id="synthetic-testing-works-with-any-type-of-cloud-environment-or-configuration">Synthetic testing works with any type of cloud environment or configuration.</h3> <p>Whether you’re deploying a monolithic application on-premises or running a set of microservices across a scale-out cluster that spans multiple clouds or data centers, synthetic testing lets you evaluate how your applications and services perform under varying conditions. The alternative to synthetic testing, which is to analyze traffic from actual user requests as they happen, doesn’t deliver this level of control and flexibility. It lets you track only the traffic produced by actual users and doesn’t surface problems until your users are already experiencing them.</p> <h2 id="types-of-synthetic-testing">Types of synthetic testing</h2> <p>Synthetic testing can be used to measure a wide variety of network performance metrics. <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a> can continuously test public cloud services and SaaS apps that are hosted outside your own environment. Synthetic testing can also be used to monitor DNS and HTTP server performance. Service providers and corporations with large networks are using synthetic testing to monitor real-time connectivity between corporate sites and displaying results in a site mesh grid to quickly get a view of network status.</p> <p>Continuous monitoring of all services and infrastructure within your stack—including those managed or owned by third parties—is critical for understanding performance problems even of basic services like DNS. As mentioned, synthetic testing uses artificially generated traffic (e.g., network, DNS, HTTP, web, etc.) and sends it to a specific target (e.g., IP address, server, host, web page, etc.). Metrics associated with that “test,” such as page load times, jitter, and packet loss, can then be used to build KPIs using those metrics.</p> <p>An example of synthetic testing is the Kentik “State of the Internet” for SaaS applications. Each test is performed by sending a GET request to the HTTP endpoint that the user (employee) would access for the specific application.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7nFU8SisMkzV7BYCd9i5Zo/5b6bd021103d922e2bf6f68e7d7398b2/state-of-the-internet-monitoring-saas.png" style="max-width: 800px;" class="image center" withFrame alt="SaaS apps and publisc clouds in the Kentik platform" /> <div class="caption" style="margin-top: -40px;">The State of the Internet in the Kentik platform</div> <p>In addition, the test also performs a network layer (ICMP ping + traceroute) to the IP address resolved for the host that the HTTP server is running on or behind. If any part of that stack (network, DNS, or HTTP layer) slows down or fails, the test catches it immediately and can show you whether the failure was due to the network or at a higher layer. Synthetic testing isn’t just useful in the context of network connectivity and infrastructure. It can include BGP, DNS, CDN, and API services.</p> <h2 id="what-is-the-difference-between-synthetic-testing-and-real-user-monitoring">What is the difference between synthetic testing and real user monitoring?</h2> <p>Real user monitoring (RUM) aims to directly provide information on your application’s frontend performance from end users. Unlike active or synthetic monitoring, which attempts to gain internet performance insights by regularly testing synthetic interactions, RUM monitors how your users interact with sites or apps. Both synthetic testing and RUM can use agents or Javascript to generate measurements.</p> <p>There are downsides to RUM. The first is the sheer amount of user analytics it can generate, making it difficult to identify and diagnose potential issues impacting end users. The second, which is more difficult to overcome, is that by its nature, RUM doesn’t enable you to test. If you want to see the quality of a connection to a particular region and there are no users online, you are out of luck. Or, if you are deploying a new app or public cloud resources, RUM won’t help you get a picture of the expected digital experience. RUM sheds light on the network path end users take beyond the web server so that it won’t generate data on WAN, SDWAN, and cloud-resource performance. Also, many SaaS applications your organization may utilize won’t allow agents to be embedded.</p> <h2 id="monitoring-cloud-networking-synthetics-vs-flow">Monitoring cloud networking: Synthetics vs. flow</h2> <p>Synthetic network testing is only one of the key tools for monitoring and troubleshooting network performance. The other is flow monitoring, which lets you track and analyze the traffic generated by your actual users as they engage with your applications and environment. Flow monitoring uses various network flow protocols — such as <a href="https://www.kentik.com/kentipedia/what-is-netflow-overview/" title="Kentipedia: What is NetFlow? An Overview of the NetFlow Protocol">NetFlow</a>, <a href="https://www.kentik.com/blog/netflow-vs-sflow/" title="NetFlow vs. sFlow: What’s the Difference?">sFlow</a>, and <a href="https://www.kentik.com/kentipedia/ipfix-collector/" title="Kentipedia: IPFIX Collector Tools">IPFIX</a> — or (in the case of cloud networking) <a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="Kentipedia: What are VPC Flow Logs?">VPC flow logs</a> to collect and analyze data about actual IP traffic going to and from network interfaces and other devices.</p> <p>Synthetic monitoring and flow monitoring provide total visibility into the network. Flow monitoring tells your team what is happening and what your users are experiencing in real-time. It alerts you when a service goes down, helps you track network egress costs, warns you when you are close to exhausting network capacity, and so on.</p> <p>Meanwhile, synthetic testing provides insights into what will happen if conditions change. It’s a means of future-proofing your network as you add DNS servers, integrate new third-party services, or start supporting users based in a new region. It also offers a preview of what might result if a cloud provider has an outage or one of your data centers goes down.</p> <p>In short, synthetic testing keeps you prepared for the future and ready to react, while flow monitoring ensures you catch issues that slip past your synthetic testing safeguards. Together, they enable network observability.</p> <h2 id="kentik-makes-synthetic-testing-autonomous">Kentik makes synthetic testing autonomous</h2> <p>With infrastructure, traffic, and path awareness, Kentik Synthetics can autonomously identify the most important places to test from to troubleshoot performance immediately after deployment.</p> <p>Using path distribution information contained in flow, Kentik autonomously determines which path to use for synthetic tests so that test data best reflects actual paths. Kentik stays fully aware of traffic dynamics, allowing test configurations to adapt to network traffic and infrastructure changes.</p> <p>With a single click, the Kentik can automatically provision different testing scenarios, including site-to-site, site-mesh, and hybrid infrastructure tests with combinations of ASNs, sites, countries, connectivity types, CDNs, and OTT providers.</p> <p>Because they lack infrastructure and intelligence insights, point products and non-network solutions require users to engage in a manual and time-consuming process to configure tests—especially when infrastructure or traffic paths change.</p> <p>Organizations are embracing cloud environments that are becoming increasingly complex. As a result, their ability to identify and troubleshoot problems in these resources is of increasing concern. Kentik is the industry’s first solution to close operational gaps introduced by cloud and hybrid cloud environments. The next article in this series, “<a href="/blog/using-kentik-synthetics-for-your-cloud-monitoring-needs/" title="Using Kentik Synthetics for your Cloud Monitoring Needs">Using Kentik Synthetics for your Cloud Monitoring Needs</a>,” outlines how Kentik can close some of these operational gaps.</p><![CDATA[A Guide to Cloud Monitoring Through Synthetic Testing]]><![CDATA[The first in a three-part guide series explaining the process of using continuous synthetic testing as a comprehensive cloud monitoring solution to find problems before they start.]]>https://www.kentik.com/blog/a-guide-to-cloud-monitoring-through-synthetic-testinghttps://www.kentik.com/blog/a-guide-to-cloud-monitoring-through-synthetic-testing<![CDATA[Stephen Condon]]>Sat, 12 Nov 2022 05:00:00 GMT<p><em>A three-part guide to assuring performance and availability of critical cloud services across public and hybrid clouds and the internet</em></p> <h2 id="what-is-cloud-monitoring">What is cloud monitoring?</h2> <p>As the world moves to utilize cloud-based resources to be more nimble and scale its IT infrastructure, the challenge of <a href="https://www.kentik.com/kentipedia/cloud-network-performance-monitoring/">cloud monitoring</a> has emerged. Cloud monitoring involves observing and managing operational workflows in a cloud-based IT infrastructure. Monitoring should be designed to confirm the availability and performance of websites, servers, VPC containers, gateways, load balancers, applications, and other cloud infrastructures such as Kubernetes deployments.</p> <p>Cloud monitoring should encompass the monitoring of actual traffic and also test traffic. This guide focuses on monitoring test traffic using synthetic testing.</p> <h2 id="what-are-some-types-of-cloud-monitoring">What are some types of cloud monitoring?</h2> <p>Networking in and with the cloud involves managing the interconnections of a wide array of devices, services, and applications. You’ll want to test your connection to and the performance of traffic flow from your website and applications through your new and existing cloud resources.</p> <p>Find a tool that enables you to monitor <a href="https://www.kentik.com/solutions/usecase/amazon-web-services/">AWS</a>, <a href="https://www.kentik.com/solutions/usecase/microsoft-azure/">Microsoft Azure</a>, and <a href="https://www.kentik.com/solutions/usecase/google-cloud-platform/">Google Cloud</a> to perform website monitoring, virtual machine monitoring, database monitoring, virtual network monitoring, cloud storage monitoring, and see traffic flows between and within your cloud resources.</p> <p>With synthetic monitoring, you can see traffic flows between and within pods deployed on <a href="https://www.kentik.com/solutions/usecase/kubernetes-networking/">Kubernetes clusters</a>. This lets you provide an integrated view allowing you to identify performance problems before they impact users. Synthetic testing enables you to monitor and drill into cloud and data center environments, troubleshoot VPC flows, subnets, east-west traffic, and check device interfaces.</p> <h2 id="why-cloud-monitoring-through-continuous-testing-matters">Why cloud monitoring through continuous testing matters</h2> <h3 id="find-problems-before-your-users-do">Find problems before your users do</h3> <div class="pullquote right" style="max-width: 370px; border: none; font-weight: 300; font-size: .96em; color: #393a45; line-height: 110%; padding-top: 0;"><img src="https://images.ctfassets.net/6yom6slo28h2/SvcSL8YI10FopX21MyybG/5028326b25c46ea32673b2b7dc3f7a20/latency-new-outage.png" class="image right" alt="Is latency the new outage?" /><div class="caption" style="margin-bottom: 0; padding-bottom: 0;">Good reasons for proactively monitoring with synthetic testing</div></div> <p>Continuous <a href="https://www.kentik.com/kentipedia/what-is-synthetic-monitoring/">synthetic monitoring</a> of your cloud environment ensures that you can detect availability and performance problems before they impact end users. Whether the issues are caused by services you have deployed yourself or from a third-party resource, that’s true. From the customer’s perspective, the difference doesn’t matter.</p> <p>Synthetic monitoring also helps catch issues that affect only certain users — such as those in a specific geography. And it allows you to test how an application update or an architecture change, such as migration to the cloud or the adoption of a CDN, affects performance before you deploy the changes to customers.</p> <p>In short, cloud monitoring continuously helps you ensure the smooth digital experience that users expect in today’s world, where subsecond delays or short periods of unavailability can break SLAs. It also provides the reports you need to validate SLA adherence.</p> <h3 id="the-cloud-is-not-a-singular-entity">The cloud is not a singular entity</h3> <p>It’s a sprawling, ever-changing stack of disparate services, infrastructure, applications, and other resources that need continuous monitoring to guarantee operational excellence.</p> <p>From API gateways and WANs to public-facing SaaS apps and private LOB apps, to CDNs and transit networks and beyond, any wrinkle in the complex set of technologies that power your cloud will be felt by users — unless you catch it first via synthetic monitoring.</p> <p>That’s why frequent, continuous, and systematic monitoring of all resources within your cloud environment is critical for guaranteeing the digital experience that your company promises to customers.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/69MLhqdPyYCTwiBTMb75Ot/5ac2e74ffd10e912bc9770b1f0b5556c/synthetic-testing-proactive-monitor.png" class="image center" style="max-width: 700px;" alt="Using synthetic testing for cloud monitoring" /> <div class="caption" style="margin-top: -30px;">What synthetic testing can help you proactively monitor</div> <h2 id="how-active-cloud-monitoring-helps-you-solve-problems-before-they-start">How active cloud monitoring helps you solve problems before they start</h2> <p>Understanding what is happening in a modern environment requires autonomous monitoring — meaning it happens automatically — and active, which means proactively sending synthetic traffic across your environment to track the response.</p> <p>Without an autonomous approach, orchestrating monitoring and testing for large-scale applications consisting of dozens of microservices that may be distributed across multiple clouds simply isn’t feasible.</p> <p>And without active monitoring, you’re stuck waiting on user-generated traffic flows before you can detect and respond to problems. You struggle to catch issues that affect only certain geographies, for example, or that involve services that aren’t part of typical flows.</p> <p>Active (synthetic) monitoring puts you in control. Stop waiting passively for problems to arise. Start finding and fixing actively.</p> <h2 id="next-up-synthetic-testing">Next up: Synthetic testing</h2> <p>Synthetic testing provides insights about what will happen if conditions change. It’s a means of future-proofing your network as you add DNS servers, add new cloud regions or providers, integrate new third-party services, or start supporting users based in a new region. It also offers a preview of what might result if a cloud provider has an outage or one of your data centers goes down. In short, synthetic testing keeps you prepared for the future and ready to react, while flow/actual traffic monitoring ensures you catch issues that slip past your synthetic testing safeguards. Together, they enable network observability.</p> <p>Stay tuned for our next post in this series: <a href="/blog/everything-you-need-to-know-about-synthetic-testing/">Everything You Need to Know About Synthetic Testing</a>.</p><![CDATA[Kentik takes network observability to KubeCon 2022]]><![CDATA[We're fresh off KubeCon NA, where we showcased our new Kubernetes observability product, Kentik Kube, to the hordes of cloud native architecture enthusiasts. Learn about how deep visibility into container networking across clusters and clouds is the future of k8s networking. ]]>https://www.kentik.com/blog/kentik-takes-network-observability-to-kubecon-2022https://www.kentik.com/blog/kentik-takes-network-observability-to-kubecon-2022<![CDATA[Phil Gervasi]]>Thu, 10 Nov 2022 05:00:00 GMT<p>If you’re an engineer trying to fix real problems with your apps, looking at just one small part of the picture isn’t going to cut it. This is why Kentik is so focused on helping you understand what’s going on beyond single k8s instances, and it’s a big part of what network observability is all about.</p> <p>This was Kentik’s message at Kubecon 2022, which was a memorable event for us. The conference was larger than in the past, the location was perfect, and our team was excited to show off <a href="https://www.kentik.com/solutions/usecase/kubernetes-networking/">Kentik Kube</a>, a recent addition to our observability portfolio.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6ghF6QXBOcb0bPUHkxFipd/646884333f3188da1ee2703bfaa442b8/kentik-kubecon-booth.jpg" style="max-width: 500px;" class="image center" alt="Kentik booth at KubeCon" /> <div class="caption" style="margin-top: -30px; max-width: 500px;">Mike Krygeris and Paul Sancimino at our booth and ready-to-go to talk tech and give demos on k8s visibility from the container to the cloud. </div> <p>We’re excited because Kentik Kube is our groundbreaking new way to get deep interactive visibility into container network connections and performance. We’re already super nerds when it comes to network telemetry in general, so collecting metrics from k8s clusters and visualizing them in our portal was the natural evolution of what we’re already good at.</p> <p>The reality is most organizations are at least <em>somewhat</em> hybrid in that they have on-prem data centers, a public cloud footprint, and k8s clusters communicating between them. The micro-services apps running in these clusters have to communicate with each other as well as with the outside world, so we want to understand if any of these communications is failing or is slow, not to mention if there is communication going on that’s unexpected and could be a security issue.</p> <p>We also want to understand the bigger picture of what connects with what else and how, and whether the policies are set up in a way to allow that (NACLs, Routes, etc.). Tying our latest container visibility advancements into our larger platform of telemetry and analytics provides a much broader understanding of what’s going on than by isolating k8s instances alone.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/TVm42kJnG97rAwnM9gpyM/03c0444de845d6f0bb3536c423a6430f/kentik-kube-k8s-connections.png" style="max-width: 800px;" class="image center" withFrame alt="k8s connections in the Kentik portal" /> <p>At these types of tech conferences, we like to do demos that show how you can use Kentik to solve real problems, so Mike and Paul walked through a scenario in which an application built on k8s was performing slowly and how to use Kentik Kube to figure out what’s going on.</p> <p>Remember when I said visibility from the container to the cloud? Well, check out the graphic below, a screenshot from our Kubecon demo. You’ll see that this cluster communicates with the internet (which makes sense as it runs a customer-facing store), but using “show path,” you can see it also communicates with another EKS cluster in US-East-1.</p> <p>We can go deeper to see that NACLs, etc., are all configured correctly; there’s no denied traffic, and the route tables look okay. So far, the issue is not between these two clusters.</p> <img src="//images.ctfassets.net/6yom6slo28h2/238Tk60a1IvajSJsB1vTOz/8c0e46b09b00691a7f61c3457624c8f8/kentik-kube-nacls.png" style="max-width: 320px;" class="image center" alt="Detail of Kubernetes clusters" /> <p>It’s time to really dig into the k8s communication, which we do by deploying an eBPF telemetry agent into these clusters. Notice in the graphic below that you can see the four nodes in our cluster. You can see the traffic between the nodes and that there’s some considerable latency going on.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4ddmocCNj5pJkmb3zKquEs/3de658c5ea84f4931c6928992f279c5e/kentik-kube-four-nodes.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="k8s communication" /> <p>This next part is even cooler. Look at how we can drill down into specific nodes to see several interesting details. You can see the list of pods that are running on this node, their traffic profile, as well as the fact that there’s been a recent problem with latency.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4Zcp8AqsD6rzUxVPgtXIbQ/76d10ccd39daa4a406e7cfbe86d1a3d0/kentik-kubecon-latency-details1.png" style="max-width: 600px;" class="image center" withFrame alt="Kubernetes - drill into specific nodes" /> <p>Of course, knowing that there’s latency at the node level is helpful, but we really want to see what causes the app to be slow. Kentik allows us to drill down even further to see the namespace – and <em>voilà!</em> You can see all the pods, how they communicate with each other, the amount of traffic they exchange, and the latency between them.</p> <img src="//images.ctfassets.net/6yom6slo28h2/75BbRfNj2CKjVYT51ot6qK/b62461a5313c987f515667bf7665e63d/kentik-kube-pods-communication.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kubernetes pods and how they communicate" /> <p>And there’s the smoking gun: The pod hosting the frontend of our app is experiencing high latency impacting all the communications that go via this pod.</p> <p>Most of our applications are delivered over the network today, and of those applications, many of them are built on a distributed k8s architecture. This means that a ton of <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Kentik Blog: Best Practices for Enriching Network Telemetry to Support Network Observability">application-relevant telemetry</a> is embedded right into the network. Kentik Kube provides you the very unique ability to see what’s going on with your k8s networking, from the container level itself all the way to the cloud.</p> <p>To learn more about Kentik Kube, <a href="#demo_dialog" title="Request your Kentik demo">request a live demo</a>.</p><![CDATA[Are you a network observability champion?]]><![CDATA[The concept of network observability has taken hold in the market. But how are users adopting this technology, and what gaps remain? This article provides a taste of the user survey results from a newly released market study. ]]>https://www.kentik.com/blog/are-you-a-network-observability-championhttps://www.kentik.com/blog/are-you-a-network-observability-champion<![CDATA[Kevin Woods]]>Tue, 08 Nov 2022 05:00:00 GMT<p>At Kentik, we pride ourselves as innovators and thought-leaders for network observability. “Kentik <em>is</em> network observability” is more than a slogan for us. It’s an idea that informs our product roadmap and guides our problem-solving with customers. We’ve done a lot to explain network observability to prospects. One overview, <a href="https://www.kentik.com/resources/nfd-29-network-observability-the-evolution-of-network-visibility-at-kentik/">Network observability: The evolution of network visibility with Kentik</a>, by Kentik tech evangelist <a href="https://www.kentik.com/blog/author/phil-gervasi/">Phil Gervasi</a>, discusses how today’s complex environment of cloud-hosted applications, containerized services, and overlay networks require a network-centric approach to observability.</p> <p>But more important than what we think, what do <em>you</em> think about network observability?</p> <p>To find out what the NetOps community thinks about network observability, our analyst partner, Enterprise Management Group (EMA), conducted a <a href="/resources/ema-research-report-network-observability-netops/">detailed market study</a> surveying over 400 IT networking professionals. The study aimed to understand how NetOps teams see network observability — what it is, how it helps, and what gaps remain in the available solutions.</p> <p>There are many noteworthy results in this study. For example, when asked, “How successful do you think your organization is with its use of network monitoring or network observability tools?” only 25.9% of organizations reported being fully successful. This was a significant decrease in the success rate of 47% from when EMA posed the same question to respondents in 2018.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3jbtUp9uGBA0Nk6N8ghPgA/0227dcf7a75bfca6aa1daba813138d3f/ema-network-observability-chart.png" style="max-width: 600px;" alt="Pie chart showing how successful survey participants think their organization is with its use of network monitoring or network observability tools" class="image center" /> <p>The results also cover other key areas of change, such as:</p> <ul> <li><strong>Data</strong>: Are IT organizations trying to expand the volume and variety of data they collect and analyze with their NetOps tools? Are certain classes of data becoming more important or less important as IT organizations evolve their approach to NetOps?</li> <li><strong>Insights</strong>: How do network managers draw distinctions between tools that provide access to information (data) and tools that provide insights? What roles do AI and ML play in defining this distinction? What insights are essential to users?</li> <li><strong>Tasks and workflows</strong>: What tasks are <a href="https://www.kentik.com/kentipedia/what-is-netops-network-operations/" title="Kentipedia: What is NetOps?">NetOps</a> teams interested in transforming with network observability? How will these workflows change? Troubleshooting, network planning and capacity management, policy configuration and management, traffic engineering, attack mitigation, and cost analytics are all explored.</li> </ul> <h2 id="more-from-the-ema-network-observability-report">More from the EMA network observability report</h2> <p>There are two ways to learn about these fascinating results from this first-ever market study.</p> <ul> <li> <p>Join Shamus McGillicuddy on November 9 for EMA’s webinar <a href="https://www.kentik.com/go/webinar/ema-report-network-observability-use-cases/">Network Observability: Delivering Actionable Insights to Network Operations</a></p> </li> <li> <p>Download EMA’s report <a href="https://www.kentik.com/resources/ema-research-report-network-observability-netops/">Network Observability: Delivering Actionable Insights to Network Operations</a></p> </li> </ul><![CDATA[Kentik Market Intelligence just increased its IQ — introducing KMI Insights!]]><![CDATA[KMI Insights is now available. Insights brings the dynamics of transit and peering relationships that are of most interest to you into focus.]]>https://www.kentik.com/blog/introducing-kmi-insightshttps://www.kentik.com/blog/introducing-kmi-insights<![CDATA[Greg Villain]]>Tue, 01 Nov 2022 04:00:00 GMT<p>Early this year we launched <a href="https://www.kentik.com/resources/kentik-market-intelligence/">Kentik Market Intelligence (KMI)</a>. If you missed it, KMI enumerates transit and peering relationships as well as produces rankings based on the volume of IP space transited by ASes in different geographies. Using tables and charts, KMI offers a global view of the internet out-of-the-box without any configuration or setup.</p> <p>KMI uses public BGP routing data to rank ASes based on their advertised IP space. Rankings are updated multiple times daily and come in various forms: total customer base, customer base type (retail, wholesale, backbone), and peering.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3ySSryPONgGk78bLsuvfI3/db8827922daadf1e7cc60423b0374917/kmi-insights-overlay.png" style="max-width: 800px;" thumbnail class="image center no-shadow" alt="Market rankings with new Insights panel" /> <div class="caption" style="margin-top: -50px;">An example of a KMI table with the new KMI insights window</div> <p>We think KMI is an invaluable tool for the marketing, sales and product management staff of any service provider who sells transit. We believe that it’s the only tool that exists to support critical activities such as:</p> <ul> <li>Selecting the best upstream provider in any locale based on objective criteria</li> <li>Benchmarking ASes against each other based on interconnection to eyeballs or content providers</li> <li>Supporting IP transit sales prospecting and interconnection KPIs</li> <li>Identifying your competitors’ single-homed customers.</li> </ul> <p>We have now enhanced KMI with a significant new feature set - KMI Insights are now available!</p> <h2 id="kmi-insights-tell-me-more">KMI Insights, tell me more…</h2> <p>Insights bring the market dynamics that are of most interest to you into focus. When crunching the billions of BGP dump entries multiple times a day, KMI now identifies insights and contextualizes them in the KMI UI, coloring them with the networks involved and the market they are observed in.</p> <p>Sample insights can be: <strong>&#x3C;this provider></strong> added <strong>&#x3C;this transit customer></strong> within <strong>&#x3C;this geography></strong> for instance, or any network adding new providers, as well as rank changes and changes in routing announcements between two networks. You can configure your personal insights using the selections below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6DHh48hXdLB2cHYTFzGlC3/8fb8ad25fec0ca03133bba1e5925094b/kmi-insight-types.png" style="max-width: 420px;" class="image center" alt="Configure market insights" /> <p>Each insight comes with attributes:</p> <ul> <li>A customer network - as identified in the Provider/Customer ASN relationship</li> <li>A provider network - as identified in the Provider/Customer ASN relationship</li> <li>A market - the insight applies to</li> <li>Insights type - based on their nature (see list in the screenshot above)</li> <li>Magnitude - an 1-5 value that helps order these insights from most important to least important (based on our ranking of significance)</li> </ul> <h2 id="insights-contextualized">Insights contextualized</h2> <p><a href="https://www.kentik.com/product/service-provider-analytics/">Kentik Service Provider Analytics</a> customers now get a Top Insights panel to bring their attention to market dynamics that they are most interested in:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1qCkuUPaIr8pqV6NVOB4C5/b90e519b42fef8d5360d03acf5350ae6/kmi-top-insights.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Market dynamics for service providers" /> <p>This side panel can be configured to include or exclude specific types of insights, or to extend the period covered by the insights.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5IXsGtcV5flppkfF1S7h38/39b6ab55079a1a3fe96eeb36b3c385fa/kmi-configure-insights.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Insights based on markets and networks" /> <h2 id="track-networks-and-markets-that-matter-to-you">Track networks and markets that matter to you!</h2> <p>A while back, we introduced Observation Deck, which is a place for every user to compose their landing page with the areas of network visibility that specifically matter to them, based on widgets from specific workflows and areas of the product. With the introduction of Insights, KMI now has a widget that can be incorporated into your Kentik Observation Deck.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6nyJqNewjc40l4mTBaybFn/394defad5df6968b5918f624f873bbd1/kmi-observation-deck-widget.png" style="max-width: 800px;" thumbnail withFrame class="image center" alt="Configure market insights" /> <p>You can now create as many KMI widgets as you’d like, so you can focus on the markets and/or networks that are important to you.</p> <p>In the future, we will add more KMI-related widgets to the Observation Deck so that you can embed widgets showing rankings and visualizations from KMI.</p> <p>If you have further questions about KMI and the new Insights feature, I’d be happy to answer them and <a href="#demo_dialog" title="Request a demo">give you a demo</a>. This new feature considerably enhances a KMI which is near and dear to my heart.</p> <p>We hope you’ll enjoy this update as much as we do. Please <a href="mailto:[email protected]">send us your feedback</a> on it; we’d love to hear what you think!</p><![CDATA[Insight and reliability through continuous synthetic testing in Kubernetes]]><![CDATA[Kubernetes has become the de facto standard for cloud-based applications. As companies migrate more and more workloads, ensuring reliable connectivity and performance are critical not just for user applications but also for the cluster itself.]]>https://www.kentik.com/blog/insight-and-reliability-through-continuous-synthetic-testing-in-kuberneteshttps://www.kentik.com/blog/insight-and-reliability-through-continuous-synthetic-testing-in-kubernetes<![CDATA[Evan Hazlett]]>Thu, 27 Oct 2022 04:00:00 GMT<p>Kubernetes has become the de facto standard for cloud-based applications. As companies migrate more and more workloads, ensuring reliable connectivity and performance are critical not just for user applications but also for the cluster itself. In this article, we will discuss how augmenting your system monitoring with in-cluster synthetic testing can give you proactive indicators that something might be headed for trouble.</p> <h2 id="synthetic-testing-with-kubernetes">Synthetic testing with Kubernetes</h2> <p>Synthetic tests provide a way of simulating real user actions in a variety of ways. They give insight into real-world usage of your applications and services, whether it be network availability, DNS resolution, or server response times to name a few.</p> <p>Kubernetes is intended to streamline deployment and application management. By design, it enables faster delivery by hiding the complexity of the underlying infrastructure allowing teams to deploy using common patterns. This gives teams more control over their application development and lifecycle, but also means the responsibility for ensuring application health is largely on them. In an ideal scenario, the team would instrument applications and services to both observe and continuously check health, but in reality, we know that is only sometimes the case. By giving teams the ability to integrate synthetic tests directly into their current development workflow, they can quickly gain insight into how their applications are performing.</p> <p>At Kentik, we provide <a href="https://www.kentik.com/product/synthetics/">synthetic testing</a> to monitor the digital experience across network infrastructure, multiple clouds, SaaS applications, and page transaction testing. We maintain a network of hundreds of agents around the world that continuously monitor designated targets using a range of tests, including page response, DNS lookup times, BGP monitoring, and more. We wanted to bring this same level of synthetic testing to Kubernetes, enabling developers to easily add checks for their applications, so we created <a href="https://github.com/kentik/odyssey/">Odyssey: an open source Kubernetes operator for synthetic testing</a>.</p> <h2 id="odyssey-an-open-source-kubernetes-operator">Odyssey: An open source Kubernetes operator</h2> <p>Odyssey is an open source Kubernetes operator that automates agent provisioning and configuration for a synthetic test framework in a Kubernetes cluster. The <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/operator/">operator pattern</a> enables developers to extend Kubernetes for their own applications and still follow the Kubernetes principles. When used with Kentik Synthetics, this gives a number of benefits:</p> <ul> <li>A standard synthetic platform for all teams</li> <li>Operator-controlled base framework</li> <li>The operator, not the team, manages cloud provider access</li> <li>Developers focus on writing checks, not managing infrastructure</li> </ul> <p>Once the Odyssey operator is deployed, teams can leverage their existing Kubernetes knowledge by having a familiar and simpler method to add synthetic checks to their applications. Let’s look at an example deployment with a service:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">apiVersion: apps/v1 kind: Deployment metadata: name: demo labels: app: demo spec: replicas: 3 selector: matchLabels: app: demo template: metadata: labels: app: demo spec: containers: - name: demo image: docker.io/ehazlett/sysinfo:latest ports: - containerPort: 8080 name: app --- apiVersion: v1 kind: Service metadata: name: demo labels: app: demo spec: selector: app: demo ports: - name: app port: 8080 targetPort: app protocol: TCP</code></pre></div> <p>In this example, we have a deployment that exposes a service on port 8080 and a service that also provides access to the deployment on port 8080. Let’s look at adding an Odyssey check to the application that performs an HTTP GET request on the service.</p> <p>Odyssey provides a <a href="https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/">custom resource</a> to define a synthetic task. This can be a variety of checks, including ping and fetch, that will ping a host and perform an HTTP request. In the following example it shows how to create an HTTP check with the Odyssey fetch task:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">apiVersion: synthetics.kentiklabs.com/v1 kind: SyntheticTask metadata: name: demo spec: fetch: - service: demo target: / port: 8080 method: GET period: 10s expiry: 5s</code></pre></div> <p>This test will perform an HTTP GET against the Kubernetes service every ten seconds. The operator takes care of resolving the service address and backends allowing the developer to only have to define the application target and not the underlying infrastructure pieces. Once deployed, the Kentik synthetic agent will be provisioned and configured to run the task. When paired with the Kentik Synthetic Dashboard, it gives a great view into both current and historical performance.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4OXpo8zAyl0RLuDVMx0KpN/f40aad76cb97a83657f7c9b158e7ad68/odyssey-synthetics-dashboard.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Dashboard showing an open source Kubernetes operator for synthetic testing" /> <p>With more and more applications moving to Kubernetes, having an easy way to integrate continuous synthetic testing into your cluster brings the necessary insight and operational reliability needed for production services. When coupled with Kentik Synthetics, it gives teams a complete view of their environment from network to application endpoint.</p> <h2 id="kentik-kube">Kentik Kube</h2> <p>We’re excited to announce our beta launch of <a href="https://www.kentik.com/solutions/usecase/kubernetes-networking/" title="Kentik for Kubernetes Networking">Kentik Kube</a>, an industry-first solution that reveals how K8s traffic routes through an organization’s data center, cloud, and the internet.</p> <p>With this launch, Kentik can observe the entire network — on-prem, in the cloud, on physical hardware or virtual machines, and anywhere in between. Kentik Kube enables network, infrastructure, platform, and DevOps engineers to gain full visibility of network traffic within the context of their Kubernetes deployments — so they can quickly detect and solve network problems and surface traffic costs from pods to external services.</p><![CDATA[Managing the hidden costs of cloud networking - Part 2]]><![CDATA[Companies considering cloud adoption should ensure that they take these valuable lessons into account to avoid hidden cloud costs.]]>https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-2https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-2<![CDATA[Ted Turner]]>Wed, 26 Oct 2022 04:00:00 GMT<p>In the <a href="https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-i/">first post of this series</a>, I detailed ways companies considering cloud adoption can achieve quick wins in performance and cost savings. While these benefits of the cloud certainly remain true <em>in theory</em>, realizing these benefits <em>in practice</em> can be increasingly difficult as applications and their networks become more complex.</p> <p>To illustrate, I’d like to talk about my time as a network engineer at <em>Company X</em>. The beginning of the company’s foray into the cloud stemmed from a hardware expansion of several racks worth of equipment to support new SMS capabilities. As these SMS capabilities were released into the product suite, the expenditure on the hardware proved to be overkill, ultimately resulting in the equipment being repurposed for other business needs.</p> <p>This marked an important transition for the company. From there on out, our development teams were instructed to build all new services in the cloud, with the mandate that the newly delivered applications function in both our cloud and data center.</p> <p>In this post, I will focus on cost management lessons learned from two of those cloud projects:</p> <ul> <li>a geographically distributed service</li> <li>a network data aggregation tool</li> </ul> <h2 id="hidden-performance-costs">Hidden performance costs</h2> <div class="pullquote right" style="margin-top: 15px;">It is vital for teams that transition to distributed solutions to consider extra performance dimensions.</div> <p>My first introduction to applications running in a distributed environment was one of our large, internal, Software-as-a-Service (SaaS)-based applications, which we ran on specific hardware in San Diego, CA. Around this time, the Southwest was experiencing rolling and persistent power outages, so we decided to host the Identity/Authorization (IAM) service for this application in the Pacific Northwest.</p> <p>After making the switch, customer logon latencies surged from 11ms to around 1.2 seconds, which led to painful timeouts and a poor overall user experience.</p> <p>Was this a network issue or a coding issue?</p> <p>Technically, it was both. The request latency was appropriate from a network perspective, given the physical distance between the two data centers.</p> <p>So, while not really a developer-caused issue, it was definitely going to be a developer-caused solution: take into account the new networking constraints presented by geographically distributed requests, and make the appropriate changes to request timers, acknowledgment of the delay in the UI, etc.</p> <p>This experience taught me that it is vital for teams that transition to distributed solutions to consider these extra performance dimensions.</p> <h3 id="troubleshooting-cloud-networks">Troubleshooting cloud networks</h3> <p>The IAM service example was straightforward, but imagine this misstep between dev and networking teams as a building block in what can potentially be a much larger issue in a distributed application at scale.</p> <p>To help engineers avoid these missteps, cloud platforms and infrastructure as code (IaC) have emerged as valuable tools for programmatically managing this networking complexity. They have also introduced a reasonably significant redistribution of networking responsibilities. With IaC, networking decisions are not just made by network engineers, but by devs and DevOps engineers, to the effect that these decisions are often being made independently across various teams with various scopes and priorities. In its ideal state, these loosely-coupled decisions allow these teams to run their code in a (miraculously) harmonized network, where the actions of one don’t affect the other.</p> <p>But, unfortunately, reality is not the ideal state. So, where do I look when things go wrong? With more networking decisions being made via IaC, DNS often becomes the culprit, as it is responsible for selecting the path, firewall, load balancer, caching, and fault-tolerant paths; all of which must be considered, configured in IaC, and deployed correctly.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6sSbXZZQKApXG10x5mIrXM/add52057b43962a93293c5b18b1ea185/synthetic-testing-against-hostnames.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Synthetic monitoring results" /> <div class="caption" style="margin-top: -30px; max-width: 700px;">Kentik synthetic testing against hostnames, enabling understanding of IP addresses in use, latency, jitter, and packet loss.</div> <h2 id="hidden-financial-costs">Hidden financial costs</h2> <p>Near the end of my ten years at Company X, we were building a data aggregator to help us identify and isolate networking issues while providing the information we needed to answer our most asked questions:</p> <ul> <li>Is the request communicating to internal or external services?</li> <li>What path is the request taking through the network?</li> <li>How is DNS involved in the request?</li> <li>Is the request interacting with any firewalls? If so, L3? L4? L7?</li> <li>How is the request affecting the network’s load balancers?</li> <li>Alongside the latency, is the request experiencing any packet loss or jitter?</li> </ul> <p>To do this, we needed to be able to ingest trillions of records, standardize data formatting and tagging across multiple metric streams, compress this data, and ultimately deposit it onto a “single pane of glass” visualization tool.</p> <p>After deploying the tool, we started looking for any serious performance issues. Did the single pane of glass accurately represent the detailed metrics we were forwarding? Did the data arrive in a timely fashion? Could we use the system to troubleshoot faster? These are all excellent questions that we needed to answer.</p> <p>A question we did not expect, however, was an inquiry from the finance team asking why our DEV project (that’s right, not even in production yet) was costing $40,000 per month. Yikes!</p> <div class="pullquote right" style="max-width: 320px;">With the line between app development and networking so blurred in the cloud, sometimes, a single line of code can lead to dramatic networking consequences. </div> <p>Upon further inspection, it seemed that there was a code-driven networking issue this time. We were always calling the publicly available name instead of the internal names/IP addresses, creating lots of unnecessary outbound charges.</p> <p>We were able to cheaply fix this by providing a simple checkbox in the UI to select private versus public addresses.</p> <p>All this to say, moving massive amounts of data in a cloud or hybrid-cloud environment can be very costly:</p> <ul> <li>Sending traffic from a cloud back into campus or DC costs money, whereas this type of traffic was previously free.</li> <li>Sending traffic from a cloud, across the public internet, back to a cloud resource incurs outbound charges, even though “everything is running in the cloud.”</li> </ul> <p>With the line between app development and networking so blurred in the cloud, sometimes, a single line of code can lead to dramatic networking consequences. Having visibility into your network can help you track performance and financial costs quickly, even as network traffic and complexity grow.</p> <p>Don’t miss <a href="/blog/hidden-costs-of-cloud-networking-optimizing-for-the-cloud-part-3/">Part III of this series</a>, where I outline how to organize cloud networks and teams best to facilitate high-velocity development that provides top value.</p><![CDATA[Kentik Kube extends network observability to Kubernetes deployments]]><![CDATA[We’re excited to announce our beta launch of Kentik Kube, an industry-first solution that reveals how K8s traffic routes through an organization's data center, cloud, and the internet. ]]>https://www.kentik.com/blog/kentik-kube-extends-network-observability-to-kubernetes-deploymentshttps://www.kentik.com/blog/kentik-kube-extends-network-observability-to-kubernetes-deployments<![CDATA[Nick Stinemates]]>Tue, 25 Oct 2022 04:00:00 GMT<p>We’re excited to announce our beta launch of Kentik Kube, an industry-first solution that reveals how K8s traffic routes through an organization’s data center, cloud, and the internet.</p> <p>With this launch, Kentik can observe the entire network — on prem, in the cloud, on physical hardware or virtual machines, and anywhere in between. Kentik Kube enables network, infrastructure, platform, and DevOps engineers to <a href="https://www.kentik.com/go/offer/kentik-kube/">gain full visibility of network traffic within the context of their Kubernetes deployments</a> — so they can quickly detect and solve network problems, and surface traffic flowing from pods to external services.</p> <img src="//images.ctfassets.net/6yom6slo28h2/48sOCgzbaM5FMaZOJTGSBn/ee57c71ca62430068a105b2302342951/kentik-kube-traffic-in-among-kubernetes-clusters.png" class="image center" thumbnail withFrame style="max-width: 800px;" alt="Network traffic analytics inside and among Kubernetes clusters" /> <div class="caption" style="margin-top: -30px;">Kubernetes cluster running on GKE, displaying latency between nodes for an online shopping site.</div> <h2 id="why-we-built-kentik-kube">Why we built Kentik Kube</h2> <p>Very often, pods and services experience network delays that degrade a user’s experience. Up until now there has not been a means to identify which Kubernetes services and pods are experiencing network delays. The complexity of microservices leaves developers wondering if the network reality matches their design, who are the top requesters consuming Kubernetes services or which microservices are oversubscribed, and how the infrastructure is communicating both within itself or across the internet.</p> <p>Kubernetes has become the de facto standard for cloud-based applications. As companies migrate their workloads, ensuring the reliability, connectivity and performance from user applications and their clusters, to the entire infrastructure and internet is critical.</p> <h2 id="kentik-kube-use-cases">Kentik Kube use cases</h2> <p>We built <a href="/solutions/usecase/kubernetes-networking/">Kentik Kube</a> to provide visibility for cloud-managed Kubernetes clusters (AKS, EKS, and GKE) as well as on-prem, self-managed clusters using the most widely implemented network models. Teams responsible for complex networks can:</p> <p><strong>Improve network performance</strong></p> <ul> <li>Discover which services and pods are experiencing network latency</li> <li>Identify service misconfigurations without capturing packets</li> <li>Configure alert policies to proactively find high latency impacting nodes, pods, workloads or services.</li> </ul> <p><strong>Gain end-to-end K8s visibility</strong></p> <ul> <li>Identify all clients and requesters consuming your Kubernetes services</li> <li>Know exactly who was talking to which pod, and when.</li> </ul> <p><strong>Validate policies and security measures</strong></p> <ul> <li>See which pods, namespaces, and services are speaking with each other to ensure configured policy is working as expected.</li> <li>Identify pods and services that are communicating with non-Kubernetes infrastructure or the internet — when they should not be.</li> </ul> <h2 id="how-kentik-kube-works">How Kentik Kube works</h2> <p>Kentik Kube relies on data generated from a lightweight eBPF agent that is installed onto your Kubernetes cluster. It sends data back to the Kentik SaaS platform, allowing you to query, graph and alert on conditions in your data. This data coupled with our analytics engine, enables users to gain complete visibility and context for traffic performance inside and among Kubernetes clusters.</p> <h2 id="mapping-your-network-with-kentik-kube">Mapping your network with Kentik Kube</h2> <p>Kentik Kube provides east-west and north-south traffic analytics inside and among Kubernetes clusters. Kentik will automatically detail your network map once you have deployed the eBPF agent.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1MdvBb735tEngcbYj5jExJ/9d860fe4acaed05f67c0b4fcb4997332/network-map-eks-clusters.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik Kube showing latency" /> <div class="caption" style="margin-top: -30px;">Network map showing EKS clusters communicating within AWS regions.</div> <p>Kentik Kube can display details so you can see if your route tables, NACLs, etc. are all configured correctly. You can drill down into a cluster to see if there are latency or other issues. Our eBPF telemetry agent deployed into these clusters lets you see the traffic between the nodes and the latency.</p> <img src="//images.ctfassets.net/6yom6slo28h2/42gWGqUueWb8vRKS3zrHy8/559177c7230dac39324284420b8b697d/kentik-kube.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Kentik Kube showing latency" /> <h2 id="how-to-get-started-with-kentik-kube">How to get started with Kentik Kube</h2> <p>Kentik Kube is now in beta. You can apply to <a href="https://www.kentik.com/go/offer/kentik-kube/">trial the beta here</a>. Please <a href="mailto:[email protected]">share your feedback</a> with us. We’d love to hear what you think.</p><![CDATA[How Kentik Visualizes the BGP Propagation of a DDoS Mitigation]]><![CDATA[At first glance, a DDoS attack may seem less sophisticated than other types of network attacks, but its effects can be devastating. Visibility into the attack and mitigation is therefore critical for any organization with a public internet presence. Learn how to use Kentik to see the propagation of BGP announcements on the public internet before, during, and after the DDoS attack mitigation.]]>https://www.kentik.com/blog/how-bgp-propagation-affects-ddos-mitigationhttps://www.kentik.com/blog/how-bgp-propagation-affects-ddos-mitigation<![CDATA[Phil Gervasi]]>Tue, 18 Oct 2022 04:00:00 GMT<p>We often think of DDoS attacks as volumetric malicious traffic targeted against organizations that effectively take a service offline. Most frequently detected by anomalous behavior found in NetFlow, sFlow, IPFIX, and BGP data, what may not be well understood is how the DDoS <em>mitigation</em> works and how it’s possible to visualize the effectiveness of the mitigation during and after an attack.</p> <h2 id="ddos-targets">DDoS targets</h2> <p><a href="https://www.kentik.com/kentipedia/how-to-prevent-ddos-attacks/">DDoS attacks</a> can take place against publicly available websites and even entire <a href="https://www.kentik.com/resources/nfd-sp-2-observing-content-delivery-and-over-the-top-ott-services/">content delivery networks</a>, but often they’re targeted against specific IP addresses representing some piece of infrastructure or a corporate network.</p> <p>These targeted attacks are usually committed to reducing a service’s availability or completely taking down a corporate network’s public availability on the internet. The IP addresses in these attacks are public addresses configured on internet-facing routers and used to announce routing information to the public internet.</p> <p>An organization announces its routing information to the world using a unique autonomous system number, or ASN, that is used as a unique identifier in the great sea of organizations also announcing themselves on the internet. Both the organization’s network information, in terms of IP addresses, and its ASN, are propagated using the Border Gateway Protocol, or BGP, to “announce” to the world the availability and reachability of the network.</p> <p>Therefore, mitigating these types of targeted DDoS attacks must focus on protecting these external IP addresses and an organization’s unique ASN.</p> <h2 id="ddos-mitigation">DDoS mitigation</h2> <p>Today, the most common <a href="https://www.kentik.com/resources/detecting-and-mitigating-ddos-attacks-with-kentik-protect/">DDoS mitigation method</a> is for an organization to partner with a third-party DDoS mitigation service that will assume the public BGP route announcement of the company being attacked.</p> <p>When a visibility tool, such as <a href="https://www.kentik.com/product/protect/">Kentik Protect</a>, sees the beginnings of a DDoS attack, an organization can start the mitigation process by coordinating the transfer of their routing to the mitigation vendor. This way, the attackers will see the third-party mitigation vendor as the source of the IP addresses and send their malicious traffic to them rather than the original target.</p> <div as="Testimonial" index="0" color="green"></div> <p>Notice in the screenshot below from a <a href="https://www.kentik.com/resources/how-bgp-propagation-affects-ddos-mitigation/">demonstration</a> by Doug Madory, Kentik’s director of internet analysis, that we can clearly see the moment <a href="https://www.lightreading.com/security/kentik-integrates-with-cloudflare-for-ddos-protection/d/d-id/776797">Cloudflare</a>, a DDoS mitigation provider, took over routing announcements for Hibernia Services, Ltd.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6qHrDYuPkLgLl7kSeo5wgZ/dfc2caa44e7f0a886148036047fb88a9/moment-of-ddos-mitigation-activation.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Moment of DDoS mitigation activation" /> <p>The third-party mitigation service will then scrub the traffic and forward it back to the originally intended recipient organization using a tunnel, usually GRE, to obfuscate the mitigation process from the attacker.</p> <h2 id="bgp-propagation">BGP propagation</h2> <p>DDoS mitigation ultimately depends on the proper operation of BGP on the public internet. This means that the time it takes for the mitigation vendor and the process to take over for their customer is urgent, albeit nuanced.</p> <p>For the mitigation to occur, the victim must stop announcing its networks via BGP to the public internet. Simultaneously, the mitigation vendor must take over and announce those same networks on their customer’s behalf. If there is overlap, there can be routing contention on the internet. If there’s a delay, the victim’s network could be adversely affected by the DDoS attack or simply offline because no routing announcements occur during the switchover.</p> <p>However, the nuance goes deeper. There may also be filtering in place by another service provider, which breaks when the mitigation vendor takes over. An organization may experience a performance degradation of their service, an e-commerce website, for example, due to the path changes and filtering happening on the internet.</p> <h2 id="visualizing-bgp-announcements">Visualizing BGP announcements</h2> <p>Considering a DDoS attack and its mitigation can take many minutes and affect an organization’s line of business application(s), visibility into the various elements of the process is crucial. For example, an organization under attack will need visibility into the following:</p> <ul> <li>When an attack occurred</li> <li>When the mitigation activation took place</li> <li>How effective that mitigation is</li> <li>How routes are being announced before, during, and after the mitigation</li> <li>When the mitigation ends</li> </ul> <p>Most organizations can’t afford to have a visibility blindspot that lasts many minutes, especially during this type of crisis, so visualizing these elements is very important to managing the attack and returning to normal.</p> <p>In the graphic below, notice we can see when the mitigation was activated. There was also a problem with the activation, and the mitigation provider never entirely took over routing announcements for the victim. This would mean both a delay in the mitigation activation and contention with routing on the public internet. As a result, there would be reachability problems to the victim’s network and publicly available services. This specific kind of network visibility is crucial to managing an attack successfully.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3sACsTZvNAmGRTbrhgs9ZS/1f11918ec84c9634b13a3d38502b120f/ddos-mitigation-problem.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Moment of DDoS mitigation activation" /> <h2 id="kentiks-unique-visibility-capability">Kentik’s unique visibility capability</h2> <p>Kentik provides visibility into an organization’s routing announcements before, during, and after a DDoS attack and mitigation. Using Kentik Protect and Kentik’s synthetic testing functionality, we can recognize the beginnings of a DDoS attack, alert the network team, and activate automated mitigation. More precisely, we can detect the exact moment an activation occurred, its effectiveness, when it ended, and if/when routing returned to normal.</p> <p>At first glance, a DDoS attack may seem less sophisticated than other types of network attacks. Still, because it directly impacts BGP routing announcements to the world, it is nevertheless devastating. Visibility into the attack and mitigation is therefore critical for any organization with a public internet presence.</p> <p>Watch this video interview and demonstration to learn more about DDoS mitigation and Kentik’s unique ability to provide visibility into the entire process before, during, and after.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/t3ft98f9ev" title="How BGP propagation affects DDoS mitigation" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div><![CDATA[Flows vs. packet captures for network visibility]]><![CDATA[A packet capture is a great option for troubleshooting network issues and performing digital forensics, but is it a good option for always-on visibility considering flow data gives us the vast majority of the information we need for normal network operations?]]>https://www.kentik.com/blog/flows-vs-packet-captures-for-network-visibilityhttps://www.kentik.com/blog/flows-vs-packet-captures-for-network-visibility<![CDATA[Phil Gervasi]]>Thu, 13 Oct 2022 04:00:00 GMT<p>Recently, I saw some discussion online about how flow data, like <a href="https://www.kentik.com/kentipedia/what-is-netflow-overview/" title="Kentipedia: What is NetFlow?">NetFlow</a> and <a href="https://www.kentik.com/blog/netflow-vs-sflow/" title="Kentik Blog: NetFlow vs. sFlow: What’s the Difference?">sFlow</a>, doesn’t provide enough network visibility compared to doing full packet captures. The idea was that unless you’re doing full packet captures, you’re not doing visibility right. Because I’ve used packet captures so many times in my career, I admit there’s a part of me that wants to agree with this. But I’ve been eyeball-deep in how network visibility works the last two years, so I really can’t agree.</p> <h2 id="benefits-of-network-flow-data">Benefits of network flow data</h2> <p>In my experience, for the majority of the time, the level of visibility we get with flow data is perfect for what we’re trying to do, even when you take sampling into account. We get information about both ends of a conversation, application information, ports, protocols, path information, QoS activity, and BGP info. We can detect anomalies, recognize traffic patterns, and get information beyond just headers such as DNS info. So there’s really a lot we’re getting without capturing, processing, and storing every single packet.</p> <h3 id="security-and-compliance-concerns-with-packet-captures">Security and compliance concerns with packet captures</h3> <p>And, of course, we’re also avoiding the compliance issues or privacy violations such as with <a href="https://www.hhs.gov/hipaa/for-professionals/privacy/index.html">HIPAA</a> or with some law enforcement agencies and <a href="https://resources.infosecinstitute.com/topic/computer-forensics-chain-custody/">chain of custody</a> concerns when we start cracking open payloads.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6bOHgIuQHEdeZn1suscOXL/05595707bc90aaa7c96dbb3e4bd85ab3/burt-macklin-fbi.gif" style="max-width: 498px;" class="image center" alt="Burt Macklin, FBI, from Parks and Recreation" /> <h2 id="other-issues-with-packet-capture-vs-flow">Other issues with packet capture vs. flow</h2> <p>Remember that to capture literally every single packet on the network, we’re talking about installing <a href="https://www.kentik.com/blog/what-is-port-mirroring-span-explained/" title="Kentik Blog: What Is Port Mirroring? SPAN Explained">taps</a> to collect, aggregate, and forward that to some other collector. That means an additional parallel—and a typically very expensive—network to buy and maintain.</p> <p>Ain’t nobody got time for that.</p> <p>A cool thing about flow is that you actually do have the ability to extract the relevant parts of the packet, create aggregated counters and export that data through a protocol light enough that even the wimpiest processors can handle. And if you really need to, there are mechanisms that allow the export of actual packets alongside the interfaces they are on. Cisco has “Packet section” for v9 and sFlow has the ability to export payloads up to whatever length you want. And what’s crazy is that some devices will let you do that without sampling, which is usually the biggest gripe I hear about flow.</p> <h3 id="encryption-and-packet-captures">Encryption and packet captures</h3> <p>Another issue is that today we’re often working with TLS 1.3, which means we need keys for every single encrypted session to look at payloads. The problem is a lot of devices just don’t support exporting session keys, which is critical for decryption of payloads. So that means you end up storing a bunch of packets with encrypted payloads you can’t do anything with. For many scenarios, that’s totally pointless, especially considering you can get a lot of the same metadata from flows.</p> <h2 id="when-should-full-packet-captures-be-used">When should full packet captures be used?</h2> <p><em>But what about those times when full packet captures <strong>do</strong> make sense?</em></p> <p>Orgs in some highly regulated industries are <em>required</em> to keep full packet captures, so whether or not it’s useful or cost prohibitive is pretty much irrelevant if they want to be in compliance.</p> <p>From a troubleshooting perspective, you can’t beat a <a href="https://en.wikipedia.org/wiki/Pcap" title="Wikipedia: pcap">pcap</a>. Assuming it’s traffic from your internal network, you can reconstruct an entire TCP conversation. I bet many-an-engineer would admit to running Wireshark on a lonely Friday night just to mess around with reconstructing a VoIP call.</p> <p>And security folks might look at individual packets to do deep forensics analysis after a breach.</p> <p>So clearly, there are specific use cases for full packet captures, but those are the exceptions and not the norm, at least not for most network operations teams. For those infrequent times when you need to perform granular troubleshooting or <a href="https://www.kentik.com/kentipedia/network-forensics/" title="Kentipedia: Network Forensics and the Role of Flow Data in Network Security">network forensic analysis</a>, you can spin up your favorite tool and run an ad-hoc, temporary, and targeted packet capture.</p> <h2 id="netflow-vs-pcap-in-the-real-world">NetFlow vs. pcap in the real world</h2> <p>I love pcaps as much as the next packet herder, but it just doesn’t make sense in most daily network operations as the primary, always-on visibility method. Most of the time, the level of visibility we get with flow data gives us the cost-effective and useful visibility we’re looking for.</p> <p>In a perfect world, we could all afford to keep every packet and it would take 0ms to query out the data we want. But in the real world, we pcap where we <em>have</em> to, and we extract the stuff we really need — via flows — everywhere else.</p><![CDATA[Anatomy of an OTT traffic surge: Thursday Night Football on Amazon Prime Video]]><![CDATA[This fall Amazon Prime Video became the exclusive broadcaster of the NFL's *Thursday Night Football*. This move continued Prime Video's push into the lucrative world of live sports broadcasting. As you can imagine, these games have led to a surge in traffic for this OTT service.]]>https://www.kentik.com/blog/anatomy-ott-traffic-surge-thursday-night-football-amazon-prime-videohttps://www.kentik.com/blog/anatomy-ott-traffic-surge-thursday-night-football-amazon-prime-video<![CDATA[Doug Madory]]>Wed, 28 Sep 2022 16:00:00 GMT<p>This fall Amazon Prime Video became the exclusive broadcaster of the NFL’s <em>Thursday Night Football</em>. This move continued <a href="https://en.wikipedia.org/wiki/Sports_on_Amazon_Prime_Video">Prime Video’s push</a> into the lucrative world of live sports broadcasting.</p> <p>While they had previously aired TNF, as it is known, this is the first season Amazon Prime Video has exclusive rights to broadcast these games. As you can imagine, airing these games has led to a surge in traffic for this <a href="https://en.wikipedia.org/wiki/Over-the-top_media_service">OTT service</a>.</p> <h3 id="ott-service-tracking">OTT Service Tracking</h3> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/" title="Gaming as on OTT service: Virgin Media reveals that Call Of Duty: Warzone has the “biggest impact” on its network">Call of Duty update</a> or a <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday/">Microsoft Patch Tuesday</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p><a href="https://www.kentik.com/resources/kentik-true-origin/" title="Learn more about Kentik True Origin">Kentik True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 540 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <h3 id="-are-you-ready-for-some-caching-">🎶 Are you ready for some caching? 🎶</h3> <p>In these days of an endlessly fractured media landscape, professional football remains a ratings powerhouse in the United States, consistently drawing in audiences numbered in the millions.</p> <p>As illustrated below in a screenshot from Kentik’s Data Explorer view, Amazon Prime Video traffic dramatically surged during the previous two Thursday evenings. If traffic volume can indicate viewership, then broadcasting Thursday Night Football more than doubled the number of households watching Amazon Prime Video.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1QyPaemN0jHPJJkELlftdt/b0be75c71b980574ced85c6e5272ab6a/OTT_TNF_on_Amazon_Prime_Video.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="OTT TNF on Amazon Prime Video" /> <div class="caption" style="margin-top: -30px;">Amazon Prime Video OTT traffic analyzed with Kentik</div> <p>Perhaps more interesting is the breakdown in source CDNs: not all traffic was delivered via Amazon Prime Video’s sister business unit Amazon Web Services (AWS). While we saw AWS delivering the majority of the traffic, also in the CDN mix were Edgio (Limelight), Akamai, Fastly, and Lumen.</p> <p>The graphic below shows how Amazon Prime Video was delivered during this two-week period. By breaking down the traffic by Source Connectivity Type (below), we can see how TNF was delivered by a variety of sources including private peering, IXP, embedded cache, and transit. For the game on September 22nd, the connectivity source breakdown was embedded cache (42.7%), private peering (40.4%), IXP (8.5%), and transit (8.4%).</p> <img src="//images.ctfassets.net/6yom6slo28h2/57bAiEkXfonZizp2p9HXS9/e5fd6968af9e7590d6c01c4e0b5f9dcc/OTT_TNF_on_Prime_Video_src.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="OTT: TNF on Prime Video by Connectivity Type" /> <div class="caption" style="margin-top: -30px;">Amazon Prime Video OTT traffic analysis by source</div> <p>It is normal for CDNs with a last mile cache embedding program to heavily favor this mode of delivery over other connectivity types as it allows:</p> <ol> <li>The ISP to save transit costs</li> <li>The subscribers to get demonstrably better last-mile performance</li> </ol> <p>In addition to source CDN and connectivity type, users of Kentik’s OTT Service Tracking are also able to break down traffic volumes by subscribers, specific router interfaces, and customer locations.</p> <h3 id="how-does-ott-service-tracking-help">How does OTT Service Tracking help?</h3> <p>In July, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/" title="Learn more about recent OTT service tracking enhancements">described the latest enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like the release of a blockbuster movie on streaming can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/solutions/usecase/network-business-analytics/" title="Network Business Analytics Use Cases for Kentik">network business analytics here</a>.</p> <p>Ready to improve over-the-top service tracking for your own networks? <a href="#demo_dialog" title="Request your Kentik demo">Get a personalized demo</a>.</p><![CDATA[Webinar recap: A NetOps guide to DDoS defense]]><![CDATA[Didn’t have time to watch our NetOps guide to DDoS defense webinar with Cloudflare? This blog recaps what was presented and discussed.]]>https://www.kentik.com/blog/webinar-recap-a-netops-guide-to-ddos-defensehttps://www.kentik.com/blog/webinar-recap-a-netops-guide-to-ddos-defense<![CDATA[Stephen Condon]]>Tue, 27 Sep 2022 04:00:00 GMT<p>We recently broadcast a <a href="https://www.kentik.com/go/webinar/netops-guide-to-ddos-defense/">webinar to discuss DDoS trends and developments</a>. Our partner Cloudflare had their cybersecurity evangelist, Ameet Naik, join us along with our own <a href="https://www.kentik.com/analysis/">Doug Madory, head of internet analysis</a>. I had the pleasure of moderating the discussion.</p> <h2 id="ddos-attack-trends">DDoS attack trends</h2> <p>Ameet got the presentation started with an update on DDoS attack trends. Cloudflare has incredible visibility of global internet traffic and attack trends due to its presence in 275 cities globally with over 11,000 networks directly connecting to them. They block 117 billion cyber threats each day. Highlights included that SYN attacks account for more than 50% of Q2, 2022 network-layer DDoS attacks. Other significant threats were attributed to:</p> <ul> <li>17.6% DNS</li> <li>7.8% RST</li> <li>6.7% UDP</li> </ul> <p>In terms of the emerging threats, Cloudflare detected significant rises in the following:</p> <ul> <li>CHARGEN up 378%,</li> <li>Ubiquiti up 328%</li> <li>Memcached up 287%</li> </ul> <p>Ameet gave a brief description of the profile of these attacks and the impact they can have.</p> <p>Next was the Russian vs. Ukraine cyber war. Ameet spoke about how entities targeting Ukrainian companies appear to be trying to silence information. The most attacked industries in Ukraine are broadcasting, internet, online media, and publishing. Attacks on these industries make up almost 80% of all DDoS attacks targeting Ukraine.</p> <p>Concerning attacks on Russian cyber assets, the Russian banks, financial institutions and insurance (BFSI) companies came under the most attacks. Almost 45% of all DDoS attacks targeted the BFSI sector. The second most targeted was the cryptocurrency industry, followed by online media.</p> <h2 id="ransom-ddos-attacks">Ransom DDoS attacks</h2> <p>The number of respondents reporting threats or ransom notes in Q2 increased by 11% QoQ and YoY. These attacks involve demanding a ransom to be paid if you want the attacks to stop. During this quarter, Cloudflare has been mitigating ransom DDoS attacks that have been launched by entities claiming to be the Advanced Persistent Threat (APT) group “Fancy Lazarus.” The campaign has been focusing on financial institutions and cryptocurrency companies. There was some discussion about these attacks being correlated with high seasons — note the historical peak in December.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3gdy1MfnabqeOpL6zQfUfA/c2313f4331eec4b74d7bfadb195bb667/cloudflare-ransom-ddos-attacks.png" style="max-width: 800px;" class="image center" alt="Cloudflare - Ransom DDoS attacks and threats" /> <p>Next were some key observations about the DDoS threat landscape: Most DDoS attacks are cyber vandalism, which can be powerful and cause damage.</p> <p>Sophisticated, large, or well-funded attacks are rare, but hit hard and fast. They are initiated by humans, but executed by machines. Attackers can be very persistent in learning your network topology and identifying weak points.</p> <p>Ameet concluded with an overview of the Cloudflare approach to DDoS protection and how Kentik integrates with Cloudflare to trigger mitigation. If you’d like to learn more about how this integration works, please read our blog post <a href="https://www.kentik.com/blog/cybersecurity-cloudflare-and-kentik-mitigate-ddos-attacks/">Working with Cloudflare to mitigate DDoS attacks</a> from earlier this year.</p> <h2 id="bgp-and-ddos-mitigation">BGP and DDoS mitigation</h2> <p>Our internet analyst was up next. Doug Madory is a leading authority on BGP analysis. He started the discussion with an interesting story about the origins of Prolexic, how they used BGP to protect an off-shore gambling site from DDoS attacks, and how he got involved with the New York Times and their <a href="https://www.nytimes.com/news-event/sports-betting-daily-fantasy-games-fanduel-draftkings">online gambling investigation</a>. You’ll have to listen to the webinar to hear Doug tell the story — or what he was willing to share publicly.</p> <p>Doug then outlined the role BGP plays in DDoS mitigation and what can go wrong, including:</p> <ul> <li>A mitigation vendor announcement getting filtered based on RPKI or IRR data.</li> <li>Previous upstream(s) don’t stop announcing when the mitigation vendor begins leading to contention and incomplete activation.</li> <li>Some other problems with the signaling to the mitigation vendor affecting the timing of the activation.</li> </ul> <p>Kentik has some new visualizations of DDoS mitigation activations that Doug has been leading. These views use Kentik synthetic agents to view BGP announcements from hundreds of vantage points. Companies using a BGP-based DDoS mitigation service generally cannot see how fast or complete the mitigation is when they are under attack. As you can see from the following visualizations, Kentik has a solution.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6WomohGX3oJTSTSpj1Vs7O/379b876852b02e9c5bd5fa486d8ad27c/effective-ddos-mitigation.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="An effective DDoS mitigation" /> <div class="caption" style="margin-top: -30px;">What effective mitigation looks like</div> <img src="//images.ctfassets.net/6yom6slo28h2/4RSocS3V6Lc7C7hX5ZOuh1/94d87554aa62268f03bcc3dab0b76841/incomplete-ddos-mitigation.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Incomplete DDoS mitigation" /> <div class="caption" style="margin-top: -30px;">An incomplete DDoS mitigation</div> <p>Ameet mentioned that he often gets asked how long it takes for a mitigation to propagate fully. Well, with these visualizations he can now answer that question.</p> <h2 id="why-network-observability-is-critical-to-ddos-defense">Why network observability is critical to DDoS defense</h2> <p>Before the Q&#x26;A, I concluded the webinar with a recap of my recent blog <a href="https://www.kentik.com/blog/8-reasons-network-observability-critical-for-ddos-detection-and-mitigation/">8 reasons why observability is critical to DDoS defense</a>. Don’t miss the <a href="https://www.kentik.com/resources/detecting-and-mitigating-ddos-attacks-with-kentik-protect/">video walk-through of Kentik Protect</a> showing how we detect and analyze DDoS attacks.</p> <p>Also, learn more about monitoring BGP in the context of DDoS mitigation in this short video.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/t3ft98f9ev" title="How BGP propagation affects DDoS mitigation" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>Don’t hesitate to reach out if you have questions or <a href="#demo_dialog" title="Request your Kentik demo">want a personal demonstration</a>.</p><![CDATA[What can be learned from recent BGP hijacks targeting cryptocurrency services?]]><![CDATA[On August 17, 2022, an attacker was able to steal $235,000 in cryptocurrency by employing a BGP hijack against Celer Bridge, a cryptocurrency service. While this incident is the latest such attack, there are lessons beyond the world of cryptocurrency for any organization that conducts business on the internet.]]>https://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-serviceshttps://www.kentik.com/blog/bgp-hijacks-targeting-cryptocurrency-services<![CDATA[Doug Madory]]>Thu, 22 Sep 2022 04:00:00 GMT<p>On August 17, 2022, an attacker was able to steal approximately $235,000 in cryptocurrency by employing a <a href="http://www.kentik.com/kentipedia/what-is-bgp-border-gateway-protocol/#as-announcements-and-related-issues-and-risks" title="Kentipedia: Learn more about BGP and risks associated with AS announcements, including BGP hijacks">BGP hijack</a> against the Celer Bridge, a service that allows users to convert between cryptocurrencies.</p> <p>This blog post discusses this and previous infrastructure attacks against cryptocurrency services. While these episodes revolve around cryptocurrency theft, the underlying attacks hold lessons for securing the BGP routing of any organization that conducts business on the internet.</p> <div as="Promo"></div> <h2 id="the-attack-against-celer-bridge">The Attack Against Celer Bridge</h2> <p>In a detailed <a href="https://blog.coinbase.com/celer-bridge-incident-analysis-895a9fc77e57">blog post</a> earlier this month, the threat intelligence team from Coinbase explained how the attack went down. <em>(Note: Coinbase was not the target of the attack.)</em> In short, the attacker used a BGP hijack to gain control of a portion of Amazon’s IP address space.</p> <p>Doing so allowed it to impersonate part of the Celer Bridge infrastructure, which was hosted by Amazon, and issue malicious smart contracts. These “phishing contracts” stole the victim’s assets by redirecting them to the attacker’s wallet.</p> <p>To ensure the BGP hijack would be successful, the attacker needed to make sure its malicious BGP announcements wouldn’t get filtered by an upstream network by taking two steps. First, it inserted bogus route objects (shown below) for QuickhostUK in AltDB, a free alternative to the IRR databases managed by RADB and the five RIRs. <em>At this time, it is unclear whether QuickhostUK was also a victim in this attack</em>.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">irrd.log-20220817.gz:31106270-ADD 96126 irrd.log-20220817.gz:31106280- irrd.log-20220817.gz:31106281-as-set: AS-SET209243 irrd.log-20220817.gz:31106306-descr: quickhost set irrd.log-20220817.gz:31106332-members: AS209243, AS16509 irrd.log-20220817.gz:31106362:mnt-by: MAINT-QUICKHOSTUK irrd.log-20220817.gz:31106392-changed: crussell()quickhostuk net 20220816 irrd.log-20220817.gz:31106438-source: ALTDB &lt;br /> irrd.log-20220817.gz:31147549-ADD 96127 irrd.log-20220817.gz:31147559- irrd.log-20220817.gz:31147560-route: 44.235.216.0/24 irrd.log-20220817.gz:31147588-descr: route irrd.log-20220817.gz:31147606-origin: AS16509 irrd.log-20220817.gz:31147626:mnt-by: MAINT-QUICKHOSTUK irrd.log-20220817.gz:31147656-changed: crussell()quickhostuk net 20220816 irrd.log-20220817.gz:31147702-source: ALTDB</code></pre></div> <div class="caption">Credit: Siyuan Miao of Misaka on <a href="https://seclists.org/nanog/2022/Aug/236">NANOG list</a></div> <p>Since network service providers build their route filters using these IRR databases, this step was necessary to ensure upstream networks wouldn’t filter the bogus announcements coming from QuickhostUK.</p> <p>Secondly, the attacker altered the AS_PATH of the route so that it appears to be originated from an Amazon ASN (specifically AS14618). This also caused the route to be evaluated as RPKI-valid. More on that later.</p> <p>After the hijack in August, I <a href="https://twitter.com/DougMadory/status/1562089866321698819">tweeted out</a> the following Kentik BGP visualization showing the propagation of this malicious route. The upper portion shows 44.235.216.0/24 appearing with an origin of AS14618 (in green) at 19:39 UTC and quickly becoming globally routed. It was withdrawn at 20:22 UTC but <a href="https://twitter.com/DougMadory/status/1562090975287197697">returned again</a> at 20:38, 20:54, and 21:30 before being withdrawn for good at 22:07 UTC.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7yMBjSi56iZ8wGRQILqDzW/2677e9272c2babf503bae0d53df495d2/bgp-visualization-propagation-malicious-route.png" style="max-width: 800px" class="image center" withFrame alt="Propagation of malicious route" /> <p>Amazon didn’t begin announcing this identical /24 until 23:07 UTC (in purple), an hour after the last hijack was finished and more than three hours after the hijacks began. According to Coinbase’s timeline, victims had cryptocurrency stolen in separate events between 19:51 and 21:49 UTC.</p> <h2 id="prior-infrastructure-attacks-against-crypto">Prior Infrastructure Attacks Against Crypto</h2> <p>The Celer Bridge attack wasn’t the first time a cryptocurrency service was targeted using a BGP hijack. In April 2018, Amazon’s authoritative DNS service, the former Route 53, <a href="https://www.theverge.com/2018/4/24/17275982/myetherwallet-hack-bgp-dns-hijacking-stolen-ethereum">was hijacked</a> to redirect certain DNS queries to an imposter DNS service, as is illustrated below.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5k0P7e6IdFvcPyvbTdOeiU/7df1cf68085c2af5ee29c46fc71fcd87/myetherwallet-attack.png" style="max-width: 800px" class="image center no-shadow" alt="Diagram of the myeherwallet attack" /> <p>The imposter authoritative DNS server returned bogus responses for myetherwallet.com, misdirecting users to an imposter version of MyEtherWallet’s website. When users logged into their accounts (after clicking past the certificate error, ∗sigh∗), the cryptocurrency was drained from their accounts.</p> <p>Within a couple of months of this incident, Amazon had <a href="https://twitter.com/JobSnijders/status/1286817558851723264">published ROAs</a> for their routes, including those of its <a href="https://twitter.com/JobSnijders/status/1005163149346181120">authoritative DNS service</a>. This move enabled RPKI ROV to help offer some protection against such an attack in the future.</p> <p>Since public DNS services like Google DNS peer directly with Amazon and reject RPKI-invalids, it would be difficult, if not impossible, to fool Google DNS like this again. If an attacker surreptitiously appended an Amazon ASN to the AS_PATH of its hijack route in order to render it RPKI-valid, it would be unlikely to be selected over the legitimate route from Amazon because of its longer AS_PATH length.</p> <p>The BGP hijack against Amazon would not be the last to target cryptocurrency.</p> <p>Earlier this year, another incident involved the manipulation of BGP to target a cryptocurrency service. Attackers were able to make off with over $2 million in cryptocurrency by employing a BGP hijack against KLAYswap, an online cryptocurrency exchange based in South Korea.</p> <p>Henry Birge-Lee and his colleagues at Princeton authored an <a href="https://freedom-to-tinker.com/2022/03/09/attackers-exploit-fundamental-flaw-in-the-webs-security-to-steal-2-million-in-cryptocurrency/">excellent post</a> on this incident. In this incident, the attackers went after the users of the KLAYswap cryptocurrency exchange by performing a BGP hijack of the IP space of a South Korean hosting provider (Kakao), as illustrated below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4e5bdNcFlBHQjl3eMbwJJL/e121a6bd7b09c15e04deeff92cd7620b/klayswap-kakao.png" style="max-width: 600px; margin-bottom: 5px;" class="image center no-shadow" alt="Diagram of attackers going after cryptocurrency using a BGP hijack" /> <div class="caption">Original image credit: Henry Birge-Lee, <a href="https://freedom-to-tinker.com/">https://freedom-to-tinker.com</a></div> <p>This is because Kakao hosted a JavaScript library loaded when users were on the KLAYswap platform. The BGP hijack enabled the attackers to impersonate Kakao and return a malicious version of this library, redirecting user transactions to destinations controlled by the attackers.</p> <hr> <h4 id="explainer-department">Explainer Department</h4> <h2 id="what-is-bgp-hijacking">What is BGP Hijacking?</h2> <p>BGP hijacking, or IP hijacking, refers to a malicious attempt by attackers to illicitly take control of a group of IP prefixes via the Border Gateway Protocol (BGP). By manipulating the internet’s routing tables, the attacker reroutes internet traffic to a system under their control or in a manner that disrupts the intended path of data, leading to significant disruptions and potential cybersecurity breaches.</p> <p>The core mechanism of BGP hijacking is founded on the inherent trust in the internet’s infrastructure. BGP, an essential protocol used to route web traffic between IP networks, doesn’t have intrinsic security measures, making it susceptible to hijacking. The attacker announces fake BGP routes that confuse routers into redirecting traffic meant for a particular IP address range to a different, unanticipated route. This strategy allows the hijacker to eavesdrop, manipulate data, or halt the traffic, causing substantial harm.</p> <p>The consequences of BGP hijacking can be far-reaching. In addition to the immediate impact of traffic disruption, it can lead to severe security incidents, such as data breaches and violation of privacy. For example, an attacker could potentially capture sensitive information transmitted over the hijacked path, modify the content of data in transit, or use the hijacked routes for activities like DDoS attacks. Due to these high-risk scenarios, securing BGP and developing strategies to detect and mitigate BGP hijacking have become critical to maintaining internet security.</p> <p>While malicious intent is undoubtedly a concern, it’s important to note that not all instances of BGP hijacking are intentionally harmful. Many instances result from inadvertent misconfigurations rather than from malicious activity. For example, a simple typo in a configuration file can unintentionally lead to the announcement of incorrect BGP routes, causing unintended traffic redirection. Despite the lack of malice in these cases, the effect on internet traffic and the potential for information exposure remains significant.</p> <hr> <h2 id="what-can-be-done-to-protect-against-bgp-hijacks">What Can be Done to Protect Against BGP Hijacks?</h2> <p>While the abovementioned incidents involved the targeting of cryptocurrency services, the underlying issues are universal and can affect any organization that uses internet-based services. To safeguard against attacks like these, BGP and DNS monitoring need to play a central role in your monitoring strategy. Strict RPKI configuration can also increase the difficulty for someone to hijack your routes, as I will explain.</p> <h3 id="bgp-and-dns-monitoring">BGP and DNS Monitoring</h3> <p>DNS monitoring exists for the scenario that unfolded with MyEtherWallet in 2018. It uses agents around the world to check that queries for a specified domain return expected results. If a response contains something other than what was expected, it will fire off an alert.</p> <p>In the case of last month’s Celer Bridge attack, <a href="https://www.kentik.com/blog/introducing-bgp-monitoring-from-kentik/">BGP monitoring</a> <em>could</em> have alerted that a new /24 of Amazon address space was being announced, although the forged Amazon origin may have caused it to appear legitimate.</p> <p>However, when this new /24 appeared with an unexpected upstream of AS209243 (Quickhost), an alert should have drawn attention to this anomaly. The key detail here that would have distinguished this alert from the appearance of just another peer of Amazon would have been that the new upstream was seen by 100% of BGP vantage points. In other words, this new Amazon prefix was getting exclusively transited by this relatively unknown hosting provider. That should have raised some eyebrows among the Amazon NetOps team.</p> <h3 id="rpki-rov">RPKI ROV</h3> <p>Amazon had an ROA for the hijacked prefix, so why didn’t RPKI ROV help here? It is important first to emphasize that RPKI ROV is intended to limit the impact of <em>inadvertent or accidental hijacks</em> due to routing leaks or misconfigurations. This is because ROV alone cannot prevent a “determined adversary” from forging the origin in the AS_PATH, rendering a malicious hijack RPKI-valid.</p> <p>Having said that, it <em>could</em> have still helped if the ROA were set up differently. As it stands, the ROA for the address space in question is quite liberal. Basically, three different Amazon ASNs (16509, 8987, and 14618) can all announce parts of this address space with prefixes ranging in size from a /10 all the way down to a /24. See the output of the <a href="https://rpki-validator.ripe.net/ui/44.235.216.0%2F24?validate-bgp=true">Routinator web UI</a>:</p> <img src="//images.ctfassets.net/6yom6slo28h2/229HNW2RopCK93e2jVEVG8/ef880943cec3bc5275a711fd5de53668/routinator-output.png" style="max-width: 600px" class="image center" alt="Routinator output" /> <p>An alternative approach to ROA creation would be to do what other networks such as Cloudflare and Comcast have done: set the origin and maximum prefix length to be <em>identical</em> to how the prefix is routed. While this approach incurs an overhead cost of updating an ROA every time a route is modified, it also leaves little room for alternate versions of the route to come into circulation. Another consideration is the propagation time of the RPKI system itself - changes to ROAs take time to propagate around the world, and networks only periodically update their RPKI data.</p> <p>In its <a href="https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/">blog post</a> following the June 2019 routing leak by Allegheny Technologies, Cloudflare made the argument that had Verizon deployed RPKI ROV and had been rejecting RPKI-invalid routes, the leak would not have circulated, and the impact would have been minimal. As I discussed in my <a href="https://storage.googleapis.com/site-media-prod/meetings/NANOG78/2104/20200211_Madory_Visualizing_Major_Routing_v1.pdf">talk at NANOG 78</a> in February 2020, this statement is only true because the maximum prefix lengths in Cloudflare’s ROAs matched the prefix lengths of their routes. This is not true of many ROAs, including Amazon’s.</p> <p>At NANOG 84 earlier this year, <a href="https://storage.googleapis.com/site-media-prod/meetings/NANOG84/2491/20220214_Tauber_One_Rpki_Deployment_v1.pdf">Comcast presented the story</a> of how they deployed RPKI ROV on their network. <a href="https://youtu.be/DPJxl1jPUbU?t=1802">In the Q&#x26;A</a>, they confirmed that they adopted a strategy of using automation to maintain exact matches of maximum prefix lengths in their ROAs to avoid this route optimizer leak scenario.</p> <p>Had Amazon created a ROA specifically for 44.224.0.0/11 with an origin of AS16509 and a max-prefin-len of 11, then the attacker would have to do one of two things to pull off this attack. One option would be to announce the same route (44.224.0.0/11 with a forged origin of AS16509). This route would have been RPKI-valid but would have had to contend with the real AS16509 for route propagation. Alternatively, if the attacker announced a more-specific route, it would have been evaluated as RPKI-invalid and had its <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/">propagation dramatically reduced</a> if not completely blocked given the upstream in this case was Arelion and they <a href="https://blog.arelion.com/2020/05/06/routing-security-rpki-update-q2-20/">reject RPKI-invalid routes</a>.</p> <h2 id="conclusion">Conclusion</h2> <p>As mentioned, the attacks against cryptocurrency services in recent years highlight universal problems that aren’t restricted to cryptocurrencies. Companies looking to secure their internet-facing infrastructures need to deploy robust BGP and DNS monitoring of their infrastructure and any internet-based dependencies they may have.</p> <p>Additionally, companies should reject RPKI-invalid routes while creating <em>strict</em> ROAs for their IP address space by including maximum prefix lengths that match the prefix lengths used in their routes. In fact, <a href="https://www.rfc-editor.org/rfc/rfc9319.html">RFC 9319 The Use of maxLength in the Resource Public Key Infrastructure (RPKI)</a> states that it is a “best current practice” that networks entirely avoid using the maxLength attribute in ROAs, except in certain circumstances. Leaving the maxLength field blank in a ROA has the same effect as setting the maxLength field to match the prefix. These steps can significantly reduce the window of opportunity for an attacker to subvert your internet infrastructure.</p><![CDATA[Separating the answers from the data: Networking Field Day 29]]><![CDATA[There is a critical difference between having more data and more answers. Read our recap of Networking Field Day 29 and learn how network observability provides the insight necessary to support your app over the network.]]>https://www.kentik.com/blog/separating-the-answers-from-the-data-networking-field-day-29https://www.kentik.com/blog/separating-the-answers-from-the-data-networking-field-day-29<![CDATA[Phil Gervasi]]>Thu, 15 Sep 2022 04:00:00 GMT<p>There is a key difference between having more data and having more answers.</p> <p>That was the theme for Kentik at <a href="https://techfieldday.com/event/nfd29/">Networking Field Day 29</a>. Rather than show the delegates how we collect network telemetry and use it to populate pretty graphs (who doesn’t love a colorful Sankey?), we wanted to show the delegates how everything we do is about helping an engineer actually solve real network operations problems — problems like figuring out <em>why</em> an application is super slow, or detecting a DDoS attack in real-time. In other words, it’s the difference between seeing <em>what</em> is happening and understanding <em>why</em> it’s happening.</p> <p>That’s one of the things I appreciate about the Kentik company culture. In many of the internal conversations I’ve had with co-workers, I sense a genuine desire to find ways to help network engineers keep service delivery smooth and stable. Even the most academic and nerdy conversations were in the context of the problem we were trying to solve.</p> <p>So when we talk about using a combination of flow data <em>and</em> synthetic tests to troubleshoot a problem, it’s because we found that’s a faster and better way to find the root cause of network issues. And honestly, it’s a better way to find proof that it <em>isn’t</em> a network problem, too.</p> <p>In the presentation below, Brian Davenport, one of Kentik’s superstar Solutions Engineers, talked about how Kentik combines flow data, enriches it with relevant metadata for more context, and augments it with the results of synthetic tests. The goal here isn’t just <em>more</em> data, although we certainly love more input into the system; instead, it’s all about how we combine it to better visualize how these metrics are related. That’s one important way we help you solve a complex problem faster and with less manual clue-chaining.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/60kxv6sfot" title="NFD 29: What's new at Kentik" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>Our VP of Solutions Engineering, Justin Ryburn, spoke about the same concept, but focused on how Kentik consolidates and aggregates all the diverse telemetry we have today in one place. So, since Brian really hammered the idea that flow data and synthetic test data work hand-in-hand, Justin showed how flows, SNMP, <a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="Kentipedia: What are VPC Flow Logs?">VPC flow logs</a>, streaming telemetry, and everything else we ingest all come together into one coherent system. We understand that <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Kentik Blog: Best Practices for Enriching Network Telemetry to Support Network Observability">each type of telemetry</a> provides a different perspective of the network, and in his demo, he stepped through how we use all of this information to quickly isolate the cause of a slow web application (and yes, totally prove that it’s <em>not</em> the network).</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/5blts5sufa" title="NFD 29: Observability across the enterprise" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>This was the driving force behind what I wanted to say, too. I really wanted the delegates to understand that network observability, built on a foundation of diverse and accurate network data, is all about augmenting an engineer. It’s all about providing network operations with <em>useful</em>, insightful intelligence.</p> <p>Sure, we use ML and some well-known statistical analysis algorithms to do some of that, but we consider ML as another tool in our toolbox, not the end-all be-all of what we’re doing.</p> <p>So the way we use a time series model to forecast, or how we perform anomaly detection without a crazy number of false positives, or when we compare traffic patterns to our global threat feed to <a href="https://www.kentik.com/blog/8-reasons-network-observability-critical-for-ddos-detection-and-mitigation/">alert you of a DDoS attack in real-time,</a> it always comes down to providing useful insight, not just more data to look at.</p> <p>And when we alert you to a problem, we can often also tell you what other things you should check out and what the possible root cause could be. I mean, that’s like the actual dictionary definition of the word “insight,” isn’t it?</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/x7ylqb212q" title="NFD 29: Network observability - the evolution of network visibility with Kentik" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>I’ve heard some in the networking community talk about <a href="/blog/network-observability-hype-or-reality/">network observability as marketecture or just a re-branding</a> of traditional monitoring and visibility, but it really is so much more. Built on a foundation of traditional network visibility, network observability is the evolution of just being able to see what’s going on in your network to understand <em>why</em> something is happening.</p> <p>To learn more, watch the entire <a href="https://www.kentik.com/go/event/nfd29/">playlist of our presentations</a> from Networking Field Day 29 and download the <a href="https://www.kentik.com/resources/network-observability-for-dummies/">Network Observability for Dummies ebook</a>.</p><![CDATA[8 reasons why network observability is critical for DDoS detection and mitigation]]><![CDATA[Learn eight ways that network monitoring can be critical for DDoS detection and mitigation.]]>https://www.kentik.com/blog/8-reasons-network-observability-critical-for-ddos-detection-and-mitigationhttps://www.kentik.com/blog/8-reasons-network-observability-critical-for-ddos-detection-and-mitigation<![CDATA[Stephen Condon]]>Wed, 14 Sep 2022 04:00:00 GMT<p>Distributed denial-of-service (DDoS) attacks have been a continuous threat since the advent of the commercial internet. The struggle between security experts and DDoS protection is an asymmetrical war where $30 attacks can jeopardize millions of dollars for companies in downtime and breaches of contract. They can also be a smokescreen for something worse, such as the infiltration of malware. In addition to ever-larger traffic volumes, attackers are also increasing their target diversity, with attack traffic simultaneously spanning data, applications, and infrastructure to increase the attack’s chances of success.</p> <p>At Kentik, we see thousands of DDoS mitigations activated each week. DDoS attacks continue to increase in number, volume, and sophistication. A <a href="https://blog.cloudflare.com/mantis-botnet/" title="Mantis - the most powerful botnet to date">June 2022 blog post by our partner Cloudflare</a> detailed one of the largest and most powerful DDoS attacks ever — the Mantis botnet was able to launch an attack that generated 26 million HTTPS requests per second!</p> <p>The cost to undertake DDoS attacks is plummeting, while the tools for carrying them out are becoming more sophisticated.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7ujYJgELlVjzofdjocFZ1I/5e8054f8f9ab04f0ff4c58ab54451a79/ddos-graphic.png" style="max-width: 400px; margin-bottom: -30px;" class="image center no-shadow" alt="DDoS protection" /> <h2 id="why-launch-a-ddos-attack">Why launch a DDoS attack?</h2> <p>There are many motivations for initiating a DDoS attack. Many are political, some are motivated by competition, and others are out of spite — such as disgruntled/former employees. Perpetrators can bring a target’s infrastructure to its knees, leveraging the situation to extort money, information, or apply negotiation pressure.</p> <p>DDoS attacks are also used as a smoke-screen for other more insidious attacks, such as the introduction of malware or a more overt crime, like theft.</p> <h2 id="ddos-protection-with-network-observability">DDoS protection with network observability</h2> <p>Early detection and mitigation are critical for businesses that want to protect themselves against a DDoS attack. Some DDoS attacks are sophisticated enough to successfully shut down large servers, and even completely disable a target’s network. This severe disruption to services and applications can result in direct revenue loss and damage to a brand’s reputation.</p> <p>Network observability can help you detect and mitigate malicious or accidental cybersecurity threats at their onset.</p> <p>Here are our top eight reasons why network observability is critical for defense against modern DDoS attacks:</p> <h3 id="1-early-detection">1. Early detection</h3> <p>The importance of early detection and mitigation of a DDoS attack cannot be overstated. It will save you time, frustration, revenue, brand equity, and help you keep your infrastructure secure. Leading network observability solutions will understand your traffic by analyzing your real-time and historic NetFlow data, constantly comparing this traffic flow data against benchmarks to catch anomalous traffic patterns, giving network and security engineers what they need most: the awareness and time to mitigate the attack and protect their network before it does damage.</p> <h3 id="2-detecting-low-volume-attacks">2. Detecting low-volume attacks</h3> <p>When most people think of DDoS attacks, they think of massive volumetric attacks that crash websites or networks. In reality, most DDoS attacks are small in size and duration, often less than 1 Gbps and only a few minutes long, making them difficult to detect. DDoS detection tools are often configured with detection thresholds that ignore or don’t see these attacks. These low-volume attacks are often used to mask security breaches. Hackers will use a DDoS attack to distract SecOps, while simultaneously launching a more rewarding security breach. The security breach could involve data being exfiltrated, networks being mapped for vulnerabilities or infiltration of ransomware.</p> <p>Network observability solutions allow you to baseline against small traffic volumes, enabling network engineers to fine-tune thresholds and alerts accordingly.</p> <h3 id="3-granular-identification-of-traffic-sources">3. Granular identification of traffic sources</h3> <p>Identifying where traffic originates and normal traffic flows from those sources is keystone data to a defense strategy. The <a href="https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources/" title="Kentik Blog: The Network Also Needs to be Observable, Part 2: Network Telemetry Sources">context-rich telemetry</a> that network observability solutions leverage includes critical network information like geolocation.</p> <p>To protect your infrastructure, you need to be able to build policies based on certain geographies, such as an alert if the traffic is from an embargoed country. Being able to identify the source of the traffic can help tremendously in the detection of security breaches. Identifying traffic from an unusual traffic source may be the key to early mitigation.</p> <h3 id="4-understanding-the-attack-in-context">4. Understanding the attack in context</h3> <p>SNMP data is not enough! Flow data gives you the ability to understand the attack in context. It gives details on where the attack is coming from, as well as what IP addresses, ports, or protocols make up the attack.</p> <p>This context helps with mitigation by being able to understand the nature of the attack better, as well as apply more accurate filters against the traffic.</p> <div class="pullquote center">In a DDoS attack, “you want to look at traffic volumes, but with Kentik we also can look at source IPs, AS numbers and other metrics to see if it’s a distributed attack. This is so easy to do in Kentik; you simply add the source IP address dimension to the analysis.”<div class="attribution">&mdash; Jurriën Rasing, Group Product Manager for Platform Engineering, Booking.com | <a href="https://www.kentik.com/resources/booking-com-has-no-reservations-about-the-value-of-kentik/" title="Case Study: Booking.com">Read the case study</a></div></div> <h3 id="5-determining-the-effectiveness-of-mitigations">5. Determining the effectiveness of mitigations</h3> <p>Mitigation services and technologies sometimes don’t achieve full coverage and attack traffic can circumvent the mitigation leaving you exposed. It’s important to be able to use NetFlow to analyze what DDoS traffic has been redirected for scrubbing and what traffic has been missed. And perhaps just as important, being able to monitor BGP from hundreds of vantage points can enable you to understand how quickly your mitigation service achieved full coverage if it did at all.</p> <p>The BGP visualization below shows a DDoS mitigation vendor (purple) appearing upstream of the customer network but never achieving complete coverage of the customer network. Below that, we can see the result of this incomplete activation as only a portion of DDoS traffic is ultimately redirected to the DDoS mitigation vendor. An incomplete DDoS mitigation permits attack traffic to reach the target network, imperiling critical services.</p> <img src="//images.ctfassets.net/6yom6slo28h2/64sLSbhNFc66YH8a9JzOZI/b4128bafaec334e0d455a04fff72a5af/incomplete-ddos-mitigation-example.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="Incomplete DDoS mitigation example" /> <img src="//images.ctfassets.net/6yom6slo28h2/3hCQatBSAc1Gq3GLiQmGIH/1adbebb518f9f71a64b6c14b70d395c0/ddos-attack-bgp-based-mitigation.png" style="max-width: 800px;" class="image center" thumbnail withFrame alt="DDoS attack with BGP-based mitigation" /> <h3 id="6-performing-attack-forensics">6. Performing attack forensics</h3> <p>Many DDoS attacks fit a pattern. Many of the same bad actors perpetuate them and their fingerprints aren’t always obvious. A good network observability solution will allow you to look back in time to understand, have we seen this attack before? Are there patterns? How can this be prevented altogether?</p> <h3 id="7-eliminating-false-positives">7. Eliminating false positives</h3> <p>Without a network observability platform that gives you granular traffic analysis, automated mitigations can cause you to filter traffic that is needed by your end users. This can result in you causing an outage for your users in an attempt to block an attack.</p> <p>False positives can be a big distraction for your SOC team. Alerts that, upon investigation, are revealed to be normal traffic result in alert fatigue. Eventually, your security experts will stop paying attention to the noise, leaving you open to malicious attacks.</p> <div class="pullquote center">“We were hesitant to consider a fully-automated DDoS mitigation approach. Initially, we had team members approving each mitigation because we thought there would be false positives. After a few weeks with Kentik, we began to trust the detection completely, and full automation is now easy and essential for us. We no longer have to sit around waiting for the next attack to happen.”<div class="attribution">&mdash; David Marble, President and CEO, OSHEAN | <a href="https://www.kentik.com/resources/case-study-oshean/" title="OSHEAN Case Study">Read the case study</a></div></div> <h3 id="8-controlling-costs">8. Controlling costs</h3> <p>DDoS traffic can cause havoc in 95/5 pricing models and always-on mitigation services can be expensive. True network observability will give you the ability to detect attacks at their onset, decreasing the chances of exceeding traffic limits, protecting your infrastructure, and giving you the ability to engage a mitigation service before the attack takes hold.</p> <div class="pullquote center">“With the high cost-per-bit of satellite infrastructure, bandwidth is a precious resource for us. Kentik has allowed us to remove a substantial amount of abusive and malicious traffic from our network, with a huge measurable impact on our bottom line,”<div class="attribution">&mdash; Alex Kitthikoune, Network Administrator, Viasat | <a href="https://www.kentik.com/resources/case-study-viasat/" title="Viasat DDoS Case Study">Read the case study</a></div></div> <h2 id="takeaways">Takeaways</h2> <p>Network observability provides an unmatched solution for detecting and mitigating DDoS attacks, and, for these eight key reasons, is critical for DDoS defense in the modern network:</p> <ol> <li>Early detection</li> <li>Detecting low-volume attacks</li> <li>Granular identification of traffic sources</li> <li>Understanding the attacks in context</li> <li>Determining the effectiveness of mitigations</li> <li>Performing attack forensics</li> <li>Eliminating false positives</li> <li>Controlling costs</li> </ol> <p>Watch this short video to learn how Kentik can help you defend your network:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/8limu4gw0j" title="Detecting and Mitigating DDoS Attacks with Kentik Protect" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>Also, check out the <a href="/resources/kentik-protect-ddos-detection-and-defense/" title="Kentik Protect: Neutralize DDoS attacks. Analyze incidents. Catch botnets.">Kentik Protect solution brief</a>.</p> <p>You can find all of our latest resources about DDoS attacks, BGP hijacking, and other network security threats in our <a href="https://www.kentik.com/get/ddos-protection-and-network-security-package/" title="Latest DDoS Protection and Network Security resources from Kentik">DDoS protection and network security package</a>.</p><![CDATA[Managing the hidden costs of cloud networking - Part I]]><![CDATA[Cloud architectures and their managed infrastructures keep personnel and networking costs down while promoting high-velocity software development, but there are hidden costs. Read the first post in our series on managing the hidden costs of cloud networking.]]>https://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-ihttps://www.kentik.com/blog/managing-the-hidden-costs-of-cloud-networking-part-i<![CDATA[Ted Turner]]>Mon, 12 Sep 2022 04:00:00 GMT<p>Technologies like virtualization and containerization have gained significant traction over the last decade as foundational tools for modern application development. As companies like Amazon (AWS), Microsoft (Azure), and Google (Google Cloud) started to invest in the hardware and software infrastructure required to support access to these virtualized resources, “the cloud” was born.</p> <p>Networking in and with the cloud involves managing the interconnections of a wide array of devices, services, and applications: <a href="https://www.kentik.com/kentipedia/what-is-a-vpc-virtual-private-cloud/" title="Kentipedia: What is a VPC (Virtual Private Cloud)?">VPC containers</a>, gateways, load balancers, controllers, firewalls, routers, switches, servers, clients, IoT endpoints, controllers, service meshes, load balancers, firewalls, <a href="https://www.kentik.com/kentipedia/edge-computing-and-edge-networking/" title="Kentipedia: Edge Computing and Edge Networking">edge services</a>, probes, and more, not to mention the many application integrations. These distributed networks present unique challenges but, if correctly managed, can provide highly scalable, robust, and available applications with a competitive ROI.</p> <p>With over two decades of experience in networking at scale, I’ve had the privilege of working with on-prem, hybrid cloud, and multi-cloud networks and <a href="https://www.kentik.com/resources/network-tips-to-ensure-a-successful-aws-migration-webinar/">transitioning between these architectures</a>. In this series, I will first explain what I see as the main promises, both fiscal and technological, of using the cloud, and what costs these promises can cover up.</p> <h2 id="promised-cost-reductions-with-cloud-services">Promised cost reductions with cloud services</h2> <p>Proponents of cloud architectures maintain that their managed infrastructures keep personnel and networking costs down, all the while promoting high-velocity software development. So, the logic goes, cloud networks are cheaper to build, operate, service, and secure.</p> <h3 id="managed-networking-hardware-and-storage">Managed networking, hardware, and storage</h3> <p>With functions, instances, clusters, and connections disappearing and reappearing in mere moments, many integral cloud network components (service meshes, API gateways, controllers, VPCs, etc.) have strong automagic components that abstract the pain of an ephemeral network away from the network engineer. This managed networking allows for more network elasticity, and provides an easy way to store, backup, and secure data, all while reducing costs with more granular, on-demand pricing that eliminates idle or over-used resources.</p> <p>Security, maintenance, and advancement of hardware are also abstracted away from cloud customers, reducing CAPEX and IT personnel costs.</p> <h3 id="high-velocity-development">High velocity development</h3> <p>Agile software development is a byproduct of allowing software development teams to iterate quickly in their architectural choices to best fit their business demands. The virtualization available via cloud services enables this agility by allowing engineers to deploy services with highly customized, decoupled architectures. These cloud-based architectures can be updated regularly and independently of their larger application context, especially with the help of CI/CD tools, removing many common deployment bottlenecks from the development cycle.</p> <p>This accelerated delivery of software changes in the cloud allows for faster turnover of features, bug fixes, and security updates, and is more likely to drive customer adoption, satisfaction, and ultimately, revenue generation.</p> <h2 id="the-hidden-costs-of-cloud-networking">The hidden costs of cloud networking</h2> <p>I just covered what I consider to be some of the common, high-level cost reduction promises of cloud providers and proponents. But moving to the cloud can’t be a perfect, pain-free solution, can it?</p> <p>Unfortunately, cloud provider marketing tends to leave out some significant production realities for teams considering the move to cloud-centric development.</p> <h3 id="the-cost-of-complexity">The cost of complexity</h3> <p>In a word, cloud-based development is complex. The myriad of services, applications, devices, regions, policies, access privileges, protocols, security threats, architectures, and deployment strategies makes the scale of complexity in cloud networks a truly unique challenge.</p> <p>While cloud providers and associated SaaS companies do their best to abstract away a lot of the pain of this complexity, this can make cloud-native organizations strongly dependent on the fiscal and engineering choices of the service providers.</p> <h4 id="cloud-based-pricing-models">Cloud-based pricing models</h4> <p>This <a href="https://thesai.org/Downloads/Volume7No2/Paper_11-Pricing_Schemes_in_Cloud_Computing_An_Overview.pdf">research paper</a> tracked pricing strategies of cloud businesses and clearly illustrates why runaway cloud spend is such a threat for engineering teams. With so many concurrent pricing models at play, keeping track and optimizing for cost is a tall order.</p> <p>Network engineers often have to take into account a cocktail of variable pricing strategies, including but not limited to:</p> <ul> <li>Time-based</li> <li>Volume-based</li> <li>Resource-based</li> <li>Service-based</li> <li>Content-based</li> <li>Location-based/edge</li> <li>Priority</li> <li>Subscription</li> <li>Dynamic</li> </ul> <h3 id="the-cost-of-monitoring-complexity">The cost of monitoring complexity</h3> <p>For distributed networks at scale, monitoring becomes a daunting expense if not handled well.</p> <p>Consider the massive amounts of data, for every instance, with multiple copies, and often millions of data points between metrics, labels, traces, flow logs, etc., and the costs that this data incurs:</p> <ul> <li>The cost of the services used to instrument, aggregate, move, transform, store, and analyze this data</li> <li>The cost of the teams required to build and maintain these monitoring platforms and tools</li> </ul> <h3 id="the-cost-of-dependency">The cost of dependency</h3> <p>Cloud-native development is rife with dependencies: the cloud providers, the SaaS and open source development, deployment, and monitoring tools.</p> <h2 id="takeaways-about-cloud-networking-costs">Takeaways about cloud networking costs</h2> <p>For simple, standalone applications, the cloud can offer quick wins and cost savings via:</p> <ul> <li>Easy delivery of items like static web pages</li> <li>Easy storage and backup of data</li> <li>Reduced personnel</li> <li>Increased development agility</li> <li>Shifting from capital to open expenditures</li> </ul> <p>As applications and their networking demands become more complex, the cloud can present high costs that are difficult to predict, control, or optimize. But, with network observability principles and strategies, the cost of these complexities can be managed and ultimately provide significant improvements to your organization’s bottom line.</p> <h2 id="stay-tuned-for-part-ii">Stay tuned for Part II</h2> <p><a href="/blog/managing-the-hidden-costs-of-cloud-networking-part-2/">In Part II of this series</a>, we examine the case study of Company X, where I worked for ten years and helped manage migrations to and from the cloud.</p> <p>I share what I learned about how complex cloud networks can put a strain on an organization when not properly implemented, and the lessons I was able to take from responding to those challenges.</p><![CDATA[How much does RPKI ROV reduce the propagation of invalid routes?]]><![CDATA[Our analysis from earlier this year estimated that the majority of internet traffic now goes to routes covered by ROAs and is thus eligible for the protection that RPKI ROV offers. This analysis takes the next step in understanding RPKI ROV deployment by measuring the rejection of invalid routes.]]>https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routeshttps://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes<![CDATA[Doug Madory, Job Snijders]]>Wed, 24 Aug 2022 04:00:00 GMT<p>Earlier this year, Job Snijders and I <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">published an analysis</a> that estimated the proportion of internet traffic destined for BGP routes with ROAs. The conclusion was that the majority of internet traffic goes to routes covered by ROAs and are thus eligible for the protection that RPKI ROV offers.</p> <p>However, ROAs <em>alone</em> are useless if only a few networks are rejecting invalid routes. The next step in understanding where we are at with RPKI ROV deployment is to better understand how widespread the rejection of invalid routes is.</p> <p>It’s a tricky question, but one way to explore this is to measure the impact on propagation when a route is evaluated as invalid. The more networks that reject invalids, the smaller the propagation of those invalid routes. Reduced invalid route propagation reduces the damage caused by route leaks or inadvertent hijacks.</p> <p>So the question is, <strong>how much does being evaluated as invalid reduce a route’s propagation?</strong></p> <p>To answer this question, we can make use of thousands of BGP routes that are persistently invalid due to ROA misconfigurations. Using a public BGP dataset like <a href="http://www.routeviews.org/routeviews/">Routeviews</a>, we can measure how far persistently invalid routes propagate on the internet as compared to their RPKI-valid and RPKI-not-found (i.e. no ROA) sibling routes. We can also look at what happens to the propagation of individual routes as their statuses change from valid to invalid.</p> <div as="WistiaVideo" videoId="qpfh2174t5" audio></div> <h3 id="histogram-time">Histogram time!</h3> <p>These days, the IPv4 table consists of over 900,000 routes while the IPv6 table has a little more than 150,000 routes. If we took each route and counted how many Routeviews vantage points had that route in their routing tables, we can make a histogram showing how many routes are seen by how many vantage points. The count of vantage points can serve as a measure of route’s propagation — the more vantage points, the more propagation.</p> <p>Here are histograms for the two address families. The peak of globally routed prefixes (those seen by nearly all vantage points) is around 295 for IPv4 and 240 for IPv6 (the lower number reflects the smaller number of IPv6 vantage points in the Routeviews dataset).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6bwUhS731UHob5ffCaZsKM/5204faea50b53841120b831bfc74dbe0/propagation_distribution.png" style="max-width: 800px;" class="image center" alt="Propagation distribution" thumbnail /> <p>If we decompose each of these histograms by RPKI ROV status, we arrive at the following histograms which illustrate the difference in propagation that invalids experience as compared to the alternatives. The distributions for IPv4 (left) and IPv6 (right) are colored by RPKI status: RPKI-not-found (blue), RPKI-valid (orange) and RPKI-invalid (green).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7uL7smjzcsmg6Hec6Pvl1A/294032450ffb693669e9c8ab9c14625a/IPv4-v6_propagation_by_RPKI.png" style="max-width: 800px;" class="image center" thumbnail alt="IPv4 and IPv6 propagation by RPKI" /> <p>From these histograms, we can see that invalid routes rarely, if ever, experience propagation greater than half that experienced by RPKI-valid and RPKI-not-found routes. In fact, many experience propagation significantly less than half, but the amount of reduction depends on a number of factors including the upstreams involved in transiting the prefixes. Nonetheless, it is evident that RPKI ROV dramatically reduces the propagation of invalid routes.</p> <h3 id="the-plight-of-the-invalid">The plight of the invalid</h3> <p>On any given day, there are hundreds of BGP routes that change RPKI ROV states either due to changes in their ROAs or due to a change in their origin. Let’s take a look at a couple recent examples of routes that became invalid and how that affected how far the route was propagated.</p> <p>A common BGP error that never seems to go away is the single-digit BGP hijack caused by a simple router misconfiguration. My friend Anurag Bhatia <a href="https://anuragbhatia.com/2013/08/networking/as-number-hijacking-due-to-misconfiguration/">wrote about this phenomenon</a> back in 2013 and it was also <a href="https://twitter.com/DougMadory/status/1404915171726925827">discussed</a> in the recent <a href="https://storage.googleapis.com/site-media-prod/meetings/NANOG82Virtual/2353/20210608_Siddiqui_No_It_Wasn_T_v1.pdf">NANOG talk</a> by Aftab Siddiqui of <a href="https://www.manrs.org/">MANRS</a>. These errors occur when a network engineer attempts to prepend an AS three times, for example, but instead ends up prepending the <em>number 3</em> to the AS path. The result is that it appears as though the prestigious Massachusetts Institute of Technology (AS3) is hijacking the route when in reality it was a simple misconfig.</p> <p>Well, when you add RPKI ROV to the mix, this error has more than a superficial impact. In the example below, AS210974 changed how it announced 212.192.2.0/24 on August 4, 2022. It began prepending the number 3 to its AS path, however since there was a ROA for this prefix, it also caused the route to become invalid leading to a significant drop in propagation.</p> <p>This is depicted below in Kentik’s BGP visualization which reports on the percentage of BGP vantage points that have routes of each origin in their tables. When AS3 becomes the origin due to the misconfig (red), the percentage of vantage points carrying this route drops by half as numerous backbone providers (including Cogent (AS174), GTT (AS3257), Lumen (AS3356), and Tata (AS6453)) stop accepting this route from Bezeq (AS8551).</p> <img src="//images.ctfassets.net/6yom6slo28h2/4WTt7EyHNZmqSfEsGesxOe/628b6c27b93b9659c2db3a6445d7df62/prepending_error_conflicts_with_RPKI_ROV.png" style="max-width: 800px;" class="image center" alt="Prepending error conflicts with RPKI ROV" /> <p>In another recent case, <a href="https://rpki-validator.ripe.net/ui/103.169.138.0%2F23?validate-bgp=true">103.169.138.0/23</a> changed origins on August 11, 2022 from AS142343 to AS38758. However, since the new origin isn’t listed as an authorized Origin AS in the ROA for this route, 103.169.138.0/23 became an invalid route. As a result, the propagation dropped by over a half of what it was at the beginning of the day, when Cogent (AS174) and Hurricane Electric (AS6939) stopped accepting the invalid route from Telin (AS7713).</p> <img src="//images.ctfassets.net/6yom6slo28h2/4S6jXRgR88tmGXh3Cp7NdR/fd2c42633b6f02012ac0a9a65c3addeb/RPKI_conflict.png" style="max-width: 800px;" class="image center" alt="RPKI conflict" /> <p>Incidents like these are becoming more commonplace as RPKI adoption increases and the rate of misconfigurations remains steady. All the more reason to use route monitoring that reports on the RPKI ROV validity of your routes. Operators can monitor their routes <a href="https://www.kentik.com/blog/introducing-bgp-monitoring-from-kentik/">using Kentik</a> or the open source package <a href="https://github.com/nttgin/BGPAlerter">BGPAlerter</a>.</p> <h3 id="conclusion">Conclusion</h3> <p>Based on the analysis above, the evaluation of a route as invalid reduces its propagation by anywhere between one half to two thirds. Given that the <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">majority of internet traffic</a> now flows towards routes with ROAs, this offers a significant degree of protection for the internet in the event of a routing leak or other inadvertent BGP hijack.</p> <p>Just recently, Zayo (AS6461) <a href="https://mailman.nanog.org/pipermail/nanog/2022-August/220287.html">announced</a> that they would begin rejecting invalid routes. By doing so, they will join the ranks of the other tier-1 backbone networks like Arelion, NTT, Cogent, Lumen, and GTT that reject invalid routes. It will be fascinating to see how much the above distributions change when this announced change takes effect.</p> <p>At this time Zayo will be only rejecting invalids coming from peers (not customers). Since most of their biggest peers are already dropping invalids, the overall impact may be subtle. Having all of the internet’s largest backbone networks dropping invalids would be a substantial achievement in routing security, and we’re almost there!</p><![CDATA[Understanding AS relationships, outage analysis and more Network Operator Confidential gems]]><![CDATA[Doug Madory shared global internet market insights from recent months, and we were joined by Unitas Global for a discussion of their push for greater customer transparency using Kentik Synthetics.]]>https://www.kentik.com/blog/understanding-as-relationships-outage-analysis-network-operator-confidentialhttps://www.kentik.com/blog/understanding-as-relationships-outage-analysis-network-operator-confidential<![CDATA[Stephen Condon]]>Thu, 11 Aug 2022 04:00:00 GMT<p>The objective of Network Operator Confidential is to share our global internet market insights from recent months. Kentik, and our customers, have access to views and analysis of global internet traffic that no one else can match.</p> <p>In our <a href="https://www.kentik.com/go/webinar/network-operator-confidential/">first Network Operator Confidential webinar</a>, I was joined by Doug Madory, Kentik’s director of internet analysis, and Grant Kirkwood, founder and CTO at Unitas Global.</p> <p>We covered:</p> <ul> <li>Market share dynamics as revealed by Kentik Market Intelligence (KMI)</li> <li>Internet outages</li> <li>How Unitas Global uses Kentik to generate a live network status grid for their customers.</li> </ul> <h3 id="network-operator-market-dynamics">Network Operator Market Dynamics</h3> <p><a href="https://www.kentik.com/resources/kentik-market-intelligence/">Kentik Market Intelligence</a> (KMI) is a SaaS business intelligence tool that helps network operators <a href="https://www.kentik.com/blog/launching-a-labor-of-love-kentik-market-intelligence/">understand AS transit and peering relationships</a> for any market in the world.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7LgKA06Lyxn1M5j6V0y8iI/94f009aeaeb8d5b0b971471113186a14/kmi-noc-1.jpg" style="max-width: 800px;" class="image center" alt="Kentik Market Intelligence" thumbnail /> <p>In the Kentik app, you can view KMI data on ASes by:</p> <ul> <li>Customer base</li> <li>Customer growth</li> <li>Peering</li> </ul> <p>And within these views you can isolate:</p> <ul> <li>IPv4 and IPv6</li> <li>Geographic market</li> <li>Customer base type: all, retail, wholesale and backbone</li> </ul> <p>Doug demonstrated a number KMI use cases, including how it can be used to identify a provider’s single-homed customers. KMI highlights all the customers of an AS that only use that one provider. The single-homed customers represent prospects for a competing provider’s sales team. Additionally, by identifying contract start dates (coming out soon), service provider sales teams will have a good indication of when they should be reaching out to prospects.</p> <p>Since Unitas was on the call, Doug showed a KMI view of their rankings and discussed how their <a href="https://www.inap.com/press-release/unitas-global-acquires-inaps-network-business-assets/">recent acquisition of INAP</a> (previously Internap) would be reflected in KMI rankings.</p> <h3 id="network-outage-analysis">Network outage analysis</h3> <p>Doug showed his analysis of the recent Rogers outage on July 8, as seen by Kentik. The outage was the largest in Canadian history and took down the internet for 25% of the Canadian population. I won’t go into too much detail as Doug posted a full analysis in <a href="https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outage/">A deeper dive into the Rogers outage</a>, but here are some of Doug’s key observations:</p> <ul> <li>Rogers said a config change removed a filter allowing routes from the internet to be circulated by their interior gateway protocol (IGP) — per Rogers filing with CRTC.</li> <li>This led to an update flood exceeding the memory and processing capacity of their routers.</li> <li>Initial assessments were that AS812’s routes were withdrawn in BGP rendering the network unreachable.</li> <li>However, we could immediately see that many AS812 routes stayed up but still couldn’t communicate.</li> </ul> <p>Therefore, Doug’s opinion is that it wasn’t a reachability problem and that BGP shouldn’t be blamed. In the Rogers outage, exterior BGP withdrawals were symptoms, not the cause of any lack of connectivity.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5EYDyOKDCS8oDE4ITaweOk/c3f17c78e459b82d5ded0ea597ce2c5b/rogers-outage-route--withdrawal.png" style="max-width: 800px;" class="image center" alt="Rogers outage - withdrawal and return" thumbnail /> <p>Next was a discussion of the outages resulting from the Ukraine conflict with particular attention given to Russia rerouting traffic emanating from Kherson. Doug drew parallels with what Russia did in Crimea in 2014. Kherson is now being forced to use transit exclusively from Rostelecom via Crimea which raises a host of surveillance and censorship concerns. Since presenting his analysis on the webinar Doug <a href="https://www.nytimes.com/interactive/2022/08/09/technology/ukraine-internet-russia-censorship.html">collaborated with the New York Times</a> on their analysis of the Kherson rerouting and <a href="https://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplan/">this post gives his full analysis</a>. Here’s an animation that depicts the re-routing:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 64.75507765830346%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/3quz3zd1s2" title="Reconfiguration of ASes in Kherson" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h3 id="how-unitas-global-exceeds-customer-expectations-with-kentik-synthetics">How Unitas Global exceeds customer expectations with Kentik Synthetics</h3> <p><a href="https://unitasglobal.com/">Unitas</a> is a global access provider that operates a network focused on meeting the needs of large multinational enterprises. Their customers have complex connectivity requirements consisting of internet, cloud, last mile access, and interconnections into the major cloud providers. According to Grant, Kentik is a <a href="https://www.kentik.com/resources/case-study-unitas-global/">large part of how they manage and optimize their network and service</a>. The Unitas network encompasses:</p> <ul> <li>170+ pops</li> <li>1,100+ customers</li> <li>5,500+ peer networks</li> </ul> <p>Grant spoke about their recent acquisition of INAP and now the Kentik service is helping them combine these networks. INAP was also a Kentik customer before the merger. Having a common set of tooling to understand the traffic profiles has been a huge advantage in combining the networks. Grant mentioned that other large providers who acquired other networks and are still operating separate ASNs ten years later. Grant is confident that they’ll be able to expedite the integration with an assist from Kentik.</p> <p>Next Grant showed how he uses <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a> to generate a <a href="https://stats.unitasglobal.net/">live mesh view of the status of the Unitas Global network</a>. Kentik Synthetics private agents are deployed in each of his PoPs giving a view of performance between each of the major metro PoPs. This gives Unitas and their customers a baseline of performance as well as an early warning system of changes to performance on particular routes. The Unitas sales team sees this as a key sales tool and reflects their transparent approach to servicing customers.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7yRkM3vcqatSZZPCaxCZzH/4b009f73a4d7af37961151fa2d3378d4/unitas-reach.jpg" style="max-width: 800px;" class="image center" alt="Unitas Reach" thumbnail /> <h3 id="questions">Questions</h3> <p>We finished the webinar with a discussion about the impact <a href="https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow/">RPKI ROV</a> was having on mitigating hijacks. It was explained that with the majority of tier 1 providers now dropping RPKI-invalid routes, you can expect a two-third reduction in the propagation of invalid routes. RPKI has now emerged as the internet’s best defense against BGP hijacks due to typos and other routing mishaps. So it’s in your interest to deploy RPKI ROV and reject invalid routes. (Doug alluded to a future post on this subject.)</p> <p>That’s a wrap on the first Network Operator Confidential. Please <a href="https://www.kentik.com/contact/">contact us</a> if you have any suggested topics for the next NOC webinar.</p><![CDATA[Bringing business context to network analytics]]><![CDATA[At Networking Field Day: Service Provider 2, Steve Meuse showed how Kentik’s OTT analysis tool can help a service provider better understand the services running on their network. Doug Madory introduced Kentik Market Intelligence, a SaaS-based business intelligence tool, and Nina Bargisen discussed optimizing peering relationships. ]]>https://www.kentik.com/blog/bringing-business-context-to-network-analyticshttps://www.kentik.com/blog/bringing-business-context-to-network-analytics<![CDATA[Phil Gervasi]]>Wed, 10 Aug 2022 04:00:00 GMT<p><strong>Kentik brings real-world business context to the telemetry we collect and the analytics we provide.</strong> That’s the overarching theme I got from <a href="/resources/nfd-sp-2-whats-new-with-kentik-in-2022/">Networking Field Day: Service Provider 2</a>.</p> <p>As I watched and listened to each presentation, it was pretty obvious to me that Avi, Steve, Doug, and Nina, all technical powerhouses, were a little less focused on packets and a little more focused on how we can improve network operations and a service provider’s ability to make smart business decisions.</p> <p>What struck me was how the presenters explained Kentik’s technology. It was always in the context of what all this <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Kentik Blog: Best Practices for Enriching Network Telemetry to Support Network Observability">amazing telemetr</a>y, analytics, and data-crunching can do to help you, as a service provider, run your network smarter and provide a better service to your customers.</p> <p>For example, Steve Meuse, a long-time solutions architect working with service providers, talked about how <a href="/resources/nfd-sp-2-observing-content-delivery-and-over-the-top-ott-services/">the network-centric perspective of ASNs, peering relationships, transit networks, etc. needs to change</a>. Instead, our perspective needs to be on the actual services that run on top of all that cool technology.</p> <div class="pullquote center" style="text-align: center;">It's all about SERVICES!</div> <p>Check out the screenshot from his presentation below. This is a graphic Steve snagged from the Kentik portal that shows a major outage from last year with ASN32934. Do you remember that outage? No? Neither do I. But I do remember the <a href="https://www.kentik.com/blog/facebooks-historic-outage-explained/" title="Kentik Blog: Facebook&#x27;s Historic Outage Explained">Facebook outage</a> that happened at the exact same time and involved the exact same ASN. <strong>Real-world business context</strong>.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/T3hVP8CAkRZ68WakHTpXg/634d52fb560544057f5a9a993bf82312/nfd-sp2-steve.jpg" style="max-width: 800px;" class="image center" alt="Steve Meuse presentation" /> <p>Nina Bargisen had her own take on this, too. Nina looked specifically at <a href="/resources/nfd-sp-2-peering-in-this-economy/">how to use Kentik to analyze traffic volume and quality to and from specific peers</a> — specific ASNs — so that an engineer can be selective about who they peer with. An engineer could then actually conserve real, physical hardware resources to make the most out of the gear chugging along in the rack today. This has a direct impact on a service provider’s budget and ability to deliver services to their own customers. <strong>Real-world business context</strong>.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5jGBGvragogscCFDWHjLzq/aed9b40ae50b6fd08cb29df34519536e/nfd-sp2-nina.jpg" style="max-width: 800px;" class="image center" alt="Nina Bargisen presentation" /> <p>Doug Madory, Kentik’s director of <a href="https://www.kentik.com/analysis/" title="Kentik Internet Outage Tracker and Analysis">internet analysis</a>, <a href="/resources/nfd-sp-2-parsing-bgp-tables-for-fun-and-profit/">introduced Kentik Market Intelligence</a>, a SaaS business intelligence tool for understanding AS transit and peering relationships for any market in the world. To me, this was a break from Doug’s typical deep dive technical analyses into what’s happening on the internet, but in reality, it actually wasn’t.</p> <p>You see, <a href="/product/market-intelligence/">Kentik Market Intelligence</a> starts with a deep dive into peering relationships and AS paths, but it presupposes (correctly) that each hop represents an actual business relationship. KMI asks the questions, “how much does this specific hop cost?”, and “how does a specific ASN compare to other ASNs in a similar market?”</p> <img src="//images.ctfassets.net/6yom6slo28h2/3csLTXmzWQ0QwVzBYUtCyM/551ec438ca441bc8d88c8a57299486ad/nfd-sp2-doug-kmi.jpg" style="max-width: 800px;" class="image center" alt="Doug Madory presentation" /> <p>Doug explained that with KMI, an engineer can make better business peering decisions, find future customers in specific markets, and determine who are the top retail, wholesale or backbone providers in any country. This is rich business context around Kentik’s very technical underlying analysis. Again, <strong>real-world business context</strong>.</p> <p>Networking Field Day events are some of my favorites because I can nerd-out and talk packets and flows without shame. But this last <a href="https://techfieldday.com/event/nfdsp2/">NFD: Service Provider 2</a> brought that technical conversation to a new level. Yes, Doug got into the details of transit networks, and Steve’s presentation was like taking an online class, and Nina got into the weeds on peering. But it was awesome to see all that technical depth and analysis wrapped neatly in a real-world business context for service providers trying to run a business and deliver a great product to their customers.</p> <p><a href="/resources/nfd-sp-2-whats-new-with-kentik-in-2022/">Watch all of Kentik’s presentations at Networking Field Day: Service Provider 2</a>.</p><![CDATA[Rerouting of Kherson follows familiar gameplan]]><![CDATA[Since the beginning of June 2022, internet connectivity in the Russian-held Ukrainian city of Kherson has been rerouted through Crimea, the peninsula in southern Ukraine that has been occupied by Russia since March 2014. As I explain in this blog post, the rerouting of internet service in Kherson appears to parallel what took place following the Russian annexation of Crimea.]]>https://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplanhttps://www.kentik.com/blog/rerouting-of-kherson-follows-familiar-gameplan<![CDATA[Doug Madory]]>Tue, 09 Aug 2022 04:00:00 GMT<blockquote><i>This blog post was published as a companion piece to an article on the same topic in today's <a href="https://www.nytimes.com/interactive/2022/08/09/technology/ukraine-internet-russia-censorship.html">New York Times</a>.</i></blockquote> <p>Since the beginning of June this year, internet connectivity in the Russian-held Ukrainian city of Kherson has been rerouted through Crimea, the peninsula in southern Ukraine that has been occupied by Russia since March 2014. As I explain in this blog post, the rerouting of internet service in Kherson appears to parallel what took place following the Russian annexation of the Crimean peninsula.</p> <h3 id="russian-rerouting-of-crimea-2014">Russian rerouting of Crimea (2014)</h3> <p>Following Russia’s annexation of Crimea in March 2014, Russian Prime Minister Dmitry Medvedev made a <a href="https://www.bbc.com/news/world-europe-26816033">highly publicized visit</a> to the Crimean city of Simferopol to announce plans to integrate the newly acquired territory into Russia.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3llkBufdYN0CL2lyTGHiNy/71466b18860ee642f54b510d00567604/medvedev-tweet.png" style="max-width: 600px;" class="image center" alt="Medvedev tweet on Crimea" /> <p>During his visit, Medvedev <a href="https://www.commsupdate.com/articles/2014/03/25/medvedev-rostelecom-should-launch-in-crimea-to-safeguard-communication-channels/">stated</a> (also <a href="https://twitter.com/MedvedevRussia/status/448048962474225664">tweeted</a> above) that it was imperative for Rostelecom (state telecom of Russia) to immediately begin construction of a submarine cable from mainland Russia to Crimea across the <a href="https://en.wikipedia.org/wiki/Kerch_Strait">Kerch Strait</a>. “The transfer of important information by foreign companies is unacceptable,” he said. The foreign companies he was referring to were the Ukrainian ones.</p> <p>In April 2014, the Russians <a href="https://www.comnews.ru/content/81914/2014-04-28/rostelekom-prisoedinil-krym">announced</a> that a new 46km submarine fiber optic cable spanning the shallow littoral waters of the Kerch Strait had been completed and was operational with a reported <a href="https://www.telecompaper.com/news/rostelecom-launches-fibre-link-to-crimea--1010208">capacity of 110 Gbps</a>.</p> <p>The following month, Rostelecom <a href="https://web.archive.org/web/20220422103232/http://gazeta.sebastopol.ua/2014/05/13/rostelekom-budet-rabotat-v-krymu-pod-brendom-miranda-media/">announced</a> that its branch in Crimea would operate under the name «Миранда-медиа» or “Miranda-Media”. However, it wasn’t until July 17th that a new <a href="https://en.wikipedia.org/wiki/Autonomous_system_(Internet)">autonomous system</a> number, <a href="https://bgp.tools/as/201776">AS201776</a>, appeared in the global routing table by the name of Miranda-Media. A week after its appearance, AS201776 began providing transit to two ISPs in Crimea on July 24th, namely, KCT (AS48004) of Simferopol and ACS-Group (AS42986) of Alupka.</p> <p>On July 31st 2014, I <a href="https://web.archive.org/web/20140731152826/http://www.renesys.com/2014/07/no-turning-back-russia-crimea/">broke the story</a> of the activation of the <a href="https://www.submarinecablemap.com/submarine-cable/kerch-strait-cable">Kerch Strait cable</a> linking Russia to Crimea, covered by <a href="https://www.vice.com/en/article/ypw35k/russia-built-an-underwater-cable-to-bring-its-internet-to-newly-annexed-crimea">Vice</a>, <a href="https://www.theverge.com/2014/8/2/5962145/crimeans-are-now-using-the-russian-internet">The Verge</a>, among others.</p> <p>One of the first prefixes announced by Miranda-Media was 178.34.152.0/21. RIPE registration data for the prefix at that time showed that it had been the address space assigned to the Sochi Winter Olympics several weeks earlier. After serving that mission, it had been transferred to Russia’s next strategic project, Crimea.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">inetnum: 178.34.128.0 - 178.34.191.255 netname: Macroregional_South descr: OJSC Rostelecom Macroregional Branch South descr: Winter Olympic Games of Sochi 2014 descr: Sochi, Russia descr: The Olympic Games are finished. Transition period. Apr, 2014 country: RU</code></pre></div> <p>In the subsequent months in 2014, Crimean ISPs changed how they reached the global internet. Instead of connecting to Ukrainian telecommunications companies over the land bridge to mainland Ukraine, they connected to Russia via a submarine fiber optic cable. As a consequence, latencies to websites hosted in Russia dropped while latencies to websites hosted in mainland Ukraine spiked. The graphic below illustrates the increase in latency from Kiev, Ukraine to CrimeaCom (AS28761) in July 2014.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4KJ039P1sTrSuHfVa2VFrx/f3b1f9a829a6c1244ffb7411e956e80e/crimeacom-latencies.png" style="max-width: 500px;" class="image center" alt="CrimeaCom latencies 2014" /> <p>At the time, one of Miranda-Media’s new Crimean customers even advised its users about this change in latency in the following message one its website:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7onXYiRKJtPSwisqo0pKlG/72fd8a1167df10fd2aeae99d15f7cc39/gigabyte-335w.png" style="max-width: 335px; margin-bottom: 15px;" class="image center" alt="" /> <div class="caption" style="max-width: 600px;"><i>Translation: The Russian transport operator was changed. Traffic now goes through the optical cable laid along the bottom of the Kerch Strait. In connection with this decreased response time (ping) to Russian resources and increased to foreign and Ukrainian.</i></div> <p>The Russian takeover of the Crimean internet was more than just re-routing traffic through Miranda-Media’s link to Russia, it was about taking over the local network as well. On September 24, 2014, Ukrtelecom reported that its assets in Sevastopol were <a href="https://www.epravda.com.ua/news/2014/09/24/493654/">forcibly seized</a> and turned over to a new entity, Sevastopol Telecom (aka Sevtelekom).</p> <p>The Russian press coverage of the seizure had a <a href="https://www.comnews.ru/content/87635/2014-09-25/ukrtelekom-otklyuchil-sevastopol-bez-preduprezhdeniya">different perspective</a>: it stated that the reason for the disconnection of Ukrtelecom’s Sevastopol network was unknown and lauded “Russian networks” for quickly restoring Sevastopol’s communications.</p> <p>On that same day, AS59833 (Sevtelekom) first appeared in the global routing table - another new Miranda-Media customer network.</p> <h3 id="russian-rerouting-of-kherson-2022">Russian rerouting of Kherson (2022)</h3> <p>The city of Kherson lies in the south of Ukraine and has been under Russian control since the early days of the invasion. Despite some brief outages, internet service there had been <a href="https://ioda.inetintel.cc.gatech.edu/region/4378?from=1645506000&#x26;until=1651291199">largely uninterrupted</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/38KVF9KJ6o19HO3rhnQELf/dca22e220a4d57702e18a3c964e4ca56/kherson-map.png" style="max-width: 400px; margin-bottom: 15px;" class="image center" alt="Kherson, Ukraine map" /> <div class="caption">Map credit: <a href="https://www.fpri.org/article/2016/09/distinguishing-true-false-fakes-forgeries-russias-information-war-ukraine/map-of-kherson/">Foreign Policy Research Institute</a></div> <p>However, at 16:12 UTC on April 30th, internet service in the city of Kherson and its surrounding areas went down. SkyNet (Khersontelecom, AS47598) was the first to return to service the following day at 16:02 UTC on May 1st, but it wasn’t connecting through its normal upstreams. Instead of its Ukrainian transit providers, it was using Miranda-Media (AS201776) in Crimea to reach Rostelecom in Russia for connectivity to the outside world.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/W8Rb9aHhLQQcJ6MsgZkrv/631d196b2215f4f9a6db2fe2d4885eb6/kentik-portal-traffic-to-kherson.png" style="max-width: 800px;" class="image center" alt="Internet traffic to Kherson, Ukraine" thumbnail /> <p>At 16:15 UTC on May 2nd, a second Kherson ISP (Brok-X, AS49168) came back online using transit from Miranda-Media. However, by the end of the day on May 4th, internet service across Kherson was up and again routed via Ukrainian transit. By then, both AS47598 and AS49168 ceased using Miranda-Media to route traffic to the internet.</p> <p>Later that same day, the owner of SkyNet (KhersonTelecom) <a href="https://www.facebook.com/skynet.kherson.ua/posts/1199572964193714">took to Facebook</a> to explain his rationale for reconnecting his service through Russia. “On April 30, EVERYTHING fell!”, he said, adding that he had no choice but to “turn on the circuit of the Crimean operator Miranda.” While some <a href="https://www.wired.com/story/ukraine-russia-internet-takeover/">local residents thanked him</a>, others accused him of using “enemy internet”, a decision he defended in the post.</p> <p>Following the restoration on May 4th, ISPs in Kherson returned to their previous Ukrainian transit providers until May 30th when another city-wide internet outage struck at 14:43 UTC. At that time, Kherson ISPs lost their links to other Ukrainian networks, possibly for good. The few Kherson ISPs that stayed online did so by using transit from Miranda-Media. Within a few days, every network in Kherson was either offline or connected to the internet via the Crimean provider.</p> <p>Below is an animation of the phased reconfiguration of the ASes that make up the internet of Kherson over this period.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 64.75507765830346%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/3quz3zd1s2" title="Reconfiguration of ASes in Kherson" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <h3 id="conclusion">Conclusion</h3> <p>With Kherson’s internet traffic routed through Moscow, the Russian government has the ability to surveil, intercept, and block communications with the outside world. This creates a grave risk for the citizens of Kherson, especially those resisting the occupation of the city. Additionally, if Crimea is any guide, we may also see telecommunications assets in Kherson seized and turned over to Russian owned companies.</p> <p>A routing configuration with a single upstream for a whole city is a fragile one. If anything should happen to the circuit to Miranda-Media, Kherson will have no options to restore service while their links into Ukraine are down.</p> <p>Under normal circumstances, the use of multiple transit providers offers leverage to the customer when it comes to contract negotiations. A customer with a single upstream becomes a captive audience for the provider. However, given the situation in Kherson, internet users may be a captive audience in more ways than one.</p><![CDATA[Network telemetry in Sumo Logic]]><![CDATA[In this post we will show how to leverage Kentik Labs' open source project ktranslate to bring network observability into the Sumo Logic platform.]]>https://www.kentik.com/blog/network-telemetry-in-sumo-logichttps://www.kentik.com/blog/network-telemetry-in-sumo-logic<![CDATA[Evan Hazlett]]>Mon, 18 Jul 2022 04:00:00 GMT<p>Recently we caught up with the Sumo Logic team to discuss network visibility and optimizing application stack views. We took a look at their API and found that it would be easiest to use the <a href="https://help.sumologic.com/03Send-Data/Sources/02Sources-for-Hosted-Collectors/HTTP-Source/Upload-Metrics-to-an-HTTP-Source">HTTP Ingest</a> method to send in our metric data.</p> <h3 id="format">Format</h3> <p><a href="https://www.sumologic.com/">Sumo Logic</a> offers a few different formats in which to ingest metrics. We chose the Carbon 2.0 format as it is relatively simple, yet offers enough flexibility to handle various additional enriched data that we augment the network flow with, such as ASN, geographical source/destination, protocols, etc. By leveraging the extensible design of <a href="https://github.com/kentik/ktranslate">ktranslate</a> we created a new <a href="https://github.com/kentik/ktranslate/pull/338">Carbon 2.0 output formatter</a> that looks like the following:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">metric=in_bytes mtype=rate unit=B/s device_id=100 Type=kflow dst_addr=192.168.5.15 src_endpoint=216.176.96.90:8080 src_addr=216.176.96.90 sample_rate=1 eventType=KFlow dst_endpoint=192.168.5.15:52454 protocol=TCP provider=kentik-flow-device src_as_name=RTCCOM dst_route_prefix=0.0.0.0 input_port=54429 src_route_prefix=0.0.0.0 src_geo=US src_as=14574 l4_src_port=8080 l4_dst_port=52454 tcp_flags=27 dst_as_name=0 355 1655310976</code></pre></div> <p>By using the Carbon intrinsic tags such as the “mtype” and “unit” we can get rich data views from within Sumo Logic. We also add our enriched data using meta tags that enable better queries and filters from within Sumo Logic.</p> <h3 id="sumo-logic">Sumo Logic</h3> <p>Once we had the format ready it was pretty straightforward to send the data to the Sumo Logic API using the ktranslate HTTP exporter. We needed to have the Sumo Logic API hosted collector endpoint and we were good to go. Here is an example:</p> <div class="gatsby-highlight" data-language="shell"><pre class="language-shell"><code class="language-shell">$<span class="token operator">></span> ktranslate <span class="token punctuation">\</span> <span class="token parameter variable">--sinks</span> http <span class="token punctuation">\</span> <span class="token parameter variable">--http_url</span> <span class="token string">"https://endpoint4.collection.sumologic.com/receiver/v1/http/ &lt;your-private-endpoint-here>"</span> <span class="token punctuation">\</span> <span class="token parameter variable">--http_header</span> <span class="token string">"Content-Type:application/vnd.sumologic.carbon2"</span> <span class="token punctuation">\</span> <span class="token parameter variable">--format</span> carbon</code></pre></div> <p>Once we have ktranslate sending metrics we should be able to see them on the Sumo Logic “Metrics” view:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6b8rzfvjAVXncjfWtBWaDc/eff549d5e7e52a5ddf766ff6e66481a1/sumologic-metrics-view-new.png" style="max-width: 800px;" class="image center" alt="Sumologic Metrics view" thumbnail /> <p>We can then build a simple dashboard showing network data such as source and destination transfer:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/0KGBMyWPqnO2Ygn48RHmK/76acd1f664f776aceda2817f458785fa/sumologic-dashboard-new.png" style="max-width: 800px;" class="image center" alt="Sumo Logic dashboard" thumbnail /> <p>By using ktranslate and kprobe we can get vital network metrics into a variety of services. If you have questions or comments please join us on <a href="https://discord.gg/kentik">Discord</a> or <a href="https://github.com/kentik/ktranslate">GitHub</a>.</p><![CDATA[A deeper dive into the Rogers outage]]><![CDATA[On July 8, 2022, Canadian telecommunications giant Rogers Communications suffered a catastrophic outage taking down nearly all services for its 11 million customers in the largest internet outage in Canadian history. We dig into the outage and debunk the notion that it was caused by the withdrawal of BGP routes. ]]>https://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outagehttps://www.kentik.com/blog/a-deeper-dive-into-the-rogers-outage<![CDATA[Doug Madory]]>Fri, 15 Jul 2022 04:00:00 GMT<p>Beginning at 8:44 UTC (4:44am EDT) on July 8, 2022, Canadian telecommunications giant Rogers Communications suffered a catastrophic outage taking down nearly all services for its 11 million customers in what is arguably the largest internet outage in Canadian history. Internet services began to return after 15 hours of downtime and were still being restored throughout the following day.</p> <p><a href="https://blog.cloudflare.com/cloudflares-view-of-the-rogers-communications-outage-in-canada/">Initial analyses</a> of the outage reported that Rogers (AS812) couldn’t communicate with the internet because its BGP routes had been withdrawn from the global routing table. While a majority of AS812’s routes were withdrawn, there were hundreds of IPv4 and IPv6 routes that continued to be announced but stopped carrying traffic nonetheless. This fact points more to an internal breakdown in Rogers’ network such as its <a href="https://en.wikipedia.org/wiki/Interior_gateway_protocol">interior gateway protocol</a> (IGP) rather than their <a href="https://www.techtarget.com/whatis/definition/Exterior-Gateway-Protocol-EGP">exterior gateway protocol</a> (i.e. BGP).</p> <div as="Promo"></div> <h2 id="the-view-from-kentik">The view from Kentik</h2> <p>The graphic below shows Kentik’s view of the outage through our aggregated NetFlow data. Traffic drops to near-zero at 8:44 UTC and doesn’t begin to return until after midnight UTC. At 00:05 UTC, IPv4 traffic began flowing again to AS812 but IPv6 didn’t return until 08:30 UTC on July 9 for a total downtime no less than 15 hours. In all, it took several hours more until traffic levels were close to normal.</p> <img src="//images.ctfassets.net/6yom6slo28h2/24Q0bOjtYtMlM4Pa2mtmfR/3c9728060de2d7d775d805544d8c2055/rogers-outage-view-from-kentik.png" style="max-width: 800px;" class="image center" alt="Rogers outage: Aggregated NetFlow data" thumbnail /> <h2 id="route-withdrawals">Route withdrawals</h2> <p>On a normal day, AS812 originates around 970 IPv4 and IPv6 prefixes that are seen by at least 100 <a href="http://www.routeviews.org/routeviews/">Routeviews vantage points</a>. During the outage, hundreds of these routes were withdrawn — but not all at the same time. While most of the withdrawn routes went down around 8:44 UTC — the same time that traffic stopped flowing to AS812. There were also batches of route withdrawals at 8:50, 8:54, 8:59, 9:10 and 9:25.</p> <p>On the day of the outage, a widely shared <a href="https://www.reddit.com/r/Rogers/comments/vuk17t/timelapse_of_rogers_bgp_losing_practically_all_of/">Reddit post</a> featured a Youtube video of a <a href="https://stat.ripe.net/widget/bgplay">BGPlay</a> animation of the withdrawal of one of AS812’s routes: 99.247.0.0/18. Below is a reachability visualization of this prefix over the 24-hour period containing the outage. I believe it paints a clearer picture of the plight of this particular prefix: timing of its initial withdrawal, a temporary restoration, and eventual return.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4B7HGqk0OGF7pVmE0AqnOO/2cb5dcb70b235e21668618356c36943c/rogers-outage-reachability-visualization.png" style="max-width: 800px;" class="image center" alt="Rogers outage - 24 hour period" thumbnail /> <p>As was the case with 99.247.0.0/18, most of the withdrawn routes returned to the global routing table for periods of an hour or more beginning at 14:33 UTC (see reachability visualization for 74.198.29.0/24 below). Despite the temporary return of these routes, almost no traffic moved into or out of the AS812, meaning that it wasn’t the lack of reachability that was causing the outage.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5EYDyOKDCS8oDE4ITaweOk/c3f17c78e459b82d5ded0ea597ce2c5b/rogers-outage-route--withdrawal.png" style="max-width: 800px;" class="image center" alt="Rogers outage - withdrawal and return" thumbnail /> <h2 id="routes-that-stayed-up">Routes that stayed up</h2> <p>An interesting feature of this outage is the fact that the BGP routes that stayed up stopped carrying traffic. At 10:00 UTC on July 8 (over an hour since the outage began), about 240 IPv4 routes were still globally routed such as 74.198.65.0/24 illustrated below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3HYpXOTC6rL6r34HL7u4bo/091421e1d7a41514e3aa83876aadc4d3/rogers-outage-routes-that-stayed-up.png" style="max-width: 800px;" class="image center" alt="Rogers outage - routes that stayed up" thumbnail /> <p>Additionally, at 10:00 UTC, there were about 120 IPv6 routes still being originated by AS812 and seen by at least 100 Routeviews vantage points. Most of these didn’t get withdrawn until 17:11 UTC — more than eight hours into the outage. The reachability visualization of one of those routes (2605:8d80:4c0::/44) is shown below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2PGvv6HzBcxYj3SO4NLJOh/5ef14ffe833daf69dae4549263c45a29/rogers-outage-upstream-restoration.png" style="max-width: 800px;" class="image center" alt="Rogers outage - reachability visualization" thumbnail /> <p>If we isolate the traffic seen in our aggregate NetFlow to these 120 IPv6 holdouts, we can see traffic drop away at 08:44 UTC despite the fact that the routes are still in circulation for several more hours. Therefore the drop in traffic was not caused by a lack of reachability in BGP.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6ZV5p9rm9i0UntfP6t0nfs/d258d742102f996a92b65d16fac381f8/rogers-outage-netflow.png" style="max-width: 800px;" class="image center" alt="Rogers outage - isolated NetFlow traffic" thumbnail /> <p>Finally, while it didn’t seem to do much to assuage the impact of the outage, there were also dozens of routes originated by AS812 that stayed online. They continued to carry traffic because they were re-routed through competitors such as Bell Canada (AS577), Beanfield Technologies (AS23498) and Telus (AS852). These routes transited another Rogers ASN, AS19835, to reach those providers as illustrated in the graphic below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4aDFXvKvfDH30Gr1NzR0SE/2c0a61d7066d65a6632060bd04e7b432/rogers-outage-transit-shift.png" style="max-width: 800px;" class="image center" alt="Routes that stayed online" thumbnail /> <h2 id="dont-blame-bgp--">Don’t blame BGP :-)</h2> <p>In October, <a href="https://www.kentik.com/blog/facebooks-historic-outage-explained/">Facebook suffered a historic outage</a> when their automation software mistakenly withdrew the anycasted BGP routes handling its authoritative DNS rendering its services unusable. Last month, <a href="https://www.kentik.com/analysis/cloudflare-suffers-30-minute-outage/">Cloudflare suffered a 30-minute outage</a> when they <a href="https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/">pushed a configuration mistake</a> in their automation software which also caused BGP routes to be withdrawn. From these and other recent episodes, one might conclude that BGP is an irredeemable <a href="https://www.wired.com/story/bgp-route-leak-internet-outage/">mess</a>.</p> <div class="pullquote right">If you mistakenly tell your routers to withdraw your BGP routes, and they comply, that’s not BGP’s fault.</div> <p>But I’m here to say that BGP gets a bad rap during big outages. It’s an important protocol that governs the movement of traffic through the internet. It’s also one that every internet measurement analyst observes and analyzes. When there’s a <a href="https://twitter.com/DougMadory/status/1545926797925052416">big outage</a>, we can often see the impacts in BGP data, but often these are the symptoms, not the cause. <em>If you mistakenly tell your routers to withdraw your BGP routes, and they comply, that’s not BGP’s fault</em>.</p> <p>However in the case of the Rogers outage, the fact that traffic stopped for many routes that were still being advertised suggests the issue wasn’t a lack of reachability caused by AS812 not advertising its presence in the global routing table. In other words, the exterior BGP withdrawals were symptoms, not causes of this outage.</p> <p>Lastly, the fact that users of Rogers’ Fido mobile service reported having <a href="https://twitter.com/Hanna8f9/status/1545444979585204231">no bars of service</a> during the outage is also a head-scratcher — what is a common dependency that would take down the mobile signal along with internal routing?</p> <p>We hope the engineers at Rogers can quickly publish a thorough root cause analysis (RCA) to help the internet community learn from this complex outage. As a model, they should look to Cloudflare’s <a href="https://blog.cloudflare.com/cloudflare-outage-on-june-21-2022/">most recent RCA</a> which was both informative and educational.</p> <p>In the meantime, we remain religious about the value of <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">network observability</a> to understand the health of one’s network and remediate issues when they occur. To learn more, <a href="#demo_dialog" title="get a demo">contact us for a live demo</a>.</p><![CDATA[Investigating digital experience with Synthetic Transaction Monitoring]]><![CDATA[Investigating a user’s digital experience used to start with a help desk ticket, but with Kentik’s Synthetic Transaction Monitoring, you can proactively simulate and monitor a user’s interaction with any web application.]]>https://www.kentik.com/blog/investigating-digital-experience-with-synthetic-transaction-monitoringhttps://www.kentik.com/blog/investigating-digital-experience-with-synthetic-transaction-monitoring<![CDATA[Phil Gervasi]]>Tue, 12 Jul 2022 04:00:00 GMT<p><a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a> is all about proactively testing and monitoring specific elements of your network, the services it relies on, and the applications it delivers. That means using artificial traffic instead of end-user traffic to test a variety of aspects of digital experience monitoring like device availability, DNS activity, web application page load times, and BGP activity.</p> <p><em>But to test an end-user’s experience interacting with a website, we need to approach things differently.</em></p> <h3 id="what-does-stm-solve">What does STM solve?</h3> <div class="pullquote right">With STM, you can test the availability of an HTTP service, analyze how every component of a web application is loading, and simulate an end-user interacting with the web application itself.</div> <p><a href="https://www.kentik.com/resources/synthetic-transaction-monitoring-stm/">Kentik’s Synthetic Transaction Monitoring</a> takes you up the stack from monitoring the network layer and expands Kentik’s suite of application tests. Along with the HTTP test and <a href="https://www.kentik.com/blog/solving-slow-web-applications-with-the-kentik-page-load-test/">Page Load test</a>, with Synthetic Transaction Monitoring, or STM, you can test the availability of an HTTP service, analyze how every component of a web application is loading, and simulate an end-user interacting with the web application itself.</p> <p>Imagine an ecommerce website hosted in several <a href="https://www.kentik.com/kentipedia/cloud-network-performance-monitoring/" title="Kentipedia: Cloud Network Performance Monitoring">public cloud regions</a>. Your customers access it from all over the world to log in, make purchases, book flights, reserve cars, or leave comments on a news article. Rather than collect passive information that tells you simply that the website is up or down or how much of one particular protocol is consuming your WAN interfaces, STM actively tests what it’s like to log into your site, click on various elements on the web page, add something to the shopping cart, and so on.</p> <p>Every step in an end-user’s interaction with a site can be simulated, which means you have the ability to monitor the digital experience of an end user without having to wait for trouble tickets to come in. These could be sites you host in your private data center, in a public cloud or delivered by a <a href="https://www.kentik.com/kentipedia/saas-network-performance-monitoring/" title="Kentipedia: SaaS Network Performance Monitoring">SaaS provider</a>.</p> <h3 id="how-does-stm-work">How does STM work?</h3> <p>STM uses Kentik’s hundreds of global agents dispersed throughout the world to run proactive tests against a specific target, for example, your public website. These agents are programmed to programmatically run a script that simulates an end user logging into the site, clicking on objects, adding items to a shopping cart, and doing whatever it is that’s relevant to you and your business.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5snEIL0hRLSJmoR4D9Wj11/827af2b53b4afa099d521b083beb007d/global-agents-low-res.png" style="max-width: 500px;" class="image center" alt="Synthetic testing global agents" /> <p>The script is embedded within the tool, and though you can create it manually, you can also generate it automatically by performing the actions you want to test in a Chrome browser and recording the transaction using the built-in dev tools recorder. After that, it’s a simple matter of exporting the activity as a Puppeteer script and pasting it into your test.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5uOF4JtObEThS9R6qINaro/cb46de59e0b7c42d590cfd475cfbd76d/puppeteer-975w.png" style="max-width: 600px;" class="image center" alt="Puppeteer script" /> <div class="pullquote right" style="max-width: 280px;">You can monitor the digital experience without having to wait for something to go wrong with a real-life person interacting with the site.</div> <p>The test can be as simple or complex as you need it to be, and it can run as frequently as you want. The beauty is that you’re using artificial traffic and not actual end-user traffic. This means you can monitor the digital experience without having to wait for something to go wrong with a real-life person interacting with the site.</p> <p>The results are a powerful tool for real-time troubleshooting of an application problem, but it’s also the perfect way to monitor an end-user’s digital experience with a specific application over time. This is how STM fits into both the Kentik suite of proactive synthetics tests and also Kentik’s overall network observability solution.</p> <p>When you correlate active test results with passive information derived from flow data, SNMP, streaming <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Kentik Blog: Best Practices for Enriching Network Telemetry to Support Network Observability">telemetry</a>, and so on, you get a much more complete picture of what’s going on with application delivery over your network.</p> <h3 id="what-you-can-learn-from-stm">What you can learn from STM</h3> <p>To start, the test results show you the overall health of the test over time. This way you can see at a glance if there are any issues with the simulated transaction over a given time range. For each moment in time, the transaction time will be compared to a rolling standard deviation which is calculated automatically.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6nTy6g9wFJiChcmju1CwJI/e2c665f3abd25c865684d50348d1c27f/transaction-testing-login.png" style="max-width: 700px;" class="image center" alt="Synthetic transaction monitoring test results" /> <p>That’s very important, but the really awesome benefit of the test is the ability to get much more granular with the individual steps in the transaction.</p> <p>The system takes snapshots of the website throughout the transaction process, which is a great way for an engineer to see the reality of what’s going on – what a real person would see interacting with the site.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6jKagQKgdA9cvoUll1zkmg/4cad601603c23893efa935572c8a8bf9/stm-with-screenshots.png" style="max-width: 700px;" class="image center" alt="Transaction testing process snapshots" /> <p>Additionally, performance metrics are collected for every single piece of the transaction script that’s run. The test captures the DOM file so that you can see exactly how long it took for each component to load, and therefore, for each simulated user activity to be processed. Any slowness can be tracked down to a specific end-user activity or specific object on the page.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5nHsq08ASTFw4cj5HOfiWw/69e48227bc88cae92843c78624c43f0e/stm-waterfall.png" style="max-width: 700px;" class="image center" alt="STM waterfall" /> <p>Lastly, you can perform the test from multiple geographic locations and of course with the variations you need to simulate your own environment.</p> <h3 id="watch-a-short-demo">Watch a short demo</h3> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/31oztyfmhr" title="Synthetic Transaction Monitoring with Kentik Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>Learn more about Kentik’s STM solution in this <a href="https://www.kentik.com/resources/synthetic-transaction-monitoring-with-kentik/">short transaction monitoring video</a> where Sunil Kodiyan, one of Kentik’s resident subject matter experts on synthetic testing, walks through how to use the tool to simulate an end-user’s activity and peer into their digital experience with your web application.</p><![CDATA[Kentik moves up the stack with Synthetic Transaction Monitoring]]><![CDATA[Kentik goes up the stack and launched Synthetic Transaction Monitoring (STM). Read this post to learn about enhancements to our synthetic testing capability and how to perform STM in Kentik.]]>https://www.kentik.com/blog/kentik-moves-up-the-stack-with-synthetic-transaction-monitoringhttps://www.kentik.com/blog/kentik-moves-up-the-stack-with-synthetic-transaction-monitoring<![CDATA[Sunil Kodiyan]]>Thu, 23 Jun 2022 04:00:00 GMT<p>In our quest to provide the leading network observability solution, Kentik has been focused on developing a service for NetOps teams that empowers them to have intimate knowledge of their network traffic and the devices that route traffic. Our service helps them plan capacity, project costs, optimize routes, detect unwanted traffic, troubleshoot issues and analyze events.</p> <p>In 2020 we launched <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a>, which allows NetOps to be more proactive in testing and simulating traffic not only across traditional network resources, but also to and within public cloud resources and recently to software apps. In February of this year we added <a href="https://www.kentik.com/blog/introducing-bgp-monitoring-from-kentik/">BGP monitoring</a> to give insights into BGP announcements and withdrawals, <a href="https://www.kentik.com/kentipedia/bgp-route-leaks/" title="Kentipedia: What are BGP Route Leaks and How to Protect Your Networks Against Them">detect route leaks</a> and monitor <a href="https://www.kentik.com/blog/how-much-does-rpki-rov-reduce-the-propagation-of-invalid-routes/" title="Kentik Blog: How Much Does RPKI ROV Reduce the Propagation of Invalid Routes?">RPKI</a> status.</p> <p>In a further move up the stack, we recently launched <a href="https://www.kentik.com/blog/solving-slow-web-applications-with-the-kentik-page-load-test/">synthetic Page Load testing</a>, which gives our customers the ability to test how a website loads from global locations. Our results give a granular breakdown of how every component on a web page loads so that you can easily identify what’s impacting performance, whether it’s a site hosted on-premises, in the cloud or hosted by a SaaS provider.</p> <h3 id="introducing-synthetic-transaction-monitoring">Introducing Synthetic Transaction Monitoring</h3> <p>I’m excited to announce that our initial launch of <a href="/resources/synthetic-transaction-monitoring-stm/">Synthetic Transaction Monitoring (STM)</a> is now GA! There are plenty of Application Performance Monitoring (APM) tools on the market that enable DevOps to get detailed insights into the performance of applications, but none that we are aware of can do this in the context of actual <a href="https://www.kentik.com/blog/best-practices-enrich-network-telemetry-to-support-network-observability/" title="Kentik Blog: Best Practices for Enriching Network Telemetry to Support Network Observability">network telemetry</a>.</p> <p>With this new feature, network teams and site reliability engineers will be able to identify where performance problems are occurring without having to rely on DevOps tools. Kentik provides full visibility of actual network traffic alongside synthetic test traffic – all the way up to measuring the performance of applications by emulating user transactions. (For a more general introduction to the concept of STM, see our Kentipedia entry, “<a href="https://www.kentik.com/kentipedia/what-is-synthetic-transaction-monitoring-stm/" title="Kentipedia: What is Synthetic Transaction Monitoring (STM)?">What is Synthetic Transaction Monitoring?</a>”)</p> <p>Our integrated sharing tools will enable your team to collaborate with other team members via your platform of choice and our public link sharing feature allows users to share test results with unauthenticated users to efficiently instrument troubleshooting sessions internally and externally.</p> <h3 id="how-to-set-up-synthetic-transaction-monitoring">How to set up Synthetic Transaction Monitoring</h3> <p>To set up STM in the Kentik portal, you first use Chrome Developer Tools to record the actions a user would take to interact with an application. Once you’ve recorded these steps, you login to your Kentik account and create a test. Here’s what the setup looks like in Kentik:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3YPqec3SQnzzUiGeZIrTF4/c97bf5f6b6a868bb2d28773f5d09212e/stm-kentik-setup.png" class="image center" style="max-width: 450px;" alt="Synthetic page load test" thumbnail /> <p>You can select any of our public application agents to test from, and/or easily deploy your own private Application agents which we can supply (for Docker, x86 and ARM). STM tests can be set as both automatic and periodic on time intervals.</p> <h3 id="presentation-of-synthetic-test-monitoring-results">Presentation of Synthetic Test Monitoring results</h3> <p>Results are presented on a timeline that shows transaction completion time. Time intervals can be edited. Performance is measured against dynamically calculated baselines and lags in performance are colored – orange is a warning, red is critical. Selecting a point on the line will indicate total completion time and the rolling standard deviation baseline. Screenshots captured during the transaction process provide insight into the script execution flow and aid in troubleshooting.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7BIIRYzqAT17239AOTCLwQ/015a8f562ab6671edc3528ef74967046/transaction-monitoring-test-results.png" class="image center" style="max-width: 800px; margin-bottom: 15px;" alt="Synthetic page load test" thumbnail /> <div class="caption">Transaction Monitoring test results in the Kentik Portal depicting completion time with screen captures of the journey</div> <p>The Waterfall tab shows the load order and load duration of each element in the Document Object Model (DOM) of every page visited.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3rp1xNsz2PWSI4Qk5dpyhm/e8504ca2e447159bb91b30e48d79d32a/stm-waterfall-view.png" class="image center" style="max-width: 800px; margin-bottom: 15px;" alt="Synthetic page load test" thumbnail /> <div class="caption">Waterfall view of test results in the Kentik Portal depicting load time of page elements</div> <h3 id="key-differentiators-of-kentik-synthetics">Key differentiators of Kentik Synthetics</h3> <p>While most other synthetics transaction testing tools are implemented on a Selenium framework, at Kentik we chose <strong>Puppeteer</strong>, a testing framework developed by Google. With nearly identical syntax to Selenium, users get the benefits of a familiar language while also benefiting from a framework that is more tightly coupled with the Google Chrome browser it controls. Every Chrome browser has a transaction recorder built in, avoiding the need to maintain an independent scripting environment.</p> <p>NetOps teams have long been held back by the lack of <strong>affordability</strong> of synthetic transaction testing, which meant that only the most critical apps got coverage at less than optimal time intervals. With Kentik you can afford to monitor more apps at lower time intervals with the same or less budget than competitive offerings.</p> <p>When downtime can cost you dearly in terms of reputation damage and lost revenue, testing every minute won’t cut it. Your critical applications often require sub-minute test frequencies. Only Kentik provides network (ping and traceroute) <strong>testing down to one second intervals</strong>, which coupled with STM at five minute intervals results in you never missing a possible performance degradation.</p> <p>As well as the obvious advantage of having a single service to correlate real network telemetry with synthetic test traffic, a significant feature of the Kentik service is that you can use actual traffic to <strong>automatically configure</strong> and maintain the synthetic tests you should be running. This can save you a huge amount of setup and administration time. If there’s a change in your traffic flows, Kentik will automatically adjust the locations to which you are testing.</p> <h3 id="conclusion">Conclusion</h3> <p>We don’t know of any other service that can give network operators the ability to easily visualize real traffic alongside test traffic so they can quickly analyze events and answer any network question. Thanks to our engineering teams for their rapid development and deployment of this feature and to our customers who have guided its development!</p> <p>Kentik Synthetics is available for a <a href="https://www.kentik.com/get-started/">30-day free trial</a>. Our public agents are available for you to test from so there’s no heavy integration work to get up and running – in no time you’ll be performing STM or page load tests. For more information on the Kentik STM capability <a href="https://www.kentik.com/go/webinar/debug-faster-web-and-network-performance/">view this webinar</a> or <a href="https://www.kentik.com/get-demo/">contact us for a demo</a>.</p> <p><button as="DemoButton">GET A DEMO</button></p><![CDATA[Cisco Live 2022 recap]]><![CDATA[The theme of augmenting the network engineer with diverse data and machine learning really took the main stage at the World of Solutions. Read Phil Gervasi's recap.]]>https://www.kentik.com/blog/cisco-live-2022-recaphttps://www.kentik.com/blog/cisco-live-2022-recap<![CDATA[Phil Gervasi]]>Wed, 22 Jun 2022 04:00:00 GMT<p>It was really great to see so many old friends again at Cisco Live this year in Las Vegas. It’s been a few years since I’ve been to a conference other than a few small work events, so it was exciting to wander the World of Solutions, talk to people at the many booths, and bump into people I knew online but never met in person.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6XtwsC4LvqZ0evY8OwmuaZ/33236141b62c2481026896e91861fc35/cisco-live-2022.jpg" class="image center" style="max-width: 600px;" alt="Cisco Live 2022" /> <p>The conversation often started with something like “hey it’s great to finally meet you in person,” followed by “doesn’t it feel like everyone’s talking about the same thing this year?” And honestly, it kind of did. It’s not that every booth had the exact same message, but there was definitely a theme emerging over the course of the week — <em>mining data from the network to improve network operations</em>.</p> <p>We’ve had visibility into our networks to one extent or another since the early days of networking. Over the course of the 80s, 90s, 2000s, and today, we’ve gotten progressively better at gathering flow data, SNMP info and streaming telemetry from network devices, and logs from various sources internally and from public clouds.</p> <p>My impression this year is that network visibility was taking a big step forward. Not toward gathering more kinds of data, and not toward creating prettier graphs, but toward adding intelligence to the analytics we’ve been doing manually since the beginning. From my perspective, <em><a href="https://www.kentik.com/kentipedia/what-is-network-observability/">observability</a></em> took the main stage this year in the World of Solutions.</p> <img src="//images.ctfassets.net/6yom6slo28h2/L2DIwLzpSH4yJEzWfNohx/4e1e19c0ae9e0e78c590f69136394e47/cisco-live-2022-meme.jpg" class="image center" style="max-width: 450px;" alt="Cisco Live observability meme" /> <p>A few years ago, even uttering that term caused engineers to cringe, but this year it was pretty obvious observability was as cool as Chris Cornell in 1994 (or really any year for that matter). But is it just some new marketing term everyone’s jumping on like the intent based networking craze of a few years ago?</p> <p>That’s something I’ve been thinking about a lot over the last two years. What’s going on in network operations that observability is trying to solve? How is observability different from visibility? Is the underlying technology that makes observability possible real, or <a href="https://www.kentik.com/blog/network-observability-hype-or-reality/">is this all marketing fluff</a>?</p> <p>For me, the answers to these questions started with reflecting back on my days as an engineer working in the trenches to build and troubleshoot big networks. Usually, that meant figuring out why some application wasn’t working well. Rarely was I troubleshooting a network just to make the network better. Analyzing logs, troubleshooting QoS policies, or figuring out why traffic was taking a particular path were to make an application perform better for real people trying to get real work done.</p> <p>That’s really one of the underlying themes of observability and of this year’s Cisco Live. It’s all about augmenting the engineer to help network operations work faster and more efficiently. That means root cause analysis, security mitigation and forensics, and correlating data to get deep insight into everything going on in the network.</p> <p><em>This is the difference between only <strong>seeing</strong> what’s going on in the network and understanding <strong>why</strong> it’s happening</em>.</p> <p>For example, when end users start complaining that their line of business application feels very slow, legacy visibility can tell you that an interface is utilized a little more than normal and that the time to load a web page is taking a little longer than expected.</p> <p>Observability, on the other hand, will use all of that legacy visibility as the foundation to then analyze events and automatically discover a high probability of correlation between swinging traffic between data centers, DNS lookup times rising dramatically, and application page load times increasing to the point that the app is almost unusable.</p> <p>Observability is built on a foundation of visibility. However, it goes beyond just seeing what’s going on to also ingesting, scaling, normalizing, and correlating data so that you, as a network engineer, can get to the root cause of a problem faster than ever before.</p> <p>This is the point of <a href="https://www.kentik.com/why-kentik-network-observability-and-monitoring/">Kentik’s network observability platform</a> and why, for instance, you can see how the specific results of a synthetic test correlate with flow data. Diverse visibility, normalized data, correlated results produce real insight into <em>why</em> something is happening on the network.</p> <p>Cisco Live 2022 had its share of networking goodness including some cool advances in wireless tech, cloud networking, and even in network security. But it was the theme of augmenting the network engineer with diverse data and machine learning that took the main stage at the World of Solutions. Regardless of how the term was used a few years ago, <em>observability</em> is certainly a very real and practical solution for handling the growing complexity of application delivery today.</p> <p>Learn more about Kentik and try it for free for 30 days.</p> <p><button as="SignupButton">TRY KENTIK</button></p><![CDATA[Outage in Egypt impacted AWS, GCP and Azure interregional connectivity]]><![CDATA[On Tuesday, June 7, internet users in numerous countries from East Africa to the Middle East to South Asia experienced an hours-long degradation in service due to an outage at one of the internet’s most critical chokepoints: Egypt. In this blog post, we review some of the impacts as well as compare how this disruption affected connectivity within the three major public cloud providers.]]>https://www.kentik.com/blog/outage-in-egypt-impacted-aws-gcp-and-azure-interregional-connectivityhttps://www.kentik.com/blog/outage-in-egypt-impacted-aws-gcp-and-azure-interregional-connectivity<![CDATA[Doug Madory]]>Tue, 14 Jun 2022 04:00:00 GMT<p>On Tuesday, June 7, internet users in numerous countries from East Africa to the Middle East to South Asia experienced an hours-long degradation in service due to an outage at one of the internet’s most critical chokepoints: Egypt. Beginning at approximately 12:25 UTC, multiple submarine cables connecting Europe and Asia experienced outages lasting over four hours.</p> <p>As I show below, the impacts were visible in various types of internet measurement data to the affected countries. In addition, the impacts were visible in Kentik’s synthetic measurements between the regions of the three major public cloud providers: AWS, GCP and Azure underscoring that even the cloud must rely on the same submarine cable infrastructure that the rest of us do.</p> <h3 id="transegypt">TransEgypt</h3> <p>Egypt’s Suez Canal is a vital waypoint in the world of international shipping. The canal’s role as a global maritime chokepoint was underscored last year when the massive cargo ship <a href="https://www.wired.co.uk/article/ever-given-global-supply-chain">Ever Given became stuck in the canal</a>. In the six days it took to free the Ever Given, hundreds of other ships were delayed in making their deliveries, temporarily halting a <a href="https://en.wikipedia.org/wiki/2021_Suez_Canal_obstruction#Economic_impact">large portion of global trade</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/rmTlVXhsXHfV2rhMmKXhE/c9583af000991034c11d5ebde0712550/ever-given-suez-canal.jpg" style="max-width: 600px; margin-bottom: 15px;" class="image center" /> <div class="caption">Ever Given stuck in the Suez Canal in March 2021</div> <p>Perhaps less appreciated is the fact that Egypt also serves as a critical internet chokepoint. Virtually all internet traffic between Europe and Asia rides along submarine cables which run through this country.</p> <p>Of course, the submarine cables traversing Egypt aren’t <em>laid in the canal</em> itself. Instead they come ashore at Abu Talat and Alexandria on the Mediterranean, travel overland to the Red Sea where they are deposited back in the water at cable landing stations such as Zafarna. Telecom Egypt operates this overland passage called TransEgypt and collects a hefty payment from each submarine cable using the route. For those circuits that don’t travel over the Egyptian desert, the Suez Canal Authority offers fiber optic lines run in ductwork encased in concrete on the side of the canal.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7rse8Fiw9yrwHDFDmwhaNY/c166d59cfac43ab33ce0e8f6aa603138/transegypt-diagram.jpg" style="max-width: 700px; margin-bottom: 15px;" class="image center" /> <div class="caption">TransEgypt diagram Telecom Egypt from 2013</div> <p>According to Telecom Egypt, TransEgypt operates as a mesh, meaning that it should be able to survive the failure of any particular segment. Despite this assurance, there have been a handful of brief outages in this overland route through the years. When these outages occur, they can impact millions of people in dozens of countries.</p> <p>In February 2013, multiple submarine cables traversing Egypt experienced <a href="https://twitter.com/InternetIntel/status/302119254545805312">outages lasting several hours</a> due to a <a href="https://mybroadband.co.za/news/broadband/70566-slow-browsing-seacom-is-down-reports-mweb.html">fire set by thieves</a> in a supremely misguided attempt to extract copper from the fiber optic cables. In March 2013, the <a href="https://www.theguardian.com/technology/2013/mar/28/egypt-undersea-cable-arrests#:~:text=Undersea%20internet%20cables%20off%20Egypt%20disrupted%20as%20navy%20arrests%20three,-This%20article%20is&#x26;text=Egyptian%20naval%20forces%20have%20arrested,capacity%20between%20Europe%20and%20Egypt.">Egyptian Navy arrested divers</a> off the coast of Alexandria who had damaged <a href="https://www.submarinecablemap.com/submarine-cable/seamewe-4">SeaMeWe-4</a> when they detonated underwater explosives purportedly to collect scrap metal. (Note: the TransEgypt circuits were unaffected by the <a href="https://www.kentik.com/blog/from-egypt-to-uganda-a-decade-of-internet-shutdowns/">government-directed internet shutdown</a> in January 2011).</p> <p>Thankfully last week’s outage occurred on land and could be quickly repaired. The alternative would have been a nightmare scenario. The SeaMeWe-4 cut in March 2013 led to severe <a href="https://web.archive.org/web/20130820092123/http://www.renesys.com/2013/03/intrigue-surrounds-smw4-cut/">connectivity problems</a> in numerous countries, lasting weeks in many cases.</p> <p>Egypt’s status as a global internet chokepoint is a perennial topic at nearly every submarine cable conference and there have been several attempts to establish routes around Egypt and the Suez. The <a href="https://www.commsupdate.com/articles/2010/06/21/jadi-to-link-middle-east-with-far-east-europe/">JADI system</a> was one such route. Its name came from the cities which constituted its path: Jeddah, Amman, Damascus and Istanbul. Unfortunately, the complete circuit was short-lived. Months after its activation, Syria descended into civil war. The fighting severed the link which was never reconstituted.</p> <p>Additionally, <a href="https://en.wikipedia.org/wiki/Europe-Persia_Express_Gateway">EPEG</a> (Europe-Persia Express Gateway) is a terrestrial circuit running from Frankfurt, Germany to Iran and Oman in the Persian Gulf. It was <a href="https://lirneasia.net/2013/04/iran-and-russia-throw-lifeline-to-internet/">activated in April 2013</a> months ahead of schedule to supply bandwidth to the Middle East reeling from the loss of SeaMeWe-4 the previous month. Due to a variety of factors, EPEG never altered the region’s dependence on Egypt-based routes and most recently, the circuit was <a href="https://www.bbc.com/persian/iran-60637605">severed in March</a> due to the on-going fighting in Ukraine.</p> <h3 id="various-country-level-impacts">Various country-level impacts</h3> <p>The Internet Outage Detection and Analysis (IODA) project at Georgia Tech <a href="https://twitter.com/gatech_ioda/status/1534193779615334400">published charts</a> showing impacts to Somalia, Tanzania, Djibouti and Ethiopia in East Africa.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Teb7EH9hkmQE9K7fnbqnW/11078c6712c7f77650bdfaa19ffabf64/ioda-signals-djbouti.png" style="max-width: 700px;" class="image center" /> <p>Cloudflare Radar <a href="https://blog.cloudflare.com/aae-1-smw5-cable-cuts/">published a blog post</a> including the one of Tanzania below showing the impact to their traffic resulting from the outage.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5v3PLX9Ymj9Zn40OIOyu89/ef9a74a148ee98161783ac5ee396e55f/traffic-tanzania.png" style="max-width: 800px;" class="image center" /> <p>And finally, Romain Fontugne of the <a href="https://www.iijlab.net/en/">IIJ Research Lab</a> in Japan <a href="https://twitter.com/romain_fontugne/status/1534356883502698497">contributed</a> the view below based on <a href="https://atlas.ripe.net/probes/">RIPE Atlas probe</a> measurements between Europe and Asia. According to these measurements, latencies from one region to the other experienced dramatic increases until the Egypt circuits were restored.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4iSLhvVvQO44bNneCVO7p6/49e3a4da5187b4185583027a59de9724/europe-asia-latencies.png" style="max-width: 800px;" class="image center" /> <p>When we look at the traffic impacts through Kentik’s aggregate NetFlow data, we can see network service providers in various countries shifting transit to recover from the loss of connections disrupted due to the outage in Egypt.</p> <p>In the example below, Ooredoo (AS8781, formerly Qatar Telecom) lost transit from Lumen (AS3356) and Cogent (AS174) during the outage with inbound traffic shifting to PCCW (AS3491) and Tata (AS6453) in the their absence.</p> <img src="//images.ctfassets.net/6yom6slo28h2/31ED4waqJCgv8S88OWnUML/392081384fa0519366a77218e635dc3a/traffic-to-ooredoo.png" style="max-width: 800px;" class="image center" thumbnail /> <h3 id="cloud-impacts">Cloud impacts</h3> <p>As part of our synthetics measurement suite, Kentik measures latency, packet loss and jitter between each of the major public cloud regions. As I mentioned in the opening, even these cloud providers have to rely on the same submarine cable infrastructure that everyone else does. Cloud providers will buy capacity on numerous cables with the objective to maintain connectivity in the event any single cable suffers an outage.</p> <p>Having said that, it is fascinating to observe the differences between how the major public cloud providers experienced the outage. By choosing routes that are common to AWS, GCP and Azure (London, England to Singapore and Washington DC to Mumbai, India), we can attempt a couple of fair comparisons.</p> <p>In our data, there appeared to be two distinct outage events at 12:25 and 12:45 UTC. In the charts below, we can see AWS and Azure being briefly impacted at 12:25 UTC, but mostly surviving intact. Conversely, Google’s GCP appears to have felt a far greater impact beginning at 12:45 UTC.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3bjnsd8v1YuHUV4HRIba3f/f9e87e077df4320b43c4ac425e8f6ee7/london-singapore-gcp.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">GCP experienced significant packet loss and increased latency for over an hour.</div> <img src="//images.ctfassets.net/6yom6slo28h2/3DOVaNay6J3JUs6WGfByth/202f91f1aae8b606f4016d6bf24b9462/london-singapore-azure.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">Azure experienced a temporary increase in latency.</div> <img src="//images.ctfassets.net/6yom6slo28h2/Kcf8DlSqVWQnyqdg1W9gl/1f575fbafa1e249d268713f75b8c02f6/london-singapore-aws.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">AWS experienced a momentary increase in latency.</div> <p>Latencies for AWS returned to normal within minutes after a brief spike. Latencies for Azure remained elevated for hours until returning to previous levels after the outage was resolved. According to our measurements, only GCP experienced large amounts of packet loss for over an hour.</p> <p>The pattern repeats for other routes as well. Here are three views of measurements from the Washington DC area to Mumbai, India. The greatest impact is visible in GCP, much less in Azure and almost nothing in AWS.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1A4cOtWF8DUJXMvIB8QFPC/fb56c25a32b77be7b908f9b6edda23dd/dc-mumbai-gcp.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">GCP experienced a significant spike in packet loss followed by increased latencies.</div> <img src="//images.ctfassets.net/6yom6slo28h2/18MY6yE5BFRyomKa9rCBLj/c41acfcea066dca1aa4c3d5b493bf28a/dc-mumbai-azre.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">Azure experienced elevated latencies along this route for a couple of hours.</div> <img src="//images.ctfassets.net/6yom6slo28h2/4I4mhIy0kfasRrNif78K7k/28d482527a24243b0b3010b9e7965458/ashburn-mumbai-aws.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">The impact to AWS along this route was negligible.</div> <h3 id="conclusion">Conclusion</h3> <p>The submarine cable industry continues to try to build alternatives to the Egyptian internet chokepoint. Most recently, Google has teamed up with Telecom Italia Sparkle and Omantel to build the <a href="https://www.submarinenetworks.com/en/systems/asia-europe-africa/blue-raman/google-s-blue-raman-cable-creates-new-route-across-israel">upcoming Blue-Raman cable system</a>. This cable system (depicted in dotted red and orange below) would come ashore in Tel Aviv, Israel before crossing over land to Aqaba, Jordan and continuing on to Mumbai, India.</p> <img src="//images.ctfassets.net/6yom6slo28h2/o6XjTyTTwgK7o7Oy7acJv/2a47996eb5009a8b48fbb489a0f89d79/israel-submarine-cables.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">Existing and planned submarine cables connecting Israel, credit: <a href="https://www.haaretz.com/israel-news/business/2020-04-14/ty-article/.premium/israel-to-play-key-role-in-giant-google-fiber-optic-cable-project/0000017f-e5ec-da9b-a1ff-edef50650000" target="_blank">Haaretz</a></div> <p><a href="https://en.wikipedia.org/wiki/Global_commons">Global commons</a> are resources that humanity must share like the oceans, the atmosphere and outer space. Whether the internet should be included on that list is a <a href="https://www.cigionline.org/publications/internet-global-commons/">matter of debate</a>, but what is evident is that the global internet depends on shared resources like submarine cables.</p> <p>This is true whether you are a hyperscale cloud provider or a digital business. In either case, the need for network observability is critical to keeping the packets flowing and delivering services whether the risks are <a href="https://www.theguardian.com/technology/2013/mar/28/egypt-undersea-cable-arrests">underwater explosives</a>, <a href="https://www.bbc.com/news/world-europe-jersey-38141230">ship anchors</a>, or <a href="https://arstechnica.com/tech-policy/2015/07/its-official-sharks-no-longer-a-threat-to-subsea-internet-cables/">sharks</a>!</p><![CDATA[Solving slow web applications with the Kentik Page Load test]]><![CDATA[By peeling back layers of your users' interactions, you can investigate what’s going on in every aspect of their digital experience -- from the network layer all the way to application. ]]>https://www.kentik.com/blog/solving-slow-web-applications-with-the-kentik-page-load-testhttps://www.kentik.com/blog/solving-slow-web-applications-with-the-kentik-page-load-test<![CDATA[Phil Gervasi]]>Fri, 03 Jun 2022 04:00:00 GMT<p><a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a> is all about <em>proactive</em> monitoring. With synthetic monitoring, you can investigate users’ digital experience by peeling back, layer by layer, exactly what’s going on in every aspect of the digital experience from the network layer all the way to application.</p> <p>Because synthetic tests can be so granular, the results provide different information than you can get from flows, streaming telemetry or other observability data. For example, you can test specific DNS activity, HTTPS activity or simulate an actual web transaction. This is a powerful method to proactively monitor digital experience rather than gather data passively.</p> <h3 id="the-kentik-page-load-test">The Kentik Page Load test</h3> <p>The Kentik Synthetics menu of tests continues to expand rapidly — we recently introduced a range of <a href="https://www.kentik.com/blog/introducing-bgp-monitoring-from-kentik/">BGP monitoring</a> capabilities. Page Load test is a way to test how a website loads from specific locations. The process is pretty straightforward — you select which locations you want as your source, you specify a URL, and the system tests a full browser page load using Headless Chromium run by Kentik app agents. The results are a readout of all the status codes, the times for performance indicators response, navigation, domain lookup, etc.</p> <p>The Page Load test gives you a granular breakdown of how every component on the page is loading so that you can easily track down exactly what’s impacting the site’s performance, whether it’s a site hosted on-premises, in the cloud, or hosted by a SaaS provider.</p> <h3 id="the-page-load-test-in-action">The Page Load test in action</h3> <p>Take a look at the graphic below in which we’re using the Kentik Synthetics Page Load test to figure out why a website is slow. Notice in the lower left we have several synthetic test agents that live in the Kentik cloud which you can use to monitor pretty much any website you want.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4fUb3f4oqxZGvFl0cO52zU/d21b66d1b91ad6cea78e69140b359747/page-load-synthetic-agents.png" class="image center" style="max-width: 800px;" alt="Page load test with synthetic agents" /> <p>Above on the right you can see individual times for how long different components of the website took to render and load. Notice that you can see the domain lookup time, response time, average HTTP latency, and so on.</p> <p>The color coded bar at the top of the screen is a warning indicator using an aggregate of these metrics. Green means no problem, of course, whereas orange and red indicate some issue with the website’s performance. These metrics can be based on auto-configured dynamic baselines or manually configured in the advanced configuration for the test.</p> <p>But we need to go even deeper. In the next image below, notice that when you dig into the breakdown of one of the agent tests, you can see right away that there’s a problem, with the DNS lookup time increasing over time.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Cph0ujGUBO04p4Bd3VGC3/b077d2fbf620233067cdbced9c6134a8/page-load-agent-detail.png" class="image center" style="max-width: 800px;" alt="Detail of the page load test agent" /> <p>You can open a waterfall view to see the entire rendition of the transaction so you can see how every single individual component of the web page loaded over time. Check out the last screenshot below and notice that you can see the metrics for every GET request in detail.</p> <p>This is a great way to find the root cause of a slow web application, and in this example you can clearly see the slowness was caused by an inordinately high (and increasing) DNS lookup time. And it’s only a matter of a few clicks to run this test on-going so you can get historical results for this website which can be correlated with your other visibility data.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5jopQRVYXIhoTIm5x3JU1Z/4c0c953086f7e080a3710ca41a2e1865/page-load-root-cause.png" class="image center" style="max-width: 800px;" alt="Troubleshoot cause slow web application" /> <p>In this video, one of Kentik’s subject matter experts on Synthetics, Sunil Kodiyan, walks through this scenario and answers some questions about how Kentik Synthetics fits into an overall observability solution.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/r0uqoqkt2o" title="Solving slow web applications - Kentik Synthetics Pageload Test Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>To learn more about Kentik Synthetics and the Kentik observability platform, <a href="/get-demo/">sign up for a demo</a>.</p> <p><button as="DemoButton">GET A DEMO</button></p><![CDATA[How New Relic uses Kentik for network observability]]><![CDATA[With Kentik Synthetics, New Relic gains proactive insights about network performance and can ensure its customers have a reliable digital experience.]]>https://www.kentik.com/blog/how-new-relic-uses-kentik-for-network-observabilityhttps://www.kentik.com/blog/how-new-relic-uses-kentik-for-network-observability<![CDATA[Stephen Condon]]>Tue, 17 May 2022 04:00:00 GMT<p>New Relic is known for empowering the world’s leading engineering teams to deliver great software performance and reliability. And the network that delivers that service to New Relic’s users plays a critical role. Hiccups in the performance of the network between New Relic’s mission-critical service and their users can create a cascade of problems.</p> <p>New Relic uses their own service to manage performance. However, when it comes to the network, a solution that could monitor from the outside was needed. As Pedro Carvalho, a senior network engineer at New Relic, <a href="https://www.kentik.com/resources/case-study-new-relic/">put so succinctly</a>, “One of the big challenges we find in a hybrid environment is troubleshooting the network. It’s very hard to find and diagnose network connections unless you have a tool that can see things from the outside.”</p> <p>With the move to public cloud infrastructure and the adoption of containerization, the challenge of monitoring and troubleshooting network performance has become incredibly complex. Detailed service level indicators (SLIs) and service level objectives (SLOs) aren’t useful if you don’t have the insights you need to address problems. This is where Kentik excels, giving you one platform to view real traffic monitoring alongside synthetically generated tests.</p> <h3 id="are-your-providers-meeting-their-slas">Are your providers meeting their SLAs?</h3> <p>Troubleshooting network issues in a hybrid environment is a challenge that New Relic does not face alone. The network outside and the cloud services required are in the hands of third-party providers (cloud SAAS, cloud IAAS, dedicated telecom, contracted telecom, etc.). Holding your cloud and network providers accountable is only possible if you have detailed data about how they are performing and how they will perform given certain conditions.</p> <p>That’s where <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a> can help. By using our global public agents and private agents deployed on your infrastructure, we can give you an intimate understanding of the network conditions your users/customers are experiencing — even if you do have an outage. Kentik can provide synthetic test results and related traffic flow information side-by-side, giving you a detailed understanding of the experience users are actually having and what experience they will have if conditions change.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/40I9iLisnFqncKgVl9SDUp/0f04cada52093e14aabd6cb2ef3cb436/synthetics-test-results.png" style="max-width: 800px;" class="image center" thumbnail /> <p>Keeping vendors accountable is also an area where we are helping New Relic. April Carter, the senior software engineering manager at New Relic says “Kentik gives us the ability to diagnose issues, and if it’s vendor-related, we can instantly tell that vendor exactly what links are affected and when.”</p> <h3 id="cloud-performance-monitoring">Cloud Performance Monitoring</h3> <p>Our performance monitor can give you the ability to diagnose interconnection performance problems impacting traffic between your cloud and on-prem networks. It will auto-discover VPN/DCG interconnections and give you latency, jitter and packet loss statistics between and within VPCs.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5EGswLtYDUM5tLTJ6hkKdJ/875526406a0a82dedd013aab0d07c562/cloud-performance-monitor.png" style="max-width: 800px;" class="image center" thumbnail /> <h3 id="facilitate-devops-collaboration">Facilitate DevOps collaboration</h3> <p>One of the features of the Kentik service the New Relic found especially useful is <a href="https://www.kentik.com/product/firehose/">Kentik Firehose</a>. Kentik Firehose uses a binary, called KTranslate, also available as an <a href="https://github.com/kentik/ktranslate/">open source project</a>, to listen to HTTP(s) traffic from the Kentik platform and perform the desired transformations to deliver the data in the format that the customer’s system can ingest. New Relic is then able to integrate this data into interfaces that their internal users are familiar with.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6uLqegPpRvTz40difGiQBg/871880418b5130d346c21dbdf991db8c/firehose-20210818.png" style="max-width: 800px;" class="image center no-shadow" thumbnail /> <p>This integration enables network and DevOps teams to collaborate detecting and geo-locating network threats affecting their applications and to understand the role of cloud infrastructure on application performance. New Relic and their users are able to troubleshoot application performance degradations in an expedited manner — critical in complex network environments such as distributed services in cloud deployments.</p> <h3 id="new-relic-case-study-now-available">New Relic Case Study Now Available</h3> <p>Kentik is proud to partner with New Relic, both by providing them with our network observability solution and as a technology partner. There’s no greater example of how the network can be made observable within the industry’s leading observability solution. And for Kentik and New Relic, that’s proven at both the engineering level and the solution-level for our joint customers.</p> <p>To learn more about how New Relic uses Kentik, <a href="https://www.kentik.com/resources/case-study-new-relic/">read the complete case study</a>. Want to learn more about Kentik? <a href="https://www.kentik.com/get-started/">Start a 30-day trial</a> and give Kentik Synthetics a try.</p><![CDATA[How to prepare for a peering-partner business review]]><![CDATA[A critical milestone in any peering relationship is the business review; and when it comes to business reviews, it’s all about preparation. Learn how Kentik can help.]]>https://www.kentik.com/blog/how-to-prepare-for-a-peering-partner-business-reviewhttps://www.kentik.com/blog/how-to-prepare-for-a-peering-partner-business-review<![CDATA[Nina Bargisen]]>Thu, 12 May 2022 04:00:00 GMT<p>Peering is more than just setting up sessions with any AS that will accept one. Peering can involve long-term relationships that require reviews and joint-planning to grow synergy. A critical milestone in any peering relationship is the business review – and when it comes to business reviews, it’s all about preparation.</p> <p>So where to start? It can help to think about the review from two different perspectives: How does the relationship work from your side and, maybe even more importantly, how do you think it works from your partner’s point of view?</p> <p>The first action is to read the partners’ peering policy.</p> <h3 id="the-basics-of-peering-policies">The basics of peering policies</h3> <p>A peering policy is a declaration of a network’s intentions to peer. A network can state if it has the following:</p> <ul> <li><strong>Open peering policy</strong> - Will peer with everyone and everywhere possible</li> <li><strong>Selective peering policy</strong> - Will generally peer, but there are a set of requirements that define how mutual benefit can be gained from peering</li> <li><strong>Restrictive peering policy</strong> - Will peer, but not seeking new peers and will generally decline any requests</li> </ul> <p>It’s most common that networks agree to peer in all mutual locations when they agree to peer. That’s because, well, it’s mutually beneficial. Most networks will hand off the traffic as soon as possible. And peering in all mutual locations makes it more likely that the burden of carrying the traffic is evenly shared. When that’s not the case – for example, when an eyeball network peers with a content network – the parties work out an agreement on where to peer so it makes sense for both parties.</p> <p>As for the networks with a selective peering policy, they usually document their requirements and lay out their justification for saying no to peering. The most common requirements or restrictions are:</p> <ul> <li><strong>Ratio</strong> - The network requires a certain balance between the sent and received traffic from the potential peer</li> <li><strong>Volume</strong> - The network requires a certain volume to justify the increased workload involved in setting up and maintaining connections</li> <li><strong>Locations</strong> - The network requires a certain geographic overlap between the networks so they can hand off traffic most efficiently and save bandwidth within their own network</li> <li><strong>Customers of an existing peer</strong> - If your network is a customer of an existing peer to your peering prospect, your traffic is already on a free connection for them and your traffic might be needed in a potential ratio-relationship between your peering prospect and your provider</li> </ul> <p>With requirements from the peering policy in mind, it’s time to look at the traffic.</p> <h3 id="analysis">Analysis</h3> <p>Items to look out for:</p> <ol> <li><strong>Routing</strong> - Is the traffic flowing as both parties want it to be?</li> <li><strong>Quality</strong> - Is the traffic of even quality or do you see any connection or any destinations where the traffic is not performing as well as expected?</li> <li><strong>Volume</strong> - Do you meet the requirements in the policy? What is the expected growth of traffic?</li> <li><strong>Cost</strong></li> </ol> <h3 id="1-routing">1. Routing</h3> <p>The peering policy gives a good indication of how the partner wishes the traffic to flow, but sometimes they have other requirements or interests. For example, a content distribution network (CDN) might not be interested in hot-potato routing (see below) at all. On the contrary, their business is to place the content as close as possible to the content consumers.</p> <p>So, as a rule of thumb:</p> <ul> <li>Networks that transit traffic or networks that sell internet access favor hot-potato routing.</li> <li>CDNs prefer the shortest path possible from the interconnection to the consumers of the content.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/6JpargfNmz8FkLr8v42NCY/52e4cc6671e629624dc42b9dcc6b72e6/hot-potato-routing.png" style="max-width: 400px; margin-bottom: 15px;" class="image center no-shadow" thumbnail alt="hot-potato routing" /> <div class="caption" style="max-width: 600px;" >This shows hot-potato routing of traffic between X and Y, where X is a customer of ISP B, and Y of ISP A. The ISPs have two peerings, one in each region. Each ISP hands off the traffic as soon as possible, which means traffic from X to Y travels most of the way in ISP A, while traffic from Y to X travels most of the way in ISP B.</div> <p>In this analysis we use the <a href="https://www.kentik.com/blog/visualizing-traffic-exit-points-in-complex-networks/">Kentik BGP Ultimate Exit</a> to see how the traffic flows in and out of our network. We are checking both directions for the peer.</p> <p>In our query we split the traffic by Source ASN, ingress site, Ultimate Exit site, destination, next hop ASN and destination ASN.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7pXmjT1P55T9Tl59CCaqNM/a5a86c5c82817b2bc890102f5ff15e85/visualization-dimensions.png" style="max-width: 400px;" class="image center" alt="Query options" /> <p>When looking at traffic from 1299 to destinations through our network, we see that most traffic enters in ORD1, but exits in ORD1 and in NYC1 and we can clearly see which next hop ASNs receive the traffic where.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4zlDkhHLhMrJyKB84wHT4r/28e73085d570336ad146dffc14698ab8/sankey1299.png" style="max-width: 800px;" class="image center" thumbnail /> <p>Looking at the traffic going to 1299 via our network, we see some traffic from some of our customers ingress our network in NYC1 and flows to 1299 in ORD1.</p> <img src="//images.ctfassets.net/6yom6slo28h2/53J5LH4j7DiuPqew6raail/483cd894d7b46242f6422a84039a9fae/sankey1299-nyc-ord.png" style="max-width: 800px;" class="image center" thumbnail /> <p>Based on this analysis it would be great for us if 1299 would agree to adding a connection in NYC1, in addition to the one we have in ORD. We would hand off the traffic from our customer connection in NYC1 to 1299 in NYC1 and not carry it to ORD1.</p> <p>But what does it look like on our partner’s side?</p> <p>Adding geography for the source of the traffic to the dimensions, we learn that most of the traffic we ingress from 1299 is coming from New York City so it is likely the partner will agree to set up a connection there. The traffic will be handed off closer to the source and that way travels a shorter path.</p> <img src="//images.ctfassets.net/6yom6slo28h2/11qEtHA9H1FNQX19l2C1LC/cffdbf950170ddec20a3e904c5c5cb3e/sankey1299-closer-handoff.png" style="max-width: 800px;" class="image center" thumbnail /> <p>If we look at the traffic in the other direction - traffic from our customers to 1299, we see that we would fix a hairpin issue for our customer 2906.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2CRMmeuBmOAr9tAuLsvY2h/cf4496ae4e263dc05fe1f767b1631cc0/sankey-fix-hairpin.png" style="max-width: 800px;" class="image center" thumbnail /> <p>We can translate the diagrams above to a network sketch of before and after.</p> <img src="//images.ctfassets.net/6yom6slo28h2/KDyEzBzBjVPuybWnOOhQd/f25ac14472f08bd559f6cb376d9e8179/hairpin-fix.png" style="max-width: 600px;" class="image center no-shadow" thumbnail /> <h3 id="2-quality">2. Quality</h3> <p>In general, synthetic tests enhance your flow-based connectivity monitoring and analysis. Synthetic tests are performed using Kentik’s platform of global agents in combination with private agents installed in your network. Synthetic tests help you see your network how others see it:</p> <ul> <li>Continuous path monitoring with alerts shows your connectivity works as planned and alerts you when it does not.</li> <li>Continuous monitoring of packet loss, latency and jitter with alerting means you are already on it before your customers experience any degradation. <ul> <li>Automation could be triggered by alerts from the tests and do it for you.</li> </ul> </li> <li>State of the internet measurements can help you quickly determine if an alert from a test from your network to an internet destination is due to internet weather or if you need to take action inside your network.</li> </ul> <img src="https://images.ctfassets.net/6yom6slo28h2/5kU5W8pAZH1XuZC83Q8scU/11b3a0d5f32d8090b69b267b01edac59/synthetics-peering.png" style="max-width: 400px;" class="image center no-shadow" /> <p>Monitoring the quality of the traffic for an individual peer is easy once you have found the good targets in the network. Combining your flow and synthetic platforms, like Kentik does, makes it easy to find the right targets to monitor.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5fbHe5aOB8zvBUjcNhXnwJ/29dfe1a57b899b041cbb9739bee75f56/synthetics-path-hops.png" style="max-width: 800px;" class="image center" /> <img src="//images.ctfassets.net/6yom6slo28h2/wYTWSZuOZq2v8jooiNmmg/103bb3eadffaddddef410954c805b8b1/synthetics-path-view-latency.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" /> <div class="caption">The test shows the average latency between the customer in the NYC and their endusers as between 20 and 30 ms.</div> <h3 id="3-volume">3. Volume</h3> <p>You will need to have a simple check of the ratio and the volume of traffic if your peering partner’s policy requires it. However, it’s more important to have a traffic forecast. What is the expected traffic growth? When you know this, you can discuss the capacity of the connections with the peering partner.</p> <p>In the <a href="https://www.kentik.com/solutions/usecase/network-capacity-planning/">Capacity Planning workflow</a>, this is an easy task.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6MFvM1mZ9kbN3fNTOETaIE/afada111e390a2bfa8e86b5a1b03e451/capacity-planning1-2.png" style="max-width: 800px;" class="image center" thumbnail /> <p>We can see there is quite moderate growth and if that growth continues, there will still be enough capacity for more than a year ahead.</p> <img src="//images.ctfassets.net/6yom6slo28h2/53SANRFE6EGZdGfHp5y4yO/1b92cd9193fa98ce1363672bff4a43cb/capacity-planning3-4.png" style="max-width: 800px;" class="image center" thumbnail /> <p>Since we know from the routing analysis that a good part of the traffic is due to our customer AS 2906, it’s a good idea to check their interface and see that their growth is consistent with the growth on the interface to the peer.</p> <p>Now we know that we do not have capacity as a driver for the expansion of the connectivity to this peer, only the improvement for both parties by adding a connection in the metro where there is local traffic between our networks that is currently hauled to another metro and back.</p> <h3 id="4-cost">4. Cost</h3> <p>Knowing your cost for each of your peers is key to a good relationship. We discussed this in detail in the first of part of our blog series, “<a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/">The peering coordinator’s tool box</a>.”</p> <p>Given that the traffic is not projected to grow much in this example, you can use this information to calculate what adding an extra connection will do to your cost for the peering with 1299 and decide if the improvement of the routing is worth it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5F091ZTeobxCNnXE9YMOvx/ffa0b71a3ba3e6a3dfd2bfb886f02839/connectivity-costs-telia.png" style="max-width: 800px;" class="image center" thumbnail /> <h4 id="interconnection-choices-and-technology">Interconnection choices and technology</h4> <p>We have not yet discussed how you are connected to the peer in this example.<br> Public peering is when you’re connected via an internet exchange. It is often the case that when traffic to a single peer over an internet exchange reaches a certain level, the connection is moved to a private connection. So, if the growth of this traffic alone will trigger the need to upgrade the port on the exchange, it can make more sense to move. However, sometimes this is only the case for one of the two parties in the relationship. You can often find guidelines in the peering policy for when traffic should be moved from public to private peering.</p> <p>Private peering connections are connections where the traffic flows via either cross connects in a data center or sometimes via metro connections from one data center to another. It is common to share the cost of the cross connects. Often the requester pays for the first connection and then you take turns. But this is always part of the negotiations.</p> <p>Try to predict whether the changes you would like to implement are reasonable from a cost perspective for your partner. Things to consider here are the evolution of interface speeds. Maybe your network is transforming from 100G to 400G, but your partner takes a bit longer so they will prefer adding more interfaces to the bundles – or vice versa. Try to have a solution ready for that discussion beforehand: Can we move to sites where the right equipment is in place? Can we split the cost of cross connects differently so the party that insists on bundles takes care of the cost alone?</p> <p>In the case of our example, a potential solution to covering the extra cost of adding a connection in a new location, even though the traffic is not growing much, could be to set up sessions on an IXP in the metro area. Another approach might be to offer to pay the cost of the cross connects no matter whose turn it is to cover it.</p> <h3 id="define-your-batna">Define your BATNA</h3> <p>The last thing to consider is your Best Alternative to a Negotiated Agreement (BATNA). What will happen if your peering partner does not want to accommodate the needs that your analysis has uncovered? Is the as-is situation acceptable from your point of view or would it be better to find alternative paths for your traffic by moving it to transit?</p> <p><a href="https://www.kentik.com/product/market-intelligence/">Kentik Market Intelligence (KMI)</a> can help you find a good transit provider for the NYC metro area.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6YRqFCuUyPUIeVNx9Ujh8s/b581477842157aac2fd6922a7f5319f4/kmi-rankings-nyc.png" style="max-width: 800px;" class="image center" thumbnail /> <p>Picking the number one or two provider in the rankings will likely secure you good access to the consumers in the area for your customers, in this example in New York.</p> <h3 id="conclusion">Conclusion</h3> <p>Now the game plan is set:</p> <ul> <li>We have identified a routing issue and documented a hairpin issue in the current setup. </li> <li>The solution is a new connection in the NYC metro. </li> <li>Since traffic is growing very slowly, there is not an immediate need for more capacity so, if fixing the hairpin is not important for the peer, we can: <ul> <li>Offer to pay the cross connects, or </li> <li>Suggest a session on an IXP in the metro. </li> </ul> </li> <li>If the status quo is unacceptable to us, our BATNA is to buy local transit and move the traffic to that connection. </li> </ul> <p>Want to know more about this topic? Tune into my new webinar: <a href="https://www.kentik.com/go/webinar/peering-partner-business-review/">How to prepare for a peering partner business review</a>.</p><![CDATA[Using synthetics to get the big picture]]><![CDATA[When something goes wrong with a service, engineers need much more than legacy network visibility to get a complete picture of the problem. Here’s how synthetic monitoring helps.]]>https://www.kentik.com/blog/using-synthetics-to-get-the-big-picturehttps://www.kentik.com/blog/using-synthetics-to-get-the-big-picture<![CDATA[Phil Gervasi]]>Thu, 05 May 2022 04:00:00 GMT<p>Nobody actually cares about the network. Provocative words coming from a network visibility company, you might be thinking. However, consider what you’re doing right now. You’re reading a blog on a website, maybe clicking around other tabs, possibly streaming some music, and likely keeping an eye on your work chat. These are all applications, and <em>that’s</em> what we all truly care about, not the plumbing that delivers them.</p> <p>So when something goes wrong with a service we’re delivering to our customers, usually an application, engineers need much more than legacy network visibility information to get a complete picture of the problem. Interface statistics we get from SNMP help, and so does path information we get from flow data and routing tables. There’s definitely a good amount to gather from these methods, but engineers need the big picture.</p> <h3 id="putting-things-in-context">Putting things in context</h3> <p>The context of those interface statistics, flow logs, and routing tables or, in other words, the context of visibility, is the application — and that’s the point of network observability. That means fixing a routing problem isn’t just an exercise in making the network hum again — it’s to get back to delivering optimally performing applications to customers as soon as possible and without any more interruptions.</p> <p><strong>Synthetics</strong>, sometimes referred to as <a href="https://www.kentik.com/kentipedia/what-is-synthetic-monitoring/">synthetic testing or synthetic monitoring</a>, provides more specific application-focused visibility information than traditional MELT. Using synthetics, you can learn about an application’s page-load time, you can simulate a person logging into a website, or you can monitor specific SaaS application performance.</p> <p>A synthetic test is a way to generate traffic artificially to analyze a specific problem. That can be something very simple like pinging application servers to monitor availability. They can also be network-focused, such as <a href="/blog/introducing-bgp-monitoring-from-kentik/">monitoring BGP peerings</a> in a geographic region. But the real power is when synthetic testing is application-focused, such as monitoring how long it takes for an application server to respond to a request, or simulating a person logging into an ecommerce site to make a purchase.</p> <p>Imagine trying to figure out why your car isn’t running well. It’s making a funny sound and it doesn’t drive smoothly. There’s a check-engine light to tell you there’s a general problem with the engine. There’s a dashboard indicator to tell you that you’re low on fuel. And in some cars there’s even an alert that your tires are low on air.</p> <p>But <em>why</em> is the car running poorly? Those few monitoring systems give us some helpful information, but it’s not until a mechanic test drives the car and runs specific and deliberate diagnostic tests when we can understand what the underlying problem is.</p> <h3 id="the-big-picture">The big picture</h3> <p>Since synthetic testing gives you greater context for your other metrics, it’s part of an overall visibility solution, not simply a feature you bolt on and forget about. Traffic flow data, device information, cloud logs, and routing tables are all very important and each provide their own specific value. Synthetics then adds the context missing from simply looking at routing tables and cloud logs.</p> <p>For example, picture a poorly running application hosted in AWS and load-balanced across several regions. Your traditional network visibility information alerts you that the inbound interfaces on all but one AWS region are maxed out. As a result, all traffic is being redirected to one AWS region. Now you know there’s a problem, and you have a place to start troubleshooting.</p> <p>But just like the mechanic running diagnostics on your car, synthetics is powerful for troubleshooting to help you understand <em>why</em> you’re getting those alerts. Using <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a>, you can simulate a person logging into the application, see the DNS request and response, the application’s response time, and trace the traffic path. Then you can discover that DNS responses for the application are directing everyone to a single region causing the application to feel slow.</p> <p>In a real-life network there would be more digging to do here, but this is an example of the application-specific visibility that gives flow data, cloud logs, and mundane information like interface statistics context so it all comes together.</p> <p>Whether it’s as simple as a ping test to check availability of a resource or a complex test to simulate a person making a purchase on an ecommerce site, Kentik Synthetics provides a programmatic and deliberate method for getting beyond the alert and understanding <em>why</em> something is going on. Kentik Synthetics helps us focus on what we truly care about — the big picture.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p>Want to learn more? Start a 30-day trial and give Kentik Synthetics a try.</p> <p><button as="SignupButton">TRY KENTIK</button></p><![CDATA[4 steps to bring network observability into your organization]]><![CDATA[If you’re on the hook for the network that powers your organization, you may be hearing about network observability. This blog will help answer how network observability should become a part of your plans and what steps you can take.]]>https://www.kentik.com/blog/4-steps-to-bring-network-observability-into-your-organizationhttps://www.kentik.com/blog/4-steps-to-bring-network-observability-into-your-organization<![CDATA[Kevin Woods]]>Wed, 27 Apr 2022 04:00:00 GMT<p>The vast majority of corporate IT departments have a network monitoring solution. Typically that solution is built on standalone software platforms. If that’s you, this post is for you.</p> <p>You’re probably hearing a lot about “<a href="https://www.kentik.com/kentipedia/what-is-network-observability/">observability</a>” these days. Generally, that’s the ability to answer any question and explore unknown or unexpected problems to deliver great digital experiences to your users. Isn’t that what your traditional network monitoring solution is already doing? Well, not exactly.</p> <p>When it comes to making the network observable, that means gathering a wide range of network telemetry, organizing that data in a capable platform and building the analytics tools on top.</p> <div class="pullquote left" style="max-width: 220px;">“If you can’t observe it, you can’t manage it.”</div> <p>However, building network observability for many organizations is a different type of challenge. For most, there is a need to evolve from or build on top of a monitoring practice, and change or expand the monitoring technology. Just as importantly, it may also create a change in practices and even the organization. One of my Kentik colleagues says, “If you can’t observe it, you can’t manage it.”</p> <p>So, how can you take on network observability? Where should you even start? Here are a few essential steps.</p> <h3 id="step-1-embrace-saas-based-network-observability">Step 1: Embrace SaaS-based network observability</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/4ViEdYvikOJ76xexX2PeA3/3390100790d4d66b883c453390a53af1/10-steps-saas.png" class="image right no-shadow" style="max-width: 300px;" alt="SaaS-based network observability" /> <p>There’s nothing wrong with a network monitoring solution built on standalone software platforms. In some environments, a non-cloud-based solution may even be necessary for good reasons. However, there are a lot of benefits to SaaS worth considering.</p> <p>A SaaS approach to network observability makes it easy to connect your observability tools to all networking components, including virtual assets that are difficult to monitor using physical appliances. SaaS-based network observability can scale seamlessly as your network grows in scope and complexity. And SaaS allows you to pay as you go.</p> <h3 id="step-2-broaden-your-network-horizon">Step 2: Broaden your network horizon</h3> <p>The productivity of your users is now dependent on far more than your in-house servers and devices. Your users are using SaaS applications running in clouds. More of your organization’s applications are running in the cloud. Your users are increasingly dependent on the performance of internet service providers, wireless mobile carriers, transit providers and the internet itself. As the network pro on the hook for the productivity of these users, you must now keep tabs on all of these — clouds, SaaS providers and network service providers. They all play a role in your success.</p> <h3 id="step-3-embrace-netops">Step 3: Embrace NetOps</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/1O73dqDxSjAe4cVxW3nNbn/638dab11e55d138424720eea2cb16ca4/netops-devops.png" class="image right no-shadow" style="max-width: 280px;" alt="NetOps" /> <p>Network management teams can’t operate in silos. They must continuously coordinate and communicate with software engineers, IT engineers, DevOps teams and other stakeholders. Through a collaborative strategy, different engineers can work together to ensure that network operations reinforce other IT operations and vice versa.</p> <p>NetOps (also known as DevNetOps) is an approach to network management that facilitates this type of collaboration. By using shared methodologies like observability — which is just as important to software developers and DevOps teams as to network management teams — network engineers and other technical stakeholders should speak the same language and work toward common goals.</p> <h3 id="step-4-join-me">Step 4: Join me</h3> <p>I’ve summarized a few directional items above, but the question remains: How do you put this into practice? I’d love to share some of the approaches you should take to move your IT organization toward network observability. And I’d like to show you some examples of how Kentik’s platform can help, too. Check out my webinar: <a href="https://www.kentik.com/go/webinar/10-steps-modernize-network-monitoring/">10 steps IT should take to modernize network monitoring</a> or <a href="https://www.kentik.com/get-demo/">reach out to us for a demo</a> to see for yourself.</p><![CDATA[Measuring RPKI ROV adoption with NetFlow]]><![CDATA[RPKI is the internet’s best defense against BGP hijacks. What is it? And how does it protect the majority of your outbound traffic from accidental BGP hijacks without posing a risk to legitimate traffic? ]]>https://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflowhttps://www.kentik.com/blog/measuring-rpki-rov-adoption-with-netflow<![CDATA[Doug Madory, Job Snijders]]>Mon, 25 Apr 2022 04:00:00 GMT<div class="pullquote right" style="font-style: normal; font-weight: normal; font-size: 96%; max-width: 320px; margin-top: 10px; line-height: 140%;"> <b>Resource Public Key Infrastructure (RPKI)</b> is a routing security framework that provides a mechanism for validating the correct originating autonomous system (AS) and prefix length of a BGP route.<br /><br /> <b>Route Origin Authorization (ROA)</b> is a cryptographically signed object within the RPKI that asserts the correct originating AS and prefix length of a BGP route.</div> <p>For as long as the internet has existed, the challenge of securing its underlying protocols has persisted. The resulting lack of routing security, for example, has led to numerous BGP incidents such as hijacks and routing leaks that regularly result in misdirected traffic, dropped packets and increased latencies.</p> <p>Of the proposed technical solutions that have surfaced over the years, Resource Public Key Infrastructure (RPKI) has emerged as the internet’s best defense against BGP hijacks due to typos and other routing mishaps. While it doesn’t defend against every adverse routing event, RPKI reduces the impact of accidental originations, a fairly common occurrence.</p> <p>The challenge with any distributed security mechanism such as this one is that it only yields benefits when there is broad adoption. Many networks need to independently elect to create Route Origin Authorizations (ROAs) for their prefixes and configure their routers to reject RPKI-invalids.</p> <p>Networks deciding whether to participate face a “chicken or the egg” dilemma: why bother rejecting RPKI-invalid routes if few are creating ROAs, and why create ROAs if no one is rejecting invalids. However, the data presented in the following analysis suggests that we may have finally reached critical mass with RPKI — the point at which the benefit of participating outweighs the cost of implementation combined with the risk of inaction.</p> <div as="Promo"></div> <p>In just the last couple of years, the list of tier-1 network service providers who now reject RPKI-invalids has grown to include NTT, GTT, Arelion (Telia), Cogent, Telstra, PCCW and Lumen. In other words, an RPKI-invalid route has a difficult time propagating across the internet these days.</p> <p>At the same time, the number of ROAs, which assert the rightful origin and prefix length of a route, has started growing in recent years. As evidence of this, we can take a look at the chart below from <a href="https://rpki-monitor.antd.nist.gov/">NIST’s RPKI Monitor</a>. In this chart, we can see the number of IPv4 BGP routes evaluated as RPKI-valid started increasing at a steady rate beginning at the end of 2018.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1vN8nbIzIk1av0iS3SvzD3/7739c48c256f76e0b10a77bffa0fadbb/rpki-rov-analysis-history.png" style="max-width: 600px;" class="image center" /> <p>Every time a ROA is created for a BGP route, the count of RPKI-valid routes increases (green line), the count of RPKI-unknown routes decreases (yellow line). There is also a red line which hugs the x-axis representing a miniscule amount of persistently RPKI-invalid routes due to misconfigurations.</p> <p>There are two steps that must take place before an RPKI-invalid route can be rejected:</p> <ol> <li>The owner of the address space must create a ROA,</li> <li>Networks on the internet must reject routes that are evaluated as invalid.</li> </ol> <p>This analysis focuses on measuring how much progress has been made in that first step alone.</p> <p>As was referenced above, there are several public tools for measuring ROA creation such as NIST’s RPKI Monitor as well as RIPE’s <a href="https://stat.ripe.net/app/launchpad/">RIPEstat</a> utility. However, these tools are strictly based on BGP data and those of us who regularly work with BGP know that not all routes are equivalent in terms of traffic.</p> <p>The NIST’s RPKI Monitor presently reports that 34.89% of IPv4 routes and 34.28% of IPv6 routes are RPKI-valid based on published ROAs. The percentages of RPKI-unknown routes are 63.47% and 62.37% for IPv4 and IPv6 respectively. Regardless of protocol, the ratio of RPKI-unknown routes to RPKI-valid routes is 2:1.</p> <img src="//images.ctfassets.net/6yom6slo28h2/u03eFPntpuxMR3lx1NLix/81962b01c0a752abe1d3ae8f08db0edb/rpki-rov-analysis.png" style="max-width: 600px;" class="image center" /> <p>But then the question becomes: <em>What proportion of overall traffic is safeguarded by that 34%?</em></p> <h2 id="a-quick-detour">A quick detour</h2> <p>A long, long time ago, long before the pandemic, my co-author Job Snijders (Principal Engineer at Fastly) was working to allay an initial hesitation around RPKI adoption. The worry was that rejecting RPKI-invalid routes would lead to the loss of important customer traffic due to persistent RPKI-invalid routes and thus be unacceptable to a business.</p> <p>In February 2019, Job worked with Paolo Lucente (they were both at NTT at the time) to extend the <a href="http://www.pmacct.net/">pmacct</a> network analysis tool to combine NetFlow analysis with RPKI evaluation to remove any guesswork about what traffic might actually be dropped. Job summarized their work in <a href="https://mailman.nanog.org/pipermail/nanog/2019-February/099522.html">an email</a> to the NANOG list. The email ended with a challenge to Kentik to incorporate this functionality into its product for the benefit of Kentik’s customers as well as the greater internet.</p> <p>Kentik heeded the challenge and within a few months <a href="https://www.kentik.com/blog/bgp-and-rpki-a-path-made-clear-with-kentik/">added RPKI evaluation</a> to its NetFlow analysis platform, enabling a user to explore what specific traffic would be evaluated by RPKI as valid, unknown and invalid. The conclusion of users following this line of inquiry is that dropping persistently RPKI-invalid routes does not lead to loss of important traffic.</p> <p>In March 2019, <a href="https://events.dknog.dk/event/4/contributions/37/attachments/15/25/DKNOG9_Snijders_Routing_security_roadmap1.pdf">Job presented their preliminary findings</a> based on NTT traffic at DKNOG-9 and included the following chart showing that the majority of traffic was destined for RPKI-unknown routes in blue, while traffic to RPKI-valid routes (orange) was a distinct minority. As expected, traffic to RPKI-invalid routes (red) was miniscule.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2VkQPU4nODWuVppAF3c7dJ/1329ca174a42c8f18ba26cebba76b974/rpki-routes-ntt-traffic.png" style="max-width: 600px;" class="image center" /> <p>Job’s provocative claim in that presentation was that “not everyone needs to do RPKI,” His point was that given the consolidation of the internet industry, only a few major players (content providers and eyeball networks) needed to deploy RPKI before we started seeing large benefits. Keep this claim in mind as we review the following results.</p> <div as="WistiaVideo" videoId="qpfh2174t5" audio></div> <h2 id="kentiks-perspective">Kentik’s perspective</h2> <p>How can Kentik help our understanding of RPKI adoption? Kentik has hundreds of customers providing live feeds of NetFlow and almost half of those have opted-in to allow their data to be used as part of aggregate analysis.</p> <p>It is important to note that any resulting analysis based on this data is subject to the biases of Kentik’s customer set, which includes network service providers, content delivery networks as well as large digital enterprises. It’s also skewed towards the United States, where the majority of our customers are based. These caveats aside, this large NetFlow dataset is invaluable for understanding broader developments on the internet, whether they be <a href="https://www.kentik.com/blog/myanmar-goes-offline-during-military-coup/">military coups</a> or <a href="https://www.kentik.com/blog/facebooks-historic-outage-explained/">historic social media outages</a>.</p> <p>As mentioned earlier, in response to Job’s challenge, Kentik incorporated RPKI evaluation into the intake process of NetFlow to allow our users to answer the question about dropped RPKI-invalid traffic. However, the flexibility of Kentik’s analysis platform enables us to answer other questions, such as the one from above…</p> <h2 id="what-proportion-of-traffic-goes-to-rpki-valid-routes">What proportion of traffic goes to RPKI-valid routes?</h2> <p>Just like the pmacct extension that Job and Paolo built, Kentik tracks four outcomes of RPKI evaluation:</p> <ol> <li>Valid</li> <li>Unknown</li> <li>Invalid</li> <li>Invalid - but covered by valid/unknown</li> </ol> <p>Case #4 is when an RPKI-invalid route has a covering prefix that wouldn’t be rejected because it was either valid or unknown. The result is that the destination address space is reachable through the covering prefix. This case only exists in the analysis plane and is not part of any IETF standard on BGP or RPKI.</p> <p>Over a week long period of analysis at the beginning of February, we observed the following breakdown our all of our aggregate NetFlow measured in bits/sec, as shown here:</p> <img src="//images.ctfassets.net/6yom6slo28h2/RXBfaD7hWQfl2gp54lNZU/ac55ead76c8363766ec6a6fb81e60626/internet-traffic-rpki.png" style="max-width: 600px;" class="image center" /> <p>If this breakdown is truly representative of the broader internet, then the majority of the internet’s traffic goes to BGP routes secured by ROAs. This paints a far rosier picture than the NIST RPKI Monitor stats and suggests we’ve come a long way from the NTT traffic chart in Job’s <a href="https://events.dknog.dk/event/4/contributions/37/attachments/15/25/DKNOG9_Snijders_Routing_security_roadmap1.pdf">DKNOG-9 presentation</a>.</p> <h2 id="digging-deeper-into-the-data">Digging deeper into the data</h2> <p>So how is it that a <em>majority of traffic</em> can be destined to <em>a minority of RPKI-valid BGP routes</em>? To answer this question, we need to identify the companies responsible for these RPKI-valid routes.</p> <p>Let’s explore the internet’s largest market, the United States. According to <a href="https://stat.ripe.net/app/launchpad/">RIPEstat</a>, RPKI-valid routes account for 25.15% of U.S. IPv4 address space and 19.99% of U.S. IPv6 address space. When we look at our aggregate NetFlow destined to the U.S., we see that 58.5%, a clear majority, of bits/sec goes to RPKI-valid routes.</p> <p>If we decompose that 58.5% further, we can see the companies which account for the most traffic are a collection of major U.S. eyeball networks (Comcast and Spectrum), as well as content providers (Amazon, Google and Cloudflare). All of these companies completed major RPKI deployments in recent years.</p> <p>These companies might not account for the majority of the BGP routes of the U.S., but they do account for a large portion of U.S. traffic. Not everyone needed to deploy RPKI before we started seeing benefits.</p> <h2 id="conclusion">Conclusion</h2> <p>Deploying RPKI is a best current practice — both creating ROAs for your routes, as well as rejecting RPKI-invalid routes. Based on the analysis above, rejecting RPKI-invalid routes likely protects that <em>majority of your outbound traffic</em> from accidental BGP hijacks without posing a risk to legitimate traffic.</p> <p>In addition, another best current practice is to avoid modifying LOCAL_PREF or BGP communities based on validation states. The risk here is that if a validator were to crash, all RPKI-valid states would be re-classified as RPKI-unknown and potentially thousands of new BGP routes be simultaneously announced causing a dangerous level of BGP churn. This is a scenario Job expounded upon in a <a href="https://sobornost.net/~job/Snijders_CERN_breaking_internet.pdf">recent talk</a> at CERN.</p> <p>As always, if you’d like to explore how your company’s traffic would fare after deploying RPKI, <a href="https://www.kentik.com/go/get-started/">sign up for a trial</a> and take a look at the results.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/dLd27cJo8Ds?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <p>This blog post was based on a recent presentation at NANOG 83. Watch the full video.</p><![CDATA[The evolution of network visibility]]><![CDATA[In the old days, we used a combination of several data types, help desk tickets, and spreadsheets to figure out what was going on with our network. Today, that’s just not good enough. The next evolution of network visibility goes beyond collecting data and presenting it on pretty graphs. Network observability augments the engineer and finds meaning in the huge amount of data we collect today to solve problems faster and ensure service delivery.]]>https://www.kentik.com/blog/the-evolution-of-network-visibilityhttps://www.kentik.com/blog/the-evolution-of-network-visibility<![CDATA[Phil Gervasi]]>Wed, 20 Apr 2022 04:00:00 GMT<p>In the old days, it took a bunch of help desk tickets for an engineer to realize there was something wrong with the network. At that time, troubleshooting meant logging into network devices one-by-one to pore over logs.</p> <p>In the late 80s, SNMP was introduced giving engineers a way to manage network devices remotely. It quickly became a way to also collect and manage information about devices. That was a big step forward, and it marked the beginning of network visibility as we know it today.</p> <p>In 1995, <a href="https://www.kentik.com/kentipedia/netflow-overview/">NetFlow</a> was released, giving us a way to send network traffic information directly from our routers and switches to a centralized flow collector. Flow data was another big step forward in the type of data we could collect, especially because we could get information about application activity rather than just CPU and link utilization.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2rRbHkpFTVYByOUlx91OA/a5d6f243770134ee743ad8ccf5832874/bob-dylan-times.jpg" style="max-width: 200px; border: 1px solid #efefef;" class="image right no-shadow" /> <p>Then we saw the first dashboards display SNMP alerts and network flow data on colorful graphs and charts. The thing is, this is where the industry stopped for a while. We got better at SNMP and analyzing flows, graphs got prettier, and menus got snappier. But behind the scenes it was still just a collection of SNMP traps and flow collectors. To be fair, it worked well enough for the time period, but as the great networking guru Bob Dylan once said, “the times they are a-changin.”</p> <h3 id="network-visibility-had-to-evolve">Network visibility had to evolve</h3> <div class="pullquote right" style="font-style: normal;">Network visibility needed to evolve because the technology we need visibility into changed.</div> <p>Today’s needs are just a little bit different from the needs we had in the mid 90s. OK, so maybe that’s a bit of an understatement. It was only a few years ago when public cloud and SaaS took off. Network overlays like <a href="/kentipedia/sd-wan-software-defined-networking-defined-and-explained/">SD-WAN</a> became standard rather than corner-case, and hipster tech like containerization became mainstream. Network visibility needed to evolve because the technology we need visibility into changed.</p> <p>So how do we get visibility into parts of the network that don’t provide us much in the way of SNMP or flows? How do we get visibility into parts of the network that we don’t even manage or own? And how does a mere mortal make sense of this enormous collection of data?</p> <p>Today, we need to collect everything and anything. Old school SNMP information and flow data, public cloud logs from AWS, Azure and Google, streaming telemetry direct from our beloved routers, and metadata data gathered by looking at DNS responses make up the bulk of our visibility databases. And once we have it all in one place, we can smush it all together to analyze programmatically.</p> <h3 id="the-next-step-is-network-observability">The next step is network observability</h3> <p>“Programmatic” is a loaded term. It conjures thoughts of Python scripts and Ansible playbooks, but in terms of visibility, it means much more than that. The next evolution of network visibility is a programmatic approach to <em>finding meaning</em> in huge amounts of diverse data. “Finding meaning” may sound like a spiritual pilgrimage, but it really just boils down to helping engineers find the root cause of a service delivery problem faster and automatically.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7ciU3jCvUp4vTk2RacGEXc/e7470dfa3dec0f78e99e2b3b773ec1bc/finding-meaning-in-data.jpg" style="max-width: 500px;" class="image center" /> <p>Think about how many data points exist in a database of network metrics, especially if it’s data collected over time. There are probably millions of timestamps, alerts on CPU utilization, flow logs, configuration files, AWS VPC flow logs, Google system logs, packet information, ephemeral information like interface statistics, and information derived from synthetic tests. That’s a big list. Understanding how everything relates to everything else is an overwhelming task for an engineering team.</p> <p>The next step in network visibility is <a href="https://www.kentik.com/kentipedia/what-is-network-observability/">network <em>observability</em></a>, which means finding correlation among data points, inferring visibility where it’s difficult or impossible, and deriving meaning from the data beyond associating a problem with a timestamp. A dedicated team of engineers and data scientists might be able to do this given enough time, but network observability gets us there faster and, in theory, with more insight than even a skilled engineer could provide. It solves the problem of how to handle today’s huge amount and variety of network data.</p> <h3 id="augmenting-the-engineer">Augmenting the engineer</h3> <div class="pullquote right" style="max-width: 350px; font-style: normal;">Network observability answers <em>why</em> something is happening on the network, not just that it <em>is</em> happening.</div> <p>This means network observability doesn’t <em>replace</em> legacy network visibility. Instead, network observability is built on the foundation of years of network visibility technology alongside the latest data types and methods. It automates correlation, performs anomaly detection, provides actionable insight, and answers <em>why</em> something is happening on the network, not just that it <em>is</em> happening.</p><![CDATA[Working with Cloudflare to mitigate DDoS attacks]]><![CDATA[Kentik and Cloudflare have a new integration. Read why and how we've expanded our partnership for on-demand DDoS protection.]]>https://www.kentik.com/blog/cybersecurity-cloudflare-and-kentik-mitigate-ddos-attackshttps://www.kentik.com/blog/cybersecurity-cloudflare-and-kentik-mitigate-ddos-attacks<![CDATA[Stephen Condon]]>Wed, 13 Apr 2022 04:00:00 GMT<p>The rolling thunder of cybersecurity warnings has built to a crescendo this year. <a href="https://www.helpnetsecurity.com/2022/03/28/ddos-attacks-2021/">According to HelpNetSecurity</a>, cybercriminals launched over 9.75 million DDoS attacks in 2022. The <a href="https://blog.cloudflare.com/ddos-attack-trends-for-2022-q1/">Cloudflare Attack Trends 2022 Q1 Report</a> published yesterday shows an alarming increase in application-layer DDoS attacks. And our own Doug Madory has been <a href="https://www.kentik.com/analysis/">sharing analysis</a> on the impact of cyberattacks, too.</p> <p>In May of 2020, Kentik and Cloudflare jointly announced that we’d be integrating Kentik Protect with Cloudflare Magic Transit. The success of this integration and growing customer interest has led us to strengthen that relationship. Starting today, the Cloudflare team will now be able to resell Kentik directly, offering customers the ability to buy and deploy a fully integrated on-demand DDoS mitigation solution.</p> <p>The Kentik Network Observability Platform monitors our customers’ infrastructure and, when a DDoS attack is detected, activates Magic Transit DDoS protection. Cloudflare ingests all traffic directed to customer networks, filters out DDoS traffic at the edge, and delivers clean IP traffic to the customer, preventing adverse impact to users.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Hz9bgKP9PKMh3n8qncz4U/c419d016af064dba9a4f675e24802a33/kentik-cloudflare-2022.png" style="max-width: 700px;" class="image center" alt="Kentik and Cloudflare mitigate DDoS attacks" /> <p><a href="https://www.cloudflare.com/magic-transit/">Cloudflare Magic Transit</a> and <a href="https://www.kentik.com/product/protect/">Kentik Protect</a> communicate via a REST API to trigger on-demand mitigation. Cloudflare Magic Transit has data centers in 250+ cities and over 120 Tbps of edge capacity, and can absorb and neutralize even the largest volumetric DDoS attacks, protecting any network from collateral damage and ensuring application availability for a business’ customers and users.</p> <h3 id="advancing-cybersecurity">Advancing cybersecurity</h3> <p>Kentik Protect leverages powerful network analytics to deliver alerts and insights so you can mitigate DDoS attacks in real time. We have the most accurate and observable service for detecting volumetric and application-based attacks.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/29hP0hO5PxVB21QoI6A5Ax/3e7ca71dcfe8f3b7d02c6db94a5017f7/security-protect-against-botnets.png" style="max-width: 800px; margin-bottom: 15px;" thumbnail class="image center" /> <div class="caption">Analyze botnet traffic characteristics and trends over time.</div> <h3 id="bgp-monitoring">BGP monitoring</h3> <p>It’s not just DDoS attacks that concern those charged with protecting the network from cyber threats. Another threat that never seems to go away is BGP routing insecurity. In fact, just last month, Doug Madory posted about the latest BGP <a href="https://www.kentik.com/analysis/bgp-hijack-of-twitter-by-russian-isp/">hijack of Twitter</a>.</p> <p>Although the recent successes in the adoption of Resource Public Key Infrastructure (RPKI) have helped to limit the impacts of BGP hijacks, monitoring is necessary to remediate any routing mishaps that can arise. BGP hijacks and leaks can misdirect internet traffic causing serious disruptions to application and service delivery.</p> <p>Today’s BGP monitoring services rely almost exclusively on public data sources like RouteViews and RIPE RIS. However, Kentik is uniquely positioned to take BGP monitoring to the next level by extending those public data sources with thousands of private (anonymized) BGP feeds from our customers. The coverage offered by the BGP data used in our service will be unmatched by any BGP monitoring service in existence. We are currently investing considerable resources in building our BGP monitoring service to use this larger data set to detect routing anomalies like hijacks and route leaks, check RPKI statuses of routes, as well as offer an innovative visualization technique that will allow new types of analysis:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1rTqgMEi0oI2em4VDIDQCe/be7fb944c6ab8ff7c6f397f98a6ac90e/bgp-visualization.png" style="max-width: 800px;" class="image center" thumbnail /> <p>You can read more about our current BGP monitoring capabilities <a href="https://www.kentik.com/blog/introducing-bgp-monitoring-from-kentik/">in this post</a>.</p> <h3 id="simplifying-network-security">Simplifying network security</h3> <p>Kentik and Cloudflare both aim to simplify network security for our customers. Magic Transit makes it easier for network professionals to know their network is protected. Now, we have extended that simplicity not only to detecting and mitigating DDoS attacks, but also to purchasing and integrating Magic Transit with Kentik. Combining Kentik’s BGP monitoring and DDoS detection with Cloudflare Magic Transit DDoS mitigation will go a long way to ensuring your infrastructure has the best protection possible.</p> <p>We’ll be joining Cloudflare for the webinar <a href="https://gateway.on24.com/wcc/eh/2153307/lp/3735104/?partnerref=Kentik">DDoS attack trends and predictions for 2022</a> on April 28 at 10am PDT. For more information on the partnership and technical integration <a href="https://www.kentik.com/get-demo/">get in touch</a>.</p><![CDATA[Network AF, Episode 13: Talking networking and PR with Ilissa Miller ]]><![CDATA[In this episode of the Network AF podcast, Avi Freedman connects with Ilissa Miller, network whisperer and PR industry veteran. ]]>https://www.kentik.com/blog/network-af-episode-13-talking-networking-and-pr-with-ilissa-millerhttps://www.kentik.com/blog/network-af-episode-13-talking-networking-and-pr-with-ilissa-miller<![CDATA[Maria Martinova]]>Fri, 08 Apr 2022 04:00:00 GMT<p><CastedPlayer episodeId="789f5c39" smallPlayer></CastedPlayer></p> <hr> <p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/7zWckfdyC5g6QDbYQPMFxW/5886d9d4decda1353ed8b99067ab1297/ilissa-miller.jpg" style="max-width: 200px;" class="image right" alt="Ilissa Miller" /></a></p> <p>In this episode of the <a href="https://www.kentik.com/network-af/">Network AF podcast</a>, Avi Freedman connects with Ilissa Miller, network whisperer and PR industry veteran. Ilissa and her team translate technology into business terms by helping clients understand the value and functionality of a company. Avi asks Ilissa how she got into the field, her biggest takeaways that helped launch her own business and what’s important in today’s networking world.</p> <p>Today, Ilissa is the CEO of iMiller Public Relations and president of NEDAS, a company that covers convergence for wireline and wireless communications infrastructure. She also works for a company called DE-CIX, an internet-exchange operator, supporting their partner and marketing initiatives in North America.</p> <p>Ilissa helps businesses grow and market their products, educate the masses on new technologies and provide context and understanding on what matters in a constantly changing industry.</p> <p>Topics Avi and Ilissa cover:</p> <ul> <li>Gaining wisdom: The difference between watching, learning and executing</li> <li>Which is the more valuable set of learning in a company: Technology or people?</li> <li>The biggest problem in the networking industry today</li> </ul> <h3 id="gaining-wisdom-the-difference-between-watching-learning-and-executing">Gaining wisdom: The difference between watching, learning and executing</h3> <p>Ilissa started out as a consultant in the networking space, under the guidance of her mentor, David Mayer. She learned by watching how strategic relationships are formed and how deals are structured. She talks about striking out on her own and the trial-by-fire of an individual executive responsible for business development and marketing.</p> <p>Timestamp: (10:52)</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/efc9b4e0"></iframe> <p>Ilissa shares that her past experience as a product manager really helped her navigate the inner workings of a company. As a product manager, you have to work with all the departments: operations, legal, sales, and cannot bring a product to market without all of those key services. She wanted to understand how everything worked in symphony because that is the key to helping companies. If you know how companies work, you can get the right team aligned to help move the business forward.</p> <h3 id="which-is-the-more-valuable-set-of-learning-in-a-company-technology-or-people">Which is the more valuable set of learning in a company: Technology or people?</h3> <p>What actually drives a company forward? Is it the people and how they organize, or is it being well versed in engineering? Ilissa believes that becoming fluent in the dynamics of people, and understanding how to position your needs is what works best in today’s society.</p> <p>In earlier industry days, people would hoard information because they felt that was their value. Being a living legacy system was job security. Ilissa points out that in today’s paradigm, it takes a certain psychological finesse to share information quickly and be understood.</p> <p>Avi shares a similar story in the startup world. People are afraid to share their idea in fear that someone will steal it, not realizing that the clincher is in the execution. It requires grit and human beings coming together to make an idea real.</p> <p>Timestamp: 15:07</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/6b78d210"></iframe> <h3 id="the-biggest-problem-in-the-networking-industry">The biggest problem in the networking industry</h3> <p>Ilissa cautions that assumptions are a detriment to both business and progress. Case in point, we can assume that Akamai is a CDN and they do X, Y and Z, but the assumptions that we have may be based on outdated information or constructs that no longer exist. This grinds communication to a halt and calcifies new opportunities.</p> <p>There is also a translation that needs to happen within teams. Practitioners speak in features and technical know-how, and executives speak in business value. Mutual curiosity is necessary when they are not speaking the same language to gain perspective and break new ground.</p> <p>Time stamp: 18:59</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/7085224c"></iframe> <p>Enter the importance of a PR firm that knows how to translate technical terms to business value. Ilissa traverses the gap between technology and understanding by valuing the evolution of the industry and training teams to grasp the bigger picture.</p> <p>A company has to decide and maintain messaging, even through the inevitable ebbs and flows of growth. Ilissa focuses on balancing what the company thinks might be important, and what the marketplace actually wants. Therein lies the secret weapon of good PR.</p> <p>Tune in to this episode to learn more about insider PR strategy, the biggest problem the industry faces today and the key to move your company forward in today’s world.</p><![CDATA[Anatomy of an OTT traffic surge: 2022 Men's NCAA Basketball Championship]]><![CDATA[Last night, Kansas topped the University of North Carolina in a thrilling come-from-behind victory to win their fourth championship in men's college basketball. It was also notable in how viewers saw the game. Instead of being aired on network television like in the past, the game was carried on TBS requiring viewers to have either a cable TV package or use a streaming service to watch the game. Here's what we saw.]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-2022-mens-ncaa-basketball-championshiphttps://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-2022-mens-ncaa-basketball-championship<![CDATA[Doug Madory]]>Tue, 05 Apr 2022 16:00:00 GMT<p>Last night, Kansas topped the University of North Carolina in a thrilling come-from-behind victory to win their fourth championship in men’s college basketball. It was also notable in how viewers saw the game. Instead of being aired on CBS (network television), the game was carried on TBS requiring viewers to have either a cable TV package or use a streaming service to watch the game. Here’s what we saw.</p> <h3 id="ott-service-tracking">OTT Service Tracking</h3> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/" title="Gaming as on OTT service: Virgin Media reveals that Call Of Duty: Warzone has the “biggest impact” on its network">Call of Duty update</a> or a <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday/">Microsoft Patch Tuesday</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p><a href="https://www.kentik.com/resources/kentik-true-origin/" title="Learn more about Kentik True Origin">Kentik True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 540 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <h3 id="kansas-tops-unc-in-a-thriller">Kansas tops UNC in a thriller</h3> <p>Many viewers watched last night’s men’s championship game using one of several OTT streaming services. One of the more popular options, based on Kentik data, was Sling TV.</p> <p>As illustrated below in a screenshot from Kentik’s Data Explorer view, traffic to Sling TV surged as hoops fans used the streaming service to watch the game. As soon as it ended, traffic dropped precipitously and viewers signed off before heading to bed. There was even a little dip during halftime. Fastly and Akamai were the top CDN’s handling delivery of Sling TV’s traffic to users.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6EDr8vsYGi8fZBRK0nIFWB/145dae7df6c455885dab65242d346006/Sling_TV_NCAA_Men-s_Bball_Championship.png" style="max-width: 500px; margin-bottom: 15px;" class="image center" thumbnail alt="Sling TV OTT traffic analyzed with Kentik" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Sling TV OTT traffic analyzed with Kentik</div> <p>When broken down by Connectivity Type, Kentik customers delivered the basketball game on Sling TV to their subscribers from a variety of sources including transit (45.4%), private peering (39.4%), embedded cache (14.2%), and IXP (0.9%). Usually, CDNs with a last mile cache embedding program heavily favor these against other connectivity types as it both allows:</p> <ul> <li>The ISP to save transit costs</li> <li>The subscribers to get demonstrably better last-mile performance</li> </ul> <p>In this case, the fact that embedded cache traffic is a small proportion of the overall delivery implies that some ISPs from this dataset have maxed-out their embedded caches.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3jjcahdnA0DcLPWxFb4HfX/778e3ab4b43adbc2ef8c2e2b650f74cf/Sling_TV_NCAA_Mens_Bball_Championship_by_connectivity_type.png" style="max-width: 500px; margin-bottom: 15px;" class="image center" thumbnail alt="Sling TV OTT traffic analysis by source" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Sling TV OTT traffic analysis by source</div> <p>In addition to source CDN and connectivity type, users of Kentik’s OTT Service Tracking are also able to break down traffic volumes by subscribers, specific router interfaces and customer locations.</p> <h3 id="how-does-ott-service-tracking-help">How does OTT Service Tracking help?</h3> <p>In July, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/" title="Learn more about recent OTT service tracking enhancements">described the latest enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like the release of a blockbuster movie on streaming can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/solutions/usecase/network-business-analytics/" title="Network Business Analytics Use Cases for Kentik">network business analytics here</a>.</p> <p>Ready to improve over-the-top service tracking for your own networks? Start a <a href="#signup_dialog" title="Start a free trial of Kentik">free trial</a> of Kentik today.</p><![CDATA[Synthetics 101 - Part 2: Protecting and growing revenue with proactive monitoring]]><![CDATA[In the second post of this multi-part series, Kentik’s Anil Murty will share how Kentik Synthetics helps you proactively monitor applications and services and troubleshoot the root cause of issues faster.]]>https://www.kentik.com/blog/synthetic-monitoring-101-part-2https://www.kentik.com/blog/synthetic-monitoring-101-part-2<![CDATA[Anil Murty]]>Wed, 30 Mar 2022 04:00:00 GMT<p>In <a href="https://www.kentik.com/blog/synthetics-101-how-to-drive-better-business-outcomes/">part 1</a> of our synthetics series, we looked at tracking network performance to drive better business outcomes. Here in part 2 of our series, we’ll dig into the very first and most basic business outcome of using <a href="https://www.kentik.com/kentipedia/what-is-digital-experience-monitoring/">digital experience monitoring (DEM)</a>. That is, we’ll look at how to protect and grow revenue by proactively monitoring the health, availability and uptime of your critical applications and services, so you can fix issues before your customers’ experience suffers.</p> <h3 id="every-type-of-business-benefits-from-proactive-monitoring">Every type of business benefits from proactive monitoring</h3> <p>Whatever type of business you run these days, it likely has one or more digital components to it. And if these digital components involve interactions with customers, then your business directly depends on them being responsive and performant. If your “customers” are internal employees, then slow applications or services lead to frustration and interruptions, impacting business operations and productivity. If they’re external customers, slowness can lead to direct loss of revenue. Let’s look at some examples of the way Kentik’s customers, including network engineers, operators and administrators, in different organizations (corporate, digital businesses and service providers) use <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a> to ensure their customers get the best digital experience.</p> <h3 id="corporate-it-teams">Corporate IT teams</h3> <h4 id="monitoring-saas-applications-proactively">Monitoring SaaS applications proactively</h4> <p>Think about the various applications you use today for your daily work: two out of three (if not more) are likely software-as-a-service (SaaS) applications. Apart from reducing the amount of capital expenditure (CapEx) cost that a company needs to take on, the SaaS delivery model has done amazing things for worker productivity. It significantly shortens the time it takes to onboard new employees to the organization, ensures the right level of access and security (through SSO), and provides the best features (through auto-app updates). The challenge, though, is that for the network teams that have to support the organization, the burden of ensuring connectivity and quality of service shifts from the local/controlled network to the local/controlled network plus the “general internet.” Throw remote work into the mix, and the challenge becomes even more daunting.</p> <p>When an issue occurs, the first question on the corporate IT team’s minds is: “Is it us? Or is the <b>&#x3C;insert SaaS application name></b> down for everyone?”</p> <p>To simply answer this question, Kentik runs synthetic performance tests to the most common SaaS applications, from 15 global locations, every minute and makes this data available to all Kentik customers for free!</p> <img src="//images.ctfassets.net/6yom6slo28h2/6bEEk8RoXao3nZ7Sm1h52C/9ae3923c665c738a36049f9de159bf65/state-of-the-internet-saas-apps.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="State of the Internet dashboard"/> <div class="caption">This “State of the Internet” dashboard shows availability, connectivity and performance of common SaaS applications from all parts of the world.</div> <p>Each test above is performing a GET request to the HTTP endpoint that the user (employee) would access for the specific application. In addition, the test is also performing a network layer (ICMP ping + traceroute) to the IP address resolved for the host that the HTTP server is running on or behind. If any part of that stack (network, DNS or HTTP layer) slows down or fails, the test catches it immediately and is able to show you whether the failure was due to the network or at a higher layer.</p> <h4 id="monitoring-vpn-connectivity-proactively">Monitoring VPN connectivity proactively</h4> <p>With the rise of remote work, VPN connections have become critical to connecting employees to mission-critical applications. Suboptimal network paths to a VPN gateway can impact network performance, which in turn can cause employee productivity to suffer.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4XJFlOvIEyTUd4VWoJuFWv/22ea347bb92d460d36e5da23a9b96487/vpn-connectivity-paths.png" style="max-width: 600px;" class="image center" thumbnail /> <p>Kentik offers both private and public agents that can be used to test connectivity from the VPN gateway to the upstream services, as well as from the VPN gateway downstream to agents located in the broadband ISP networks where the employees reside. In addition to running IP connectivity checks, these tests also provide a view into the paths that the packets take and show points of failure or slowness (grouped by ASN) so that you know who to file that ticket for!</p> <img src="//images.ctfassets.net/6yom6slo28h2/11cakOEwVrNAl7DZDUYOje/f7a89da354834c63d2626facb8e5f21f/path-view-packets.png" style="max-width: 800px; margin-bottom:15px;" class="image center" thumbnail /> <div class="caption">Traceroute-driven path visualization lets you quickly identify slow links and then lets you zoom out and see the same view at the service-provider level, so you know who to file a ticket with.</div> <h3 id="digital-enterprises">Digital enterprises</h3> <h4 id="tracking-and-optimizing-page-load-time">Tracking and optimizing page-load time</h4> <div class="pullquote right" style="margin-top: 10px; max-width: 270px;">A mere two seconds of slowness could potentially be costing you one-third of your potential new customers.</div> <p><a href="https://www.thinkwithgoogle.com/marketing-strategies/app-and-mobile/mobile-page-speed-new-industry-benchmarks-load-time-vs-bounce/">Research from Google</a> indicates that as page load time increases from one second to 10 seconds, the probability of a mobile site visitor bouncing increases by 123%. Further, the probability of bounce increases by 32% when page-load time increases from one second to three seconds. Think about that. A mere two seconds of slowness could potentially be costing you one-third of your potential new customers. Page-load delays cost you money! Stay ahead of them with synthetic monitoring:</p> <img src="//images.ctfassets.net/6yom6slo28h2/4AuUWJn0x9D1kUPbrrgAd1/234a639c342c33cd4b246380a42ec2d1/cnn-ping-trace.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">Full-browser page-load tests continuously check the performance of your application like a user would, instantly alerting you when issues occur and giving you sufficient detail to know if it’s the application (web) layer or the network layer that has a problem.</div> <h4 id="troubleshooting-root-cause-of-end-user-latency">Troubleshooting root cause of end-user latency</h4> <p>Large web-scale companies like <a href="https://www.kentik.com/resources/case-study-dialpad/">our customer Dialpad</a> rely on a global network of data centers to ensure that their end users experience the minimum amount of latency possible when connecting to their voice and video services. They use synthetic tests to monitor circuits from their data centers to customer locations. Doing so not only lets them get alerted as soon as latency increases between their data centers and customers, but the traceroute-based path view lets them quickly get to the root cause of the problem. In Dialpad’s case, one of their carriers was routing traffic from Hong Kong all the way to the U.S. before sending it back to India, which significantly increased the latency their users experienced getting to Dialpad’s service.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3WbH02vWJiTisGFmE0Q77Y/61765ab083a8fe454d79ce09ff636434/troubleshoot-root-cause.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">Traceroute-driven path experience not only helps identify misrouted traffic, but it can also help root cause device issues by showing device and site information in the same view, hop-by-hop.</div> <h3 id="service-providers">Service providers</h3> <h4 id="monitoring-pop-to-pop-connectivity-and-performance">Monitoring PoP-to-PoP connectivity and performance</h4> <p>Service providers with tens if not hundreds of global points-of-presence (PoPs) need to know when connectivity between any of those PoPs suffer and, if so, they need to be able to quickly know where the point of failure is and who is responsible for it (what device, which network, etc.). Traditionally, the challenge has been that setting up such tests can be tedious and the results (which can be hundreds of test results) can be overwhelming. High-density meshes, like the one shown below, are a great way to get a bird’s-eye view of PoP-to-PoP performance across all PoPs while quickly being able to zoom into points of failure.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/w55ieR8cePiqziNFNtVgM/44ca7e7082bd394140533328f3cb5e21/multi-cloud-performance.png" style="max-width: 800px;" class="image center" thumbnail lat="High-density mesh showing PoP to PoP performance" /> <h4 id="monitoring-performance-based-on-traffic-autonomous-testing">Monitoring performance based on traffic (autonomous testing!)</h4> <p>As a service provider with “on-net” CDNs, transit and peering routes to other ASNs, and user traffic to various parts of the world, it can be overwhelming to know where to focus your proactive-monitoring tests. Being able to see traffic patterns to know where the most user traffic is coming from and going to can help in this situation.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6QqDLrKauD95TIHmbFUTpL/78b23442f1a9e691ec8d7a2a82d4e42d/autonomous-tests-cdn.png" style="max-width: 600px; margin-bottom: 15px;" class="image center" /> <img src="//images.ctfassets.net/6yom6slo28h2/2Bl9lF8MhWpk6aZjyThAo4/484ceb1f5d775007b82736307c03ca45/test-a-cdn.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">Autonomous tests display a sorted list of destinations (ASNs, CDNs, countries, cities or regions) based on real traffic patterns from your network (ASN) and then suggest locations in your network to install synthetic agents in to get the most out of your testing.</div> <p>While that in itself reduces the time needed to plan tests, imagine if the platform was smart enough to not just recommend setting up tests to specific locations, but to also help set up and maintain such tests? This is essentially the idea with autonomous testing. Kentik’s platform identifies pingable targets within the target network destination you choose, sets up tests automatically, and then refreshes the tests periodically as hosts change.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1eUsOF5TpU4jPC0Wctcsnv/5bc2eef3c857079c03b3eb45419f7583/test-options.png" style="max-width: 400px; margin-bottom: 15px;" class="image center" thumbnail /> <div class="caption">Autonomous testing finds IP addresses for you in the chosen destinations (ASN, CDN, country, city or region), sets up tests to them automatically, and refreshes them periodically &mdash; all while giving you the knobs to control how many IPs to test, how many providers to track, and how often to scan flows for new IP addresses.</div> <p>It’s very likely that one of the proactive monitoring use cases applies to your business, but this is just one of the many high-level use cases for synthetic monitoring. There is a lot more to come in this series, so stay tuned.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p>If you don’t want to wait, reach out to our team <a href="https://www.kentik.com/get-demo/">for a demo</a> or start your <a href="https://www.kentik.com/get-started/">30-day free trial</a>, or sign up for our educational sessions on synthetics:</p> <ul> <li><a href="https://www.kentik.com/go/webinar/virtual-design-clinic/">Virtual Design Clinics</a> &mdash; sessions offered every two weeks: <ul> <li>Thursday, April 7 at 8am PT - Synthetics: PoP edition</li> <li>Thursday, April 28 at 8am PT - Synthetics: Latency edition</li> </ul> </li> </ul> <p><button href="https://www.kentik.com/get-started/">FREE 30-DAY TRIAL</button></p><![CDATA[Network AF, Episode 11: The art of connecting with PacketFabric’s Jezzibell Gilmore ]]><![CDATA[Jezzibell Gillmore is co-founder and chief commercial officer of PacketFabric. She’s also an expert at learning by doing and connecting by keeping it simple. Hear her story from the Network AF podcast.]]>https://www.kentik.com/blog/network-af-episode-11-art-of-connecting-packetfabric-jezzibell-gilmorehttps://www.kentik.com/blog/network-af-episode-11-art-of-connecting-packetfabric-jezzibell-gilmore<![CDATA[Maria Martinova]]>Wed, 16 Mar 2022 04:00:00 GMT<p><CastedPlayer episodeId="0c3bbf82" smallPlayer></CastedPlayer></p> <hr> <p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/5x9orjh7by7IKpPtGcQprx/fadb6254198fbcfaebcd9b94bc47adf6/jezzibell-gilmore.jpg" style="max-width: 200px;" class="image right" alt="Jezzibell Gilmore" /></a></p> <p>In episode 11 of <a href="https://www.kentik.com/network-af/">Network AF</a>, Avi talks with Jezzibell Gilmore, co-founder and chief commercial officer (CCO) of PacketFabric. Jezzibell is a powerful woman in networking who is modernizing and paving the way for infrastructure in the digital universe. In the conversation, she shares how she’s weaving together technology, business drivers and cutting-edge innovation, while keeping her foot firmly on the ground.</p> <p>Packet Fabric is a network-as-a service platform providing data center interconnection and <a href="https://www.kentik.com/kentipedia/what-is-cloud-networking/">cloud connectivity</a>. As the company’s co-founder and CCO, Jezzibell is in charge of business revenue, which means finding customers and making them happy.</p> <p>Like many network enthusiasts, Jezzibell got into networking totally by chance. The CEO of AboveNet heard she was leaving her previous job and offered her an opportunity to come on board. At first, what AboveNet did sounded like a foreign language to Jezzibell, but she quickly realized many people were still learning and navigating the fast-growing networking industry, just like her. Realizing this sparked a love for figuring out complex things and making them better, a quality she carries into her work today.</p> <p>On the episode, Avi and Jezzibell talk about:</p> <ul> <li>Knowledge exchange</li> <li>Learning by doing</li> <li>Preventing glazed eyes: Translating tech into bite-sized pieces</li> <li>The challenge of buying a service as a large enterprise</li> </ul> <h3 id="knowledge-exchange">Knowledge exchange</h3> <p>For Jezzibell, learning technology on the fly has been a formative experience in her career. In the beginning, she was happy to make dinner for her engineer friends in exchange for valuable technical know-how that wasn’t being taught in school. Going to meetups and participating in community events helped her understand how to create and use technology in real time. She shares that the key to her success is learning by doing.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/58f3c43d"></iframe> <h3 id="learning-by-doing">Learning by doing</h3> <p>Avi counters that not waiting around or asking for permission is the building block of the DIY culture that feeds the willingness to innovate. Though limited in how much “figuring it out” can scale, he admits it does provide the curiosity necessary to keep going to the next level.</p> <p>Jezzibell agrees and adds that she makes it a point to nurture training and touch points for various teams in her organization, creating a knowledge-ouroboros that feeds itself. When engineers are able to translate product features succinctly with their sales counterparts in mind, it empowers the sales team to support the customer base better.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/4530d4ce"></iframe> <h3 id="preventing-glazed-eyes-translating-tech-into-bite-size-pieces">Preventing glazed eyes: Translating tech into bite-size pieces</h3> <p>Breaking down technical parts into easy-to-digest pieces that help people understand value is Jezzibell’s gift. She jokes about noticing when people’s eyes glaze over once someone launches into technical jargon. She shares that the art of connecting and being understood is not done by trying to prove yourself, with dropping that jargon, but rather more simply by leading with service.</p> <p>Jezzibell also talks about experiencing the luxury of seeing people put their guard down and having open, grounded conversations with them. She creates this by asking so-called “foundational” questions, cutting past the hubris early on. Even in working in large organizations, she champions keeping it simple.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/0f457a51"></iframe> <h3 id="the-challenge-of-buying-a-service-as-an-enterprise">The challenge of buying a service as an enterprise</h3> <p>Having previously worked at Akamai and today running business revenue at PacketFabric, Jezzibell marvels at the minefield that a large organization navigates when buying a service. She has compassion for their plight and, in turn, makes it easy on them by providing the necessary human element.</p> <p>Both agree there is a human component in buying technology that can never be automated, no matter how big you are. Avi quips that this is the so-called “customs of the customer,” which are the characteristics that require flexibility and meeting the customer where they’re at. Jezzibell sees this as balancing two diverse sides of the organization: the consumptive nature of technology, and the slower, creative process of procurement. It takes patience and appreciation of both sides, in order to create the engineering solution the enterprise needs.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/6a490544"></iframe> <p>Tune in to Network AF to find out about the main feature that Jezzibell invests in as CCO that she says pays back in dividends. Hear how she finds acceptance in techy cabals, and listen to more hilarious stories from Avi about networking and learning on the fly in this latest episode.</p><![CDATA[Updated: Cogent and Lumen curtail operations in Russia]]><![CDATA[In our updated blog post, read why international telcos Cogent and Lumen say they are taking action against Russia.]]>https://www.kentik.com/blog/cogent-disconnects-from-russiahttps://www.kentik.com/blog/cogent-disconnects-from-russia<![CDATA[Doug Madory]]>Mon, 07 Mar 2022 05:00:00 GMT<p><strong>Reader’s note</strong>: On Friday, March 4, we published this blog post to comment on Cogent’s decision to terminate their commercial relationships with their Russian customers. Today, March 7, another international telecom, Lumen, <a href="https://news.lumen.com/RussiaUkraine">announced</a> that it will also take action. We’ve updated this blog post to reflect the latest information we have.</p> <h3 id="on-cogent">On Cogent</h3> <p>Despite Ukraine’s <a href="https://eump.org/media/2022/Goran-Marby.pdf">recent request</a> that it be disconnected from the global internet, Russia remains online. However, as of 5pm GMT on March 4, Russia became connected to the world via one less international telecom.</p> <p>On March 3, Cogent sent emails to their customers in Russia advising them about the immediate termination of their commercial relationships due to Russia’s invasion of Ukraine.</p> <blockquote style="margin-left: 5%; border-left: 4px solid #53b8e2; font-family: courier; max-width: 80%; font-size: 96%; margin-bottom: 25px;">In light of the unwarranted and unprovoked invasion of Ukraine, Cogent is terminating all of your services effective at 5 PM GMT on March 4, 2022. The economic sanctions put in place as a result of the invasion and the increasingly uncertain security situation make it impossible for Cogent to continue to provide you with service. <br /><br /> All Cogent-provided ports and IP Address space will be reclaimed as of the termination date. For any colocation customers, your equipment will be powered off and kept in the rack for you to collect. <br /><br /> If not collected within thirty days, the equipment will be removed from the rack and stored. For any utility computing customers, you will not have access to your servers after the termination of service. The servers will be disconnected and kept in storage by Cogent for an indeterminate period.</blockquote> <p><strong>Disconnecting their customers in Russia will not disconnect Russia</strong>, but it will reduce the amount of overall bandwidth available for international connectivity. This reduction in bandwidth may lead to congestion as the remaining international carriers try to pick up the slack.</p> <p>Cogent is one of the biggest internet backbone carriers in the world. In fact, it is a member of the rarified Transit-Free Zone which is a small group of global telecoms so large that they don’t pay anyone else for transit (international internet bandwidth). TFZ telecoms exchange internet traffic for free with other TFZ members and charge smaller networks for bandwidth.</p> <p>Based in the United States, Cogent sells transit all over the world including the Russian market. If we pull up the list of Cogent’s Russian customers in <a href="https://www.kentik.com/blog/launching-a-labor-of-love-kentik-market-intelligence/">Kentik Market Intelligence</a> we can get a sense for the potential impact. Below is a screenshot from KMI of just the top five Russian customers of Cogent based on transited IPv4 space. It includes both Russian state telecom Rostelecom (AS12389) as well as Russia’s other national fiber backbone operator Transtelecom (as known as TTK, AS20485). Russia’s mobile market is dominated by a “big three” of MTS (AS8359), Megafon (AS31133) and VEON (formerly Vimpelcom, AS3216). Of these, Cogent has two of the three Russian mobile carriers as customers.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4sSv317VI0C73SOaxqwcpf/13886c3ac5ae5154ce63f32b942fddee/top-cogent-customers-ru.png" class="image center" style="max-width: 400px;" alt="Top Cogent customers in Russia" /> <p>If we take what KMI estimates to be Cogent’s largest customer in Russia (Rostelecom) and pivot to see who they buy transit from, we see the view below. The loss of Cogent would leave Rostelecom with several other options for transit including Vodafone (AS1273), Telecom Italia Sparkle (AS6762), Lumen (AS3356), and Arelion (formerly Telia, AS1299).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7pfWGskEiUWsJtIJlKw92K/2fdda6f4d56076c32b825e67c592c140/rostelecom-providers.png" class="image center" style="max-width: 300px;" alt="Rostelecom providers" /> <p>In addition to the telecoms listed above, Cogent provides service to Russia’s search engine Yandex (AS208722) as well as StormWall (AS59796), a Russian DDoS mitigation firm defending three of the 31 prominent Russian websites on Ukrainian Vice Prime Minister Fedorov’s <a href="https://www.reuters.com/world/europe/ukraine-launches-it-army-takes-aim-russian-cyberspace-2022-02-26/">target list</a>.</p> <p>As Russia becomes increasingly disconnected from the financial system, Russia’s communication companies may have difficulty paying foreign transit providers for service.</p> <p><a href="https://www.internetsociety.org/author/sullivan/">Andrew Sullivan</a>, president of the Internet Society, recently responded to requests to remove Russia from the Domain Name System (DNS), with an <a href="https://www.internetsociety.org/blog/2022/03/why-the-world-must-resist-calls-to-undermine-the-internet/">impassioned plea</a>. “Cutting a whole population off the Internet will stop disinformation coming from that population—but it also stops the flow of truth,” he wrote.</p> <p>A backbone carrier disconnecting its customers in a country the size of Russia is without precedent in the history of the internet and reflects the intense global reaction that the world has had over the invasion of Ukraine.</p> <h3 id="update-7-march">UPDATE (7 March)</h3> <h4 id="visible-impacts-of-the-disconnection">Visible impacts of the disconnection</h4> <p>Russian telecoms VEON (formerly Vimpelcom, AS3216) and Transtelecom (TTK, AS20485) lost Cogent transit service <a href="https://twitter.com/DougMadory/status/1500837678119624708">at 17:05 UTC</a> on Friday, 4 March, minutes after the deadline. However, as of today, state telecom Rostelecom (AS12389) and mobile operator Megafon (AS31133) are still using Cogent transit.</p> <p>The change in inbound traffic for VEON is pictured below.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3Zgr0gjarJZTagABfEvPNy/81e7d5b22e1cfe0dc790ac3a28983719/veon-inbound-traffice-change.png" style="max-width: 750px;" class="image center" thumbnail alt="Inbound traffic change for VEON" /> <p>It is worth noting that VEON and TTK also provide transit into the countries of Central Asia and Mongolia. Therefore the disconnection of Cogent from the Russian market has some <a href="https://twitter.com/DougMadory/status/1500842532510515206">downstream impacts</a> into Kazakhstan, Tajikistan and Uzbekistan.</p> <p>When Cogent disconnects from Rostelecom, it will also impact the internet in Iran, Azerbaijan, Belarus, and the contested geographies of Crimea and Abkhazia.</p> <p>As a result of the loss of customers VEON and TTK, Cogent has dropped one position in the KMI overall Customer Base rankings (pictured below) in Russia. When they lose Rostelecom and Megafon they will be out of the Top 10.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/20VocMB97kpcmzy76atQTJ/6de0c61287c00fd2908f72db5b335ddc/kmi-customer-base-cogent1.png" style="max-width: 500px;" class="image center" alt="Cogent drop in KMI rankings" /> <h3 id="lumens-announcement">Lumen’s announcement</h3> <p>In the wake of Cogent’s decision to terminate services to its Russian customers, Lumen <a href="https://news.lumen.com/RussiaUkraine">published a statement</a> on Monday concerning its business in Russia.</p> <p>Among the announcements, Lumen stated that it will stop the sale of any new services to either Russia-based companies as well as non-Russia-based companies that were providing services in Russia. Additionally, Lumen states that it “terminated an agreement to provide services to an existing Russian financial institution and we have withdrawn our name from consideration for new business with another Russian financial institution.”</p> <p>Lumen is already the top international transit provider to Russia boasting Rostelecom, TTK, as well as all three major mobile operators (MTS, Megafon and VEON) as customers. While they may not take on any new customers in Russia, they may be carrying more traffic with the loss of Cogent as a transit alternative.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><em>This blog post was originally published on March 4, 2022.</em></p><![CDATA[The peering coordinators’ toolbox - part 4]]><![CDATA[In part four of our series, you’ll learn how to find the right transit provider for your peering needs.]]>https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-4https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-4<![CDATA[Nina Bargisen]]>Thu, 03 Mar 2022 05:00:00 GMT<ul> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/">Part 1 - Economics: How to understand the cost of your connectivity</a></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2">Part 2 - Performance: How to understand the services used by your customers</a></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-3/">Part 3 - Analyze: How to improve your CDN and peering connectivity</a></li> <li><strong>Part 4 - Select: How to find the right transit provider</strong></li> </ul> <p>The peering coordinator’s toolbox is a blog series where we dive deep into the peering coordinator workflow and show how the right tools can help you be successful in your role.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px; margin-top: 40px;"></div> <h2 id="part-4-selection">Part 4: Selection</h2> <h3 id="how-to-find-the-right-transit-provider">How to find the right transit provider</h3> <p>In this blog series, we discussed how to peer off or embed IP traffic to optimize your service quality and cost of interconnection. In this final part of the series, we’ll discuss how to select the right transit provider(s).</p> <p>The requirements for a good transit provider are typically one or more of the following:</p> <ul> <li>Short path to selected services or networks</li> <li>Geographic scope</li> <li>Latency</li> <li>Packet loss</li> <li>Price</li> <li>Resilience</li> </ul> <p>Remember, your transit is not only a solution to reach some selected services or networks on the internet, but it’s also the way to reach any destination on the whole internet. It’s also your last resort if there are no other paths in your network to the destination.</p> <h4 id="optimal-connectivity-short-path-to-or-from-selected-destinations">Optimal connectivity: Short path to or from selected destinations</h4> <p>If you’ve followed along in previous parts of this blog series, you know that you might have an ASN where peering doesn’t make sense or is not possible. That’s where you know you’ll want to secure the best possible path between you and them.</p> <p>With <a href="https://www.kentik.com/product/market-intelligence/">Kentik Market Intelligence (KMI)</a>, you can look up which providers the target ASN uses and which peers they have. From this list, you can find your potential providers.</p> <p>The next thing you’ll want to determine is: Where do my potential providers connect to my target ASN?</p> <p>This is one of the hardest questions to answer. And while many transit providers offer features that can help you answer this question, the two necessary features needed are a BGP community setup and a looking glass.</p> <p><strong>BGP communities</strong></p> <blockquote style="border-left: 4px solid #53b8e2; margin-bottom: 30px;">BGP communities are BGP attribute tags that can be applied to incoming or outgoing routes. This method is often used by transit providers to allow customers to influence the routing policy of their own routes and to communicate information about the routes they send to the customer. The community setup is sometimes publicly available and sometimes only available on request.</blockquote> <blockquote style="border-left: 4px solid #53b8e2">A looking glass is a common tool provided by transit providers to their customers. It’s often publicly available as well. The most common features are SNMP pings, traceroutes and queries into the BGP routing table from selected routers in the provider's network.</blockquote> <p>You will need several types of BGP communities to be available:</p> <ol> <li><strong>Geographical communities</strong>: These are communities that are set by the transit provider on routes they receive from their peers, customers and providers (if they have any). The typical granularity is city, country and continent.</li> <li><strong>Propagation control communities</strong>: These are communities that you can set on your routes when you announce them to your provider. Based on these, you have some control over how the provider announces your prefixes to their customers, peers and providers.</li> </ol> <p>With this in hand, you can check in the looking glass to see which communities are tagged to the routes from the ASN of your interest.</p> <p>Below is an example of information about a route from AS 9121 in AS 2914’s looking glass from a query done on a AS 2914 router in Dusseldorf, Germany.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6JyqpFCwNcInvcFx8SL1Af/125ee0644513d49febc7085e1a3d6bb3/route-info.png" style="max-width: 700px;" class="image center no-shadow" alt="route information" /> <p>The community list is the last line and we can see there are communities set by AS 2914 and by AS 9121. AS 2914 makes their community setup available to the public <a href="https://www.gin.ntt.net/support-center/policies-procedures/routing/">on their website</a>, so a simple look-up of the list shows:</p> <p>2914:1201 - route is learned in Dusseldorf<br> 2914:2202 - route is learned in Germany<br> 2914:3200 - route is learned in Europe</p> <p>It’s then fair to assume that AS2914 and AS 9121 are connected in Dusseldorf, Germany.</p> <p>This is one way you can check whether the potential provider has a connection to your target ASN in locations that make sense with respect to your desired route for the traffic.</p> <p>Traceroutes from the potential provider’s looking glass can confirm and give information about the path <strong>to</strong> the target ASN. So now you know what will happen to your <strong>outbound</strong> traffic to the target ASN.</p> <p><strong>Inbound traffic</strong><br> Be aware that transit providers usually use “hot-potato routing.” This means that they will hand off traffic as soon as possible after receiving it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6JpargfNmz8FkLr8v42NCY/52e4cc6671e629624dc42b9dcc6b72e6/hot-potato-routing.png" style="max-width: 450px;" class="image center no-shadow" alt="hot-potato routing" /> <div class="caption">This shows hot-potato routing of traffic between X and Y, where X is a customer of ISP B, and Y of ISP A. The ISPs have two peerings, one in each region. Each ISP hands off the traffic as soon as possible, which means traffic from X to Y travels most of the way in ISP A, while traffic from Y to X travels most of the way in ISP B.</div> <p>This means that your <strong>inbound</strong> traffic might take a completely different route from the target ASN to your network (or to the location of your potential new transit connection).</p> <p>If the target ASN also provides a looking glass, it is straightforward to check the return path so you can see the path for your <strong>inbound</strong> traffic from the target ASN to the location of your potential transit connection.</p> <p>If they do not, the <a href="https://www.kentik.com/product/global-agents/">Kentik Global Synthetic Network</a> includes hundreds of strategically positioned global agents in internet cities around the world and in every cloud region within AWS, Google Cloud, Microsoft Azure and IBM Cloud. If there is a global agent in your target ASN, you can use this to see the path from the target ASN for your inbound traffic.</p> <p>The propagation communities are needed in the case where you want to use a transit provider to reach a few selected ASNs, but not as a full transit provider for the entire internet to reach your network. For example, if your target ASN is a customer of the provider, using a do-not-announce-to-peers community and a do-not-announce-to-upstreams community will restrict your visibility on the internet via this provider.</p> <p>When you use propagation communities, always make sure that your routes are available to the entire internet via other providers.</p> <h4 id="regional-connectivity">Regional connectivity</h4> <p><a href="https://www.kentik.com/product/market-intelligence/">Kentik Market Intelligence (KMI)</a> rankings can help you select the right provider if you need to secure good connectivity to a regional market — be it state, country or selected region.</p> <p>For example, if you needed to be sure you reached most of Argentina, KMI’s country ranking for Argentina can help you select the best provider for this country.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4k8CoAkwRYKDSSCC6yop91/ade4f2e4222d8e63c3f3220921874a6e/kmi-argentina.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" alt="Kentik Market Intelligence ranking" thumbnail /> <h4 id="global-connectivity">Global connectivity</h4> <p>When choosing your providers for global connectivity, KMI global ranking is a good guidance of who will be able to provide good connectivity to the entire internet. This will ensure your services are available to everyone. However, sometimes smaller regional providers will make more sense — either because of price or simply availability. The detailed view in KMI will help you understand the connectivity of this provider.</p> <p>KMI’s “rankings” tab will show how the provider is ranked.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4aGGgbb76BvdYbzY4dzUyK/2a65a612abde7ff7cde1c43e3e9d844f/kmi-ranking-uab-bite.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" alt="Kentik Market Intelligence -global connectivity" thumbnail /> <p>KMI’s “customers and providers” view will show you their connectivity information.</p> <p>If global reach is essential for your network, maybe a provider who is single homed to another small provider is not the best choice. Someone who has several global networks from the top of the rankings as their providers will likely be a more acceptable option for you.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2U7PzYew2ybar42l3mPUmJ/4d38098e500198aa92e3cb646cc46c64/kmi-options.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <p>With that, we conclude our peering blog series. You now have the basics of the economics, analysis, performance and selection processes of peering. Want to know more? Reach out to us <a href="https://www.kentik.com/go/get-started/demo/">for a demo</a> or <a href="https://www.kentik.com/go/get-started/">sign up for a free trial</a> now.</p> <p><button href="https://www.kentik.com/go/get-started/">TRY KENTIK NOW</button></p><![CDATA[Network observability, now publicly shareable]]><![CDATA[What fun is network observability if you can’t share what you see? Now you can with public link sharing from Kentik. See what’s new.]]>https://www.kentik.com/blog/network-observability-now-publicly-shareablehttps://www.kentik.com/blog/network-observability-now-publicly-shareable<![CDATA[Stephen Condon]]>Tue, 01 Mar 2022 05:00:00 GMT<p>What fun is network observability if you can’t share what you see? That’s why we’ve added public link sharing to the Kentik platform.</p> <p>One of the greater missions of network observability is to break the boundaries of conventional monitoring. At Kentik, we focused our initial efforts on making complex infrastructure problems easy to visualize, understand and resolve. Now we’re tackling a follow-up mandate: to democratize network observability.</p> <p>With public link sharing, you can share secured public links to visualizations of your Data Explorer and Synthetics workflow — with anyone you want. Just look for that familiar “Share” button to get started!</p> <img src="//images.ctfassets.net/6yom6slo28h2/5QvU0wN9x0lXXMD8rWQYuh/debd59538d85659ec01673fae75bcf39/link-sharing-toolbar.png" style="max-width: 450px;" class="image center no-shadow" alt="Share option in the toolbar" /> <p>Public link sharing also comes with the bells and whistles, including:</p> <ul> <li>A public, secure website to display your visualizations</li> <li>The ability to customize your sharable link using human-readable URLs</li> <li>The ability for your public viewers to change visualization types or granularly, or select time-series being displayed — and even drill into the visualization at hand</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5pPdGIR645xdDgHIpGJKqe/ceefae3685fb3a724fde9771aa4a9311/link-sharing-example.png" style="max-width: 800px;" class="image center" alt="Link sharing example" /> <p>Think about the possibilities unlocked by link sharing your visualizations. For example, the infrastructure team’s “Mean Time to Innocence” (you’ll want to call it MTTI) will greatly benefit from it.</p> <p>Being able to share a specific slice of what you can see in Kentik will help foster collaboration and transparency and help break down silos with external vendors, the media, other internal parties, and even customers and prospects. That includes being able to:</p> <ul> <li>Efficiently instrument troubleshooting sessions internally and externally</li> <li>Show evidence of outages and performance issues to vendors</li> <li>Augment support responses to customers with actual, interactive evidence</li> <li>Build brand confidence by transparently communicating on outages and their impact</li> <li>Empower your product and marketing teams with tools to showcase the performance of your infrastructure</li> </ul> <div class="pullquote right" style="max-width: 320px; margin-top: 15px;">In the spirit of sharing, share the love, and tell us how you’re using it!</div> <p>Beyond these use cases, we’re looking forward to seeing what other creative uses you’ll find for public link sharing. So, in the spirit of sharing, share the love, and tell us how you’re using it!</p> <p>While testing this new feature, the team here at Kentik came up with a few creative ideas of our own for public link sharing. Here are a few for inspiration:</p> <ul> <li>Displaying a global latency/jitter/loss matrix to connectivity prospects</li> <li>Show your app team hard evidence that the network is not responsible for a degraded performance</li> <li>Help cache-embedding content providers troubleshoot performance issues on a specific OTT service</li> <li>Augment peering requests/responses with traffic graph and traffic localization</li> <li>Support discussions with peers around sudden traffic shifts</li> </ul> <p>Our customers have been asking for this feature, but enabling the sharing of selected-Kentik views to those who don’t have access to the service isn’t trivial. We understand that certain data will always be confidential and shouldn’t be shared. That data has been removed from shared views.</p> <h3 id="our-journey-to-democratize-network-observability-has-just-begun">Our journey to democratize network observability has just begun.</h3> <img src="//images.ctfassets.net/6yom6slo28h2/6KPMON9vEgdBX9gCw22xA6/73c5798d311dd368d83031ff94d48c12/portal-menu-share-option.png" style="max-width: 300px;" class="image right no-shadow" alt="Public sharing menu option" /> <p>With this first feature set around public link sharing hitting your browsers, our minds are already full of ideas to take this further. And there’s already a long slate of iterations to come, so stay tuned.</p> <p>To keep making Kentik your go-to network observability platform, we’d love to hear your feedback once you’ve kicked the tires. Or, as our product team always says, “The future of this feature set is mostly in your hands, so tell us what you’d like!”</p> <p>More information on public link sharing is available in the <a href="https://kb.kentik.com/v4/Cb29.htm">Kentik Knowledge Base</a>.</p> <p>Need network observability that’s sharable? Reach out to our team for a <a href="https://www.kentik.com/go/get-demo/">demo</a> or to <a href="https://www.kentik.com/go/get-started/">start your trial</a>.</p><![CDATA[Network AF, Episode 10: Navigating venture capital and networking with Alan Cohen]]><![CDATA[Alan Cohen, partner at venture capital firm DCVC, sits down with Avi to talk about his experience working in networking and security, the advent of virtualization and multi-cloud, and strategies he has learned throughout his venture capital days to reach and grow entrepreneurs' businesses.]]>https://www.kentik.com/blog/network-af-episode-10-navigating-venture-capital-networking-alan-cohenhttps://www.kentik.com/blog/network-af-episode-10-navigating-venture-capital-networking-alan-cohen<![CDATA[Maria Martinova]]>Fri, 25 Feb 2022 05:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="//images.ctfassets.net/6yom6slo28h2/7DWgsHJUE1ws0KGJbAW0GU/f242c416f8b409a905c058462ca79fe5/alan-cohen.jpg" style="max-width: 200px;" class="image right" alt="Alan Cohen" /></a></p> <p>In this episode of <a href="https://www.kentik.com/network-af/">Network AF</a>, your podcast host Avi Freedman chats with networking investor, advisor and VC partner, Alan Cohen. Alan brings a hilarious, witty and nonconformist attitude to the talk, exploring Silicon Valley in the 90s, the joy of moving from large enterprises to small disruptors, and generously sharing secrets of the trade with Avi and podcast listeners.</p> <p>An operational expert in product development, go-to-market strategies and growth, Alan cut his teeth in the 90s at IBM, then was VP and head of marketing for Cisco’s $25 billion enterprise business, managing over 400 people and 25 product lines.</p> <p>A self-described “technology-Benjamin Button,” Alan has selectively moved from larger organizations to smaller, disruptor startups, most notably serving as Nicira’s VP of marketing, which was acquired by VMware for $1.3 billion in 2012. After Nicira, Alan pursued his interest in mastering massive problems through computing, by joining Illumio as chief commercial officer and board member.</p> <p>Currently, he mentors the new generation of deep-tech ideas and industry upstarts by serving as partner at venture capital firm DCVC.</p> <p>In the episode, Avi and Alan cover:</p> <ul> <li>Alan’s unique journey from marketing leader to venture capitalist</li> <li>Getting noticed and getting acquired</li> <li>The marketing of technology</li> <li>Why networking people are so important right now</li> </ul> <h3 id="from-marketing-leader-to-venture-capitalist">From marketing leader to venture capitalist</h3> <p>Alan refers to himself as an “accidental tourist” in technology. He admits he’s a trained economist and not an engineer, having started his career as a consultant for US West’s cellular and broadband networks division.</p> <p>He traverses his journey in, out and around networking, from Cisco to startups, retelling anecdotes of how McKinsey scoffed at cell phones and computers, how “interactive TV” was the calling card of the future, and recounts various missteps and big wins along the way to cloud computing. Then, in his own words, he talks of “doing what every washed-up technologist does,” by making his way into venture capital.</p> <p>At DCVC today, Alan nurtures the next generation of ideas and people, by “holding their beer, giving them money and not getting in their way.”</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/9be9bcca"></iframe> <h3 id="getting-noticed-and-getting-acquired">Getting noticed and getting acquired</h3> <p>Avi and Alan discuss the seminal heyday of networking when Cisco co-founders Sandy Lerner and Len Bosack were just starting out running the Stanford computer labs, trying to connect all the machines and printers at Stanford’s Graduate School of Business. This project would later seed the building blocks of Cisco IOS, kicking off an avalanche of growth and innovation.</p> <p>Recounting the energy of the times, Alan shares a story when he was still a newbie exec at IBM. He was asked to speak at the lofty Vortex conference put on by famed entrepreneur and engineer Bob Metcalfe. He was slated to speak after John Chambers, Cisco’s CEO at the time, and before economist and well known author George Gilder. Alan likens it to going on after the Stones and before the Beatles. But luckily, he was able to negotiate being on a panel of his industry peers instead.</p> <p>This panel conversation incited so much attention that John Chambers ended up watching it, helping Alan get acquired not long after.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/f2521a80"></iframe> <h3 id="the-marketing-of-technology">The marketing of technology</h3> <p>As a technologist, Avi says he often finds himself needing to decloak the marketing on a website to fully understand what something does. He confides that all the wonderful marketing messages about values, benefits and positioning are nice, but he adds, “I can’t reason about something unless I know what it does.” Alan, who was previously responsible for delivering marketing messages across $24 billion of Cisco’s revenue, jokes that together they could have been the Brundlefly of supreme power in networking.</p> <p>Avi laments spending eight years trying to sell the cloud vision before its time when, he says in hindsight, what was needed was a strong focus on concrete product marketing. Alan countered that the beauty of Nicira at its time was that it shifted the locus of control away from enterprises and into the hands of network administrators, creating a virtualized infrastructure — an actual product that solved the problem of having to run and monitor the new cloud phenomenon.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/e7462c0d"></iframe> <h3 id="why-networking-people-are-so-important-right-now">Why networking people are so important right now</h3> <p>As the world moves well beyond virtualized infrastructure into “multi-cloud, multi-everything,” in Alan’s words, it’s actually not as simple to make things work — and work well. Automation doesn’t mean simplicity, as someone still needs to make it go. That’s why Alan notes that networkers are very important right now. In the new post-Covid world, “if you don’t have connectivity, you can’t learn, work or participate in society.”</p> <p>With the recent news about <a href="https://www.kentik.com/go/webinar/outages-2022-02/">network outages at Facebook and AWS</a>, Avi and Alan both marvel at the multiplication effect of how everything is connected nowadays and how, once again, it all comes back to the importance of networking folks working tirelessly to make the whole world go through global communication networks and applications.</p> <p>Acknowledging just how complex these systems are, Avi laughingly admits, “The thing that’s amazing about the internet is not that it does break sometimes, but that it works in the first place.”</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/572a7b73"></iframe> <p>Tune in to the <a href="https://www.kentik.com/network-af/">full episode</a> to find out Alan’s tried-and-true rule of a successful go-to-market strategy, and get Avi’s foolproof way of knowing if you’ll win someone’s business.</p><![CDATA[Introducing BGP Monitoring from Kentik]]><![CDATA[Kentik’s BGP monitoring capabilities address root-cause routing issues across BGP routes, BGP event tracking, hijack detection, and other BGP issues.]]>https://www.kentik.com/blog/bgp-monitoring-from-kentikhttps://www.kentik.com/blog/bgp-monitoring-from-kentik<![CDATA[Anil Murty]]>Thu, 24 Feb 2022 05:00:00 GMT<h2 id="understanding-the-border-gateway-protocol-bgp">Understanding the Border Gateway Protocol (BGP)</h2> <p>Designed at the dawn of the commercial internet, the <a href="https://www.kentik.com/kentipedia/what-is-bgp-border-gateway-protocol/" title="Kentipedia: What Is BGP? Understanding Border Gateway Protocol">Border Gateway Protocol (BGP)</a> is a policy-based routing protocol that has long been an established part of the internet infrastructure. BGP (and BGP routing) was created to enable the internet to scale and accommodate a growing number of autonomous systems (ASes) that needed to exchange routing information with each other. As the backbone of the global internet routing system, BGP is responsible for directing traffic between ASes, which are essentially networks operated by different organizations such as internet service providers (ISPs), data centers, and large enterprises.</p> <p>BGP is crucial because it maintains the stability and reliability of the internet by ensuring that traffic is routed efficiently across various networks. As the internet has evolved, so too have the complexity and demands placed on BGP, making its monitoring and management increasingly essential. This has given rise to the need for BGP monitoring, a process that helps network operators detect and troubleshoot issues in their routing infrastructure. By understanding and analyzing BGP data, operators can optimize network performance, minimize downtime, and maintain the overall health of their networks.</p> <h2 id="what-is-bgp-monitoring">What is BGP monitoring</h2> <p>BGP monitoring refers to the process of monitoring the Border Gateway Protocol (BGP) to detect and troubleshoot issues in a network’s routing infrastructure. BGP is a protocol used by internet service providers (ISPs) to exchange routing information between autonomous systems (ASes), and it plays a critical role in ensuring that data is routed efficiently and reliably across the internet. BGP route monitoring is historically of interest primarily to ISPs and hosting service providers whose revenue depends on delivering traffic.</p> <p>Additionally, the threat landscape has evolved, with bad actors exploiting BGP vulnerabilities to carry out attacks such as route hijacking and leaks. These incidents can lead to traffic being redirected to malicious networks, causing significant disruptions to the affected organizations. BGP monitoring helps network operators detect and remediate such incidents and, in turn, protect their networks and ensure the security of their data. BGP hijack detection is one of the essential aspects of monitoring BGP today.</p> <div as="Promo"></div> <p>As we saw with <a href="https://www.kentik.com/blog/facebooks-historic-outage-explained/" title="Kentik Blog: Facebook&#x27;s historic outage, explained">Facebook’s historic outage</a>, monitoring BGP proactively has become equally important for digital enterprises and web businesses. That’s because their user experience and revenue streams depend on reliable, high-performance internet traffic delivery. To help our customers manage this critical element of network performance, Kentik now includes BGP performance monitoring as part of the <a href="https://www.kentik.com/product/kentik-platform/" title="Learn more about the Kentik platform">Kentik Network Observability Platform</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1gz3gX6oYvdnbiYe82Stgg/8e82909326a193ff0b50666d3109eee6/bgp-performance-monitoring-in-platform.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="BGP monitoring in the Kentik platform" /> <div class="caption">Path Visualization is a part of the BGP Monitor test results. You can see the AS paths currently and at any point in time.</div> <h2 id="why-choose-kentik-as-your-bgp-monitoring-solution">Why choose Kentik as your BGP monitoring solution?</h2> <p>While free and commercial solutions for monitoring BGP have existed for several years, there was something missing that compelled many of our customers to nudge us towards building our own. Part of this had to do with our approach to <a href="https://www.kentik.com/kentipedia/what-is-network-observability/" title="Kentipedia: What is Network Observability?">network observability</a>. Our customers give us great reviews for user experience and for our approach that enables users to answer any question about their network. They wanted BGP monitoring to be a part of the solution.</p> <p>The other big reason to use Kentik’s BGP monitoring solution is that it addresses many of the limitations of the current “best of breed” alternatives with features including:</p> <h3 id="1-large-number-of-data-sources">1. Large number of data sources</h3> <p>Most solutions on the market today are solely reliant on publicly available BGP monitors. While these are excellent sources of data, Kentik is uniquely positioned to take this to the next level by leveraging our rich BGP data sets that include both public and private (anonymized) sources of BGP data.</p> <h3 id="2-immediate-data-retrieval">2. Immediate data retrieval</h3> <p>Given the size of the datasets that a BGP monitoring solution needs to work with, some solutions have a delay (up to several hours, sometimes) before they present you with data. Our design goal has been to make this near-instantaneous.</p> <h3 id="3-instant-alerts">3. Instant alerts</h3> <p>What good is a BGP monitoring solution if it alerts you after the whole world has found out about the issue, for example, on Twitter? Kentik alerts you nearly instantaneously.</p> <h3 id="4-clean-user-experience">4. Clean user experience</h3> <p>Collecting and presenting the data is one thing, but doing it in a way that makes it a pleasure to use is a whole different story. Kentik already has information about customer networks, ASes, and prefixes which reduce the time needed to start surfacing up BGP data.</p> <h3 id="5-multitude-of-apis-and-integrations">5. Multitude of APIs and integrations</h3> <p>It’s 2022, you want the data you need, where you need it.</p> <h3 id="6-benefits-of-a-single-pane-of-glass">6. Benefits of a single pane of glass</h3> <p>While <a href="https://www.kentik.com/kentipedia/what-is-synthetic-monitoring/" title="Kentipedia: What is Synthetic Monitoring?">synthetic monitoring</a> is a crucial part of managing services in production, correlating test failures to internet routing issues caused by BGP changes completes the picture.</p> <p>The limitations of competing solutions, coupled with our customers’ need for true <a href="https://www.kentik.com/kentipedia/what-is-network-observability/" title="What is Network Observability?">network observability</a>, were the key reasons for why we embarked on the journey to creating a next-generation BGP monitoring solution, as part of the Kentik Network Observability Platform. While Kentik BGP Monitor is already more feature-rich than many existing solutions, we can’t wait for you to start using it so we can continue to build on what we have.</p> <h2 id="bgp-monitoring-use-cases-and-features">BGP monitoring use cases and features</h2> <p>Kentik’s BGP Monitor tool addresses the most common use cases around monitoring BGP state as well as root-causing routing issues when they occur. These include:</p> <h3 id="event-tracking">Event tracking</h3> <p>See route announcements and withdrawals over time and filter the data by day, hour, AS, prefix and announcement type. This is a crucial part of the day-to-day observation of <a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1/" title="Kentik Blog: BGP Routing Tutorial, Part 1">BGP routing</a> infrastructure and policies.</p> <img src="//images.ctfassets.net/6yom6slo28h2/77Dr7LJb8sYkPQmGkMOXAw/20ee7e217bf203f4673616f000faa418/event-tracking-facebook.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="BGP event tracking" /> <div class="caption">BGP Monitor “Events” tab showing BGP announcements for Facebook’s /23 and /24 prefixes, before, during and after the historic outage.</div> <h3 id="bgp-hijack-detection">BGP hijack detection</h3> <p>Malicious exploits of BGP’s vulnerabilities can cause routes between the internet’s tens of thousands of Autonomous Systems (ASes) to change, resulting in disruptions to application and service delivery. Being able to alert as soon as these happen is one of the primary use cases of BGP monitoring.</p> <h3 id="route-leak-detection">Route leak detection</h3> <p><a href="https://www.kentik.com/blog/new-year-new-bgp-leaks/" title="Kentik Blog: New year, New BGP Route Leaks">Route leaks</a> are similar to the malicious hijacking of BGP routes, but caused by inadvertent misconfiguration (for example, human error).</p> <h3 id="rpki-status-check">RPKI status check</h3> <p><a href="https://www.kentik.com/blog/bgp-and-rpki-a-path-made-clear-with-kentik/" title="Kentik Blog: BGP and RPKI, A Path Made Clear with Kentik">Resource Public Key Infrastructure (RPKI)</a> is a best practice for securing BGP route announcements, but the improper configuration of ROAs can cause reachability issues. Knowing when these occur and getting alerted is a crucial part of monitoring BGP.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1LIeFLA8onowPTEzY6tbSA/7adb903c8bb9611239ed871ad5ccc101/rpki-alert.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="RPKI status checking in Kentik" /> <div class="caption">The BGP Monitor Test in Kentik Synthetics, enables you to detect and get alerted on both unexpected origins as well as invalid RPKI status. Both of these conditions can be set up to notify you at your chosen channels (Slack, email, PagerDuty, etc.).</div> <h3 id="reachability-tracking">Reachability tracking</h3> <p>We help you track changes in the reachability of your prefixes from hundreds of vantage points all over the internet and will alert you when any of them become unreachable. You need to be sure that traffic from your ASes can make its way to your customers and the service providers you depend on.</p> <img src="//images.ctfassets.net/6yom6slo28h2/A51T5b5lxlWIsB2z6ybrv/b0593b931662035594a6a37e1a76ccc8/reachability-tracking.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="visibility of prefixes from hundreds of BGP vantage points"/> <div class="caption">Time-series chart showing visibility of prefixes from hundreds of BGP vantage points. Filters show the visibility per prefix by origin AS.</div> <h3 id="as-path-change-tracking">AS path change tracking</h3> <p>Frequent changes in the path that BGP route announcements take between ASes can be a sign of instability. Monitoring for these changes and getting alerted as soon as they occur is a key part of ensuring service reliability.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6FVofxOaNPea3PZEQxV6Kc/2e1a93268fda912722673f32c1251ccb/asn-path-change-tracking.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="changes in AS path over time"/> <div class="caption">Time-series chart showing average number of changes in AS path over time. </div> <h3 id="as-path-visualization">AS path visualization</h3> <p>Fast troubleshooting of issues requires being able to visualize data to find trouble spots quickly. We give you a 10,000-foot view of changes in BGP routes over time — an indispensable tool!</p> <img src="//images.ctfassets.net/6yom6slo28h2/3h1CEpfIg1lK72i6AHUYNA/fe46381c7b269d4bb4983a16d1802d42/as-path-visualization.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="AS path visualization in Kentik" /> <div class="caption">AS path visualization showing a hop-by-hop view of routes that is “scrubbable” across time.</div> <h3 id="convenient-notifications">Convenient notifications</h3> <p>Last but not least, all of the above metrics can be set up to alert within the product and can be tied to the most common notification channels including:</p> <ul> <li>Slack</li> <li>Microsoft Teams</li> <li>JSON</li> <li>OpsGenie</li> <li>Pagerduty</li> <li>Servicenow</li> <li>Splunk</li> <li>Syslog</li> <li>VictorOps</li> <li>Xmatters</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/6854CVDCdiLst9UQMlm59K/fa6c94a1018ad1092877787e2f864d0d/alert-setup.png" style="max-width: 800px;" class="image center" thumbnail alt="BGP monitoring notifications" /> <h2 id="additional-advantages-of-proactive-bgp-monitoring">Additional advantages of proactive BGP monitoring</h2> <p>Proactively monitoring BGP is essential to verify route changes during regular network operations. For instance, when altering a service provider relationship, it is crucial to ensure that your routes are accurately advertised and reachable post-change.</p> <h3 id="potential-bgp-issues">Potential BGP issues</h3> <p>With many entities and devices involved in BGP networks, there are numerous points where issues can arise. BGP’s known weaknesses in authentication and verification of routing claims can lead to problems. Here are some common challenges businesses face:</p> <h4 id="bgp-route-misconfigurations">BGP route misconfigurations</h4> <p>Advertising routes that cannot carry traffic is called “blackholing”. If you advertise a part of the IP space owned by someone else, and your advertisement is more specific than the owner’s, internet data intended for that space will be directed to your border router. This will effectively disconnect the black-holed address space from the rest of the internet.</p> <h4 id="bgp-route-hijacking">BGP route hijacking</h4> <p>Route hijacking involves using another network’s valid prefix as your own, potentially causing severe network disruptions. Most route hijacking on the internet results from unintentional misconfigurations. While malicious intent is possible, a simple configuration typo is often the cause.</p> <h4 id="bgp-route-flapping">BGP route flapping</h4> <p>Route flapping happens when a router initially advertises a destination network through one route, then quickly switches to another or alternates between “available” and “unavailable” status. This forces other routers to recalculate routes, consuming processing power and potentially affecting service.</p> <h4 id="infrastructure-failures">Infrastructure failures</h4> <p>Hardware and software errors, configuration mistakes, and communication link failures (e.g., unreliable connections) can result in route flapping and other issues. Reachability information may be repeatedly advertised and withdrawn. A frequent failure scenario is when a router interface experiences a hardware problem, causing the router to alternate between “up” and “down” announcements.</p> <h4 id="bgp-and-ddos-attacks">BGP and DDoS attacks</h4> <p>BGP hijacking can facilitate DDoS attacks, in which an attacker impersonates a legitimate network by using another network’s valid prefix as their own. If successful, traffic may be redirected to the attacker’s network, effectively denying service to the user.</p> <h2 id="learn-more-about-kentiks-bgp-monitoring-tools">Learn more about Kentik’s BGP monitoring tools</h2> <p>Visit our <a href="https://www.kentik.com/solutions/bgp-route-monitoring/" title="BGP Route Monitoring Solutions page at Kentik">BGP route monitoring solutions</a> page to learn more about how Kentik’s BGP monitoring features can help you visualize, optimize, and secure BGP routing for your networks.</p> <h2 id="conclusion">Conclusion</h2> <p>This blog post is just a preview of some of the features of Kentik’s new BGP monitoring solution. We’d love to hear from you about other use cases we can solve for. Please reach out (<a href="https://www.kentik.com/contact/">here</a> or via your account team) today if you’d like to set up a conversation with our product and engineering team.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 30px;"></div> <p>Sign up for a free trial today to start using BGP monitoring capabilities in Kentik.</p> <p><button as="SignupButton">START NOW</button></p><![CDATA[The peering coordinators’ toolbox - part 3]]><![CDATA[In part three of our series, you’ll see how to improve CDN and peering connectivity. Learn about peering policies and see how to use data to support your peering decisions. ]]>https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-3https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-3<![CDATA[Nina Bargisen]]>Fri, 18 Feb 2022 05:00:00 GMT<ul> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/" title="Part 1 - Economics: How to understand the cost of your connectivity">Part 1 - Economics: How to understand the cost of your connectivity</a></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2" title="Part 2 - Performance: How to understand the services used by your customers">Part 2 - Performance: How to understand the services used by your customers</a></li> <li><strong>Part 3 - Analyze: How to improve your CDN and peering connectivity</strong></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-4/" title="Part 4 - Select: How to find the right transit provider">Part 4 - Select: How to find the right transit provider</a></li> </ul> <p>The peering coordinator’s toolbox is a blog series where we dive deep into the peering coordinator workflow and show how the right tools can help you be successful in your role.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px; margin-top: 40px;"></div> <h2 id="part-3-analyze">Part 3: Analyze</h2> <h3 id="how-to-improve-cdn-and-peering-connectivity">How to improve CDN and peering connectivity</h3> <p>In <a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/">part 1</a> and <a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2/">part 2</a> of our series, we focused on how to understand the most important services on your network. We also identified the sources and destinations for these services by looking at which CDNs are in the digital supply chain. In this post, we’ll assume you’ve already identified which CDNs to embed and which ones to peer with. That means you already have the beginnings of your peering-prospect list. The next step is to find the peering prospects for the rest of your traffic.</p> <h3 id="finding-peers">Finding peers</h3> <p>Kentik’s Discover Peers tool shows you top-potential peers based on traffic volume. You can restrict the tool to only look at traffic from your transit interfaces or you can include all your external-facing interfaces. You can also exclude ASNs that are not interesting for the analysis (for example, those you already have found a solution for).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5w7FBgv6Jjj5z5KzHAlP1k/e7c3db1927a52553244645bfec71985a/edge-data-driven-peering.png" style="max-width: 800px;" class="image center" thumbnail alt="Discover peers" /> <p>There are a couple of dimensions to use when looking for peering prospects. In <a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2/">part 2</a> of the series, we talked about the need to balance cost and quality. In that example, you’ll want to see if you can move traffic to a more cost-effective connection according to the analysis you have from <a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/">part 1</a> or create a shorter path for the traffic by peering. But how do you get that level of detail?</p> <p>Once you have your raw prospect list you can start evaluating each of the networks on the list.</p> <p>Using the Discover Peers tool, you can filter the list of prospects based on whether they have an open, selective or restrictive peering policy. This is based on registration information in <a href="https://www.peeringdb.com/">PeeringDB</a>, the global community database where networks that peer register and maintain their peering details.</p> <h4 id="the-basics-of-peering-policies">The basics of peering policies</h4> <p>A peering policy is a declaration of a network’s intentions to peer. A network can state if it has the following:</p> <ul> <li><strong>Open peering policy</strong> - will peer with everyone and everywhere possible</li> <li><strong>Selective peering policy</strong> - will generally peer, but there are a set of requirements that define how mutual benefit can be gained from peering</li> <li><strong>Restrictive peering policy</strong> - will peer, but not seeking new peers and will generally decline any requests</li> </ul> <p>The peering prospects with an open peering policy are straightforward — it’s just a question of reaching out and agreeing on where and when.</p> <p>It’s most common that networks agree to peer in all mutual locations. That’s because, well, it’s mutually beneficial. Most networks will hand off the traffic as soon as possible. And peering in all mutual locations makes it more likely that there is fair share in terms of the burden of carrying the traffic. When that is not the case — for example, when an eyeball network peers with a content network — the parties work out an agreement on where to peer so it makes sense for both parties.</p> <p>As for the networks with a selective peering policy, they usually document their requirements and lay out their justification for saying no to peering. The most common requirements or restrictions are:</p> <ul> <li><strong>Ratio</strong> - the network requires a certain balance between the sent and received traffic from the potential peer</li> <li><strong>Volume</strong> - the network requires a certain volume to justify the increased workload involved in setting up and maintaining connections</li> <li><strong>Locations</strong> - the network requires a certain geographic overlap between the networks so they can hand off traffic most efficiently and save bandwidth within their own network</li> <li><strong>Customers of an existing peer</strong> - if your network is a customer of an existing peer to your peering prospect, your traffic is already on a free connection for them and your traffic might be needed in a potential ratio-relationship between your peering prospect and your provider</li> </ul> <h4 id="how-to-manage-selective-peering-policies">How to manage selective peering policies</h4> <p>You may be wondering: Can I do anything so that networks with a selective peering policy (where I don’t meet all the requirements) will agree to peer with me?</p> <p>The thing to remember is that the policy creates a basis for the prospect to say no. That doesn’t necessarily mean that a peering relationship won’t be of mutual benefit. It just makes it your job to prove it. You might also need to implement some changes in your network and your routing to meet their criteria.</p> <p><a href="https://www.kentik.com/product/edge/">Flow tools</a>, like those that Kentik offers, are essential for documenting exactly what the volume and ratio of the <em>current</em> traffic between you and your peer prospect is. And you already did this when you identified this network as a desirable peer.</p> <p>But how can you identify potential opportunities to increase the volume in one direction to fix a ratio or to simply boost the traffic? To do this you need to identify traffic-flows that move to the potential direct connection between you and the prospect, but that which, today, is not routed through the prospect’s network at all.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4ujN2cr2w6bi8foMMOOIHR/04490be09b390709f17d7df3268f55df/peering-prospects.png" style="max-width: 450px;" class="image center no-shadow" alt="Identify potential peers" /> <p><strong>You need to be able to answer: “Who are the customers of my peering prospect?”</strong></p> <p><a href="https://www.kentik.com/product/market-intelligence/">Kentik Market Intelligence (KMI)</a> can answer this question for you. With KMI, you can find information about the customers of a network. Using the Kentik Data Explorer, you can also see the traffic to and from your network to each of the multi-homed customers. With that, you should look for traffic that flows via a path outside of your potential peer’s network.</p> <p>In the example below, we’d like to peer with AS12345 with a selective peering policy, but our current traffic to them is not high enough for us to meet their traffic volume requirement.</p> <p>With KMI we can see their customer AS4567 after we exclude our mutual customers and their single-homed customers:</p> <img src="//images.ctfassets.net/6yom6slo28h2/38RbZTBZ4WexkWC5SSDkky/75ac2e7d529a6f498e4f71c7adf0ba4f/peering-prospects-kmi.png" style="max-width: 800px;" class="image center" thumbnail alt="Find peering prospects with Kentik Market Intelligence" /> <p>A quick glance at Customer 1 shows us their provider list:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5tsECPupz4YvFjyP51XNnz/7e142a5d382df9a5af93ed61f50a4b9e/peering-customer1-providers.png" style="max-width: 600px;" class="image center" thumbnail alt="List of potential peers" /> <p>We then use the Kentik Data Explorer to investigate if we have traffic to AS4567 that can help us. In this example, we search for traffic going to AS4567.</p> <p>If we break that out by the BGP next-hop and the second hop ASN, we find traffic going through our transits to their transit AS3356:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5NpXpi1lW5qsOLfp8bNpxt/e2c29788e4c7a03499308dfc05e6ab6b/peering-bgp-next-hop.png" style="max-width: 800px;" class="image center" thumbnail alt="Use Data Explorer to find peers" /> <p>We can now use this data to argue that a new peering with prospect AS12345 will increase the total traffic from us to them with ~1 Gbps to better meet their traffic requirement. In addition, they might increase their revenue from the customer.</p> <p>Now you might be asking: How can I select the best transit provider to support my peering strategy? And how can I minimize my total network cost? Stay tuned for part 4 of our peering coordinator’s series, where we’ll answer those questions.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 30px;"></div> <p>Ready to try it yourself? Start a 30-day free trial.</p> <p><button href="https://www.kentik.com/go/get-started/">START NOW</button></p><![CDATA[The peering coordinators’ toolbox - part 2]]><![CDATA[In part two of our peering series, we look at performance. Read on to see how to understand the services used by your customers. ]]>https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2<![CDATA[Nina Bargisen]]>Fri, 11 Feb 2022 05:00:00 GMT<ul> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/">Part 1 - Economics: How to understand the cost of your connectivity</a></li> <li><strong>Part 2 - Performance: How to understand the services used by your customers</strong></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-3/">Part 3 - Analyze: How to improve your CDN and peering connectivity</a></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-4/" title="Part 4 - Select: How to find the right transit provider">Part 4 - Select: How to find the right transit provider</a></li> </ul> <p>The peering coordinator’s toolbox is a blog series where we dive deep into the peering coordinator workflow and show how the right tools can help you be successful in your role.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px; margin-top: 40px;"></div> <h2 id="part-2---performance">Part 2 - Performance</h2> <h3 id="how-to-understand-the-services-used-by-your-customers">How to understand the services used by your customers</h3> <p><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1/">In part 1 of this series</a>, we created an overview of our interconnection costs. You’re now ready for the next step in the peering-coordinator workflow: understanding traffic on the network.</p> <p>No matter which kind of network you have, you’ll most likely have a mix of video, SaaS, games and software updates running over your network. And you need to understand what they are and how they’re served to your customers — both internal and external.</p> <p><a href="/blog/ott-service-tracking-gets-a-major-facelift-and-update/">Kentik OTT Service Tracking’s classification engine</a> has classified more than 350 different OTT (over-the-top) services and over 50 CDNs (content delivery networks) — and is continuing to add more. This provides you with an easy way to get started.</p> <h3 id="which-services-are-the-most-important-for-my-network">Which services are the most important for my network?</h3> <p>In the image below, you can see an overview of services running on your network and see how much traffic each of the services generate. You can also see which networks are delivering the services to you.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4PWJ22RUGEBRUkwYqnIrJq/1a600a4c843c0db2850e3150fecb9931/ott-services-overview.png" style="max-width: 800px;" class="image center" thumbnail alt="OTT service tracking" /> <p>You can also drill down on each of the categories to see how each of the services are served in detail. Let’s take, for example, Zoom — a most-popular video conferencing solution. In the below example, you’ll notice that the majority of this traffic runs over your transit connection (Off-Net Transit).</p> <img src="//images.ctfassets.net/6yom6slo28h2/47FLI8xW95K5s9GlZkpeWu/0e88077751d38c4122151ba8699b8448/ott-traffic-zoom.png" style="max-width: 800px;" class="image center" thumbnail alt="Video conferencing network traffic" /> <h3 id="evaluate-the-quality-of-the-service">Evaluate the quality of the service</h3> <p>There is a chance that we can improve the price and/or the quality of this traffic by moving more to peering. Drilling into the performance analysis per subscriber below, you can see that the bits per subscriber are significantly higher when the traffic is served via private peering. This indicates that moving the Zoom traffic from your transit to private peering will improve the quality of the Zoom service for your end users.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5RTaQfafq3M1VQ7XCNFTHs/f27399eb37fa485d0ec42f5941c0a195/subscriber-drilldown.png" style="max-width: 800px;" class="image center" thumbnail alt="Performance analysis per subscriber" /> <h3 id="ott-service-and-cdns">OTT service and CDNs</h3> <p>Below you’ll see the digital supply chain view for the Zoom traffic, which will show you how the traffic is served. In this example, the traffic is served via a CDN called Big CDN, so let’s focus on how we can improve all the traffic from Big CDN, including Zoom traffic, by moving away from using transit for this CDN.</p> <img src="//images.ctfassets.net/6yom6slo28h2/B7AoPk9fgMZ7CdZW5rVPS/bbdf324ca92fdfb9bf726f75a4caf1da/digital-supply-chain.png" style="max-width: 800px;" class="image center" thumbnail alt="Digital supply chain" /> <h3 id="designing-a-solution">Designing a solution</h3> <p>We will first need to think about how the traffic flows internally in your network. The challenge for this exercise is to identify the traffic destinations.</p> <p>Kentik offers a number of options so that you can configure the view based on the choices your architects have made. <a href="https://kb.kentik.com/v3/Cb06.htm">Custom Dimensions</a> is the key feature to use. Here you can define and save your custom dimensions and then use them for filtering. You can create dimensions that identify the customers in regions of your network. The options could be IP prefixes, ports, BGP attributes, or even devices if you collect flow from the customer edge.</p> <p>In our example we have four regions in the network (A, B, C and D), and the external connectivity is located at the sites 1, 2 and 3 from the digital supply chain above. Customers in the four regions are identified by their IP addresses.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3GH2eKvIgMToAELovDLXO4/7e29143567e7c0eba5a3e1504112d2d6/connectivity-4-regions.png" style="max-width: 500px;" class="image center no-shadow" /> <p>The result of the traffic analysis from Big CDN shows:</p> <table style="width: 50%;"> <tbody> <tr> <td>Region 1</td> <td>15 Gbps</td> </tr> <tr> <td>Region 2</td> <td>21 Gbps</td> </tr> <tr> <td>Region 3</td> <td>10 Gbps</td> </tr> <tr> <td>Region 4</td> <td>35 Gbps</td> </tr> </tbody> </table> <p>In this case, it will make sense to explore whether private peering to Big CDN in each of the four regions is possible. In part 3 of this blog series, we’ll focus on the process of identifying potential peers and discuss how to prepare a request to peer. But for now, let’s just go with the notion: “We request private peering to Big CDN”.</p> <p>Let’s assume Big CDN replies back that they are present in the same data center where our Site 3 is located, but they’re not in any other locations where we have a PoP.</p> <p>Big CDN instead offers servers from their CDN that we can install in our PoPs.</p> <p>A number of CDNs, in particular those built and operated by content providers, offer servers free of charge to install and connect directly to ISPs network. So, here is another consideration to make: Will the cost of power and space for servers be higher or lower than building your network to carry the traffic from your peering or transit edge to your customers?</p> <blockquote style="border-left: 4px solid #53b8e2">A content delivery network, or CDN, is a network of distributed servers with content, a system to distribute the content to where it’s needed, and a system to steer the consumers to the closest server with content. Most steering systems, however, will prefer the servers in a network versus those that are reached via peering, so in the case of our example, we need to make sure that Big CDN supports serving some users via peering and some via the embedded servers.</blockquote> <img src="//images.ctfassets.net/6yom6slo28h2/43hoxgMnv8KnN11gg1zXaq/db0dbe2815fcc5d1f5b4dc9a3711429b/cdn-diagram.png" style="max-width: 500px;" class="image center no-shadow" /> <p>In the case of our example, installing servers in region A, B and D will remove traffic from Big CDN from the backbone links connecting the four regions.</p> <h3 id="evaluate-the-solution">Evaluate the solution</h3> <p>Now we’re able to compare:</p> <ul> <li>The cost of the transit where Zoom currently is served</li> <li>The cost of the private peering to Big CDN</li> <li>The cost of power, space and card/router share for the embedded servers in three regions</li> <li>Estimated savings on the needed capacity of the backbone links connecting the four regions.</li> </ul> <p>Given that we expect to improve the quality of the service for our customers, we’re now ready to decide whether the savings or the potential extra cost for the service is worth implementing the solution. And once implemented, <a href="/blog/why-and-how-to-track-connectivity-costs/">Kentik Connectivity Costs</a> will make sure you can track the impact of the solution on your cost.</p> <p>Stay tuned for the next part of this series, where you’ll learn about how to peer off the right amount of the rest of your traffic, work out where you have mutual sites with the peering prospects, and see how you can provide evidence of the value of the peering relationship.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 30px;"></div> <p>Ready to try it yourself? Start a 30-day free trial.</p> <p><button href="https://www.kentik.com/go/get-started/">START NOW</button></p><![CDATA[Network AF, Episode 9: Learning from great mentors and by breaking things with Hank Kilmer]]><![CDATA[In this episode of Network AF, Avi talks to Hank Kilmer, VP of IP engineering at Cogent. Hear more about networking in the 80s and learning by breaking things. ]]>https://www.kentik.com/blog/network-af-episode-9-great-mentors-and-breaking-things-hank-kilmerhttps://www.kentik.com/blog/network-af-episode-9-great-mentors-and-breaking-things-hank-kilmer<![CDATA[Michelle Kincaid]]>Fri, 11 Feb 2022 05:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/3Ycw5nmAUkApLmMGlAwBWr/818f6bc141b82e08cd9b0d2599780448/hank-kilmer.jpg" style="max-width: 130px;" class="image right" alt="Hank Kilmer" /></a></p> <p>In a <a href="https://www.kentik.com/network-af/">new episode</a> of the Network AF podcast, your host Avi Freedman interviews Hank Kilmer, VP of IP engineering at <a href="https://www.cogentco.com/en/">Cogent</a>.</p> <p>Hank has been running major internet backbones since the early 90s. He joined Cogent in 2011, and prior to that, held leadership positions with UUNET (now Verizon), Sprint, Digex, Abovenet and Terrapin Communications.</p> <p>In this episode, Avi and Hank discuss:</p> <ul> <li>Networking in the 80s</li> <li>Mentorship and giving feedback</li> <li>Learning by breaking things</li> <li>Cogent and its amazing customer service</li> <li>Advice to Hank’s younger self</li> </ul> <h3 id="networking-in-the-80s">Networking in the 80s</h3> <p>In the late 80s, Hank was studying at Rutgers University and needed a way to pay for college. That’s when he started making money writing device drivers. And eventually, that led him to switch from studying electrical engineering to a degree in computer science.</p> <p>With the switch, Hank says he soon noticed that university staff were struggling to interlink computers and printers while setting up networking infrastructure. So he asked if he could help. The Rutgers team said if he could fix what they were working on before the end of the day, then he had a job. Hank sat down, opened the manual, and by the end of the day, he’d fixed the problem and got himself hired.</p> <p>Hank says the role itself was a mixture of running the university’s networking, getting into connectivity with other universities and the ARPANET, repairing hardware, and other miscellaneous tasks. “General IT,” Avi jokes.</p> <p>Hank says he worked with Decknet, VAX, Apollos, and other protocols. This prepared him for his next role as a CIS admin for a software development house, followed by an opportunity to join UUNET.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/107878c2"></iframe> <h3 id="mentorship-and-giving-feedback">Mentorship and giving feedback</h3> <p>In a conversation about mentorship, Hank talks about the people who gave him a chance at Rutgers, including Alex Latzko, Rick Crispin, Mel Pleasant, and others. He says he wouldn’t have been successful without them.</p> <p>Hank talks a bit about how we’re taught in school — with lessons and assignments. He says not enough attention is paid to teaching soft skills, like how to talk to another person. Specifically, with feedback, Hank mentions that depending on how feedback is delivered, a person can take it several different ways. He believes we need to have a better understanding of how to communicate, with each person’s feedback preferences in mind, to more successfully give and receive feedback.</p> <p>“The point is: how you interact with people matters. More important than what you know,” he says.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/8fe9b33d"></iframe> <h3 id="learning-by-breaking-things">Learning by breaking things</h3> <p>Hank shares a story from his UUNET days, locking everyone out of a router, owning up to the mistake, and then putting in the hard weekend work to fix a problem he created.</p> <p>Avi asks if it was easier to break things in the past and learn from the experience, versus now, with today’s internet that is much more critical. Hank says there are definitely things we’re not able to do now, but that companies these days are also better at testing – figuring out how to miniaturize and reproduce entire networks in lab environments.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/7272e7df"></iframe> <h3 id="cogent-and-its-amazing-customer-service">Cogent and its amazing customer service</h3> <p>In a conversation about Cogent’s business and what its infrastructure teams look like, Hank talks about Cogent’s grow, but says that at its heart, the company still runs an old-school ISP. A successful one at that: Cogent says that 20% of all internet traffic is on their networks!</p> <p>Hank mentions that a large part of Cogent’s success is driven by the company’s continued focused on delivering amazing service to its customers. This leads to Avi and Hank talking about some of the past companies they’ve watched or been a part of — companies that no longer exist because they tried to solve too many problems and spent more money than they could afford.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/64669384"></iframe> <h3 id="advice-to-hanks-younger-self">Advice to Hank’s younger self</h3> <p>Hank says if he could tell his younger self something it would be to pay closer attention to the people around him earlier in his life. He talks about how, eventually, he learned this. But he says knowing that earlier would have helped him tap into more people’s experiences, personalities and knowledge.</p> <p>Connect with Hank on <a href="https://www.linkedin.com/in/hank-k-a70101/">LinkedIn</a>, and listen to his <a href="https://www.kentik.com/network-af/">full episode</a>.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/848f7666"></iframe><![CDATA[Launching a labor of love, Kentik Market Intelligence]]><![CDATA[Just in time for Valentine’s Day, we’re announcing our newest labor of love: Kentik Market Intelligence (KMI), a new product that spells out the internet ecosystem.]]>https://www.kentik.com/blog/launching-a-labor-of-love-kentik-market-intelligencehttps://www.kentik.com/blog/launching-a-labor-of-love-kentik-market-intelligence<![CDATA[Doug Madory]]>Thu, 10 Feb 2022 05:00:00 GMT<p>When it comes to the internet, understanding the global ecosystem can be tough. There’s a lot of manual work that service providers and digital businesses have traditionally put into finding the best way to reach customers over IP networks. And more work is needed for benchmarking against competitors and finding the best relationships for peering. But it doesn’t have to be complicated…</p> <p>Just in time for Valentine’s Day, we’re announcing our newest labor of love: <a href="/product/market-intelligence/">Kentik Market Intelligence (KMI)</a>, a new product that spells out the internet ecosystem.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2ueC1vs6kZLsMLRMNX0PUP/c243b7af1734019f1a5283289e8d68f9/kmi-relationship-status.png" style="max-width: 500px;" class="image center" alt="Peering relationships" /> <h3 id="what-is-kmi">What is KMI?</h3> <p>KMI enables users to easily navigate the insights and dynamics hidden in the global routing table by identifying the ASes operating in any given geography, determining their providers, customers and peers, and alerting you when there are changes to those relationships.</p> <p>Additionally, KMI produces rankings based on the volume of IP space transited by ASes in different geographies. Using tables and charts, KMI offers a global view of the internet out-of-the-box without any configuration or setup.</p> <p>KMI uses public BGP routing data to rank ASes based on their advertised IP space. Rankings are updated daily and come in various forms: total customer base, customer type (retail, wholesale, backbone), as well as peering. Your AS is always highlighted for easy comparison in each geography.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3DMEL2UEBDyxFqRflNlBes/40921a614134a836327570d607031c47/kmi-belgium-short.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">KMI has uses for multiple roles within internet service providers and digital businesses.</div> <h3 id="key-benefits-for-netops-and-peering-coordinators">Key benefits for NetOps and peering coordinators</h3> <p>KMI removes the guesswork in planning discussions around extending the reach of the network. Now the top retail, wholesale or backbone providers are known in any country, region or continent. Interconnection managers can make better decisions based on data they did not have easy access to previously.</p> <ul> <li>Determine which service providers connect to the most customer networks, both directly and through peering relationships.</li> <li>Quickly understand who are the customers, providers, and peers of any AS in any country.</li> <li>Observe the reach and relative market share of each ASN in geographies of interest.</li> </ul> <h3 id="key-benefits-for-marketing-and-sales">Key benefits for marketing and sales</h3> <p>Sales prospecting used to rely heavily on parsing routing registry records and word-of-mouth. KMI reveals sales opportunities by listing the transit relationships in a target market. For example, you can discover new prospects by analyzing the current customers of wholesale or backbone operators that provide service in a particular market. Additionally, KMI will mark ASes that are critically dependent on a single upstream provider (i.e., single-homed) as well as which ASes are presently classified as customers of your AS (i.e., mutual).</p> <ul> <li>Know how you compare to other providers in selected geographies.</li> <li>Identify sales opportunities by knowing your competitors’ customers in any market, including whether or not they are critically dependent on a single upstream provider.</li> <li>Market your network connectivity based on objective ranking data in a geographical market.</li> <li>Monitor how your competitors’ rankings are evolving.</li> </ul> <h3 id="how-is-kmi-calculated">How is KMI calculated?</h3> <p>KMI rankings are based on the relative amount of transited IP space by each AS for a given market. While we cannot directly measure the volume of internet traffic routed by any given AS on the internet, we can estimate the size of the customer base by determining how much IP address space it transits relative to other ASes.</p> <p>All models are based on sets of assumptions and the key assumption for this model is that the volume of transited internet traffic is correlated to the amount of transited IP address space. In other words, the more transited IP address space, the more potential for transited internet traffic and the higher the transit score.</p> <p>The following computation is done on a regular basis (initially once a day, more frequently in the future) using BGP data collected from the public <a href="http://www.routeviews.org/routeviews/">RouteViews</a> project which collects BGP data from global vantage points around the world. There are four steps:</p> <ol> <li>AS-AS relationship classification: We classify every AS-AS adjacency as either transit or peering through a machine learning algorithm that takes several factors into account. More details on our methodology can be requested.</li> <li>Routed prefixes are assigned a score based on prefix length.</li> <li>Prefix scores are summed by geolocation and score type.</li> <li>Rank ASes using transit scores.</li> </ol> <p>Every prefix in the global routing table is analyzed and contributes to a set of transit scores by geolocation and score type. Therefore, every AS that is either originating a prefix or transiting one will have some type of transit score for at least one geography. More details on how KMI is calculated are available upon request.</p> <p>KMI is now available in our Service Provider Analytics package. Please do not hesitate to reach out to our team <a href="https://www.kentik.com/go/get-demo/">for a demo</a> or to start your <a href="https://www.kentik.com/go/get-started/">30-day free trial</a>.</p> <p><button href="https://www.kentik.com/go/get-started/">START NOW</button></p><![CDATA[The peering coordinators’ toolbox - part 1]]><![CDATA[In this blog series, we dive deep into the peering coordinator workflow and show how the right tools can help you be successful in your role. In part 1, we discuss the economics of connectivity. ]]>https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-1<![CDATA[Nina Bargisen]]>Fri, 04 Feb 2022 05:00:00 GMT<ul> <li><strong>Part 1: Economics: How to understand the cost of your connectivity</strong></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-2/">Part 2: Performance: How to understand the services used by your customers</a></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-3/">Part 3 - Analyze: How to improve your CDN and peering connectivity</a></li> <li><a href="https://www.kentik.com/blog/the-peering-coordinators-toolbox-part-4/" title="Part 4 - Select: How to find the right transit provider">Part 4 - Select: How to find the right transit provider</a></li> </ul> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p>The peering coordinator’s toolbox is a blog series where we’ll dive deep into the peering coordinator workflow and show how the right tools can help you be successful in your role.</p> <p>A peering coordinator is the person responsible for the way an IP network connects to the rest of the internet. “Peering” is the term used for interconnections between two IP networks where the two networks provide access to each other’s networks and customers but do not allow traffic to transit to the rest of the internet.</p> <p>This series will walk you through each step of the flow, from getting the full overview of your interconnection-related cost, to learning how to determine which networks you need to seek peering with, to understanding which transit provider will best suit your needs. The series will give readers insight into why the internet is the network of networks.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px; margin-top: 40px;"></div> <h2 id="part-1---economics">Part 1 - Economics</h2> <h3 id="how-to-understand-the-cost-of-your-connectivity">How to understand the cost of your connectivity</h3> <p>One of the basic elements of a successful peering strategy is to understand internet connectivity costs. Comparing your transit price with the cost of a port to an internet exchange (IX) will not give an adequate picture since a number of other factors need to be taken into account.</p> <h3 id="cost-elements">Cost elements</h3> <p>The most straightforward elements of your interconnection costs are:</p> <ul> <li>Traffic costs, which often are divided into a fixed part (the commit) and a variable part (the burst), the latter of which is determined by the volume that exceeds the commit. (It’s also common to not have a commit.)</li> <li>Port fee</li> <li>Cross-connect recurring fee</li> <li>Membership fee in the case of an IX port</li> <li>Transport costs if you are connecting using metro fiber or waves</li> </ul> <p>All of the costs above are easy to collect on a monthly basis. For the traffic cost, you will need to collect the traffic data from your network. The most common method of charging traffic volume is to charge per Mbps of the 95th percentile of the highest direction. This way you are not paying for short bursts of traffic, and the provider is getting paid for the network they need to secure to carry your traffic. The charging method will be part of your contract, and the mix of fixed and variable costs will depend on your traffic profile and your negotiation skills.</p> <p>Other costs associated with a transit, peering or IX connection are the housing costs for the point of presence (PoP) where the port is terminated, the hardware costs and, in some cases, the backbone costs to connect the PoP to the rest of your network. To estimate these, you need to understand how your planning and finance colleagues model the network cost. You also need to understand how to convert the CAPEX, i.e., the one-time costs when you buy the hardware, into OPEX, which are recurring costs that you can use to calculate a monthly cost for your interconnections.</p> <p>When you buy hardware for the network, the value is entered into the books as an asset, but this asset will immediately start to lose value. The rate of depreciation for hardware is typically three to five years.</p> <p>Say you pay $10,000 for a line card. Your finance team assumes it will be worth $500 when it is fully depreciated after three years. The monthly depreciation rate can then be calculated like this:</p> <p><strong>($10,000 - $500) / 36 = $264</strong></p> <p>The planning team can give you the fraction of the monthly depreciation rate that an interface on a router should be assigned. Use this number as an input to calculate the total cost of the connection.</p> <p>If your network is very uniform and your interconnection choices do not make much of a difference to your hardware spend, you can make the necessary comparisons without adding the hardware and transport costs to the connections in the tool.</p> <h3 id="creating-an-overview">Creating an overview</h3> <p>Once you have determined which cost to track for each connection, we recommend using the <a href="https://www.kentik.com/blog/why-and-how-to-track-connectivity-costs/">Kentik Connectivity Cost workflow</a> to register and track your total interconnection cost.</p> <p>In the example below, we have a monthly cost of $50 for the share of the line card, and $30 for the share of the router that can be assigned to the interfaces we use for “my best transit” interfaces.</p> <table> <thead> <tr> <th>My best transit</th> <th>Monthly cost</th> <th>Kentik cost element</th> </tr> </thead> <tbody> <tr> <td><strong>CAPEX</strong></td> <td></td> <td></td> </tr> <tr> <td>Share of line card per interface</td> <td>$50</td> <td>Interface cost</td> </tr> <tr> <td>Share of router per interface</td> <td>$30</td> <td>Interface cost</td> </tr> <tr> <td><strong>OPEX</strong></td> <td></td> <td></td> </tr> <tr> <td>Commit 1G $/Mbps</td> <td>$0.20</td> <td>Cost group commit</td> </tr> <tr> <td>1G &#x3C; burst &#x3C; 5G $/Mbps</td> <td>$0.15</td> <td>Cost group Tier 1</td> </tr> <tr> <td>5G &#x3C; burst $/Mbps</td> <td>$0.10</td> <td>Cost group Tier 2</td> </tr> <tr> <td>Cross-connect</td> <td>$100</td> <td>Interface cost</td> </tr> <tr> <td>Share of housing</td> <td>$5</td> <td>Interface cost</td> </tr> </tbody> </table> <p>The Kentik Connectivity Cost workflow begins by suggesting the potential providers based on your interface classifications. GTT is already configured in the tool as one of your providers. “My best transit” was suggested based on four interfaces in two different PoPs.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4EStwMVjxYv07TdtCMd4Xf/2fd3fbbeae0b8c80ebda80753ba55930/configure-providers.png" style="max-width: 800px;" class="image center" thumbnail /> <img src="//images.ctfassets.net/6yom6slo28h2/6vLP8NESdOI9ZTVAnpR6T6/379a7bffca03733cef2eeb000e5eb071/my-best-transit.png" style="max-width: 800px;" class="image center" thumbnail /> <p>The next step is to add the contract details, edit the auto-generated details and empty the cost group.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3WDDwWrDUCIoHVjUoQci2z/236698801975c33bac227245b151272d/my-best-transit-cost-group.png" style="max-width: 800px;" class="image center" thumbnail /> <p>The cost group is where we add the details we’ve gathered in the cost overview table. “My best transit” cost structure is blended and consists of a commit and two tiers for the burst traffic. The tool supports different computational methods for volume-based accounting.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/pcQnHn5NjKnRFF6l8LCvu/dce8c118c48fc8c1b1ff5346d08e37dd/my-best-transit-cost-group2.png" style="max-width: 800px;" class="image center" thumbnail /> <p>In the interfaces section, we can add the per-interface-based CAPEX and OPEX from the table.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2OKhQZ9ScTHQxJaNlmxKb5/91922a906f3511ba6cdeb91198c0560d/my-best-transit-cost-group3.png" style="max-width: 800px;" class="image center" thumbnail /> <img src="//images.ctfassets.net/6yom6slo28h2/2R74bXyQfMCA2HCIormwQ2/1937f50405fc1958e168a1475ecdfab1/my-best-transit-cost-group4.png" style="max-width: 800px;" class="image center" thumbnail /> <p>Once everything is added, the workflow presents an overview of the minimum monthly spend for this provider — that is, the fixed costs that will occur even if no traffic is running via this provider.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4XKBJZEeF42CahU41oU3m7/4c85c0a522c8bb6d8e7b355ea63b7642/my-best-transit-cost-group5.png" style="max-width: 800px;" class="image center" thumbnail /> <p>When all of your connections are added to the workflow, you’ll be able to track the trends of cost over time and analyze the cost based on different dimensions like sites, connectivity types and providers.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4E6z87sr0BOC7mGoTUk9Wb/db786cc5e0499e1c956b730289114b88/connectivity-costs.png" style="max-width: 800px;" class="image center" thumbnail /> <p>The tool will give you an estimate of the cost for the current billing cycle and remind you about when the next invoices from your providers should be paid.</p> <p>With that, you can easily spot which connections contribute the most to your interconnection spend. And you are now in the best possible position to start optimizing your cost.</p> <p>Stay tuned for part 2 of our “peering coordinator’s toolbox” series, where we’ll explore how you can evaluate and optimize your interconnections so you have the right balance between cost and performance.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 30px;"></div> <p>Ready to try it yourself? Start a 30-day free trial.</p> <p><button href="https://www.kentik.com/go/get-started/">START NOW</button></p><![CDATA[Network capacity planning made easy]]><![CDATA[We’ve overhauled our Capacity Planning workflow. Read how our newest developments make capacity planning easier and more intuitive than ever. ]]>https://www.kentik.com/blog/network-capacity-planning-for-2022-made-easyhttps://www.kentik.com/blog/network-capacity-planning-for-2022-made-easy<![CDATA[Stephen Condon]]>Thu, 20 Jan 2022 05:00:00 GMT<p>Didn’t quite get to that task of capacity planning in December? Well, not to worry, this month Kentik has overhauled our <a href="https://www.kentik.com/solutions/usecase/network-capacity-planning/">Capacity Planning</a> workflow and is introducing a slew of new features to make capacity planning easier and more intuitive than ever before.</p> <p>Those who are familiar with Kentik will know that one of our core offerings is the ability to monitor and plan actual network and interface utilization. Network capacity planning involves figuring out interface utilization and when they’re approaching the limits of their capacity. Traditionally, capacity planning has been a completely manual process with complex spreadsheets to pull in the data, aggregate it, run statistics on it, and trend it over time.</p> <p>With Kentik, the Capacity Planning workflow (Core » Capacity Planning) enables you to quickly assess the utilization of links relative to link capacity, leveraging SNMP data, so that you can take action before links become overloaded. You’ll see not only current traffic volume and daily forecasting, but also utilization trends and how soon your links are likely to reach capacity. (Note: we go beyond linear regression and use exponential regressions.)</p> <p>You can assign interfaces to multiple groups and evaluate utilization independently for each group according to custom criteria. Kentik also enables you to set custom utilization indicators and alarms. These capabilities enable our customers to optimize the use of their infrastructure, delivering the best possible service while avoiding overutilization and overspending on capacity.</p> <h3 id="capacity-planning-use-cases">Capacity planning use cases</h3> <p>Besides keeping a close watch on utilization rates, the most common use cases for capacity planning include:</p> <ul> <li>Determining when to purchase or design upgraded bandwidth</li> <li>Viewing capacity to individual network providers to determine when to make network upgrades or decommission underutilized capacity for a given provider</li> <li>Viewing capacity of groups of customer interfaces to determine when to recommend network upgrades to a given customer</li> </ul> <h3 id="enhancements-to-our-capacity-planning-workflow">Enhancements to our Capacity Planning workflow</h3> <p>When we get recommendations from customers on how our products can be improved, we act! Our recently overhauled Capacity Planning workflow boosts network observability, enhances reporting, and improves consistency with other Kentik workflows.</p> <p>For those who are familiar with the Kentik service, what follows is a summary of the enhancements and new features of the Capacity Planning workflow.</p> <p><strong>Enhancements</strong>:</p> <ul> <li>A new configuration interface, now similar to our Connectivity Cost UI</li> <li>Improved visibility into the configuration for Dynamic Groups</li> <li>Clear iconography to highlight whether the Capacity Plan Group is static or dynamic</li> <li>Enhanced landing status screen for each Capacity Plan that gives a status overview and a summary that highlights capacity issues</li> <li>More filtering options and details; now users can zero in on the exact interface they want to investigate via the table header, filter text field, or by severity</li> <li>The ability to export capacity plans as a CSV individually and globally</li> <li>Improved visualizations including a utilization gauge for each interface listed in the plan and inline traffic graphs for any interface</li> <li>The ability to show the traffic graph with the utilization thresholds configured for each interface in the plan’s details</li> </ul> <p><strong>New features</strong>:</p> <ul> <li>Filtering options that are instinctive, including the ability to feature Capacity Plans on the landing page based on severity</li> <li>SNMP Configuration Warning if not all router-type devices don’t have SNMP configured</li> <li>New reports for either the entire set of Capacity Plans, or individually, are now easily accessible in either PDF, email and/or CSV</li> </ul> <h3 id="capacity-planning-in-kentik">Capacity Planning in Kentik</h3> <p>For those who aren’t customers or haven’t spent time in the Kentik portal, what follows are some visualizations of the Capacity Planning workflow. You can experience it yourself by <a href="#signup_dialog" title="Start a free trial">starting a trial at any time</a>.</p> <p>The configuration allows you to create groups of interfaces and place a capacity/cost configuration onto them. You can also set <em>warning</em> and <em>critical</em> ranges using the sliding scale.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1V6KcczNQR2Y31fG8yUhCa/77475138135024d7f93f3010b1adda48/capacity-planning-peering.png" style="max-width: 800px;" class="image center" alt="Capacity planning: Configuration" thumbnail /> <p>Dynamic groups can rely on interface attributes such as the device they are on, the site they are part of, the provider/customer they’re connected to, as well as the type of connection they are serving (transit, peering, backbone, customer, etc.).</p> <img src="//images.ctfassets.net/6yom6slo28h2/6bEyQXewoD9gbFhScikQcH/6b1b9456fb898131318cd7809a8ade2b/capacity-planning-peering2.png" style="max-width: 800px;" class="image center" alt="Capacity planning: Configuration" thumbnail /> <p>The capacity plan details page can be sorted by filters, column or severity.</p> <img src="//images.ctfassets.net/6yom6slo28h2/IAAIUKNHZbJqWuayqUMqw/f6109708bfa6a456afc8d1c0dea24833/capacity-planning-upgrades.png" style="max-width: 800px;" class="image center" alt="Capacity planning - sorting" thumbnail /> <p>Clicking on any interface row in the Capacity Plan’s details page unveils a traffic graph inline for the interface. This traffic chart includes the utilization threshold and colors the area above them with the relevant severity color.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6tQTvjHndP873kvQEpUAuv/6dc13e9ae24fa44487bcf72b289fe1b5/capacity-planning-traffic-graph.png" style="max-width: 800px;" class="image center" alt="Capacity planning: traffic graph" thumbnail /> <p>Here’s what your Capacity Plan landing page will look like once you’ve set your parameters and groups. Hopefully, there will be less red and orange!</p> <img src="//images.ctfassets.net/6yom6slo28h2/5caWEe8TnOrERsjro8WvwM/c631216f915a55a7044607bc77bbcbb5/capacity-plans-overview.png" style="max-width: 800px;" class="image center" alt="Capacity planning overview" thumbnail /> <p>If you’d like to learn more about how to plan and monitor your network capacity using Kentik, <a href="https://www.kentik.com/go/get-demo/">reach out to our team</a> for a demo or to start your evaluation.</p><![CDATA[Tonga downed by massive undersea volcanic eruption]]><![CDATA[On Saturday, the pacific island nation of Tonga was decimated by a massive volcanic eruption that was visible from space. At 5:27pm local time, the underwater volcano Hunga Tonga-Hunga Ha’apai unexpectedly erupted, sending ash and debris for hundreds of miles. As of this writing, all internet and telephone communications between Tonga and the rest of the world are still down.]]>https://www.kentik.com/blog/tonga-downed-by-massive-undersea-volcano-eruptionhttps://www.kentik.com/blog/tonga-downed-by-massive-undersea-volcano-eruption<![CDATA[Doug Madory]]>Tue, 18 Jan 2022 17:00:00 GMT<p>On Saturday, the pacific island nation of Tonga was decimated by a <a href="https://www.washingtonpost.com/world/tonga-issues-tsunami-warning-after-undersea-volcano-erupts/2022/01/15/a0ea00b2-75d6-11ec-a26d-1c21c16b1c93_story.html">massive volcanic eruption</a> that was <a href="https://twitter.com/NWSHonolulu/status/1482255559072096256">visible from space</a>. At 5:27pm local time, the underwater volcano <a href="https://en.wikipedia.org/wiki/Hunga_Tonga">Hunga Tonga-Hunga Ha’apai</a> unexpectedly erupted, sending ash and debris for hundreds of miles. As of this writing, all internet and telephone communications between Tonga and the rest of the world are still down.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2zMn0JbODQTdBUz6Lq8qlp/d9655536f52dec00311780740234963c/map_from_philbe.jpeg" style="max-width: 450px;" class="image center" thumbnail alt="Tonga Cable map" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Image credit: https://twitter.com/philBE2 </div> <p>As I <a href="https://twitter.com/DougMadory/status/1482377945318625294">tweeted out on Saturday</a>, based on Kentik aggregate NetFlow data, we observed a steep dropoff in traffic to Tonga at 5:30pm local time (just a couple of minutes after the eruption) and later a complete disconnection at 6:40pm local time in Tonga.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2TJPiU9CS5vawojC86GIjm/5655364baa16c391a4c44c46e667900c/tonga.png" style="max-width: 450px;" class="image center" thumbnail alt="Tonga Disconnection" /> <p>In the days since the volcanic eruption, it <a href="https://www.stuff.co.nz/world/south-pacific/127517682/cable-repair-ship-to-set-sail-from-png-to-restore-communications-to-tonga?rm=a">has been confirmed</a> that the <a href="https://www.submarinecablemap.com/submarine-cable/tonga-cable">submarine cable connecting Tonga</a> to the global internet was broken approximately 37 km from shore. The submarine cable repair ship that will likely be tasked with making the repair will be the <a href="https://twitter.com/philBE2/status/1483513562115026944">CS Reliance</a>. Given that the Reliance is still docked at Port Moresby in Papua New Guinea, it might not be until the end of the month before the repairs are made.</p> <p><strong>Activation of the Tonga Cable</strong></p> <p>Connecting remote island nations with high-speed access to the global internet is one of the great challenges of the modern internet. Without a submarine cable, these locations are reliant on satellite service which is expensive, suffers from higher latency (at least for geostationary satellite service), and is capacity constrained (as compared to fiber optic cable).</p> <p>Satellite service in the Pacific is expensive because the coverage area is vast and the populations are relatively small. There isn’t an economy of scale to bring down the price of service.</p> <p>In the submarine cable industry, connections to these locales are referred to as “thin routes” - meaning the return on investment for constructing a submarine cable would likely be very small or thin. Given the risk involved in funding, fabricating and installing a submarine cable, private investors generally pass.</p> <p>Several attempts at cost control have been made to connect islands in the Pacific Ocean including using a “donor cable”. In this scenario a submarine cable is carefully retrieved off the ocean floor and relaid (“donated”) to connect another island. The old cable may no longer have the capacity for its previous route, but might be more than enough for its new one. The donated cable eliminates fabrication costs, which can be half of the cost of a new submarine cable.</p> <p>In 2011, the World Bank Group and the Asian Development Bank announced that <a href="https://www.worldbank.org/en/news/press-release/2011/08/15/world-bank-and-asian-development-bank-to-support-high-speed-internet-in-tonga">they would fund</a> a new submarine cable to Tonga. In a press release, World Bank Group President Robert B. Zoellick <a href="https://www.worldbank.org/en/news/press-release/2011/08/15/world-bank-and-asian-development-bank-to-support-high-speed-internet-in-tonga">stated that</a>:</p> <blockquote> <p>“Access to high-speed internet links will vastly improve opportunities for the people of Tonga to connect to the world, provide information needed by business to expand jobs, and allow people to more easily and inexpensively keep in contact with families overseas…</p> <p>This critical link will connect Tonga firmly with the rest of the world, generating huge economic opportunities from early 2013 when the cable should be in place and marking a key step in Tonga’s international connectivity.”</p> </blockquote> <p>At 5:56 UTC on August 5, 2013, I <a href="https://twitter.com/DougMadory/status/1482414748666941445">spotted the activation</a> of the Tonga cable based on traceroute measurements. My <a href="https://www.linkedin.com/in/dougmadory/detail/overlay-view/urn:li:fsd_profileTreasuryMedia:(ACoAAABgDfMBQqg6K3WmEobvLYAvoetjdTSF-R0,1635480018483)/">presentation the following month</a> at Submarine Networks World in Singapore included the graphic below showing the dramatic improvement in latency from geostationary satellite to subsea cable. Like <a href="https://www.reuters.com/article/cuba-internet/cubas-mystery-fiber-optic-internet-cable-stirs-to-life-idUKL1N0AR9TQ20130122">Cuba earlier that year</a>, Tonga was now connected via submarine cable and its reliance on slow, expensive satellite service was over.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/49Dl2n3FHdquAO3OfWHNu2/9b6f72f647b24d1f70b6852fec0799c5/tonga_cable_activation.png" style="max-width: 400px;" class="image center" thumbnail alt="Tonga Cable Activation" /> <p>In January 2019, the Tonga Cable was severed by a ship’s anchor (the most common way submarine cables are broken). Tonga <a href="https://www.aljazeera.com/news/2019/1/23/tonga-facing-absolute-disaster-after-internet-cable-blackout">reverted to satellite</a> but had to limit certain types of communication due to the limited capacity of the satellite signal.</p> <p>The emergency activation of satellite service in 2019 led to an unresolved legal dispute between the regional satellite operator Kacific and the Tongan government-owned company, Tonga Satellite. Despite the impasse, <a href="https://www.stuff.co.nz/business/127527180/satellite-firm-willing-and-able-to-connect-tonga-under-disputed-2019-contract">Kacific has offered</a> to work to reconnect Tonga given the current circumstances. Normally restoring a satellite link would be faster than fixing a broken submarine cable but there remain a lot of unknowns, such as the state of the island’s telecommunications infrastructure following the tsunami.</p> <p><strong>How could an undersea volcano have taken out a submarine cable?</strong></p> <p>Undersea landslides are a danger to subsea cables, and in the Pacific this has been the cause of some famous outages in the past. Professor Lionel Carter of the Victoria University of Wellington <a href="https://www.researchgate.net/publication/263506188_Insights_into_Submarine_Geohazards_from_Breaks_in_Subsea_Telecommunication_Cables">has written about</a> how submarine landslides and turbidity currents pose threats to subsea cables.</p> <p>In December 2006, an earthquake in Taiwan caused underwater landslides that disrupted numerous cables crippling communications in the Far East. The <a href="https://www.linkedin.com/feed/update/urn:li:activity:6884116022702567425">earthquake caused</a> 21 faults in 9 cables and took 11 ships to restore everything back to normal.</p> <p>Then again in March 2011, a magnitude 9.1 earthquake off the coast of Japan that led to underwater landslides that carried one submarine cable 2km according to a panel discussion at <a href="https://subtelforum.com/22suboptic-2013-archive-now-available-online/">Suboptic in April 2013</a>.</p> <p>According to our data, we observed the complete disconnection of Tonga at 6:40pm local time. This was over an hour after the initial drop in traffic immediately followed the eruption. We will soon learn whether this was the time necessary for an underwater landslide to reach the Tonga cable causing the break.</p> <p>We will be following the situation in Tonga and be hoping for a quick and safe recovery for everyone there.</p><![CDATA[Network AF, Episode 8: Staying curious with Ron Winward]]><![CDATA[In this episode of Network AF, Avi talks to INAP’s Ron Winward about staying curious in networking. The two discuss everything from automation to programming languages to the Mirai botnet and more.]]>https://www.kentik.com/blog/network-af-episode-8-staying-curious-with-ron-winwardhttps://www.kentik.com/blog/network-af-episode-8-staying-curious-with-ron-winward<![CDATA[Michelle Kincaid]]>Fri, 14 Jan 2022 05:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="//images.ctfassets.net/6yom6slo28h2/289J9hmbCTySMnYRSh2Uhm/33a23078bc2166556766fb487e436e9f/ron-winward.jpg" style="max-width: 200px;" class="image right" alt="Ron Winward" /></a></p> <p>In the first podcast episode of 2022, Avi welcomes <a href="https://www.kentik.com/network-af/">Ron Winward to Network AF!</a></p> <p>Ron is the vice president of network services at INAP, global provider of secure, performance-oriented, hybrid infrastructure. Like Avi, Ron also grew up in Pennsylvania and is a member of the East Coast Access of Infrastructure.</p> <p>On the podcast, the duo discuss:</p> <ul> <li>An early curiosity for technology</li> <li>The path to networking</li> <li>Showing curiosity and readiness in networking</li> <li>Moving from network infrastructure to security — and back again</li> <li>A bit about automation</li> <li>Advice to Ron’s younger self</li> </ul> <h3 id="an-early-curiosity-for-technology">An early curiosity for technology</h3> <p>Ron tells Avi he’s been fascinated with computers and technology since before the days of the mainstream internet. Part of Ron’s curiosity stemmed from his father working at Bell of Pennsylvania and his uncles working at AT&#x26;T. Then, in college, Ron focused on business, but he says he stayed closely aligned with tech as he got a job doing outside sales for dial-up internet access at CLEC.</p> <p>Ron reflects on the evolution from dial-up to T1s to DSL and, eventually, fiber and the growth of ISPs from local to nationwide businesses. “Think about where we are now from dial-up internet access to high-definition TV streamed wirelessly into our houses on a device that is no bigger than a matchbox car (like an Amazon Fire Stick),” Ron says. Or like “doing a gigabit through your phone,” Avi adds.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/7b5f9d58"></iframe> <h3 id="rons-path-to-networking">Ron’s path to networking</h3> <p>Ron says he gravitated towards network infrastructure because he’s always been fascinated by the capabilities and innovations that networks bring to people. That includes innovations like VoIP phones, high-definition streaming, and even the ability for entire companies to pivot during a pandemic to completely remote workforces.</p> <p>In the early 2000s, Ron took out loans to earn CCNA and CCNP certifications, followed by a Juniper cert later on. He progressed from sales and sales engineering to the engineering side of networking infrastructure, and says that curiosity to learn continues today.</p> <p>While there are a variety of certifications and programs available online to learn these skills today, Ron doesn’t really see a formal education track for network engineering yet. He stresses that people “don’t have to have a degree at all, let alone a degree in computer science to be a networker,” but he emphasizes that “you just have to have curiosity.”</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/c841f6fd"></iframe> <h3 id="showing-curiosity-and-readiness-in-networking">Showing curiosity and readiness in networking</h3> <p>Avi asks Ron how someone can effectively demonstrate their intellectual curiosity and readiness about networking to inspire trust in their skills. Ron says there are many ways to do this.</p> <p>What Ron looks for in someone moving into the industry is a person who shows interest in a specific field and technology. He says he isn’t necessarily looking for certifications, but rather someone who can say, “I’ve done this, and I want to do this. Here’s why I want to do this. Here is why I think I’ll be good at it.” He adds that having visibility into someone’s prior work helps him see how that person would be valuable to a team.</p> <p>Avi says he would also encourage those traversing this terrain for the first time to, as much as possible, give up embarrassment. “Humble yourself about how you ask questions and who you learn from,” he says. Ron agrees and admits that as a 20-year networking veteran, he still learns every day through his own experiences and from the networking community.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/7748c515"></iframe> <h3 id="from-network-infrastructure-to-security--and-back-again">From network infrastructure to security — and back again</h3> <p>Avi and Ron also talk about how the roles in networking have become hyper-specific and often require an expert to be totally dedicated to a particular language or aspect of maintaining a network. This leads to a discussion around Ron’s decision to transition to network security.</p> <p>Ron says this shift allowed him to do research on things like the Mirai botnet and present how each one of those attack vectors in the botnet impacted networks. He was also able to share what networkers would need to look for to prevent something like it again. Ron compares discovering vulnerabilities to a game of cat-and-mouse, and talks about how much he loved that opportunity.</p> <p>Ron has now returned to a focus on network infrastructure, but despite working in security, he says he sees it all in the same thread of resilient networking to keep services up.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/59878bd5"></iframe> <h3 id="a-bit-about-automation">A bit about automation</h3> <p>Avi asks Ron where he sees the industry going in terms of automation, intent and infrastructure as code. Ron replies that we’ve had nearly 30 years to understand things that break networks.</p> <p>“Now, with innovation being what it is, we continue to look for better ways to keep uptime, maintain uptime, drive efficiency in terms of deployment and configuration and time-to-build,” he says. “Things that took us two, three weeks before can take us two to three minutes with automation now.”</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/0a5c20f9"></iframe> <h3 id="advice-to-rons-younger-self">Advice to Ron’s younger self</h3> <p>In addition to learning more programming, Ron says he would advise his earlier self to get started sooner on his personal network and building relationships. He says this is such a key component to both learning and understanding your career path.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/1397e453"></iframe> <p>Listen to Ron and Avi’s <a href="https://www.kentik.com/network-af/">full conversation</a>. And follow Ron on <a href="https://www.linkedin.com/in/ronwinward/">LinkedIn</a> and <a href="https://twitter.com/ronwinward">Twitter</a>.</p><![CDATA[How to measure the performance of a website]]><![CDATA[In this post, learn how to isolate the root cause of a poor digital experience. See why it’s important to measure performance of a website from multiple different locations and observe the metrics over time. ]]>https://www.kentik.com/blog/how-to-measure-the-performance-of-a-websitehttps://www.kentik.com/blog/how-to-measure-the-performance-of-a-website<![CDATA[Michael Patterson]]>Tue, 11 Jan 2022 05:00:00 GMT<p>If you’re a person who works from home, you almost certainly have to deal with occasional internet connection issues. More often than complete outages, you’re likely dealing with occasional slowness. And you know from experience that any one of dozens of devices and services along the path can cause latency. To be more specific: slowness can be introduced as your digital connection traverses your PC, the local wifi/wired connection, the local ISP, the Tier 1 or Tier 2 provider, or the CDN that provides the hardware which hosts the web server running the application.</p> <p>Given all that can impact the perceived speediness of the website that you are trying to reach, you can’t simply measure the performance of a website without simultaneously keeping track of all the other factors that could be impacting a connection. What’s more, you also can’t assume that a single test instance will provide an accurate picture of a possible problem. One-shot tests don’t take into account the hiccups.</p> <h3 id="a-case-of-the-hiccups">A case of the hiccups</h3> <p>Hiccups, for example, can occur when something happens on your computer that causes the operating system to respond a little slower, albeit maybe by just a few hundred milliseconds. As a result, this event introduces additional slowness. This isn’t a big deal if you run a test every minute and evaluate the results over time.</p> <p>Or, perhaps one or more of the 10-20 routers in the path needed to reach the destination gets busy for a hundred milliseconds or so. A hiccup can even result from other services hosted on the same machine as your application competing for the same compute resources.</p> <h3 id="false-alarms">False alarms</h3> <p>The above events happen nearly every minute of every connection you maintain. Because of the constant bombardment of requests hitting one or more devices, frequently, tests are going to create an alarm. Yet, few of these alarms may represent a service-impacting condition. An algorithm is needed to ascertain the overall digital experience and keep frequent notices in check. If the test alerted ops for every threshold violation, false positives would be routine. False alarms occur when you receive alerts of a performance problem, but there isn’t a digital experience issue. Remember the story of Chicken Little?</p> <h3 id="multiple-factors-impact-the-performance-of-a-website">Multiple factors impact the performance of a website</h3> <p>Measuring a website’s performance is accurately represented when monitoring several metrics. Some of the metrics include but are not limited to:</p> <p><strong>Response size</strong>: This is the volume of data your browser is trying to download to display a page. This includes all HTML, style sheets, cookies, any javascript that has to execute, etc. And, of course, whether or not any of this is cached locally in your browser. For a test to be considered accurate, it should be hitting the same page(s) in question.</p> <p><strong>Domain lookup time</strong>: When a connection is made to a domain, the local operating system must first reach out to the DNS to resolve that hostname to an IP address. Where is the DNS? Is your PC using a DNS on the same local area network as your computer? More and more often, end users are leveraging a service provider’s DNS, which adds a measurable amount of latency. Ask yourself, do you want your test to query the DNS every time the test executes, or should it use the same IP address from the last test? This would cut time off the test.</p> <p><strong>Connect time</strong>: Prior to downloading the entire web page, a TCP connection is established between your PC and the application. Your PC sends out what is called a “SYN” packet to the destination’s IP address. Upon receiving the SYN, the destination will respond with an “ACK.” The speed of this handshake can be measured and stored in milliseconds.</p> <p><strong>Response time</strong>: This is the difference in timestamps between when the first byte of the page is received to when the last byte of the page is received. Look above at the response size. The HTML, the CSS, etc. all have to be downloaded.</p> <p><strong>Average HTTP latency</strong>: This metric is the combined values of domain lookup time, connect time and response time.</p> <p><strong>Latency</strong>: Putting the application, the server and the DNS response times aside for a moment, obviously the network can be a big factor when measuring performance. Wifi strength, multiple routers in the path, and cable/fiber quality can and do play a major role in the overall perceived performance of a website. Pinging the IP address the application resides on can give a reasonably accurate measurement of network response time. The use of ICMP vs. TCP or UDP to carry the ping can also be a factor. Some vendors like to compare the latency metric with the response time measurement as both of these metrics are reasonably good numbers for measuring overall network slowness.</p> <p><strong>Jitter</strong>: When end systems and servers aren’t overly burdened with workloads, the timing in which successive packets are sent for a single file is generally consistent. If the inter-arrival time of datagrams is not consistent, the deviance from the inter-arrival time is considered jitter. Although first introduced in application monitoring over concerns with voice and video quality, jitter ends up being a fantastic metric to pay attention to when measuring a website’s performance.</p> <p><strong>Packet loss</strong>: Unlike voice and video, conversations in file transfers cannot continue if pieces of a file are missing. If a packet containing data is dropped, the other end, depending on the session layer protocol, will definitely pick up on it and hold on processing the entire download until the missing datagram is resent. Missed packets cause latency because they force the receiving side to wait until the missing piece is present. A second way packet loss is measured is with ping or traceroute. These applications send out hello messages to a destination and measure the response time. When a response is not received, it is considered dropped. But, that doesn’t always mean there is a congestion issue. However, these protocols are a great latency measurement tool and should be kept in your tool kit.</p> <p>A summary diagram of the above metrics is shown below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/7EjXpnhV6mG9C4gHcJD6p4/180c77c6bb67e7b7996b79f05579d02b/how-to-measure-the-performance-of-a-website.png" style="max-width: 600px;" class="image center" alt="How to measure website performance" /> <p>Notice the “Status Code” in the upper left of the image above. This is a value sent back from the web page and generally shouldn’t ever change.</p> <h3 id="different-angles">Different angles</h3> <p>After your test is up and running, despite the layers of measuring a website’s performance, it still should not be relied upon as the sole <a href="https://www.kentik.com/solutions/usecase/digital-experience-monitoring/">measure of digital experience</a>. Why is this? Because it is a single perspective. Think about it. When you’re faced with a big decision, you might get advice from multiple people in order to make sure you are making the right choice. The same holds true when it comes to determining if a digital service is slow. You may ask others if they are experiencing the same slowness.</p> <p>And, although asking others if something is slow can be helpful, it is hardly an accurate measurement. For this reason, it is a good idea to introduce <a href="https://www.kentik.com/product/global-agents/">synthetic probes</a> that will test the performance of a website from geographically dispersed locations.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2RV2IklYWTPvHXWLlazKzV/a841498d8de5072736e548b28ee88c97/synthetic-application-monitoring.png" style="max-width: 600px;" class="image center" alt="Synthetic application monitoring" /> <h3 id="synthetic-application-monitoring">Synthetic application monitoring</h3> <p>By measuring performance of a website from multiple different locations and observing the metrics over time, the level of certainty can be improved. This can be used to isolate the root cause of a poor digital experience.</p> <p>If you would like to learn how to move traffic around a trouble spot on the internet or learn more on how to measure the performance of a website using synthetic monitors, <a href="https://www.kentik.com/go/get-demo/">reach out to the team at Kentik</a>.</p><![CDATA[Wellness and surfing… with Kentik!]]><![CDATA[Backend Engineering Manager Aaron Kagawa shares how he finds wellness in surfing -- with a little help from Kentik’s employee benefits. ]]>https://www.kentik.com/blog/wellness-and-surfing-with-kentikhttps://www.kentik.com/blog/wellness-and-surfing-with-kentik<![CDATA[Aaron Kagawa]]>Thu, 23 Dec 2021 05:00:00 GMT<p>It’s an exciting time here at Kentik. We’re expanding into new markets, working closely with amazing customers, and continuing to hire talented people to grow our team. We announced a <a href="https://www.kentik.com/press-releases/kentik-raises-usd40-million-to-end-the-era-of-fragmented-monitoring/">new round of funding</a> in the fall. We’re continuously building on our product. And we have our sights set on scaling our systems to large numbers. As Kentik’s backend engineering manager, I know this has kept our team very busy. We have worked very hard to get to this point, and it’s exciting to think about what’s ahead.</p> <p>One thing that I love about Kentik and the culture of the company (aside from the fact that everyone is focused on creating the best product we can for our customers) is that people care deeply about each other, our fellow employees, and our families, too. And this is a top-down thing!</p> <p>One way our leadership supports us is by providing really awesome benefits. Of course, we have all the standard stuff: 401k, flexible vacation, full health insurance, and fully remote positions, which is really amazing. Kentik also provides a health reimbursement account (up to $4,500 a year for families) for health expenses like eyeglasses, lasik surgery, braces for my kids, and so much more. With my growing family, this kind of benefit definitely gives me peace of mind knowing that I can give my family the best health care possible. And by far, my favorite benefit is the wellness fund. Kentik provides its employees with $100 per month for this. Meaning we can use this fund for gym memberships, meditation classes, exercise equipment, and SURFBOARDS!</p> <p>I live in Hawaii (another super awesome perk right there), and Kentik pays for my surfboards! Every single surfer I talk to about this is super jealous. Not only do I get free surfboards, I buy the best ones. Surfboards that I never thought I would ever have and that I probably would never buy myself with my own money. I think that’s the point that is so great about working at Kentik. Wellness is a priority. And leadership makes it a point to provide money and flexible time off for their employees to actually have wellness.</p> <img src="//images.ctfassets.net/6yom6slo28h2/vtGKxJH42cK7LviMnP8CA/5b7ce9667768f1ecea26728af368b50a/ak-aurelia.jpg" style="max-width: 650px" class="image center" /> <img src="//images.ctfassets.net/6yom6slo28h2/2PhiLk5929LxIthX1iTXP0/ad9962334f1b42cd1fd52c450e0c4345/ak-surfboard.jpg" style="max-width: 650px; margin-bottom: 15px;" class="image center" /> <div style="max-width: 650px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Aaron’s daughter Aurelia recently went out for her first surf session.</div> <p>Surfing is a huge part of my life. I grew up surfing with my dad, uncles, cousins and friends. I still surf with my dad every week. My kids are also starting to surf. I’m literally having the surf sessions of my life on my Kentik-funded surfboards!</p> <img src="//images.ctfassets.net/6yom6slo28h2/37Z6yzenItwzbWuL1WgDy5/97452045b978775f0f7ba5bdeb08abf2/ak-sunrise2.jpg" style="max-width: 650px" class="image center" /> <img src="//images.ctfassets.net/6yom6slo28h2/4mr4edUxG9BpIlQShEfrPO/68443627661e22bcd8e50cb9a58b6f1c/ak-sunrise1.jpg" style="max-width: 650px; margin-bottom: 15px;" class="image center" /> <div style="max-width: 650px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Photos from Aaron’s morning surf sessions. </div> <p>Every morning I get up to check the surf report and plan my surf sessions. And every morning I’m energized to help drive Kentik to success.</p> <p>If you have a wellness passion, love building awesome tech, and want to make waves, then check out our <a href="https://www.kentik.com/careers/">careers page</a>. We’re hiring! I’ll loan you one of my Kentik boards when you are in Hawaii and take you out for a “board meeting”… because, BTW, our next company offsite is in Waikiki, Hawaii. It’s going to be awesome.</p> <p>Aloha!</p><![CDATA[Network AF, Episode 7: From Juilliard to bare metal with Zac Smith]]><![CDATA[On this episode of Network AF, Avi talks with Zac Smith, Bare Metal Managing Director at Equinix. Zac is a graduate of Juilliard, has started multiple networking companies and is an operating board member of Pursuit.]]>https://www.kentik.com/blog/network-af-episode-7-from-juilliard-to-bare-metal-with-zac-smithhttps://www.kentik.com/blog/network-af-episode-7-from-juilliard-to-bare-metal-with-zac-smith<![CDATA[Michelle Kincaid]]>Mon, 20 Dec 2021 05:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/764nd0Ct3QgkxvZrkw3LLF/39e30b74361798b91154bfc815ea1aa8/zac-smith.jpeg" style="max-width: 200px;" class="image right" alt="Zac Smith" /></a></p> <p>In the latest episode of Network AF, your host <a href="https://www.kentik.com/network-af/">Avi Freedman chats with Zac Smith</a>.</p> <p>Zac is a 20-year networking veteran, the managing director of Equinix Metal, and a double bass player. Throughout Zac’s career, he’s focused on using software to build automated infrastructure platforms. That includes growing Voxel, the Linux-based hosting platform that sold to Internap in 2011, into one of the early, leading cloud-hosting companies.</p> <p>In 2014, Zac co-founded Packet to empower technology-enabled enterprises with automated bare metal infrastructure. Packet was acquired by Equinix in March 2020, and Zac now leads the strategy, product and go-to-market for Equinix’s bare metal platform.</p> <p>His current projects involve helping Equinix’s customers access the global reach of its 248 data centers across the world — all with less friction than traditional colocation. He is also focused on building an automated colocation product attached to Equinix’s interconnection capabilities.</p> <p>Topics Avi and Zac cover include:</p> <ul> <li>Zac’s shift from Juilliard to networking</li> <li>Mentorship and “doing well by doing good”</li> <li>Widening infrastructure accessibility and education</li> <li>Zac’s goals with Packet and bare metal</li> </ul> <h3 id="the-shift-from-juilliard-to-networking">The shift from Juilliard to networking</h3> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/5a5759c0"></iframe> <p>Zac tells Avi that his interest in tech grew out of a practical approach to pay for college: by doing PC repairs for wealthy people on 5th Avenue in New York. That’s where he learned he could make more money getting people connected to the internet via AOL discs than in working for minimum wage at the box office at Juilliard.</p> <p>Back then, Zac says he never really thought about networking as a career because he was studying music. But with forward-thinking, he realized that starting a tech business might make more sense than trying to get hired for his music background.</p> <p>Family friend John LeRoux advised Zac to avoid creating a phone company, but rather to start something with recurring revenue. LeRoux said, “If you charge people, sell to them once, and then don’t mess up. Just treat them well, and they’ll usually pay you every month. And that’s a good business model.”</p> <p>So, Zac decided to start a web-hosting business that catered to musicians. He says his curiosity gave him the space to learn all about running data centers and to evolve alongside networking. And once he was in it and observing others building networks he couldn’t get out.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/e385fcc2"></iframe> <p>Zac jokes that he was the absolute worst musician at Juilliard, but it taught him that “you have to be okay being the stupidest person in the room” and be humble about it. When he first got into networks and hosting he didn’t know anything, but says as long as you’re willing to ask questions, people are really interested to share and help. He loves saying, “I don’t know yet,” and admitting when these technologies are confusing to create a safe space for people just coming into networking.</p> <h3 id="mentorship-and-doing-well-by-doing-good">Mentorship and “doing well by doing good”</h3> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/6798f8a4"></iframe> <p>A philosophy Zac tries to live by is “doing well by doing good.” That’s, in part, because he knows how much the notion helped him when he arrived in the industry. He tells Avi that there were many people who helped him along the way.</p> <p>“Raj [Dutt] would say, ‘Let me show you how to do this.’ And Adam Rothschild would always answer my questions. And then Alex would give us a colo rack and say, ‘It’s cool, let me show you.’” The help from his peers opened doors from a technical standpoint.</p> <p>Zac loves not only creating products, but also creating outcomes for people. As the chair of an operating board for a nonprofit called Pursuit in Queens, New York, he helps bring underrepresented communities into the tech ecosystem, mainly focusing on full-stack developers.</p> <p>He says the average person who applies for the program is earning around $18,000 per year, usually supporting two-to-three generations. After finishing a one-year program through nights and weekends (where these folks get to participate and learn about internet access with quiet places to study and computing resources), their average salary is $88,000 per year. This creates a ripple effect of possibilities for others in these communities, too.</p> <p>Zac says they’ve graduated several-thousand people into tech jobs at this point.</p> <h3 id="zacs-advice-to-his-younger-self">Zac’s advice to his younger self</h3> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/7bd777dc"></iframe> <p>Thinking back, Zac says he would still tell his younger self to go to Julliard, reflecting that it was his best-ever decision. It got him to New York, allowed him to meet inspiring people, and helped him learn to be self-critical.</p> <p>His advice to his younger self is to ask more questions because “it’s just always helpful.” He says we all have our own personal egos, however big or small, and asking questions helps chip this down a bit. Lastly, he says he would tell those new to their careers to take some time to enjoy while still learning and growing.</p> <p><a href="https://www.kentik.com/network-af/">Listen to Zac and Avi’s full conversation</a>.</p><![CDATA[Channel partner spotlight: Sciens]]><![CDATA[Sciens is one of Kentik’s newest channel partners, serving the Korean market. In this Q&A, we highlight a bit about the team behind Sciens, who they are and what they do.]]>https://www.kentik.com/blog/channel-partner-spotlight-scienshttps://www.kentik.com/blog/channel-partner-spotlight-sciens<![CDATA[Jim Frey]]>Thu, 16 Dec 2021 05:00:00 GMT<p>Headquartered in Seoul, Korea, Sciens Co., Ltd. is one of the newest partners to join <a href="https://www.kentik.com/partners/channel-partners/">our program</a>, supporting us in serving the Korean market. We reached out to the Sciens team recently with this Q&#x26;A to share a little about the team and their services.</p> <h3 id="who-is-sciens">Who is Sciens?</h3> <p>Sciens is a network-specialized IT company, serving the enterprise and service provider markets in Korea. Since 2013, we have been providing custom network solutions and services for our customers.</p> <p>We strive to be the leader in offering the most advanced and cutting-edge network technology, and we use our business experience, technology, solutions and services to do that.</p> <h3 id="what-services-do-you-provide">What services do you provide?</h3> <p>At SCIENS we offer <strong>total network system services</strong>, including everything from network diagnosis and analysis to consulting, design, construction, maintenance and more. In these areas, we specifically offer:</p> <p><strong>Network diagnosis and analysis:</strong></p> <ul> <li>Analyzing environments and operating statuses of customers’ current networks</li> <li>Consulting on network diagnostics, problem-finding and improvement</li> <li>Converging customer needs</li> <li>Network planning</li> </ul> <p><strong>Network construction:</strong></p> <ul> <li>Fast and reliable network installation and deployment</li> </ul> <p><strong>Wireless security:</strong></p> <ul> <li>Rogue access point detection and blocking</li> <li>Ad-hoc detection and blocking</li> <li>Block illegal wireless connections</li> <li>Wireless DDoS attack detection</li> <li>Wireless public management shutoff</li> <li>WEP cloaking</li> </ul> <p><strong>Design:</strong></p> <ul> <li>Wired/wireless integrated network design</li> <li>Designing a management network through an integrated management system</li> </ul> <p><strong>Maintenance:</strong></p> <ul> <li>Fast and accurate processing to meet customer needs</li> <li>Operation of stable networks through regular and timely inspections</li> </ul> <p>We also offer <strong>internet infrastructure construction</strong>. This part of our business helps our customers build new network access for their end users, with specialized, tailored IT infrastructure solutions.</p> <p>As part of this service, we provide solutions with new technologies and capabilities to overcome the limitations of existing access networks (ISDN, PSTN, xDSL), including options for added security and QoS for companies that have completed their infrastructure construction. Special application solutions we offer include:</p> <ul> <li> <p><strong>CDN</strong> - We provide load-balanced application delivery and backup solutions that leverage nationwide networks provided by large ISPs, IDC centers, distributed by region.</p> </li> <li> <p><strong>VPN construction</strong> - We build your network across the internet and provide enhanced corporate security with data encryption and QoS delivery solutions.</p> </li> </ul> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px; margin-top: 30px;"></div> <h3 id="want-to-know-more">Want to know more?</h3> <p>For more information on Sciens, <a href="http://www.sciens21.com/en/sub/product/business.asp">check out their site and services</a>. You can also learn more about <a href="https://www.kentik.com/partners/channel-partners/">Kentik’s Channel Partner Program</a>.</p><![CDATA[Why is my SaaS application so slow?]]><![CDATA[When the problem is outside of your network, solving it can become much more complicated. BGP, traceroute, packet analysis and synthetic monitors all come into play. ]]>https://www.kentik.com/blog/why-is-my-saas-application-so-slowhttps://www.kentik.com/blog/why-is-my-saas-application-so-slow<![CDATA[Michael Patterson]]>Tue, 14 Dec 2021 05:00:00 GMT<p>Many companies today rely on SaaS connections in order for the business to function. Some users simply can’t operate in their job when an application becomes unavailable. When hundreds of users are impacted, this can cost a company serious money.</p> <p>That’s why keeping a proverbial finger on the pulse of application performance is generally worth the effort. But, it isn’t easy. Many popular SaaS applications are delivered from hundreds of locations around the world. To properly monitor them, it’s paramount to find out which servers your end users are connecting to.</p> <h3 id="start-with-the-desktop">Start with the desktop</h3> <p>Because SaaS applications are web-based, browsers can introduce problems. Most companies have a preferred platform (e.g., Chrome, Internet Explorer, Firefox, Opera). Make sure you are using the corporate standard. However, sometimes I try other browsers at the beginning of the troubleshooting effort. It helps me gain confidence that the problem isn’t the browser, or is it?</p> <p>How many browser extensions do you currently have running? Some extensions consume resources and slow down web browsers. Try temporarily uninstalling all of them to see if performance improves.</p> <p>Do you have any active tabs that are consuming resources even when you aren’t on the tab? Just for testing, close them. The more things we can eliminate, the faster we can overcome the problem and get back to work.</p> <p>While you’re at the browser, clear cached files and cookies. Sometimes these files get corrupt or cause wacky problems. If you delete them, the next time you connect to any SaaS, the browser will just download them again, so have no fear. Delete them and eliminate another possible source of the problem.</p> <p>It might seem like a “d’oh!” question, but: is it just one SaaS application that is slow, or do all the sites that you visit seem to be crawling? If it’s all sites, the problem is probably not the application. Let’s check some other basics. Are you running the OS with the latest patches? Run Task Manager (or Activity Monitor for MacOS), and look for applications that are consuming excessive resources. Shut them down and try working with my SaaS application after making some or all of the above changes.</p> <p>This is sort of a long shot, but are you using a corporate DNS server? You might try temporarily switching to a public DNS like Google’s 8.8.8.8. DNS lookups can introduce significant latency on new connections.</p> <p>Before you leave the desktop, launch a command prompt and execute a tracert to see if a particular hop in the path is introducing excessive latency.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3GwhxJMXQb0YPTd764AELq/3e7348e4827dfaa9f169a20c16936bf6/why-is-my-saas-application-so-slow.png" style="max-width: 700px; padding: 0" class="image center" alt="Traceroute - why is my SaaS app so slow?" /> <p>Notice above that the routers used in the connection are looking pretty snappy. I’m not seeing any glaring issues in the path.</p> <p>Next, point your browser to <a href="http://www.speedtest.net/">www.speedtest.net</a> and run the test from at least two computers sharing the same internet connection.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6Itms8WblThIrukCuAzLQM/e2cf6c380db2f0c38df79db01b80050c/troubleshooting-saas-connections.png" style="max-width: 500px; padding: 0" class="image center" alt="Troubleshooting SaaS Connections" /> <p>Make sure that both devices are using the same server. Notice the arrow above in red. When troubleshooting SaaS connections, we want to make sure we’re staying consistent especially since internet traffic is bursty in nature.</p> <p>If you “ain’t afraid of getting dirty,” use the Wireshark packet analyzer to see if your connection to the SaaS applications’ IP address is suffering from any packet loss. Even a 3-6% loss will lead to lots of retransmits and could be noticeable in your end-user experience. But if you are afraid, maybe find a good YouTube tutorial on Wireshark, as you may find that it isn’t that tough to familiarize yourself with these types of tools.</p> <h3 id="check-your-local-network">Check your local network</h3> <p>If you’re working from home, could it be possible that you’re competing with other local devices for that precious, limited bandwidth? If you don’t have fancy tools that tell you who is communicating through the router, take all other internet capable devices offline. Nowadays, so much is connected to the web (phones, tablets, computers, TVs, watches, radios, vacuums, security systems, refrigerators, small IoT devices, neighbors, even boat radios down at the dock). Most of these devices phone-home and auto-receive software updates. All of these devices could be introducing a calamity of problems.</p> <p>If you have access to your local router, login and see if you can give connections to your SaaS application a priority. I would avoid giving your entire device high priority as other applications on your device could be consuming the pipe, and you don’t want them competing with high-priority, work-related applications.</p> <h3 id="communicate">Communicate</h3> <h4 id="check-with-the-saas-application">Check with the SaaS application</h4> <p>In the effort to exhaust all possible reasons that it could be slow, it’s probably wise to take a few seconds to check <a href="https://downdetector.com/">Downdetector</a>. It’s just user status “up” or “down” reports that don’t share all of the juicy details that we’d like to see (e.g., CPU, memory, latency, packet loss, jitter, etc.), but it can give you a clue if it’s a local or widespread problem; or in other words, something you can do something about or not.</p> <h4 id="check-with-coworkers">Check with coworkers</h4> <p>After you’ve completed some of the above, maybe it’s time to send a message out on Slack to see if other internal users are suffering from the same performance issues as you. Your coworkers may have some suggestions that aren’t listed above. If they respond with something like “yeah, I’m noticing it, too,” make note of this because if the problem keeps recurring, it could be time to contact IT to see if maybe you can connect to a different server.</p> <h3 id="what-next">What next?</h3> <p>Here are two things to keep in mind as you step through the above tactics for trying to determine why your SaaS application is so slow:</p> <ol> <li>The investigative techniques provided above will often uncover issues that you were not aware of (e.g., excessive browser extensions, unknown devices on your network, patches that should be applied, etc.)</li> <li>The problem may not be yours to solve. As stated earlier, repeated slowness issues should be reported to the help desk.</li> </ol> <p>When the source of the problem is outside of your network, solving the issue can become much more complicated. For example, after thorough review of the traceroute and the BGP topology, a peering connection at the local IX could resolve the issue.</p> <p>In other cases, synthetic monitors can be deployed to collect historical data on issues related to slowness. Loaded with historical information on which devices in the path introduce latency, packet loss and jitter, service providers can be contacted and urged to update the hardware and connections at the specific locations where the problems are being introduced.</p> <p>These advanced troubleshooting techniques require special tools that help ensure that the entire, end-to-end cloud infrastructure is capable of delivering the quality of experience you’re looking for when logged into a SaaS application. To learn more, <a href="https://www.kentik.com/contact/">reach out</a> and ask about the Kentik Network Observability Cloud.</p><![CDATA[Network AF, Episode 6: Cat Gurinsky on mentorship and the shared languages of network engineering]]><![CDATA[On this episode of Network AF, Avi is joined by senior network engineer Cat Gurinsky to share her journey through networking. Cat found a passion for automating deployments and troubleshooting and is the current chair for the NANOG Program Committee.]]>https://www.kentik.com/blog/network-af-episode-6-cat-gurinsky-mentorship-network-engineeringhttps://www.kentik.com/blog/network-af-episode-6-cat-gurinsky-mentorship-network-engineering<![CDATA[Michelle Kincaid]]>Thu, 02 Dec 2021 05:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/5vEoPmn2Py6SYuCutYAmR2/3bb89926604062250d17ff1fa5d1bcf6/cat-gurinski.jpeg" style="max-width: 200px;" class="image right" alt="Cat Gurinski" /></a></p> <p>In the latest episode of the Network AF podcast, your host Avi Freedman welcomes his friend and networking pro <a href="https://www.kentik.com/network-af/">Cat Gurinsky to the show</a>.</p> <p>As a senior network engineer with loads of experience, Cat is most passionate about automation and troubleshooting, and especially loves to use Python and Arista’s pyeapi frameworks in her pursuits. She’s also the current chair of the <a href="https://www.nanog.org/about/who-we-are/program-committee/">NANOG Program Committee</a>, and previously worked for companies like Best Buy, Switch and Data, and Equinix. She lives in Austin, Texas with her family, where outside networking, she owns and teaches at her dojo, <a href="https://www.immortaltiger.com/">Immortal Tiger Kenpo Karate</a>.</p> <p>Cat joins Network AF to discuss:</p> <ul> <li>Her path to networking</li> <li>Mentorship experiences</li> <li>The pros and cons of automation</li> <li>Martial arts and operating her dojo</li> <li>Advice to network engineering newcomers</li> </ul> <h3 id="the-path-to-networking">The path to networking</h3> <p>Cat credits her father, who was a computer lab teacher, for introducing her to tech. From there, she attended technical school after her daily high school classes were done, and she says that helped her break into the industry with an early job repairing broken hard drives and computer equipment.</p> <p>In her college days, Cat went to Valparaiso University to study Japanese, and she used her early tech foundation to land a student job in the school’s IT department. As a post-grad, a friend encouraged her to apply for an open network engineer position. This experience became more like an apprenticeship, and Cat’s skillset, especially with programming languages, quickly grew.</p> <p>Throughout her career, Cat has focused on several core areas: Arista BGP deployments for large data centers; writing scripts to make operating switches more efficient and easier for her teams; and on automating network inventory tracking.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/fa58b7a5"></iframe> <h3 id="on-mentorship">On mentorship</h3> <p>Mentorship means a great deal to Cat, and she thinks having someone to bounce ideas off of and to learn from is how we can all become more approachable experts. During the episode she talks with Avi about how NANOG is putting together a formal program for creating more mentorship opportunities to support learning complex network engineering skills. And because there is no complete guidebook for the types of problems one will encounter while managing networks and infrastructure, Cat says that having safe space with a mentor to break and unbreak things offers a better ability to understand processes.</p> <p>Cat reflects on the mentorship she’s received and notes that now is the time to go find mentions. This will also help in finding the next generation of engineers.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/29ce5f61"></iframe> <h3 id="the-pros-and-cons-of-automation">The pros and cons of automation</h3> <p>As the topic of automation comes into the conversation, Avi asks Cat about the biggest outage she’s caused without automation. She says shortly before the Switch and Data merger with Equinix that she was tasked with a week’s worth of collecting information from show commands. During that process, one of the switches rebooted and came back without a config, which Avi responds to with shocked eyes describing it as a RAID erase.</p> <p>Cat says she repeatedly did the same commands until realizing that automating the process would be much faster. However, for all the good it can do in considerate hands, she offers a reminder that your links and work need to be correct before wide deployment. And that even changes like incorrect cabling can confuse other network engineers when running scripts causes incorrect labels.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/c55ec65f"></iframe> <h3 id="advice-to-network-engineering-newcomers">Advice to network engineering newcomers</h3> <p>What advice does Cat have for those interested in learning about automation and network deployment? She tells Avi that she knows it isn’t for everyone, but she recommends taking an intro class to understand the framework so that you can pursue self-taught lessons more successfully.</p> <p>She also recommends virtual labs as a way to become acquainted with fundamentals without having the risks of disrupting services. Regardless of the road taken, become familiar with the different languages the industry works with, say advises.</p> <p>Cat says she learned a lot of PHP and C++ before eventually needing to learn Python to interact with Arista.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/9c5a741b"></iframe> <h3 id="martial-arts-and-operating-a-dojo">Martial arts and operating a dojo</h3> <p>Outside of her work world, Cat notes that operating a dojo implies serious commitment to the craft. She laughs about how it all began with a friend convincing her to “take a sword class to learn to be like Kenshin,” referencing the anime Rurouni Kenshin.</p> <p>Curious about potential similarities between her karate and networking passions, Cat says each as their own language. “Martial arts is a language of the body. Networking is a language of the switches. And programming is a language of everything to interact with each other.”</p> <p>But more seriously, Cat says the dojo gives her a break from her day job and gets her out of her office chair.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/804028da"></iframe> <h3 id="parting-advice">Parting advice</h3> <p>In the closing minutes of the episode, Avi asks what advice Cat has for her younger self, looking back at everything from anime sword fighting to her early career. She says she would tell herself not to be in a rush to grow up — to take her time. She never took the chance to study abroad in Japan and regrets how the rush to graduate early came at the cost of traveling abroad.</p> <p>During adulthood and growing a career, it’s harder to get the chance to do something like this for an extended period, she says. Her parting advice? “If you’re thinking about it, do it. There will not be time later.”</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/b9a587fd"></iframe> <p><a href="https://www.kentik.com/network-af/">Listen to Avi and Cat’s full episode</a>. You can also follow Cat across the web on <a href="https://twitter.com/shimamizu">Twitter</a>, <a href="https://www.instagram.com/shimamizucat/">Instagram</a>, and <a href="https://www.linkedin.com/in/shimamizu/">LinkedIn</a>.</p><![CDATA[The 10 top network outages of 2021]]><![CDATA[Every year the internet suffers numerous disruptions and outages, and 2021 was no exception. Kentik's Doug Madory recaps the top 10. And now the world’s network engineers deserve a load of #HugOps in 2021.]]>https://www.kentik.com/blog/the-10-top-network-outages-of-2021https://www.kentik.com/blog/the-10-top-network-outages-of-2021<![CDATA[Doug Madory]]>Wed, 01 Dec 2021 05:00:00 GMT<p><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><img src="https://images.ctfassets.net/6yom6slo28h2/3Kvs82VAgztd7Kfr56KZy8/25a77a43815bb53047ef65b183ea57c5/hugops-kentik.png" style="max-width: 420px;" class="image right no-shadow" alt="HugOps Kentik" /></a></p> <p>Every year the internet experiences numerous disruptions and outages, and 2021 was certainly no exception. This year we documented outages, including multiple government-directed shutdowns, as well as what might be the internet’s biggest outage in history. In this post, I run through 10 of the top outages that we covered in 2021. Needless to say, the world’s network engineers deserve a load of <a href="https://twitter.com/search?q=%23HugOps" title="Search hashtag HugOps on Twitter">#HugOps</a> in 2021.</p> <h3 id="famous-internet-outages">Famous internet outages</h3> <p>Back in February, my friend <a href="https://twitter.com/jtkristoff">John Kristoff</a> kicked off a <a href="https://seclists.org/nanog/2021/Feb/255">lively discussion</a> on the <a href="https://www.nanog.org/resources/nanog-mailing-list/nanog-mailing-lists/">NANOG listserv</a> by asking subscribers to rank their top three most famous internet outages. The subsequent exchange of internet war stories inspired the creation of the lead-off panel at <a href="https://www.nanog.org/events/nanog-83/">NANOG 83</a> entitled “<a href="https://www.youtube.com/watch?v=QwvNIQecYn4">Famous Internet Outages</a>.” Kentik Co-founder and CEO Avi Freedman and I were two of the panel’s four speakers.</p> <p>The following is our list of the 10 top internet outages that occurred in 2021 listed in rough chronological order.</p> <p><a id="1" style="margin-bottom: 10px;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="1-uganda-election-shutdown">1. Uganda election shutdown</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/229uuFmk8oKZ0GlIX2pT32/598cdc3a768bd1b0f7fb4125529d1277/uganda-election-shutdown.png" style="max-width: 800px;" class="image center" thumbnail alt="Uganda election internet shutdown" /> <p>The first major outage on our list occurred in January. That was when the government of Uganda cut the country’s internet services in the days around a national presidential election. This outage took place almost 10 years to the day after the internet shutdown in Egypt during the Arab Spring.</p> <p>Egypt’s shutdown was a watershed moment for the internet community. It signaled the beginning of the era of the large-scale, government-directed internet shutdown that we presently find ourselves in.</p> <p>Uganda was almost completely offline for five days around the day of the vote. A vote which resulted in the re-election of President Museveni, extending his 36-year rule over the East African country. Sadly, the disruption came as <a href="https://www.accessnow.org/the-world-is-watching-uganda-elections/" title="The World is Watching Uganda Elections">little surprise</a> to observers, who anticipated and tried to prevent the internet blockage.</p> <p><a href="https://twitter.com/lawyerpants" title="Peter Micek on Twitter">Peter Micek</a> of <a href="https://www.accessnow.org" title="Access Now">Access Now</a> and I co-wrote a <a href="https://www.kentik.com/blog/from-egypt-to-uganda-a-decade-of-internet-shutdowns/" title="Kentik Blog: From Egypt to Uganda, A Decade of Internet Shutdowns">blog post</a> stressing that the campaign to combat shutdowns is far from over and needs more support. It’s especially worrisome that, from this, other embattled authoritarian regimes may draw the conclusion that shutdowns work, given the president’s re-election.</p> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="2" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="2-verizon-fios-outage">2. Verizon Fios outage</h3> <img src="//images.ctfassets.net/6yom6slo28h2/3bLopsu537TGUgMhgPN6QC/a4110f6188fbc65c21350e7e43e1ca6d/verizon-fios-outage.png" style="max-width: 800px;" class="image center" thumbnail alt="Verizon Fios internet outage" /> <p>In the middle of the first Covid-19 pandemic winter, the Verizon Fios service suffered a major outage. According to Kentik data, Verizon experienced a <a href="https://twitter.com/DougMadory/status/1354127284870017027" title="Doug Madory on Verizon traffic drop, Twitter">12% drop in traffic volume</a> nationally while the service was down.</p> <p>While it was <a href="https://www.washingtonpost.com/technology/2021/01/26/internet-outage-east-coast/" title="Washington Post: Big Internet outages hit the East Coast, causing issues for Verizon, Zoom, Slack, Gmail">initially attributed</a> to a <a href="https://twitter.com/VerizonSupport/status/1354109889572982786" title="Verizon Support notice on Twitter">fiber optic cut</a> in New York City, later it was clarified that the fiber cut was <a href="https://www.wsj.com/articles/verizon-internet-outage-disrupts-usage-in-northeast-11611685745" title="WSJ: Verizon Internet Outage Disrupts Usage in Northeast">a separate issue</a>. The outage lasted over an hour and disrupted the midday activities of thousands of remote workers and online students along the East Coast.</p> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="3" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="3-military-coup-in-myanmar">3. Military coup in Myanmar</h3> <img src="//images.ctfassets.net/6yom6slo28h2/3i4Tb76QB9arJLWdnDNvwB/cb776bba0a244e8a047f23c94caac039/military-coup-myanmar.png" style="max-width: 800px;" class="image center" thumbnail alt="Internet outage from military coup in Myanmar" /> <p>Only weeks after the shutdown in Uganda came a military coup in Southeast Asia. On February 1, the Myanmar military seized control of the country through a <a href="https://en.wikipedia.org/wiki/2021_Myanmar_coup_d%27%C3%A9tat" title="Wikipedia: 2021 Myanmar coup d&#x27;état">coup d’état</a> and ordered a shutdown of most of the country’s internet services for several hours.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/ABcswfBO2RY?rel=0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <p>During the coming months, disruptions to Myanmar’s internet <a href="https://twitter.com/DougMadory/status/1368595073445883907" title="Doug Madory on Myanmar network disruptions, Twitter">ran the gamut</a>: a total shutdown of all services, nightly internet blockages, extended shutdown of mobile internet services, and even a leaked <a href="https://twitter.com/DougMadory/status/1357776166820728840" title="Doug Madory on Myanmar BGP blackhole, Twitter">BGP hijack of Twitter</a>.</p> <p>In an effort to understand and document the complex situation in Myanmar, we joined forces with CAIDA’s <a href="https://ioda.caida.org/" title="Internet Outage Detection and Analysis group">Internet Outage Detection and Analysis</a> (IODA) group, the Tor Project’s <a href="https://ooni.org/" title="OONI: Open Observatory of Network Interference">Open Observatory of Network Interference</a> (OONI) team and <a href="https://censoredplanet.org/" title="University of Michigan&#x27;s Censored Planet platform">Censored Planet</a> from the University of Michigan, to combine our technologies and produce a comprehensive analysis of the internet blockages that took place in Myanmar this spring.</p> <p>This collaboration culminated in <a href="https://dl.acm.org/doi/pdf/10.1145/3473604.3474562" title="ACM Paper: A multi-perspective view of Internet censorship in Myanmar">a paper</a> published in August in SIGCOMM’s <a href="https://conferences.sigcomm.org/sigcomm/2021/workshop-foci.html" title="ACM SIGCOMM 2021 Workshop on Free and Open Communications on the Internet (FOCI 2021)">2021 Workshop on Free and Open Communications on the Internet</a>.</p> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="4" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="4-russia-blocking-twitter">4. Russia blocking Twitter</h3> <img src="//images.ctfassets.net/6yom6slo28h2/QSJRM6RhwlKLM9dW1IGl1/acc68dd0c2ef39d9c4bec8aeb15a9a8a/russia-blocking-twitter.png" style="max-width: 800px;" class="image center" thumbnail alt="Russian blocking Twitter" /> <p>On March 10, Russia’s agency for regulating the country’s communications (Roskomnadzor) attempted to slow traffic to Twitter, but it <a href="https://twitter.com/DougMadory/status/1369648537634545673" title="Doug Madory on Russian attempt to slow Twitter traffic, Twitter">inadvertently disrupted</a> much of the country’s mobile internet service instead.</p> <p>In a mistake that IT professionals around the world can relate to, they were stung by a <a href="https://twitter.com/DougMadory/status/1369665894494900224" title="Doug Madory on Russian internet outage, Twitter">bad substring match</a> from a poorly formed <a href="https://en.wikipedia.org/wiki/Regular_expression" title="Wikipedia: Regular expression (regex)">regular expression</a>. Intending to block Twitter’s link shortener t.co, Russia blocked traffic associated with all domains containing t.co, for example, Microsoft.com and Reddit.com.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1YGCg1ndaEw0Y7d6WYC2PU/e71745c4e42fcdd077cf36b9987957d5/outage-cartoon.png" style="max-width: 600px;" class="image center" thumbnail alt="Internet outage cartoon" /> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="5" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="5-fastly-outage">5. Fastly outage</h3> <img src="//images.ctfassets.net/6yom6slo28h2/4d2wi7tP1XxWGJ3VgJyap0/f162d8089c6b36a78f29308ad839f9b0/fastly-outage-doug-madory-bbc.jpg" style="max-width: 600px;" class="image center" thumbnail alt="Doug Madory on the BBC after the Fastly outage" /> <p>In early June, content delivery network Fastly <a href="https://www.kentik.com/analysis/fastly-outage-knocks-major-websites-offline/" title="Fastly Outage Knocks Major Websites Offline, Kentik Network Analysis Center">experienced a major outage</a> because of a faulty configuration push causing thousands of high-profile websites to become unreachable. According to Kentik data, Fastly saw a 75% drop in traffic volume during the outage.</p> <p>Fastly’s downed services began returning to normal within 50 minutes, which led me to take a glass-half-full perspective on the incident. Despite our best efforts, outages will happen and we should take comfort in the speed at which Fastly was able to restore their services following the disruption.</p> <p>“There is no error-free internet, it doesn’t exist,” I told <a href="https://www.nytimes.com/2021/06/08/business/fastly-internet-outage.html" title="The New York Times: What is Fastly, the company behind the worldwide internet outage?">The New York Times</a>. I was also invited on a live BBC World News broadcast to discuss the outage — mostly to define CDN for the audience. :-)</p> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="6" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="6-syria-student-exam-outages">6. Syria student exam outages</h3> <img src="//images.ctfassets.net/6yom6slo28h2/2yobDGRTiJXjoCKz53hepR/3c8b36fb79694011d6ac1d2e62d725a2/syria-student-exam-outage.png" style="max-width: 800px;" class="image center" thumbnail alt="Syria internet blackouts during exams" /> <p>Syria experienced multiple hours-long <a href="https://www.kentik.com/analysis/internet-blackout-in-syria-due-to-student-exams/" title="Internet Blackout in Syria Due to Student Exams, Kentik Network Analysis Center">national internet blackouts</a> in May and June as part of an ongoing effort to combat cheating on student exams. Continuing a practice that began in the <a href="https://web.archive.org/web/20160815064635/http://research.dyn.com/2016/08/syria-goes-to-extremes-to-foil-cheaters/" title="Syria goes to extremes to foil cheaters, Dyn Research, 2016, retrieved via Internet Archive">summer of 2016</a>, the government of Syria ordered internet service cut for 4.5 hours on multiple days while high school final exams were administered.</p> <p>The outages began at 1:00 UTC (4am local) and lasted until 5:30 UTC (8:30am local) during which time the exams were physically distributed around the country. A similar practice has happened in <a href="https://www.voanews.com/a/iraqi-internet-experiencing-strange-outages/2921135.html" title="Voice of America: Iraqi Internet Experiencing &#x27;Strange&#x27; Outages">Iraq since 2015</a>. However, there were no exam-related national outages in Iraq this year. Perhaps this should give us hope that common sense might prevail in Syria with respect to these shutdowns.</p> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="7" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="7-protests-in-cuba">7. Protests in Cuba</h3> <img src="//images.ctfassets.net/6yom6slo28h2/4qK7kJJuP3LarCgqIPJmKW/d7c096768725c7851340f7c79741313d/protests-cuba.png" style="max-width: 800px;" class="image center" thumbnail alt="Cuba internet outages due to protests" /> <p>On Sunday, July 11, Cubans in cities across the island nation spilled out into the streets in unprecedented numbers to protest an authoritarian government that has been in power since 1959. As these protests grew, the country’s internet went completely offline. At first, it was just <a href="https://twitter.com/DougMadory/status/1414327987525275659" title="Doug Madory on Cuba internet outages, Twitter">down for 30 minutes</a>; then we observed <a href="https://twitter.com/DougMadory/status/1414356912410353665" title="Doug Madory on Cuba internet outages, Twitter">sporadic and limited outages</a> over the next several hours. On the following day, the <a href="https://ooni.org/" title="OONI: Open Observatory of Network Interference">Open Observatory</a> documented <a href="https://twitter.com/OpenObservatory/status/1414622433156476930" title="OONI on Cuba blockage of social media services, Twitter">blockages of social media</a> that continued throughout the week.</p> <p>While shutdowns are new to Cuba in 2021, <a href="https://twitter.com/DougMadory/status/1414327987525275659" title="Doug Madory on Cuba internet outages, Twitter">July’s outage</a> wasn’t the first. There was a mobile service <a href="https://twitter.com/DougMadory/status/1354581571840389127" title="Doug Madory on Cuba internet outages, Twitter">outage in January</a> that appeared to be government-directed following additional protests motivated by the <a href="https://www.theartnewspaper.com/2021/01/28/tania-bruguera-and-members-of-cuban-artist-activist-group-27n-arrested-in-havana" title="The Art Newspaper: Tania Bruguera and members of Cuban artist-activist group 27N arrested in Havana">27N movement</a>, as well as a <a href="https://twitter.com/DougMadory/status/1360311938039840774" title="Doug Madory on Cuba internet outages, Twitter">complete shutdown</a> for several hours in February.</p> <p>Despite the 2013 activation of the <a href="https://en.wikipedia.org/wiki/ALBA-1" title="Wikipedia: ALBA-1 submarine communications cable">ALBA-1 submarine cable</a>, internet access in Cuba has historically been very limited due to a combination of the effects of the U.S. embargo and Cuba’s domestic policy which has restricted access to the internet.</p> <p>Following the protests and outages, there was a <a href="https://www.datacenterdynamics.com/en/news/us-politicians-ask-president-biden-to-beam-internet-into-cuba-amid-protests-and-social-media-blackouts/" title="Datacenter Dynamics: US politicians ask President Biden to beam Internet into Cuba amid protests and social media blackouts">push by some U.S. politicians</a> to find a way to provide internet service to the Cuban people via balloon or other unorthodox means. Citing the success of Google’s <a href="https://techcrunch.com/2021/01/21/google-alphabet-is-shutting-down-loon-internet/" title="TechCrunch: Alphabet shuts down Loon internet balloon company">shuttered Project Loon</a> to extend balloon-based mobile internet service to Puerto Rico and Kenya, they proposed using the same technology to provide an alternative source of mobile internet service in Cuba.</p> <p>Of course, those Project Loon achievements were done in cooperation with local providers, not in contention with them, as would be the case in Cuba. Additionally, there are numerous unresolved challenges to what I termed a “Hollywood scenario” in the <a href="https://www.washingtonpost.com/world/2021/07/22/cuba-internet-balloons/" title="Washington Post: Cuba censored the Internet amid protests. Florida leaders want Biden to respond with balloon-based wireless.">Washington Post</a>.</p> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="8" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="8-facebook-outage">8. Facebook outage</h3> <img src="//images.ctfassets.net/6yom6slo28h2/372nlG30vxuYmr4iXz1LH7/ce8116b6e9d19f98c1468b04efc9c96e/facebook-outage.png" style="max-width: 800px;" class="image center" thumbnail alt="Facebook outage" /> <p>On October 4, the world’s largest social media platform <a href="https://en.wikipedia.org/wiki/2021_Facebook_outage" title="Wikipedia: 2021 Facebook outage">suffered a global outage</a> of all of its services for nearly six hours, during which time Facebook and its subsidiaries, including WhatsApp, Instagram and Oculus, were unavailable. While Facebook and Instagram had s<a href="https://www.kentik.com/analysis/facebook-and-instagram-suffer-outage-on-april-8th-2021/" title="Facebook and Instagram Suffer Outage on April 8th, 2021, Kentik Network Analysis Center">uffered a brief outage on April 8</a>, the big outage for the social media giant in 2021 was clearly October 4.</p> <p>With a claimed 3.5 billion users of its combined services, Facebook’s downtime of at least <a href="https://www.kentik.com/analysis/facebook-suffers-global-outage/" title="Facebook suffers global outage, Kentik Network Analysis Center">five and a half hours</a> comes to more than 1.2 trillion person-minutes of service unavailability, a so-called “1.2 tera-lapse,” or the largest communications outage in history.</p> <p>According to <a href="https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/" title="Facebook Engineering: More details about the October 4 outage">Facebook’s official explanation</a>, it was a routine maintenance job that took down the entire platform by issuing a command to “assess the availability of the global backbone capacity which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.”</p> <p>There were numerous ancillary effects from the loss of the Facebook platform. After <a href="https://twitter.com/Twitter/status/1445078208190291973">initially welcoming</a> the hordes of Facebook refugees, Twitter <a href="https://www.news9live.com/technology/app-news/twitter-briefly-goes-down-after-traffic-overload-following-facebook-instagram-outage-124225" title="News9: Twitter servers faced a minor slowdown after traffic overload following Facebook, Instagram outage https://www.news9live.com/technology/app-news/twitter-briefly-goes-down-after-traffic-overload-following-facebook-instagram-outage-124225">began to feel</a> the crunch as one of the few remaining social media options. <a href="https://twitter.com/signalapp/status/1445062426739855366" title="@signalapp on Twitter: signups are way up">Signal reported millions</a> of new signups in the wake of the temporary loss of WhatsApp. And users accustomed to authenticating via Facebook for various apps or websites were unable to access them.</p> <p>Lastly, the outage highlighted the confusion that average users have in understanding the difference between their internet service and the apps that run on it. During the outage, Downdetector reported spikes in complaints about nearly every company in the internet ecosystem when people could not reach the Facebook platform.</p> <img src="//images.contentful.com/6yom6slo28h2/1g5F5SwZbw7JqcQOjqNi7o/6a2fad0042d556257ca49f6b85f2d603/facebook-outage-down-detector.png" style="max-width: 550px;" class="image center" thumbnail alt="Down Detector during Facebook outage" /> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="9" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="9-comcast-outages">9. Comcast outages</h3> <img src="//images.ctfassets.net/6yom6slo28h2/4do7p4L5YSDQbSJpnmjm68/a34da3a6f48de53c295d3bfc876c0afc/comcast-outages.png" style="max-width: 800px;" class="image center" thumbnail alt="Comcast outages" /> <p>Beginning on Monday, November 8, and continuing into the next day, Comcast <a href="https://arstechnica.com/information-technology/2021/11/comcast-admits-widespread-outage-as-tens-of-thousands-of-users-report-problems/" title="ars Technica: Comcast admits “widespread” outage as tens of thousands of users report problems">suffered several large outages</a> affecting thousands of customers across the country. Beginning on the West Coast and continuing to the East, Comcast users experienced outages lasting anywhere from a few minutes to hours.</p> <p>Most large outages begin with a single precipitating event that takes down services until they are later restored. What makes these outages unusual is that different parts of the Comcast network went down at different times, sometimes hours apart.</p> <p>The graphic above shows two drops in traffic to AS33651 of Comcast beginning at 22:00 UTC on November 8 and again just before 6:00 UTC the following day. Meanwhile, AS33491 experienced a <a href="https://twitter.com/DougMadory/status/1458133562260140034" title="Doug Madory on Comcast outages, Twitter">total outage</a> beginning just after 13:00 UTC on November 9.</p> <p>As of this writing, Comcast has yet to publish an explanation for the outages.</p> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <p><a id="10" style="margin-bottom: 0;"> </a></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="10-military-coup-in-sudan">10. Military coup in Sudan</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/6vPhIroaX5miLyZTwZTKdg/3f751ac2645dac6ea1c70ce3215040f5/sudan-outage.png" style="max-width: 800px;" class="image center" thumbnail alt="Internet outage due to coup in Sudan" /> <p>On October 25, we saw another military coup d’état take down a country’s internet - this time in the North African <a href="https://www.cbsnews.com/news/sudan-coup-military-protests-deaths-us-blinken-warning/" title="CBS News: Clashes rock Sudan&#x27;s capital after deadliest day since military coup">country of Sudan</a>. While Sudan remained connected to the global internet, this shutdown targeted <a href="https://www.kentik.com/analysis/internet-blackout-in-sudan-in-fifth-day-following-military-coup/" title="Internet blackout in Sudan now in fifth day following military coup, Kentik Network Analysis Center">mobile internet services</a> rendering them inoperable for more than 24 days. Before restoring services on November 18, Sudanese military forces <a href="https://www.reuters.com/world/africa/mobile-phone-lines-inside-sudan-are-cut-before-planned-protests-2021-11-17/" title="Reuters: At least 15 people shot dead in anti-coup protests in Sudan, medics say">violently cracked down</a> on anti-coup protesters in the streets of Khartoum.</p> <p>Sudan is no stranger to shutdowns. The country has sadly experienced numerous blackouts in recent years, starting as far back as an incident in <a href="https://www.washingtonpost.com/news/the-switch/wp/2013/09/25/sudan-loses-internet-access-and-it-looks-like-the-government-is-behind-it/" title="Washington Post: Sudan loses Internet access — and it looks like the government is behind it">September 2013</a>, as well as <a href="https://twitter.com/DougMadory/status/1406332802442924037" title="Doug Madory on Sudan internet shutdown, Twitter">blackouts this summer</a> to prevent student cheating on exams.</p> <div style="font-size:16px;display:flex;flex-wrap:wrap;justify-content:space-between; margin-bottom: 40px;"> <div><a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936"><b>CAST YOUR VOTE</b></a></div> <div><a href="#top"><b>BACK TO TOP</b></a></div> </div> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <h3 id="honorable-mentions">Honorable mentions</h3> <p>This year’s outages that didn’t make our top-10 list included <a href="https://www.kentik.com/analysis/internet-outage-in-new-zealand-caused-by-cyberattack-response/" title="Internet outage in New Zealand triggered by cyberattack response, Kentik Network Analysis Center">September’s incident in New Zealand</a> caused by a faulty DDoS attack mitigation and the slow restoration of internet service <a href="https://www.kentik.com/analysis/hurricane-ida-causes-sustained-drop-in-internet-traffic-to-new-orleans/" title="Nine days after Hurricane Ida, internet traffic to New Orleans still recovering, Kentik Network Analysis Center">in New Orleans</a> as the region recovered from Hurricane Ida.</p> <p>February’s freeze that <a href="https://www.npr.org/2021/02/16/968230163/millions-without-power-in-texas-northern-mexico-as-blackouts-and-bitter-cold-con" title="NPR: Millions Lose Power In Texas, Northern Mexico As Blackouts And Bitter Cold Continue">knocked out electricity</a> for millions of customers in Texas also led to power outages impacting <a href="https://www.reuters.com/article/us-mexico-energy/power-outage-hits-northern-mexico-after-cold-snap-roils-texas-idUSKBN2AF1G4" title="Reuters: Power cuts hit 4.7 million users in northern Mexico after cold snap">4.7 million people in northern Mexico</a>. Without electricity, internet services were also knocked out, as <a href="https://twitter.com/DougMadory/status/1361420909244915716" title="Doug Madory on internet outage in northern Mexico, Twitter">we observed</a> in Kentik data.</p> <p>This summer saw the departure of U.S. military forces from Afghanistan after nearly 20 years of conflict. In a <a href="https://www.kentik.com/blog/whats-next-for-the-internet-in-afghanistan/" title="Kentik Blog: What’s next for the internet in Afghanistan?">blog post</a>, I posed the question of how the departure would impact the Afghan domestic internet, which had grown dramatically in the past decade.</p> <p>While we have yet to see any major internet disruptions in Afghanistan, it’s fascinating to observe the military withdrawal from a BGP perspective in the global routing table. The decline of prefixes originated by AS5800, the primary ASN used by the Department of Defense in Afghanistan, mirrors the troop drawdown. Below is a timeline of <a href="https://bgp.he.net/AS5800" title="HE: originations from AS5800">originations from AS5800</a> this year.</p> <img src="//images.ctfassets.net/6yom6slo28h2/631H6v8yFZDLlLLdPR55FL/b029d52df0d67d41a8daeb85f0fa9415/as5800-ipv4.png" style="max-width: 550px;" class="image center" alt="AS5800 timeline" /> <p>Another ASN that stopped announcing DoD prefixes in 2021 was the <a href="https://www.kentik.com/blog/wait-did-as8003-just-disappear/" title="Kentik Blog: Wait, Did AS8003 Just Disappear?">mysterious AS8003</a>. That was the obscure ASN that appeared out of nowhere in the final seconds of the previous administration and started announcing more IPv4 space than any ASN in history leading to headlines in the <a href="https://www.washingtonpost.com/technology/2021/04/24/pentagon-internet-address-mystery/" title="Washington Post: Minutes before Trump left office, millions of the Pentagon’s dormant IP addresses sprang to life">Washington Post</a> and <a href="https://apnews.com/article/technology-business-government-and-politics-b26ab809d1e9fdb53314f56299399949" title="AP: The big Pentagon internet mystery now partially solved">Associated Press</a>.</p> <h3 id="what-makes-your-list">What makes your list?</h3> <p>That’s our list of the 10 top internet outages that occurred in 2021. Would any of them make your all-time, top-three most famous outages? Did we leave any out? <a href="https://www.linkedin.com/feed/update/urn:li:activity:6871908532376231936">Let us know</a>.</p><![CDATA[Synthetics 101 - Part 1: How to drive better business outcomes]]><![CDATA[Synthetic monitoring can help you proactively track the performance and health of your networks, applications and services. In the first post of this multi-part series, Kentik’s Anil Murty will share how synthetic monitoring can also help you drive better business outcomes.]]>https://www.kentik.com/blog/synthetics-101-how-to-drive-better-business-outcomeshttps://www.kentik.com/blog/synthetics-101-how-to-drive-better-business-outcomes<![CDATA[Anil Murty]]>Wed, 01 Dec 2021 04:00:00 GMT<p>In this multi-part blog series, I’ll help you learn how you can use synthetic monitoring to proactively track the performance and health of your networks, applications and services — with the ultimate goal of helping you drive real business outcomes.</p> <p>I’ll start with a high-level overview of what synthetic monitoring is and explain how it has seen a growth in adoption as a direct consequence of changes in the way applications and services are delivered and where they are consumed. I’ll also get into the most common use cases for synthetic monitoring (gleaned from hundreds of customer conversations), and I’ll share how the analytics gained from it can be directly tied to Key Performance Indicators (KPIs) in each of those cases.</p> <p>Finally, I will follow up with several individual posts in the weeks and months ahead to delve into detail for some of the use cases outlined here. Where applicable, I’ll also show you how to use the Kentik platform to meet objectives in each situation.</p> <h3 id="what-is-synthetic-monitoring">What is Synthetic Monitoring?</h3> <p>The <a href="https://www.merriam-webster.com/dictionary/synthetic">Miriam-Webster definition</a> of the word “synthetic” defines it as something that is “produced artificially” and “devised, arranged or fabricated to imitate or replace usual realities.” I find that definition to reflect the general idea of synthetic monitoring pretty well, in terms of its high-level implementation. Specifically, in the networking context, synthetic monitoring is all about imitating different network conditions and/or simulating different user conditions and behaviors.</p> <div style="background-color: #f8f8f8; max-width: 450px; padding: 25px; margin: 0 auto; margin-bottom: 20px;"><p style="color: #0f97C4;"><b>Synthetic monitoring imitates:</b></p> <ul> <li>Network traffic types <ul> <li>Layer-2/3, web, audio, video…</li> </ul> </li> <li>Network conditions <ul> <li>Fully loaded, partially loaded</li> <li>One route vs. another</li> </ul> </li> <li>User locations <ul> <li>Works from San Francisco, but does it work from Auckland, NZ?</li> </ul> </li> <li>User actions <ul> <li>Reading a news site</li> <li>Logging into an application</li> <li>Checking out a shopping cart</li> </ul> </li> </ul> </div> <p>Synthetic monitoring achieves this by generating different types of traffic (e.g., network, DNS, HTTP, web, etc.), sending it to a specific target (e.g., IP address, server, host, web page, etc.), measuring metrics associated with that “test” and then building KPIs using those metrics.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4SSt6HlCpYiyxCeUQ4bnn8/f2e30e19b446d024b15d29e0b352a304/target-agent.png" style="max-width: 700px;" class="image center" alt="Synthetic Test and Target" /> <h3 id="why-imitate-or-why-monitor-synthetically">Why Imitate (Or Why Monitor Synthetically)?</h3> <p>To understand the need to simulate network and user conditions, it is important to consider the underlying broader market changes that have created the need for it. Over the last decade, the way that applications and services are delivered and consumed has been significantly disrupted by four key trends:</p> <ol> <li>The mainstream adoption of public cloud services driven by the increased flexibility, low capex costs and overall quality of service (in terms of availability and uptime) they provide.</li> <li>The move from delivering applications as binary images installed on hardware owned by customers to applications delivered as services (SaaS).</li> <li>The desire to consume applications and services from different parts of the world and on-the-go, driven by advances in technology enabling high-bandwidth applications on consumer devices.</li> <li>The recent trend of more and more companies offering the flexibility to work from home (or from anywhere in the world).</li> </ol> <p>If there is one thing that those four shifts have in common, it is an increased dependence on the network and the general internet in order to achieve their objective. To drive this point home, I like to use a quote I heard during a recent customer conversation:</p> <p style="font-size: 130%; text-align: center;"><b><i>“Latency is the new outage.”</i></b></p> <p>When I heard that the first time, I went to google it and was pleasantly surprised that not only are some pretty big-name companies talking about it, but more importantly, latency as a metric is something that even Google (as dominant as it is in the search business) is paranoid enough about to list right there in its search results.</p> <img src="//images.ctfassets.net/6yom6slo28h2/SvcSL8YI10FopX21MyybG/5028326b25c46ea32673b2b7dc3f7a20/latency-new-outage.png" style="max-width: 400px;" class="image center" alt="Latency" /> <p>If I had to sum all that up in one sentence, I’d say that the shift to delivering “Anything as-a-Service” to anyone, anywhere in the world, and expecting it to perform the way it would if it were being delivered within the same building, is what has been the biggest driver of the reliance on networks.</p> <div class="pullquote right" style="margin-top: 15px;">“Anything-as-a-Service” has significantly raised the bar for network performance, and latency is the most important proxy for it.</div> <p>As a consequence, it has led to the growth in the use of synthetic monitoring. Specifically, being able to simulate dynamic and varied network conditions that reflect the nature of today’s network traffic and being able to visualize, debug and pinpoint the root of issues when they occur is where synthetic monitoring earns its money.</p> <p>Now let’s get into the value of synthetic monitoring to specific business use cases with the help of some real-world examples.</p> <h3 id="tying-it-to-business-outcomes">Tying It to Business Outcomes</h3> <p>In building <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a>, our team spoke with well over one-hundred customers who included service providers, digital enterprises (SRE and DevOps teams) and corporate IT teams. I’ve distilled our learnings from those conversations into the following business-impacting use cases for synthetic monitoring:</p> <ul> <li> <p><strong>Proactive monitoring</strong>: Regardless of the type of business you run, you never want to have a customer tell you something in your product or service experience is broken or unavailable. To that end, a tool that can give you a view of the customer’s experience on an ongoing basis and notify you the moment it suffers is invaluable. This is the fundamental use case that synthetic monitoring addresses - tracking the uptime, availability, reachability and overall end-user experience of critical services that your customers pay you for.</p> </li> <li> <p><strong>Holding third-party vendors / service providers / partners accountable</strong>: If you are an ecommerce company providing a shopping-cart checkout experience, you want to make sure there is as little friction in users going through it as possible. Unfortunately, today’s web services rely on tens, if not hundreds of calls to external services (like payment processors, third-party sellers, shopping platforms, etc.) as part of their implementation. Synthetic monitoring can help you disambiguate this by telling you if the issue is with the network or the application, and further, if it is in a service owned by you or by an external service provider.</p> </li> <li> <p><strong>Benchmarking and baselining performance</strong>: Service providers compete based on the performance they provide to their customers. What better way to prove your level of service than to compare your performance with that of your competitors and present it with hard numbers? Another use case for baselining performance could be for comparing network performance in evaluating different SD-WAN solutions.</p> </li> <li> <p><strong>Preparing for significant network changes or transitions</strong>: Has your company made a strategic decision to enter a new market? What if it isn’t a strategic decision but rather an acquisition that is causing consolidation of infrastructure? Wouldn’t it be nice to have the confidence before you “flip the switch” to move real traffic over so that things won’t go completely haywire? Alternatively, if your company has made a strategic decision to move to the public cloud, wouldn’t it be nice to know what the impact of moving a specific service or a portion of it from your on-premises data center to the public cloud would be - before you switched your user traffic to it?</p> </li> <li> <p><strong>Measuring and adhering to SLAs / SLOs</strong>: Are you a service provider delivering hosting services through a worldwide network of points-of-presence (PoPs)? Do you care about monitoring connectivity and performance between these sites? Do your customers ask you about the level of service you are providing them? These are things you can achieve using synthetic monitoring.</p> </li> <li> <p><strong>Intelligent route optimization</strong>: Service providers and network operators often want to be able to make routing and interconnection decisions based on cost and performance. Running synthetic monitors along transit and peering routes and analyzing the results programmatically can help you automate some of this decision-making.</p> </li> </ul> <p>Are one or more of the above use cases crucial to your business? If so, definitely stay tuned for the follow-up posts in this series. I will delve into each use case in detail and show you how Kentik can help you address your key pain points along the way. Are there other use cases that we have not outlined? If so, we would love to work with you to address those with synthetics.</p> <h3 id="kentik-synthetics">Kentik Synthetics</h3> <p>Kentik Synthetics provides you <a href="https://www.kentik.com/product/global-agents/">hundreds of hosted agents</a> located across different geographic regions and provider networks (ASes) that are capable of performing the tests that enable you to address the above use cases. You also have the option of running these agents on your own infrastructure and/or within your own networks (as private agents).</p> <img src="//images.contentful.com/6yom6slo28h2/5jCznElJfO5lyQxQDl2IBx/60db0f8ecc0a64c4fdab993ff1cfce49/agent-management.png" style="max-width: 800px;" class="image center" alt="Synthetic Agents" /> <p>If you don’t want to wait for the next post in this series to learn more, you can <a href="#signup_dialog" title="Start a Free Kentik Trial">get started with Kentik</a> or <a href="/contact/">reach out to us</a> and we’ll show you a demo of the product.</p><![CDATA[Anatomy of an OTT traffic surge: Dune release on HBO Max]]><![CDATA[Last month, the long-awaited film adaptation of Frank Herbert's sci-fi epic Dune was released in theaters and on HBO Max. Directed by Canadian filmmaker Denis Villeneuve, the movie was a hit at the box office as well as via streaming, leading to another OTT traffic surge.]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-dune-release-on-hbo-maxhttps://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-dune-release-on-hbo-max<![CDATA[Doug Madory]]>Wed, 17 Nov 2021 16:00:00 GMT<p>Last month, the long-awaited film adaptation of Frank Herbert’s sci-fi epic Dune was released in theaters and on <a href="https://www.hbomax.com/">HBO Max</a>. Directed by Canadian filmmaker Denis Villeneuve, the movie was a hit at the box office as well as via streaming, leading to another OTT traffic surge.</p> <h3 id="ott-service-tracking">OTT Service Tracking</h3> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered — an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/" title="Gaming as on OTT service: Virgin Media reveals that Call Of Duty: Warzone has the “biggest impact” on its network">Call of Duty update</a> or a <a href="https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday/">Microsoft Patch Tuesday</a>, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p><a href="https://www.kentik.com/resources/kentik-true-origin/" title="Learn more about Kentik True Origin">Kentik True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 540 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <h3 id="the-spice-and-traffic-must-flow">The spice (and traffic) must flow</h3> <p>Villeneuve’s Dune was the <a href="https://www.mediaplaynews.com/dune-you-again-top-weekly-whip-media-streaming-charts-nov-7/?hilite=dune">most streamed movie</a> for three weeks following its debut on October 21, a day ahead of its theatrical release.</p> <p>As illustrated below in a screenshot from Kentik’s Data Explorer view, HBO Max traffic experienced a 66% jump in volume following the release of Dune on streaming. When broken down by CDN, we saw that HBO Max traffic was mostly via Akamai (46.8%), Amazon/AWS (22.1%) and Limelight (21.1%).</p> <img src="https://images.contentful.com/6yom6slo28h2/1DkltJdakzezxopZBY6Bex/b932bcbb1416b09b5d3a7124c16940c8/HBO_Max_by_source_CDN.png" style="max-width: 500px; margin-bottom: 15px;" class="image center" thumbnail alt="HBO Max OTT traffic analysis with Kentik" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">HBO Max OTT traffic analyzed with Kentik</div> <p>When broken down by Connectivity Type (below), Kentik customers delivered HBO Max to their subscribers from a variety of sources including private peering (83.2%), transit (22%), embedded cache (17.4%), and IXP (7.1%). Usually, CDNs with a last mile cache embedding program heavily favor these against other connectivity types as it both allows:</p> <ul> <li>The ISP to save transit costs</li> <li>The subscribers to get demonstrably better last-mile performance</li> </ul> <p>In this case, the fact that embedded cache traffic is a small proportion of the overall delivery implies that some ISPs from this dataset have maxed-out their embedded caches.</p> <img src="https://images.contentful.com/6yom6slo28h2/rFwyms3lpC49fYhp9lzqy/4ca828c06636031c9ea15b9df9ffa154/HBO_Max_by_connectivitity_type.png" style="max-width: 500px; margin-bottom: 15px;" class="image center" thumbnail alt="HBO Max OTT traffic analysis by source" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">HBO Max OTT traffic analysis by source</div> <p>In addition to source CDN and connectivity type, users of Kentik’s OTT Service Tracking are also able to break down traffic volumes by subscribers, specific router interfaces and customer locations.</p> <h3 id="how-does-ott-service-tracking-help">How does OTT Service Tracking help?</h3> <p>In July, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/" title="Learn more about recent OTT service tracking enhancements">described the latest enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like the release of a blockbuster movie on streaming can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/solutions/usecase/network-business-analytics/" title="Network Busines Analytics Use Cases for Kentik">network business analytics here</a>.</p> <p>Ready to improve over-the-top service tracking for your own networks? Start a <a href="#signup_dialog" title="Start a free trial of Kentik">free trial</a> of Kentik today.</p><![CDATA[Network AF, Episode 5: Building relationships as an internet analyst with Doug Madory]]><![CDATA[Listen to our host Avi and guest Doug Madory's conversation around internet analysis. Doug shares how he got into technology, his career in the Air Force, and dives into what it's like building relationships with the press and working with them as an internet analyst. ]]>https://www.kentik.com/blog/network-af-episode-5-building-relationships-internet-analyst-doug-madoryhttps://www.kentik.com/blog/network-af-episode-5-building-relationships-internet-analyst-doug-madory<![CDATA[Michelle Kincaid]]>Wed, 10 Nov 2021 04:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="//images.ctfassets.net/6yom6slo28h2/gL7AWUQKgRM7EvWtG5ehq/0173ecb6ca6446ebc46275834deea0ba/InternetIntelligence_DougMadory.jpg" style="max-width: 200px;" class="image right" alt="Doug Madory" /></a></p> <p>Network AF welcomes <a href="https://www.kentik.com/network-af/">Doug Madory to the podcast</a>.</p> <p>Doug is a veteran, a researcher, a writer and Kentik’s director of internet analysis. With his start in the U.S. Air Force within its Information War Center, Doug has now been working in the networking industry for 12 years.</p> <p>After the Air Force, Doug went on to work for Renesys, which was acquired by Dyn, which was later acquired by Oracle. At each of these companies, he served as an expert network researcher, using data to help interpret events, human behavior and internet incidents. At Kentik, Doug focuses on turning networking data into meaningful insights for customers and sharing his research with the industry and curious, everyday digital citizens.</p> <p>During this episode of the podcast, Avi and Doug discuss:</p> <ul> <li>How Doug’s interests led him to internet analysis</li> <li>How Doug builds relationships while communicating technical networking data</li> <li>How analyzing traffic flow presents opportunities for investigative journalists to tell stories</li> <li>And the effects of internet shutdowns</li> </ul> <h3 id="getting-his-start">Getting his start</h3> <p>From Poughkeepsie, New York, Doug gained an interest in technology at a young age. His father worked at IBM for 30 years, writing code and working on mainframes. As a result, there were always computers around the family home for Doug and his brother to learn on.</p> <p>Doug joined the Air Force as a network engineer on a team running the IT infrastructure for intelligence operations. He recounts his appreciation for Cisco’s certification programs, which he says helped his skill and technical understanding. Eventually, he made his way to head of the Solaris Administration for the Information War Center, in charge of 55 airmen who needed to be able to deploy in a moment’s notice.</p> <h3 id="bgp-as-a-professional-gateway">BGP as a professional gateway</h3> <p>Doug credits some of his professional growth and learnings from his early days at Renesys, where he says he did a lot of BGP analysis. He jokes that he “started off as the guy who would just write reports that nobody wanted to write, and do data QA that people didn’t really want to do.” This turned into writing opportunities that strengthened his ability to translate this information clearly, and became a major part of his professional path.</p> <p>On the subject of how this was all possible, Doug talks about being mentored as an “understudy” to Renesys’ founder Jim Cowie, collaborating with Jim and adding more detail to Jim’s materials.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/c293fd2d"></iframe> <p>Avi also asks Doug about his focus on <a href="https://www.kentik.com/analysis/">internet outages, shutdowns and internet development</a>. Doug says he writes about “gory routing incident autopsies.” He says he’s also is fascinated by a trend in the world where governments will shut down their internet to prevent students from cheating on exams.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/18e62ea4"></iframe> <h3 id="on-internet-shutdowns">On internet shutdowns</h3> <p>Given the nature of this type of work, Doug’s orbit becomes entangled with relevant crises like the Arab Spring and other internet shutdowns that governments like Egypt imposed on their citizens. At Renesy, he had the tools to dig into BGP data and interpret what was happening. He says that the company didn’t see its role as being a flag carrier for human rights so much as having the ability to give technical analysis and make sure everybody got the story accurate. In a stroke of luck, Doug received a tip about the Egyptian internet being shut off from a network engineer a day in advance, giving him a day’s notice on the Egyptian revolution.</p> <p>He also references some of his favorite journalism projects, including working with The New York Times on the North Korean and Myanmar outages.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/35239a45"></iframe> <h3 id="whats-ahead">What’s ahead</h3> <p>Doug is currently excited about learning new skills and growing his understanding of synthetic monitoring research for performance measurement. This will help him help Kentik customers and interested journalists to easily interpret and tell their own stories of what’s happening across the internet.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/84e866a3"></iframe> <p>If he could tell his younger self something, Doug says he would’ve never been able to predict this amazing internet data analysis and interpretation job could exist or that there would be so many in-depth analysis tools. In addition to sharing general advice on building relationships and learning, Doug tells the audience to always ask questions from a place of humility and trust that the intrinsically passionate people in this space will respond in kind.</p> <p>Doug’s parting advice is to be someone who helps others and to try to find people at the right moment. You can find Doug on <a href="https://twitter.com/dougmadory">Twitter</a>, <a href="https://www.linkedin.com/in/dougmadory/">LinkedIn</a> and the <a href="https://www.kentik.com/blog/author/doug-madory/">Kentik blog</a>.</p><![CDATA[How to determine the source of SaaS latency]]><![CDATA[To monitor your SaaS applications, it has to be done by thinking creatively. Traditional tactics lack insight. It’s time to consider synthetic monitoring. ]]>https://www.kentik.com/blog/how-to-determine-the-source-of-saas-latencyhttps://www.kentik.com/blog/how-to-determine-the-source-of-saas-latency<![CDATA[Michael Patterson]]>Mon, 08 Nov 2021 04:00:00 GMT<p>One of the positive things that came out of events in 2020 was that many of us started working from home. At first, it was kind of weird. But once we realized that what we needed was available online, it became easier. All we had to do was figure out a few new apps, like Slack, Asana and Google Docs. Then, after a couple of weeks of working from home, many of us started having thoughts like, “I wonder if I could wear shorts and my favorite slippers? I mean, no one sees me from the waist down.”</p> <img src ="//images.contentful.com/6yom6slo28h2/6FCmzCUFOmhTWejf3snreH/77b2f060608a1c38a53c8f55393b5383/remote-worker-undies.png" style="max-width: 350px;" class="image center no-shadow" /> <p>Yeah, it’s awesome, isn’t it? But, we also learned something else: When there is a network problem, we have to figure it out ourselves, which is less awesome.</p> <p>Leading up to SaaS offerings, everything we communicated with over the network to do our jobs (other than surf the web and send email) was hosted on-premises. The mail server, the CRM, the file servers, etc. were all hosted in the corporate data center. If we had a connection problem, we’d call over the cubical something like, “Hey Darla, can you see the Skywalker server? I’m having trouble connecting.” Depending on how that went, we’d try a few things and then call the help desk. Loaded with SNMP tools like HPOV, Wireshark and Telnet, the network team could figure out the connectivity problem and get us back up and running.</p> <p>Sitting at home and struggling to connect to Chorus or Monday.com, or having intermittent issues with a Zoom connection, we’ve all pretty much all settled in on doing our own troubleshooting or reshuffling our workloads until internet connectivity issues clear up. Is this really the way it’s going to be going forward? Are those who work from home forced to solve their own network-related problems? Is there no other way?</p> <h3 id="its-time-to-get-creative">It’s time to get creative</h3> <p>While the internet is just a network, the problem is that your company cannot get end-to-end visibility with traditional network monitoring tools. The NetOps team can’t Telnet to routers and look at routing tables, and they can’t change routes or span a port to collect packets. This all means that traditional tools won’t provide the value they once did. As a result, we have to get creative and come up with new tools to regain visibility. The SaaS apps that many of us are using are likely hosted somewhere across the internet. And if the problem isn’t in your home, it’s somewhere between your home and a server located in a data center — who knows where?</p> <h3 id="should-you-tolerate-interruptions">Should you tolerate interruptions?</h3> <p>There are ways to <a href="https://www.kentik.com/blog/visibility-into-the-cloud-path-of-application-latency/">gain visibility into the cloud path of application latency</a>. But, even if you can identify the service provider introducing the problem, what can you do about it?</p> <p>And, wait, back up a minute. Aren’t intermittent connectivity issues expected and even tolerated so long as they aren’t excessive? How do you know when a connection issue is excessive? This is when you have to decide on whether to do something about the problem or go on tolerating the interruptions.</p> <h3 id="strength-in-numbers">Strength in numbers</h3> <p>If you are using an application that hundreds or thousands of others in your organization are using, there is strength in numbers. Reporting the issue to the help desk or even your peers is a great starting point. It’s like voting; it’s how individual contributors can make a difference. When enough people complain about the same SaaS, it gets attention. Someone will take notice, pour through the complaints, and do some homework. They will determine if the problem is the SaaS or something along the internet path, and they will need evidence to take action. This means that it’s time to get creative about how we measure performance.</p> <h3 id="synthetic-monitoring">Synthetic monitoring</h3> <p>When companies have many employees working remotely and relying on the same SaaS applications, they like to reinforce complaints with supporting evidence. In the post on “<a href="https://www.kentik.com/blog/how-to-monitor-packet-loss-and-latency-in-the-cloud/">How to monitor packet loss and latency in the cloud</a>,” the topic of using synthetic monitors was outlined. These are small software agents that measure performance (e.g., latency, packet loss, jitter and more) from distributed locations, much like work from home employees.</p> <h3 id="hop-by-hop-diagnostics">Hop-by-hop diagnostics</h3> <p>What needs to be emphasized here is that synthetic monitors can also perform diagnostics which allow NetOps to learn where in the internet path (hop-by-hop) potential problems are being introduced. See the example below:</p> <img src ="//images.ctfassets.net/6yom6slo28h2/2i45VltyoRDY5o1lGigBb3/84d53a6dda2d6b1049792ea5f5b266f8/source-of-saas-latency-in-the-path.png" style="max-width: 700px;" class="image center no-shadow" thumbnail alt="Source of SaaS latency in the path"/> <p>If you have several probes all reporting issues with the same isolated router hop, there is strength in numbers. To take advantage of this, the data can be trended over time. This is where repeat offenders become obvious. Once you have collected the clear indicators of a problem, the data can be shared with the service provider that is responsible for the router in question. Hopefully, they can replace the router or route your traffic around it.</p> <h3 id="proactively-monitoring-saas-applications">Proactively monitoring SaaS applications</h3> <p>Ideally, the applications we depend on are constantly monitored from our perspective. Synthetic monitors can help here as well. These tiny applications can run on small computers such as a raspberry pi or even a container on your computer. If you are working with a vendor like Kentik, you’ll have <a href="https://www.kentik.com/product/global-agents/">hundreds of these agents</a>, already deployed around the world, which can be used to monitor the applications you care about. Take for example, the list of monitored applications below.</p> <img src ="//images.ctfassets.net/6yom6slo28h2/34UbKpsZ6bgnjyHT8WrAnm/ef7474e2a6a72f9aa4a271438395835d/saas-latency-monitoring.png" style="max-width: 600px;" class="image center no-shadow" thumbnail alt="SaaS latency monitoring" /> <p>Above, all of the critical SaaS applications that the business depends on are monitored. Each of the applications can be monitored by one or more synthetic agents. For example, if Office 365 were to incur latency, it would change color, and clicking on it would bring up a map indicating the area of the world that is experiencing trouble with the application.</p> <img src ="//images.ctfassets.net/6yom6slo28h2/45PmsHUPu9IHoYfje6Fhy5/c27771e590799f635c8df6250d541da4/how-to-monitor-saas-applications.png" style="max-width: 700px;" class="image center no-shadow" thumbnail alt="How to monitor SaaS applications" /> <p>By clicking on “Path View” or the “Timeline” button (not shown), it becomes clear how frequently the problem is occurring, as well as which service provider is introducing it. The workflow makes it simple to narrow in on the source of the problem.</p> <h3 id="in-summary">In summary</h3> <p>Working from home is great, but in some ways it makes us our own IT department. That’s not practical at scale. A better approach is to have your company centrally monitor SaaS applications using globally distributed synthetic testing agents, like those provided by Kentik.</p> <p>If you’d like to learn more about how to monitor your SaaS applications with the Kentik Network Observability Cloud, <a href="https://www.kentik.com/go/get-demo/">reach out to our team</a> for a demo or to start your evaluation.</p><![CDATA[Network AF, Episode 4: Untangling business in the ISP industry with Elliot Noss]]><![CDATA[On today's episode of the Network AF podcast, Avi welcomes Elliot Noss, President, and CEO of Tucows. Elliot has a love and passion for the internet that started the moment he was introduced to it. This passion comes through as he discusses his goals in networking and the positive change he wants to make in solving cybercrime issues at the DNS level.]]>https://www.kentik.com/blog/network-af-episode-4-untangling-business-in-the-isp-industry-with-elliot-nosshttps://www.kentik.com/blog/network-af-episode-4-untangling-business-in-the-isp-industry-with-elliot-noss<![CDATA[Michelle Kincaid]]>Tue, 26 Oct 2021 04:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/23j2euM6HVeFYA9wtvyUIO/aa8114d5cd1eaf20538b2955fe9f6ad1/elliot-noss-400.png" style="max-width: 200px;" class="image right" alt="Elliot Noss" /></a></p> <p>Today on the <a href="/network-af/" title="Network AF: the network engineering podcast">episode 4 of the Network AF podcast</a>, host Avi Freedman welcomes his longtime friend Elliot Noss. For 25 years, Elliot has been the CEO of Tucows, the internet services company with the second-largest domain registrar in the world.</p> <p>Elliot is considered an outlier in the ISP industry, largely due to his transparency and for the stellar customer experiences he encourages through Tucows. On the podcast, he and Avi discuss:</p> <ul> <li>Life in the domain-name space and beyond</li> <li>How to provide exceptional customer service</li> <li>The culture of passionate people at Tucows</li> </ul> <h3 id="elliots-start-in-the-space">Elliot’s start in the space</h3> <p>Elliot says he’s always been a businessperson first and a geek second. At university, he would allow himself to take a “fun” course once per semester. One year that involved computer programming, where he says he spent a lot of time filling out punch cards with a pencil and putting them into a card reader.</p> <p>While working in Toronto, he joined the ISP Infonautics, and he couldn’t believe they owned the TUCOWS domain name. That is, “The Ultimate Collection of Winsock Software,” something Elliot calls an “anachronistic acronym.” The company understood the importance of software downloads as a primary lead, and they bought the website from Scott Swedorski. However, they didn’t have anyone running it. When Elliot asked about it, building the business became his job. Eventually, Infonautics would rebrand as Tucows with Elliot as its figurehead.</p> <p>Elliot has seen the entire computing and networking industry change during his 25 years at Tucows, including how people connect. Since the shareware days, the company has become expert providers in domain names, network solutions, fiber internet service, mobile virtual network operations and SaaS. This growth stems from its origins in inbound marketing and, as Elliot says, being smart enough to take people’s money when they wanted to advertise casinos and software. It was the best place in the world to advertise your shareware. Now Tucows has grown into a network of services, including Ting, Hover, OpenSRS, Enom, Epag and Ascio. Avi and Elliot discuss that growth in-depth.</p> <h3 id="providing-exceptional-customer-experience">Providing exceptional customer experience</h3> <p>Elliot says a foundation of Tucows’ growth and evolution is partly due to its emphasis on excellent customer support. Inspired by the customer-first ethos from dial-up ISPs back in the 90s, he adopted a stance to avoid what he calls the industry’s “dirty secret.” That is, half of support calls are billing-related.</p> <p>To avoid that for Tucows, Elliot has always aimed to make the company’s services simple, reasonably priced and make billing as easy as possible. In turn, this significantly helps customers have a better experience, he says.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/3cab365a"></iframe> <h3 id="cybercrime-and-the-internet-as-entertainment">Cybercrime and the internet as entertainment</h3> <p>Since the second he interacted with it in the late 80s, Elliot has loved the digital world. In his words, he has “always felt like he owed a debt to the internet” for what it’s brought him.</p> <p>As a result, he says he wants to fix some of the low-hanging fruit that makes the internet a dangerous place. He talks with Avi about some cybercrimes involving phishing, farming and spam, particularly being solvable at the DNS level.</p> <p>By solving this, Elliot hopes it might inspire people to tackle bigger and broader problems. He states the only reason we haven’t yet begun to see these changes is because “big telecom and big media” have traditionally only seen the internet for its ability to deliver entertainment, and not its utility and infrastructure value.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/e641a397"></iframe> <h3 id="the-culture-of-tucows">The culture of Tucows</h3> <p>Tucows has a ton of great employee reviews on Glassdoor that highlight the company’s impressive culture. Avi asks Elliot about it and what Tucows does to make people happy.</p> <p>Elliot says culture is always in the little things: “It is the things you do when no one is looking, and it’s amazing that people complicate it beyond the golden rule. There are those for whom we will be the best work experience they will have in their life.”</p> <p>Elliot says Tucows isn’t really for people who are only money driven, status-driven, or for those looking to be upwardly mobile in their titles changing all the time. At Tucows, people sometimes sit at the same job titles for eight or 10 years. The job responsibilities and scope increase every year, but the title stays the same.</p> <p>He also always asks, “Do you love the internet?” It’s the kind of filter and prerequisite they’re looking for at Tucows. He adds that great people who are happy are necessary for great success.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/7301ba4b"></iframe> <h3 id="elliots-parting-advice">Elliot’s parting advice</h3> <p>Elliot’s career advice for his younger self includes two things he tells his children. The first being to “make sure you do what you love.” He says in today’s hyper-connected world, you’re competing with everyone, everywhere. “If you don’t love what you do, you won’t work hard. And if you don’t work hard, you won’t succeed. Full stop.”</p> <p>The second offering of knowledge comes from a story his daughter hates about her first real job and not wanting people to dislike her. Elliot says, “This is really simple. If you make the people who you work with… if you make their jobs easier, they’re going to like you no matter how big of an <em>expletive</em> you are. If you make their jobs harder, they’re going to dislike you no matter how kind and sweet you are. Real simple.”</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/6823e6c3"></iframe> <p>You can find Elliot on <a href="https://twitter.com/enoss" title="Elliot Noss, @enoss, on Twitter">Twitter</a> and <a href="https://www.linkedin.com/in/elliot-noss-39a2/" title="Elliot Noss on LinkedIn">LinkedIn</a>.</p> <p>Watch the <a href="/network-af/" title="Network AF: the network engineering podcast">full episode here</a>… Or follow us on <a href="https://podcasts.apple.com/us/podcast/network-af/id1584877668?itsct=podcast_box_link&#x26;itscg=30200&#x26;ls=1" title="Network AF - network engineering podcast - on Apple Podcasts">Apple Podcasts</a>, <a href="https://www.google.com/podcasts?feed=aHR0cHM6Ly9mZWVkcy5jYXN0ZWQudXMvOTEvTmV0d29yay1BRi02MDZjZGY0Ny9mZWVk" title="Network AF - network engineering podcast - on Google Podcasts">Google Podcasts</a>, or <a href="https://open.spotify.com/show/1xbEIFBRbilSvwOYuWtJ1Q" title="Network AF - network engineering podcast - on Spotify">Spotify</a> so you never miss an episode!</p><![CDATA[NPM, encryption, and the challenges ahead: Part 2]]><![CDATA[Encryption forced NPM vendors to evolve. In part 2 of this series, let’s discuss NPM’s evolution, including synthetic testing as one recent advancement. ]]>https://www.kentik.com/blog/npm-encryption-and-the-challenges-ahead-part-2https://www.kentik.com/blog/npm-encryption-and-the-challenges-ahead-part-2<![CDATA[Michael Patterson]]>Thu, 21 Oct 2021 04:00:00 GMT<p>In <a href="https://www.kentik.com/blog/npm-encryption-and-the-challenges-ahead-part-1-of-2/">part 1</a> of this series, I talked a bit about how encryption is shaping <a href="https://www.kentik.com/kentipedia/network-performance-monitoring/">network performance monitoring (NPM)</a>. Let’s dive in deeper now…</p> <p>Most NetOps and DevOps professionals today hear complaints about network performance when employees work from home. Unless the complaint is coming from all remote users of an application, individuals suffering from slowness are on their own to figure out how to optimize connection speeds. Users who have technical expertise to isolate the performance problem often admit there is nothing they can do to fix a sluggish connection.</p> <p>SNMP and packet capture will be used for many years to come to manage the on-premises network. These technologies, however, are generally not effective at trying to gain visibility into traffic patterns traversing the internet. When it comes to the World Wide Web, we simply can’t put a packet analyzer in every ISP in the path of a connection. We can’t telnet to the routers in the path because different companies in different countries own them. How then can we gain visibility into why a connection is slow for a given user?</p> <p>Tools like Microsoft’s <a href="https://techcommunity.microsoft.com/t5/networking-blog/introducing-packet-monitor/ba-p/1410594">Packet Monitor</a> (Pktmon) can be installed on work-from-home computers to regain packet visibility. But this doesn’t scale, and it is a reactive troubleshooting tactic. Ideally, we should stay proactive and uncover problems like brownouts before they surface as blackouts.</p> <h3 id="npm-must-evolve">NPM must evolve</h3> <p>We have to think differently. In <a href="https://www.kentik.com/blog/npm-encryption-and-the-challenges-ahead-part-1-of-2/">part 1</a> of this blog, I made the case that vendors such as Google and standards bodies such as the IETF create the need to rethink how NPM solutions are designed and deployed. This means backing up a bit and thinking about the goal of optimizing network connections to make sure the applications we depend on stay responsive. By focusing on the goal, we start to think outside the box of traditional monitoring techniques. We must consider new ways of gaining performance insights.</p> <p>What do we have access to that hasn’t changed and isn’t impacted by the IETF’s mission to stop pervasive monitoring?</p> <ul> <li><strong>BGP monitoring and correlation with flow-based telemetry</strong> allow companies to gain clear visibility into how corporate and remote-user traffic is traversing the internet. Using BGP, public IP addresses can be associated with a business’s Autonomous System, but this doesn’t always reveal the actual application (e.g., AWS currently hosts Intuit’s TurboTax).</li> <li><strong>By properly labeling interfaces</strong> (e.g., transit, public peer, backbone, etc.) and using flow data, traffic taking unoptimized paths can be identified and even rerouted.</li> <li><strong>Locating synthetic monitors</strong> in the locations where an organization has a significant user base provides performance insights that most organizations have never had in the past. These tiny agents measure for latency, packet loss, jitter, and conduct traceroutes to identify hops and potential trouble spots in the path to cloud-hosted applications.</li> </ul> <h3 id="synthetic-testing-as-a-recent-advance-in-npm">Synthetic testing as a recent advance in NPM</h3> <p><a href="https://www.kentik.com/product/synthetics/">Synthetic testing</a> is a more recent development in the NPM space and has the advantage of using synthetically generated traffic to measure connectivity performance to cloud offerings. Synthetic testing allows for deterministic performance measurement without involving actual user traffic. Although synthetics is considered a more recent advance in network monitoring, its root goes back several decades to protocols such as PING and traceroute.</p> <h4 id="distributed-network-performance-testing">Distributed network performance testing</h4> <p>Below is an example of locating synthetic monitors everywhere that users are accessing business-critical applications. In the Kentik solution, multiple customers can share monitors securely, making it easy to bring additional viewpoints online.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3kJrbnkA2xitLntQx8gob9/f90099b234ce8efd164ae56ca8e9c6bc/distributed-network-performance-testing.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="Global synthetic monitors" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Kentik deploys a global fleet of synthetic testing agents that can help organizations measure the performance of the network and applications continuously for their users.</div> <h4 id="api-testing">API testing</h4> <p>Beyond the tests mentioned above, synthetics can also test APIs to verify availability and the response time of the applications employees connect to complete their work. The theory is that if the synthetic monitor has issues, then the end users in the same region are probably suffering. Customers are finding that this is precisely the case.</p> <p>Below is an example of some of the SaaS applications that are actively monitored with synthetics. Drill in, of course, for further details such as historical trends.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7ifPInxRU3j4ha2pIIzfqp/795ba596e17f85ae9b46096a931c1e13/saas-apps-active-monitoring-synthetics.png" style="max-width: 800px;" class="image center" thumbnail alt="SaaS applications being actively monitored with synthetic testing" /> <p>Synthetics also perform traceroutes, allowing NetOps and DevOps to identify specific routers within service providers that are consistently causing problems. Contacting the service provider may help with rerouting connections around the trouble spot. If not, peering or new transit can be considered.</p> <h4 id="npm-wrap-up">NPM wrap-up</h4> <p>Just as electricity eventually became much more reliable, so has network connectivity. Twenty years ago, we needed to constantly ping servers, switches and routers to make sure that they were still up and running. Hardware and operating systems have since become much more reliable. With connectivity a lesser issue, today we need to focus on performance brownouts as they indicate the areas where we could eventually have blackouts.</p> <p>It’s hard to let go of our traditional tools and familiar telemetry. This is why we need to focus on the goal: to optimize network connections and ensure the applications we depend on stay responsive. The use of synthetics stays in line with this objective.</p> <p>If you’d like to learn more about how to get involved with an NPM solution that is evolving to overcome the challenges in today’s network traffic visibility space, <a href="https://www.kentik.com/go/get-demo/">reach out to our team</a> for more information.</p><![CDATA[Network AF, Episode 3: Uniting networking pros with Salesforce’s Janine Malcolm ]]><![CDATA[In the latest episode of Network AF, you'll meet Janine Malcolm, director of network engineering at Salesforce. Here's a recap of the podcast, where Janine talks about her journey to get to where she is today and how she became interested in networking itself.]]>https://www.kentik.com/blog/network-af-episode-3-uniting-networking-pros-with-salesforces-janine-malcolmhttps://www.kentik.com/blog/network-af-episode-3-uniting-networking-pros-with-salesforces-janine-malcolm<![CDATA[Michelle Kincaid]]>Fri, 15 Oct 2021 04:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/7GuvFIqcQt0m1z4ENJNMQw/80aaeafc3ffcf7a8cb869e73149258ab/janine-malcolm.jpg" style="max-width: 200px;" class="image right" alt="Janine Malcolm" /></a></p> <p>If you’ve ever thought networking is bewildering as a newcomer, you’re not alone. In <a href="/network-af/">episode 3 of Network AF</a>, meet Janine Malcolm, the director of network engineering at Salesforce. She joins podcast host Avi Freedman to chat about some of the experiences she’s had throughout her career and how to make network engineering a more accessible profession.</p> <p>At Salesforce, Janine is currently focused on uniting groups of people as one overall network engineering team. On the podcast, she and Avi cover this topic, as well as mentorship opportunities, how to make the workplace a more respectful and equal place regardless of a person’s background, and lessons learned through experimentation and failure.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/80e51ad4"></iframe> <p>Before her days at Salesforce, Janine worked her way up as a customer service representative at DIGEX (in dial-up), helping people to understand their web hosting and ISP problems. Gradually moving up from T1 to T2 to T3 support lines, she eventually learned enough to move server-side. To her this was an eye-opening experience. And two years later, she moved to the network engineering side, proclaiming how cool it was to learn to code in Perl and get to move so much data around.</p> <p>It’s also worth pointing out, the original catalyst to jump start Janine’s interest in technology was none other than her grandfather. She mentions him introducing her to computers around age four or five.</p> <h3 id="on-mentorship">On mentorship</h3> <p>Reinforcing that her path to success was filled with a strong support system, Janine says mentoring and pairing up with colleagues to solve networking problems was one of the greatest assets. She also credits DIGEX for having many fantastic women who acted as wonderful role models and mentors.</p> <p>Having the ability to set up buddy systems where people sit right next to each other setting up routers, troubleshooting problems and learning together allowed her greater opportunity to ask questions in an open environment. She still feels that 70% of how people learn is just sitting there, experiencing problems, mentoring by sharing knowledge and helping with an open mind. It’s something that she tries to emulate with the teams she manages and builds, in addition to cross-training to fill gaps in knowledge. (It’s at this point in the podcast that several jokes are made at the expense of FORTRAN and how white space is not syntax.)</p> <p>On the importance of networks, which many say are just hardware, Janine and Avi both share anecdotes about how there’s just software. Networks, of course, are constituted of hardware, but there’s always software underlying that ensures that hardware functions as intended. Which is a huge barrier to entry for someone in college or even just someone interested in network engineering when it comes to understanding the underlying technology that powers it all. Avi shares that a friend at IBM likes to say, “Without the network, there is no cloud,” or the Zen version, “Networks are the water of the clouds, the connective tissue, what clouds are made of.”</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/a825a642"></iframe> <h3 id="on-blowing-up-the-internet-and-bubble-gumming">On “blowing up the internet” and “bubble-gumming”</h3> <p>Avi and Janine also discuss “blowing up the internet” and how this perspective has changed over time. For instance, in the past, a network engineer may have been encouraged or less concerned with a minor interruption of service in order to test and experiment. It was less damaging then. But now in the SaaS world we live in, reliability and hyper-available services are key. That lapse in service could cause a loss of trust when the end-user needs the functionality most. Avi jokes that if someone at Kentik said, “Hey I might try this thing and it might explode the internet, it might take our service down,” that he would laugh while definitively replying “Oh no.”</p> <p>This isn’t to say that experimentation is dead. Far from it actually, but now different methods are needed to ensure reliability. Whether it comes from attempting new automations that can prevent service interruptions or setting up test environments in smaller batches.</p> <p>To this point Avi discusses with Janine how when they started there was a lot of “bubble-gumming” status routes and nobody knew exactly what everything did or why it was part of the network. Now though, services like Kentik and others are available so that people know all the details of their network operations and traffic, understanding that if they turn off a route they know what part of the service it will impact. In the past this was a much more laborious process involving turning off services, rebooting and digging for information manually. So experimenting is still an active element of network engineering, but now it happens in ways with less potential to hurt the business’ reputation.</p> <h3 id="on-salesforce">On Salesforce</h3> <p>While Janine works with many vendors and thinks it is fascinating to solve customer problems, she is doing slightly different things in her network engineering role at Salesforce these days.</p> <p>She’s focused on making sure the network is not only running efficiently, but that it provides a comfortable layer upon which to build applications that run smoothly over the network. She’s helping to fine-tune their networks to work based on what is and is not working for their customers.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/b3d9d4c7"></iframe> <p>Avi and Janine and Avi also discuss Salesforce and the company’s approach to network management, like if they use the cloud and more traditional on-premises routers and switches. And Janine talks about Salesforce employing a hybrid approach to managing their network.</p> <p>When Avi asks Janine what trends and buzzwords she is tired of hearing, or by contrast is fascinated by, she immediately and enthusiastically responds with the phrase, “software-defined” and that she is really sick of it. As far as both Avi and Janine are concerned, software defines anything and everything, it’s all software.</p> <p>There’s a lot more packed into this podcast episode, including Janine’s advice for future generations of network engineers, so be sure to <a href="/network-af/">give it a listen</a>.</p><![CDATA[NPM, encryption, and the challenges ahead: Part 1 of 2]]><![CDATA[Encryption and cloud adoption are creating hurdles for network performance monitoring vendors. To survive, solutions must evolve or die.]]>https://www.kentik.com/blog/npm-encryption-and-the-challenges-ahead-part-1-of-2https://www.kentik.com/blog/npm-encryption-and-the-challenges-ahead-part-1-of-2<![CDATA[Michael Patterson]]>Wed, 13 Oct 2021 04:00:00 GMT<p>It’s interesting to observe how encryption and <a href="https://www.kentik.com/kentipedia/network-performance-monitoring/" title="Kentipedia: Network Performance Monitoring definition">network performance monitoring (NPM)</a> have evolved over time. When I first entered the networking industry right out of college, many applications sent passwords over the network in clear text, unencrypted. Since just about everyone’s PC was wired back to a repeater (i.e., not a switch), we could observe each other’s traffic with free packet analyzers and laugh. Once you saw a person’s password to any given application, you knew they were generally using the same one for all of their other applications — email, the ticketing system, the FTP and Novell servers, etc. Well, that didn’t last long.</p> <p>Encrypted passwords came along as did token authentication. Then TLS, HTTPS, SNMPv3 and it continues. It’s almost as if someone out there in the ether is determined to end all passive or pervasive “unwanted” monitoring.</p> <p>When companies started outsourcing the hosting of their websites to the likes of Akamai and AWS, network teams learned quickly that many work- and non-work related applications shared the same IP address. This made accurately monitoring latency and packet loss to some applications impossible.</p> <p>Some NPM vendors started pairing DNS lookup records with flow data in order to separate business applications from non-business applications hosted on the same IP address. The problem is that many companies have several DNS servers spread out in far-reaching locations, and not all DNS vendors allow access to the logs. Getting all the DNS logs is often impractical.</p> <p>These problems have, of course, created opportunities in the NPM market. The changes have forced vendors to evolve their tools in order to keep NetOps teams informed on how applications are performing for the end users.</p> <h3 id="what-is-npm">What is NPM?</h3> <p>As a refresher, NPM solutions provide historical and real-time predictive views on the availability and performance of the network and the applications that travel over the network. Traditionally, this is done using flow analysis, SNMP, packet capture and other forms of infrastructure telemetry. However, as explained above, these techniques are often not helpful when trying to measure SaaS application performance.</p> <p>Next-generation NPM solutions ingest new forms of telemetry, which allow them to overcome their dependency on older collection methods. The NPM goal is still to improve the overall end-user experience with their applications. But, this goal gets harder and harder as more and more encryption gets introduced and more services move to the cloud. Because of this, NPM solutions must evolve.</p> <p>In this short video, Kentik CEO Avi Freedman discusses the many types of <a href="https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources/" title="Kentik Blog: The Network Also Needs to be Observable, Part 2: Network Telemetry Sources">network telemetry</a> data and integrations that are important to improving the value of network performance monitoring, and network observability in general:</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 64.75507765830346%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/iqkp9s7pu4" title="Network Performance Monitoring: Types of Data and Integrations, Avi Freedman, Kentik Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" allowfullscreen="" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">This video is a brief excerpt from "5 Problems Your Current Network Monitoring Can’t Solve (That Network Observability Can)"—you can <a href="https://www.kentik.com/go/webinar-5-problems-solved-by-network-observability/" title="Webinar: 5 Problems Your Current Network Monitoring Can’t Solve (That Network Observability Can)"/>watch the entire presentation here</a>.</div> <h3 id="is-google-hijacking-dns-with-doh">Is Google hijacking DNS with DoH?</h3> <p>Probably one of the more controversial evolutions in security that we are seeing in the pervasive monitoring space is <a href="https://en.wikipedia.org/wiki/DNS_over_HTTPS">DNS over HTTPS</a> (DoH). Google has service providers and enterprises in an uproar over their intention to have the Chrome web browser circumvent local DNS servers. This migration will allow Google to know all the websites its users are visiting, but it breaks many enterprise security and policy platforms which depend on the collection of DNS lookups. Some NPM platforms that rely on DNS logs will have a serious challenge if Google gets their way. FireFox is already supporting DoH by default.</p> <p>Consider SD-WAN as another example of DoH causing problems. The SD-WAN controller grants permission to connections based on the top-level domain (e.g., facebook.com) being visited. If Google has their druthers, a lot of SD-WAN vendors are going to be in a tough spot and will have to figure out another way to enforce policy. Some have stated that it could be done by passively monitoring for the <a href="https://en.wikipedia.org/wiki/Server_Name_Indication">SNI</a> in the initial HTTPS connection handshake, but maybe not for long. There is a movement to encrypt that as well.</p> <p>Google isn’t the only one forcing changes in the NPM space. The Internet Engineering Task Force (IETF) is also working toward changes that will impact pervasive monitoring.</p> <h3 id="the-ietf-declares-pervasive-monitoring-is-an-attack">The IETF declares “pervasive monitoring is an attack”</h3> <p>Check it out for yourself in <a href="https://datatracker.ietf.org/doc/html/rfc7258">RFC7258</a>. The IETF community’s technical assessment is that pervasive monitoring (PM) is an attack on the privacy of internet users and organizations. If you’ve never read an RFC, read this one as it’s only three pages of actual content. Read page two or just check this out:</p> <blockquote> <p>The term “attack” is used here in a technical sense that differs somewhat from common English usage. In common English usage, an attack is an aggressive action perpetrated by an opponent, intended to enforce the opponent’s will on the attacked party. The term is used here to refer to behavior that subverts the intent of communicating parties without the agreement of those parties. An attack may change the content of the communication, <em>record the content or external characteristics of the communication, or through correlation with other communication events</em>, reveal information the parties did not intend to be revealed.</p> </blockquote> <h3 id="other-npm-challenges">Other NPM challenges</h3> <p>The NPM market faces several other challenges that have been brought about by industry changes.</p> <ul> <li> <p><strong>HTTP is the new TCP</strong>. Fifteen years ago, NetOps could start up a packet analyzer and see all kinds of different unencrypted protocols such as FTP, SMTP, POP3, HTTP, DNS, etc. These protocols helped the NPM identify applications. Today, many downloads are done over HTTPS and the same holds true for checking email. It looks like DNS is heading in the HTTPS direction as well. Start up a packet or flow analyzer and make note of the most popular protocol. It’s encrypted HTTPS. Encrypted traffic means limited visibility. In ten years, QUIC (quick UDP internet connections) over UDP will likely become the protocol of choice to most websites especially if Google encourages it like they did with HTTPS back in 2015. This is the year that Google announced that its search engine would play favorites to websites supporting HTTPS. In less than a year, most companies (~70%) serious about market competition made the switch. Increased HTTPS traffic made it difficult for routers to provide insights beyond traditional NetFlow v5 metrics. Flow collectors and packet capture probes are finding it more difficult to provide the traditional insight they have been serving up for years.</p> </li> <li> <p>Migration to the cloud has also meant that <strong>much less data is staying on-premises</strong>. Due to the global pandemic, employees are working from home and connecting only to third-party SaaS applications to do their job. This means all of their data never goes on-premises! As a result, the big investments made in powerful NPM appliances installed in the data center for passive “pervasive” monitoring are becoming less useful.</p> </li> <li> <p>With more applications natively built in the cloud, <strong>the increased use of containers and microservices</strong> has fundamentally changed the visibility and flow of network traffic. Because of this, monitoring tools are forced to search out alternative methods to see into container-to-container traffic. To date, there is hope in letting NPMs leverage APIs to recover some of the visibility as the cost to export flow data has been prohibitive for many companies.</p> </li> <li> <p><strong>NPM tools are becoming commoditized</strong>. Some vendors are noticing that their differences are less significant. The largest differentiator for some is scalable collection rates which are expensive and difficult to achieve. All vendors tout impressive collection rates, few can deliver.</p> </li> </ul> <p>Customers looking for a network performance monitoring solution need to have a clear understanding of the traffic patterns of their organization. These questions need to be answered before purchasing a solution:</p> <ul> <li>What percent of employees are working from home?</li> <li>What percent of employee traffic is headed to SaaS applications maintained by third parties (Salesforce, Google Docs, Slack, Asana, Gmail, etc.)?</li> </ul> <p>Based on the answers to the above, customers may find themselves excluding a lot of NPM vendors because SNMP polling, collecting DNS logs and packet capture aren’t going to cut it.</p> <p>Stay tuned for part 2 of this blog post where we’ll discuss how NPM is evolving.</p><![CDATA[Network AF, Episode 2: Backbone engineering and interconnection with Nina Bargisen]]><![CDATA[In episode 2 of Network AF, Nina Bargisen joins Avi to discuss network interconnection and peering, her career in networking, and mentorship and diversity issues in network engineering.]]>https://www.kentik.com/blog/network-af-episode-2-backbone-engineering-interconnection-nina-bargisenhttps://www.kentik.com/blog/network-af-episode-2-backbone-engineering-interconnection-nina-bargisen<![CDATA[Michelle Kincaid]]>Mon, 11 Oct 2021 04:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/6vEDn7vQPcyfJHCd8cRhR2/73c8cf475403567ea51032d87117232d/nina-bargisen.jpeg" style="max-width: 200px;" class="image right" alt="Nina Bargisen" /></a></p> <p>In episode 2 of Network AF, meet Nina Bargisen.</p> <p>Nina has spent over two decades in network engineering and talks to podcast host Avi Freedman about her history in interconnection and peering.</p> <p>She’s worked for companies like Netflix and TDC (formerly Tele Danmark Communications). Now Nina is Kentik’s director of GTM strategy focused on supporting service providers.</p> <p>Nina joins Avi on the <a href="https://www.kentik.com/network-af/" title="Network AF podcast on kentik.com. Find all episodes here!">Network AF podcast</a> to discuss:</p> <ul> <li>Mentorship experiences</li> <li>Ways the networking community can improve how it shares tribal knowledge with the infinitely diverse number of people joining the industry</li> <li>And some of the lessons learned throughout the years</li> </ul> <p>Nina’s entry into networking was non-standard. She began her career as a mathematician, going to university and doing research. After graduating she took time away from her professional career to focus on the work it takes to raise her children, and she talked about the process of going back to work outside her home in 1999.</p> <p>Nina applied for an intro position for new graduates, explaining that she hadn’t used her degree and would love to be considered. She talks about how lucky she was to be accepted at the time, not knowing that the near future held the dot-com bubble.</p> <p>Quickly learning new skills and responsibilities, Nina grew her experience in telephony and value-added services. Eventually this took her to the internet division of TDC where she worked on products for enterprises looking to host video advertising, shared hosting of video files, and eventually livestreaming services.</p> <p>In fact, she and Avi discuss how one of her early projects occurred because of the September 11th terrorist attacks while working at TDC, and how she worked with a broadcasting client to stream news of the tragedy. The two also talk about Nina’s trajectory throughout TDC for nearly 14 years, becoming its peering coordinator after coming home from a trip to NANOG.</p> <p>Nina also speaks to how the lessons she learned at TDC building out its infrastructure and backbone influenced her role at Netflix. She says the biggest difference between her ISP job and Netflix was that at Netflix they would peer with everyone except for very few people whose behavior meant it might mess up traffic.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/36e9b194"></iframe> <p>When it comes to what’s hot and what’s hype, Nina is all about the new edge being hype. “For years everyone has been saying it is all slideware. 5G is going to be another 4G. But it looks like people are putting the computer closer to the edge, and people think content will come to the edge.” She doubts video on-demand will come to the edge, but livestreaming probably will, explaining that on-demand doesn’t have enough file storage space.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/ae953157"></iframe> <p>She also thinks non-latency sensitive services are going to stay relatively central, but still not totally because it’s decentralized in the cloud. Real-time is going to be on the edge, she thinks. And overall, Nina is optimistic about the edge because of how much the cloud has changed the internet.</p> <p>Avi and Nina also discuss Nina’s advice for those who want to enter the internet and infrastructure world, early career development, and the importance of diversity and inclusion in network engineering.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/82b08388"></iframe> <p>Nina’s parting words are to try to be welcoming if people seem like they might want to talk, and to be cognizant of making those conversations happen online and offline. “Everyone who is curious about how s#!* works should be told. Feed those who are curious.”</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><a href="https://listen.casted.us/public/91/Network-AF-606cdf47/b7746141" title="Network AF, Episode 2: Backbone engineering and interconnection with Nina Bargisen">Watch and listen to the episode here</a>, or check out some of the highlights below. Follow us on <a href="https://podcasts.apple.com/us/podcast/network-af/id1584877668?itsct=podcast_box_link&#x26;itscg=30200&#x26;ls=1" title="Network AF on Apple Podcasts">Apple Podcasts</a>, <a href="https://www.google.com/podcasts?feed=aHR0cHM6Ly9mZWVkcy5jYXN0ZWQudXMvOTEvTmV0d29yay1BRi02MDZjZGY0Ny9mZWVk" title="Network AF on Google Podcasts">Google Podcasts</a>, or <a href="https://open.spotify.com/show/1xbEIFBRbilSvwOYuWtJ1Q" title="Network AF on Spotify">Spotify</a> so you never miss an episode!</p><![CDATA[Facebook’s historic outage, explained]]><![CDATA[Facebook suffered a historic and nearly six-hour global outage on October 4. In this post, we look at Kentik’s view of the outage. ]]>https://www.kentik.com/blog/facebooks-historic-outage-explainedhttps://www.kentik.com/blog/facebooks-historic-outage-explained<![CDATA[Doug Madory]]>Tue, 05 Oct 2021 04:00:00 GMT<p>Yesterday the world’s largest social media platform <a href="https://www.kentik.com/analysis/facebook-suffers-global-outage/">suffered a global outage</a> of all of its services for nearly six hours during which time, Facebook and its subsidiaries, including WhatsApp, Instagram and Oculus, were unavailable. With a claimed 3.5 billion users of its combined services, Facebook’s down-time of at least five and a half hours comes to more than 1.2 trillion person-minutes of service unavailability, a so-called “1.2 <a href="https://www.oxfordreference.com/view/10.1093/acref/9780199571444.001.0001/acref-9780199571444-e-1393">tera-lapse</a>,” or the largest communications outage in history.</p> <p>When we first heard about problems with Facebook, we checked Kentik’s OTT Service Tracker, which analyzes traffic by over-the-top (OTT) service. The visualization below shows the dropoff in the volume of traffic served up by each individual Facebook service beginning at roughly 15:39 UTC (or 11:39 AM ET).</p> <img src="//images.ctfassets.net/6yom6slo28h2/53idPzZxZbyJdgOUrgnJYO/02edecd8129099f31262bbfb5a179687/facebook-outage-traffic-dropoff.png" style="max-width: 800px;" class="image center" alt="Facebook outage - traffic dropoff" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Kentik Data Explorer showing a drop in global traffic volume for Facebook services, October 4, 2021. <a href="http://creativecommons.org/licenses/by/4.0/" target="_blank" title="Creative Commons License CC-BY 4.0"/>CC-BY 4.0</a></div> <p>Facebook Video accounts for the largest amount of bits per second that the Facebook platform delivers, whereas its messaging app WhatsApp constitutes a much smaller volume of traffic.</p> <p>So what happened? According to a <a href="https://engineering.fb.com/2021/10/04/networking-traffic/outage/">statement published</a> last night, Facebook Engineering wrote, “Configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication.” The result of this misconfiguration was that Facebook inadvertently downed some of their links to the outside world, resulting in the withdrawal of dozens of the 300+ IPv4 and IPv6 prefixes they normally originate.</p> <p>Included in the withdrawn prefixes were the IP addresses of Facebook’s authoritative DNS servers, rendering them unreachable. The result was that users worldwide were unable to resolve any domain belonging to Facebook, effectively rendering all of the social media giant’s various services completely unusable.</p> <p>For example, in IPv4, Facebook authoritative server a.ns.facebook.com resolves to the address 129.134.30.12 which is routed as 129.134.30.0/24 and 129.134.30.0/23. The withdrawal of the latter prefix is shown below in Kentik’s BGP Monitor (part of <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a>) beginning with a spike of activity at 15:39 UTC before returning at 21:00 UTC.</p> <img src="//images.ctfassets.net/6yom6slo28h2/646oVQAFtdO0e4Y4r8myWX/ada00faf8bd815b876b7ef4d227f83c7/facebook-outage-bgp-monitor.png" style="max-width: 700px;" class="image center" alt="Facebook outage - BGP monitoring" /> <p>Below, Kentik’s versatile Data Explorer illustrates how traffic from Facebook’s platform changed over the course of the day when broken down by protocol. Before the outage, UDP delivering traffic-intensive video dominated the volume of traffic volume while TCP constituted a minority. During the outage, the TCP traffic went away entirely, leaving a small amount of UDP traffic consisting of DNS retries from Facebook’s 3.5 billion users attempting in vain to reconnect to their services.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3TmGwxHVOG2f7D6XKcyKbD/03f58bcda5e1e143920cfbf71f5322fe/facebook-outage-data-explorer.png" style="max-width: 800px;" class="image center" alt="Facebook's traffic change during the day" /> <p>In addition to our NetFlow analysis product that hundreds of our customers currently know and love, Kentik also offers an active network performance measurement product called <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a>. Our <a href="https://www.kentik.com/product/global-agents/">global network of synthetic agents</a>, located around the world, immediately alerted us that something had gone terribly wrong at Facebook.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3D5LRSHCe3pdeIT18MizMx/4770d2ae1a841fed085f25a420a3b3c7/synthetics-shows-outage.png" style="max-width: 700px;" class="image center" alt="Global agents alert Kentik of Facebook outage" /> <p>Here’s how the Facebook outage looked from the perspective of synthetic monitoring, which includes various timing metrics such as breakdowns of HTTP stages (domain lookup time, connect time, response time) and average HTTP latency, as shown below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3EhSCvGpVnuGwa9OoP05yg/272358275d98ec9337c3a0e176a422b5/synthetic-monitoring-view-of-outage.png" style="max-width: 800px;" class="image center" alt="Synthetic monitoring shows Facebook outage" /> <p>Kentik Synthetics can also report on HTTP status code, average latency, as well as packet loss. Given that yesterday’s outage was total, all of these metrics were in the red for the duration of the outage.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2BMwsLWGgweMLiASVf231x/969bc6f088451395b18b143765258754/synthetics-metrics.png" style="max-width: 800px;" class="image center" alt="Synthetics metrics during Facebook outage" /> <p>The “Path View” in Kentik Synthetics can run continuous traceroutes to a target internet resource. This helps us see how this manifests itself on the physical devices that carry traffic across the internet for users of Facebook. Shown below, this view enables hop-by-hop analysis to identify problems such as increased latency or packet loss. In the case of yesterday’s Facebook outage, the tests could no longer execute when Facebook.com stopped resolving, but it did briefly begin to report on packet loss before stopping.</p> <img src="//images.ctfassets.net/6yom6slo28h2/48wQ671NlUtCJsDlPf0YdL/550bd8f47d9f40a9ec8e23913390f5cd/facebook-outage-in-path-view.png" style="max-width: 800px;" class="image center" alt="Facebook outage in path view" /> <p>Facebook-owned WhatsApp didn’t fare much better. Synthetic agents performing a continuous PageLoad test (where the entire webpage is loaded into a headless Chromium browser instance running on the agents to measure things like domain lookup time, SSL connection time and DOM processing time) to <a href="https://www.whatsapp.com">https://www.whatsapp.com</a> started failing completely, shortly after 15:40 UTC.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7lZViAMWoUJ7uCHw6RLzKS/9e7e0e904c99f149e21863e50e63c008/whatsapp-failure.png" style="max-width: 700px;" class="image center" thumbnail alt="WhatsApp failure" /> <p>Notice how nearly all the agents stopped seeing results for any of the metrics within a few minutes.</p> <img src="//images.ctfassets.net/6yom6slo28h2/uesw0lV13MsDTTImMOsYX/1544d7828791a27ab58860a685a65702/whatsapp-failure-agents.png" style="max-width: 700px;" class="image center" thumbnail alt="Global agents during Whatsapp failure" /> <p>A similar path analysis using continuous traceroutes for WhatsApp appears below. Before the outage, all the Kentik Synthetics agents are able to complete traceroutes (our agents run three every minute and increment TTL values to build this visualization) to WhatsApp servers. You can also see one of the devices with the hostname po101.psw01.sin6.tfbnw.net located in AS32934 (that belongs to Facebook) is dropping no packets (0% packet loss).</p> <img src="//images.ctfassets.net/6yom6slo28h2/7rS2xVCtcqKESyoUaeSA92/d4507fd04119bf0793c5905f1aab33dc/whatsapp-traceroutes.png" style="max-width: 700px;" class="image center" thumbnail alt="Path analysis using traceroutes for Whatsapp" /> <p>Now let’s look at the same traceroute visualization in the thick of the outage. Notice that all global agents that are part of this synthetic test are unable to complete traceroutes to WhatsApp servers except one in San Francisco and one in Ashburn. And even for those agents, the San Francisco one is losing about half the packets ( 50% packet loss) when it hit the device po105.psw02.lax3.tfbnw.net — which is another router inside AS32934 that belongs to Facebook.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5vwbepPTsig5WJme9nYwaU/c0babd70e6f0c9138472a7f50782a4f7/whatsapp-traceroutes-synthetic-agents.png" style="max-width: 700px;" class="image center" thumbnail alt="Whats App traceroute visualization" /> <p>One difference in the graphic below is the surge in latency as Facebook was returning its service online. At that moment in time, upwards of a billion users were likely trying to reconnect to WhatsApp at the same time, causing congestion leading to dramatically higher latencies.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7lgbZGzbBbjRI7oAmQzNo8/b3cc80ceb8611d175bd19d3d1a221aae/surge-in-latency.png" style="max-width: 800px;" class="image center" alt="Back online - surge in latency" /> <h3 id="what-now">What now?</h3> <p>Facebook’s historic outage showed that network events can have a huge impact on the services and digital experience of users. It’s more important than ever to make your network observable so that you can see and respond to network failures immediately.</p> <p>Yes, getting alerts from your monitoring systems help, but the challenge then becomes determining and resolving the cause of the problem.</p> <p>That’s where network observability comes in. Being able to gather telemetry data from all of your networks, the ones you own and the ones you don’t, is critical.</p> <p>Kentik provides a cloud-based network observability platform that enables you to identify and resolve issues quickly. That’s why market leaders like Zoom, Dropbox, Box, IBM and hundreds more across the globe rely on Kentik for network observability. <a href="https://www.kentik.com/go/get-started/">See for yourself</a>.</p><![CDATA[Listen up: The Network AF podcast is here]]><![CDATA[Today our Co-founder and CEO Avi Freedman launches Network AF, his new podcast. Come and listen to all-things networking, cloud, the internet and more. The first episode is out today!]]>https://www.kentik.com/blog/listen-up-the-network-af-podcast-is-herehttps://www.kentik.com/blog/listen-up-the-network-af-podcast-is-here<![CDATA[Michelle Kincaid]]>Tue, 28 Sep 2021 04:00:00 GMT<p><a href="https://www.kentik.com/network-af/"><img src="https://images.ctfassets.net/6yom6slo28h2/8Ag7NulP1kDWMup42K7B0/48270153c0688a35aa6c436f00d1c5f5/network-af-square.png" style="max-width: 260px;" class="image right" alt="Network AF Podcast" /></a></p> <p>Hear here! Today we’re very excited to share that our Co-founder and CEO Avi Freedman launched a new podcast, <a href="https://www.kentik.com/network-af/">Network AF</a>.</p> <p>If you like nerding out on all-things networking, cloud and the internet, this podcast is for you.</p> <p>If you like networking how-tos, best practices and biggest mistakes, this podcast is for you.</p> <p>If you want to up your poker game, well… this podcast might also be for you. (After all, Avi has played at the World Series of Poker and the Ultimate Poker Challenge.)</p> <p>As a self-proclaimed internet plumber, Avi plays host to top network experts from around the world for in-depth, honest and freewheeling banter on all-things-network.</p> <p>Network AF is a show about inspiring the next generation of network engineering professionals. In each episode you’ll learn a little about industry changes and trends and what lies ahead. Without further ado…</p> <h3 id="network-af-episode-1">Network AF, Episode 1:</h3> <h4 id="career-and-networking-evolution-with-bgpmon-founder-andree-toonk">Career and networking evolution with BGPMon Founder Andree Toonk</h4> <p>In the inaugural episode of Network AF, Avi chats with Andree Toonk, founder of the network monitoring and routing security tool BGPMon. The two discuss Andree’s career and some of the current projects he’s working on.</p> <p>As a 20-year networking professional based in Vancouver, Canada, Andree says he’s always had one foot in what is now called DevOps and another in networking. During the podcast, he shares a bit from working for various ISPs, OpenDNS and Cisco through the acquisition of BGPMon.</p> <p>Andree and Avi cover a bunch of topics, including:</p> <ul> <li>How to overcome intimidation to learn more advanced networking skills</li> <li>How to stay connected to mentors to provide a path to greater success</li> <li>How to drive growth in your network engineering career</li> <li>Trends in SaaS, zero trust, automation and more</li> </ul> <p><a href="https://networkaf.kentik.com/public/91/Network-AF-606cdf47/79bb9840" title="Network AF Episode 1: Career and Networking Evolution with BGPMon&#x27;s Founder Andree Toonk">Watch and listen to the episode here</a>, or check out some of the highlights below. Follow us on <a href="https://podcasts.apple.com/us/podcast/network-af/id1584877668?itsct=podcast_box_link&#x26;itscg=30200&#x26;ls=1" title="Network AF on Apple Podcasts">Apple Podcasts</a>, <a href="https://www.google.com/podcasts?feed=aHR0cHM6Ly9mZWVkcy5jYXN0ZWQudXMvOTEvTmV0d29yay1BRi02MDZjZGY0Ny9mZWVk" title="Network AF on Google Podcasts">Google Podcasts</a>, or <a href="https://open.spotify.com/show/1xbEIFBRbilSvwOYuWtJ1Q" title="Network AF on Spotify">Spotify</a> so you never miss an episode!</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p>And here’s a bit more from the episode…</p> <h4 id="andrees-early-networking-career-and-mentorship">Andree’s early networking career and mentorship</h4> <p>Andree discovered networking during his time at university because of a Cisco-sponsored lab on campus. The company provided the newest infrastructure equipment, Flash-based tutorials, and Excel materials that were instrumental in growing Andree’s networking skills. Often he would study at that lab until midnight when it would close, meeting new friends and building an obsession for tinkering with networks. Eventually he got a gig at AMS-IX (the Amsterdam Internet Exchange) one of the larger internet exchanges in the world at the time. For him, it’s where the door opened beyond the default routes. He discovered a whole world behind the network, and never left that world.</p> <p>Hearing of Andree’s past at AMS-IX, he and Avi traded stories about the organization, including its interesting approach to layer zero using Glimmerglass. This prompts Avi to ask Andree about mentorship, about who helped him along the way to learn such complex ideas and information. While Andree had several mentors, he jokes that at the beginning it was his time at SurfNet Communications that really helped him grow his automation expertise. As a new engineer, he often received all the “crappy jobs” like being told to collect all the serial numbers from all the devices. He realized it was a three week job if he went to each device manually, or a two day job writing a program and executing it to do the same thing.</p> <p>To the point of mentorship, Andree says even if you’re going to get a coffee with a new colleague or having a one-on-one with a peer in the industry it can be useful to hear their approaches to problem solving and what they’re currently working on. He encourages everyone to take the chance to meet with people outside their own company’s bubble, whether it’s reaching out to a professional a few years senior you respect, or someone you’ve enjoyed working with previously. He says it can be scary, but to find someone to regularly have coffee or meet with once a week or per month, but that “you never know where you’ll find gold.”</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/e85e9a1b"></iframe> <p>Avi reinforces this idea that you never really know what will happen from staying connected with people and demonstrating that you’re interested in learning and getting involved in projects. He says that in a healthy company and healthy environment, people will recognize that effort and give you the opportunity to grow.</p> <h4 id="the-world-of-saas-networking-and-mysocketio">The world of SaaS networking and Mysocket.io</h4> <p>When asked about his interest in pursuing high-speed networking and the evolution of Linux, Andree talks about the history of his work being a main influence. Seeing “SaaS-y” options sprouting up that packaged traditional networking services made him curious how to build those in a cloud-native way in a virtual environment, both reliably and speedy. He liked the two main problems of solving a distributed computing problem without incurring prohibitive costs, and building hybrid virtual environments that could operate quickly to perform the best of both worlds. Joking that at the time, the main problem he was hearing was how slow Linux was, so he decided to solve for himself “what is slow?”</p> <p>Andree and Avi then follow this path to discuss custom solutions versus vendor solutions, and how both have different merits but moving toward more custom solutions often means losing vendor support and integrations.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/9a44d57d"></iframe> <p>Avi then wants to know what motivates Andree, leading to discussion of Andree’s new service Mysocket.io. This service offers a zero-trust private access solution that replaces VPN solutions with an identity-based zero-trust alternative. This solution allows more control over asset and environment access so that a Wiki-server can point directly to a particular port, acting as a bouncer that sees if someone is allowed and what they’re doing within an organization’s framework. From a technical perspective, he is focused on how to build this in a clientless way where an individual can be anywhere.</p> <h4 id="getting-started-in-network-engineering-and-advice-to-the-younger-networker">Getting started in network engineering and advice to the younger networker</h4> <p>Avi and Andree discuss the idea of getting started despite some of the intimidation that surrounds networking.</p> <p>“The other thing with networking I’ve found is that it can be hard until you’ve put your hands on it. You think there is complexity that there isn’t,” Avi mentions. He adds that when he was learning BGP, he was always second-guessing himself about the possibility of performance-based speed issues, realizing eventually this was not the case. To this point, Andree agrees and mentions that in his perspective there is no better way to demystify networking than simply getting started and keeping your hands dirty so that those skills are always growing or staying in shape. While he admits that these are time-intensive practices to learn that not everybody has, he urges anyone interested to start at whatever capacity they can given their available time and work balance.</p> <p>As for advice Andree has for his younger self? He feels very lucky to have stumbled into the right environment with people who were able to guide him through the right risks. But when he thinks about new people entering networking, he advises to make sure to spend time with people you think you can learn from, and find an environment and company where experimentation is encouraged. Without fear, you’ll never learn much. Read and lurk during presentations. And most importantly learn how to code, even if it is just basic scripting and Python. It will set you apart and will make your job a lot more fun, it will “take the handcuffs off,” he says.</p> <iframe width="100%" height="230px" scrolling="no" style="border: none" src="https://networkaf.kentik.com/player/393df842"></iframe> <p>Avi adds that as you learn make sure to document and teach! It’s helpful to the community, and somebody else will have experienced the same problem or benefit from you doing that. Take the time to do what you think is interesting, demonstrate passion in your area of interest.</p> <p>If you’d like to follow Andree’s work or connect, he’s on Twitter <a href="https://twitter.com/atoonk">@atoonk</a>, or check out his website <a href="https://www.toonk.io/">toonk.io</a>, where he offers insights and writes about adventures in networking and software. And don’t forget to subscribe to Network AF.</p><![CDATA[Anatomy of an OTT traffic surge: Microsoft Patch Tuesday]]><![CDATA[Last Tuesday, September 14th was the second Tuesday of the month, and for anyone running a network or working in IT, you know what that means: another Microsoft Patch Tuesday. Doug Madory looks at how the resulting traffic surge can be analyzed using Kentik’s OTT Service Tracking.]]>https://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesdayhttps://www.kentik.com/blog/anatomy-of-an-ott-traffic-surge-microsoft-patch-tuesday<![CDATA[Doug Madory]]>Wed, 22 Sep 2021 16:00:00 GMT<p>Last Tuesday, September 14th was the second Tuesday of the month, and for anyone running a network or working in IT, you know what that means: another Microsoft Patch Tuesday.</p> <p>In an effort to regularize the deployment of software patches and updates to their software, Microsoft, years ago, designated this the day of the month to be when patches get pushed out globally to computers, servers, and other devices running Microsoft’s operating systems.</p> <p>It is also a traffic surge that can be analyzed using Kentik’s OTT Service Tracking.</p> <p><strong>OTT Service Tracking</strong></p> <p>Kentik’s <a href="https://www.kentik.com/product/service-provider-analytics/" title="Learn more about the Kentik Service Provider Analytics product">OTT Service Tracking (part of Kentik Service Provider Analytics)</a> combines DNS queries with NetFlow to allow a user to understand exactly how OTT services are being delivered - an invaluable capability when trying to determine what is responsible for the latest traffic surge. Whether it is a <a href="https://stealthoptional.com/feature/virgin-media-explains-that-call-of-duty-warzone-is-the-biggest-strain-on-their-network-but-how-could-that-be-fixed/">Call of Duty update</a> or a Microsoft Patch Tuesday, these OTT traffic events can put a lot of load on a network and understanding them is necessary to keep a network operating at an optimal level.</p> <p>The capability is more than simple NetFlow analysis. Knowing the source and destination IPs of the NetFlow of a traffic surge isn’t enough to decompose a networking incident into the specific OTT services, ports, and CDNs involved. DNS query data is necessary to associate NetFlow traffic statistics with specific OTT services in order to answer questions such as, “What specific OTT service is causing my peering link with a certain CDN to become saturated?”</p> <p>Kentik <a href="https://www.kentik.com/resources/kentik-true-origin/">True Origin</a> is the engine that powers OTT Service Tracking workflow. True Origin detects and analyzes the DNA of over 540 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port at the edge of the network.</p> <p><strong>Microsoft Patch Tuesday</strong></p> <p>Last week, Kentik customers were experiencing another Patch Tuesday. As illustrated below is a screenshot from Kentik’s Data Explorer view, Microsoft Update traffic experienced a peak that was almost 7.5 times that of the previous day. The update traffic was delivered via a variety of content providers including Akamai (38%), Stackpath (17%) and Edgecast (16%).</p> <img src="https://images.contentful.com/6yom6slo28h2/50Gxp8ueIFnnf0XAu6b90C/849c0665bb59346badd9a064508185b6/Patch_Tuesday_September_2021b.png" style="max-width: 500px; margin-bottom: 15px;" class="image center" thumbnail alt="Microsoft Patch Tuesday/Windows Update traffic analysis with Kentik" /> <p>When broken down by Connectivity Type (below), Kentik customers received Microsoft’s latest round of patches and updates from a variety of sources including Private Peering (54%), Transit (22%), Embedded Cache (17.4%), and IXP (7.1%).</p> <img src="https://images.contentful.com/6yom6slo28h2/Bardz9q6yejw5d9VdisZL/ef847cce9a2a609e87bb52b1060aba31/Patch_Tuesday_September_2021c.png" style="max-width: 500px; margin-bottom: 15px;" class="image center" thumbnail alt="Microsoft Patch Tuesday/Windows Update OTT traffic analysis by source" /> <p>In addition to source CDN and connectivity type, users of Kentik’s OTT Service Tracking are also able to break down traffic volumes by subscribers, specific router interfaces and customer locations.</p> <p><strong>How does OTT Service Tracking help?</strong></p> <p>In July, my colleague Greg Villain <a href="https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update/">described the latest enhancements</a> to our OTT Service Tracking workflow which allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive costs</li> <li>Anticipating and fixing subscriber OTT service performance issues</li> <li>Delivering sufficient inbound capacity to ensure resilience</li> </ul> <p>Major traffic events like Microsoft’s Patch Tuesday can have impacts in all three areas. OTT Service Tracking is the key to understanding and responding when they occur. Learn more about the application of Kentik for <a href="https://www.kentik.com/solutions/usecase/network-business-analytics/" title="Network Busines Analytics Use Cases for Kentik">network business analytics here</a>.</p><![CDATA[Why latency is the new outage]]><![CDATA[Adding more bandwidth from the business to the cloud is like adding more cowbell to Grazing in the Grass. In many cases, it won’t improve the end-user experience when latency is the real problem.]]>https://www.kentik.com/blog/why-latency-is-the-new-outagehttps://www.kentik.com/blog/why-latency-is-the-new-outage<![CDATA[Kevin Woods, Michael Patterson]]>Tue, 21 Sep 2021 04:00:00 GMT<p>Latency is not a new problem in networking. It has been around since the very first connections were being made between computer devices. It’s actually a relic of an issue that was recognized as a problem before the introduction of the <a href="https://en.wikipedia.org/wiki/Ping_(networking_utility)">ping utility</a>, which was invented by Mike Muuss back in 1983. Yeah, that’s approaching 40 years.</p> <p>The challenge with fixing the latency problem is that it is difficult. Not as difficult as time travel, but it’s difficult enough so that for 30+ years IT professionals have tried to skirt the issue by adding more bandwidth between locations or by rolling out faster routers and switches. Application developers have also figured out clever ways of working around network latency problems, but there are limits to what can be done.</p> <p>Given that almost all traffic is headed to the cloud these days, adding bandwidth to a business or a home office is like adding more cowbell to <em>Grazing in the Grass</em> (a popular song from 1968). In many cases, it won’t improve the end-user experience.</p> <p><a href="https://www.youtube.com/watch?v=qxXZF60EPdM"><img src="https://images.ctfassets.net/6yom6slo28h2/4lIrszrwKy6IsWZBMvfF2j/cc14e449f512309d7b9dafe4de2c6de8/grazing-hugh-maskela-play.jpg" style="max-width: 400px; margin-bottom: 15px;" class="image center" /></a></p> <div style="max-width: 400px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Listen to <i>Grazing in the Grass</i> by Hugh Masekela</div> <p>Latency is a difficult problem to solve. Over the last few decades network managers have focused on adding bandwidth and reducing the network outages. Fault tolerance, dual homing, redundant power supplies and the like have all led to improved availability. Fast forward to today, most networks enjoy five nines (99.999%) of uptime. Because the downtime problem has largely been overcome, the industry is finally coming around to addressing the old slowness problem. That’s why latency is the new outage.</p> <h3 id="what-causes-latency">What causes latency?</h3> <div class="pullquote right" style="max-width: 340px;">Because the downtime problem has largely been overcome, the industry is finally coming around to addressing the old slowness problem. That’s why latency is the new outage.</div> <p><strong>Cabling and connectors</strong>: This is the summation of physical mediums used between the source and the destination. Generally this is a mix of twisted-pair cabling and fiber optics. The type of medium can impact latency, with fiber being the best for reducing latency. A poor connection can introduce latency or worse, packet loss.</p> <p><strong>Distance</strong>: The greater the physical distance, the greater the possibility of introducing latency.</p> <p><strong>Routers and switches</strong>: The internet is largely a mesh configuration of network devices that provide connectivity to everywhere we want to connect. Each switch and router we pass through introduces a bit of latency that adds up quickly. Router hops introduce the most.</p> <p><strong>Packet loss</strong>: If the transmission ends up traveling over a poor connection or through a congested router, this could introduce packet loss. Packet loss in TCP connections results in retransmissions which can introduce significant latency.</p> <h3 id="how-latency-is-measured">How latency is measured</h3> <p>Network latency is the time required to traverse from the source to the destination. It is generally measured in milliseconds. Here we will explore how latency can be measured and the factors that can introduce latency. This in turn will help us understand how optimizations can be made that will minimize the problems introduced by it.</p> <p>Below is the beginning of a TCP handshake. After the DNS lookup and the ARP, the host reaches out to the IP address of the destination using a SYN in order to open a connection. The destination then responds with an ACK. The timestamps between these two packets can be used to measure the latency. The problem is that the measurement needs to be taken as close to source hosts as possible. This would be point A.</p> <img src="//images.ctfassets.net/6yom6slo28h2/LBclmMeMrsS7jV2gJZOZi/5f5e344ed34e627acd787ecb3d354738/network-round-trip-latency.png" style="max-width: 600px; margin-bottom: 15px;" class="image center no-shadow" /> <div style="max-width: 600px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">The basics of network latency and round-trip latency</div> <p>The best place for point A (shown above) is on the host itself. However, this is not always possible. To get an idea of what I’m talking about, most of us have executed ping on the command line in order to see if a host is responding. Although the ping utility on our computers uses ICMP, which is the first to see latency, other utilities similar to ping can execute tests similar to ping on top of TCP and UDP.</p> <p>Below is a comparison between a TCP ping using <a href="https://docs.microsoft.com/en-us/sysinternals/downloads/psping">psping</a> and an ICMP ping.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1yxN6h2m8ulMfYUxSowCAS/3dd12ff448fe88c71b71881ec0013a39/psping.png" style="max-width: 500px; margin-bottom: 15px;" class="image center" alt="PsPing screen capture" /> <div style="max-width: 500px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">PsPing screen capture</div> <p>Notice in the image above how the differences in responses between psping and ping are not terribly significant. However, if any of the connections or devices in the path to youtube.com were congested, the ICMP ping would likely show higher millisecond (ms) rates compared to the psping measurements.</p> <p>If you are curious about how a UDP ping would compare, psping can be used to perform this latency test as well. <a href="https://documentation.help/PsTools/PsPing.htm">Read the documentation</a> for further details. When I ran the test, the results came back with pretty much the same response times as ICMP and TCP. If you google <a href="https://www.google.com/search?q=udp+ping+utility">udp ping utility</a> you’ll find several executables that are fun to test with.</p> <p>Some vendors might tell you that latency doesn’t apply to UDP because it is connectionless (i.e., one way). Tell this to the user trying to conference with someone who’s video is all choppy. Or, try telling this to the folks at Google who developed Quic over UDP. Speaking of which…</p> <p>I was looking for a command line ping utility that used the Quic UDP protocol against websites. Although I wasn’t able to find one, I did come across a web page called <a href="https://http3check.net/?host=youtube.com">http3check.net</a> that allows you to enter a domain and test the Quic connection speed. Below I ran it for youtube.com from my home office:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">Connection ID PKT RX Hand Shake ------------- ------ ---------- CB8FB44ABF... 4.185ms 12.457ms 43237B1614... 9.604ms 10.629ms</code></pre></div> <p>These ping utilities are fine for point troubleshooting, but for ongoing tests against mission critical networks and websites, synthetic monitoring is the best option to <a href="https://www.kentik.com/go/webinar-synthetics101-2021-03/">proactively monitor network performance</a>.</p> <p><em>NOTE: Some tools (e.g., ping, iperf and netperf) measuring with the same protocol for latency to the same destination will provide different results. If you are serious about testing, it is best to try multiple.</em></p> <h3 id="how-to-reduce-latency">How to reduce latency</h3> <p>Reducing latency over the network first requires an understanding of where the points of slowness exist — and there could be multiple sources. On the command line, TCP Traceroute is my favorite turn-to utility for mapping out the hop-by-hop connection between my device and the destination.</p> <h3 id="what-is-traceroute-and-how-does-it-work">What is traceroute and how does it work?</h3> <p>Traceroute returns the router for each hop in the path to the destination as well as the corresponding latency. If one of the on-prem routers is introducing the latency, I can verify that there is ample bandwidth on each connection between routers. If this looks good, I could look at upgrading routers. If the latency is the destination server, I might be out of luck. However, if the latency is being introduced by my service provider, I can look to switch ISPs. If the latency is beyond my ISP, the problem could become more difficult to resolve. I might be able to find a local content delivery network (CDN) that has the resource I need a connection to. In this case, I might be able to persuade my ISP to peer with another service provider or participate in an internet-exchange point (IX) to get a faster connection to the resource.</p> <h3 id="what-is-synthetic-monitoring">What is synthetic monitoring?</h3> <p>The utilities discussed above are fine for one-off troubleshoots; however, companies need to be more proactive and run these tests continuously. This is because they need historical information on connections that allow them to be decisive when it comes time to switch service providers or add additional transit.</p> <p>For many companies with large-scale infrastructure, synthetic monitoring is incorporated as part of the network observability strategy. These tools execute many of the same tests I shared earlier to similar types of destinations with several big differences.</p> <p>With <a href="/kentipedia/what-is-synthetic-monitoring/">synthetic monitoring</a>:</p> <ul> <li>Tests are run against the same destinations from multiple, even dozens of different locations.</li> <li>The data is often enriched with other information, such as flows, BGP routes and autonomous systems (AS). This allows NetOps teams to identify the specific IP addresses the tests need to be run against, as well as the locations where the tests need to be executed from. Context like this helps ensure the results from the tests best represent the user base.</li> </ul> <img src="https://images.ctfassets.net/6yom6slo28h2/5pXAQZ4MM6SoAzKA56PhCr/1057523bb60653d6d886284f3a3e1f5d/synthetics-performance-mesh-with-latency.png" style="max-width: 600px; margin-bottom: 15px;" class="image center no-shadow" alt="Synthetics performance mesh user interface showing latency" /> <div style="max-width: 600px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Example of Kentik Synthetics’ performance mesh user interface that can report latency continuously, even in complex mesh topologies</div> <p>Above we can see a full mesh of synthetic monitoring agents that are performing sub-minute tests against one another. The proximity of the testing agents to the end users targeting the tested resources was intelligently determined by collecting additional telemetry data. Latency, packet loss and jitter metrics are all collected for historical trending and alerting.</p> <h3 id="stop-putting-up-with-latency-as-the-new-outage">Stop putting up with latency as the new outage</h3> <p>To learn about synthetic monitoring, <a href="https://www.kentik.com/go/get-started-synthetic-network-monitoring/">start a free, 30-day trial</a> and see how the <a href="https://www.kentik.com/product/kentik-platform/">Kentik Network Observability Cloud</a> can help you stop accepting that latency is the new outage.</p><![CDATA[Wait, Did AS8003 Just Disappear?]]><![CDATA[There has been a new development in [The Mystery of AS8003](https://www.kentik.com/blog/the-mystery-of-as8003/). As you may recall, this was the AS number registered to a defunct company in Florida that appeared earlier this year in the global routing table announcing over 175 million IPv4 addresses belonging to the US Department of Defense. Well, that just changed.]]>https://www.kentik.com/blog/wait-did-as8003-just-disappearhttps://www.kentik.com/blog/wait-did-as8003-just-disappear<![CDATA[Doug Madory]]>Fri, 10 Sep 2021 20:00:00 GMT<p>There has been a new development in <a href="https://www.kentik.com/blog/the-mystery-of-as8003/">The Mystery of AS8003</a>.</p> <p>As you may recall, this was the AS number registered to a defunct company in Florida that appeared earlier this year in the global routing table announcing over 175 million IPv4 addresses belonging to the US Department of Defense.</p> <p>Well, that just changed.</p> <p>At 17:05 UTC (1:05pm) on Tuesday, 7 September 2021, all but one of the 765 prefixes previously announced by AS8003 were moved to a new origin, AS749 belonging to the US Department of Defense.</p> <p>However, despite the new origin (AS749) these prefixes are still routed exclusively through Hurricane Electric (AS6939) and traceroutes to the address space die at a router in Ashburn, Virginia (72.52.92.226, 100ge5-1.core2.ash1.he.net) as they did previously when AS8003 was the origin.</p> <p>Old AS paths:</p> <p>… 6939 8003</p> <p>New AS paths:</p> <p>… 6939 749</p> <p>At the beginning of the day on 7 September 2021, AS749 wasn’t announcing any routes to the global internet. One can verify this fact in either <a href="http://www.routeviews.org/routeviews/">Routeviews</a> public data, or with the RIPEstat tool’s <a href="https://stat.ripe.net/widget/routing-history#w.resource=749">Routing History view for AS749</a>. Curiously, Hurricane Electric’s invaluable public tool, bgp.he.net incorrectly <a href="https://bgp.he.net/AS749#_asinfo">shows AS749</a> announcing these prefixes for at least the past six months.</p> <img src="https://images.contentful.com/6yom6slo28h2/6hsXfAgDYWdkDZ2oD8TJSh/3ba32f8ddc0e5d9239dce85bb64484c8/bgp_he_net_AS749_originations.png" style="max-width: 300px; margin-bottom: 15px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">bgp.he.net origin timeline for AS749</div> <p>Earlier today, the Pentagon <a href="https://www.washingtonpost.com/technology/2021/09/10/pentagon-internet-protocol-addresses-trump/">told the Washington Post</a>:</p> <p>“The Department of Defense has transitioned the advertisement, or announcement, of DoD Internet Protocol Version 4 (IPV4) addresses, previously advertised under Global Resource Systems LLC, to the DoD’s traditional operations and mature network security processes.”</p> <p>If the DoD had transitioned these routes to be announced by the “DoD’s traditional operations”, I would have expected them to be transited via AS721 which is how virtually all of the DoD IP address space is routed on the internet.</p> <p>So it would seem this DoD activity, which may be the world’s largest <a href="https://en.wikipedia.org/wiki/Honeypot_(computing)">honeypot</a>, simply swapped out AS8003 for AS749 in BGP without changing much else.</p> <p>Once again as a final note: your corporate network may be using the formerly unused DoD space internally, and if so, there is a risk you could be leaking it out to a party that is actively collecting it. How could you know? Using Kentik’s Data Explorer, you could quickly and easily view the stats of exactly how much data you’re leaking to AS8003 (now AS749). May be worth a check, and if so, <a href="https://www.kentik.com/">start a free trial of Kentik</a> to do so.</p> <p>Additionally, we’ve got a short video Tech Talk about this topic as well—see <a href="https://www.kentik.com/resources/kentik-tech-talks-how-to-use-kentik-to-check-if-youre-leaking-data-to-as8003/" title="Kentik Tech Talks, Episode 9: How to Use Kentik to Check if You&#x27;re Leaking Data to AS8003 (GRS-DoD)">Kentik Tech Talks, Episode 9: How to Use Kentik to Check if You’re Leaking Data to AS8003 (GRS-DoD)</a>.</p><![CDATA[Visibility into the cloud path of application latency]]><![CDATA[A company with thousands of remote employees connecting to the same SaaS applications will randomly experience slowness. How can we troubleshoot this sluggishness?]]>https://www.kentik.com/blog/visibility-into-the-cloud-path-of-application-latencyhttps://www.kentik.com/blog/visibility-into-the-cloud-path-of-application-latency<![CDATA[Kevin Woods, Michael Patterson]]>Fri, 10 Sep 2021 04:00:00 GMT<p>Ever find yourself sitting in your home office wondering why your connection to Google Docs or Salesforce is uncharacteristically slow? When this happens to me, I start with my troubleshooting basics. Opening a few additional browser tabs and visiting other websites such as ford.com or amazon.com is one of my favorite tactics. I click around on these sites and ask myself if they feel slowish. Sometimes I’ll even launch a different browser. If I’m using Chrome then I’ll launch Firefox and run the same basic tests. I then think about the applications I have running on my old laptop. I think to myself, “Hey self, maybe something is slowing down your computer.” I shut down apps and unneeded tabs in my browser to see if this helps and it usually doesn’t.</p> <p>If I don’t like the results from the above, I might launch a ping or a TCP traceroute to see if I get any packet loss or high-latency responses.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6FKdvfVyDh0FU2cipaorXt/78caf440c195074b3431c744164b0b0e/visibility-into-the-cloud-path-of-application-latency.png" style="max-width: 700px;" class="image center" alt="Visibility in the cloud path of application latency" /> <p>If the response comes back like the above, I’ll probably reboot my computer, my internet router or maybe both and hope that this fixes the problem. If, however, the command-line tests show significant latency then I know it isn’t me. “There’s nothing I can do,” I tell myself. I press on with my work and try not to let the poor connection speed to the cloud slow me down. What if I wasn’t the only one suffering from this latency and it was happening nearly every day? What if my home was actually a large, remote office building containing hundreds of others who work for the same company using the same cloud apps? Since everyone trying to do their job in the same application is suffering, the issue is much more consequential. Something has to be done!</p> <h3 id="great-netops-teams-anticipate-saas-application-latency">Great NetOps teams anticipate SaaS application latency</h3> <p>It would be silly to think, in a company with thousands of remote employees, that connections to the same SaaS applications won’t periodically be slow. With all the vendors involved in making cloud connections possible and the anticipated, random bursts in internet traffic, intermittent sluggishness is just going to happen. We know this.</p> <p>Proactive NetOps teams anticipate these problems and build out the cloud infrastructure in a way that minimizes slowness issues. To do this, it requires a non-cloudy understanding in two areas:</p> <ul> <li>Historical response-time awareness from remote locations to the targeted SaaS applications</li> <li>BGP route comprehension</li> </ul> <p>By having a baseline of historical latency from remote offices to selected SaaS applications (e.g., Google Docs, Salesforce, etc.), NetOps teams are aware of what acceptable performance patterns look like. When this data is paired with how BGP is directing traffic through the cloud, an aha-like moment can occur where the source of the latency through the cloud path becomes obvious.</p> <h3 id="digital-experience-monitoring">Digital experience monitoring</h3> <p>In order to build a baseline of end-user experience to SaaS applications, we need a <a href="https://www.kentik.com/solutions/usecase/digital-experience-monitoring/">digital experience monitoring</a> mechanism for data collection. One of the best ways to do this is through the deployment of synthetic transaction monitors (STM). STMs come in the form of <a href="https://www.kentik.com/product/global-agents/">small software agents</a> that can reside on-prem or in the cloud. They are given instructions like connect to Google Docs every second or every minute, record the latency, and send the metric off to the collection platform. These agents will also perform traceroutes similar to what I showed above and send this data off to the collection platform as well. The data ingested from these agents provide us with two important pieces of information:</p> <ul> <li>The connection latency to the SaaS over an extended period of time</li> <li>The path through the cloud that is taken to the SaaS over an extended period of time</li> </ul> <p>This information can answer a few questions:</p> <ol> <li> <p>When people in a remote location complain about slowness, is the on-prem synthetic transaction monitor indicating the same latency pattern? If not, the problem could be local. If yes, move on to question 2.</p> </li> <li> <p>When the latency occurred, did the path through the cloud change? How did it change? And which router or autonomous system introduced the latency?</p> </li> </ol> <p>Like any conundrum we face, once we know where the problem is being introduced, we can start the process of going about fixing it.</p> <h3 id="the-cloud-path-of-application-latency">The cloud path of application latency</h3> <p>To visualize the path and the location where the latency was introduced, we need an observability platform. Below you can see where I moused over the timeline, indicated by the orange arrow. When I do this, the hop-by-hop path updates below and indicates where latency was introduced. Notice that it happened at a time when the physical distance between the communicating systems wasn’t out of the ordinary.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1p5dH15vce3wvuPWQ8gd7s/b069a4364a6666fdf2c1ede77b97fe06/digital-experience-monitoring.png" style="max-width: 700px;" class="image center" thumbnail alt="Observability platform showing latency" /> <p>In the picture above, by mousing over the little green circles at the bottom, the IP address of the hops are listed. If you mouse over any of the red connections, it provides details on the variance of the expected latency.</p> <p>Loaded with the above information, we can investigate whether or not this is a consistent problem. If we ascertain that the issue is recurring, we can explore options that allow us to mitigate the issue (i.e., go around the problem hop). How do we do this?</p> <h3 id="how-to-reduce-cloud-application-latency">How to reduce cloud application latency</h3> <p>In order to get around the problem hop, a company has a few options. These solutions depend on the location of the SaaS in respect to the users suffering from the connection problem. NetOps could:</p> <ul> <li>Speak with the service provider and ask them to find a new path for the connections to the targeted SaaS</li> <li>Look to adjacent service providers to see if peering would get the users closer to the CDN serving up the needed content</li> <li>Explore new transit that avoids the problem hop</li> </ul> <p>Given the cost to make a change and the benefits of making it, the best course of action can be implemented. Without access to STMs, a good understanding of BGP and a network observability platform that ties it all together, identifying the trouble spot is much more difficult.</p> <p>If you would like to learn more about visibility into the cloud path of application latency and the use of synthetic transaction monitors to enhance your digital experience monitoring efforts, <a href="https://www.kentik.com/contact/">reach out to the team at Kentik</a>.</p><![CDATA[What’s next for the internet in Afghanistan?]]><![CDATA[As the last US military aircraft depart Kabul, so closes a chapter of U.S. involvement in the country. Will it also end a period of growth of the domestic internet in Afghanistan?]]>https://www.kentik.com/blog/whats-next-for-the-internet-in-afghanistanhttps://www.kentik.com/blog/whats-next-for-the-internet-in-afghanistan<![CDATA[Doug Madory]]>Tue, 31 Aug 2021 16:00:00 GMT<p>As the <a href="https://www.cnn.com/2021/08/30/politics/us-military-withdraws-afghanistan/index.html" title="CNN: The last US military planes have left Afghanistan, marking the end of the United States&#x27; longest war">last U.S. military aircraft</a> departed Kabul, so closed a chapter of U.S. involvement in the country. Will it also end a period of development of the domestic internet in Afghanistan?</p> <h3 id="growth-of-afghan-domestic-internet">Growth of Afghan domestic internet</h3> <p>In 2012, my former colleague and Renesys co-founder <a href="https://twitter.com/jimcowie" title="Jim Cowie on Twitter">Jim Cowie</a> wrote a blog post, <a href="https://web.archive.org/web/20121201001304/http://www.renesys.com/blog/2012/11/could-it-happen-in-your-countr.shtml" title="Renesys: Could it happen in your country?">“Could It Happen In Your Country?”</a>. Inspired by our coverage of the internet shutdowns of the Arab Spring the previous year, Jim attempted to do a back-of-the-envelope analysis that would gauge how easily a country-level shutdown could occur based on the number of unique international links that connected a country to the internet. The supposition being that more international links makes a country more “resistant” to a complete disconnection.</p> <p>In the analysis, the United States and most of Europe were rated as resistant, while countries with a single state telecom controlling international connectivity, like Cuba, North Korea and Syria, were rated as “severe risk.” Perhaps one of the more counter-intuitive ratings from that analysis was Afghanistan’s classification as “low risk.” The fragmented nature of the country’s internet, driven by its austere mountainous geography, meant that shutting down service in Afghanistan would require more than a single kill switch.</p> <p>The diagram below illustrates the topology of the internet of Afghanistan based on BGP data from 10 years ago. Six ASNs represented its domestic internet with international bandwidth coming from either satellite providers or its neighboring countries, Pakistan, Tajikistan, Uzbekistan and Iran. Service from the latter two countries <a href="https://web.archive.org/web/20150704005510/http://research.dyn.com/2010/09/iran-exporting-the-internet-pa/" title="Renesys: Iran - Exporting the Internet">began in the spring of 2010</a>. (Note the satellite providers shown without arrows originated IP space geolocated to Afghanistan.)</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5E8FyZSUBsVCa0oWeEZglB/76239af7063e31a635f93380e462078f/Afghanistan-Internet-2011.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Afghanistan Internet, 2011</div> <p>The geographic diversity of Afghanistan’s international gateways was born of necessity. Provincial population centers often had closer ties with neighboring countries than with the nation’s capital. For example, Herat in western Afghanistan receives services such as electricity and internet service from neighboring Iran.</p> <p>Fast forward 10 years and note the internet in Afghanistan has grown substantially - from six domestic ASNs to 40! AFTEL (AS55330) still gets connectivity from neighboring countries and still serves a central role operating the country’s fiber backbone, such that it is. Most Afghans access the internet through mobile handsets using wireless carriers such as MTN Afghanistan (AS132471), Etisalat Afghanistan (AS131284), Afghan Wireless (AS38742) and Roshan (AS45178).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5eCKP1eqJdoG0X7xb22I85/9816555cfd554ac508b039b6014acdd4/Afghanistan-Internet-2021.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Afghanistan Internet, 2021</div> <p>In the above diagram, international providers without a clear country geographic route are grouped in a cluster called “Unknown.” Despite this uncertainty, Rostelecom of Russia and China Telecom very likely provide service via northern routes while many others likely provide service via Pakistan.</p> <h3 id="risks-of-disruption">Risks of disruption</h3> <p>Despite recent events in Afghanistan, we have yet to see any national internet disruption. However, the risks to the country’s domestic internet sector remain stark. One of the catalysts of Afghanistan internet growth has been the development of a fiber optic backbone depicted in the diagram below from a <a href="https://www.unescap.org/sites/default/files/Presentation%20by%20MCIT%20On%20fiber%20connectivity%20in%20Afghanistan.pdf" title="UNESCAP Presentation on Afghanistan (PDF)">UNESCAP presentation</a> from 2015. This fiber backbone is a vulnerable piece of physical infrastructure that would be difficult to repair if it were damaged due to sabotage or a natural event.</p> <img src="https://images.contentful.com/6yom6slo28h2/2YRCeX4oQp8ijgu8hBlF1T/5a6e2d94292e7a95a4fdd30631f607ad/AF_Fiber_Ring.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Afghan Fiber Ring</div> <p>Another risk comes from the state of Afghanistan’s <a href="https://www.ft.com/content/504db1d2-239f-4388-a4db-4e9cd155d312" title="FT.com: Afghanistan confronts economic meltdown after Taliban takeover">crumbling economy</a>. New international restrictions on the ability of the country’s new government to move money in and out of the country will hamper its ability to pay for international bandwidth, source replacement parts or hire technical specialists to maintain all the moving parts of a national internet. However, Syria’s state telecom has been the <a href="https://www.latimes.com/business/la-xpm-2012-apr-23-la-fi-obama-tech-sanctions-20120424-story.html" title="LA Times: U.S. puts sanctions on telecom firms in Syria, Iran">subject of sanctions</a> for years but has managed to keep its services running.</p> <p>MTN had already said <a href="https://www.reuters.com/article/mtn-group-results/update-3-south-africas-mtn-to-exit-middle-east-to-focus-on-africa-idUSL8N2F813B" title="Reuters: UPDATE 3-South Africa&#x27;s MTN to exit Middle East to focus on Africa">last year</a> that they were planning to depart Afghanistan (along with the markets of Iran and Yemen) to focus on their business in Africa. It would not be surprising to see one or more foreign operators make the business decision to cease operations—a shutdown based on simple economics.</p> <p>Finally, the threat of censorship and communications blackout are a very real possibility given the track record of the new Afghan government. As we saw in the case of Myanmar earlier this year, government shutdowns don’t require a <a href="https://dl.acm.org/doi/10.1145/3473604.3474562" title="ACM: A multi-perspective view of Internet censorship in Myanmar">centralized kill switch</a>, despite a “low risk” rating from Jim’s 2011 analysis.</p> <h3 id="conclusion">Conclusion</h3> <p>The internet of Afghanistan is in a precarious state with the growth of the past decade at risk. Could the <a href="https://www.cfr.org/article/pakistans-support-taliban-what-know" title="Council on Foreign Relations: Pakistan’s Support for the Taliban: What to Know">alignment</a> of Pakistan’s Prime Minister Imran Khan and the new Afghan government herald a greater role for Pakistan in connecting Afghanistan? Or will we watch a slow decline of internet service marked by shutdowns and increased censorship? We will be watching the situation closely.</p> <p><strong>In Memoriam: <a href="https://twitter.com/DougMadory/status/1426261993988755465">Marc Lipton</a></strong></p> <p>Marc was one of the people I used to turn to when trying to understand the telecom sectors of Iraq and Afghanistan. After working 30 years as a lawyer for AT&#x26;T handling regulatory issues, Marc retired and took a job advising the government of Iraq on building the legal structure to foster a healthy telecom sector. After several years there, he moved on to work in a similar role for the government of Afghanistan.</p> <p>While working in Afghanistan, his hotel in Kabul had been bombed - thankfully he was away at the time. I told him that I hoped he was paid well for the risks he was taking. He replied that actually he wasn’t paid much. “Why do it then?,” I asked. He replied that he wanted to make things better in Afghanistan so that his grandchildren didn’t end up getting sent to fight there.</p> <p>Marc <a href="https://www.legacy.com/us/obituaries/chicagotribune/name/marc-lipton-obituary?id=2797834">died of a heart attack</a> in 2018 at age 66. Rest in peace, Marc.</p><![CDATA[Network observability: Hype or reality?]]><![CDATA[If you haven’t heard of network observability, you soon will, and you’ll be hearing it a lot. Some say it is just marketing hype and that networks have always been observable. This post will explore why that’s not the case.]]>https://www.kentik.com/blog/network-observability-hype-or-realityhttps://www.kentik.com/blog/network-observability-hype-or-reality<![CDATA[Kevin Woods]]>Tue, 31 Aug 2021 04:00:00 GMT<p>If you haven’t yet heard the term “network observability,” you will be hearing it soon. And I predict you’ll be hearing it a lot. Some say that network observability is just marketing hype from vendors. They say, “networks have always been observable, so there’s nothing new here.” I say network observability is not just vendor hype, and this blog will make the case.</p> <h3 id="what-is-observability">What is observability?</h3> <p>Let’s back up for a minute and talk about what “observability” means. The concept of observability has taken hold in the DevOps, SRE and application performance monitoring (APM) space. Thought leaders, especially <a href="https://www.honeycomb.io/">Honeycomb</a>, do a great job teaching the industry why observability as a concept is unique and important. The term has a literal engineering definition, that, in a nutshell, means the internal state of any system is knowable solely by external observation. Such a system is said to be “observable.”</p> <p>In networking, this would mean that you can understand what’s going on in your network by interpreting the network’s telemetry data. In the practice, that would mean answering questions such as: What is causing a drop in traffic? Why is my bandwidth bill so high? And what configuration change caused this behavior?</p> <h3 id="but-weve-always-been-able-to-answer-these-questions-right">But we’ve always been able to answer these questions, right?</h3> <p>Observability is not a binary attribute. There are degrees to which a network can be observable. Indeed, classic NPM tools have allowed some investigation of problems. And with yesterday’s relatively simple fixed-configuration networks, it was possible to explore and sometimes find the cause of unexpected problems — some degree of network observability.</p> <p>But the critical point here is the trend toward <a href="https://www.kentik.com/solutions/usecase/clouds-and-hybrid/">cloud networking</a>, and related trends such as SD-WAN have changed the game. There has been a significant loss of observability for the network, and the classic NPM (<a href="https://www.kentik.com/kentipedia/network-performance-monitoring/">network performance monitoring</a>) tools have not kept up. Here are some of the problems of classic NPM:</p> <ul> <li>They can’t handle cloud-scale. It’s typical for a cloud customer to produce many terabytes of VPC flow logs per month. A SaaS-based solution is required to achieve this scale.</li> <li>Port numbers and IP addresses are less useful in traffic analytics.</li> <li>Maps and data records built by NPM software can be highly inaccurate or completely wrong.</li> <li>Traditional NPM tools do not include the metadata on network security and routing policy, cloud/container orchestration information, critical to understanding how container instances are networked.</li> <li>The value of packet-capture technology diminishes significantly in the cloud.</li> </ul> <p>So, at a minimum, networks are less observable than they used to be.</p> <h3 id="what-has-changed">What has changed?</h3> <p>Let’s face it, the game for networking has changed. It has changed in two major ways:</p> <ol> <li> <p><strong>Cloud changes everything</strong>. Cloud networking is not simply a re-implementation of an existing <a href="https://www.kentik.com/kentipedia/network-architecture/" title="Network Architecture Explained: Understanding the Basics of Modern Networks">network architecture</a> within a cloud provider’s domain. Cloud-native applications are fundamentally different, and cloud computing creates new and unexpected challenges for networking. Commonly cited problems are a loss of visibility or understanding of the <a href="https://www.kentik.com/kentipedia/what-is-network-topology/" title="Kentipedia: What is Network Topology?">network topology</a>, loss of control over network policies (because developers can now create network constructs on their own), and new networking tools from the cloud providers that are often siloed and shallow in features.</p> </li> <li> <p><strong>Networking is intersecting the APM domain</strong>. Network practitioners have always known this, but the fact that the network plays a major role in application performance has become apparent to application developers in the last couple of years. And the vendor community has responded. Cisco/AppDymanics acquired ThousandEyes, Splunk acquired Flowmill, and <a href="https://www.kentik.com/press-releases/kentik-new-relic-expand-partnership-unified-application-network-observability/">New Relic has partnered with Kentik</a> as its network observability solution. Datadog, Dynatrace and other market leaders have added significant network observability capabilities to their platforms.</p> </li> </ol> <h3 id="we-need-to-reframe-the-networking-problemsolution">We need to reframe the networking problem/solution</h3> <div class="pullquote right">Network observability is the right idea at the right time.</div> <p>So, now that application developers, SREs, DevOps and cloud infrastructure engineers all have a growing interest in networking, how are we networking veterans going to help and collaborate with our new “network-very-interested” co-workers? Pull out our 15-year-old textbooks on NPM? Teach them how SNMP works? Give them a training class on packet capture and analysis?</p> <p>Maybe, but I wouldn’t try that. A lot of the underlying networking technology has not changed, except in the cloud where the physical and data link layers are totally different. I don’t want to disrespect anything that we’ve done in networking in the last two decades, but in digitally-evolved organizations, networking needs a new face. Networking needs a new context - and network observability is the right idea at the right time.</p> <h3 id="networking-in-an-application-context">Networking in an application context</h3> <p>The most important context for network observability is application performance and the digital experience of users. And since observability tools are used to measure and diagnose problems with application performance, network observability makes perfect sense as the moniker for the networking part of the discipline.</p> <p>As networkers, there are steps we can take to increase the importance of network observability as a critical part of the picture. One step is to always talk about network issues in the context of application performance. For example:</p> <ul> <li>Network latency will impact users’ digital experience and can have negative consequences, such as users closing a service or dumping their shopping cart.</li> <li>Network jitter can hurt or even disable the use of audio or video applications, for example, on a Zoom or WebEx conference call.</li> <li>Network security breaches, such as unauthorized access, put user data and confidential information at risk.</li> <li>Network-borne attacks, such as botnets, can paralyze the users’ application performance and response times.</li> </ul> <h3 id="how-does-kentik-use-network-observability">How does Kentik use network observability?</h3> <p>For Kentik, network observability is a theme that aligns with our current solution and our product roadmap. However, just as importantly, network observability is a reframing of the problem that better aligns with the new challenges seen, particularly in cloud networking. And, observability as a concept resonates with the way application, DevOps and SRE teams see their challenges.</p> <p>As a business, it is important for us to help companies understand what problems we solve, how those problems are impactful to their business and how Kentik’s solution can help. To us, network observability is not a marketing gimmick or just a new label on network monitoring. It is representative of a change in the way networking is understood as a part of the application and the lens through which modern infrastructure and cloud teams see planning, running and fixing the network.</p> <p>At Kentik, we won’t change our minds about wanting to better explain our solution and have that better resonate with customers. And…if you want to call this good marketing…thanks for the compliment!</p><![CDATA[Is it a network or application problem? The answer is here]]><![CDATA[With network observability from Kentik in New Relic One, you can correlate and analyze all of your telemetry data in one place: your applications, infrastructure, digital experience, and network data.]]>https://www.kentik.com/blog/is-it-a-network-or-application-problem-the-answer-is-herehttps://www.kentik.com/blog/is-it-a-network-or-application-problem-the-answer-is-here<![CDATA[Nick Stinemates]]>Wed, 25 Aug 2021 04:00:00 GMT<p>At FutureStack in May <a href="https://www.kentik.com/blog/unifying-application-and-network-observability-with-new-relic/">we announced</a> our early access program for a commercial partnership with New Relic. Today we are excited to announce general availability. Network observability is now available to all free and paying users of New Relic One. Starting today, all users have the ability, out-of-the-box, to add network context to their traditional application monitoring environments.</p> <p>Since the early access program launched, we’ve seen customers deploy this solution to combine their network data with their other observability data and answer the important question: “Is it a network or an application problem?”</p> <p>Together with New Relic, we’re helping network and development teams quickly identify and troubleshoot application performance issues correlated with network traffic, performance and health data. Ultimately, this makes services more reliable.</p> <p>As Buddy Brewer, New Relic’s group VP of strategic partnerships, notes in his <a href="https://newrelic.com/blog/nerd-life/network-performance-monitoring">latest blog post</a>:</p> <blockquote>“Now with NPM [from Kentik] in New Relic One, you can correlate and analyze all your telemetry data in one place—your applications, infrastructure, digital experience, and network data. This way, you can engage the right team at the right layer of the stack faster than ever before. And when it is the network, you can provide network engineering teams with proper context for faster resolution.”</blockquote> <p>Take a look at <a href="https://www.kentik.com/go/newrelic/">kentik.com/newrelic</a> for more information and to try it for yourself today.</p><![CDATA[Announcing Kentik Labs]]><![CDATA[Today we announce the launch of Kentik Labs, our new hub for the developer, DevOps and SRE community. With the tools we’re open sourcing, you’ll be able to observe key network telemetry for a competitive advantage. ]]>https://www.kentik.com/blog/announcing-kentik-labshttps://www.kentik.com/blog/announcing-kentik-labs<![CDATA[Nick Stinemates]]>Thu, 19 Aug 2021 04:00:00 GMT<img src="https://images.ctfassets.net/6yom6slo28h2/288QdDFrn2X2RRbgav8xmv/d4e9d010301b8d4a0d15e3fbd19d5326/kentik-labs.png" style="max-width: 170px; padding-top: 20px" class="image right no-shadow" alt="Kentik Labs" /> <p>Distributed applications, by nature, rely on the network to function. As applications go from single-host, single software stack to multi-host, heterogeneous environments, being able to observe key network telemetry becomes a competitive advantage. That’s why we’re announcing the launch of <a href="https://kentiklabs.com/">Kentik Labs</a>.</p> <p>Using the tools we’re open sourcing today, you can generate network metrics using our agents, like <a href="https://github.com/kentik/convis">convis</a> (eBPF) or <a href="https://github.com/kentik/kprobe">kprobe</a> (packet capture), and convert them to a common format using <a href="https://github.com/kentik/ktranslate">kTranslate</a>. These metrics can then be stored and leveraged in the observability tools you already have deployed, including New Relic, Kafka, Influx and Prometheus. From there, you can use your favorite visualization tool like Grafana or the InfluxDB UI.</p> <p><a href="https://github.com/kentik/convis">Convis</a> (container visualization) is a small eBPF and Rust tool showing how to use eBPF to track TCP connections on a Linux host. It’s small enough to get into as a tutorial but also provides useful data about who every process on your system is talking to. Watch our <a href="https://ebpf-summit-2021.sessionize.com/session/276419">eBPF Summit video</a> where we explore how to output network traffic statistics to JSON.</p> <p>This is just the beginning - we’re looking to work with the community to expand the different types of telemetry that kTranslate accepts and different backends that it supports, including <a href="https://opentelemetry.io/">OpenTelemetry</a>.</p> <p>Curious? Come kick the tires. Check out the quickstart guides for listening to <a href="https://github.com/kentik/ktranslate/wiki/SNMP-Quickstart">SNMP</a> and <a href="https://github.com/kentik/ktranslate/wiki/New-Relic-Flow-Collection-Quickstart">NetFlow</a>. These will get you running collecting passive information about how devices on your network are doing. Then go further and create some alerting around things like when your NAS disk is getting full or a non-white listed IP is sending data from inside your house.</p> <p>If you want to get started with eBPF, you can build and run convis with:</p> <p><code>cargo build —release <br /> sudo target/release/convis -v</code></p> <p>…And then watch the world of network connections fly by.</p> <p><a href="mailto:[email protected]">We want to talk to you</a>. Come hack with us! We’re looking for people interested in Go, Rust and C (eBPF) languages. Code will all be open sourced and we’re working to shape how the next generation of networks are run. Learn more at <a href="https://kentiklabs.com/">kentiklabs.com</a>.</p><![CDATA[Channel Partner Spotlight: Edgeworx Solutions Inc.]]><![CDATA[Edgeworx is one of Kentik's largest and most active channel partners, specialized in the Canadian region. In this Q&A, we highlight a bit about who they are and what they do.]]>https://www.kentik.com/blog/channel-partner-spotlight-edgeworx-solutions-inchttps://www.kentik.com/blog/channel-partner-spotlight-edgeworx-solutions-inc<![CDATA[Jim Frey]]>Wed, 18 Aug 2021 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/5bHRfMO8G5Weclws4nT8un/e4649d2dcc3e5d1cb03681f97ef942bb/edgeworx.png" style="max-width: 180px;" class="image right no-shadow" alt="Edgeworx" /> <p>Based in Ontario, Canada, Edgeworx Solutions Inc. is one of Kentik’s longest-standing and most active channel partners. Their team has extensive relationships with some of the biggest companies across Canada and also has an actively growing client base in the US as well. This partnership helps ensure growing visibility for Kentik across multiple markets and allows our shared customers to maximize the value of the <a href="https://www.kentik.com/product/kentik-platform/">Kentik Network Observability Platform</a>.</p> <p>We reached out to the Edgeworx team recently with this Q&#x26;A to highlight a bit about who they are and what they do.</p> <div style="border-top: 1px solid #d5dee0; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>Who is Edgeworx?</strong></p> <p>Edgeworx is a systems integrator and value-added reseller (VAR) helping customers drive digital transformation, providing industry-leading services and solutions. Edgeworx has more than 170 customers across Canada and the United States, including leading telecommunication service providers, several of the largest financial institutions in North America and many leading global enterprises.</p> <ul> <li> <p><strong>Kentik and Edgeworx</strong>: We have been offering joint solutions with Kentik for over four years, solving business issues associated with network observability, application performance, scalability, network planning and business transformation to the cloud.</p> </li> <li> <p><strong>New Technologies</strong>: Edgeworx Solutions prides itself in bringing forward expert tools that are simple and provide high value results. With a focus on visibility, synthetic testing, and network analytics Edgeworx is able to provide various solutions to their clients that will also enhance their business observability.</p> </li> <li> <p><strong>The Edgeworx Mission</strong>: Our mission is to assist our customers in maximizing the ROI in infrastructure and security investments. This allows our customers to avoid significant investments while providing a clear path to grow and develop their services.</p> </li> </ul> <p><strong>What markets do you serve?</strong></p> <p>We serve a variety of industry verticals from large financial institutions to leading service providers and enterprise markets across North America.</p> <p><strong>What services do you provide?</strong></p> <ul> <li>Synthetic testing</li> <li>Network analytics</li> <li>Visibility as a service</li> <li>Data protection solutions</li> <li>Disaster recovery solutions</li> <li>Networking, connectivity and infrastructure</li> <li>Security assessments</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/2EsYxBhEK0EZZ1qdJYNpWp/f74bfdd74d81ee5724a9f3efd3b3a181/edgeworx-menu.png" style="max-width: 800px;" class="image center" alt="Edgeworx" thumbnail /> <p><strong>Want to know more?</strong></p> <p>For more information on Edgeworx, check out their site at <a href="http://www.edge-worx.com/">www.edge-worx.com</a>. You can also learn more about <a href="https://www.kentik.com/partners/channel-partners/">Kentik’s Channel Partner Program</a>.</p><![CDATA[Network Observability Can Enable Hybrid Multi-Cloud Network Operations]]><![CDATA[Network observability is a concise way to describe how NPM solutions are evolving to support IT organizations that are embracing DevOps and the cloud. In this post, EMA Analyst Shamus McGillicuddy takes a deeper dive into network observability.]]>https://www.kentik.com/blog/network-observability-can-enable-hybrid-multi-cloud-network-operationshttps://www.kentik.com/blog/network-observability-can-enable-hybrid-multi-cloud-network-operations<![CDATA[Shamus McGillicuddy]]>Wed, 11 Aug 2021 04:00:00 GMT<p>Network operations teams that are struggling to align with a DevOps-centric, multi-cloud future should investigate how the concept of network observability can help them.</p> <p>NetOps professionals will find that many of their vendor partners are starting to talk about network observability as an evolution and advancement beyond traditional network performance management.</p> <p>DevOps devotees are familiar with the term “observability.” They will tell you it refers to the ability to collect, aggregate and analyze metrics, logs and traces to understand the health and performance of applications that are distributed across multiple clouds and data centers. NetOps teams that are collaborating with DevOps will find that their own tooling requirements are moving in a similar direction.</p> <h3 id="defining-network-observability">Defining Network Observability</h3> <div class="pullquote right">Network observability is a concise way to describe how NPM solutions are evolving to support IT organizations that are embracing DevOps and the cloud.</div> <p>What is network observability? First, think of it as the convergence of network performance management and application performance management. In a 2019 survey of NetOps pros, EMA found that 28% had integrated their NPM tools with their company’s APM tools. This convergence isn’t necessarily new. NPM vendors and APM vendors were talking about this convergence a decade ago. Many NPM vendors started trying to differentiate their products by rebranding them as application-aware network performance management. One can think of network observability as a mature approach to this convergence. For example, Kentik has <a href="https://www.kentik.com/go/newrelic/">integrated its NPM solution with APM leader New Relic</a>. This allows DevOps and NetOps teams to correlate performance across network infrastructure and applications.</p> <p>But network observability also involves the evolution and expansion of NPM solutions. For instance, NPM has long focused on analyzing device metrics and traffic. However, many NPM vendors are expanding them to address critical blind spots. Active monitoring via synthetic traffic is a great example. Synthetic traffic can reveal network and application performance in places where passive traffic monitoring isn’t possible or practical. For example, an active tool can send synthetic flows to a SaaS application to measure reachability and availability of the application. Active monitoring can expand the visibility and intelligence of a traditional NPM solution, advancing it toward true network observability.</p> <p>Network observability is also a marketing term that can be helpful to NetOps teams. DevOps professionals are familiar with the concept of observability. Armed with a network observability solution, NetOps can bridge the gaps that exist between them and DevOps and start to talk about building a unified toolset for network and application operations. This gap in understanding is a real problem that exists in many enterprises. A network architect with a large retailer once told me: “Part of the challenge has been political. The cloud team isn’t interested in paying for anything else. They want to use cloud-native tools, and they’re not concerned about IT operations. They’re engineering-focused.”</p> <h3 id="are-you-ready-for-network-observability">Are You Ready for Network Observability?</h3> <p>Network observability is not for every company. EMA speaks to plenty of NetOps managers who say their area of responsibility doesn’t rise past Layer 4 on the OSI stack. For instance, a network engineer for a large government agency once told me: “We asked ourselves as a network team, are we in the business of troubleshooting application transactions and sessions, or do we terminate our responsibility at the TCP layer? We decided that as long as the TCP handshake is done in a reasonable timeframe, then we are done. Everything else is an application issue. That has to do with how our IT organization works.”</p> <p>If your IT organization takes this siloed view of operations, perhaps network observability isn’t relevant to you. On the other hand, maybe you see DevOps and hybrid multi-cloud lurking on the horizon, and you know that things need to change. If that’s the case, <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">study up on what network observability can do</a>.</p> <p>Other NetOps teams will quickly recognize the value of a network observability solution that provides unified insight into application and network performance. As a director of network operations at a large financial services company said: “My mindset is, I don’t manage the network. I manage the applications that run on it. I need to look down the stack into the network itself. We take a top-down approach, rather than a bottom-up approach. When I have a network with 4,000 ports going down, I don’t care about them unless there is traffic running on them. If something is broken but not affecting anything, it’s not important.”</p> <p>This operations director would see the value of a solution that can correlate application performance and network performance. In such an organization, network observability is a potential path toward operational excellence.</p> <p>Network observability is a concise way to describe how NPM solutions are evolving to support IT organizations that are embracing DevOps and the cloud. If you have a requirement to support such a transition, ask your vendors to share their vision of how they are aligning with this concept.</p><![CDATA[Concerns Grow As AFRINIC’s Funds Are Frozen Over IPv4 Dispute]]><![CDATA[Africa’s regional internet registry, AFRINIC, is involved in a legal dispute over an increasingly valuable commodity: IPv4 address space. Kentik’s Doug Madory takes a deeper look at what’s happening. ]]>https://www.kentik.com/blog/concerns-grow-as-afrinics-funds-are-frozen-over-ipv4-disputehttps://www.kentik.com/blog/concerns-grow-as-afrinics-funds-are-frozen-over-ipv4-dispute<![CDATA[Doug Madory]]>Fri, 06 Aug 2021 04:00:00 GMT<p>An evolving legal dispute involving AFRINIC, Africa’s regional internet registry, has led to the <a href="https://mybroadband.co.za/news/internet/407770-afrinic-bank-accounts-frozen-after-r740-million-damages-claim.html">freezing</a> of the organization’s bank account and <a href="https://www.internetsociety.org/news/statements/2021/internet-society-statement-on-potential-destabilization-of-the-internet-in-africa/">concerns</a> about the stability of the continent’s internet operations. How did we get here?</p> <h3 id="what-are-the-rirs">What Are the RIRs?</h3> <p>The global internet is administered by five <a href="https://www.apnic.net/about-apnic/organization/history-of-apnic/history-of-the-regional-internet-registries/">regional internet registries</a> (RIRs): APNIC, ARIN, RIPE NCC, LACNIC and AFRINIC. These registries are nonprofit companies charged with managing the distribution and assignment of the IP addresses and AS numbers allocated to their respective geographic regions as illustrated in the graphic below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/23whA2PWH7RvIRhcqh3V0o/4947907d001f659d72ff69691298bcd7/RIR_Regions.png" alt="regional internet registries" style="max-width: 500px;" class="image center" /> <p>South African tech publication MyBroadband <a href="https://mybroadband.co.za/news/internet/407770-afrinic-bank-accounts-frozen-after-r740-million-damages-claim.html">explained</a> it this way, “If IP addresses are internet real estate, then AFRINIC’s Whois database is like the deeds office for all of Africa and the Indian Ocean region.”</p> <p>Last month, the internet “deeds office” of Africa attempted to reclaim <a href="https://lists.afrinic.net/pipermail/community-discuss/2021-July/004122.html">IPv4 addresses</a> that AFRINIC had determined to be inappropriately allocated. The company with the disputed addresses filed a claim for damages against AFRINIC and a court in Mauritius (where AFRINIC is based) <a href="https://defimedia.info/litige-commercial-afrinic-en-difficulte-internet-menace">ordered the freezing</a> of the registry’s bank account up to $50 million.</p> <p>In response, AFRINIC CEO Eddy Kayihura published <a href="https://www.youtube.com/watch?v=VmJNnVS-lo4">this video</a> on the organization’s YouTube channel assuring the continued operation of the registry despite the unprecedented legal situation.</p> <h3 id="ipv4-an-increasingly-valuable-commodity">IPv4: An Increasingly Valuable Commodity</h3> <p>In 2015, I authored one of the <a href="https://blogs.oracle.com/internetintelligence/ipv4-address-market-takes-off-v3">first analyses</a> of the new private market for IPv4 address space. As the available supply of IPv4 addresses became exhausted and IPv6 adoption lagged, the demand for IPv4 naturally increased. Markets were established and what had once simply been a network configuration setting was now an increasingly valuable virtual commodity.</p> <p>The dramatic increase in value led to a burgeoning industry of <a href="https://www.ripe.net/manage-ips-and-asns/resource-transfers-and-mergers/brokers">IPv4 brokers</a> as well as those who would attempt to obtain IPv4 directly from registries through duplicitous means.</p> <p>In 2019, the Department of Justice <a href="https://www.justice.gov/usao-sc/pr/charleston-man-and-business-indicted-federal-court-over-9m-fraud">indicted</a> a man in South Carolina for fraudulently obtaining over 700,000 IPv4 addresses from North American registry ARIN estimated to be worth between “$9,850,880 and $14,397,440.”</p> <p>In a <a href="https://www.youtube.com/watch?v=rYLEWCCjpao&#x26;t=235s">Lightning Talk</a> at LACNIC in 2018, <a href="https://twitter.com/etiennesharp">Etienne Sharp</a> blamed “ghost companies” for spiriting into Latin America, setting up virtual offices to obtain precious IPv4 address space at little cost, and then proceeding to use the address space outside the LACNIC region. Russian security company DDoS-Guard (aka Dancom) was <a href="https://twitter.com/DougMadory/status/1351686933248933894">one of those ghost companies</a> in Etienne’s presentation, and earlier this year, <a href="https://krebsonsecurity.com/2021/01/ddos-guard-to-forfeit-internet-space-occupied-by-parler/">LACNIC revoked its right</a> to use the address space it obtained from Belize.</p> <p>And finally in 2019, internet researcher Ron Guilmette’s analysis uncovered the “<a href="https://krebsonsecurity.com/2019/12/the-great-50m-african-ip-address-heist/">Great $50M African IP Address Heist</a>” in which an AFRINIC employee had been selling large amounts of the region’s IPv4 address space for personal profit. It was also Ron’s analysis that contributed to LACNIC’s decision to revoke DDoS-Guard’s address space.</p> <p>Since that “heist,” AFRINIC had a <a href="https://afrinic.net/2019-10-25-afrinic-appoints-a-new-ceo">change in leadership</a> and has been working to tighten up its controls around IPv4 allocation. It was this work that led it to <a href="https://lists.afrinic.net/pipermail/community-discuss/2021-July/004122.html">revoke the address space</a> of Cloud Innovation Ltd. That Cloud Innovation’s IPv4 space is used outside the AFRINIC region is clear and verifiable — <a href="https://bgp.he.net/AS2386#_prefixes">several</a> <a href="https://bgp.he.net/AS7018#_prefixes">major</a> <a href="https://bgp.he.net/AS22773#_prefixes">U.S. telecoms</a> announce the space on behalf of their U.S. customers. The case hinges on whether the allocation of the address space violated AFRINIC policies, as written years ago.</p> <h3 id="statements-of-concern">Statements of Concern</h3> <p>The Internet Society issued a <a href="https://www.internetsociety.org/news/statements/2021/internet-society-statement-on-potential-destabilization-of-the-internet-in-africa/">statement this week</a> on the AFRINIC situation expressing their concern that “any interruption in AFRINIC’s operations due to ongoing litigation could have a significant negative impact on not only the stability of the African internet registry, but also on the billions of people who use the Internet worldwide.”</p> <p>The Internet Service Provider’s Association of South African put out their <a href="https://tech.africa/ispa-statement-afrinic-legal/">own statement</a> adding that “a stable, reliable, and efficient regional internet registry that engages with the community in an honest and transparent manner is in the best interests of all African internet users. ISPA hopes that any litigation that threatens this can be swiftly resolved, so that AFRINIC can continue to provide services to its members, internet resource holders and the broader community.”</p> <p>There is more at stake beyond the particulars of the IPv4 address space in question. Specifically, can AFRINIC enforce its tighter controls governing IP address distribution for Africa and the Indian Ocean without risking a court order to freeze its operating budget? The fate of AFRINIC and the administration of the internet in Africa are now in the hands of the Mauritian courts. Stay tuned.</p><![CDATA[Hybrid vs. Multi-cloud: The Good, the Bad and the Network Observability Needed]]><![CDATA[How do you pinpoint latency problems between systems in a hybrid or multi-cloud environment? It requires insight into the complete path, end-to-end, hop-by-hop. ]]>https://www.kentik.com/blog/hybrid-vs-multi-cloud-the-good-the-bad-and-the-network-observability-neededhttps://www.kentik.com/blog/hybrid-vs-multi-cloud-the-good-the-bad-and-the-network-observability-needed<![CDATA[Kevin Woods, Michael Patterson]]>Tue, 03 Aug 2021 04:00:00 GMT<p>Understanding the difference between hybrid cloud and multi-cloud is pretty simple. Though, if you’re still a bit cloudy on the topic, I’ll use just a few words to clear it up. Below is a hypothetical company with its data center in the center of the building. The public clouds (representing Google, AWS, IBM, Azure, Alibaba and Oracle) are all readily available. Outlined in light blue is the hybrid cloud which includes the on-premises network, as well as the virtual public cloud (VPC) in the AWS public cloud.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7ph8t4lKQxDY4G3hwgE0Kp/29c005fb8b272589da652e271d67a24e/hybrid-cloud-vs-multi-cloud.png" style="max-width: 600px;" class="image center no-shadow" alt="Hybrid Cloud vs. Multi-cloud" /> <p>If a company has two or more VPCs, either in the same cloud or in different clouds, this is considered a multi-cloud, as outlined above in green.</p> <h3 id="hybrid-cloud-benefits">Hybrid Cloud Benefits</h3> <p>Some of the biggest benefits when adopting a hybrid-cloud configuration are:</p> <ol> <li> <p>Applications in the cloud often have greater redundancy and elasticity. This allows DevOps teams to configure the application to increase or decrease the amount of system capacity, like CPU, storage, memory and input/output bandwidth, all on-demand.</p> </li> <li> <p>Moving to the cloud can also increase performance. Many companies find it is frequently CAPEX-prohibitive to reach the same performance objectives offered by the cloud by hosting the application on-premises.</p> </li> </ol> <h3 id="multi-cloud-benefits">Multi-cloud Benefits</h3> <p>Companies take advantage of <a href="https://www.kentik.com/kentipedia/multicloud-networking/" title="Kentipedia: Multicloud Networking">multiple clouds</a> for a few reasons:</p> <ul> <li>Different cloud providers are better at different services. For example, some DevOps teams feel that AWS is more ideal for infrastructure services such as DNS services and load balancing. Google, on the other hand, might be a better platform for machine-learning computations.</li> <li>CAPEX fees and proximity to end-users can also be a factor.</li> <li>Organizations may intentionally use multiple cloud providers to mitigate risk, in the event that one provider has a major outage.</li> </ul> <h3 id="vpcs-and-security">VPCs and Security</h3> <p>Cloud does not equal internet. In both hybrid and multi-cloud configurations, all of the customer data stays private and cannot be accessed via the internet unless the network team chooses to do so. Below you can see how easy it is in AWS to select a VPC and then click a button to “Create internet gateway” in order to grant internet access.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1QRHedAd9jNKmnzp9o1qzu/dd5c3c8b7a2aaabb9b26ad9aeb8ab708/aws-vpc-internet-gateway.png" style="max-width: 600px;" class="image center" alt="AWS VPC Internet Gateway" /> <p>If a customer-facing application is moved to the cloud in order to improve performance, a direct internet connection using a gateway can be put in place. This conserves bandwidth on the corporate internet connection.</p> <p>It is, however, sometimes necessary to set up a connection between VPCs in order to allow workloads to communicate. For example, perhaps the systems in AWS need to send data to the VPC in Google Cloud in order to perform AI operations. When this connection is provisioned, can we just trust the connection? How can we be sure that there is ample bandwidth? This connection involves networks that we have no control over and no network traffic observability. There are, however, ways to monitor the performance.</p> <h3 id="but-cloud-adoption-can-introduce-problems">But Cloud Adoption Can Introduce Problems</h3> <p>The adoption of these new cloud infrastructures has helped many companies improve the availability and response times of their applications for end-users. Cloud providers have made it easy to configure infrastructure-as-a-service, including the network constructs. Application developers can easily change network configurations. This ease can create problems such as unintentionally routing traffic to the internet, introducing unnecessary risks, costs and performance reductions. And sometimes, the application developers don’t involve the network teams until there is a problem. When the network team investigates, they may see all types of problems such as ingress and egress points with no security policies, internal communications routed over internet gateways, abandoned gateways, abandoned subnets with overlapping IP address space, or VPC peering connections with asymmetric routing policies.</p> <div class="pullquote right" style="max-width: 260px;">The move to cloud has impaired the ability of legacy traffic monitoring tools to troubleshoot latency problems between systems.</div> <p>At the same time, the move to cloud has impaired the ability of legacy traffic monitoring tools to troubleshoot latency problems between systems. This is because much like an on-premises network, each cloud is a restricted network full of routers, switches and servers that can stretch the entire globe. Without the use of a cloud-aware traffic observability platform, how your organization’s traffic traverses each vendor’s cloud is a mystery.</p> <p>Regaining hop-by-hop network observability within hybrid and multi-clouds can be done through the use of synthetic monitors. These lightweight and efficient software agents are great for proactively keeping tabs on cloud network performance by providing details on things like latency, packet loss and jitter. Furthermore, they provide IT operations with an understanding of the routed paths taken through the hybrid and multi-cloud networks combined.</p> <h3 id="network-observability-for-all-clouds">Network Observability for All Clouds</h3> <p>A network observability platform optimized for hybrid and multi-cloud networks will combine data from synthetic monitors with routes received from BGP. The marriage between these two forms of telemetry allow NetOps to isolate congestion issues for a specific connection down to individual peers, internet exchange points or content delivery networks (CDNs) that are causing unacceptable latency issues.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5fqTxt0ultxyStnP6wJTaS/c9c085dc967257142e1c76408c48d336/hybrid-cloud-and-multi-cloud-traffic-visibility.png" style="max-width: 800px;" class="image center" alt="Hybrid Cloud and Multi-cloud Traffic Visibility" /> <p>With the right platform for <a href="https://www.kentik.com/product/cloud/">proactive monitoring of hybrid and multi-cloud networks</a>, details on exactly where the traffic was impacted can be identified. This, in turn, allows members of the NetOps team to take corrective measures.</p> <p>To learn more about network observability into your hybrid and multi-cloud network, <a href="#signup_dialog" title="Start a Free Kentik Trial">start a free trial</a> today.</p><![CDATA[Latency and Packet Loss Monitoring Within the Cloud]]><![CDATA[Maximize your network's performance with a reliable packet loss monitor. Learn how to proactively identify lost data packets and latency within the cloud. ]]>https://www.kentik.com/blog/how-to-monitor-packet-loss-and-latency-in-the-cloudhttps://www.kentik.com/blog/how-to-monitor-packet-loss-and-latency-in-the-cloud<![CDATA[Kevin Woods, Michael Patterson]]>Wed, 28 Jul 2021 04:00:00 GMT<p>NetOps teams have quickly learned the benefits of hosting applications in the cloud. But before they migrated or adopted a few SaaS applications, they knew in the back of their minds that monitoring performance would be difficult. A tiny voice was asking, “How will we monitor packet loss and connection latency, hop-by-hop, when using cloud applications?”</p> <h2 id="what-is-packet-loss">What is Packet Loss?</h2> <p>Packet loss is a network performance issue that occurs when data packets fail to reach their intended destination. It can happen due to several reasons, such as network congestion, hardware failures, and overloaded devices.</p> <p>When a packet is lost, the receiving device sends a request for retransmission. The time taken for the sender to detect a lost packet and retransmit it is called the retransmission timeout (RTO). Packet loss is measured by calculating the percentage of packets that fail to reach their destination, which is known as the packet loss rate.</p> <h2 id="examples-of-packet-loss">Examples of Packet Loss</h2> <p>Most of us have experienced packet loss, like choppy video in a conference call. Or standing in line at the bank or a department store and when the clerk says, “The system is really slow today.”</p> <h3 id="routers-and-switches-bottlenecking">Routers and Switches Bottlenecking</h3> <p>Routers and switches are an integral part of a network infrastructure, allowing data packets to be efficiently transmitted across the network. However, when the number of data packets being transmitted exceeds the capacity of the router or switch, it can cause network congestion and lead to packet loss. This is known as “bottlenecking”, and it can occur when the network infrastructure is unable to handle the volume of network traffic. Routers and switches can be overwhelmed due to outdated hardware, misconfigured settings, or a lack of bandwidth.</p> <h3 id="packet-loss-resulting-in-choppy-video-conferences">Packet Loss Resulting in Choppy Video Conferences</h3> <p>Video conferencing is a vital tool for remote collaboration and communication in modern workplaces. However, packet loss can lead to a poor video conferencing experience, with choppy or pixelated video, lagging audio, and delays in communication. This can be caused by a variety of factors, such as network congestion, slow internet connections, or outdated hardware. Regular monitoring of network performance can help to identify and prevent packet loss in video conferencing.</p> <h3 id="packet-loss-causes-problems-with-cloud-applications">Packet Loss Causes Problems with Cloud Applications</h3> <p>It isn’t so much that packet loss itself is a huge problem — TCP and QUIC were engineered in anticipation that lost packets would be inevitable. If there is a problem, it rests primarily in two areas:</p> <ul> <li><strong>Lost data</strong>: Older technologies (syslogs, SNMP traps, NetFlow, RTP, etc.) that run over UDP with no acknowledgement that the messages sent arrived at the destination. When using these technologies, we hope the data makes it.</li> <li><strong>Latency</strong>: Missing packets in connection-oriented protocols such as TCP and QUIC result in lost packets getting resent. These packet retransmits introduce latency.</li> </ul> <h2 id="what-causes-packet-loss">What Causes Packet Loss?</h2> <p>Packet loss can occur due to various factors, such as network congestion, bad network hardware, overwhelmed devices, mismatched duplexing, expired TTL, poor connectors, incorrect firewall configuration, and network security threats.</p> <h3 id="network-congestion">Network Congestion</h3> <p>Network congestion occurs when there is a high volume of network traffic, and the network infrastructure is unable to handle it. This can lead to packet loss as data packets get dropped due to limited network bandwidth. Network monitoring and management tools can be used to identify and manage network congestion, and network administrators can take proactive steps to optimize network performance and bandwidth allocation.</p> <h3 id="bad-network-hardware">Bad Network Hardware</h3> <p>Bad network hardware, such as faulty network devices or poor network connectivity, can also cause packet loss. This can be due to damaged cables, misconfigured devices, or outdated hardware that cannot handle the volume of network traffic. Regular maintenance and upgrades of network hardware can help to prevent packet loss caused by bad network hardware.</p> <h3 id="overwhelmed-devices">Overwhelmed Devices</h3> <p>Overwhelmed devices, such as routers, switches, and firewalls, can also lead to packet loss when they are unable to process the volume of network traffic. This can result in delays in packet transmission and increased packet loss. Increasing network bandwidth or upgrading to more powerful devices can help to prevent overwhelmed devices.</p> <h3 id="mismatched-duplexing">Mismatched Duplexing</h3> <p>Mismatched duplexing occurs when network devices are set to different duplex modes, such as half-duplex or full-duplex, resulting in packet loss. Network administrators can use ping tests to identify mismatched duplexing and reconfigure the devices accordingly to prevent packet loss.</p> <h3 id="expired-ttl">Expired TTL</h3> <p>Expired time to live (TTL) values can also cause packet loss as routers drop packets with expired TTL values. This can happen when packets are routed for too long or when they are stuck in a routing loop. Regular maintenance and monitoring of network performance can help to prevent packet loss caused by expired TTL.</p> <h3 id="poor-physical-connections">Poor Physical Connections</h3> <p>Poor connectors, such as poorly assembled cables or loosely plugged-in devices, can cause packet loss due to unstable network connections. Regular checks and maintenance of network connections can help to prevent packet loss caused by poor connectors.</p> <h3 id="incorrect-firewall-configuration">Incorrect Firewall Configuration</h3> <p>Incorrect firewall configurations that drop packets, such as all ICMP being dropped, can also lead to packet loss. Network administrators should regularly review and optimize their firewall configurations to prevent packet loss due to incorrect firewall settings.</p> <h3 id="network-security-threats">Network Security Threats</h3> <p>Lastly, network security threats, such as DDoS attacks, can cause packet loss by overwhelming the network with malicious traffic or by causing network downtime. Implementing robust security measures and regularly monitoring network performance can help to prevent packet loss caused by network security threats.</p> <h2 id="detecting-packet-loss">Detecting Packet Loss</h2> <p>Trying to source the location that is causing the packet loss is not always a trivial practice. For example, what if the destination of the connection is somewhere in the cloud? How do you identify where the packets are getting dropped?</p> <p>Here’s a list of utilities and techniques that many of us are still using today to detect and isolate network packet loss:</p> <ol> <li><strong>Ping</strong>: Using ping on the command prompt to measure packet loss is generally done to verify connectivity of a host, but “request timed out” could indicate if there is any packet loss. You have to remember that ping rides on top of ICMP, and in a congested network it is one of the first protocols to get dropped by a busy router. For this reason, ICMP is not a reliable protocol, and more importantly, it doesn’t tell you where in the path the packets were dropped.</li> <li><strong>SNMP</strong>: By polling all the SNMP devices on the network, packet loss details can be collected and a threshold can be set that notifies the NetOps team. Since we are focused on the cloud in this article, we find that SNMP is great for LANs and WANs, but we can’t use it to see inside devices within the cloud.</li> <li><strong>Packet capture</strong>: By strategically locating one or more probes off of mirrored ports on the network, sessions can be monitored, but if the loss occurred in the cloud, we will have no idea where it occurred. In production networks, packet probes are great for deep troubleshooting, but they are expensive and simply can’t be located everywhere we need them.</li> <li><strong>TCP traceroute</strong>: A TCP trace reaches out to every router in the path to a target destination. The log generated by the trace reports on each router in the path as well as any corresponding packet loss.</li> </ol> <img src="https://images.ctfassets.net/6yom6slo28h2/77EdHy5grnoq8qMdhJW8Zo/ffda689ad938cf79d577d5b3330c89ec/how-to-monitor-packet-loss-and-latency.png" style="max-width: 650px; margin-bottom: 15px;" class="image center" alt="How to monitor packet loss and latency" /> <div style="max-width: 650px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Detecting packet loss with TCP: Notice that TCP traceroute reports on the latency of each round trip connection, making it a good diagnostic tool when it is deployed at scale. </div> <p>After reviewing the above four ways to monitor for packet loss, you might think that TCP traceroute will solve the observability riddle, but there’s a problem. The cloud is made up of thousands of routers. This means we would need a massive global network of traceroute probes to help us test connections to the business applications we depend on.</p> <p>Remember, with applications in the cloud, we need to test from all the different locations where we have employees and customers. Imagine the amount of deployment work to get them all set up — not to mention the ongoing maintenance!</p> <h2 id="how-to-monitor-packet-loss-and-latency-in-the-cloud">How to Monitor Packet Loss and Latency in the Cloud</h2> <p>Kentik solved the deployment and maintenance conundrum by setting up a global network of agents used for synthetic testing — see the <a href="https://www.kentik.com/product/global-agents/">Kentik Global Synthetic Network</a>. These lightweight devices are located all over the world, in every major virtual public cloud (VPC) and service provider. These synthetic testing agents can be configured in the Kentik Network Observability Cloud to monitor any business application such as Salesforce, Office365 and more.</p> <div class="pullquote" style="max-width: 78%; font-weight: 400;">"[Kentik] is actually showing you a visual representation of how your physical equipment connects via those virtual private connections in the cloud. It also gives us information like latency, throughput, and jitter, across on-prem, through the cloud, and back on-prem."<br />&mdash; <a href="/resources/casestudy-major-league-baseball/">Jeremy Schulman with Major League Baseball</a></div> <p>Performing a TCP traceroute to IP addresses and host names is just the beginning. Kentik Synthetics can be used to verify availability of specific content in web pages, test DNS servers to make sure they are responding in a timely manner, monitor the availability and responsiveness of API endpoints, and much more. These tests are performed in an automatic and periodic way, with testing intervals as low as the sub-minute range.</p> <h3 id="establishing-baselines-and-setting-thresholds-for-packet-loss-and-latency">Establishing Baselines and Setting Thresholds for Packet Loss and Latency</h3> <p>When the tests are all configured, you can trend the data collected and set thresholds at levels where you know the business will be impacted.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2VnVe2wndckoLbG0asCkVw/95a6e3f97d91944f332e3fb4c70143db/pack-loss-latency-jitter-monitoring-with-the-cloud.png" style="max-width: 600px; margin-bottom: 15px;" class="image center" alt="Packet Loss, Latency, Jitter Monitoring with the Cloud" /> <div style="max-width: 600px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Measuring packet loss and latency in the cloud with Kentik Synthetics</div> <p>With the Kentik Network Observability Cloud, you gain the benefits of being able to baseline the performance of applications, websites and networks. You can hold vendors accountable when you know darn well that they are introducing 80% of the delay, which is annoying your customers and could be forcing them to abandon their online checkout. Doing nothing is costing your company money.</p> <p>If the service provider doesn’t fix a packet loss or latency problem, you can find new transit and then verify that your new path is avoiding that troublesome network. To learn more, read “<a href="https://www.kentik.com/blog/how-to-monitor-traffic-through-transit-gateways/">How to Monitor Traffic Through Transit Gateways</a>.”</p> <p>If you want to be proactive, you can deploy or make use of Kentik Synthetics agents in a remote geographic location and then run tests to see if a remote location like Inuvik, Canada can support an application at the necessary service levels.</p> <p>If you’d like to learn more about our network monitoring solution or how to monitor packet loss, <a href="#demo_dialog" title="Request a Personalized Kentik Demo">request a demo</a> or <a href="#signup_dialog" title="Start a Kentik Trial">start a free trial of Kentik</a>.</p><![CDATA[Employee Spotlight: Akshay Dhawale - Solutions Engineer]]><![CDATA[In our employee spotlight series, we highlight members of the Kentik team, what they’re working on, and their most memorable moments within the company. In this Q&A, meet Solutions Engineer Akshay Dhawale.]]>https://www.kentik.com/blog/employee-spotlight-akshay-dhawale-solutions-engineerhttps://www.kentik.com/blog/employee-spotlight-akshay-dhawale-solutions-engineer<![CDATA[Michelle Kincaid]]>Thu, 22 Jul 2021 04:00:00 GMT<div style="float: right; width: 170px; margin-left: 25px;"><img src="https://images.ctfassets.net/6yom6slo28h2/5YRDNaA4oziuHxrUhhq4c/4f8efb8a0b0046323ea723695d06c016/akshay-2021.jpg" alt="Akshay Dhawale" class="no-shadow" style="max-width: 170px; border-radius: 85px; padding: 0; border: 4px solid #f37021;" /><p style="text-align: center; font-size: 98%;"><b>AKSHAY DHAWALE</b><br />Solutions Engineer</p></div> <p>In this blog series, we highlight members of the Kentik team, what they’re working on, and their most memorable moments within the company. In this Q&#x26;A, meet Akshay Dhawale, a senior solutions engineer here at Kentik.</p> <div style="border-top: 1px solid #d5dee0; width: 70%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What is your role at Kentik?</strong></p> <p><strong>Akshay:</strong> I am a senior solutions engineer with Kentik and focus on supporting digital businesses. I spend most of my time helping customers up-level their #observability game and making network data more informative and actionable.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>How long have you been at Kentik?</strong></p> <p><strong>Akshay:</strong> 4 years and several code builds.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>Where do you live?</strong></p> <p><strong>Akshay:</strong> Dallas, Texas — but mostly at <a href="https://portal.kentik.com/">portal.kentik.com</a>!</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What made you want to work at Kentik?</strong></p> <p><strong>Akshay:</strong> I met the co-founders and the sales teams at a NANOG years ago and thought that every single person I spoke with was so passionate about the technology and the opportunity to build a network engineer’s dream platform. This excitement brushed on me throughout my interviews and it continues till today as we speak to new potential candidates.</p> <p>I will also say, rocking the cool(est) Kentik T-shirts might have had a particularly significant weightage in wanting to be a part of this kClub. <img src="//images.ctfassets.net/6yom6slo28h2/BYRpJ4VQD6pTDfanIbbeS/1a8b2d99fec94a2dfae590921a196daf/3shirts-higher-res.jpg" style="max-width: 600px; margin-bottom: 15px;" class="image center" thumbnail /></p> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;"><i>"Vintage" Kentik Tees!</i></div> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What’s the coolest thing you’ve been able to do at Kentik?</strong></p> <p><strong>Akshay:</strong> <a href="https://www.kentik.com/resources/nfd22-the-kentik-experience-an-overview-demo/">Presenting the Kentik Experience at Tech Field Day</a> for our Insights launch was probably the coolest thing I’ve been able to do at Kentik. Helping a customer monetize their network data was kinda cool, too.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What keeps you at Kentik?</strong></p> <p><strong>Akshay:</strong> We have a great team of diverse people who you can learn so many things from (like home automation for a chicken coop, or dishing out a perfect Roy Choi sandwich!). Not to mention, the underlying tech and being led by a great CEO!</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>How would you describe your coworkers?</strong></p> <p><strong>Akshay:</strong> “Kind tinkerers.” That’s almost a prereq for Kentikians!</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What’s your most memorable or favorite company event or offsite?</strong></p> <p><strong>Akshay:</strong> Our company offsite in Park City, Utah and after-dinner parties.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What’s your favorite Kentik Slack topic?</strong></p> <p><strong>Akshay:</strong> The #random channel.</p> <p><em>(Kentik sidenote: This Slack channel is labeled for “non-work banter, watercooler conversation and interesting hobbies (mostly trees).” True story: In the last five photos uploaded to #random, all five photos had at least one tree featured. We stay ‘logged’ into this channel!)</em></p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What are you most excited about over the next year or two here?</strong></p> <p><strong>Akshay:</strong> Kentik’s total addressable market (TAM) continues to grow as we expand our technology to support major public clouds, synthetic performance monitoring, and our battle-tested edge and data center core visibility solutions.</p> <p>I most look forward to how we go-to-market with solutions that integrate these three realities of operating any networks today and everything that follows from <a href="https://www.kentik.com/press-releases/kentik-expands-gtm-leadership-team-to-drive-greater-revenue-growth/">this recent announcement</a>.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p>Want to work with Akshay and the rest of the Kentik team? We’re hiring! Check out our <a href="/careers/#postings">open positions</a>.</p><![CDATA[OTT Service Tracking Gets a Major Facelift and Update!]]><![CDATA[Today we’re introducing the most important update to OTT Service Tracking for ISPs since its inception. To understand the power of this workflow, we’ll review the basics of OTT service monitoring and explain why our capability is so popular with our broadband provider customers.]]>https://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-updatehttps://www.kentik.com/blog/ott-service-tracking-gets-a-major-facelift-and-update<![CDATA[Greg Villain]]>Fri, 16 Jul 2021 04:00:00 GMT<p>Today we’re introducing the most important update to OTT Service Tracking for ISPs since its inception. To understand the power of our workflow, I’ll first discuss the basics of OTT (over-the-top) service monitoring and then explain why our enhanced capability is so popular with our broadband provider customers.</p> <p>For broadband providers, commonly referred to as “<a href="https://www.kentik.com/blog/building-and-operating-networks-assuring-what-matters/">eyeball networks</a>,” it’s difficult to clearly visualize their digital supply chain. Case in point, below is a real-life, anonymized example of a video OTT service which shows how content is handed over to the broadband provider’s network:</p> <img src="//images.ctfassets.net/6yom6slo28h2/e9fdm9krEIWpmO2EtWVDm/2f2f393d056a0f0ece9bc9d78665fa25/ott-facelift-sankey.png" style="max-width: 800px;" class="image center" alt="Video OTT Service Sankey Diagram" thumbnail /> <p>Notice how a single OTT content provider leverages all methods at their disposal to deliver traffic to this one ISP, including:</p> <ul> <li>Embedded caching, private peering and transit types of edge connectivity</li> <li>Leveraging at least four commercial CDNs</li> </ul> <p>Mastering the cost-versus-performance balance made our OTT Service Tracking workflow so popular with broadband ISPs for a couple of reasons:</p> <ol> <li>There are hundreds of different OTT content providers out there, and each one comes with a complex arrangement of connectivity types and CDN providers.</li> <li>Each combination of entry site, connectivity type and CDN comes with its own variations of: <ul> <li>Cost at the interconnection edge </li> <li>Performance perceived by the subscriber consuming the content</li> <li>Interfaces on the edge of the network that need enough capacity to deliver that traffic</li> </ul> </li> </ol> <p><a href="/resources/kentik-true-origin/">Kentik True Origin</a> is the engine (or secret sauce) that powers OTT Service Tracking workflow (and our CDN Analytics workflow, too). True Origin detects and analyzes the DNA of over 350 categorized OTT services and providers and more than 50 CDNs in real time, all without the need to deploy DPI (deep packet inspection) appliances behind every port in the network.</p> <p>The OTT Service Tracking workflow allows providers to plan and execute what matters to their subscribers, including:</p> <ul> <li>Maintaining competitive <strong>costs</strong></li> <li>Anticipating and fixing subscriber OTT service <strong>performance</strong> issues</li> <li>Delivering sufficient inbound capacity to ensure <strong>resilience</strong></li> </ul> <h3 id="better-faster-stronger">Better, Faster, Stronger</h3> <p>Whether it’s a new video-game release, network edge interfaces running hot, or subscribers complaining about rebuffering, when broadband providers face a content event, two questions always get raised:</p> <div class="pullquote right">If you’re interested in seeing, from the Kentik OTT Service Tracking perspective, how such an event can impact performance, check out <a href="https://www.kentik.com/blog/maxing-out-network-content-delivery-baldurs-gate-3-case-study/">this blog post</a> I wrote about the release of a highly anticipated video game. It’s a pretty fun read!</div> <ol> <li>Is there a noticeable performance impact?</li> <li>How many subscribers are impacted by the event?</li> </ol> <p>With our latest updates to our OTT Service Tracking workflow, we’re happy to say that we can help you better answer these questions. You’ll find these improvements in our new “Subscribers” tab of any “OTT Service Details” screen.</p> <p>The data needed to answer the two questions above are not a native part of network telemetry, and are not easy to inspect at an OTT service level. Yet, there are reasonable proxies to subscribership count and performance measurements that Kentik uses to help.</p> <p><strong>Performance</strong> can be observed by looking at the average/95th percentile/max value of Mbps, per unique destination IP — as long as you are able to isolate traffic destined to subscribers for a specific OTT service. As mentioned in the previous section, performance variations also need to be observed by delivery site, connectivity type or provider, now offered in the “Subscribership” tab for any OTT service.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/75Ih04XtswfPZTcoSjjX1/a88a7c7c1d16f8f16779f892aa5f866f/ott-subscriber-performance.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" alt="Slice and Dice Subscriber Performance" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">The redesigned OTT Service Tracking and True Origin interface will help users slice and dice subscriber performance to try and answer these questions.</div> <p>A good <strong>subscribership</strong> indicator can also be inferred by using the maximum number of destination IPs as a proxy. Granted, it will not help identify the number of members of a household with separate accounts on the same OTT service, but it is largely accepted as a sufficient proxy.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2uP3jmQcufsmasWxXoGWTb/36d98b51d32979ed20b51fb7d649f8f1/ott-count-funtion-of-time.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" alt="Subscriber Count as a Function of Time" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">This screenshot shows the previously famous subscriber count for any given OTT as a function of time, together with a ranking within the same category, in this case it’s video OTTs.</div> <h3 id="tracking-edge-capacity-for-individual-ott-services">Tracking Edge Capacity for Individual OTT Services</h3> <p>Let’s look at the most exciting part of this update to our OTT workflow. As we’ve hinted in the past when talking about the vision behind Kentik, there are three standard aspects of a network on which any infrastructure engineer is usually focused.</p> <ol> <li> <p><strong>Cost</strong>: We’ve started addressing this side of the equation in the previous quarter’s highly anticipated <a href="https://www.kentik.com/product-updates/may-2021-connectivity-cost-workflow/">Connectivity Costs update</a>. (Stay tuned, we have more coming!)</p> </li> <li> <p><strong>Performance</strong>: We’ve added a wealth of functionality in multiple areas of the product in the previously mentioned update and through our <a href="https://www.kentik.com/product/synthetics/">autonomous synthetic testing for CDNs</a>.</p> </li> <li> <p><strong>Resilience / Capacity</strong>: While we already offer a pretty popular workflow for capacity planning, we heard from a number of broadband provider customers another request: <em>“What if I want to scorecard an OTT service based on how much capacity I have on the edge?”</em></p> </li> </ol> <p>As we’ve learned early in this blog post, any given OTT service can leverage a myriad of interfaces to enter the broadband provider’s network. The more interfaces involved in traffic handover, the more difficult it is for the network operations teams to achieve a complete picture of capacity at any point in time.</p> <p>The aggravating factor here is that the set of inbound interfaces receiving this specific OTT traffic are constantly changing because of routing policy changes from upstream players in the digital supply chain.</p> <p>Together with this new requirement from our ISP customers, we have also heard these requests:</p> <ol> <li>“I want to be able to quickly evaluate/scorecard capacity at an OTT service level.”</li> <li>“In case deeper inspection is warranted, I want to be able to assess whether there is a performance impact on my subscribers.”</li> <li>“In the event of a meaningful performance impact, I want to be able to measure the ‘blast radius’ in terms of number of subscribers affected.”</li> </ol> <p>With our latest OTT Service Tracking update, broadband providers get an improved ability to track capacity closely for any of their top-100 ranked OTT services, and get a better shot at zeroing in on potential issues.</p> <p>The workflow has been extended with a “Capacity” tab on the details screen of any OTT Service. In this screen, an easy-to-parse Treemap visualization gives the operator an instant view of all the interfaces involved in delivering the OTT service, together with a visual representation of the state of utilization and the weight each in terms of traffic. The user will get an immediate sense of what’s bad and how bad it is. See as follows:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/NoaHJ8FiPpJTyg2YYWcUl/dfab0f74714b2a9acea97edb8e80e881/youtube-treemap.png" style="max-width: 800px;" class="image center" alt="YouTube - Capacity Treemap" thumbnail /> <ul> <li>Each block represents an interface.</li> <li>The color of each block represents the state of utilization of the interface (utilization thresholds are configurable globally for the workflow).</li> <li>The size of each interface block represents the share of traffic for this OTT service that the interface is contributing to the entire distribution.</li> </ul> <p>Below this capacity overview is a list of network devices that these interfaces belong to. Each one of these is expandable, providing a quick reference to all interfaces on the device, their state of utilization, and contribution to this OTT on the related network device.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3gc47RbFPFKJr9K4ETUygV/46a690fd511b1898014771ec592de08d/per-device-capacity-analysis.png" style="max-width: 800px;" class="image center" alt="List of Network Devices" thumbnail /> <p>The user is encouraged to drill down into a device exhibiting high utilization to identify the over-utilized interfaces:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2IRQ1MtrSe6oPRubC3x7IR/65c8d0ff7e44e0aa3f5ae4343b02a64e/capacity-drilldown-1149w.png" style="max-width: 800px;" class="image center" alt="Identify Over-used Interfaces" thumbnail /> <p>Now that the problematic interfaces are identified, you can further drill down to measure the potential impacts, as well as the amount of subscribers behind it:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5bX9W64V7jLXy20olnhWad/05fd69a980c3d762b12bac5041155e4a/ott-dashboard.png" style="max-width: 800px;" class="image center" alt="Drill-down - Potential Impacts" thumbnail /> <p>Voilà! That’s our improved OTT Service Tracking tool!</p> <h3 id="whats-next">What’s Next?</h3> <p>At Kentik, we design features and workflows with two key ingredients: the time-tested framework of cost, performance and resilience, and the inputs from our customers.</p> <p>We start with ideas for workflows that you suggest and help us design. We put them in our customers’ hands to play with and improve them with each consecutive release. So, in a nutshell, you tell us what comes next.</p> <p>Would you like to know how much any OTT service costs you? Or does something else keep your teams up at night? <a href="https://www.kentik.com/contact/">Just tell us</a>. Our product team is thrilled to help you solve these problems.</p><![CDATA[Kentik Engineering: An Introduction]]><![CDATA[VP of Engineering Mike Ho kicks off a blog series to talk about Kentik’s engineering team, where they’ve been, decisions they’ve made, problems solved, and fun facts about the team and the platform.]]>https://www.kentik.com/blog/kentik-engineering-an-introductionhttps://www.kentik.com/blog/kentik-engineering-an-introduction<![CDATA[Mike Ho]]>Thu, 15 Jul 2021 04:00:00 GMT<p>Over the years, many people have asked us to author an engineering blog series at Kentik. It’s taken us a while to get our act together, but we’re finally here.</p> <p>As I contemplated how to kick this series off over the past couple of months, I reflected a lot on our journey as a team and a company. I recall so many memories and see so much potential, so much excitement within our engineering team. I wonder what people imagine when they think about Kentik engineering, who we are and what we do.</p> <div class="pullquote right">Who do you think we are? Network nerds? Ok, guilty.<br />Data dreamers? Definitely guilty.</div> <p>Who do you think we are? Network nerds? Ok, guilty. Data dreamers? Definitely guilty. Really cool, fun people that enjoy building software? Guilty, guilty, guilty. But, let me pull back the curtains a bit, and you will see so much more.</p> <p>As I look around me, I see a team of people I’ve gained so much respect for as we’ve built a crazy scalable SaaS platform and products together, fought back production fires together, created fun easter eggs and hackathon projects together, and enjoyed team events together, including blacksmithing/MIG welding, escape rooms, and swamp tours in the Louisiana Bayou.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1sboDu4lRuqGLMlIJ0LZDv/41f4a6cd53e49b9cacf488a6ec8b1d34/eng-nola.png" style="max-width: 500px;" class="image center" alt="Kentik Engineers at Play" /> <p>I see people I consider friends and treat like family. I see people I would want to work with again and again.</p> <p>I see hard workers, communicators, fun-lovers and parents (and so many future Kentikians entering the world!). I see many different and intriguing hobbies with people spread around the world (we were built as a remote team before 2020).</p> <p>But most importantly, I see a team. A team that cares deeply about each other and our mission to unlock the mysteries that only network data can solve. A team that has demonstrated expertise in distributed systems, data processing, and modern user experience. I see one of the most talented teams, top to bottom, I’ve ever been a part of.</p> <p>I have been at Kentik for just over six years. We’re about to hit our seventh birthday as a company. It’s both awe inspiring and mind boggling to me as I think about all the things we’ve done together in this time.</p> <p>I remember starting with two “simple” engineering goals: store every data record we receive, and be able to query this data at subsecond speeds… all while being able to scale horizontally.</p> <div class="pullquote right">Today, our U.S. SaaS deployment currently receives almost 8 million records per second ... think about that for a minute while we process another 480 million records.</div> <p>Today, our U.S. SaaS deployment currently receives almost 8 million records <strong>per second</strong>. To save you from the math — that’s 691 billion records per day, or 252 trillion records per year. Think about that for a minute while we process another 480 million records.</p> <p>When users want to query this data, they can ask any question they want across hundreds of millions to billions of records (I take care to say records instead of rows because we don’t store the data in row format), and we will typically answer it within seconds. This year, we expect to run somewhere around <strong>1.5 billion queries</strong> for our users. I think we’ve knocked those goals out of the park.</p> <p>And it didn’t stop there. We have thousands of BGP peers and hundreds of customers that have leveraged what we’ve built to make DDoS attacks an afterthought. We’ve launched several major products in the last two years, and are always looking at how to paint an even more complete picture to network observers around the world, all within a state-of-the-art UI.</p> <p>Did I mention that we have done all of this with under 30 engineers to date? I’m so proud of our team and what we’ve accomplished with so few hands on keyboards. Today, we’re growing, and growing fast. We plan to almost double in headcount this year and don’t plan to stop there. There are so many new and exciting challenges ahead of us. Check out <a href="https://www.kentik.com/careers/#postings">our job openings</a>!</p> <p>As we grow, this blog series will grow with us, and I’m very excited to take you on our journey. Our team will share where we’ve been, the decisions we’ve made, the problems we’ve solved, and fun facts about Kentik and our platform. We will talk about everything engineering, from technical solutions, to organizational decisions, to team culture, and how all of the tradeoffs we’ve made have paid off (or not) over time. Our hope is that sharing our experiences will help others feel the joy we feel as they build out their teams and products.</p> <p>For now, I’ll leave you with a little present from the ghosts of Kentik past.</p> <p style="text-align: center;"><b>V1</b></p> <img src="//images.ctfassets.net/6yom6slo28h2/78FDLOuaZ8LvuuVwAUEaaR/d6ac2614bb633e9467584b1c8ee2a433/eng-v1.png" style="max-width: 800px; margin-top: 10px;" class="image center" alt="Kentik v1" /> <p style="text-align: center;"><b>V2</b></p> <img src="//images.ctfassets.net/6yom6slo28h2/5qLX9HJOnH8ZCjAb6hoHNa/8b00e7c6fd1ee05d29908e5ee2549250/eng-v2.png" style="max-width: 800px; margin-top: 10px;" class="image center" alt="Kentik v2" /> <p style="text-align: center;"><b>V3</b></p> <img src="//images.ctfassets.net/6yom6slo28h2/20WaU04rIiRRDfRdIhKIt7/33efccb927ea9768cc2c9704e4b943de/eng-v3.png" style="max-width: 800px; margin-top: 10px;" class="image center" alt="Kentik v3" /> <p style="text-align: center;"><b>V4</b></p> <img src="https://images.ctfassets.net/6yom6slo28h2/4JH6kFbnm0cEXN82d2bV4u/477efd1191c92a8c05107c5df2e07f34/eng-v4-1.png" style="max-width: 800px; margin-top: 10px;" class="image center" alt="Kentik v4" /> <p>Until next time — stay healthy out there!</p><![CDATA[Why You Need to Monitor BGP]]><![CDATA[BGP is a critical network protocol, and yet, BGP is often not monitored. BGP issues that go unchecked can turn into major problems. This post explains how Kentik can help you to easily monitor BGP and catch critical issues quickly.]]>https://www.kentik.com/blog/why-you-need-to-monitor-bgphttps://www.kentik.com/blog/why-you-need-to-monitor-bgp<![CDATA[Kevin Woods]]>Fri, 09 Jul 2021 04:00:00 GMT<p>The <a href="https://www.kentik.com/kentipedia/what-is-bgp-border-gateway-protocol/" title="Kentipedia: What Is BGP? Border Gateway Protocol Explained">Border Gateway Protocol</a> (BGP) is a fundamental part of sending data over the internet. That’s why the team here at Kentik just introduced BGP monitoring. BGP exchanges routing and reachability information for connecting autonomous systems. Without it, there would be no way to scale the internet or even make it work at all.</p> <p>Monitoring BGP to ensure proper operation is critical for any organization. At Kentik, we’ve been helping our customers <a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1/">monitor flows in BGP networks</a> for years. We’ve also helped customers understand the origination and destination of BGP flows that cross their networks. We help enterprises and service providers see the relationship between traffic flows and applications, originating ASNs and CDNs. We provide workflows that help improve the costs, performance and plan capacity for BGP networks. We’ve also helped our customers <a href="https://www.kentik.com/blog/bgp-and-rpki-a-path-made-clear-with-kentik/">secure their networks with RPKI</a>. Now, we’re moving to proactive measures to keep you ahead of the game. Our first capability in this area, of many to come, is something we call BGP monitoring.</p> <p>But before we share what we released, let’s first look at why BGP monitoring is important.</p> <h3 id="proactive-monitoring">Proactive Monitoring</h3> <p>Before we go into what can go wrong, it is good to monitor BGP to verify route changes that you make in the ordinary course of network operations. For example, you may have changed a service provider relationship. In this case, you want to be sure that your routes are correctly advertised and reachable following the change.</p> <h3 id="what-can-go-wrong">What Can Go Wrong?</h3> <p>There are many entities and devices that participate in BGP networks, so there are numerous potential points of failure or problems. It is well-known that BGP has weaknesses around authentication and verification of routing claims. Here are some of the most common issues.</p> <p><strong>BGP Route Misconfigurations</strong><br> Advertising routes that cannot deliver traffic is known as “blackholing.” If you advertise some part of the IP space owned by someone else, and that advertisement is more specific than the one made by the owner of that IP space, then all of the data on the internet destined for that space will flow to your border router. This will effectively disconnect that black-holed address space from the rest of the internet.</p> <p><strong>Route Hijacking</strong><br> Route hijacking is using another network’s valid prefix as your own. This can cause severe problems across the entire network. The majority of the route hijacking on the internet is due to unintentional misconfiguration. That doesn’t mean that someone couldn’t be attempting to disrupt service or intercept packets, but a common cause is simply a typo in the config file.</p> <p><strong>Route Flapping</strong><br> Route flapping occurs when a router advertises a destination network via one route, then changes to another (or as “unavailable,” and then “available” again) quickly. This can cause other routers to recalculate routes, consuming processing power and potentially disrupting service.</p> <p><strong>Infrastructure Failures</strong><br> Route flapping and other problems can be caused by hardware errors, software errors, configuration mistakes, and failures in communications links such as unreliable connections. These can cause reachability information to be repeatedly advertised and withdrawn. A common failure occurs when an interface on a router has a hardware problem that will cause the router to announce it alternately as “up” and “down.”</p> <p><strong>DDoS Attacks</strong><br> BGP hijacking can be used to launch a DDoS attack where the attacker poses as a legitimate network by using another network’s valid prefix as their own. If successful, traffic can be redirected to the attacker’s network, thus denying service to the user.</p> <h3 id="kentik-introduces-bgp-monitoring">Kentik Introduces BGP Monitoring</h3> <p>To help customers proactively monitor BGP and avoid these common problems with BGP networks, Kentik has <a href="https://www.kentik.com/blog/bgp-monitoring-from-kentik/" title="Kentik Blog: Introducing BGP Monitoring from Kentik">introduced BGP monitoring</a>. In response to customer requests and feedback, we have developed a comprehensive roadmap for BGP monitoring, and we believe our solution will have significant performance advantages over alternative solutions.</p> <p>The first part of Kentik’s solution is BGP Route Viewer. BGP Router Viewer appears as a tab along with the existing SaaS and Cloud Performance tabs. For customers who have entered prefixes in their Network Classification settings, we will automatically load BGP update data for those prefixes in this tab. For customers who have not entered any prefixes in their Network Classification settings, we will show an interface that allows you to do so and give you the option to save the entered prefixes to the Network Classifications page.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1gfiT1eV34pyd2npRCojEE/8682fde9fb1ed042ee694a3279cb6798/bgp-route-viewer.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="BGP Monitoring" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Kentik’s BGP Route Viewer</div> <p>Users will see all BGP announcements and withdrawals observed by Vantage Points (VPs) over the most recent 12-hour period presented as bar charts as well as a table. Each bar represents five minutes of data. Clicking over a specific bar filters updates to that specific five minute time window. For each event in the table, we indicate the type (announcement vs withdrawal), the origin AS (the one that the prefix belongs to), the prefix, the AS path (from the VPs AS to the origin AS), the length of the path (number of hops) and the data set (RouteViews data or Kentik private peer data).</p> <h3 id="advantages-of-kentiks-approach">Advantages of Kentik’s Approach</h3> <p>Kentik’s architecture delivers frequent updates, as often as each minute, giving users the most up-to-date information quickly and minimizing the delay needed to identify and respond to issues.</p> <p>Kentik uses hundreds of BGP feeds to ensure better network coverage, accuracy and performance. Kentik users will automatically get a 3x to 5x advantage over other solutions in terms of coverage and performance. Kentik’s BGP Route Viewer is easy to use, using the ASes / prefixes you already report or can easily add. Kentik’s BGP Route Viewer is available today and is free to use for all Kentik customers and trials.</p> <h3 id="and-much-more-coming-soon">And Much More Coming Soon!</h3> <p>BGP Route Viewer is just the start. We have many additional features coming soon on the product roadmap, including the ability to get immediate notifications for hijacks, data to help you better understand BGP performance, visualizations, and the ability to monitor reachability from your choice of vantage points.</p> <h3 id="start-bgp-monitoring-today-with-kentik">Start BGP Monitoring Today with Kentik</h3> <p>If you own ASes and advertise routes to the internet directly or via a service provider, BGP is one of the most crucial components of your network infrastructure. Yet, as we have described, many things can go wrong. Kentik’s unique approach provides essential coverage and performance advantages, not available in other solutions. And, we have a complete roadmap with many more features coming soon.</p> <p>BGP monitoring is essential and is an integral part of network observability — the ability to answer any questions about your network. <a href="#signup_dialog" title="Start a Free Kentik Trial">Start monitoring BGP with Kentik today</a>!</p><![CDATA[Automated, Accurate, Flexible DDoS Detection and Mitigation]]><![CDATA[From a threat actor’s side, launching a DDoS attack is both easy and cheap. For a business, it's costly, disruptive and comes with big hurdles for effective detection and mitigation. Kentik Protect can help.]]>https://www.kentik.com/blog/automated-accurate-flexible-ddos-detection-and-mitigationhttps://www.kentik.com/blog/automated-accurate-flexible-ddos-detection-and-mitigation<![CDATA[Daniella Pontes]]>Wed, 30 Jun 2021 04:00:00 GMT<p>Gaming, finance, hosting, you name it: many digital businesses today are under DDoS attack. Why? From the threat actor’s side, launching one of these attacks is both easy and cheap. On the Dark Web, DDoS-as-a-service can be purchased for as little as $10 per hour and requires almost no technical sophistication to deploy. Not to mention, the reward (usually in the form of a service disruption) can quickly be reaped.</p> <p>On the other hand, it can be quite a costly experience for a business. One study suggests the loss is anywhere from $20,000 – $40,000 per hour as a business’ sites and services are taken down. That hardly scratches the surface when you add in the toll taken on reputation. So, what can be done?</p> <h3 id="challenges-with-current-ddos-mitigation-approaches">Challenges with Current DDoS Mitigation Approaches</h3> <p>DDoS attacks are hardly new, and there are many different approaches to mitigation. Traditionally, enterprises “scrub” DDoS traffic, but that requires investment in dedicated, costly appliances and specialized knowledge in attack mitigation, so it has become less common.</p> <p>Instead, many companies choose to outsource the entire mitigation process. While this approach eliminates the need for on-premises hardware, there are two major drawbacks:</p> <ol> <li> <p><strong>Limited flexibility</strong>: When DDoS protection is outsourced to a single vendor, organizations cannot benefit from integrations with solutions that are better suited for specific attacks and mitigation needs.</p> </li> <li> <p><strong>Poor analytics</strong>: The outsourced service focuses on identifying and triggering DDoS mitigation. That means you’ll have little visibility into the nature of the attack (e.g., type, origin, volume, parallel threats… was it really an attack?). Without analytics, little can be learned from the attack to understand its impact and prevent future or persistent ones.</p> </li> </ol> <p>Without the ability to dig deeper into the incident, you won’t be able to assess your detection and mitigation strategy, know if it was actually an attack and not caused by something like a misconfiguration, make changes and/or deploy new techniques.</p> <h3 id="using-kentik-to-detect-and-mitigate-ddos-attacks">Using Kentik to Detect and Mitigate DDoS Attacks</h3> <p><a href="https://www.kentik.com/product/protect/">Kentik Protect</a> is the automated, flexible and insights-driven solution for DDoS attacks. We built this solution for enterprises and service providers to effectively monitor infrastructure and reliably detect and mitigate DDoS attacks at their outset. Just a few of the benefits include:</p> <ul> <li> <p><strong>Accuracy</strong>: The ability to accurately differentiate actual attacks from benign traffic is a key advantage of Kentik Protect. Simplistic DDoS detection policies based on static thresholds will not maintain accuracy over time, while behavioral-based approaches suffer from false positives. Kentik Protect uses fine-grained policies, ML-based detection and timely updates of new threats to deliver higher accuracy and effectiveness in the detection of attacks.</p> </li> <li> <p><strong>Automation</strong>: Once an attack is identified, Kentik Protect supports either automated or manually triggered traffic redirection to scrubbing infrastructure, or it can signal upstream infrastructure to drop traffic completely.</p> </li> <li> <p><strong>Informed decision making</strong>: Context-rich detection also enables better decisions on the type of mitigation applied. For example, a high-volume attack should be directed to a mitigation service that can handle it, while other lower-volume attacks might be handled locally.</p> </li> <li> <p><strong>User-experience understanding</strong>: When combined with our <a href="https://www.kentik.com/product/synthetics/">synthetic testing solution</a>, Kentik Synthetics, Kentik Protect allows you to evaluate the actual application user experience worldwide both during the attack and as mitigation actions are implemented.</p> </li> </ul> <h3 id="how-does-kentik-protect-work">How Does Kentik Protect Work?</h3> <p>Kentik Protect uses standards-based protocols (RFC-3882 RTBH or RFC-5575 Flowspec) and ubiquitous network services (e.g. NetFlow, IPFIX, BGP) to integrate with a wide variety of DDoS mitigation solutions and capabilities. Crucially, it also provides detailed traffic analytics to understand the attack and the effects of mitigation tactics in real time.</p> <p>DDoS mitigation using Kentik Protect is a straightforward process, as shown below. Kentik Protect supports the policy management, traffic flow data analysis and <a href="/solutions/usecase/ddos-detection-and-network-security/">DDoS detection</a>, and relies on scrubbing or network infrastructure to execute the mitigation policy:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5ip7XvVoODYrx3dMfxkAXS/cfc5aabaddd8c8cf1e5e09e9e34e7fe4/kentik-protect-ddos.png" alt="DDoS Protection and Mitigation" style="max-width: 800px;" class="image center no-shadow" /> <p>Kentik Protect is simple, efficient and effective. The solution delivers quick time-to-value. It neutralizes attacks at the infrastructure edge and provides visibility of network flows, trends and capacity bottlenecks. DDoS policies can be automated or manual depending on the operational model of the organization.</p> <p>Unlike virtually all alternatives, Kentik Protect also enables the discovery of “sneak attacks” that may be using DDoS to hide, and doesn’t produce false-positives based on inaccurate behavioral analysis. And as a SaaS solution, deployment, threat updates and ongoing management of the solution itself is extremely simple.</p> <p>Sleep better at night. Check out Kentik Protect for yourself with this <a href="#signup_dialog" title="Start a free Kentik trial">free trial</a> or <a href="#demo_dialog" title="Request your Kentik demo">get a demo</a> with a Kentik expert.</p><![CDATA[Channel Partner Spotlight: TechEnabler]]><![CDATA[TechEnabler is a Kentik Channel Partner headquartered in Sao Paulo, Brazil. In this Q&A, VP of Channel Sales Jim Frey highlights a bit about the partner, who they are and what they do.]]>https://www.kentik.com/blog/channel-partner-spotlight-techenablerhttps://www.kentik.com/blog/channel-partner-spotlight-techenabler<![CDATA[Jim Frey]]>Thu, 24 Jun 2021 04:00:00 GMT<p>TechEnabler, headquartered in Sao Paulo, Brazil, is one of Kentik’s most successful Channel Partners serving the Latin American region. We reached out to the TechEnabler team recently with this Q&#x26;A to look behind that success, and to highlight a bit about who they are and what they do.</p> <div style="border-top: 1px solid #d5dee0; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>Who is TechEnabler?</strong></p> <p><strong>TechEnabler</strong>: The TechEnabler team has multidisciplinary expertise and over 25 years of executive experience in the telecommunications and IT sector.</p> <p>We sell, we connect, and we integrate technologies to reduce operational costs. The result is that we increase efficiency and enable new revenues on customers’ existing infrastructure. More specifically:</p> <ul> <li><strong>We sell</strong>: We strengthen your business team through management and sales actions.</li> <li><strong>We connect</strong>: We connect networks and services to deliver agile and reliable platforms.</li> <li><strong>We integrate</strong>: We integrate different systems and technologies to offer innovative and competitive solutions.</li> </ul> <p>We also have the subsidiaries WeSellTech Ltda. and OutsourceMe Ltda.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2MZG7b1V8MfD44WB86nGrQ/c2b5b786a23eaf97f53b64aae136517d/tech-enabler-org.png" style="max-width: 600px" class="image center" /> <div style="border-top: 1px solid #d5dee0; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What markets do you serve?</strong></p> <p><strong>TechEnabler</strong>: Today we serve Brazil and the Southern Cone, and we’re growing operations in LATAM countries. We’re focused in the enterprise and services-providers markets.</p> <div style="border-top: 1px solid #d5dee0; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What are your areas of expertise?</strong></p> <p><strong>TechEnabler</strong>: At TechEnabler, we seek to establish technological partnerships with relevant players in the market (like Kentik!). We’re always seeking the state-of-the-art in innovation in order to make a difference by adding value to our customers. We do that in a few ways:</p> <ul> <li> <p><strong>New technologies</strong>: We propose innovative technological solutions that allow our customers to optimize existing infrastructures, reduce costs, and increase quality and efficiency. In other words: infrastructure modernization.</p> </li> <li> <p><strong>Adding value</strong>: We enable cutting-edge technologies that allow our clients to introduce new value-added services (VAS), increasing their revenues by utilizing as much as possible of their existing infrastructure avoiding significant investments and providing the migration path to grow new OTT business services.</p> </li> <li> <p><strong>SOCaaS (Service Operation Center)</strong>: We’re pioneers in our region for introducing the technological concept of SOC from the “supplier business view.” That is, we provide integrated solutions predominantly as-a-service (OPEX) business model, and the capacity of “end-to-end” network design from site infrastructure, installation material, hardware and software (HW &#x26; SW) solutions, deployment services, system powering-up, network configurations and commissioning, acceptance testing, end-user services connection and delivery, system integration, end-user product development and implementation, and technical and sales training.</p> </li> </ul> <div style="border-top: 1px solid #d5dee0; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What services do you offer?</strong></p> <p><strong>TechEnabler</strong>: We offer a variety of services, including:</p> <ul> <li>Networking, connectivity, and infrastructure</li> <li>Application network and performance monitoring (NPM&#x26;D)</li> <li>Unified communications platform for operators and enterprises</li> <li>Telecom asset management</li> <li>MVNO and M2M solutions</li> <li>IT and systems development</li> <li>Outsourcing (i.e., services for companies that wanted to establish a presence in Brazil)</li> </ul> <p>Going into more detail, our portfolio at TechEnabler is based on four pillars:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2IXIqwgY5MOrmFG8Imf9ri/72aafcd6dca8234caad560d8e2d3746d/tech-enabler-soc.png" style="max-width: 800px" class="image center" /> <ul> <li> <p><strong>SOD (Services On Demand)</strong>: These are on-demand based on services agreements: warehousing &#x26; logistics, network infrastructure (HW &#x26; SW), deployment services, configuration, system activation, network acceptance testing, all based on our TechEnabler PMO program.</p> </li> <li> <p><strong>TSC (Technical Support Center)</strong>: Network multi-technology and multi-vendor technical support services are available 8x5 or 24x7, and/or can be customized according to customer needs and with the options of being remotely via SOC or at customer premises by a full-time dedicated specialized resource.</p> </li> <li> <p><strong>NPMD (Network Performance Monitoring &#x26; Diagnostic)</strong>: NPMD services include providing full end-to-end network visibility 24x7 from Layer-2 to Layer-7, automated and customized reports via SOC, and consulting services for data correlation and network online network troubleshooting.</p> </li> <li> <p><strong>VAS (Value Added Services)</strong>: We can also apply technologies and financial analysis towards the development of new product offerings to network end-users. This helps customers produce new revenues generation and improving competitiveness, while maintaining the customer existing infrastructure with minimized affordable investments.</p> </li> </ul> <div style="border-top: 1px solid #d5dee0; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>Want to know more?</strong></p> <p>For more information on TechEnabler, check out their site at <a href="https://www.techenabler.com.br/">www.techenabler.com.br</a>. You can also learn more about <a href="https://www.kentik.com/partners/channel-partners/">Kentik’s Channel Partner Program</a>.</p><![CDATA[How to Monitor Cloud Traffic Through Transit Gateways]]><![CDATA[In this post, we offer insight into how to monitor traffic through transit gateways by understanding routing dynamics and cloud architecture. Learn more.]]>https://www.kentik.com/blog/how-to-monitor-traffic-through-transit-gatewayshttps://www.kentik.com/blog/how-to-monitor-traffic-through-transit-gateways<![CDATA[Daniella Pontes]]>Tue, 22 Jun 2021 04:00:00 GMT<p>Cloud traffic is expanding in every direction: east-west, north-south, inter-regions, across-clouds, to the edge, sites, and more. This is a direct reflection of business strategy and explosive growth in workloads and third-party services accessed by multiple user, consumer, and device types. The end result is complex and often brittle networking environments, and cloud professionals are left in the dark.</p> <h2 id="what-is-a-transit-gateway">What is a Transit Gateway?</h2> <p>A Transit Gateway (TGW) is an AWS routing construct that you can think of as a cloud router. It is meant to scale the management of VPC connectivity in AWS cloud and to on-premises.</p> <p>For AWS cloud networks, the <a href="https://www.kentik.com/kentipedia/aws-transit-gateway-explained/" title="Kentipedia: AWS Transit Gateway: Everything You Need to Know">Transit Gateway</a> provides a way to route traffic to and from <a href="https://www.kentik.com/kentipedia/what-is-a-vpc-virtual-private-cloud/" title="Kentipedia: What is a VPC (Virtual Private Cloud)?">VPCs</a>, AWS regions, VPNs, Direct Connect, SD-WANs, etc. However, AWS offers no easy way to gain visibility into traffic that crosses these devices — unless you know how to monitor Transit Gateways.</p> <p>Without knowing who, what, where, and how, network and cloud engineers have no good way to plan, troubleshoot, or migrate cloud networks. In this post, we’ll offer insight into how to monitor your traffic through Transit Gateways by understanding routing and traffic dynamics through every hop on the end-to-end path.</p> <p>For a high-level overview of AWS Transit Gateway, see our Kentipedia article, “<a href="https://www.kentik.com/kentipedia/aws-transit-gateway-explained/" title="Kentipedia: AWS Transit Gateway: Everything You Need to Know">AWS Transit Gateway: Everything You Need to Know</a>”.</p> <h2 id="building-transit-gateways-to-interconnect-vpcs">Building Transit Gateways to Interconnect VPCs</h2> <p>Applications often need connectivity between workloads in different VPCs and regions as well as to and from on-premises environments (e.g., data centers, offices, branches, etc.). Cloud migrations, microservices architectures, third-party services, and business growth, among other factors, are causing an increase in the number of VPCs deployed — and increasing the need for cloud network visibility. From the naive initial thought of one VPC per organization, we now see organizations deploying hundreds and thousands of these virtual networks.</p> <p>Networking in public clouds can be done via different constructs: gateways, endpoints, direct connections, VPNs, <a href="https://www.kentik.com/kentipedia/what-is-vpc-peering/" title="Kentipedia: What is VPC Peering?">VPC peerings</a>, subnet routing tables, etc. Each construct addresses specific networking capabilities, security, capacity, and performance requirements. (Check out <a href="https://www.kentik.com/resources/ebook-network-pros-guide-to-the-public-cloud/" title="Download the Network Pro’s Guide to the Public Cloud">The Network Pro’s Guide to the Public Cloud</a> ebook for a quick tutorial on AWS cloud networking.)</p> <p>There’s a typical pattern in cloud environments: developers usually start with a one-to-one tactical approach to connect workloads in their applications, peering one VPC with another, for instance. See diagram 1 below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/665Wb1PV7r5Y3eGICOxP2l/8676e8bc1310cc37f4e70e2634281481/aws-vpc-peering1.png" style="max-width: 700px; margin-bottom: 0px;" class="image center no-shadow" thumbnail alt="VPC Peering" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Diagram 1: VPC peering mesh architecture</div> <p>When the number of individual connections creates operation and governance challenges, network engineers often shift their organizations into more scalable designs using the Transit Gateways in a hub-and-spoke architecture. See diagram 2 below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6IELvVtfAzU1Mg2XosJG5B/11deaebbe5ee722c328f1ef7226c68c6/aws-transit-gateway-routing.png" style="max-width: 600px; margin-bottom: 15px;" class="image center no-shadow" thumbnail alt="VPCs Attach to Transit Gateway (TGW)" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Diagram 2: VPCs attach to a Transit Gateway in a hub and spoke architecture</div> <p>Transit Gateways are critical for scalable VPC interconnectivity. Transit Gateways are also fundamental for global cloud and hybrid networking. They address the growth in intra- and inter-region connectivity, as well as to/from on-premises infrastructures. Transit Gateways function as cloud routers and enable among other use cases:</p> <ul> <li>Transit Gateway peerings</li> <li>Interconnection of multiple VPCs in an any-to-any hub-and-spoke design</li> <li>Centralized inter-VPC networking</li> <li>Static and dynamic (BGP) routing for VPN and Direct Connect attachments</li> <li>Entry point for on-premises SD-WAN connections</li> </ul> <p>See diagram 3 below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2bHLTZkYl2rcl18Ff80sqn/e99a5f0135e27cb7dbae962aa7d3b6e1/aws-transit-gateway-global-hybrid-network.png" style="max-width: 850px; margin-bottom: 15px;" class="image center no-shadow" alt="Inter-regional Transit Gateway peering architecture" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Diagram 3: Inter-regional Transit Gateway peering architecture</div> <div class="pullquote right" style="max-width: 270px; margin-top: 20px;">How can networkers know what applications require connectivity without the ability to see the traffic?</div> <h3 id="how-transit-gateways-affect-network-visibility">How Transit Gateways Affect Network Visibility</h3> <p>However, moving from ad hoc, one-to-one peering connections to a planned and optimized architecture with Transit Gateways cannot be achieved without visibility. After all, how can networkers know what applications require connectivity without the ability to see the traffic? And how will they know when they can deactivate the old peering connections?</p> <p>Unfortunately, Transit Gateways are natively a black box in the cloud. Managing cloud networking using current tools such as CloudWatch or third-party SIEM solutions is not scalable, viable, or even effective. As a consequence, most organizations run in the dark — enduring high MTTR, costs, and risks.</p> <h2 id="using-cloud-monitoring-tools-to-troubleshoot-issues-with-transit-gateways">Using Cloud Monitoring Tools to Troubleshoot Issues with Transit Gateways</h2> <p>As mentioned in the previous section, Transit Gateways are central constructs to <a href="https://www.kentik.com/kentipedia/what-is-cloud-networking/" title="Kentipedia: What is Cloud Networking?">cloud networking</a>, and yet, no visibility on their traffic is provided. For instance, in a case when a client application that uses data from the cloud shows delays, some questions come up immediately:</p> <ul> <li>Is the network the cause or a contributing factor?</li> <li>How does the data leaving the ENI (of the EC2 instance) reach the application client?</li> <li>How is the traffic routed through a Transit Gateway on the way to the destination host? Does it take an additional hop through a peered Transit Gateway before being forwarded to a Direct Connect connection to the on-premises network?</li> <li>Does it go through a site-to-site VPN attached to the Transit Gateway?</li> <li>Was the traffic blackholed by a route being withdrawn from an adjacent device?</li> </ul> <p>It is very difficult to answer these questions, or gain any insight on the traffic path using native-cloud monitoring, SIEM, or legacy network monitoring tools.</p> <p>To operate in hybrid cloud networks or cloud environments with a growing number of VPC interconnections using siloed information, engineers have to collate data from multiple sources. They must search through various tables, browse different application UIs, and learn the local query language to search through storage buckets or data lakes — all while manually keeping mental track of what IP addresses correlate to VM instance names, VPC IDs, etc. This gets old fast — it’s hard to do quickly and certainly doesn’t scale.</p> <p>Let’s step into the shoes of those trying to resolve issues using siloed data and views. For every analysis, network engineers and cloud professionals would need to:</p> <ol> <li>Get the VPC ID from the instance’s EC2 UI</li> <li>Get the Transit Gateway ID associated with that VPC ID by browsing the route tables in the VPC UI</li> <li>Now shifting to the Transit Gateway UI, browse to the same Transit Gateway ID, change to the Attachment UI, and find the Attachment ID associated with the VPC ID where your instance is deployed</li> <li>Find the Transit Gateway route table associated with the Attachment ID (associated with the source VPC)</li> <li>Lookup for the attachments associated with the destination IP/CIDR. The attachments could be to another Transit Gateway (in this case, investigation continues) or to the final on-premises connections (e.g., Direct Connect, site-to-site VPN)</li> </ol> <p>Now think of this tedious workflow, while considering that a given VPC can attach to up to 5 Transit Gateways, and each Transit Gateway supports 10’s of routing tables, 50 peering connections to other Transit Gateways, and 5000 VPC attachments. The scenario can turn into a connectivity myriad without any visibility into how the traffic is flowing through the cloud.</p> <p>Facing this kind of workflow in complex environments, MTTR can easily rise, and hardly any effective proactive measures can be taken along the way to keep MTTR within objectives.</p> <h2 id="gain-live-visibility-into-transit-gateways-and-network-monitoring-with-kentik-cloud">Gain Live Visibility into Transit Gateways and Network Monitoring with Kentik Cloud</h2> <p>A holistic visualization of the network environment is fundamental to understand it. With <a href="https://www.kentik.com/product/cloud/" title="Learn more about Kentik Cloud">Kentik Cloud</a>, all manual steps collapse into an intuitive live map with all data gathered, automatically processed, and catered to answer routine and investigative questions. Cloud architects and network engineers can interact directly on the map to view hop by hop, metrics, and details on traffic, routing, and metadata. Anyone can get answers about intra- and inter-clouds, on-premises networks, and the internet with a few clicks.</p> <p>See below how easy it is to figure out what is going through your Transit Gateways:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5bYntPrPVn1BNPsG0hQgvy/ff5d2cf2d9a5c408462d76c27c5b38a1/kentik-map-details.png" style="max-width: 850px; margin-bottom: 15px; border: 1px solid #eeeeee;" thumbnail class="image center" alt="Kentik Map - View into Total Cloud Network" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Kentik Map gives you a detailed view into your total cloud network</div> <p>From a map UI, a selected VPC has all its traffic relationships displayed. Clicking on any element of the map will provide traffic data and context. You can dig into traffic using the Explorer view (on the right) and apply popular filters, such as top n, source/dest IP address, instance name, gateway type, etc.</p> <p>Don’t miss <a href="https://www.kentik.com/resources/aws-transit-gateways-visualizing-cloud-routing-tech-talk-10/">Dan Rohan’s Tech Talk on Transit Gateways</a> for a short demo to experience how easy it is to visualize and troubleshoot hybrid cloud networks with Kentik.</p> <h2 id="how-kentiks-network-observability-platform-can-answer-your-cloud-architecture-questions">How Kentik’s Network Observability Platform Can Answer Your Cloud Architecture Questions</h2> <p>Complex public and hybrid cloud networking environments cannot be understood in pieces, let alone troubleshooting when under pressure of growing business losses and users’ complaints. Holistic understanding is required, and that is what Kentik’s network observability delivers. What it means to network engineers, SREs, developers, and cloud professionals is that you can answer any network question at any time.</p> <p>Question such as:</p> <ul> <li>What composes the traffic flows between VPCs and clouds?</li> <li>What are the paths and network constructs that the traffic flows through?</li> <li>Where are applications dependent on the network? And how are these traffic flows performing?</li> <li>How can my cloud networks be optimized for cost and security posture?</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/Fzrbkx2kVPCLH5W8jEBOr/e0dd10fe139f01b450d4881d93f108bc/aws-visibility-benefits.png" style="max-width: 750px; margin-bottom: 15px;" class="image center no-shadow" alt="Benefits of Kentik Cloud" /> <p>With Kentik, you can plan, run, and fix safe, scalable, and cost-effective hybrid cloud networks. <a href="#signup_dialog" title="Start a Free Kentik Trial">Start a trial</a> and experience how easy it is to manage cloud networks with Kentik Cloud. Or, <a href="#demo_dialog" title="Request a Kentik Demo">schedule a demo</a> with the network observability experts at Kentik.</p><![CDATA[Employee Spotlight: Steven Reynolds - UI Engineer]]><![CDATA[In our employee spotlight series, we highlight members of the Kentik team, what they’re working on, and their most memorable moments within the company. In this Q&A, meet Steven Reynolds, a Kentik UI engineer.]]>https://www.kentik.com/blog/employee-spotlight-steven-reynolds-ui-engineerhttps://www.kentik.com/blog/employee-spotlight-steven-reynolds-ui-engineer<![CDATA[Michelle Kincaid]]>Thu, 17 Jun 2021 04:00:00 GMT<div style="float: right; width: 170px"><img src="https://images.ctfassets.net/6yom6slo28h2/6AR6dGQYYEQZfGV97RZxQt/ea7bdcaabdb815319f103c5f82965d26/stephen-reynolds.png" alt="Steven Reynolds" class="no-shadow" style="max-width: 170px;" /><p style="text-align: center; font-size: 98%;"><b>STEVEN REYNOLDS</b><br />UI Engineer</p></div> <p><strong>What is your role at Kentik?</strong></p> <p><strong>Steven:</strong> I’m a UI engineer. I build the interfaces and data visualizations that our customers use to understand their networks.</p> <div style="border-top: 1px solid #d5dee0; width: 70%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>How long have you been at Kentik?</strong></p> <p><strong>Steven:</strong> 8 months</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>Where do you live?</strong></p> <p><strong>Steven:</strong> Central Texas</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What made you want to work at Kentik?</strong></p> <p><strong>Steven:</strong> A recruiter reached out to me and got me interested. However, it was my interactions with managers and the team’s VP during my interview that really sealed the deal. During these conversations, I got a real representation of what work at Kentik would be like, and I was eager to join the group.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What’s the coolest thing you’ve been able to do at Kentik?</strong></p> <p><strong>Steven:</strong> Building cost breakdown visualizations has been pretty cool! This part of our UI helps make it easy for network teams to manage and optimize their connectivity costs, and you can see a few of the visualizations in <a href="https://www.kentik.com/product-updates/may-2021-connectivity-cost-workflow/">this product update</a>.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What keeps you at Kentik?</strong></p> <p><strong>Steven:</strong> In addition to a close knit and brilliant team, the variety of work, and incredible benefits (seriously, we still cannot believe the benefits), are two things that stick out.</p> <p>First, the leadership at Kentik has allowed and encouraged me to spend time with my family rather than feeling like work and family are in conflict. Second, the level of transparency within the company from the CEO down is refreshing and makes me feel integral to what’s going on here.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>How would you describe your coworkers?</strong></p> <p><strong>Steven:</strong> Brilliant engineers who don’t take themselves too seriously.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What’s your most memorable or favorite company event or offsite?</strong></p> <p><strong>Steven:</strong> Zoom Happy Hours! I’m also looking forward to making more Kentik memories as things open back up and the team can get together.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What’s your favorite Kentik Slack topic?</strong></p> <p><strong>Steven:</strong> #wtf-parenting</p> <p><em>(Kentik sidenote: This Slack channel includes everything from parenting advice and friendly memes to questions like “Should I get a minivan?” and real-life moments like, “OMG, my kid just cut her own hair!”)</em></p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p><strong>What are you most excited about over the next year or two here?</strong></p> <p><strong>Steven:</strong> Kentik’s growth plans are truly exciting, as is the thought of being able to finally meet some teammates face-to-face!</p> <div style="border-top: 1px solid #d5dee0; width: 100%; margin-top: 30px; padding-bottom: 30px;"></div> <p>Want to work with Steven and the rest of the Kentik team? We’re hiring! Check out our <a href="https://www.kentik.com/careers/#postings">open positions</a>.</p><![CDATA[5 Things to Know About Monitoring Your Cloud Network]]><![CDATA[Networking in the cloud can be like a black box. In this blog we discuss five essential properties of network observability for cloud, giving you the ability to answer any question about your cloud network.]]>https://www.kentik.com/blog/5-things-to-know-about-monitoring-your-cloud-networkhttps://www.kentik.com/blog/5-things-to-know-about-monitoring-your-cloud-network<![CDATA[Kevin Woods]]>Tue, 15 Jun 2021 04:00:00 GMT<p><em>Understanding network performance in your cloud environment is essential for maintaining cloud application performance and reliability. However, most organizations find that they lose network visibility when they move to the cloud. This blog highlights five critical cloud network monitoring imperatives for cloud engineers to put in place to ensure the health of public and hybrid cloud environments.</em></p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p>Moving to the cloud makes many things easier. The cloud’s flexibility and elasticity allow you to add compute, storage and other resources rapidly, and to scale up and down as your needs change.</p> <p>Yet to leverage the cloud’s elasticity to the greatest effect, you must be able to manage your network configurations to ensure that performance remains optimal even as your cloud environment scales up and down.</p> <p>That’s why network observability is critical in the cloud. The cloud makes it easy to change configurations, but not so easy to verify that changes are optimal — especially when it comes to cloud networking, which can involve a multitude of VPCs, gateways and related services.</p> <p>How do cloud engineers meet these challenges? They start by focusing on five key best practices for achieving network observability across their entire cloud environment — whether it includes just one public cloud, multiple public clouds or a hybrid combination of public and private resources.</p> <h3 id="cloud-network-observability-requires-multiple-data-sources">Cloud Network Observability Requires Multiple Data Sources</h3> <p>Achieving consistent visibility into your cloud networks requires more than just one data source or set of data points. You need to be able to collect data from multiple facets and layers of your cloud environment, including:</p> <ul> <li><strong>Flow logs</strong>: Flow logs record the granular movement of traffic as it travels between instances, gateways and endpoints within your cloud environment</li> <li><strong>Network metrics</strong>: Metrics like throughput and utilization allow you to measure the health and reliability of the elements composing your cloud network.</li> <li><strong>Metadata</strong>: Correlated tags and business metadata such as names or cost information make it easier to contextualize information and understand what you’re observing.</li> <li><strong>Synthetic tests</strong>: Synthetic testing, which is an efficient means to track performance, helps you find and investigate issues before they grow to become problems for end-users.</li> <li><strong>Application metrics</strong>: Although applications are not the focus of network observability, they provide important metrics like response duration and error rate, which can be correlated with other data to evaluate the scope and user impact of networking issues.</li> </ul> <h3 id="business-metadata-is-key">Business Metadata is Key</h3> <p>Simply collecting data is only the first step toward network observability for cloud. Just as important is the correlation of data to gain context into what is happening — to be able to understand the data you are seeing.</p> <p>This means not just correlating what applications and workloads are generating the traffic, but also going by bringing in names, costs information and locations.</p> <p>For example, you need to know quickly whether a network slowdown impacts production environments or is limited to dev/test. The difference will tell you how to prioritize the incident. Likewise, mapping IP addresses to specific business units or users helps you understand network observability trends not just from the perspective of machine data, but from the business.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1aIs7FVIjVCuaC1JJeZNkL/f27b35b0b85fb612073433e04c0e34db/observation-deck-choose-components1.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div style="max-width: 700px; font-size: 96%; text-align: left; margin: 0 auto; margin-bottom: 25px;">Observation Deck allows people to choose their favorite components (which are pre-built as “widgets”) and places them side-by-side with their custom views, including their Data Explorer, dashboards, etc. This puts the information you need front-and-center.</div> <h3 id="traffic-flows-in-many-directions-observe-them-all">Traffic Flows in Many Directions. Observe Them All.</h3> <p>In a world where the majority of businesses that use the cloud have or are moving to multiple clouds, and many use a hybrid model, gaining visibility into cloud networks requires a multifaceted approach that tracks traffic flows within and between clouds. Not only that, but you must also be able to monitor traffic flows between different cloud regions and to on-premises.</p> <p>And again, you need to be able to contextualize all of this traffic flow in terms of business operations. Maybe you use one of your regions in one of your clouds only to host data backups, for instance, whereas another region hosts production data. A slowdown in traffic to the backup region may be less harmful to the business than a networking performance issue with the production region. Context like this is critical for understanding the impact of traffic flow patterns on business operations.</p> <p>In addition to conventional east-west traffic within a single cloud (or cloud region), you must also measure and analyze north-south traffic as it flows from one cloud data center or your private data center into others.</p> <p>For new applications developed by engineers who are less familiar with networking practices, it is imperative to spot sub-optimal traffic flows that may introduce unnecessary traffic delays, costs or security risks.</p> <h3 id="ask-questions-and-visualize-answers">Ask Questions and Visualize Answers</h3> <p>As you measure all of these traffic flows, look not just for minimum and maximum metrics thresholds in order to establish a baseline of normal activity, but also for anomalies that could reveal a network performance or security issue. You’ll want to be able to answer questions such as:</p> <ul> <li>Why did a spike in traffic occur? Is it congestion, an attack, an application issue or something else?</li> <li>How did a change in a cloud service configuration impact traffic flows associated with that service?</li> <li>How does a networking configuration change impact application response time and error rates?</li> <li>How much am I spending on cloud egress, and how do networking changes correlate with my egress bill?</li> <li>Which clouds or regions are generating and receiving the most traffic?</li> </ul> <p>Remember, too, that because egress plays a big role in your cloud computing bill, you’ll want to identify top talkers and top spenders — meaning resources that generate the highest amounts of traffic and associated egress costs — in order to help manage your network spend. Ask questions and visualize answers As you analyze cloud network data, take a question-and-answer approach to interpret the data and understand its impact on the business.</p> <p>For a longer discussion of questions to ask yourself as you observe your network, and tips on where to find the answers, check out our <a href="/blog/the-network-also-needs-to-be-observable-part-3-network-telemetry-types/">blog on network telemetry types</a>.</p> <h3 id="a-data-platform-capable-of-supporting-the-data-that-you-need">A Data Platform Capable of Supporting the Data That You Need</h3> <p>Using a SaaS-based solution for cloud network observability has some distinct advantages. SaaS gives you flexibility in accessing data from the services you need and deploying any network agents that you need to gather telemetry data quickly and easily. SaaS also gives you the ability to gather all of your network temetery data into a single platform optimized for network data. And, because the service is cloud-based, scalability and performance can grow easily with the need.</p> <p>Delivering a capable network observability service for cloud requires some unique attributes. First, the data platform needs to be optimized for high-cardinality data. Network telemetry comes in dozens of types and hundreds of attributes, and each attribute can have millions of unique values. Second, the SaaS-based platform needs to support massive multi-tenancy. Operational data telemetry platforms need to be able to support dozens of queries. Not just from interactive users via the UI or API, but from other operational systems within the network stack and across an organization’s operational systems. Finally, users need to be able to ask complex questions and get answers in seconds, not hours, to support modern ‘trail of thought’ diagnostic workflows — thus the need for immediate query results.</p> <h3 id="conclusion">Conclusion</h3> <p>Maintaining network observability in the cloud requires a fundamentally different approach than engineers take to monitoring networks within a single data center. Not only do cloud teams need to collect more types of data, but they must also be able to correlate and contextualize them in order to associate performance issues with different parts of their networks. Visualization is critical, too, for making sense of the complex flows and layers that shape cloud network performance. And finally, you need a data platform that is capable of supporting the unique requirements of network observability data.</p><![CDATA[Customer Spotlight: How EdgeUno Delivers Leading Internet Services in LATAM]]><![CDATA[Read how this leading internet services provider in Latin America uses Kentik for peering and capacity planning and a real-time view of performance and availability throughout the network.]]>https://www.kentik.com/blog/customer-spotlight-how-edgeuno-delivers-leading-internet-services-in-latamhttps://www.kentik.com/blog/customer-spotlight-how-edgeuno-delivers-leading-internet-services-in-latam<![CDATA[Michelle Kincaid]]>Wed, 09 Jun 2021 04:00:00 GMT<p>EdgeUno was founded with a central focus: to improve the availability and quality of internet services in countries with emerging economies. With 20 PoPs and more than 2,500 internetwork connections, if you stream content or look at social media in Latin America today, you’re probably using the EdgeUno network.</p> <p>Having network observability is essential for EdgeUno to effectively manage its multinational network and deliver optimal performance and reliability of services to customers. That’s why the service provider turned to Kentik.</p> <p><a href="/resources/edgeuno-delivers-leading-internet-services-in-latin-america-with-network/">Read our newest case study</a> to see how EdgeUno uses Kentik for peering and capacity planning and to unlock a comprehensive, real-time view of performance and availability throughout the network. With Kentik, EdgeUno:</p> <ul> <li>Optimizes costs</li> <li>Reduces MTTR</li> <li>Responds faster with cross-team collaboration</li> </ul> <p>“With Kentik, we can detect issues, performance degradation and bandwidth requirements before they become problems,” says Marco Cabral, manager of EdgeUno’s network operations center (NOC). “Kentik helps us guarantee the best digital experience for our customers.”</p> <p>Download the <a href="/resources/edgeuno-delivers-leading-internet-services-in-latin-america-with-network/">EdgeUno case study</a> or check out our <a href="/product/service-provider-analytics/">Service Provider Analytics</a>.</p><![CDATA[Unifying Application and Network Observability with New Relic]]><![CDATA[Today we expand our partnership with New Relic. Together, we’re deepening New Relic’s full-stack observability into the network layer and giving IT operations, SREs and development teams shared context to resolve issues quickly.]]>https://www.kentik.com/blog/unifying-application-and-network-observability-with-new-relichttps://www.kentik.com/blog/unifying-application-and-network-observability-with-new-relic<![CDATA[Nick Stinemates]]>Wed, 26 May 2021 04:00:00 GMT<p>Today is an exciting day here at Kentik. We’re officially expanding our <a href="https://www.kentik.com/go/newrelic">partnership with New Relic</a> to unify application and network observability.</p> <p>You may recall, back in December, we announced the <a href="https://www.kentik.com/press-releases/kentik-partners-with-new-relic-network-enriched-application-insights/">start of our partnership with New Relic</a> to provide our joint-customers with a way to combine DevOps and NetOps observability using the <a href="https://www.kentik.com/resources/kentik-firehose/">Kentik Firehose</a>. Starting today, the expansion of our partnership grants New Relic users the ability out-of-the-box to add network context to their traditional application monitoring environments directly within New Relic One.</p> <p>This is a big deal for a couple of reasons, which Kentik Co-founder and CEO Avi Freedman and New Relic’s Buddy Brewer, Group VP of Strategic Partnerships, discussed on stage this morning at <a href="https://newrelic.com/futurestack">FutureStack</a>, New Relic’s annual conference. Buddy also shared a good summary of why we’re doing this via his <a href="https://newrelic.com/blog/how-to-relic/network-observability">latest blog post</a>:</p> <blockquote><em>When software fails, it happens in unexpected ways and at the worst time. And when there is a problem in the middle of the night, software engineers often suspect a network failure. But usually they don’t have the tools or context to diagnose the problem. So they wake up the on-call network engineer, even when they're unsure if there is a network issue at all. As you can imagine, this leads to a lot of frustration, finger-pointing, and slow issue resolution.<br /><br /> <p>DevOps and NetOps teams know that there must be a better way. Since modern architectures have eliminated many of the technical boundaries, the time has come to help eliminate the boundaries between teams.<br /><br /></p> <p>With network context from Kentik in New Relic One, teams will have even better information to help “rule out” or “rule in” the network as the cause of an issue. They can more often let their NetOps team sleep when the network isn’t to blame, and provide better context to them when the network is implicated in an issue.</em></blockquote></p> <p>Together with New Relic, we’re helping network and development teams quickly identify and troubleshoot application performance issues correlated with network traffic, performance and health data. Ultimately, this makes services more reliable.</p> <p>Learn more about our expanded partnership and sign up for the early-access program at <a href="/go/newrelic">kentik.com/newrelic</a>.</p><![CDATA[Why and How to Track Connectivity Costs]]><![CDATA[As today’s economy goes online, network costs can be a determinant factor to business success. Failure to strategize and optimize connectivity expenses will naturally result in a loss of competitiveness. Addressing customer needs, Kentik launched a new automated workflow to manage connectivity costs, timely instrument negotiations at contract term, and stay on top of optimization opportunities -- all in Kentik’s user-friendly style.]]>https://www.kentik.com/blog/why-and-how-to-track-connectivity-costshttps://www.kentik.com/blog/why-and-how-to-track-connectivity-costs<![CDATA[Greg Villain]]>Mon, 24 May 2021 04:00:00 GMT<p><em>Network Interconnection: Part 2 of our new “Building and Operating Networks” series</em></p> <h3 id="previously-on-network-interconnection">“Previously on Network Interconnection…”</h3> <p>This article is the second in a series around building and operating networks. The previous article in the series (<a href="https://www.kentik.com/blog/building-and-operating-networks-assuring-what-matters/">Building and Operating Networks: Assuring What Matters</a>) sets the stage and defines the specifics of the most common networks out there, and it lays out a framework that helps benchmark, track and plan for building and operating them.</p> <p>One of the foundational ambitions of <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">network observability</a> is to provide actionable answers to any question related to your network and cloud infrastructure. These answers are also required to be consumable by a broader audience than the network expert, breaking the boundaries of usual Net/Dev/Ops teams to support modern software engineering, finance and executive teams, to name a few.</p> <p>As explained previously, all network and cloud infrastructure requires cohesive instrumentation across these three dimensions:</p> <ul> <li>Performance</li> <li>Cost</li> <li>Reliability</li> </ul> <p>As a follow-up, today we’re going to focus on the cost aspect of interconnection.</p> <h3 id="why-connectivity-costs-are-important">Why Connectivity Costs Are Important</h3> <p>As today’s economy continues to drift towards online, network and cloud infrastructure become an increasing part of any enterprise’s cost of goods and services (COGS). With this in mind, failure to track, plan and optimize them naturally results in loss of competitiveness.</p> <p><strong>An eyeball network</strong> is an ISP with residential subscribers at the content consuming edge of the internet. This type of ISP offers internet access plans to residential subscribers in markets that can be so competitive that the list price for a plan is often dangerously close to the cost of the bandwidth to serve content to the subscribers, and the margins are paper thin.</p> <p><strong>A content provider network</strong>’s profitability almost exclusively relies on their ability to serve subscribers at a cost offset by the monetization of their content. Managing cost versus performance tradeoffs across a potentially global footprint of end users is a key challenge: any investment in local performance must be carefully weighed against the direct ability to monetize it.</p> <p><em>For the purpose of this article, we’re only interested in <a href="https://www.kentik.com/product/edge/">edge connectivity costs</a>. We’ll save backbone infrastructure costs for a later day.</em></p> <h3 id="the-hard-things-about-connectivity-costs">The Hard Thing(s) About Connectivity Costs</h3> <p>Before we dive into how network observability solves for connectivity costs, let’s take a moment to list the roadblocks standing in the way of an efficient cost strategy in today’s networks.</p> <h4 id="silos-and-tools">Silos and Tools</h4> <img src="https://images.ctfassets.net/6yom6slo28h2/3LbxqfRhUGjmge1YyENhCQ/f2491c4514b64ec70790a5ed0a3ddfc2/siloes-tools.png" style="max-width: 550px;" class="image center no-shadow" thumbnail alt="connectivity costs" /> <p>One of the fundamental issues with connectivity costs is that they don’t strictly belong anywhere in the company. The network engineering team plans for capacity and provisions interconnections, but the finance team holds the keys to the chest with the executives demanding clear visibility on COGS.</p> <p>This chasm is aggravated by the lack of a unified tool that can handle all that is involved in managing interconnection costs. In most cases, repositories for the necessary information are separate and rarely interconnected, and any useful computed information requires heavy spreadsheet wrangling to bridge these gaps.</p> <p>So what do we need to track towards managing connectivity costs? Technically not much. Connectivity is metered by providers based on the maximum of in- or out-traffic on a set of interfaces. With the price per Mbps defined in a contract, just multiply the max traffic value by the unit cost at the end of the month and you’re done.</p> <p>While theoretically simple, in practice, calculating connectivity cost can become quite the epic quest. The roadblocks for extracting and munging the required data points across different silos can be tedious to overcome due to the following:</p> <p><strong>1. Documentation of all external interfaces’ traffic counters and providers</strong><br> SNMP traffic counters from edge devices are used by providers to bill for connectivity. (Note that flow-based billing is usually not used due to imprecisions caused by sampling.)</p> <p>The usual place for this is either:</p> <ul> <li>Router configuration repositories</li> <li>Any SNMP monitoring system</li> </ul> <p><strong>2. Documentation of all types of interconnections to be able to slice and dice on coherent traffic datasets</strong><br> Costs need to be evaluated within categories. While useful to track an overall price per Mbps, optimizations are done differently based on the type of interconnectivity (e.g., private or public peering, transit, etc.). This implies the need for additional metadata to qualify interconnection interfaces.</p> <p><strong>3. Telemetry for all of the edge devices’ external interfaces</strong><br> Since connectivity costs are mainly measured by traffic volume in or out the network edge devices, we need traffic information for every port involved in external connectivity. As mentioned, SNMP counter data is the preferred method to meter traffic volume. With attention to the important aspect of keeping 5-minute samples as acceptable practice for invoicing.</p> <p><strong>4. Cost model documentation for every provider contract</strong><br> Connectivity contracts rely on different rates and models. Some are flat rates for a certain port, some are 9Xth percentile-based, and some are tiered with a wide range of subtle variations. Each of these models usually sits in a contract stored somewhere in the finance silo, most likely inaccessible in any programmatic way.</p> <p><strong>5. Calendar information such as monthly invoice date or contract term</strong><br> Lastly, as invoices go by every month, the potential for incorrect provider invoices increases with the amount of providers and contracts, aggravated by the number of providers involved and the difficulty to compute these easily.</p> <p>As a byproduct, connectivity contracts renewal rarely get the timely attention and traffic history analysis they deserve, resulting in new rates being negotiated without proper instrumentation nor target price goals.</p> <h4 id="cost-models-galore">Cost Models Galore</h4> <p>As explained previously, the usual repository for documenting connectivity provider cost is usually the order form sitting in the legal or finance departments. Extracting this metadata to overlay on top of interfaces traffic data usually goes through the magic of spreadsheets. While formulas will help those with high spreadsheet-foo, turning each contract into a formula is a chore, and soon becomes a burden to maintain and use. Moreover, each model in place in the industry can come with significant complexity that just cannot be documented in a formula. Here are some examples:</p> <ul> <li> <p><strong>Type of percentile</strong> (95th, 98th….) is used on the interface to meter the traffic. The most commonly used percentile is 95th, but certain exceptions exist with more or less tolerant percentiles.</p> </li> <li> <p><strong>Bundled interfaces computation method</strong>: The percentile computation model gets more complex when multiple interfaces are under the same contract. Using the “Peak of Sums” method vs. the “Sum of Peaks” method to compute it can lead to very different results, and both are in use in the industry.</p> </li> </ul> <img src="https://images.ctfassets.net/6yom6slo28h2/VZT2cOOmS5cVKI7H71Z6u/9c1c8b5492af4f5d9ca806c67995295d/peak-of-sums_2.png" style="max-width: 800px;" class="image center no-shadow" thumbnail alt="Peak of Sums" /> <ul> <li> <p><strong>Tier computation method</strong>: This is when connectivity is contracted using traffic tiers with a different price per Mbps for each. Two distinct methods are available to compute monthly costs: adding the costs for each tier completed (blending mode), or only taking into account which tier the interfaces land in at the end of the month (parallel tiering).</p> </li> <li> <p><strong>Additional charges</strong>: Often left out of connectivity cost modelling are individual charges that can apply on a given contract or interfaces within the contract. While the price of cross-connects to a peered network may not outweigh the cost of traffic, these are still OPEX entries that add up over the entire edge and as such need to be documented, computed, and tracked accordingly.</p> </li> </ul> <p>All of the above results in the following inefficiencies, ensuring that the cost review exercise will never be performed frequently enough:</p> <ul> <li>The formula used across each provider is difficult to understand at an eye glance without reading the contract.</li> <li>There is no purpose-built tool to document such cost models and keep them up to date.</li> <li>It’s difficult to estimate minimum monthly spend for each provider, and it worsens as their number increases.</li> <li>An engineer or finance person always needs to manually merge cost data with traffic data.</li> </ul> <p>Network observability allows you to overlay configurable financial models over traffic accounting data without any additional effort needed beyond configuring these models in a very intuitive interface.</p> <p><em>The screenshots below, from the Connectivity Costs workflow update, show models that are easy to understand at an eye’s glance.</em></p> <img src="https://images.ctfassets.net/6yom6slo28h2/1Cfd6NkIbSnWjVuA9WuyWE/b5a5eb036d2b4cb2cecbeda4d2f51042/cost-groups.png" style="max-width: 800px;" class="image center" thumbnail alt="Cost Groups" /> <h4 id="lack-of-a-baseline-metric-to-track-and-tune">Lack of a baseline metric to track and tune</h4> <p>A solid interconnection strategy largely relies on:</p> <ul> <li>Being able to set goals</li> <li>Being able to track progress towards these goals</li> </ul> <p>Both of these require taking the “art” out and keeping the science via simple analytics. Analytical interconnection decision-making requires routinely answering questions such as:</p> <ul> <li>How does our overall transit cost compare to our peering costs?</li> <li>Is it worth it for us to pay for this specific local provider vs. our current transit upstream?</li> <li>Cost-wise, how do these two transit providers compare? Should I cancel one? Should I add one?</li> <li>How have my costs evolved for this provider over the past X months?</li> </ul> <p>The above can only be achieved by being able to formulate cost per Mbps over any two of those three dimensions: Connectivity type (transit, IX, paid peering, free peering, direct connect…), provider, and points of presence (sites).</p> <p>Efficient planning and goal setting not only requires being able to ask the system at any moment what a spend and cost per Mbps is for any of these dimensions (connectivity type, provider, and sites), but also to evaluate their evolution against strategic goals.</p> <p><em>The below screengrab illustrates how Kentik’s Connectivity Cost workflow provides total monthly spend and cost per Mbps benchmark metrics.</em></p> <img src="https://images.ctfassets.net/6yom6slo28h2/5hjdUabejaRHRWVSaEQcNE/c4170608efdbf3df47f388498f646f43/total-monthly-spend.png" style="max-width: 800px;" class="image center" thumbnail /> <p>It is also worth noting that if the retention of network telemetry (flow) data is defined by a given contractual window, connectivity costs are not bound to this limitation. Connectivity costs are computed every month and are stored pre-calculated virtually forever, making sure that useful cost history is not lost when out of the flow retention window. This allows for planning and goal tracking over much longer periods, as usually required.</p> <h4 id="global-connectivity-sourcing--currencies">Global connectivity sourcing &#x26; currencies</h4> <p>As interconnection grows on the edge of the network and the footprint becomes increasingly global, an additional complexity gets thrown in the mix as a direct byproduct of global connectivity sourcing: currencies and exchange rates.</p> <p>When the need to source connectivity from local providers arises, procurement may be done in the local currency. Bandwidth procured in bulk then becomes sensitive to variations of exchange rates. Tracking exchange rates requires yet another layer of the software stack needed to track costs together with traffic accounting, cost model documentation, and traffic breakdowns.</p> <p>Kentik’s newly updated Connectivity Costs workflow relies on an exchange-rate datasource covering all available currencies, ensuring that cost estimates for the ongoing month, as well as cost history for the past months, will always:</p> <ul> <li>Allow financial planning to consolidate all cost metrics in a single currency, and</li> <li>Keep both current month estimates, as well as history, truthful to the exchange rates at the time they are computed.</li> </ul> <p><em>As shown below, Kentik’s Connectivity Costs workflow now natively addresses multi-currencies.</em></p> <img src="https://images.ctfassets.net/6yom6slo28h2/VDh8pvtd8k3BKWnvwZ9gI/0f7e94bc51ea5e75230d0c272b204108/cir-model.png" style="max-width: 800px;" class="image center" thumbnail alt="" /> <h4 id="a-cross-functional-tool">A cross-functional tool</h4> <p>As connectivity costs are metrics consumed by other teams than network operations, the resulting data is rarely easy to ingest and parse for the non-technical crowd (whether the underlying cost models or the actual metrics). More often than not, a network engineer has to insert themselves in the path to executives or finance to massage the data and make it digestible to non-technical audiences, once more diverting them from building and operating the best infrastructure possible to deliver a given service.</p> <p><code>&#x3C;dramatic_backdrop></code><br> Because of this, network engineers have become the unfortunate gatekeepers of an occult art that they are the only ones to be versed in, requiring their time for endless tedious tasks, while at the same time being on the critical path of information needed to operate the company efficiently.<br> <code>&#x3C;/dramatic_backdrop></code></p> <p>The good news is that network observability exists specifically to solve these types of problems, and if you go back to its defined mission from the previous blog post, you’ll read:</p> <blockquote><em>Network observability’s mandate is to remove the boundaries between teams, tools and methodologies as it pertains to the network. Its ambition is to make it trivial to answer virtually any question about the network and easily understand the answers. Lastly, network observability streamlines the activity of building and running networks.</em></blockquote> <h3 id="the-future-of-connectivity-costs-is-here">The Future of Connectivity Costs is Here</h3> <p>Let’s say you are now reaping the benefits of our new Connectivity Costs workflows. You now review connectivity costs regularly, and have even set goals to track. Congratulations!</p> <p>Think about this: These interconnection cost models are now available to the rest of the workflows. The future is quite bright. All the necessary ingredients are now in place to be able to compute the cost of any slice of traffic coming in or out of your network: It can be different services that you run or offer; it can be different classes of users or customers that you identify with your own Custom Dimensions logic; or even any CDN, OTT traffic…</p> <p>If, like me, you can’t help but daydream about what comes next to support the performance, cost, or reliability of your infrastructure, <a href="#signup_dialog" title="Start a Free Kentik Trial">start a trial</a> and <a href="https://www.kentik.com/contact/">let’s chat</a>!</p><![CDATA[Kentik’s Journey to Deliver the First Cloud Network Observability Product]]><![CDATA[With today's launch of Kentik Cloud, we're making it easy for network and cloud engineers to visualize, troubleshoot, secure and optimize cloud networks. In this post, we share more about Kentik Cloud and why we thought it was important for us to build it.]]>https://www.kentik.com/blog/kentiks-journey-to-deliver-the-first-cloud-network-observability-producthttps://www.kentik.com/blog/kentiks-journey-to-deliver-the-first-cloud-network-observability-product<![CDATA[Dan Rohan]]>Tue, 18 May 2021 04:00:00 GMT<h2 id="why-1000-different-cloud-monitoring-solutions-dont-help-network-engineers">Why 1,000 different cloud monitoring solutions don’t help network engineers</h2> <p>Today, we are excited to <a href="/press-releases/kentik-brings-network-observability-to-public-cloud-environments/">announce the launch of Kentik Cloud</a>, our latest innovation enabling network and cloud engineers to easily visualize, troubleshoot, secure, and optimize cloud networks. We are super excited to tell you about all the things that make <a href="https://www.kentik.com/product/cloud/">Kentik Cloud</a> special (and there are a lot!), but first, we thought it’d be good to tell you why we thought it was important for us to build Kentik Cloud and share a bit of the journey that we took to build it.</p> <h3 id="cloud-questions">Cloud Questions</h3> <p>We’d been hearing for some time that managing cloud networks had become a real pain. Many of our customers already running large on-prem networks noted a trend: As their companies settled into the cloud, network teams were asked to step up to help configure network architectures, establish connectivity to their data and remote offices, and solve connectivity issues for the app teams. But, when they peeled the covers back to gaze at their newly inherited clouds, what they found was not pretty: dozens of ingress and egress points with no security policies. Internal communications routed over internet gateways and driving up costs. Abandoned gateways and subnets configured with overlapping IP space. A Gordian knot of VPC peering connections with asymmetric routing policies. In short, these cloud networks weren’t being managed by network engineers. One of our partners put it this way:</p> <div class="pullquote center" style="max-width: 65%; text-align: center;"> “Cloud networks developed by software engineers are technically feasible &mdash; but administratively nightmarish!”<br />&mdash; Network Engineer at a Fortune 500 Company</div> <p>But a lot of these teams couldn’t just roll up their sleeves and start fixing things. Many network engineers are still getting their feet wet when it comes to cloud networking. Some aren’t fluent with cloud terminology or haven’t been exposed to cloud consoles and APIs. Cloud networking concepts feel familiar but are just different enough to cause confusion. And the cloud vendors’ management interfaces don’t make it easy to do simple things like figuring out traffic paths or visualizing traffic going over a VPN.</p> <p>In short, running cloud networks is slowing network teams down. This is happening at the same time that SRE and DevOps teams are actually speeding up and enjoying the agility of the cloud. The problems lurking inside the cloud network are demotivating teams — and making them wonder if others are noticing. They worry that others might think that network teams were purposely moving slow.</p> <p>When we heard all this, we were puzzled: Weren’t cloud networks supposed to be easy? After all, we knew tons of folks — many of whom know nothing about networks — happily running little cloud environments without trouble. And it’s not like the network teams had to worry about fixing failed line cards or erroring ports. Also, we wondered, wasn’t there already a well-developed market of tooling to help here? So we asked hundreds of questions to try and understand what was going on. And what we learned was that the advances that cloud providers have introduced into networking are very subtle but also very impactful. Let’s dig in and talk about what we’ve learned.</p> <h3 id="cloud-democratized-the-network-management-plane">Cloud Democratized the Network Management Plane</h3> <p>This is more of a starting place than a new insight, but we should begin with the obvious. Cloud providers allow anyone to create and change networks, instantly. This is in many ways, one of the main reasons that the cloud has been so successful. Companies that are getting started in the cloud no longer need to understand networking very well to have an app running. And the benefits of a democratized network carry forward even as cloud networks mature. Cloud providers have done such a good job of building resilient networks, with layers of amazing virtualization on top, that network hardware failures rarely become the problem of the network engineer.</p> <h3 id="people-think-that-the-network-matters-less">People Think That the Network Matters Less</h3> <p>This incredible advancement has added weight to a very old and very baseless idea that the network shouldn’t matter anymore. While we are inclined to roll our eyes when we hear these sentiments, we should admit that things have changed. Fifteen years ago you couldn’t launch a new business or even run a school without having a server, a switch, and a router sitting in a rack somewhere. But those days are gone. Adapt or die.</p> <p>But instead of adapting, many organizations moving to the cloud swallow this lie hook, line and sinker and build cloud architectures without consulting people that understand how networks work. I can’t tell you how many times we have spoken with network teams that are waiting for such a call, despite knowing their organizations are struggling to build scalable and performant cloud networks.</p> <h3 id="cloud-is-dissolving-the-network-edge">Cloud is Dissolving the Network Edge</h3> <p>While the network team waits, the cloud perimeter is dissolving. Think of it this way: The network has always served as a boundary where organizations apply policies to control costs, enforce security and ensure the performance of their applications and infrastructure. And we’ve invested heavily to maintain this boundary. Consider the edge components of an on-prem network: the expensive routers, switches, and firewalls. The SD-WAN systems, the DDoS scrubbers, and the intrusion detection appliances. And then we back up these hardware investments by hiring smart, highly qualified individuals to run it all. We don’t hand out the passwords to these devices to just anyone.</p> <p>But when cloud providers supply the tools that allow anyone to set up network infrastructure — they’ve done exactly that. Most organizations don’t have policies in place that prevent accounts from setting up new internet gateways, configuring new security groups, or routing policies. In fact, many organizations encourage this because it allows these teams to move fast.</p> <h3 id="network-teams-are-called-in-too-late">Network Teams Are Called in Too Late</h3> <p>Organizations can bury their heads in the sand for one or two iterations as they move into the cloud, but eventually the problems they are facing become acute: connectivity issues are slowing deployments. Security has no visibility. Network costs are surging. Time to call in the pros.</p> <p>When these teams come in to look, the problems are already baked into the environment and need careful extraction. Teams need to understand the traffic flows and architectural patterns before they can start to craft a plan to address the issues and methodically solve the problems. But how? The teams need tools.</p> <h3 id="the-old-tools-arent-up-to-the-job">The Old Tools Aren’t Up to the Job</h3> <p>Naturally, network teams try to solve these problems by first looking at the tools that they have been using to monitor their on-prem networks. But this isn’t cutting it. Built primarily as simple metrics warehouses, most “legacy” network monitoring vendors modified their platforms to support the cloud by simply ingesting a few cloud metrics from services like AWS CloudWatch or adding support for simple, unenriched VPC flow logs. The end products have limited value. The metrics collected help paint a vague picture of the cloud but don’t address any of the larger problems at play. Ingesting VPC flow logs is a good start, but what good does it do when containers spin up and down in minutes, workloads are transportable and everything runs on some kind of overlay network?</p> <p>This approach also can’t handle cloud scale. Some of the customers we chatted with were routinely generating 90 TB of flow logs a month. Try throwing that at a virtual appliance. Try throwing that into a massive cluster and keeping your queries running fast. It doesn’t work.</p> <p>And frankly, most people aren’t interested in making it work, because our thinking has shifted away from a must-build-everything mentality towards one where teams want to focus on what they are good at — engineering networks. Network engineers want to add value by exercising their core competencies, not oversee the care and feeding of storage and compute. We don’t think like that anymore.</p> <h3 id="and-the-new-tools-arent-up-for-it-either">And the New Tools Aren’t Up for It Either…</h3> <p>With “legacy” tools off the table, many teams started trying to solve these problems using the tools that come bundled with their cloud providers. This, too, has largely been an exercise in frustration, as network engineers realized quickly that they were not the consumers these tools were designed for. Yes, they can get a list of top talkers on an interface, but first, they have to learn to write SQL queries. Yes, they can get metrics from their gateways and load balancers, but setting up thresholds and baselines requires a degree in data science. Yes, they can configure VPC flow logs, but then they need to figure out how to enrich the records to make them meaningful. The bottom line is that cloud providers wrote these tools as primitives designed to be used by developers, primarily for monitoring applications — not cloud networks. And of course, these tools don’t pay any attention to the fact that 72 percent of cloud networks use hybrid and/or multi-cloud architectures<sup><b><a href="https://enterprisersproject.com/article/2020/7/hybrid-cloud-10-statistics" target="_blank">1</a></b></sup> — leaving serious visibility gaps on the table.</p> <p>So what about the new full-stack observability tools? Despite being super interesting and full of promise, the reality is that most of these agent-based solutions don’t take a network-centric approach. Designed largely for developer-centric use cases like application performance monitoring or SRE use cases of logs, traces and metrics, these tools don’t come close to helping teams detangle connectivity issues, address routing problems or find performance problems.</p> <h3 id="what-makes-kentik-cloud-different">What Makes Kentik Cloud Different?</h3> <p>Kentik Cloud provides enterprises with the ability to observe their public and hybrid cloud environments and understand how cloud networks impact user experience, application performance, costs, and security. By showing a unified end-to-end network view to, from, and across public clouds, the internet, and on-premises infrastructures, Kentik Cloud helps network engineers to quickly solve problems and dramatically improve their cloud networks. Networking teams that use Kentik Cloud will love the rich visualizations, lightning-quick speed, and thoughtful workflows that illuminate the paths, performance, traffic, and inter-connectivity that comprise their cloud networks.</p> <p>The solution introduces several new exciting features and capabilities.</p> <h3 id="observation-deck">Observation Deck™</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/46CUEu6SVouODdEbbcZS1P/6d0a1658522adf0f137099f65f38358d/kentik-cloud-dashboard2.png" alt="Cloud Traffic Overview" thumbnail style="max-width: 800px;" class="image center" /> <p>The Observation Deck™ was built on the idea that customers often use Kentik to build dashboards that meaningfully represent the infrastructure and use cases most relevant to them and then switch back and forth between these views and the Kentik workflows they love. The Observation Deck allows people to choose their favorite Kentik components (which are pre-built as “widgets”) and places them side-by-side with their custom views, including their Data Explorer, dashboards, etc. This puts the information our customers need front-and-center.</p> <p>Although officially part of the Kentik Cloud release, the Observation Deck isn’t limited to only our cloud subscribers — everyone using our platform will be able to enjoy it. Cheers!</p> <h3 id="kentik-map-enhancements-for-cloud">Kentik Map Enhancements for Cloud</h3> <img src="//images.ctfassets.net/6yom6slo28h2/2sfe7Hgut758D2KIZekDjw/bb84090609810ec9ab7e628dddeaa055/kentik-map-cloud-enhancements.png" alt="Cloud traffic data" style="max-width: 800px;" thumbnail class="image center" /> <p>The Kentik Map is all about helping users ask and answer questions about their data in the context of their architectures. These latest enhancements truly deliver on this promise for cloud users by highlighting the most important elements that cloud networkers care about — the data paths, gateways, route tables, traffic data — and the metadata that puts this all into perspective — in one single, easy-to-use, and beautiful interface.</p> <p>People can use the map to discover misrouted, rejected, or unwanted traffic patterns. Using the map to understand the flow of traffic around your cloud environment is so easy that you’ll never want to use your cloud console again. Best of all, the map is integrated with your on-prem data so you can now enjoy a complete and seamless experience as you troubleshoot issues in your data center through to your VPC architectures.</p> <h3 id="cloud-performance-monitor">Cloud Performance Monitor</h3> <img src="https://images.ctfassets.net/6yom6slo28h2/5MoQ8xJxOlHC7Ml4xFhlI9/a1ef999f4376788d9e6833a0ebd6f1e5/cloud-performance-monitor-blog1.png" alt="Cloud Performance Monitoring" style="max-width: 800px;" thumbnail class="image center" /> <p>Cloud Performance Monitor extends the power of <a href="https://www.kentik.com/product/synthetics/">Kentik Synthetics</a> by helping cloud users understand the paths that traffic is taking through their network so they can set up our <a href="https://www.kentik.com/product/global-agents/">synthetics agents</a> in the most appropriate places. The workflow is broken down into two components.</p> <p>The Interconnects tab helps users gain a complete picture of how their data flows across the critical infrastructure “gluing” their cloud environment to their data centers and remote offices. The Conversations tab uses flow data to identify the inter-VPC communication paths inside cloud environments so that users can pinpoint the parts of their network that would readily benefit from active performance monitoring. Without Kentik Cloud, understanding these architectures and finding these traffic patterns can be difficult, making performance monitoring challenging. With Kentik Cloud, it’s a breeze; just point, click, and register a synthetic agent in seconds.</p> <h3 id="automated-configuration-and-onboarding-improvements">Automated Configuration and Onboarding Improvements</h3> <p>This release of Kentik Cloud also features several small but mighty enhancements that we’re proud to share with you.</p> <p>First, we’ve dramatically improved our onboarding experience by giving users two paths to get up and running on Kentik Cloud — an automated path based on Terraform, and a manual path with easy validation. The Terraform path builds a Terraform configuration template based on user preferences, making it simple to configure your AWS environment in seconds. The generated configuration relies on the popular AWS Terraform provider to enable flow logs on your VPCs, configure the collection buckets and required access policies. Then, the configuration uses our brand new Kentik Terraform provider to automatically register each VPC from every monitored account in the Kentik platform.</p> <img src="//images.ctfassets.net/6yom6slo28h2/8nn3bniLYY8GE71X4gN6R/1f37155bb624eb5c3c0fae04ddf82e2b/cloud-terraform-setup.png" thumbnail alt="Onboarding" style="max-width: 800px;" class="image center" /> <p>Our manual onboarding improvements are real time-savers as well. In previous setup screens for AWS, we asked users to input role ARNS, buckets, and regions before providing any kind of attempt to validate our success in being able to ingest cloud flow logs. The result was that users who had a misconfiguration weren’t sure what to fix. We’ve improved this experience by adding validation buttons for each step along the way.</p> <h3 id="new-data-explorer-dimensions">New Data Explorer Dimensions</h3> <img src="//images.ctfassets.net/6yom6slo28h2/21tTHa55vBOEZ5J3R9wx6h/575846cd2fd40fd138d2a4c19b40f618/data-explorer-dimension-869w.png" style="max-width: 700px;" class="image center" /> <p>We’ve also added a few new dimensions to the Data Explorer that are super useful for AWS users.</p> <p><strong>Packet Address</strong>: AWS recently added the ability to see inside network overlays (GRE, etc.) to the raw source and destination IP addresses in your VPCs. This is useful when you’re troubleshooting transit gateway Connect attachments, NAT gateways, or any kind of traffic with unencrypted overlays.</p> <p><strong>Gateway ID/Gateway Type</strong>: This is one of our most exciting AWS data dimensions, allowing you to see exactly what traffic crossed various gateways. This is useful when you’re trying to understand how traffic is flowing through your network, or are working to implement new gateways or retire old ones.</p> <p><strong>Forwarding State</strong>: This dimension enriches flow records with the route state of the destination prefix. If traffic is flowing towards a route with an active route, the forwarding state will be marked as “active.” However, if traffic is destined towards a blackhole route, the state will reflect this with a value of “blackholed.”</p> <h3 id="next-steps">Next Steps</h3> <p>In addition to polishing what we’ve already started, we’ve got so much more exciting work planned over the next few quarters: connectivity troubleshooting workflows, new cloud widgets, map enhancements, and support for new authentication features in AWS. We’re also excited to extend capabilities in our maps to Azure, Google Cloud and IBM clouds in the coming quarters. Stay tuned!</p> <h3 id="get-started">Get Started</h3> <p>If you’d like to learn more, you can <a href="#signup_dialog" title="Start a Free Kentik Trial">spin up a 30-day free trial</a> in minutes. If you’d like more information, <a href="https://www.kentik.com/contact/">contact us</a> or join us on <a href="https://www.kentik.com/go/kentik-community-slack-signup/">Kentik Users Community Slack</a> where you can chat live with product experts and get your questions answered in real time.</p> <h3 id="or-see-for-yourself-during-our-upcoming-webinar">…Or See for Yourself During Our Upcoming Webinar</h3> <p>If you’d like to see Kentik Cloud in action, using real data to attack specific use cases, I’d also like to invite you to <a href="https://www.kentik.com/go/webinar-troubleshooting-aws-environment/">join our upcoming webinar this Thursday, May 20</a> on “How to Troubleshoot Routing and Connectivity in Your AWS Environment.”</p><![CDATA[Are Blind Spots in Your Hybrid Cloud Keeping Your Network Pros Up at Night?]]><![CDATA[With increasingly complex on-premises and cloud hybrid environments, managing the network has never been so tough. And that keeps network pros up at night.]]>https://www.kentik.com/blog/are-blind-spots-in-your-hybrid-cloud-keeping-your-network-pros-up-at-nighthttps://www.kentik.com/blog/are-blind-spots-in-your-hybrid-cloud-keeping-your-network-pros-up-at-night<![CDATA[Avi Freedman]]>Mon, 17 May 2021 04:00:00 GMT<p>In today’s digitally transformed world, everyone from internet service providers to enterprises depend upon high-performing and secure networks for business success. It doesn’t matter if you’re <a href="https://www.kentik.com/resources/how-zoom-uses-kentik-for-network-visibility-and-performance/">Zoom</a> or <a href="https://www.kentik.com/resources/casestudy-major-league-baseball/">Major League Baseball</a>, if you’ve got a problem with your network… you’ve got a problem with your business.</p> <p>But with increasingly complex hybrid or multi-cloud models, monitoring the network has never been so tough. Unfortunately, many of today’s more archaic network monitoring tools fail to provide the holistic visibility needed to combine information from on-premises and cloud environments into a single view. This inability to tie data together — let alone view it — inevitably causes dangerous blind spots.</p> <p>And that keeps network pros up at night.</p> <p>Because of the growing adoption of cloud infrastructure, more companies need a network observability solution that fuses traffic and telemetry data along with business, security and internet metadata to help customers get increased visibility, improved performance and better cost control of the networking that connects their total cloud environment.</p> <p>This converged platform functionality provides network professionals, such as network, cloud and site reliability engineers, visibility into traffic to and from cloud services, the internet, and on-premises network infrastructure — and beyond.</p> <h3 id="hail-to-the-network">Hail to the Network</h3> <p>As cloud adoption expands with various cloud and on-premises permutations, the network has become increasingly difficult to manage. Fortunately, whether you’re using AWS, traditional data centers, LAN/WAN/SD-WANs, or container environments, modern monitoring solutions are empowering practitioners to see and manage network data like never before.</p> <p>A network observability solution that is architected to visualize and map traffic in complex cloud and on-premises environments offer a superior advantage.</p> <h3 id="silos-good-for-storing-corn-bad-for-observability">Silos: Good for Storing Corn, Bad for Observability</h3> <p>Let’s face it: network pros are tired of toggling between siloed tools that — ironically — can actually impede problem-solving and visibility. Silos are great for storing corn but bad for managing the cloud. Just as sales teams embrace the simplicity of Salesforce, more network pros want simple monitoring solutions that use built-in workflows to better visualize cloud networking performance, troubleshoot connectivity problems, control and reduce cloud traffic costs and mitigate apps to public cloud with confidence.</p> <h3 id="ai-driven-insights">AI-Driven Insights</h3> <p>Another benefit of network observability is AI-driven insights, which ensures intelligent monitoring keeps getting smarter and recognizes security risks and performance problems faster with greater accuracy.</p> <p>You wouldn’t put a blindfold on an air traffic controller. Similarly, more companies realize it’s a mistake to burden network professionals with archaic, siloed monitoring solutions. Fortunately, the next-generation of monitoring solutions ensure that — when it comes to monitoring network traffic in the data center, cloud or on the ground — your network practitioners will no longer fly blind in the dark.</p> <h3 id="join-the-webinar-on-may-20">Join the Webinar on May 20</h3> <p>Do you want to learn more about the benefits of network observability? Sign up today for our upcoming webinar, <a href="https://www.kentik.com/go/webinar-troubleshooting-aws-environment/">How to Troubleshoot Routing and Connectivity in Your AWS Environment</a> and learn more how network observability eliminates visibility gaps in network monitoring.</p><![CDATA[The Mystery of AS8003]]><![CDATA[On January 20, 2021, a great mystery appeared in the internet's global routing table. An entity that hadn’t been heard from in over a decade began announcing large swaths of formerly unused [IPv4 address space](https://en.wikipedia.org/wiki/IPv4) belonging to the U.S. Department of Defense. Registered as GRS-DoD, AS8003 began announcing 11.0.0.0/8 among other large DoD IPv4 ranges.]]>https://www.kentik.com/blog/the-mystery-of-as8003https://www.kentik.com/blog/the-mystery-of-as8003<![CDATA[Doug Madory]]>Sat, 24 Apr 2021 12:00:00 GMT<p>On January 20, 2021, a great mystery appeared in the internet’s global routing table. An entity that hadn’t been heard from in over a decade began announcing large swaths of formerly unused <a href="https://en.wikipedia.org/wiki/IPv4">IPv4 address space</a> belonging to the U.S. Department of Defense. Registered as <a href="https://bgp.he.net/AS8003#_whois">GRS-DoD</a>, AS8003 began announcing 11.0.0.0/8 among other large DoD IPv4 ranges.</p> <p>According to data available from University of Oregon’s <a href="http://www.routeviews.org/routeviews/">Routeviews</a> project, one of the very first BGP messages from AS8003 to the internet was:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">TIME: 01/20/21 16:57:35 TYPE: BGP4MP/MESSAGE/Update FROM: 62.115.128.183 AS1299 TO: 128.223.51.15 AS6447 ORIGIN: IGP ASPATH: 1299 6939 6939 8003 NEXT_HOP: 62.115.128.183 ANNOUNCE 11.0.0.0/8</code></pre></div> <p>The message above has a timestamp of 16:57 UTC (11:57am ET) on January 20, 2021, moments after the swearing in of Joe Biden as the President of the United States and minutes before the statutory end of the administration of Donald Trump at noon Eastern time.</p> <div as="Promo"></div> <p>The questions that started to surface included: Who is AS8003? Why are they announcing huge amounts of IPv4 space belonging to the U.S. Department of Defense? And perhaps most interestingly, why did it come alive within the final three minutes of the Trump administration?</p> <p>By late January, AS8003 was announcing about 56 million IPv4 addresses, making it the sixth largest AS in the IPv4 global routing table by originated address space. By mid-April, AS8003 dramatically increased the amount of formerly unused DoD address space that it announced to 175 million unique addresses.</p> <p>Following the increase, AS8003 became, far and away, the largest AS in the history of the internet as measured by originated IPv4 space. By comparison, AS8003 now announces 61 million more IP addresses than the now-second biggest AS in the world, China Telecom, and over <em>100 million more addresses than Comcast</em>, the largest residential internet provider in the U.S.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7hruFJtsbnwagNi87rOOYG/99533d12fe3637a7e8c1f584c4e732fb/8003.png" style="max-width: 400px; margin-top: 0;" class="image center no-shadow" /> <p>In fact, as of April 20, 2021, AS8003 is announcing so much IPv4 space that 5.7% of the entire IPv4 global routing table is presently originated by AS8003. In other words, more than one out of every 20 IPv4 addresses is presently originated by an entity that didn’t even appear in the routing table at the beginning of the year.</p> <h2 id="a-valuable-asset">A valuable asset</h2> <p>Decades ago, the U.S. Department of Defense was allocated <a href="https://en.wikipedia.org/wiki/List_of_assigned_/8_IPv4_address_blocks#List_of_assigned_/8_blocks_to_the_United_States_Department_of_Defense">numerous massive ranges</a> of IPv4 address space - after all, the internet was conceived as a Defense Dept project. Over the years, only a portion of that address space was ever utilized (i.e. announced by the DoD on the internet). As the internet grew, the pool of available IPv4 dwindled until a <a href="https://web.archive.org/web/20150416174505/http://research.dyn.com/2015/04/ipv4-address-market-takes-off/">private market emerged</a> to facilitate the sale of what was no longer just a simple router setting, but an increasingly precious commodity.</p> <p>Even as other <a href="https://www.washingtonpost.com/business/technology/iran-positions-itself-to-be-a-tech-leader-in-turbulent-middle-east/2015/04/06/e52ad4ea-d948-11e4-8103-fa84725dbf9d_story.html">nations began purchasing IPv4</a> as a strategic investment, the DoD sat on much of their unused supply of address space. In 2019, Members of Congress attempted to force the sale of all of the DoD’s IPv4 address space by proposing the following provision be added to the <a href="https://www.congress.gov/116/crpt/hrpt120/CRPT-116hrpt120-pt2.pdf">National Defense Authorization Act for 2020</a>:</p> <blockquote> <p><strong>Sale of Internet Protocol Addresses.</strong> Section 1088 would require the Secretary of Defense to sell at fair market value all of the department’s Internet Protocol version 4 (IPv4) addresses over the next 10 years. The proceeds from those sales, after paying for sales transaction costs, would be deposited in the General Fund of the Treasury.</p> </blockquote> <p>The authors of the proposed legislation used a Congressional Budget Office estimate that a /8 (16.7 million addresses) would fetch $100 million after transaction fees. In the end, it didn’t matter because this provision was stripped from the final bill that was signed into law - the Department of Defense would be funded in 2020 without having to sell this precious internet resource.</p> <h2 id="what-is-as8003-doing">What is AS8003 doing?</h2> <p>Last month, astute contributors to the NANOG listserv <a href="https://seclists.org/nanog/2021/Mar/186">highlighted</a> the oddity of massive amounts of DoD address space being announced by what appeared to be a shell company. While a BGP hijack was ruled out, the exact purpose was still unclear. <em>Until yesterday</em> when the Department of Defense provided an explanation to <a href="https://www.washingtonpost.com/technology/2021/04/24/pentagon-internet-address-mystery/">reporters from the Washington Post</a> about this unusual internet development. Their statement said:</p> <blockquote> <p>Defense Digital Service (DDS) authorized a pilot effort advertising DoD Internet Protocol (IP) space using Border Gateway Protocol (BGP). This pilot will assess, evaluate and prevent unauthorized use of DoD IP address space. Additionally, this pilot may identify potential vulnerabilities. This is one of DoD’s many efforts focused on continually improving our cyber posture and defense in response to advanced persistent threats. We are partnering throughout DoD to ensure potential vulnerabilities are mitigated.</p> </blockquote> <p>I interpret this to mean that the objectives of this effort are twofold. First, to announce this address space to scare off any would-be squatters, and secondly, to collect a massive amount of background internet traffic for threat intelligence.</p> <p>On the first point, there is a <a href="https://web.archive.org/web/20170122135805/https://dyn.com/blog/vast-world-of-fraudulent-routing/">vast world of fraudulent BGP routing</a> out there. As I’ve documented over the years, various types of bad actors use unrouted address space to bypass blocklists in order to send spam and other types of malicious traffic.</p> <p>On the second, there is a lot of background noise that can be scooped up when announcing large ranges of IPv4 address space. A recent example is Cloudflare’s announcement of 1.1.1.0/24 and 1.0.0.0/24 in 2018.</p> <p>For decades, internet routing operated with a widespread assumption that ASes didn’t route these prefixes on the internet (perhaps because they were canonical examples from networking textbooks). According to their <a href="https://blog.cloudflare.com/fixing-reachability-to-1-1-1-1-globally/">blog post</a> soon after the launch, Cloudflare received “~10Gbps of unsolicited background traffic” on their interfaces.</p> <p>And that was just for 512 IPv4 addresses! Of course, those addresses were very special, but it stands to reason that 175 million IPv4 addresses will attract orders of magnitude more traffic. More misconfigured devices and networks that mistakenly assumed that all of this DoD address space would never see the light of day.</p> <h2 id="conclusion">Conclusion</h2> <p>While yesterday’s statement from the DoD answers some questions, much remains a mystery. Why did the DoD not just announce this address space themselves instead of directing an outside entity to use the AS of a long dormant email marketing firm? Why did it come to life in the final moments of the previous administration?</p> <p>We likely won’t get all of the answers anytime soon, but we can certainly hope that the DoD uses the threat intel gleaned from the large amounts of background traffic for the benefit of everyone. Maybe they could come to a NANOG conference and present about the troves of erroneous traffic being sent their way.</p> <p>As a final note: your corporate network may be using the formerly unused DoD space internally, and if so, there is a risk you could be leaking it out to a party that is actively collecting it. How could you know? Using Kentik’s Data Explorer, you could quickly and easily view the stats of exactly how much data you’re leaking to AS8003. May be worth a check, and if so, <a href="https://www.kentik.com/">start a free trial of Kentik</a> to do so.</p> <p>We’ve got a short video Tech Talk about this topic as well—see <a href="https://www.kentik.com/resources/kentik-tech-talks-how-to-use-kentik-to-check-if-youre-leaking-data-to-as8003/" title="Kentik Tech Talks, Episode 9: How to Use Kentik to Check if You&#x27;re Leaking Data to AS8003 (GRS-DoD)">Kentik Tech Talks, Episode 9: How to Use Kentik to Check if You’re Leaking Data to AS8003 (GRS-DoD)</a>.</p><![CDATA[Building and Operating Networks: Assuring What Matters]]><![CDATA[In part 1 of this blog series, Kentik director of technical product management Greg Villain discusses what matters with network interconnection and their cost considerations. Greg examines the different types of interconnections, necessary operational measures, and applicable elements of network observability.]]>https://www.kentik.com/blog/building-and-operating-networks-assuring-what-mattershttps://www.kentik.com/blog/building-and-operating-networks-assuring-what-matters<![CDATA[Greg Villain]]>Fri, 09 Apr 2021 04:00:00 GMT<p><strong>Network Interconnection: Part 1 of a new series <em>Building and Operating Networks</em></strong></p> <h2 id="key-attributes-of-a-network">Key Attributes of a Network</h2> <p>Networks come in all shapes and sizes, and they all have different purposes. Whether you’re a network service provider or an enterprise running your own hybrid, global network, each network scenario has a unique set of constraints and optimal operating conditions.</p> <p>At Kentik, we pride ourselves as a pioneer and leader in network observability. Our platform caters to both standard and custom use cases depending on the unique needs of our customers.</p> <p>This blog examines several different network interconnection models, essential considerations for each, and how “what matters” can be optimized with <a href="https://www.kentik.com/resources/five-steps-to-network-observability-nirvana-webinar/">network observability</a>.</p> <h3 id="the-basic-dynamics-of-network-interconnection">The Basic Dynamics of Network Interconnection</h3> <p>First, let’s lay out a few simple principles that govern the way networks are interconnected.</p> <p>As trivial as it seems, no network originates and terminates traffic within itself as a closed environment. Hence the importance of the name “internet,” meaning between and among networks.</p> <p>Of course, there are corner cases, such as residential ISPs that offer IPTV services living within their own boundaries. But for this blog, let’s set aside any corner cases.</p> <p>For a given network, traffic sourced from within the network is considered free. However, content sourced from outside needs to be carried over to the eyeball network’s (defined below) subscribers in the local loop at a cost.</p> <p>Handovers are configured on the <strong>edge of a network</strong>, at locations where multiple networks meet and are usually described as <strong>interconnects</strong>. These interconnects have a cost associated with them (explicit or hidden). These costs are directly related to the amount of traffic flowing through them.</p> <p>There is a hierarchy of interconnection methods based on technical and financial attributes.</p> <p>Let’s summarize the different types of interconnects:</p> <ul> <li><strong>Transit</strong> is a default path for any traffic that comes and goes from outside of the network. Often referred to as “full routes,” it means that these connections hold the entire reachability for any and all destinations on the internet.</li> <li><strong>Peering</strong> is a supplemental, narrower set of sources or destinations to go in or out of the network to a specific source or destination network. They come in multiple flavors: <ul> <li><strong>Free peering</strong>: the relationship with the interconnected network is not metered nor paid, relying on a mutual interest for both parties interconnecting</li> <li><strong>Paid peering</strong>: one of the two interconnecting networks pays the other to interconnect, usually based on the amount of traffic flowing between them in either direction</li> <li><strong>IX (Internet eXchange)</strong>, also known as “peering fabrics.” Peering fabrics are facilities where networks can meet to exchange traffic on a shared switching platform. An IX charges member networks for access ports but offers the benefit of the availability of multiple potential peering partners through the same infrastructure. </li> </ul> </li> </ul> <p>Given these different types, defining a sensible interconnection policy is not a trivial task. Network operators need to track multiple metrics, but we’ll come back to this later.</p> <h3 id="different-types-of-networks">Different Types of Networks</h3> <p>Let’s start with a simple categorization of the types of networks out there. This basic classification will help us to understand the underlying goals of each.</p> <h4 id="eyeball-networks">Eyeball Networks</h4> <p>This term refers to residential or enterprise ISPs. A simpler description is that these networks have users directly attached to them who consume content traffic (hence “eyeball”).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7FcvRoSekiLCTVQ1A7jaEs/53f8df95a03f86f5b76a3a9a94d583d0/eyeball-networks.png" style="max-width: 800px;" class="image center no-shadow" alt="Eyeball Networks" /> <p>Here are some of main properties of eyeball networks:</p> <ul> <li>Their traffic profile is mostly <strong>INBOUND</strong> — users consume content pulled from outside of the network boundary. In other words, their traffic ratio is 1:N, where N>>1.</li> <li>Their footprint is mostly local (country specific in most cases). They are not in the business of cross-continental telecommunications.</li> <li>From a routing perspective, being heavy inbound, a big challenge for eyeball networks is the difficulty to influence inbound traffic routing. BGP really isn’t tailored for this.</li> <li>Subscriber satisfaction in an eyeball network is derived mostly from “download” performance, i.e., the speed at which they can obtain their favorite content.</li> <li>Depending on the locale, the eyeball network ISP may be in a very competitive market where an internet subscription is considered a commodity. This creates a difficult challenge for these operators, needing to balance <a href="https://www.kentik.com/resources/case-study-limelight/">performance and cost</a>.</li> </ul> <p>A good example of an eyeball network is unbundling of the copper local loop in some countries where the incumbent is legally obligated to offer other ISPs the possibility to rent the last mile copper pair to deliver DSL-based internet access. In these situations, this legal aspect drove internet plans prices down and made them very competitive markets.</p> <h3 id="content-networks">Content Networks</h3> <p>Content networks are the reverse of the eyeball networks: these networks originate the majority of traffic, destined to be consumed by end-users.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/67zV16r83EVVt3hYXAMOz4/77f933127c2878b646959552fb98767e/content-networks.png" style="max-width: 800px;" class="image center no-shadow" alt="Content Networks" /> <p>Content networks’ main attributes are:</p> <ul> <li>Their traffic profile is mostly <strong>OUTBOUND</strong> — they push content to users who are not located within their network. In other words, their traffic ratio is 1:N, where N>>1.</li> <li>Typical examples of content networks are gaming companies, social networks, hosting providers, content delivery networks (CDNs).</li> <li>Their footprint is varied and dependent on the monetization of their content in multiple locales.</li> <li>From a routing perspective, their life is much easier than eyeballs since BGP makes it trivial to select exit points for traffic.</li> <li>Financial dynamics can be diverse: subscription-based, advertisements, revenue sharing.</li> <li>One of their main objectives is to achieve high performance towards consumers, even while the destination consumer may be multiple networks away. Obviously, the fewer the hops, the better. Complexities are usually proportional to the geographical spread of their user base, as maintaining maximum performance across virtually every ISP in every locale is unrealistic.</li> </ul> <h3 id="carrier-networks">Carrier Networks</h3> <p>These networks are the providers of all the other networks: in essence, they transport packets from content networks to eyeball networks across the globe and are “the backbone of the internet.”</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7agJFcJBfA884A9tj40sZd/49b75fbfad1cfc37fa4888750a83f3f1/carrier.png" style="max-width: 800px;" class="image center no-shadow" alt="Content Networks" /> <p>Here are their main characteristics:</p> <ul> <li>Carriers make money by transporting content traffic to eyeball users, i.e., by selling traffic capacity to/from any destination/source anywhere in the world. In other words, their traffic ratio is considered 1:1.</li> <li>Their main challenge is to remain attractive to both content providers and eyeball networks: it means they need to maintain balanced demand on both sides of the equation. They need to have enough eyeballs connected to attract content providers and enough content providers to attract eyeballs.</li> <li>Most carriers don’t pay for traffic. Selling access to traffic is the core business they are in. In exchange, they offer a global footprint and connect content to eyeballs over long distances. A carrier without any paid interconnect is usually named a tier 1.</li> <li>Carriers operate in a largely commoditized industry. Their margins are usually low, i.e. their success depends on financial efficiency.</li> </ul> <h2 id="what-dimensions-matter">What Dimensions Matter?</h2> <p>There’s an interesting parallel to be drawn between building and operating a network and engineering a distributed datastore. Some may already be familiar with the <a href="https://en.wikipedia.org/wiki/CAP_theorem">CAP Theorem</a>, formulated by Eric Brewer. In this theorem, <strong>C stands for Consistency, A for Availability, and P for Partition Tolerance</strong> — it lays out this simple rule:</p> <p style="padding-left: 20%; padding-right: 20%; text-align: center;"><strong>“When building a distributed datastore, you can only select two of these three as design constraints.”</strong></p> <p>On the consumption side, when selecting a datastore for a given task, one must evaluate which of these two is more important for the underlying storage task and choose the datastore that fits accordingly.</p> <p>The dynamic is very similar for each network. As listed in the quick shorthand summary in the previous section, each network comes with its own challenges, but building and operating a network always requires an understanding and observation of three well-known dimensions:</p> <ul> <li>Cost</li> <li>Performance</li> <li>Reliability</li> </ul> <p>While less drastic than the CAP theorem (you can effectively get a variable amount of the three without having to exclude one completely), the art of the game for any network is to be able to make the right decisions, according to their constraints in each of these categories.</p> <p>Since we’ve determined that Interconnection is one of the defining elements of any network, it only makes sense to leverage these to evaluate the constraints of any interconnection.</p> <h3 id="cost">Cost</h3> <p>Generally, a transit interconnect is more expensive than peering on a cost per Mbps basis. More so, peering at an IX mutualizes the cost of interconnection over multiple peerable networks. Both require common locations and cross-connects that need to be factored into the eventual cost per Mbps.</p> <p>Cost is an important factor for eyeball networks, as they usually operate in a competitive market.</p> <blockquote>To give the reader an idea: a pretty common plan in the US can be found for $40 for 100Mbps, which means $0.4/Mbps, without even considering the cost of labor and maintenance. This traffic cost needs to be put in perspective with the interconnection cost at the edge, more specifically, the cost of transit.</blockquote> <p>The diagram below illustrates the different modes of interconnection together with the associated money-flows that can take place between a variety of networks.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6kCh1KKlLcP62qjiqVmmkq/fa8c7af6018176c215256d47cb885919/interconnection-cost.png" style="max-width: 800px;" class="image center no-shadow" alt="Interconnection Cost" /> <h3 id="performance">Performance</h3> <p>Performance relies among others on the number of hops to reach a specific destination. A blunt estimate is that performance decreases as the number of hops between a source and a destination increases.</p> <p>With that in mind, a direct peering connection allows for content to go directly from the content provider to the subscriber. Additionally, the distance between the content and the subscriber is a crucial factor: because of the limits of the speed of light, the further the source from the destination, and the higher the latency.</p> <blockquote>These two aforementioned aspects are the key reasons behind the rise of content delivery networks in the early 2000s, whose value add is to both reduce the number of hops between content and consumer as well as the geographical distance across the broadest footprint possible. For a content provider with a user base across the world, performance to users becomes problematic because of the need to optimize for a large set of local eyeball networks.</blockquote> <h3 id="reliability">Reliability</h3> <p>As a rule of thumb, the more available paths to choose from between a source and a destination, the more reliable the eventual service rendered between the two. For example, if the primary path fails, a backup path is immediately available.</p> <p>With the diversity of paths comes additional costs, as multiplying logical and physical routes implies diversifying providers, i.e. multiplying the number of interconnecting interfaces towards these next or previous hops.</p> <p>With a larger number of interfaces on the edge comes a sizable management cost overhead because of the additional number of moving pieces.</p> <blockquote>Network reliability is one of the areas most prone to the law of diminishing returns. Every additional path added inbound or outbound becomes increasingly costly and decreasingly beneficial. Some experts go as far as saying that reliability never being total; choosing carefully when not to be reliable is a key strategic consideration for a network operator.</blockquote> <p>The diagram below illustrates diverse paths between two given networks, and the associated failover modes (configuration dependent).</p> <img src="//images.ctfassets.net/6yom6slo28h2/1UfcR6qsRktv52vn2hatHh/87b293d1d7dfbe5d1fab8a079751e02c/path-failovers.png" style="max-width: 800px;" class="image center no-shadow" alt="Paths and Failovers" /> <h2 id="enter-network-observability">Enter Network Observability</h2> <p>We’ve established earlier that the top dimensions against which any network should benchmark itself are cost, performance and reliability. Given this, the underlying success on any network-reliant enterprise lies in its ability to do the following for any of these attributes:</p> <ul> <li>Measure</li> <li>Track</li> <li>Fine-tune</li> <li>Improve</li> <li>Plan</li> <li>Fix</li> </ul> <p>In short, these are all the crucial activities that network observability can deliver.</p> <p>Rightfully, the reader may ask, “It’s not like anything here is novel. Why does network observability brand itself as a novel concept?”</p> <p>One of the answers to this question relates to the lack of a cohesive approach to the tasks of managing networking costs, performance and reliability, throughout the industry.</p> <p>While network and infrastructure-heavy companies naturally structure themselves around functions (sales, network engineering, reliability engineering, software engineering, operations, support), they often suffer from the perennial organizational silos when the service they deliver demands horizontal attention across goals (cost, performance, reliability) and tasks (measure, track, tune, fix).</p> <p>Some of the usual symptoms are:</p> <ul> <li>Absence/lack of shared, baselined metrics</li> <li>Absence/lack of shared best practices across teams</li> <li>Absence/lack of end to end performance measurements</li> <li>Loss of efficiency when information is required to cross silos</li> <li>Difficulty in mapping executive strategy into technical infrastructure plans</li> </ul> <p>Network observability’s mandate is to bring down the boundaries between teams, tools and methodologies as it pertains to the network. Its ambition is to make it trivial to <a href="/why-kentik-network-observability-and-monitoring/">answer virtually any question about the network</a> and easily understand the answers. Lastly, network observability streamlines the activity of building and running networks.</p> <p>In the next blog post, we’ll discuss how Kentik instruments and supports tracking and planning of your connectivity costs.</p> <p><button as="SignupButton">TRY KENTIK</button></p><![CDATA[VPC Flow Logs in AWS: How to Monitor Traffic at the Edge of Your Cloud Network]]><![CDATA[VPC Flow Logs are a necessary form of network telemetry to deliver network observability for cloud. Dan Rohan discusses different models for gathering VPC Flow Log data and the advantages of each. ]]>https://www.kentik.com/blog/vpc-flow-logs-aws-monitor-traffic-edge-cloud-networkhttps://www.kentik.com/blog/vpc-flow-logs-aws-monitor-traffic-edge-cloud-network<![CDATA[Dan Rohan]]>Wed, 07 Apr 2021 04:00:00 GMT<p>At Kentik, we’ve been ingesting and analyzing <a href="/resources/aws-vpc-flow-logs-for-kentik/">AWS VPC Flow Logs</a> since 2018. Through the hundreds of customer conversations we’ve had, we’ve heard a widespread (and totally false) belief that AWS VPC Flow Logs must be configured to monitor every single part of your VPC environment — and thus are too expensive to set up as part of a comprehensive monitoring strategy. The truth is that while flow logs do cost money, AWS has provided knobs that you can turn to keep your costs reasonable while still getting the visibility you need. In this blog, I’ll walk you through how you can configure your AWS environment to target precisely what you want to monitor — nothing more, nothing less.</p> <h2 id="the-gimme-everything-approach-to-vpc-flow-logs">The “Gimme Everything” Approach to VPC Flow Logs</h2> <p>Before I get too deep into the technology, I should mention that there’s <em>totally</em> a benefit to setting up logs carte-blanche across your VPCs. If you have flow logs turned on everywhere, you’ll never feel the burning regret of not having traffic logs when you really need them. Think about what information you’ll need to find out which EC2 instance hogged a VPN connection or what service drove up costs on your <a href="https://www.kentik.com/kentipedia/nat-gateway/" title="Kentipedia - NAT Gateways: A Guide to Cost Management and Cloud Performance">NAT gateways</a>, and so on. But turning on VPC Flow Logs everywhere might <em>not</em> fly in larger environments — because these logs cost real money — and no one wants to pay for something unless they understand the value for doing so.</p> <p>So, if you’re not running mega-behemoth VPC infrastructure, then it’s totally possible to turn on logging everywhere without causing your CFO to barf. It also happens to be the easiest way to get started. You can begin producing logs within a few minutes, just by flipping a switch on each of your VPCs. Simply navigate to <a href="https://ap-southeast-1.console.aws.amazon.com/vpc/home">Your VPCs</a>, select a VPC and then hit the “Enable Flow Log” button from the “Flow Logs” tab in the detail pane. Follow the <a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-s3.html">instructions to set up flow logs</a> to publish to an S3 bucket, and away you go.</p> <div as="Promo"></div> <h2 id="monitoring-your-cloud-edge-easier-said-than-done">Monitoring Your Cloud Edge: Easier Said Than Done</h2> <p>While setting up global VPC Flow Logs takes just a few minutes, building logs that only capture inter-VPC flow and internet flows can take a bit more time and thought. Gaining visibility here is one of the most powerful steps you can take to optimize and secure your cloud — helping you defend against cyber attacks, improve your customers’ digital experience, and save money on your data transfer bill.</p> <p>But we have a small problem. AWS doesn’t easily allow you to configure flow logs for this use case. You simply can’t configure flow logging on internet gateways, which would seem like an obvious place to do so. Internet gateways aren’t manageable or monitorable constructs in your VPC; they just <em>exist</em> as route targets in your VPC’s route table. Flow logs are generated <em>only</em> from VPCs, subnets, and network interfaces.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4CBWQLqrumY2rU0Wc58PY7/75c94a0b593aed4204e358f9d9ef1025/aws-no-flow-internet-gateways.png" style="max-width: 500px; margin-bottom: 15px;" class="image center no-shadow" thumbnail /> <div style="max-width: 500px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">AWS does not let you configure flow logging for internet gateways</div> <p>Nonetheless, this limitation provides us with enough flexibility to monitor almost any traffic flow, so let’s dig in.</p> <h2 id="configuring-a-transit-vpc">Configuring a Transit VPC</h2> <p>At a high level, our goal is to insert a flow log source into the data path between your VPCs and the internet. This is sometimes considered a “transit VPC” or “transit hub.” This is essentially a VPC that acts as an aggregation point for all of your VPCs and infrastructure, like site-to-site VPNs, direct connects, and other VPCs. Once the transit VPC is configured, we just need to turn on flow logging on the specific peering interfaces connecting the transit hub with your workload VPCs.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3AmiVLymrdcLwl4Zekfw6t/cee2223f87bc7c63142b6cb1aae313ec/aws-transit-vpc.png" style="max-width: 800px; margin-bottom: 15px;" class="image center no-shadow" thumbnail /> <div style="max-width: 600px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Setting Up a "transit VPC" or "transit hub" is one way to insert a flow log source into the data path between your VPCs in AWS.</div> <p>Here are the steps we’ll want to follow:</p> <p><strong>Build a new VPC</strong>. Build your new transit VPC in the same account or a different account from your original VPC(s). If you choose to build this in a different account, then you’ll have to also address cross-account IAM policies (which are beyond the scope of this post). Don’t use the Launch VPC Wizard from the VPC Dashboard — instead, head over to “Your VPCs” and click on “Create VPC” to ensure that AWS doesn’t try to preconfigure needless cruft that you’ll have to delete later. Lastly, attach an internet gateway to the VPC.</p> <p><strong>Add subnets</strong>. Add public and private subnets to the new transit VPC. Private subnets are useful in transit hub architectures as these are great places for cloud and network operations teams to place bastion hosts and other shared services that need to be in the center of the action. Ensure that the subnets you configure here don’t overlap with any of the subnets already configured in your existing VPCs.</p> <p><strong>Establish VPC peering from your new VPC to your existing VPC(s)</strong>. VPC peering allows you to establish private connections between one or more VPCs. It takes just a few seconds, and the <a href="https://docs.aws.amazon.com/vpc/latest/peering/peering-configurations.html">AWS docs</a> to set this up are easy to follow. If your existing VPCs already have internet gateways, don’t delete them! Keep them around until you’re sure that you’ve got everything routing properly.</p> <p><strong>Configure a NAT gateway (optional)</strong>. Adding a NAT Gateway to the public subnet of your transit hub VPC is a critical step to allowing private instances access to the internet. If you don’t have any private subnets, you can skip this step.</p> <p><strong>Configure routing</strong>. The secret sauce to making a transit hub architecture work is in how you configure your routing. First, you’ll want to set up routes in the transit VPC. Configure a default route in your public subnet to point all traffic towards the internet gateway. Then, in the private subnet, you’ll want to configure a default route to point at the NAT gateway that sits in the public subnet. I’d recommend spinning up an instance in each to make sure that public and private connectivity are working.</p> <p>Once you’ve set up your transit VPC, then move down into your existing VPCs. I’d recommend proceeding with caution here — it might be worth setting up a test VPC with a test instance inside it to ensure you’ve got everything nailed down just so. In these VPCs, you’ll need to set up a new default route that points at the peering interface installed previously.</p> <p><strong>Configure flow logs</strong>. Setting up flow logs now is just a simple matter of choosing which interfaces through which traffic is aggregated. In this example, we’d choose one of the two peering interfaces that connect the Application VPC to the Transit Hub VPC and, optionally, the NAT gateway interface.</p> <h2 id="monitoring-transit-gateways-with-vpc-flow-logs">Monitoring Transit Gateways with VPC Flow Logs</h2> <p>Many companies have moved away from plain old VPC peering towards the <a href="https://aws.amazon.com/transit-gateway/">AWS Transit Gateway</a>. For a nominal hourly charge (plus data transfer costs), Transit Gateways replace transit hub architectures with a more scalable service that doesn’t require users to set up a mesh of peering links. Instead, Transit Gateways act as cloud routers by maintaining their own routing tables and attachments. Transit Gateways support interconnecting not only your VPCs, but can also support attachments to other Transit Gateways in different regions, as well as Direct Connects and site-to-site VPN connections. Taken together, this means that Transit Gateways are a game-changer for organizations that are building out their <a href="https://www.kentik.com/resources/hybrid-cloud-network-observability-gap-whitepaper/">cloud infrastructure</a> at a large scale.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2rnKZMNVR405j4GcehRl6F/6fd7b2a069ac426b3305e69692da64a0/aws-transit-gateways.png" style="max-width: 900px; margin-bottom: 15px;" class="image center no-shadow" thumbnail /> <div style="max-width: 600px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Transit Gateways replace transit hub architectures with a more scalable service that doesn’t require users to set up a mesh of peering links.</div> <p>To set up north-south and inter-VPC Flow Logs with a Transit Gateway, all you need to do is set up flow logs on the attachment links that connect your VPCs to your Transit Gateway. One way to quickly instrument this change is to configure flow logs on an interface basis. To do so, just navigate to the “EC2 > Network Interfaces” page in the AWS console and search for the string “gateway.” Select your gateway interfaces and then choose “Create Flow Log” from the “Actions” menu.</p> <p>Note that while you can’t configure flow logs on the interfaces that connect the transit gateway to other Transit Gateways (or Direct Connects and SSL VPNs), this isn’t usually a big deal. Flows through this infrastructure generally originate or terminate in one of your attached VPCs, which means that traffic logs will still be created. This can become slightly problematic if you want to capture traffic to public AWS services that are carried over a public virtual interface through your Direct Connect infrastructure (i.e., traffic from your on-prem infrastructure to public AWS S3 IP addresses or other publicly available AWS services). But even this isn’t a problem because you can usually set up NetFlow, sFlow or IPFIX on your on-prem routers or VPN concentrators to log this traffic.</p> <h2 id="analyze-your-vpc-flow-logs">Analyze Your VPC Flow Logs!</h2> <p>Now that you’ve set up flow logs and they are accumulating in an S3 bucket, it’s time to start digging in. You can certainly build your own analysis tool (<a href="https://docs.aws.amazon.com/athena/latest/ug/vpc-flow-logs.html">and AWS has a good primer</a>), but I recommend elevating your game by adopting a comprehensive network observability strategy with Kentik. Just <a href="#demo_dialog" title="Get a Kentik demo">sign up now for a short demo</a> and learn more about how you can use Kentik to secure and optimize your cloud.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p>To go further, watch Dan’s webinar <a href="https://www.kentik.com/go/webinar-troubleshooting-aws-environment/">How to Troubleshoot Routing and Connectivity in Your AWS Environment</a>.</p><![CDATA[How to Meet the Need for APM + NPM]]><![CDATA[Integrated application and network observability enables network, SREs, and developer teams to deliver reliable services with great user experience.]]>https://www.kentik.com/blog/how-to-meet-the-need-for-apm-npmhttps://www.kentik.com/blog/how-to-meet-the-need-for-apm-npm<![CDATA[Nick Stinemates, Daniella Pontes]]>Wed, 31 Mar 2021 12:00:00 GMT<p>In our recent blog post, “<a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">The Network Also Needs to be Observable</a>,” we made a case for network observability as an important facet of observability platforms. Here we will dive into the marriage of application and network observability as the means to keep up with today’s “always-great” experience expectations.</p> <p>Networks are increasingly complex and critical to the success of the applications and services that run on them. Add to that the rise in customer expectations at an all-time high — the network needs to be up, running, and with high performance. Always. Maintaining fast and reliable networks is a must, but also a growing challenge.</p> <p>As an industry, we started addressing performance at the top of the tech stack:</p> <ul> <li>Software systems on cloud or cloud platforms</li> <li>Applications on containers or container platforms like Rancher or OpenShift</li> </ul> <p>While even business operations have been addressed to improve reliability, interestingly, not enough attention has been given to the network. Yet, small network performance degradations can dismantle user experience, not to mention cause downtime. It is clear: making the network observable has to be part of the solution for providing a great experience.</p> <h4 id="so-what-is-network-observability-the-ability-to-answer-any-question-about-your-network">So what is network observability? The ability to answer <em>any</em> question about your network.</h4> <p>As the industry seeks to address the issues of network complexity, criticality, and customer expectations, at Kentik, we believe the answer is in knowledge: the ability to answer any question about your network, whether corporate, cloud, or internet. Wherever your traffic goes or your data resides, you need to see and know about it. And to achieve that, you need granular, context-rich, comprehensive network telemetry.</p> <h3 id="kentik-firehose-enables-network-observability-data-integration">Kentik Firehose enables network observability data integration</h3> <p><a href="https://www.kentik.com/product/firehose">Kentik Firehose</a> is an essential piece of the APM + NPM puzzle. As Kentik’s streaming API to access enriched network observability data, Kentik Firehose enables organizations to surface network data in traditional observability tools. It also gives organizations the ability to integrate network data in their observability platforms for business and DevOps intelligence.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7jy0RpkzHLrTyxw67uUHji/7b525692257c153fe703adcc86117647/firehose-20201210.png" style="max-width: 800px; margin-bottom: 15px; padding: 20px 20px 0px 20px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Kentik Firehose streaming API</div> <p>Kentik takes any network telemetry, such as:</p> <ul> <li>Traditional flow, SNMP, and BGP</li> <li>Modern cloud VPC Flow Logs and streaming telemetry</li> <li>Host-based packet capture (eBPF for containerized environments)</li> </ul> <p>And enriches that data with context:</p> <ul> <li>Application</li> <li>User</li> <li>Geo</li> <li>Threat feeds</li> <li>Or anything else you could want</li> </ul> <p>And uses it to provide proactive insights, quick answers, and automated workflows.</p> <p>All of this data gets normalized and stored in our platform, the <a href="https://www.kentik.com/product/kentik-platform/" title="Learn More About Our Platform: the Kentik Network Observability Cloud">Kentik Network Observability Platform</a>, where you can ask any questions and get the answers you need about your network, fast! Plus, Kentik does this at scale: At peak load, a Kentik cluster can ingest and enrich 4 million records per second while maintaining sub-second query response times.</p> <p>In addition to storing the data for use in Kentik, network observability data can be streamed to additional systems and service endpoints using Kentik Firehose. See below an example of integrated application and network observability with Kentik’s data integrated in New Relic’s dashboards.</p> <img src="//images.ctfassets.net/6yom6slo28h2/21aqUQZ74WY8EnCCs58EMk/b81b363b5bd22ccc23f9cf7efbef8bc5/newrelic-dashboard-kentik-data.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">New Relic dashboard with Kentik network data</div> <h3 id="netops-and-devops-collaboration">NetOps and DevOps Collaboration</h3> <p>One of the key benefits of integrating application and <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/" title="Kentik founder and CEO, Avi Freedman explains Network Observability">network observability</a> comes to light when NetOps and DevOps teams need to collaborate to geo-locate network issues affecting application performance.</p> <p>This cross-functional collaboration is crucial in today’s complex application environments with many services and transactions involved in a successful application experience. Success is basically defined in one way: when every part works as expected. But what if that’s not the case? Combinations of possible failing or degrading parts can be overwhelming. Is the network slow somewhere in the path? Or is the problem somewhere in the system or application service components? What if it’s the database that is slow and transactions are queuing up and creating a cascading effect on latency? The reality is that the problem can be anywhere: application, service mesh, API gateway, authentication, Kubernetes, backbone networking, WAN, cloud networking, firewall, etc.</p> <p>Being able to correlate network data with application and system data leads to a much faster journey to the root cause and resolution.</p> <h3 id="use-cases-and-benefits-are-limitless">Use Cases and Benefits Are Limitless</h3> <p>Network observability can be used for performance troubleshooting, capacity planning, security protection, and business analytics. Today’s applications are all digital, and across all networks.</p> <p>With <a href="https://www.kentik.com/resources/kentik-firehose/">Kentik Firehose</a>, organizations have full access to network telemetry data to use and extract value in their unique ways, including to:</p> <ul> <li>Build better products</li> <li>Support better workflows</li> <li>Implement more reliable automation</li> <li>Provide great user experience</li> </ul> <p>The benefits of integrating network observability data into other systems grow with an organization’s vision to use data as a foundation to achieve its goals.</p> <p>If you are interested in learning more about how to use Kentik Firehose, please check our webinar <a href="/go/webinar-firehose/" title="Webinar Replay:Enhance Application Performance with Network Observability">Enhance Application Performance with Network Observability - Introducing Kentik Firehose</a>.</p> <p>You can also <a href="https://www.kentik.com/go/newrelic/">sign up for early access to New Relic One</a>, Kentik and New Relic’s partnership to deliver network performance telemetry and full network observability.</p><![CDATA[How G-Core Labs Uses Kentik Synthetics to See Across All Networks]]><![CDATA[G-Core Labs uses Kentik Synthetics. With continuous, proactive monitoring in place, find out how G-Core Labs achieves fast, accurate visibility, reduced infrastructure costs, automation to save time, and the ability to uphold SLAs and maintain quality of service for customers.]]>https://www.kentik.com/blog/how-g-core-labs-uses-kentik-synthetics-to-see-across-all-networkshttps://www.kentik.com/blog/how-g-core-labs-uses-kentik-synthetics-to-see-across-all-networks<![CDATA[Michelle Kincaid]]>Wed, 31 Mar 2021 04:00:00 GMT<p>Kentik customer G-Core Labs is a global cloud and edge provider. The company has more than 100 points of presence (PoPs) located in data centers in more than 65 cities around the world.</p> <p>With a distributed infrastructure on every continent, it is absolutely critical for G-Core Labs to have end-to-end, global network monitoring. They not only need to know what is happening across their own network, but also need to be aware of incidents that occur outside of their network, for example, with upstream operators, whose performance could impact their own customers’ experiences.</p> <p>In a <a href="/resources/case-study-g-core-labs/">new case study</a> published today, we detail how G-Core Labs tried three other commercial synthetics monitoring solutions before turning to Kentik Synthetics to effectively, proactively and continuously monitor its distributed infrastructure.</p> <p>Read the <a href="/resources/case-study-g-core-labs/">full case study</a> to see how G-Core Labs uses Kentik Synthetics to achieve:</p> <ul> <li>Continuous, proactive monitoring for fast, accurate visibility</li> <li>Reduced infrastructure costs</li> <li>Automation to save the network team’s time</li> <li>The ability to uphold SLAs and maintain quality of service for customers</li> </ul> <p>You can also find out more about <a href="https://www.kentik.com/product/synthetics">Kentik Synthetics</a> or watch our recent <a href="https://www.kentik.com/go/webinar-synthetics101-2021-03/">Synthetic Testing 101 webinar</a> to see Kentik Synthetics in action.</p><![CDATA[Troubleshoot and Secure Your Cloud with AWS VPC Traffic Mirroring and Kentik]]><![CDATA[AWS announced support for VPC Traffic Mirroring to additional AWS instance types. The Kentik Network Observability Platform provides visibility and insights into AWS mirrored traffic.]]>https://www.kentik.com/blog/see-understand-aws-vpc-mirrored-traffic-with-kentikhttps://www.kentik.com/blog/see-understand-aws-vpc-mirrored-traffic-with-kentik<![CDATA[Daniella Pontes, Ted Turner]]>Wed, 10 Mar 2021 05:00:00 GMT<p>AWS announced support for <a href="https://aws.amazon.com/blogs/networking-and-content-delivery/using-vpc-traffic-mirroring-to-monitor-and-secure-your-aws-infrastructure/">VPC Traffic Mirroring</a> across a broader range of their infrastructure beyond Nitro-based EC2 instances. VPC Traffic Mirroring is now also supported on instance types such as C4, D2, G3, G3s, H1, I3, M4, P2, P3, R4, X1, and X1e. At Kentik, we are very excited about AWS’s commitment to providing direct access to the VPC network traffic, addressing the growing need to observe the network traffic as a foundation for delivering fast and secure services.</p> <p>One key benefit of VPC Traffic Mirroring is to provide visibility into VPC traffic without requiring packet-forwarding agents. With this feature, customers can send a copy of their inbound and outbound traffic going through the instances’ network interfaces to a <a href="https://www.kentik.com/resources/five-steps-to-network-observability-nirvana-webinar/">network observability</a> solution for performance, connectivity, and security issues troubleshooting.</p> <img src="//images.ctfassets.net/6yom6slo28h2/nUCKrPVgJAn20fEKaMA8W/2ca982fac71994dec5576f60c9863437/aws-mirrored-traffic-data-explorer.png" style="max-width: 800px; margin-bottom: 10px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">AWS Mirrored Traffic Granular View using Kentik Data Explorer</div> <h2 id="kentik-network-observability-platform-provides-the-answers-you-need-regarding-your-aws-vpc-traffic">Kentik Network Observability Platform provides the answers you need regarding your AWS VPC traffic</h2> <p>Network observability is Kentik’s DNA. With Kentik, you can <a href="https://www.kentik.com/why-kentik-network-observability-and-monitoring/" title="Why Kentik?">answer any question related to any network</a>. To deliver network observability, Kentik ingests granular data of multiple types and sources from public clouds and private infrastructures. Kentik takes not only mirrored packets from AWS instances but also <a href="https://www.kentik.com/resources/aws-vpc-flow-logs-for-kentik/" title="AWS VPC Flow Logs for Kentik">VPC flow logs</a>, metrics, events, metadata, and more. In hybrid deployments, Kentik also collects flow data (e.g., NetFlow, sFlow), BGP routing, streaming telemetry, SNMP, and context data. We bring in all these data to enable fast, streamlined troubleshooting and analysis of events both in real time and on historical data.</p> <div as="Promo"></div> <p>Kentik gains deep insight from packet flows without applying compute-intensive DPI processing. This more efficient approach supports important use cases:</p> <ul> <li>Executing root-cause analysis on a performance issue</li> <li>Understanding sophisticated network attack</li> <li>Detecting and stopping compromised workloads</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/1nyLfwAMLZcABsjss8uvbP/2ec1b252d37a0360c7c4277ef912308b/aws-mirrored-traffic-details.png" style="max-width: 800px;margin-bottom: 10px;" class="image center" thumbnail /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">AWS Mirrored Traffic Details Visualization with Kentik.</div> <p>If you had a platform that could show your traffic’s specific characteristics and attributes, what would you do with it? Security evaluation? Network access rules validation? Performance evaluation and quickly get to the root cause of a degradation? Architecture auditing? Capacity planning?</p> <p>These and many other use cases are supported with <a href="https://www.kentik.com/product/kentik-platform/" title="Learn More About Our Platform: the Kentik Network Observability Cloud">Kentik Network Observability Platform</a> because we ingest all your network telemetry and process it at a massive scale and high speed. We gather, store, enrich, and correlate granular data about on-premises and cloud networks, host and container flow, internet routing, performance tests, device metrics and more. When you need to know anything about the network or investigate any issue regarding your traffic, we use this data to give you answers in a meaningful way, timely and reliable.</p> <p><a href="https://www.kentik.com/get-demo/">Give us a shout out</a> and let us know your use cases for Amazon VPC Traffic Mirroring and what answers you are looking for.</p> <p>#networkobservability is what Kentik is all about.</p><![CDATA[Cascaded Lag: The Cumulative Impact of Latency on Applications and User Experience]]><![CDATA[Cloud solutions architect Ted Turner describes how Kentik Synthetics easily uncovers network-related latencies that often cascade into cloud application performance problems -- DevOps teams take note. ]]>https://www.kentik.com/blog/cascaded-lag-impact-of-latency-on-applications-user-experiencehttps://www.kentik.com/blog/cascaded-lag-impact-of-latency-on-applications-user-experience<![CDATA[Ted Turner]]>Mon, 08 Mar 2021 05:00:00 GMT<p>Latency is becoming more visible. <a href="https://www.kentik.com/go/webinar-cloud-performance/">My friend Anil</a> just posted this graphic:</p> <img src="//images.ctfassets.net/6yom6slo28h2/4LK8EsJTsEcTiyObeToA9b/e75ccefe48e37044e507d12379d54985/latency-search.png" style="max-width: 800px" class="image center" /> <p>I shared a few words on this with my team, and they suggested this blog post. I’ve been addressing latency as part of my job description for as long as I can remember. When I onboarded here at Kentik, they said we had a cool new product coming out to help uncover network-related latency. I was ecstatic! Over the last 3 years of my last gig, I built a little bundle we could deploy with tools across public cloud instances or in our data center.</p> <p>The idea of that little bundle was to figure out where latency was occurring. Was it network or application or OS related? Latency is what we were always trying to figure out.</p> <p>Kentik announced its <a href="https://www.kentik.com/resources/synthetic-monitoring/">Synthetic Monitoring</a> solution shortly after I arrived, and it’s been a hit across the Kentik user community. Necessity is the mother of invention, and I fell in love with the Kentik Synthetics right at first release. As the product evolved month over month, my love has grown stronger.</p> <p>Kentik Synthetics was built from lots of feedback from our network-savvy community, and it has proven useful for these network-specialist teams. However, we are now finding that our integrations with APM suites also add value to the DevOps side of the world. These teams are really two sides of the same coin, with APM (and application observability) on heads and network observability on tails. (Over the years I’ve used the same analogy for network and security teams).</p> <div class="pullquote right" style="max-width: 340px">If the network is slow, the apps and user experience are going to be slow. But what if the network is not slow? What if it’s the database that’s slow?</div> <p>If the network is slow, the apps and user experience are going to be slow. But what if the network is not slow? What if it’s the database that’s slow? Transactions typically taking 20ms on a database can become slower. Let’s consider the database scenario for now, but this scenario can be applied at any tier component: application, service mesh, API gateway, authentication, Kubernetes, network backbone, WAN, LAN, switch, or firewall, etc.</p> <p>When a single component becomes slower — i.e., it has a longer transaction time than calculated by the design or engineering team — the impact to all upstream components is usually plainly apparent. The key is to examine each component in the stack, and see which became slower first.</p> <p>Going back to the database scenario, a 20ms transaction that slows to 40ms will usually not only cause a latency, but the rest of the stack will usually start to queue up additional transactions in parallel, a kind of cascading effect. Transactions queueing quickly become the failure point in the application stack, and it usually takes out multiple components along the way — firewalls, service mesh, API gateway, etc.</p> <p>If the database were designed to handle 100 concurrent transactions at 20ms per transaction, then when the majority of the transactions lag to 40ms, we can predict that there will be 200 concurrent transactions per second, simply due to latency. These 200 concurrent transactions are where the cascading effect occurs and often blinds teams looking to make a change and fix something. Stated more simply, the latency cascade effect is concurrency. The concurrency cascade effect is overconsumption of downstream resources.</p> <p>In my past work life, we would find latency at the transaction or packet level, and start to backchain through the systems, until we found no more latency. Sometimes the database was the cause all the way in the backend. Sometimes it was the application components in the stack. Sometimes it was the network components. No matter where the latency occurred, we would typically find a cascade of concurrent connections rising, with a correlating overconsumption of a finite resource. Cascading failures are definitely a thing.</p> <p>At my last gig at a big enterprise, we tested our whole application stack every week. We had thousands of changes committed every week, and needed to re-calibrate the entire stack and understand latency, concurrency, and constraints on all of our finite resources. We set a two second response SLO for a web page load (for the entire page) for 90% of the transactions. This means all operations — across the front-end web tier, all networks, application stack, authentication, authorization, and database transactions — all had to be &#x3C; 2 seconds, or the time budget would be blown. Generally we had a second rule, where all transactions needed to be completed &#x3C; 4 seconds (100%). If these general guidelines were broken, all teams had to go back and start re-understanding their latencies, concurrencies, and constraints.</p> <p>Here at Kentik, we started our comprehensive understanding of networks via NetFlow, *flow, consumption. Over the years, we have added SNMP, streaming telemetry, and now synthetic testing. Together, this is what we call network observability: the ability to get fast answers to any question about your network, across all networks and <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-4-telemetry-data-platform/">all telemetry</a>.</p> <p>A testament to the need to tackle cascading latency across the stack is that <a href="https://www.kentik.com/press-releases/kentik-partners-with-new-relic-network-enriched-application-insights/">we’re now partnered with New Relic</a> to make it easier to correlate both the application telemetry with network telemetry. With New Relic and other APM vendors, Kentik can team together and start to figure out where the latency originated first, and quickly identify what really needs to be fixed in the stack, from everything that can break in a cascading outage.</p> <h3 id="final-thoughts">Final Thoughts</h3> <p>For network teams, the Kentik #NetworkObservability platform reduces mean time to innocence. Many times in a cascading failure, the network is not to blame. However, when the network is the issue, we’re usually able to pull telemetry from all network components and help point to the network component at fault. This is a win not only for network teams, but also application and DevOps teams.</p> <p>If you’re not using Kentik yet, I urge you to <a href="https://www.kentik.com/go/get-started/">sign up for a free trial</a>. We can help with easy-to-start synthetic monitoring, without touching your network or application stack. And if you like what you see, we can help with extending to your physical networking, virtual networking, and cloud networking. When you want to correlate issues with APM suites, we do that too.</p> <p>Take us for a spin!</p> <p><button as="SignupButton">TRY KENTIK</button></p><![CDATA[Monitoring vs. Observability: Understanding the Role of Each]]><![CDATA[Explore the key differences between monitoring and observability and choose the right approach for effective system insights and troubleshooting.]]>https://www.kentik.com/blog/monitoring-vs-observability-understanding-the-role-of-eachhttps://www.kentik.com/blog/monitoring-vs-observability-understanding-the-role-of-each<![CDATA[Kevin Woods]]>Tue, 16 Feb 2021 05:00:00 GMT<p>Do you work with distributed software systems? Designed well, they’re normally more robust and reliable than single systems, but they have a more complex <a href="https://www.kentik.com/kentipedia/network-architecture/" title="Network Architecture Explained: Understanding the Basics of Modern Networks">network architecture</a>. Many teams spend long hours at the keyboard querying different tools and nodes, trying to figure out why things have failed — and we’re sure you’ve been there too. And while it’s great that cloud providers often hide much of this complexity, they fail differently from the compute and network you control.</p> <p>You probably already use tools to monitor your network, often at an individual service level or networking layer. You may be monitoring a particular internet gateway or load balancer, or only seeing device metrics, or focusing on only flow or synthetic measurements. Additionally, if your service is cross-platform, you’ll waste even more time debugging between the various providers. Since you’re supporting complex, distributed applications and users, once you cover the basics, getting visibility synchronized between the app and network layer is important as well. The emerging principles and practices of observability help you understand what’s going on in your system to speed up your debugging and make the best decisions.</p> <p>This post explains the emergence of observability and how it relates to traditional monitoring. In addition, it will cover how network observability is a critical requirement for gaining complete visibility.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5CzNoWux53atNSynNiWoD8/6664095b0be55b7fb0c16bfaa59585b7/binoculars.jpg" style="max-width: 500px;" class="image center" /> <h2 id="distributed-systems-are-complex">Distributed Systems are Complex</h2> <p>Although distributed systems are more robust, they come with added complexity. You can debug each component individually, but network issues between services often cause problems in these complex systems. The problem you’re usually seeing is triggered by root causes a few layers or services over. This complexity magnifies if communication is between various cloud providers or on-premise machines. To better support these systems, you need a more effective way of understanding how the different components communicate. More specifically, when something goes wrong, you need to figure out the root cause as fast as possible. You achieve this by having the best possible understanding of your system without wasting time debugging each node.</p> <h2 id="what-is-observability">What is Observability?</h2> <p>Observability and monitoring are entirely different concepts. However, you may often hear the terms mixed up or used interchangeably. And you’d be forgiven if you thought the two meant the same thing. <a href="https://en.wikipedia.org/wiki/Observability">Observability measures how well you understand your system from only its external outputs</a>. It’s important to note the definition specifies observability as a measure, not a final state or an activity.</p> <p>In the realm of IT infrastructure, observability leverages advanced algorithms rooted in control theory to generate insights. Unlike traditional monitoring which is reactive, observability is proactive. It enables teams to understand system health and behavior from outside without tampering with the system internals. Technological tools that foster observability encompass a range of software solutions designed to capture, analyze, and visualize data to enhance our understanding of complex systems.</p> <h3 id="the-three-pillars-of-observability">The Three Pillars of Observability</h3> <p>At the core of observability lie three crucial elements: logs, metrics, and distributed tracing. Logs offer a detailed record of events, helping in understanding specific actions within IT infrastructures. Metrics provide cumulative data collection over regular intervals, presenting the performance and health of applications, especially in microservices or distributed cloud environments. Finally, distributed tracing is essential in visualizing requests as they traverse through various services, unveiling bottlenecks or failures. Together, these three pillars offer a comprehensive insight into system behavior and performance.</p> <div as="Promo" index=0></div> <h2 id="understanding-observability-vs-monitoring-vs-telemetry">Understanding Observability vs. Monitoring vs. Telemetry</h2> <p>The meaning of “external outputs” is often described in the application-centric world by the three pillars of observability: metrics, logs, and distributed tracing (or sometimes MELT, when including “events”). Specifically for network observability, the output is a broad set of telemetry and metadata. Network telemetry usually includes device metrics, traffic telemetry, and synthetic telemetry as the core. We see advanced solutions combining and other sources. Metadata for infrastructure-focused observability usually includes routing, customer, applications, user, cost, DNS, IPAM, and other orchestration data. We described network telemetry (and its relationship to observability) in detail in our recent blog <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-3-network-telemetry-types/">The Network Also Needs to be Observable, Part 3: Network Telemetry Types</a>.</p> <p>Observability gives you full access to enriched data to see the inputs and activity in your infrastructure and application systems. With the right implementation, you can interact with the underlying data and signals to detect, diagnose, and repair issues as they occur.</p> <h3 id="why-is-visibility-into-your-infrastructure-important">Why Is Visibility Into Your Infrastructure Important?</h3> <div class="pullquote right" style="margin-top: 15px;">Observability increases your understanding and visibility of different components of your network and infrastructure.</div> <p>Observability increases your understanding and visibility of different components of your network and infrastructure. You might be wondering why visibility into your infrastructure is essential. Well, we’re sure you’ll agree that maintaining and updating components of a production system is a huge pain. Changing even the most minor section of the network infrastructure may cause you to feel a little sick with worry in the pit of your stomach. When you look deeper into why you felt this way, you may find it’s because, at the time, you had no idea what was happening in different parts of the system. Most documentation and fancy diagrams trying to explain how a system works are nearly always out of date. The only way to understand how information flows through your system is by observing what’s happening — right now.</p> <h2 id="what-is-monitoring">What is Monitoring?</h2> <p>Monitoring refers to the activity of capturing data and querying it in known ways. Traditional monitoring often focuses purely on data collection and query, without the combination of telemetry types and metadata to help achieve observability. These data types include metrics like network bandwidth, CPU utilization rates, memory, cache hit, and others. And monitoring tools are used to detect abnormal behaviors that might indicate problems.</p> <p>Usually, these queries present as dashboards and alerts that look for well-known patterns, such as interfaces with errors or poorly performing links.</p> <p>As organizations embrace DevOps cultural principles to extend their operations maturity, retrospectives often wind up with additional monitoring deliverables as various failure modes become known as patterns.</p> <p>Modern observability platforms can also support monitoring techniques and allow proactive notification and interactive analysis, turning those successful investigations into saved queries and alerts.</p> <p>As with observability platforms, there are different monitoring platforms, including user activity monitoring, application monitoring, network monitoring, and event monitoring. The type of monitoring you choose to focus on often depends on where it’s going to give your organization the most value. For example, if network monitoring is the space that causes you the most pain, you are more likely to focus on that first.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3VmETeWxCriSP5jKE26IZq/12e5f0fc7282f1f1029409ee0459729d/coin-op-viewers.jpg" style="max-width: 500px;" class="image center" /> <h2 id="optimizing-observability-in-practice">Optimizing Observability in Practice</h2> <h3 id="tracing-doesnt-always-give-you-visibility-of-your-network">Tracing Doesn’t Always Give You Visibility of Your Network</h3> <p>The three common elements of compute observability are metrics, logs and distributed tracing. However, when you’re trying to get a clearer understanding of how your network is functioning, distributed tracing and metrics alone don’t provide enough visibility.</p> <p>This is where network observability fits in. You can learn more about the specifics of network observability from <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/">The Network Also Needs to be Observable, Part 1</a>. The aim is to gather all types of telemetry from all networks and business metadata and to use it to provide the most valuable insights and action-focused workflows to help the works on the networking front lines.</p> <h3 id="observability-is-useful-for-more-than-debugging">Observability is Useful for More than Debugging</h3> <div class="pullquote right" style="margin-top: 15px; max-width: 330px;">Observability is very useful for finding the root cause of issues &mdash; fast.</div> <p>Observability is very useful for finding the root cause of issues—fast. However, after implementing different levels of observability in multiple systems, there are a few other benefits just as valuable. First, it’s great for human-assisting workflows like, just as an example, <a href="https://www.kentik.com/blog/capacity-planning-done-right/">capacity planning done right</a>. One of the hardest parts of designing a system is planning for capacity. If you understand how your capacity requirements have grown over time, you can make a more informed decision. No more guessing.</p> <h3 id="you-need-to-collect-the-data-and-use-it">You Need to Collect the Data and Use It</h3> <p>Worrying less about making changes and being able to solve issues quickly sounds great. However, there’s a catch — you need to collect all the telemetry data and use it. Tools like Kentik make this process easier by <a href="https://www.kentik.com/product/kentik-platform/">automating most of the collection of the data</a> you need. And once you have the data, Kentik can alert you to any abnormal behavior and give you relevant visualizations and <a href="https://www.kentik.com/kentipedia/network-performance-monitoring/">metrics to understand</a> what is going on.</p> <h2 id="the-relationship-between-monitoring-and-observability-information-vs-insight">The Relationship Between Monitoring and Observability: Information vs. Insight</h2> <p>Monitoring provides information and visibility (especially around questions you knew were important to ask). But, observability brings you deep insights into how your application, infrastructure, and network perform, all from external outputs, and available for both novice and expert humans to explore. The classic application and DevOps-focused outputs are metrics, logs, and distributed tracing. However, you need to know about orchestration and control planes and other business metadata for a more complete understanding. Adding network observability and seeing a wide variety of infrastructure telemetry along with the required context makes this even more valuable, and not just to network teams. Instead of querying each part of the system to debug issues, all the information you need is easy to query and integrate to support your regular and unscheduled designs, plans, and operational workflows. It may take some trial and error to get the right amount of data with the right amount of detail.</p> <p>But it’s up to you to collect this data and use it to help with your understanding. Tools like Kentik at the network layer, and New Relic at the application layer, can help collect and organize this data so you can focus on making the right decisions instead of spending all your time collecting data and building visualizations.</p> <h2 id="monitoring-vs-observability-which-is-better">Monitoring vs. Observability: Which is Better?</h2> <p>The preference between monitoring and observability largely depends on the specific needs of DevOps teams and the nature of the IT environment. Traditional monitoring tools excel in environments where IT teams know what issues to expect and can define explicit alerting thresholds. They’re straightforward, offer clear data points, and work best when the IT infrastructure operates under predictable conditions.</p> <p>On the other hand, an observability platform is indispensable for complex, dynamically changing environments like today’s enterprise and service provider networks. Observability provides deeper insights and holistic understanding, especially in cloud-native landscapes or microservices architectures. So, while monitoring informs about known issues, observability shines in diagnosing unexpected or unknown challenges. Ideally, a balanced approach integrating both offers the best of both worlds to NetOps and DevOps teams.</p> <h2 id="get-the-benefits-of-network-observability-with-kentik">Get the Benefits of Network Observability with Kentik</h2> <p>The Kentik Network Observability Cloud empowers network pros to plan, run, and fix any network. To see how Kentik can bring the benefits of network observability to your organization, start a <a href="#signup_dialog" title="Request a Free Trial of Kentik">free trial</a> or <a href="#demo_dialog" title="Request a Demo of Kentik">request a personalized demo</a> today.</p> <p><button as="SignupButton">TRY KENTIK</button></p><![CDATA[Myanmar Goes Offline During Military Coup]]><![CDATA[In the past 24hrs, Myanmar experienced a military coup d'état during which prominent government figures like Nobel laureate Aung San Suu Kyi were detained and internet service was almost entirely blacked out. Here is what the shutdown looked like in our data (with annotations for timing) for Myanmar's big four: Myanma Posts and Telecommunications (MPT), Telenor Myanmar, Ooredoo Myanmar, and Mytel.]]>https://www.kentik.com/blog/myanmar-goes-offline-during-military-couphttps://www.kentik.com/blog/myanmar-goes-offline-during-military-coup<![CDATA[Doug Madory]]>Mon, 01 Feb 2021 21:00:00 GMT<p>In the past 24hrs, Myanmar experienced a military coup d’état during which prominent government figures like Nobel laureate <a href="https://www.npr.org/2021/02/01/962758188/myanmar-coup-military-detains-aung-san-suu-kyi-plans-new-election-in-2022">Aung San Suu Kyi</a> were detained and internet service was almost entirely disabled. It was a shocking turn to a once-promising story of a country’s journey from pariah state to accepted member of the international community.</p> <p>Myanmar (also called Burma) is no stranger to internet shutdowns. Prior to the infamous <a href="https://www.kentik.com/blog/from-egypt-to-uganda-a-decade-of-internet-shutdowns/">Egyptian internet shutdown</a> a decade ago, Myanmar’s shutdown of 2007 (detailed <a href="https://opennet.net/research/bulletins/013">here</a>) was perhaps the most notable instance of a government using a large-scale communications blackout as an act of political repression.</p> <p>However, in 2011, the government of Myanmar <a href="https://www.nytimes.com/2011/11/30/world/asia/in-myanmar-government-reforms-win-over-countrys-skeptics.html">implemented political reforms</a> that sought to restore democracy and relations with the international community. A liberalization of the telecom sector followed and Myanmar held an <a href="https://www.pri.org/stories/2013-06-15/burma-auction-two-mobile-phone-licenses-largely-offline-country">open international auction</a> for telecom licenses which resulted in two new mobile operators: Telenor of Norway and Ooredoo of Qatar. Later, a <a href="https://en.nhandan.com.vn/scitech/item/7464402-mytel-becomes-third-biggest-telecoms-operator-in-myanmar.html">third mobile operator won</a> a concession to operate in Myanmar: Mytel, backed by Vietnamese-military owned Viettel. The new mobile operators helped build an <a href="https://blogs.oracle.com/internetintelligence/telenor-activates-historic-link-in-myanmar">extensive telecom infrastructure</a> resulting in a transformative internet boom for the country.</p> <p>Around this time I would often use Myanmar as a model that Cuba could follow to rapidly liberalize their internet and telecom industry. In 2014, I was interviewed on NPR where I <a href="https://www.pri.org/stories/2014-12-19/few-suggestions-cuba-how-get-real-about-access-internet">made that case</a>, and then, with a nod to the Cuban government’s origins, <a href="https://blogs.oracle.com/internetintelligence/what%e2%80%99s-next-for-cuba">I wrote</a>,</p> <blockquote> <p>there is reason to believe that the impressive growth being experienced currently in Myanmar could be replicated in Cuba — but it would require a capitalist approach for one of the world’s last remaining communist countries. In Cuba, such a mind shift would be … revolutionary.</p> </blockquote> <p>Last week, Peter Micek of Access Now and I wrote in <a href="https://www.kentik.com/blog/from-egypt-to-uganda-a-decade-of-internet-shutdowns/">a blog post </a>marking the tenth anniversary of the shutdown in Egypt,</p> <blockquote> <p>With ten years of hindsight, we can see that the Egyptian internet shutdown was not an anomaly, but rather a harbinger of things to come, namely, an era where internet communications would be directly threatened by repressive governments in an effort to control their own people.</p> </blockquote> <p>Today’s incident in Myanmar joins internet shutdowns in <a href="https://twitter.com/DougMadory/status/1351285694380060694">Uganda</a> and <a href="https://twitter.com/DougMadory/status/1354581571840389127">Cuba</a> just this month. When taken together, these episodes further underscore that the battle against internet shutdowns is far from over.</p> <p>Here is what the shutdown looked like in our data (with annotations for timing) for Myanmar’s big four: Myanma Posts and Telecommunications (MPT), Telenor Myanmar, Ooredoo Myanmar, and Mytel. Some providers in Myanmar managed to stay online during the blackout, such as 5BB Broadband (AS133384) and Myanmar Broadband Telecom (MBT, AS135300).</p> <img src="//images.ctfassets.net/6yom6slo28h2/7p8ROXrRhLlleRUhM1RDow/7c36792f59da9ea985db63813da9a7e2/MPT_revised2.png" style="max-width: 600px;" class="image center" /> <img src="//images.ctfassets.net/6yom6slo28h2/CEsWs0ykSxgjCdijHGKkp/84014c795a82aebcd727149006897268/Ooredoo_revised.png" style="max-width: 600px;" class="image center" /> <img src="//images.ctfassets.net/6yom6slo28h2/01GQ12QLLRpdLAMTV4RehH/cc70dc2611ade8d71eaacf6c35bfa91a/Telenor_revised.png" style="max-width: 600px;" class="image center" /> <img src="//images.ctfassets.net/6yom6slo28h2/7taBd3ScG3P44ITpRcIP4Q/453494d2aad1972cd3066c0f65ed93c8/Mytel_revised.png" style="max-width: 600px;" class="image center" /><![CDATA[The New Normals of Network Operations in the Year Ahead]]><![CDATA[Last week I had the honor to participate in the PTC 2021 conference. Held in Hawaii every January, PTC's annual conference is the Pacific Rim's premier telecommunications event. Although this year’s conference was all virtual (no boondoggles to Honolulu!), it was no less important as the theme this year was New Realities. In the following blog post, I summarize what I presented in my PTC panel entitled Strategies to Meet Network Needs.]]>https://www.kentik.com/blog/the-new-normals-of-network-operations-in-the-year-aheadhttps://www.kentik.com/blog/the-new-normals-of-network-operations-in-the-year-ahead<![CDATA[Doug Madory]]>Fri, 29 Jan 2021 05:00:00 GMT<p>Last week I had the honor to participate in the <a href="https://www.ptc.org/ptc21/">PTC 2021 conference</a>. Held in Hawaii every January, PTC’s annual conference is the Pacific Rim’s premier telecommunications event. Although this year’s conference was all virtual (no boondoggles to Honolulu!), it was no less important as the theme this year was New Realities. In the following blog post, I summarize what I presented in my PTC panel entitled Strategies to Meet Network Needs.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5xbHfsiEZqlisMw8cR67PN/08f9997fc00504233362341c18027387/doug_at_ptc.png" style="max-width: 400px;" class="image center" /> <h4 id="new-normals-of-network-operations-in-the-year-ahead">New Normals of Network Operations in the Year Ahead</h4> <p>While it is impossible to predict exactly how and when we’ll come out of the pandemic, we believe that there are going to be some lasting changes to how companies in the internet and cloud space operate heading in the future. Here are five “new normals” that Kentik customers are considering as they pandemic-proof their businesses to survive this year’s New Realities.</p> <h4 id="new-normal-1-expanded-disaster-recovery-planning">New Normal #1: Expanded Disaster Recovery Planning</h4> <p>Not many (did any?) Disaster Recovery plans included a scenario of the magnitude that we’re dealing with now. DR scenarios can be either long-lasting or short-lived. They can have either regional or global impact. But unlike an earthquake or hurricane, the pandemic is a truly global event that, as we are well aware by now, is of indefinite duration thus making it uniquely difficult to prepare for.</p> <p>Aside from the unprecedented changes to our day-to-day work life, companies have faced profound challenges in their supply chain. How does a company continue to operate when replacement equipment can’t be shipped? Acquiring servers was a big problem in the beginning of the pandemic - fans from one country, motherboards from another, hard disks from yet another. With countries opening and closing, equipment acquisition became a nightmare.</p> <p>One lesson we learned was to understand all the downstream dependencies and consider forward stockpiling. We learned that one must assume that you could lose physical access to multiple geographically separated data centers whereas previous DR scenarios only included the loss of access to one or two data centers in a single geography.</p> <h4 id="new-normal-2-zero-trust-by-necessity">New Normal #2: Zero Trust by necessity!</h4> <p>Zero Trust is a security model that moves away from depending on a traditional security perimeter to a corporate compute environment that is segmented into many secure zones. By isolating services and resources, the hope is that they can be better protected individually, hopefully limiting the impact of a potential security failure.</p> <p>However, Zero Trust takes on new meaning during a pandemic when much of the workforce is working from home using untrusted networks. Work-from-home means there essentially is no trusted network and nearly all connections are over untrusted <a href="https://www.kentik.com/go/assessment-network-readiness/">networks</a>.</p> <h4 id="new-normal-3-advanced-provisioning">New Normal #3: Advanced Provisioning</h4> <img src="//images.ctfassets.net/6yom6slo28h2/21WiBXR0hc0RtcPKYORf0d/50949949fb22629dcc443193a8f94cda/Screen_Shot_2021-01-29_at_3.14.58_PM.png" style="max-width: 300px;" class="image right" /> <p>When traffic volumes ramped up dramatically in March 2020, global networks could handle the additional load due to advanced provisioning - intentional slack built into the system for future capacity. Outside of some last-mile exceptions, it was this advanced provisioning that really saved us as much of our lives moved to an exclusively online existence.</p> <p>But traffic levels continue to grow and some of our new online habits (including working from home) will remain with us after the pandemic ends. Kentik customers are planning to continue to increase capacity into 2021.</p> <h4 id="new-normal-4-edgecloud-native-development">New Normal #4: Edge/cloud native development</h4> <p>Heading into 2021, we expect to see more <a href="https://www.kentik.com/blog/see-understand-aws-vpc-mirrored-traffic-with-kentik/">cloud</a> native development. No app gets approved that isn’t cloud native (i.e. containerized). In 2020, we didn’t see much in the way of mass deployment of edge architecture, but in 2021, we expect to see the first of these “edgier” deployments start to take place around gaming, manufacturing/logistics, and distributed <a href="https://www.kentik.com/go/webinar-synthetics101-2021-03/">performance monitoring</a>.</p> <h4 id="new-normal-5-evolving-regulatory-environment">New Normal #5: Evolving Regulatory Environment</h4> <p>The pandemic isn’t the only source of challenges for the internet <a href="https://www.kentik.com/resources/idc-innovators-cloud-managed-network-monitoring/">cloud</a> space. The regulatory environment continues to be an active source of worry. GDPR and CCPA (California Consumer Privacy Act) are here today, but what will businesses have to contend with tomorrow?</p> <p>Section 230 of the Communications Decency Act has become a focus of political debate. The law protects internet businesses from liability if a user posts something illegal, but as social media platforms become a central part of our politics this protection may erode or its interpretation may be revised. Will there be changes in what is required to police content from our customers?</p> <p>Companies may need to form a more developed opinion about their stance on law. It means even more meetings with the legal team. Kentik has 20+ customers in the cloud and CDNs space that are very concerned about this.</p> <h4 id="conclusion">Conclusion</h4> <p>Those are five of the New Normals we see facing our customers in the cloud and internet space heading into 2021. This is a very challenging time, but if we can get through it, we will be stronger and more resilient companies. From everyone at Kentik, stay safe and healthy!</p><![CDATA[My Kentik Portal Gets a New Skin!]]><![CDATA[My Kentik Portal, Kentik’s white-label network observability portal, is now available to all Kentik customers. We've lifted access restrictions, so you can take advantage of its features without complex activation. Plus you'll find numerous enhancements based on interviews with existing users.]]>https://www.kentik.com/blog/my-kentik-portal-gets-a-new-skinhttps://www.kentik.com/blog/my-kentik-portal-gets-a-new-skin<![CDATA[Greg Villain]]>Thu, 28 Jan 2021 05:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/42ZgPsFnRTRg8IOmL3ZaxE/4c09d12a49ae56a57f5abc86dde95770/mkp-menu.png" style="max-width: 650px;" class="image right" /> <p>Users with a sharp eye may have noticed a new item making its way into Kentik’s navigation menu: My Kentik Portal (MKP).</p> <p>Existing users may be familiar with this Kentik function, as it was previously available to a subset of customers. Now we are making My Kentik Portal available to the wider customer base for adoption.</p> <p>The new features of MKP are:</p> <ul> <li>An updated My Kentik Portal experience with new v4 portal UX standards</li> <li>Access restrictions lifted to <a href="https://www.kentik.com/resources/my-kentik-portal/">My Kentik Portal</a>, which means:<br> <ul> <li>All users can take advantage of the features without complex activation</li> <li>Default usage limits are set higher to allow prospective users to test MKP in a meaningful way</li> </ul> </li> <li>Numerous feature enhancements based on interviews with the existing users</li> </ul> <p>The product team believes we have largely delivered on these goals, but see for yourself and let us know what you think!</p> <h3 id="my-kentik-portal-in-a-nutshell-if-youve-missed-the-previous-episodes">My Kentik Portal in a Nutshell (if you’ve missed the previous episodes)</h3> <p>For those who are unfamiliar with <a href="https://www.kentik.com/resources/my-kentik-portal/">My Kentik Portal</a>, let’s rewind and spend a moment to explain what it is. In a nutshell, <strong>My Kentik Portal is Kentik’s white-label network observability portal</strong>.</p> <p>It allows Kentik customers (they’ll be referred to as “landlords”) to offer network observability and analytics to their own customers (who’ll then be referred to as “tenants”).</p> <p>My Kentik Portal comes with a dual interface:</p> <ul> <li> <p><strong>The landlord interface</strong>, allowing landlords to provision and configure tenants, the partitions of the infrastructure data they have access, and the visualizations and alert policies that each one of them will be offered. The landlord interface is also the place where landlords will configure general attributes of the tenant interface such as branding, or the URL at which the latter is available.</p> </li> <li> <p><strong>The tenant interface</strong>, is a standalone branded portal, separate from the well-known Kentik Portal that the landlords know and love. In this interface, the tenants are able to consume the analytics and alerts configured for them by landlords.</p> </li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/1GluMCpItOJ7rZD8y4a1PF/2c9fbe01e691d60eb271ad3161c9ded7/mkp-acme-login.png" style="max-width: 600px;" class="image center" /> <h3 id="do-you-need-my-kentik-portal">Do You Need My Kentik Portal?</h3> <p>If your company sells infrastructure services, chances are you will want to give My Kentik Portal a test drive. Go for it! Kick the tires and let us know what you think, we’re just getting started, and are more than happy to <a href="https://www.kentik.com/contact/">hear your feedback</a> and incorporate it into future releases.</p> <p>Here are some reasons to check out My Kentik Portal:</p> <ul> <li> <p>In a competitive market place, My Kentik Portal will allow you to add differentiated and visible value to your customers by delivering tailored analytics on top of your existing portfolio of infrastructure services.</p> </li> <li> <p>As soon as you start monitoring your infrastructure with Kentik, the network data is immediately available to partition and can be offered to your customers without any additional service fee. My Kentik Portal is a great way to put your network observability data to work and monetize it.</p> </li> </ul> <h3 id="my-kentik-portal-at-a-glance">My Kentik Portal at a Glance</h3> <p>If what you’ve read so far makes you curious, read on as we go through a few of the noteworthy features that come with this release.</p> <h4 id="flexible-tenant-data-partitioning-and-visualizations">Flexible Tenant Data Partitioning and Visualizations</h4> <p>For every configured tenant, landlords can leverage a comprehensive list of filters to determine how to partition the network data for customers.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4d8MKGLmuzlUOT7DHZgVpq/946c7685ad1684d15aa78f34c18df1be/mkp-landlord-tenant1.png" style="max-width: 600px;" class="image center no-shadow" /> <img src="//images.ctfassets.net/6yom6slo28h2/59fmuynbam6e9K2D9GCehR/3478fb8b8120c89885852346e5a13256/mkp-landlord-tenant2.png" style="max-width: 600px;" class="image center no-shadow" /> <p>From simple cases where all the telemetry data from one or more network devices can be assigned to a tenant, to the more complex cases where tenants can only be defined by a set of interfaces, or IP blocks or ASNs assigned to them, all the way down to your own custom dimensions to help assign traffic data:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2UWkUB4MGZNctkEiuWXgmF/5a61cdd4af5201107a2872caf073123b/mkp-edit-tenant-pear.png" style="max-width: 800px;" class="image center" /> <p>Additionally, every tenant can be assigned a different set of visualizations, dashboards and alert policies, to best fit their precise analytics requirements:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Yjc37jEyPQz35NcEpGZYR/ff14188606f60be40f1be29b02749a58/mkp-view-dashboards.png" style="max-width: 800px;" class="image center" thumbnail /> <h4 id="tenant-packages">Tenant Packages</h4> <p>To simplify the provisioning of tenants, landlords can build “packages.” Packages are the literal SKUs for their network analytics offer, allowing them to bundle multiple dashboards, network visualizations and alert policies into groupings of content that they can directly assign to tenants. The Packages UI also allows you to keep an eye on the metrics behind each package, and package usage across tenants.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4XlnnFQ9r4gJj7JA9coota/499e19836abefbb96912b1c0e7b46f72/mkp-packages.png" style="max-width: 800px; margin-bottom: 10px;" class="image center" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Package management screen</div> <img src="//images.ctfassets.net/6yom6slo28h2/3Rp23Zb2G7JXlejNWuf0OU/0bb4953258847ee258a7383101ea5e56/mkp-edit-package.png" style="max-width: 800px; margin-bottom: 10px;" class="image center" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Customizing a package appearance</div> <h4 id="helpers-to-build-tenant-content">Helpers to Build Tenant Content</h4> <p>While moving My Kentik Portal over to the new v4 experience and interviewing existing users, we realized that we needed to help landlords more efficiently produce visualizations and dashboards for their tenants. In the previous iteration, landlords had to juggle between Data Explorer or dashboard editing and the tenant UX to visualize the content they were creating through the eyes of their tenants.</p> <p>In v4, landlords now have access to the “Preview as Tenant” top item within Data Explorer and Dashboard Editor Actions menu, as displayed below. This menu will get populated as soon as the first tenant is provisioned.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3pXvBApOWVqr7eujxsbJ2s/61f4fcfbd0f406f20f3a0722318bb41f/mkp-preview-as-tenant-662w.png" style="max-width: 500px;" class="image center" /> <p>Upon selecting a tenant to display the current dashboard or view, a named filter will be appended to the associated query controlled, containing the name of the selected tenant, as depicted below — this additional filter will correspond to the specifics of the data partition configured for this tenant, no matter how complex it is.</p> <img src="//images.ctfassets.net/6yom6slo28h2/puFFomXrNAH7HoAWkawlt/43e985f3ebcbfc1fac80120d81103290/mkp-filtering-400w.png" style="max-width: 350px;margin-bottom: 10px;" class="image center" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Named filter gets appended following on to a “Preview as Tenant” action</div> <h4 id="a-redesigned-tenant-ux">A Redesigned Tenant UX</h4> <p>Last but not least, the overall tenant experience has been overhauled, as you can see from the screenshot below:</p> <img src="//images.ctfassets.net/6yom6slo28h2/35Re9njgRSWs7jqgm1YahC/aaf57f303a0beb8ce367a2e90d6719e1/mkp-explore-library-647w.png" style="max-width: 600px;" class="image center" /> <p>I’d like to name a few other improvements to the new v4 tenant experience:</p> <ul> <li>Available visualizations now take the full screen and look more inviting</li> <li>The description coming with the dashboard or visualization is now displayed within the tenant UX homepage.<br /><strong>Tip</strong>: Make sure your tenant content has appealing and explanatory descriptions!</li> <li>Guided Mode dashboards now have a direct prompt in the homescreen</li> <li>Views contained in a tenant package now have a bottom label identifying at which package you are looking</li> </ul> <h3 id="wrapping-up">Wrapping up</h3> <p>While already a substantial update, this initial v4 release of My Kentik Portal marks just the beginning of the envisioned product roadmap.</p> <p>Now that we’ve released it, and made it available to everyone without restrictions, our job is done and yours starts. Please help us make My Kentik Portal better by <a href="https://www.kentik.com/contact/">providing your feedback</a> as we plan for future iterations!</p><![CDATA[From Egypt to Uganda, A Decade of Internet Shutdowns]]><![CDATA[The shutdown in Egypt not only shifted the dynamics of protest in the 21st century, it was a watershed moment for the internet community — from technical organizations like Renesys to digital rights advocacy groups like Access Now. The era of the large-scale government-directed internet shutdown had truly begun.]]>https://www.kentik.com/blog/from-egypt-to-uganda-a-decade-of-internet-shutdownshttps://www.kentik.com/blog/from-egypt-to-uganda-a-decade-of-internet-shutdowns<![CDATA[Doug Madory, Peter Micek]]>Wed, 27 Jan 2021 16:30:00 GMT<p><em>The following is a joint blog post by Doug Madory (Kentik, previously with Renesys) and Peter Micek (Access Now).</em></p> <p>Ten years ago today, a team of computer scientists at internet monitoring firm Renesys began frantically <a href="https://www.smh.com.au/technology/egyptian-internet-cutoff-unprecedented-renesys-20110129-1a8o0.html" target="_blank"> documenting</a> and <a href="https://archive.nytimes.com/www.nytimes.com/interactive/2011/02/16/world/middleeast/0212-egypt-internet.html" target="_blank"> reporting </a> about a massive internet shutdown in Egypt. At the same time, digital rights organization Access Now was <a href="https://www.esquire.com/news-politics/news/a9674/brett-solomon-interview-access-now-5456069/" target="_blank"> scrambling to assist </a> Egyptian activists circumventing surveillance and blockages of internet communications.</p> <p>The Arab Spring had begun and longtime Egyptian President Hosni Mubarak was under siege by protests in Tahrir Square. In an ill-fated move, the government gave the order to cut internet services in an attempt to squelch the uprising. No longer able to follow the protests online, the shutdown brought more Egyptians into the streets to see what was happening, adding to the crowds. Ultimately, the internet in Egypt was restored, Mubarak resigned, and little else in Egypt was ever resolved.</p> <p><em>Graphics generated by Renesys (left) and by New York Times (right) in 2011.</em> <br> <img src="//images.ctfassets.net/6yom6slo28h2/6vEZBsSsV8icEsURVnZCWO/4912bbff4dd4b3febb3683592a05bc1d/egypt_revised.png" style="max-width: 350px;" /> <img src="//images.ctfassets.net/6yom6slo28h2/gLLvSfKMeubTEqE3ZxLQo/43b77c667528df66917fa2d6ed5b264c/0216-for-INTERNETweb.png" style="max-width: 480px;" /></p> <hr> <p>The shutdown in Egypt not only shifted the dynamics of protest in the 21st century, it was a watershed moment for the internet community — from technical organizations like Renesys to digital rights advocacy groups like <a href="https://www.accessnow.org/" target="_blank">Access Now</a>. The era of the large-scale government-directed internet shutdown had truly begun.</p> <p>In the past decade, the scourge of nation-scale internet shutdowns has continued — largely migrating from the Arab Spring to Sub-saharan Africa as longtime rulers came up for reelection. Used as a technique to quash opposition, shutdowns have the pernicious effects of <a href="https://www.accessnow.org/iamthesudanrevolution-theres-a-direct-link-between-internet-shutdowns-and-human-rights-violations-in-sudan/" target="_blank"> hiding human rights abuses</a> as well as <a href="https://cpj.org/2020/05/network-shutdowns-restrict-reporting-during-covid/" target="_blank"> limiting the press </a> and free speech. In addition, shutdowns can be costly from <a href="https://www.brookings.edu/research/internet-shutdowns-cost-countries-2-4-billion-last-year/" target="_blank">entire economies</a> down to <a href="https://www.reuters.com/article/uganda-internet-rights/100-hours-in-the-dark-how-an-election-internet-blackout-hit-poor-ugandans-idUSL4N2JU2YQ" target="_blank">average citizens</a>. But for an embattled ruler, these are often seen as the necessary price of maintaining power.</p> <p>Governments had disrupted the internet prior to Egypt’s blackout in 2011. However, this event drew the world’s attention to the internet’s weak points and bottlenecks where a nation’s telecoms must answer to authorities’ whims. Access Now pressured <a href="https://www.theguardian.com/business/2011/jul/26/vodafone-access-egypt-shutdown" target="_blank">multinational telecoms</a> operating in Egypt to explain their roles in the Mubarak regime’s internet shutdown. Soon, the <a href="https://www.loc.gov/law/foreign-news/article/u-n-human-rights-council-first-resolution-on-internet-free-speech/" target="_blank">UN declared</a> that human rights apply online as they do offline, and telecoms began to <a href="http://www.telecomindustrydialogue.org/" target="_blank">recognize</a> that they impact people’s rights.</p> <p>Yet Egypt’s internet shutdown proved a bellwether. The threats have multiplied, with more than <a href="https://www.accessnow.org/cms/assets/uploads/2020/02/KeepItOn-2019-report-1.pdf" target="_blank">213 shutdowns</a> recorded in 2019. Governments from Myanmar to Cameroon adapted their tactics, blocking for more <a href="https://www.accessnow.org/365-days-and-counting-myanmar/" target="_blank">prolonged periods</a>, in ways <a href="https://www.aljazeera.com/news/2018/1/26/cameroon-internet-shutdowns-cost-anglophones-millions" target="_blank">more targeted</a> to stifle the voices of specific populations, such as refugees living in Bangladesh’s Cox’s Bazar camp, and people in marginalized and vulnerable ethnic, religious, or political groups.</p> <p>As shutdowns rose globally, Access Now and partners launched <a href="https://www.accessnow.org/keepiton/" target="_blank">#KeepItOn</a> in 2016, a campaign and <a href="https://www.accessnow.org/keepiton/#coalition" target="_blank">coalition</a> that has grown to more than 220 organizations in 99 countries.</p> <img src="//images.ctfassets.net/6yom6slo28h2/019rPD4b2rTm4G4tJP6IsI/5fc947748df9fae7f50fcaf7bfc58267/KeepItOn-presentation-social.jpg" style="max-width: 500px;" class="image center" /> <p>Additionally, in the past decade there has been an increasing need for the technical portion of the internet community to use their expertise, data, and tools to support the work of digital rights NGOs like Access Now. This domain expertise helps to differentiate a blackout due to a submarine cable cut from one caused by a government-directed act of repression. In recent years, companies have fielded internet “weather maps” that would <a href="https://blogs.oracle.com/internetintelligence/introducing-the-internet-intelligence-map" target="_blank">democratize internet analysis</a> allowing average users to find evidence of internet interference.</p> <p>Despite these efforts, the past week saw yet another national internet shutdown when the government of Uganda gave the order to cut internet services in the country during a national election. The country was almost completely offline for five days around the day of the vote, which resulted in the reelection of President Museveni extending his 36-year rule over the East African country. Sadly, the disruption came as <a href="https://www.accessnow.org/the-world-is-watching-uganda-elections/" target="_blank">little surprise</a> to observers, who anticipated and tried to prevent the internet blocks.</p> <p>The shutdown in Uganda shows that the effort to combat these acts is far from over and needs more support. Especially worrisome is the prospect that the conclusion other embattled authoritarian regimes may draw from Uganda is that shutdowns work, given the president’s re-election.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1f55OgFzcpl0CCAzrm9KYk/89431fb65498ea5fc5ba6b81aa361885/Uganda_Shutdown_Jan2021.png" style="max-width: 600px;" class="image center" /> <p>With ten years of hindsight, we can see that the Egyptian internet shutdown was not an anomaly, but rather a harbinger of things to come, namely, an era where internet communications would be directly threatened by repressive governments in an effort to control their own people.</p> <p>Just like Renesys and Access Now did ten years ago, everyone has a role to play in exposing and resisting these disruptions. Legal challenges resulted in several <a href="https://www.accessnow.org/2020-digital-rights-wins/" target="_blank">courtroom victories</a> against shutdowns in 2020. Civil society organizations can <a href="https://www.accessnow.org/keepiton/" target="_blank">join #KeepItOn</a> and deploy tools to <a href="https://ooni.org/" target="_blank">measure internet censorship</a> and <a href="https://docs.google.com/forms/d/e/1FAIpQLSewDTGuARvgghlqLgtQrjVPLp2Pdn5cCz5-_s7-Sg_pMh20xQ/viewform" target="_blank">share stories</a> of living through shutdowns, calling on governments to #LetTheNetWork. Tech companies can <a href="https://globalnetworkinitiative.org/policy-issues/network-disruptions/" target="_blank">collaborate</a> against abusive government requests and governments can <a href="https://www.reuters.com/article/belarus-election-internet/u-s-germany-among-29-nations-condemning-reported-internet-shutdowns-in-belarus-idINKBN2690DO" target="_blank">speak out</a> when their peers violate norms. Join our fight, and you too can help build a stronger net.</p><![CDATA[What Is Port Mirroring? SPAN Ports Explained]]><![CDATA[In this blog, we dive into what SPAN Port Mirroring is, how it works, what it’s good at, and the drawbacks as well as best practices in using port mirroring.]]>https://www.kentik.com/blog/what-is-port-mirroring-span-explainedhttps://www.kentik.com/blog/what-is-port-mirroring-span-explained<![CDATA[Kevin Woods]]>Wed, 20 Jan 2021 05:00:00 GMT<p><a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/" title="Kentik blog: The Network also Needs to be Observable">Network observability</a> is a core part of systems administration. No matter your line of business, and no matter your number of users, you need to know what’s happening on your network. There are a number of good reasons to invest in better network observability. If and when you experience a network outage, high-quality observability means you’ll diagnose the issue more quickly. Network observability also gives you a tool to detect malicious intruders in your environment. Regularly inspecting network traffic can even help you identify network bottlenecks and improve the day-to-day experience for end-users on your network.</p> <p>In this post, we’ll talk about one of the most popular means of network observation: SPAN port mirroring. We’ll dive into what it is, how it works, what it’s good at, and the drawbacks as well as best practices in using port mirroring.</p> <h2 id="what-is-span-port-mirroring">What Is SPAN Port Mirroring?</h2> <img src="//images.ctfassets.net/6yom6slo28h2/1oawV1f018S2e5ap96RfHw/ddf379a76a3c471fc5533da55acbd64c/on-off-switch.jpg" class="image center" style="max-width: 300px;" /> <p>The concept behind port mirroring is quite simple. When you configure a switch, you reserve one port. Then you configure the switch to “mirror” all traffic that passes through to that reserved port. Whenever the switch processes a packet, it makes a copy and sends it to whatever is connected to the aforementioned port. Usually, this will be some kind of dedicated system set up to monitor the traffic on that switch. SPAN (<a href="https://www.cisco.com/assets/sol/sb/Switches_Emulators_v2_3_5_xx/help/250/index.html#page/tesla_250_olh/span_overview.html" title="Switched Port Analyzer" target="_blank">Switched Port Analyzer</a>) is a Cisco-specific way of handling port mirroring. For the purposes of our discussion, we can use these terms interchangeably, but you should keep in mind that every network vendor provides some sort of port mirroring.</p> <div as="Promo"></div> <h2 id="types-of-span-port-configurations">Types of SPAN Port Configurations</h2> <p>As you dig into SPAN port configurations, it’s important to understand what SPAN setups can and can’t do. For starters, they can’t tell you about any traffic that doesn’t route through the switch you’re configuring. That just makes sense, right? When you’re configuring SPAN ports, it’s also essential to understand how network traffic passes through your network. If you configure SPAN on the wrong switch within your topography, you’re going to wind up missing packets that you want to see. Fortunately, specific SPAN implementations don’t mean that you’re confined to a single physical switch.</p> <p>If your <a href="https://www.kentik.com/kentipedia/what-is-network-topology/" title="Kentipedia: What is Network Topology?">network topology</a> spans multiple switches, SPAN has you covered. There are <a href="https://learningnetwork.cisco.com/s/article/span-rspan-erspan" target="_blank" title="two SPAN variants">two SPAN variants</a> that handle distributed environments effectively: RSPAN and ERSPAN.</p> <h3 id="rspan">RSPAN</h3> <p>RSPAN takes our SPAN configuration from earlier and works across a dedicated VLAN tunnel. Traffic going from one switch to another moves along a dedicated tunnel. When you configure your switch, you dedicate a VLAN (one or more ports) as an RSPAN VLAN. Now all traffic that passes along switches within that tunnel will be copied to the RSPAN VLAN. Much like a traditional SPAN configuration, the switch copies all traffic. The important thing to know about RSPAN is that all the switches involved need to be on the same physical network. RSPAN is an <a href="https://smallbiztrends.com/2013/09/osi-model-layer-networking.html" target="_blank">OSI Layer 2</a> configuration. It doesn’t support routing traffic through Layer 3.</p> <h3 id="erspan">ERSPAN</h3> <p>If you read the previous paragraph, you can probably guess why ERSPAN exists. While RSPAN only supports Layer 2 routing, ERSPAN supports Layer 3. When you enable ERSPAN, you gain the ability to route mirrored traffic across multiple physical networks. This provides a real benefit for organizations with multiple geographically distributed network environments. Unfortunately, ERSPAN is a Cisco-proprietary feature. It’s only available on certain models. Those limits remove a lot of choices if ERSPAN is a feature your team needs.</p> <h2 id="common-span-port-mistakes-to-avoid">Common SPAN Port Mistakes to Avoid</h2> <p>When setting up and using port mirroring, it’s essential to avoid common SPAN Port mistakes. The following issues are frequently encountered when working with SPAN ports:</p> <ul> <li><strong>Incorrect Configuration:</strong> An improper configuration can lead to reduced visibility or even cause network issues. Make sure to carefully follow the steps and best practices when setting up your SPAN port.</li> <li><strong>Limited Bandwidth:</strong> If the monitoring port doesn’t have enough bandwidth, it can cause dropped packets and limit your ability to analyze network traffic. Ensure your monitoring port has adequate bandwidth for the volume of traffic you’re capturing.</li> <li><strong>Overload:</strong> Overloading the switch with too many mirrored ports or excessive traffic can cause performance issues. Be cautious of the number of mirrored ports you set up and consider monitoring only the most critical traffic.</li> <li><strong>Wrong Source Port Selection:</strong> Choosing the wrong source port can lead to missing crucial network traffic. Ensure that you have a clear understanding of your network topology and traffic flows to select the correct source port for monitoring.</li> <li><strong>Monitoring the Wrong Traffic:</strong> Focusing on irrelevant traffic can lead to missed insights and wasted resources. Be selective in the traffic you monitor to ensure you capture the most relevant data for your network observability goals.</li> </ul> <h2 id="span-port-vs-network-tap-whats-the-difference">SPAN Port vs. Network Tap: What’s the Difference?</h2> <p>In order to effectively perform network monitoring, there are two popular methods that provide direct access to the actual <a href="https://www.kentik.com/blog/flows-vs-packet-captures-for-network-visibility" title="Kentik blog: Flows vs. Packet Captures for Network Visibility">packets</a> traveling across networks. And accessing this data at the packet level is essential because it provides the level of detail needed to gain network visibility. These two methods are SPAN port mirroring and network TAP (Test AccessPoint).</p> <h3 id="what-is-a-network-tap">What is a Network Tap?</h3> <p>A network TAP is a device that sits in a network segment between two appliances (such as a router, switch, or firewall), and allows you to directly access and monitor the network traffic. Therefore all of the data flows through the TAP and it creates a copy of the data for monitoring as the original data continues to flow through the network, simultaneously transmitting send and receive data on separate channels.</p> <h3 id="what-is-a-span-port">What is a SPAN Port?</h3> <p>SPAN ports, also referred to as Port Mirroring, are dedicated ports on a switch or router that creates copies of selected packets that pass through the device and sends them to a specific destination port.</p> <h2 id="span-port-vs-network-tap-pros-and-cons-of-each-approach">SPAN Port vs Network Tap: Pros and Cons of Each Approach</h2> <p>When deciding between a SPAN port and a network tap for network monitoring, it’s essential to consider the pros and cons of each approach to determine the best solution for your network observability needs.</p> <h3 id="pros-and-cons-of-span-port">Pros and cons of SPAN Port:</h3> <p>Advantages:</p> <ul> <li>Built into the switch, requiring no additional hardware</li> <li>Cost-effective and easy to configure</li> <li>Can be quickly enabled or disabled as needed</li> <li>Invisible on the network, reducing points of failure</li> </ul> <p>Disadvantages:</p> <ul> <li>Lower priority, potentially dropping mirrored packets during high traffic</li> <li>Limited visibility in case of switch issues or misconfiguration</li> <li>Requires resources on physical or virtual appliances</li> <li>May be less suitable for time-sensitive network observability goals</li> </ul> <h3 id="pros-and-cons-of-network-tap">Pros and cons of Network Tap:</h3> <p>Advantages:</p> <ul> <li>Provides direct access to network traffic for improved visibility</li> <li>Does not introduce additional load on the switch</li> <li>Handles send and receive data on separate channels, reducing latency</li> <li>Can offer more accurate network monitoring and diagnostics</li> </ul> <p>Disadvantages:</p> <ul> <li>Requires additional hardware and potential maintenance</li> <li>Can be more expensive to set up and maintain</li> <li>May introduce additional points of failure</li> <li>Physical installation needed for each switch being monitored</li> </ul> <h3 id="what-are-the-benefits-of-port-mirroring">What are the Benefits of Port Mirroring?</h3> <div class="pullquote right" style="width: 260px;">Let’s start with the most obvious benefit of port mirroring: the functionality is available on your switch.</div> <p>Let’s start with the most obvious benefit of port mirroring: the functionality is available on your switch device. Compared to something like a <a href="https://www.techopedia.com/definition/25311/network-tap" title="network tap" target="_blank">network tap</a>, port mirroring is easy and cheap to configure. You don’t need any additional hardware. This makes port mirroring particularly valuable when your network configuration is constrained by physical space or when you might only need to monitor a VLAN for a short period of time. Instead of needing to get into a cage and physically install or remove hardware, you’ll just need to modify your switch configuration. This ease of configuration and lack of up-front cost makes port mirroring an attractive proposition for organizations taking their first steps toward <a href="https://www.kentik.com/resources/five-steps-to-network-observability-nirvana-webinar/">network observability</a>.</p> <p>As an additional bonus, port mirroring is effectively invisible on your network. If you introduce a dedicated network tap, that’s another device you need to maintain and support. While dedicated network taps rarely fail, they’re still a potential point of failure. Enabling port mirroring on a switch makes it no more likely to fail than any other switch. And, as noted, port mirroring works across multiple switches. A device like a network tap requires installation connected to every switch you want to monitor. But by far, the biggest benefit to port mirroring is that it’s so simple and quick to set up.</p> <h3 id="what-are-the-drawbacks-of-port-mirroring">What are the Drawbacks of Port Mirroring?</h3> <div class="pullquote right">The switch treats each mirrored packet as a lower priority than normal network traffic. </div> <p>While port mirroring is cheaper and quicker to set up, it does carry some real drawbacks. The most significant is that the switch will treat mirrored traffic (SPAN data) as a lower priority. While the CPU overhead of copying any individual packet and mirroring it to a destination port or VLAN is low, those costs add up. So the switch treats each mirrored packet as a lower priority than normal network traffic. During low-to-medium traffic periods, this isn’t a big deal. The switch capably handles both the normal traffic and the mirrored traffic. However, when traffic flow loads up, things can get hairy. The switch will drop mirrored packets first. This means that you’re most likely to lose some network observability during the time when you need it the most.</p> <p>Moreover, the reduced priority of mirrored packets makes some network observability goals more difficult to achieve. For instance, if your network observability goals include things like reducing network jitter or <a href="https://www.kentik.com/blog/why-latency-is-the-new-outage/" title="Kentik blog: Why Latency is the New Outage">latency</a>, port mirroring might not be a good fit for your goals. This is because the delay in delivering mirrored packets can make those time-sensitive issues more difficult to detect and resolve.</p> <p>Another downside to port mirroring is that it requires resources on physical or virtual appliances. These can be costly from a hardware and (in the case of commercial solutions) software licensing point of view. As a result, in most cases, it is only fiscally feasible to deploy forms of port mirroring at selected points in the network.</p> <p>A cloud-friendly and highly scalable model combines the deployment of lightweight host-based monitoring agents that export packet capture statistics gathered on servers and open source proxy servers.</p> <h2 id="port-mirroring-best-practices">Port Mirroring Best Practices</h2> <p>If you’re jumping into port mirroring for the first time, here are some best practices you want to keep in mind:</p> <ul> <li>Know your environment. You need to know where <a href="https://www.kentik.com/blog/gathering-understanding-using-traffic-telemetry-for-network-observability/" title="Kentik blog: Gathering, Understanding, and Using Traffic Telemetry for Network Observability">traffic flows</a> to make sure you’re putting your SPAN port on the correct switch.</li> <li>Focus your filtering. Because of the downsides of a SPAN port, you want to make sure you’re not over-capturing traffic. The switch may drop mirrored packets if network traffic load is too high</li> <li>Check your logs. Network observability is important, but you don’t do yourself any good if you don’t ever look at the traffic you capture.</li> <li>Don’t over-capture. Every packet you capture is one you have to filter through or parse out. Understand what you’re looking for, and try to capture only traffic that you need to see.</li> </ul> <h2 id="span-ports-just-one-tool-in-the-toolbelt">SPAN Ports: Just One Tool in the Toolbelt</h2> <p>Port mirroring is a great low-cost option for tapping into a source of <a href="https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources/" title="Kentik Blog: More about network telemetry">network telemetry</a>. Especially if you’re just starting your journey into high-quality network observability, it’s easy to configure and enable out of the gate. There are no big up-front costs, and enabling port mirroring is something you can do on your existing hardware today.</p> <p>That doesn’t mean you should just jump in with both feet. Before you get started, you want to intimately understand how your network traffic flows. Where is your most critical traffic coming from? Where is it going? Once you answer those questions, it’s easier to know where to set up port mirroring in your existing network.</p> <p>It’s also critical to understand that port mirroring isn’t a silver bullet. As we noted, while it has some real upsides, it also comes with some real downsides. Given that, you want to use port mirroring as just one tool in your toolbelt. Unless your network is very small, don’t expect that setting up port mirroring will solve all your problems. You can’t just set it and forget it. Good networks, like gardens, require tending and maintenance to flourish. If you’re interested in learning how to tend your network so that it runs its best all the time, we’d be happy to show you how <a href="https://www.kentik.com/why-kentik-network-observability-and-monitoring/" title="Why choose Kentik for network observability and monitoring?">Kentik can help</a>.</p><![CDATA[Basic Network Troubleshooting: A Complete Guide]]><![CDATA[The basics of network troubleshooting have not changed much over the years. When you’re network troubleshooting, a lot can be required to solve the problem. You could be solving many different issues across several different systems on your complex, hybrid network infrastructure. A network observability solution can help speed up and simplify the process.]]>https://www.kentik.com/blog/network-troubleshooting-complete-guidehttps://www.kentik.com/blog/network-troubleshooting-complete-guide<![CDATA[Kevin Woods]]>Wed, 13 Jan 2021 05:00:00 GMT<h2 id="the-network-is-the-key">The Network is the Key</h2> <p>“The network is down!” — I’m sure you heard that before.</p> <p>Despite your best efforts as a network engineer, network failures happen, and you have to fix them. Hopefully, you’ve implemented a <a href="https://www.kentik.com/product/kentik-platform/">network observability platform</a> in advance, so you should be collecting a wealth of information about your network, making troubleshooting easier.</p> <div class="pullquote right" style="max-width: 250px; margin-top: 20px;">Network troubleshooting becomes easier if your network is observable.</div> <p>But what happens when it’s time to activate troubleshooting mode?</p> <p>In this post, I’m going to talk about the steps to troubleshoot your network. And then I’ll provide some best practices, as well as provide examples of troubleshooting with Kentik’s network observability solutions.</p> <h2 id="what-is-network-troubleshooting">What is Network Troubleshooting?</h2> <p>Network troubleshooting is the process of solving problems that are occurring on your network, using a methodical approach. A simple definition for what can often be a hard task!</p> <p>When users complain, whether they’re internal or external to your organization — or ideally, before they do — you need to figure out what the cause of their problem is. The goal is to troubleshoot and fix whatever issue underlies the problems.</p> <p>Troubleshooting requires taking a methodical approach to resolving the issue as quickly as possible. Unfortunately for you, the user doesn’t care what your service-level objective for fixing the problem is. In today’s “gotta have it fast” culture, more often than not, you need to fix it now — or revenue is affected.</p> <p>Let’s get into some ways you can troubleshoot your network and reduce your mean time to repair (MTTR).</p> <div as="Promo"></div> <h2 id="basic-network-troubleshooting-processes">Basic Network Troubleshooting Processes</h2> <h3 id="identify-the-problem">Identify the Problem</h3> <p>When you’re <a href="https://www.kentik.com/solutions/usecase/troubleshoot-networks/" title="Learn more about using Kentik for network troubleshooting">troubleshooting network issues</a>, complexity and interdependency make it complex to track down the problem. You could be solving many different issues across several different networks and planes (underlay and overlay) in a complex, hybrid network infrastructure.</p> <p>The first thing you want to do is identify the problem you’re dealing with. Here are some typical network-related problems:</p> <ul> <li><strong>A configuration change broke something</strong>. On a network, configuration settings are constantly changing. Unfortunately, configuration change accidents can happen that bring down parts of the network.</li> <li><strong>Interface dropping packets</strong>. Interface issues caused by misconfigurations, errors, or queue limits lead to network traffic failing to reach its destination. Packets simply get dropped.</li> <li><strong>Physics limitations on connectivity</strong>. Sometimes, your connections don’t have enough bandwidth. Or the latency is too much between source and destination. These lead to network congestion, slowness, or timeouts.</li> <li><strong>Problems in the cloud</strong>. Intra- or inter-cloud connectivity problems can have their own unique set of causes and challenges. Often driven by someone else’s congestion, oversubscription, or software failures.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/6zkEHYv2diZit52VuA7Px0/40e9390fe29ce6fd24ba8414b1e5ead5/toolset.jpg" style="max-width: 500px;" class="image center" /> <h3 id="find-your-network-troubleshooting-tools">Find Your Network Troubleshooting Tools</h3> <p>Fixing these kinds of troubleshooting problems needs more than identification. To paraphrase French biologist Louis Pasteur — where observation is concerned, chance favors only the prepared mind.</p> <p>No network engineer can troubleshoot without being prepared with their tools and telemetry. So once you’ve identified that there is a problem, it’s time to use your network troubleshooting tools.</p> <p>Ideally, you have tools and telemetry in advance, so your network observability toolchain is using AI to automatically identify problems and linking you to a jumping off point so you can drive down both MTTK (Mean Time to Know) and either MTTR (Mean Time to Repair) or MTTI (Mean Time to Innocence).</p> <p>Here are a few examples of basic network troubleshooting tools:</p> <ul> <li>Ping</li> <li>Tracert/ Trace Route</li> <li>Ipconfig/ ifconfig</li> <li>Netstat</li> <li>Nslookup</li> <li>Pathping/MTR</li> <li>Route</li> <li>PuTTY</li> </ul> <h3 id="the-first-step--ping-affected-systems">The First Step — Ping Affected Systems</h3> <p>When your network is down, slow, or suffers from some other problem, your first job is to send packets across the network to validate the complaint. Send these pings using the Internet Control Message Protocol (ICMP) or TCP to one or any of the network devices you believe to be involved.</p> <p>The ping tool is a utility that’s available on practically every system, be it a desktop, server, router, or switch.</p> <p>There’s a sports analogy that says “the most important ability is availability” for systems. If you can’t reach it, it’s not available to your users.</p> <p>Sending some ICMP packets across the network, especially from your users’ side, will help answer that question, if your platform isn’t presenting the path to you automatically. In some cases if ICMP is filtered, you can usually switch to TCP (Transmission Control Protocol) and use tcping, telnet, or another TCP-based method to check for reachability.</p> <h3 id="get-the-path-with-traceroute">Get the Path with Traceroute</h3> <p>If you’re not getting any ping responses, you need to find out where the ping is stopping. You can use another ICMP-based tool to help, and that’s traceroute.</p> <p>Your ping could be getting stopped because ICMP isn’t allowed on your network or by a specific device. If that’s the case, you should consider TCP Traceroute on Linux, which switches to TCP packets.</p> <p>From traceroute, since you will see the path of IP-enabled devices your packets take, you will also see where the packets stop and get dropped. Once you have that, you can further investigate why this packet loss is happening. Could it be a misconfiguration issue such as a misconfiguration of IP addresses or subnet mask? A misapplied access list?</p> <h3 id="test-your-network-with-synthetic-monitoring">Test Your Network with Synthetic Monitoring</h3> <p>Tool such as <a href="https://www.kentik.com/product/synthetics/" title="Kentik Synthetic Monitoring product page">Kentik Synthetic Monitoring</a> enable you to continuously test network performance (via ICMP, TCP, HTTP, and other tests) so you can uncover and solve network issues before they impact customer experience. Ping and traceroute tests performed continuously with public and/or private agents generate key metrics (latency, jitter, and loss) that are evaluated for network health and performance.</p> <p>To get ahead of the game, Kentik also allows you to set up autonomous tests, so there’s already test history to your top services and destinations. You can also run these continuously (every second, like the ping command default) for high resolution.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7HT2BBgX36a3XASaSEqV9I/a885a4ad75d99ad9bc0ac755cf19f383/traceroute-path-view.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" thumbnail alt="Network Troubleshooting: traceroute path view" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Kentik’s traceroute-based network path view provides traffic visualization as it flows between test points and agents</div> <h3 id="device-commands-and-database-logs">Device Commands and Database Logs</h3> <p>Now that you’ve identified the network device or group of devices that could be the culprit, log into those devices and take a look. Run commands based on your device’s network operating system to see some of the configuration.</p> <p>Take a look at the running configuration to see what interfaces are configured to get to the destination. You can take a look at system logs that the device has kept for any routing or forwarding errors. You can also look at antivirus logs on the destination systems that could be blocking access.</p> <p>At this point, you may find yourself unable to get enough detail about the problem. Command line tools are telling you how things should work. What if everything’s working the way it should? What now? Or you might be getting overwhelmed by the amount of log data.</p> <h3 id="device-configuration-changes">Device Configuration Changes</h3> <p>Many network outages relate to changes that humans made! Another key step on the troubleshooting path is to see if anything changed at about the same time as issues started.</p> <p>This information can be found in logs of AAA (Authentication, Authorization, and Accounting) events from your devices. Ideally stored centrally, but often also visible by examining the on-device event log history.</p> <h3 id="packets-and-flows">Packets and Flows</h3> <p>The old saying about packet captures is that packets don’t lie! That’s also true for flow data, which summarizes packets.</p> <p>Both packets and flows provide information about the source and destination IP addresses, ports, and protocols.</p> <p>When getting flow data, you’re not as in the weeds as during a packet capture, but it’s good enough for most operational troubleshooting. Whether it’s with <a href="https://www.kentik.com/blog/accurate-visibility-with-netflow-sflow-and-ipfix/">NetFlow, sFlow, or IPFIX</a>, you’ll be able to see who’s talking to whom and how with flow data going to a flow collector for analysis.</p> <p>Capturing packet data is truly getting into the weeds of troubleshooting your network. If it’s unclear from flow, and often if it’s a router or other system bug, you may need to go to the packets.</p> <p>Unless you have expensive collection infrastructure, it’s also often more time consuming for you than any of the other tools above. Whether it’s tcpdump, Wireshark, or SPAN port, you’ll be able to get needle-in-the-haystack data with packet captures.</p> <p>One great middle ground is augmented flow data, which captures many of the elements of packets. This can be great if you can get performance data, but not all network devices can watch performance and embed in flow — in fact, the higher speed the device, the less likely it is to support this kind of enhancement.</p> <p>Collecting and analyzing packets and flows is where you start to venture into the next step. You’re using a mix of utility tools (tcpdump) and software (Wireshark, flow collector). If you’re expecting to keep a low MTTR, you need to move up the stack to software systems.</p> <h3 id="up-the-stack">Up the Stack</h3> <p>If you can’t find issues using these tools and techniques at the network level, you may need to peek up the stack because it could be an application, compute, or storage issue. We’ll cover more on this cross-stack debugging in a future troubleshooting overview.</p> <h2 id="kentik-network-observability">Kentik Network Observability</h2> <p>Of course, network performance monitoring (NPM) and network observability solutions such as Kentik can greatly help avoid network downtime, detect network performance issues before they critically impact end-users, and track down the root cause of network problems</p> <p>In today’s complex and rapidly changing network environments, it’s essential to go beyond reactive troubleshooting and embrace a proactive approach to maintaining your network. Network monitoring and proactive troubleshooting can help identify potential issues early on and prevent them from escalating into more severe problems that impact end users or cause downtime.</p> <p>Kentik’s Network Observability solutions, including the Network Explorer and Data Explorer, can be invaluable tools in implementing proactive troubleshooting strategies. By providing real-time and historical network telemetry data and easy-to-use visualization and analysis tools, Kentik enables you to stay ahead of potential network issues and maintain high-performing, reliable, and secure network infrastructure.</p> <h3 id="network-explorer-solution">Network Explorer Solution</h3> <p>Kentik Network Explorer provides an overview of the network with organized, pre-built views of activity and utilization, a Network Map, and other ways to browse your network, including the devices, peers, and interesting patterns that Kentik finds in the traffic.</p> <p>To make NetOps teams more efficient, Kentik provides troubleshooting and capacity management workflows. These are some of the most basic tasks required to operate today’s complex networks, which span data center, WAN, LAN, hybrid and multi-cloud infrastructures.</p> <p>The Network Explorer combines flow, routing, performance, and device metrics to build the map and let you easily navigate. And everything is linked to Data Explorer if you need to really turn the query knobs to zoom way in.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1TFDwPNfZayc1Rw8pnNcSr/6f8a4cc7d83af3e771eca58b4bc14457/troubleshooting-network-explorer.png" style="max-width: 800px; margin-bottom: 15px;" class="image center" alt="Network Troubleshooting with Kentik's Network Explorer view" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">An Essential Network Troubleshooting Tool: Kentik’s Network Explorer</div> <h3 id="data-explorer-solution">Data Explorer Solution</h3> <p>If you can’t find the obvious issue with something unreachable or down, it’s key to look beyond the high level and into the details of your network.</p> <p>Kentik Data Explorer provides a fast, network-centric, easy-to-use interface to query real-time and historic network telemetry data. Select from dozens of dimensions or metrics, 13 different visualizations and any data sources. Set time ranges and search 45 days or more of retained data. Query results within seconds for most searches.</p> <p>This lets you see traffic, routing, performance, and device metrics in total, by device, region, customer, application, or any combination of dimensions and filters that you need to zoom in and find underlying issues.</p> <p>Kentik’s Data Explorer provides graphs or table views of network telemetry useful for all types of troubleshooting tasks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2fuCeB1iG0EjvzY7FH19Pu/59a9b25cb17cc09c0603edf3a1682820/data-explorer-view.png" style="max-width: 800px; margin-bottom: 5px;" class="image center no-shadow" thumbnail alt="Network Trouleshooting: Data Explorer view" /> <div style="max-width: 800px; font-size: 96%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Kentik’s Data Explorer provides graphs or table views of network telemetry useful for all types of troubleshooting tasks</div> <h3 id="software-tools-help-facilitate-network-troubleshooting">Software Tools Help Facilitate Network Troubleshooting</h3> <p>Marc Andreessen of Netscape fame once said that, “software is eating the world.” But software has made things a lot easier when it comes to network troubleshooting. It has taken over from the manual tools run from a terminal or network device.</p> <p>There are software tools that ping not just to one device but multiple devices simultaneously for availability and path. Many are flow and packet data stores with software agents sending network data. All this is done and put on a nice dashboard for you. Network troubleshooting is still hard, but software makes it easier.</p> <div class="pullquote right" style="margin-top: 15px;">Network troubleshooting is still hard, but software makes it easier.</div> <p>However, in this cloud-native and multi-cloud infrastructure era, some software makes it easier than others. For that, you need to move beyond traditional monitoring software because it’s not enough anymore. You need to move to observability software.</p> <p>With software tools like products from Kentik, you can use the devices to send data to observe the state of your network instead of pulling it from the network.</p> <h2 id="network-troubleshooting-best-practices">Network Troubleshooting Best Practices</h2> <p>Whether you’re using network observability tools, or have a network small enough where the other tools are sufficient, here are some best practices you should consider.</p> <h3 id="develop-a-checklist">Develop a Checklist</h3> <p>You should develop a checklist of steps like what I’ve outlined above when troubleshooting.</p> <p>In his book <em>The Checklist Manifesto</em>, Dr. Atul Gawande discusses how checklists are used by surgeons, pilots, and other high-stress professionals to help them avoid mistakes. Having a checklist to ensure that you go through your troubleshooting steps promptly and correctly can save your users big headaches. And save you some aggravation.</p> <p>Over time, this checklist will likely become second nature, and having and following it ensures you’re always on top of your game.</p> <h3 id="ready-your-software-tools">Ready Your Software Tools</h3> <p>You want to have already picked the network troubleshooting tools you need to troubleshoot a network problem before you get an emergency call. That isn’t the time to research the best software tool to use. By then, it’s too late.</p> <p>If you run into a network troubleshooting problem that took longer than you hoped with one tool, research other tools for the next time. But do this before the next big problem comes along.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3ziSrcEgTuSROvC6U76UZL/932345a47dfbc78c71ee3da2a29e8619/notes-documentation.jpg" style="max-width: 450px;" class="image center /"> <h3 id="get-documentation">Get Documentation</h3> <p>It’s tough to jump on a network troubleshooting call and not know much about the network you’re going to, well, troubleshoot. IT organizations are notorious for not having enough documentation. At times, you know it’s because there aren’t enough of you to go around.</p> <p>But you have to do what you can. Over time, you should compile what you learn about the network. Document it yourself if you have to, but have some information. Identify who owns what and what is where. Otherwise, you could spend lots of troubleshooting time asking basic questions.</p> <h3 id="prepare-your-telemetry">Prepare Your Telemetry</h3> <p>In addition to having the software to move with speed, you’ll need to be already sending, saving, and ideally detecting anomalies over your network telemetry. For more details on network telemetry, see our blog posts <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable/" title="Kentik Blog: The Network Also Needs to be Observable">“The Network Also Needs to be Observable”</a> and <a href="https://www.kentik.com/blog/the-network-needs-to-be-observable-part-2-network-telemetry-sources" title="Kentik Blog: Network Telemetry Sources">“Part 2: Network Telemetry Sources”</a>.</p> <h3 id="follow-the-osi">Follow the OSI</h3> <p>If you closely follow the toolset above, you may have noticed that I’m moving up the stack with each tool.</p> <p>In some ways, I’m following the <a href="https://www.networkworld.com/article/3239677/the-osi-model-explained-and-how-to-easily-remember-its-7-layers.html" target="_blank">Open Systems Interconnection (OSI)</a> stack. When troubleshooting, you want to start at the physical layer and work your way up. If you start by looking at the application, you’ll be masking potential physical connection problems such as interface errors or routing issues happening at layer 3. Or any forwarding problems at layer 2.</p> <p>So follow the stack, and it won’t steer you wrong.</p> <h3 id="preparedness-and-network-troubleshooting">Preparedness and Network Troubleshooting</h3> <p>And there it is. When the network is down, troubleshooting can be a daunting task, especially in today’s <a href="https://www.kentik.com/resources/hybrid-cloud-network-observability-gap-whitepaper/" title="Learn more about the difficulties introduced by hybrid environments in our whitepaper, Hybrid Cloud and the Network Observability Gap">hybrid infrastructure environments</a>.</p> <p>But if you follow the steps I’ve outlined, you can make things easier on yourself. Create your network troubleshooting checklist, decide on your toolset, and get ready. If it’s not down now, the network will likely be down later today.</p> <p>Now that you know this about network troubleshooting, you’ll be ready when the network issues affect traffic in the middle of the night. You won’t like it; nobody likes those 1:00 A.M. calls. But you’ll be prepared.</p><![CDATA[When DNS Fails in the Cloud: DNS Performance and Troubleshooting]]><![CDATA[Cloud Solution Architect Ted Turner describes a process of migrating all of applications to the cloud, and gives key takeaways as to how building your observability platform to understand both the application state as well as underlying infrastructure is key to maintaining uptime for customers.]]>https://www.kentik.com/blog/when-dns-fails-in-the-cloud-dns-performance-and-troubleshootinghttps://www.kentik.com/blog/when-dns-fails-in-the-cloud-dns-performance-and-troubleshooting<![CDATA[Ted Turner]]>Fri, 11 Dec 2020 05:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/5uKm4fKkc73VmzWfDeROxG/1619e374d5e9cb0da241f84a909aae97/cloud-stick-figure.jpg" style="max-width: 250px; margin-right: 20px;" class="image right no-shadow" /> <p>A while back I was working for a large scale <a href="https://www.kentik.com/blog/scaling-bgp-peering-kentik-saas-environment/">SaaS</a> platform. We were in the process of migrating all of our applications into the cloud. Our application stack used tremendous amounts of east-west traffic to help various application services understand state.</p> <p>To facilitate the migration of our applications we split DNS into three tiers:</p> <ul> <li>Data center-based resolution from the cloud over VPN, to DC-hosted DNS resolvers</li> <li>Cloud-based DNS for our partners and customers (SaaS DNS)</li> <li>Cloud-based DNS for resolution within our cloud service provider (PaaS DNS)</li> </ul> <p><a href="https://www.kentik.com/resources/aws-cloud-adoption-visibility-management/">Cloud</a> providers will rate-limit your resources to ensure their ability to provide service for all of their customers. Based on rate limiting, the decision to use a SaaS-based external service made sense, as it allowed some security logging and scaled to our performance needs, without the rate limits.</p> <div class="pullquote right" style="max-width: 340px;"><b>TAKE AWAY:</b><br />Architect your DNS to account for any DNS related outages—CSP based, SaaS-hosted based, internally hosted—while improving network observability.</div> <p>What we learned after our outage is that our SaaS DNS was provisioned from our cloud service provider across a dedicated BGP private peering link. The private peering was shared across two routers per side.</p> <p>The problem unfolded where our APM suite showed massive outages of our application stack. With east-west traffic patterns, we needed to look at both our data center, as well as our <a href="https://www.kentik.com/go/webinar-cloud-performance/">cloud</a> resources. Our APM suites showed our calls between our applications were timing out. Of note, none of our partner-based services was showing any degradation or time-outs. Knowing that our CSP partner-based services were working well, we focused our efforts on two of our application stack legs: DC-related infrastructure and SaaS-hosted infrastructure.</p> <p>We opened up technical tickets with both our <a href="https://www.kentik.com/resources/cloud-visibility/">cloud</a> service provider as well as our SaaS-hosted provider. We started investigating whether this was a cloud service provider issue (it was), SAAS partner or a data center related issue.</p> <p>About 20 minutes after our application stack impact alarms sounded, we noted resources were starting to come back online.</p> <p>As we were troubleshooting with our <a href="https://www.kentik.com/blog/how-to-maximize-the-value-of-streaming-telemetry-for-network-monitoring-and/">cloud</a> service provider technical team, the cloud service provider identified that there was a single router failure in the CSP, which affected all <a href="https://www.kentik.com/blog/capacity-planning-done-right/">traffic</a> routing towards our SaaS provider for DNS. The fix was an automated re-provisioning of the router connected to the dedicated private BGP peer. As soon as the dead/dying router was removed from service, the previously impacted DNS SaaS provider traffic started flowing as expected. When DNS traffic normalized, the application stacks could start executing recovery operations for each component of the application stack. Teams with more mature applications were able to automatically stabilize. Teams with leaner or less mature applications were required to help the software by hand.</p> <p>While the CSP fixed the issue in about 20 minutes, we still had a gaping hole in our application stack for stability. Every time our CSP or SaaS provider experienced any <a href="https://www.kentik.com/go/webinar-synthetics101-2021-03/">performance</a> or stability issues, our application stack would also be impacted. We needed to figure out where our dependency was built into our application stack (DNS), and figure out a method to allow for fast healing. The end goal is to take care of our customers and avoid impacts.</p> <p>The router on the CSP side died, causing a failed forwarding path. During the post mortem we identified that 50% of our traffic headed to our SaaS DNS provider was lost upon leaving our cloud service provider. The failed router was in the state of dying, but had not died enough to stop BGP adjacency.</p> <p>The CSP management plane identified a failing condition in the router. At about the 20-minute mark of impact to the application stack, the dying router was removed from production, providing immediate relief. The CSP management plane automatically deployed a new router with associated BGP peering.</p> <p>At this point we started an internal review of the application stack build and identified a single point of failure in our configuration of the OS. Each of our DNS configuration settings was uniquely built, tuned and configured for a very specific slice of our application stack needs.</p> <p>The APM suite is great for instrumented code, but had a hard time identifying that DNS and router path was impacting the application stack. Application stack calls, which time out, have a large variety of causes. Building your observability platform to understand both the application state as well as underlying infrastructure is key to building and maintaining uptime for customers.</p> <p>Upon review of our OS configurations, we decided to implement a few changes:</p> <ul> <li>Allow for caching of DNS resolved queries (DNSMasq, Unbound)</li> <li>Allow for DNS queries to be resolved by two providers</li> <li>SaaS DNS provider</li> <li>CSP DNS provider</li> <li>Remove Java negative ttl</li> </ul> <p><strong>More Take Aways</strong></p> <ul> <li>APM suites are key to understanding the health of microservices and service mesh implementations</li> <li>Monitor the health of your CSP, SaaS providers, and all paths connecting these resources</li> <li>DNS flows should be 1:1 for day to day traffic (i.e., 50% loss in DNS “return” traffic killed the application stack)</li> <li>Reduction in DNS flow return to any of your DNS servers may indicate failure in routing paths</li> </ul><![CDATA[Kentik Firehose: The Missing Piece in Full-stack Monitoring]]><![CDATA[Kentik Firehose is a way for you to export enriched network traffic data from the Kentik Network Observability Platform. With Firehose, organizations can directly integrate network data into other analytic systems, message queues, time-series databases, or data lakes.]]>https://www.kentik.com/blog/kentik-firehose-the-missing-piece-in-full-stack-monitoringhttps://www.kentik.com/blog/kentik-firehose-the-missing-piece-in-full-stack-monitoring<![CDATA[Aaron Kagawa, Daniella Pontes]]>Wed, 09 Dec 2020 05:00:00 GMT<p>Networks are pervasive. They provide the means for all digital communications. Although often unpretentiously represented by a single line or reduced to their famous canonic cloud bubble form, modern networks are anything but that simple.</p> <p>What is called “the network” is indeed a myriad of paths between infrastructure elements, physical and virtual, dynamic and context-rich, hard to make sense of. So everything network-related is hidden in basic representations. Once all complexity is out of sight, it is easy to take the network for granted and not give a second thought to the cloud bubble or the line in the diagram, assuming that somehow everything will work out to benefit your applications and users.</p> <p>The fact is that businesses cannot count on the network without paying close attention to it. Those responsible for the network need to make, and regularly revisit, smart choices, and fix problems when necessary. In the digital era, being agile is a premise, so you need clear answers fast. The time has come to bring the networks you depend on into a full Hi-Fi view.</p> <h2 id="stakes-are-high-for-digital-businesses">Stakes are high for digital businesses</h2> <p>The digital experience became the make or break of the modern business betting on clouds, internet delivery and — with the pandemic — a distributed workforce. The bottom line is: regardless of whether you’re connecting microservices in a cloud-native architecture, workloads in hybrid deployments, or making the digital delivery path to the customer, the reliability, quality, and cost-effectiveness of the network will define the winners.</p> <p>It is time to take a closer look at that cloud in your diagram and the line connecting two dots in your service map. You will see that the networks are heterogeneous. Some network infrastructure may be owned, like in your data centers and corporate WAN/SD-WAN), and others shared, like in public clouds and over the internet. Furthermore, these diverse, dynamically routed, and context-rich networks are also subject to traffic bursts, outages, degradations, and attacks. Only through data can network pros make sense, interact with, plan, and get answers in high complexity environments.</p> <h2 id="the-network-through-the-lens-of-data">The network through the lens of data</h2> <p>If reality shows that business success depends on highly complex networks, then network observability is a must. To observe the network, you need to collect, enrich, and analyze large, diverse, granular data sets including telemetry and events from devices, traffic flows, synthetic tests, and routing information, together with a variety of context data (e.g., business, applications, users, geolocation, threat, etc.) That is what Kentik is good at. By providing network observability, we help network pros make their networks reliable and fast — despite the odds.</p> <p>What if all this <a href="/blog/the-network-also-needs-to-be-observable-part-4-telemetry-data-platform">network observability data</a> could be used company-wide benefiting other teams and improving business analytics? It is now possible with Kentik Firehose.</p> <h2 id="what-is-kentik-firehose">What is Kentik Firehose?</h2> <p><a href="https://www.kentik.com/product/firehose/">Kentik Firehose</a> is a way for you to export enriched network traffic data from the <a href="https://www.kentik.com/product/kentik-platform/">Kentik Network Observability Platform</a>. With Firehose, organizations can directly integrate network data into other analytic systems, message queues, time-series databases, or data lakes. Firehose uses a client binary called KTranslate to receive data exported from our platform and perform the desired transformations to deliver the data in the format your systems can ingest, like JSON and AVRO, among others.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/7jy0RpkzHLrTyxw67uUHji/7b525692257c153fe703adcc86117647/firehose-20201210.png" style="max-width: 800px;" class="image center no-shadow" /> <h3 id="what-you-get-from-it">What you get from it</h3> <p>You can send unified, enriched and correlated network data to your application monitoring tool (e.g., New Relic), data sinks (e.g., Splunk, InfluxDB, Elasticsearch, AWS S3, etc.) or publishing platforms (e.g., Kafka, AWS Kinesis, Google Pub/Sub, etc.).</p> <h3 id="and-what-for">And what for</h3> <ul> <li>Close the network observability gap in your teams’ full-stack monitoring</li> <li>Include streaming analytics and correlate network data with that of other systems to uncover new insights</li> <li>Enhance your business intelligence and predictive analytics with foundational information about traffic distribution, network reliability, and digital experience</li> <li>Extract more value from new classes of services and workflows powered by data analytics.</li> </ul> <h2 id="closing-the-network-gap-in-full-stack-monitoring">Closing the network gap in full-stack monitoring</h2> <p>While Kentik users today get answers quickly about what’s happening on the network, their counterparts from other teams are mostly in the dark, lacking critical knowledge of how the network impacts their application performance, transaction volume, customer behavior, and revenue growth, among other technical and business KPIs.</p> <div class="pullquote right">In conversations with customers and IT professionals, we often hear that observability at the network layer is missing from their application monitoring and analytics frameworks.</div> <p>In conversations with customers and IT professionals, we often hear that observability at the network layer is missing from their application monitoring and analytics frameworks. The pain is most acute, they say, when troubleshooting application performance and some services experience high latency. Another pain point is not having insight into the performance implications and traffic costs of rolling out cloud deployments in new regions. Finally, they need network observability to validate that they are meeting the performance goals of existing services or even identify new business opportunities from their customers’ traffic profile, among other scenarios.</p> <h2 id="see-more-and-better-do-more-and-better">See more and better, do more and better</h2> <p>You can get almost limitless value from adding network observability to your existing intelligence. Focusing on the application and digital experience is a great start. Check out other <a href="https://www.kentik.com/resources/kentik-firehose/">Kentik Firehose use cases</a> to improve application performance, troubleshooting process, workflow optimization, and business intelligence.</p><![CDATA[If Your Network Traffic Is Continuous, Why Isn’t Your Synthetic Testing?]]><![CDATA[In this post, we show how Kentik Synthetic Monitoring supports high-frequency tests with sub-minute intervals, providing network teams with a tool that captures what's happening in the network, including subtle degradations.]]>https://www.kentik.com/blog/if-your-network-traffic-is-continuous-why-isnt-your-synthetic-testinghttps://www.kentik.com/blog/if-your-network-traffic-is-continuous-why-isnt-your-synthetic-testing<![CDATA[Anil Murty]]>Thu, 29 Oct 2020 04:00:00 GMT<p>Reliable networks are the <em>sine qua non</em> of modern business using digital technologies, clouds, and the internet to build and deliver services. In other words, if the network is not doing its part, success becomes unattainable.</p> <p>Synthetic monitoring was adopted to fulfill the need to achieve some level of confidence that the network is delivering a good experience. By actively performing tests at regular intervals, IT teams can get a headstart to detect and address problems proactively and build a foundation for quality assessment monitoring metrics, such as latency, jitter, and packet loss.</p> <p>If the performance measurements reach unacceptable levels, it is straightforward to conclude that the network is at risk of not supporting a good digital experience. On the other hand, if the metrics collected don’t show degradation, how confident can IT teams be that the digital experience is OK? To answer this question, one must first understand the effectiveness of the tests performed. Are they being run frequently enough to capture events impacting your traffic?</p> <h3 id="legacy-solutions-that-test-infrequently-are-blind-to-subtle-degradations">Legacy solutions that test infrequently are blind to subtle degradations</h3> <p>Here is where legacy synthetic monitoring solutions fail. As a general rule, the less frequent the test, the less effective it is in catching subtle changes in performance. Due to a combination of product limitations and cost inefficiencies, legacy synthetic monitoring solutions keep test intervals in windows of units or tens-of-minutes, making testing too sparse to be meaningful.</p> <p>Monitoring network performance is fundamental to assessing digital experience and service quality. Testing frequency is a primary factor in ensuring that synthetic monitoring provides meaningful insight into the network’s reliability and digital experience quality. So, when we designed Kentik Synthetic Monitoring, we made sure that it can support high-frequency tests with sub-minute intervals providing network teams with a tool that can capture what is going on in the network, including subtle degradations.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3PsjmC7wpzzqOirDwch0hG/356ca0acb37844cc70ae60ba48e58912/degradations.png" style="max-width: 700px; margin-bottom: 10px;" class="image center" thumbnail /> <div style="max-width: 680px; font-size: 98%; text-align: center; margin: 0 auto; margin-bottom: 25px;">With Kentik you can detect all degradation events on the network</div> <h3 id="continuous-testing-is-also-vital-to-improving-mttr-and-digital-experience">Continuous testing is also vital to improving MTTR and digital experience</h3> <p>Lack of sub-minute test windows was a recurrent complaint we received from companies using legacy synthetic monitoring solutions. Especially those running applications that are very sensitive to performance degradation.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4FwuqrdHHB6eyOFzyl7STh/ff3f9aedb4d3b30956a7c5645ca99266/test-frequency-1200w" style="max-width: 600px;margin-bottom: 10px;" class="image center" thumbnail /> <div style="max-width: 580px; font-size: 98%; text-align: center; margin: 0 auto; margin-bottom: 25px;">Only Kentik makes continuous synthetic testing affordable. Continuous testing is critical for a number of use cases, such as machine automation.</div> <p>It is fair to say that network performance has become paramount to the digital experience provided to the users and very hard to control and predict. Network performance is often impacted by traffic or routing dynamics, making performance conditions a fast-moving target and a daunting challenge for quality assurance.</p> <p>Testing the network frequently increases the chances of knowing exactly where and when a threshold was crossed. Short intervals are especially important when dealing with intermittent issues. By providing testing frequency options that go down to per-second tests, Kentik added significant value to IT’s ability to troubleshoot with precision and speed — even those short events hard to detect and identify.</p> <h3 id="higher-trust-in-the-network-also-means-more-cost-effective-business-models">Higher trust in the network also means more cost-effective business models.</h3> <p>When we announced support for per-second tests, we immediately heard of companies’ interest in using automation frameworks. And different from what one may think, network reliability is not only a determining factor for technical aspects, it is also an important factor for the overall business cost model.</p> <p>Taking the example of automation. It relies on constant communication between the control plane application and the clients executing the tasks. Instructions are sent continuously in real-time. Any latency or loss peak event must be identified for the network performance to be more accurately measured and profiled, allowing the overall system to be better planned. The more reliable the communication between control and execution, the less redundancy and safe-range the overall system requires.</p> <p>Our customers and prospects told us that more accurate monitoring of network performance would allow them to architect more efficiently, impacting the business, including location and grid concentration viability.</p> <h3 id="synthetic-monitoring-without-compromises">Synthetic monitoring without compromises</h3> <p>When you monitor infrequently, you are, in fact, back to blind conditions. So, Kentik wanted to provide customers with a synthetic solution that would allow them to monitor network performance as frequently as they need to derive real value.</p> <p>Synthetic monitoring is tainted by solutions that are not effective because they force a compromise in what to test and how frequently. Cost-primitiveness is a top issue even for large testing intervals; what to say about going sub-minute? Organizations just “police” their monitoring rather than concentrating on what needs to be done.</p> <p>Kentik is committed to providing network observability that customers can rely on and afford. Kentik’s next-gen synthetic monitoring brought to market a solution without the usual test frequency and cost compromises.</p> <p>With <a href="https://www.kentik.com/resources/synthetic-monitoring/" title="Learn more about Kentik Synthetic Monitoring in our Solution Brief">Kentik Synthetic Monitoring</a>, not only can IT teams define high-frequency tests — e.g., per second — the pricing model permits it. Organizations can now perform continuous network performance monitoring to ensure they stay on top of digital experience, detecting even subtle intermittent degradations impacting their traffic, thus achieving higher assurance of the network services’ quality and reliability.</p><![CDATA[Maxing Out Network Content Delivery: Baldur’s Gate 3 Case Study]]><![CDATA[Greg Villain, our director of product management and self-described die-hard gamer dives into the early access launch of Baldur's Gate 3. He takes a look from a content delivery perspective to determine if this is a good precursor for the future traffic event when the game releases in 2021.]]>https://www.kentik.com/blog/maxing-out-network-content-delivery-baldurs-gate-3-case-studyhttps://www.kentik.com/blog/maxing-out-network-content-delivery-baldurs-gate-3-case-study<![CDATA[Greg Villain]]>Thu, 15 Oct 2020 04:00:00 GMT<p>As a Director of Product Management here at Kentik, I specialize in service provider networks and content delivery over the internet. I am also a die-hard gamer!</p> <p>In my day job, I’ve helped large providers build global content delivery infrastructure and OTT (Over The Top) content delivery. I’ve also worked closely with ISPs to help them gain insight about how their subscribers download content, disambiguating the myriad of combined connectivity and traffic handover methods so that they can manage cost and performance to the best of their ability.</p> <p>I’ve written several Kentik blog posts and articles about this in the past, including <a href="https://www.kentik.com/blog/kentik-true-origin-brings-cdn-content-delivery-insights-to-isps/">Kentik True Origins Brings CDN Insights to ISPs</a> and <a href="https://www.networkcomputing.com/networking/CDNs-Require-New-Network-Visibility-Standards" target="_blank">The CDN Era Requires New Network Visibility Standards</a>.</p> <p>As a product owner, I am focused on anything related to CDN and OTT traffic classification. I regularly dive into internet-based content events. I make sure our OTT/CDN detection engine picks these up and that the engine does the best job possible at helping ISPs disambiguate the “infrastructure visibility gap.”</p> <p>As an avid gamer, I’ve been gaming in any shape or form since the early nineties, when everything was entirely offline. I have accumulated many acquaintances and much knowledge of this industry over the years, trying to stay up to date with what’s new, what’s to come, and how games are delivered to end-users.</p> <h3 id="tuesday-october-6-2020-the-early-access-launch-of-baldurs-gate-3">Tuesday, October 6, 2020: The early access launch of Baldur’s Gate 3</h3> <p>I was excited when I got this email update from the Steam games distribution platform in my inbox:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3E2i4PZqExToKDAnbeBqtV/a12b8b6110a3209e5a52464552ed54f0/baldur3-email.png" style="max-width: 700px" class="image center no-shadow" /> <p>So excited that I had to tweet out loud about it (I’m <a href="https://twitter.com/grrrrreg" target="_blank">@grrrrreg</a> on Twitter)!</p> <img src="//images.ctfassets.net/6yom6slo28h2/42OJ2M9cyagszQK8yXu79J/a7691392b3048046277de136d8d17ae1/greg-baldur-tweet.png" style="max-width: 600px" class="image center no-shadow" /> <p>For those not familiar with it, the <a href="https://en.wikipedia.org/wiki/Baldur%27s_Gate_(series)" target="_blank">Baldur’s Gate series</a> is the most iconic role-playing game franchise that was ever released, based on the Forgotten Realms universe and ruleset from the legendary tabletop role-playing game Advanced Dungeons &#x26; Dragons. I remember clocking hundreds of hours on Baldur’s Gate 1, in 1998, then again in 2000 when the sequel was released.</p> <p>Fast-forward 20 years, and a studio other than the initial <a href="https://www.bioware.com/" target="_blank">BioWare</a> studio, <a href="https://larian.com" target="_blank">Larian Studio</a>, had acquired the license and set itself to develop the third installment of this excellent series. I learned that while only 30% of the way completed, the game was opened for early release — something that was unthinkable, back in the day (remember you couldn’t release a game and then update it, since everything was pretty much offline). Call me a die-hard fanboy, but I didn’t think twice about buying a game that was in its early stages of development.</p> <p>Now here’s where things get even more interesting from a content delivery perspective, looking at the game’s system requirements on Steam and OMG!</p> <img src="//images.ctfassets.net/6yom6slo28h2/56bXHV1mPGN2xc1BoAEbrD/e12f9b78d34ae1a500cdb67eae6e9561/baldur-reqs.png" style="max-width: 550px" class="image center no-shadow" /> <p>I’ve never installed a game that was so huge — 150GB of disk space required! Back in the day, it would have been unthinkable. To add some perspective, DVDs, which were the primary distribution could store 4.7GB (single-layer DVD) or up to 8GB (dual-layer DVD). Think about this: Baldur’s Gate 3, back then, would have required a box containing at least 18 dual-layer install DVDs!</p> <p>Today, ISPs have to carry that same payload to every one of their gaming subscribers who wants to give the game a spin. Seeing this further convinced me that OTT visibility is key to any ISP’s need to keep performance high and delivery costs low.</p> <h3 id="through-the-eyes-of-a-kentik-user">Through the eyes of a Kentik user</h3> <p>In 1998, when I was about to enter university as an MSc student, I remember my gaming nerd friends and I were very excited about this game. Now, more than 20 years later, I’m almost 100% certain that any of them, still in the IT industry, will buy and play that game to reminisce fondly the hours spent staring at our screens exploring the medieval earth of the Sword Coast.</p> <p>The point that I’m trying to make here is that:</p> <ol> <li>The cult following behind the Baldur’s Gate franchise makes it one of the most highly anticipated games coming in 2021. A lot of users will download it.</li> <li>The sheer size of the game will make it a content delivery nightmare for many ISPs if they are not prepared.</li> </ol> <p>Given that, I set out to check whether our OTT and CDN engine did its job and whether this early release is a good precursor for the future traffic event that will result when the game releases in 2021. Only Steam offers this early release, so I didn’t have to look for an event across multiple platforms and content providers, thus simplifying the analysis. Other recently released games, including Call of Duty (Blizzard, Steam, Xbox Live, Playstation Network) and Fortnite, are distributed across multiple platforms, complicating the picture.</p> <h3 id="what-follows-is-the-point-of-view-of-a-typical-us-based-residential-isp">What follows is the point of view of a typical US-based residential ISP</h3> <p>On the surface, this week has been a standard week for gaming; nothing out of the ordinary shows in the Kentik OTT summary screen. Video and gaming represent nearly 70% of an ISP’s traffic, which implies that these two content areas are the benchmark for an ISP’s subscriber’s performance and the main cost center for network capacity.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2uMZShtw9sMHjtYij2tgcf/2be3f58a0aaa4d4a615ae79f804fe619/video-gaming-week.png" style="max-width: 600px" class="image center" /> <p>No smoking Baldur’s Gate 3 gun here; I needed to dive a little bit deeper because the traffic event could have been confused with other gaming traffic. Remember, this is just an early release, and many players have no interest in playing an unfinished game.</p> <p>So I wanted to dig in and isolate gaming traffic. While the usual suspects still occupy the top of the gaming traffic ladder, something seems to be worth our attention.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4nLgmzLZO3owPNEpbqJNU/8d88d90d9ea25d9b6c61202f7cede471/steam-games.png" style="max-width: 800px" class="image center" thumbnail /> <p>Notice the bump on that Steam content provider traffic sparkline. The event, even though displayed with other types of traffic, was hiding in plain sight!</p> <p>The Steam bump was even more apparent when switching from a 1-week view to a 30-day view:</p> <img src="//images.ctfassets.net/6yom6slo28h2/JshwiG0wguPZqpppDreI9/6b7bff2839a5fc831b790e0a9e7f5327/callofduty-baldursgate.png" style="max-width: 560px" class="image center" /> <p>I should also mention that Call of Duty: Modern Warfare got an update on September 29, which you can see as a simultaneous bump on XBOX Live, Playstation and Blizzard. Googling it confirms this hunch (I’m getting good at it) - <a href="https://www.androidcentral.com/call-duty-modern-warfare-announces-season-six-adds-operators-guns-maps-and-more-september-29" target="_blank">September 29 marks the release of CoD’s Season 6</a>.</p> <p>Going back to Baldur’s Gate 3, I wanted to measure the impact this release has had on my edge network.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4VkRTpntt6NVWNl3v2MvUx/8b0e270bb84d3b51a0c703478b31c2ea/steam-games-download.png" style="max-width: 800px" class="image center" thumbnail /> <p>What I notice first is that Steam traffic almost tripled compared to the regular peaks. It went from slightly above 10Gbps to nearly 30Gbps. Though we’re not talking about huge volumes, depending on handover capacity, this could cause an issue in interconnection ports.</p> <p>I will also note that 60% of this Steam traffic comes via transit, which is likely the most costly method, and the least performant in terms of the number of hops — as well the fact that 30% comes from peering at an internet exchange.</p> <p>When I looked closely, I saw that both transit and IX connectivity carry traffic directly from Valve’s (the company behind the Steam games distribution platform) own ASN: AS32590.</p> <div style="font-size: 96%"><em>Note: The screengrab below was modified to remove provider names for CDNs, transit and ASNs.</em></div> <img src="//images.ctfassets.net/6yom6slo28h2/2vlMIOlyAtuRyiuQh38dhf/ff063f249f976d00de7607647295cc4b/steamgames-sankey.png" style="max-width: 800px" class="image center" thumbnail /> <h3 id="what-can-this-isp-do-moving-forward-to-prepare-for-the-2021-release-of-this-monumentally-epic-game">What can this ISP do moving forward to prepare for the 2021 release of this monumentally epic game?</h3> <p>One way to optimize delivery would be to:</p> <ol> <li><strong>Figure out why Valve's ASN goes both to transit and IX</strong> <ul> <li>Is it because my peering ports on the internet exchange are full?</li> <li>Is it because Valve’s peering ports on the internet exchange are full?</li> </ul> </li> <li><strong>Try and get in touch with CDN#2</strong> <ul> <li>Is there any chance I can get more of this sourced from embedded caches in my network?</li> <li>Am I reaching CDN#2's max embedded cache capacity inside my network? </li> </ul> </li> </ol> <p>If you are interested in hearing more about my gaming content traffic forensics adventures, please message me; I can nerd out for hours about this!</p> <p>If you are having trouble bridging the infrastructure visibility gap of CDNs and OTT services, <a href="#signup_dialog">take Kentik for a spin</a>. I’ve got your back!</p><![CDATA[Kentik Adds Integrated Synthetic Monitoring to Deliver Next-Gen Network Observability]]><![CDATA[What is synthetic testing, and where do point solutions fail? In this blog post, Kentik Co-founder and CEO Avi Freedman discusses network performance testing and why Kentik designed an autonomous and affordable synthetic monitoring capability that is fully integrated into our network observability platform.]]>https://www.kentik.com/blog/kentik-integrated-synthetic-monitoring-next-gen-network-observabilityhttps://www.kentik.com/blog/kentik-integrated-synthetic-monitoring-next-gen-network-observability<![CDATA[Avi Freedman]]>Wed, 23 Sep 2020 04:00:00 GMT<p>Digital experience drives business today, and it is the brand and storefront for your users and customers. Increasingly, digital experience is also the connector for your employees.</p> <p>On top of the critical baseline of understanding what the demands and activities are on your networks, understanding the performance of the traffic flowing across your infrastructure is critical to delivering a great digital experience.</p> <p>As we <a href="https://www.kentik.com/press-releases/kentik-launches-synthetic-monitoring-solution/">announced</a> in July, Kentik released our Synthetic Monitoring offering, adding integrated, autonomous, pervasive performance testing to our already leading network traffic analytics and observability platform.</p> <h3 id="what-is-synthetic-testing">What is synthetic testing?</h3> <p>Synthetic testing is done by sending synthesized tests that are not from real users to check availability and performance metrics, such as latency, jitter and packet loss. Also called “active” testing in academia, <a href="https://www.kentik.com/resources/synthetic-monitoring/" title="Learn more about Kentik&#x27;s synthetic testing solution, Kentik Synthetic Monitoring">synthetic testing</a> is a way to get an idea of performance when your network elements themselves can’t passively observe it from watching your traffic. There are other ways of getting performance data, either from packets (which our cloud-native and cloud-scale customers don’t do), or by using data from up the stack to shine a light on the network, but this data is often segmented away from network teams.</p> <h3 id="arent-there-already-synthetics-offerings">Aren’t there already synthetics offerings?</h3> <p>Since Kentik started selling network analytics in 2015, most of our large customers have asked us to either implement or integrate with synthetics offerings. Generally, their frustrations have been around:</p> <ul> <li>Standalone systems</li> <li>Excessive manual configuration</li> <li>Noisy results due to a lack of understanding of what the network does or is doing</li> <li>The need to monitor confusing and expensive bills</li> </ul> <p>My personal history with synthetics goes back to the ‘90s with Keynote. It was an expensive standalone platform full of noise, where people spent a lot of time monitoring their bills, manually setting up tests, and tuning the network to be close to the agents.</p> <p>Fast-forward 25 years, and before <a href="https://www.kentik.com/resources/synthetic-monitoring/" title="Kentik Synthetic Monitoring solution brief">Kentik’s Synthetic Monitoring</a> launch, our customers tell us that the vendors they use still require manual setup and tuning, longer debug times to understand if a synthetic test failure is a problem; and spending more time monitoring bills than on maintaining the production stack!</p> <h3 id="whats-needed-an-integrated-autonomous-solution">What’s needed? An integrated, autonomous solution</h3> <div class="pullquote right" style="max-width: 350px;">Kentik has designed an autonomous and affordable synthetic monitoring capability that is fully integrated into Kentik’s network observability platform.</div> <p>For decades performance testing has been done by point solutions, which are not fully integrated, and create massive visibility and intelligence gaps. What’s worse is that these solutions waste time and effort spent on noisy alerts that have no relevance to actual traffic and also lack the context needed to validate that relevance.</p> <p>In order to be effective for the modern, internet-centric and orchestrated era, a synthetic monitoring solution must have:</p> <ul> <li><strong>No blind spots</strong>. It must monitor everything that matters by understanding the real user traffic.</li> <li><strong>No operations overhead from manual interventions</strong>. It must avoid “stale” tests and errors in repetitive manual tasks while freeing up resources to be assigned to higher-value workflows.</li> <li><strong>Intelligent alerts and diagnostics</strong> with less noise and more actionable insights that reduce alarm fatigue and time-to-resolution (MTTR).</li> <li><strong>Affordable continual testing</strong> to see everything relevant, but focus in on monitoring your digital experience, not your monitoring bill.</li> </ul> <p>To deliver on the promise of measuring user and employee experience, Kentik has designed an autonomous and affordable synthetic monitoring capability that is fully integrated into Kentik’s network observability platform.</p> <h3 id="why-are-integration-and-context-so-key">Why are integration and context so key?</h3> <p>Knowledge of what the real traffic is and how it flows in context with the various infrastructures (e.g. internet, cloud, WAN, data centers, etc.) is fundamental. Without knowing what networks you’re depending on and their state and traffic, you have:</p> <ul> <li>The wrong tests</li> <li>Missing context that wastes valuable time on setup and interpretation</li> <li>Alert fatigue due to the lack of context and integration</li> <li>No ability to orchestrate and automate based on your important apps and customers</li> </ul> <h3 id="autonomous-and-proactive">Autonomous and proactive</h3> <p>The answer? Proactive testing, set up autonomously based on your network traffic and business priorities.</p> <p>As infrastructure evolves to increasingly fast and high complexity environments, automation becomes a necessity, and autonomy becomes a requirement for interaction in a timely and effective manner. At internet scale and orchestration speed, it’s simply not possible to have humans in the loop of identifying and testing all of the important aspects of your digital dependencies across cloud, SaaS, CDN, API gateways, and the service providers that connect your users and employees.</p> <p>The promise of self-driving networks requires self-driving analytics, with full network context, and that’s what we’re delivering in our GA release. Kentik uses the data we have about all of the networks from you to the customer and all the telemetry types, including state, health and traffic. And then, the Kentik platform can autonomously configure tests for the things you depend on, change those tests as your application traffic shifts, and present the results in the context of impact on your users.</p> <p>Autonomous configuration is also necessary to truly know about issues before your users do. Manually configuring tests after users complain puts you behind the support curve. Proactively testing your critical dependencies as traffic shifts enables you to know about – and fix – issues before users notice, and often before they’re seriously impacted.</p> <h3 id="we-invite-you-to-use-kentik-synthetic-monitoring">We invite you to use Kentik Synthetic Monitoring</h3> <div class="pullquote right" style="max-width: 350px;">Kentik monitors and alerts on what truly matters to the digital experience of your users: a fundamental leap in usability and value, designed to keep your traffic flowing smoothly and your ops teams sleeping well.</div> <p>Kentik believes that affordable, autonomous testing integrated with network intelligence and observability is a game changer.</p> <p>Kentik Synthetic Monitoring, with trillions of eyes on the global networks we see from traffic measurements observed every day, automates performance monitoring based on intent and in context with your actual traffic and the infrastructures it flows through.</p> <p>Kentik monitors and alerts on what truly matters to the digital experience of your users: a fundamental leap in usability and value, designed to keep your traffic flowing smoothly and your ops teams sleeping well.</p> <p>Whether you’re an enterprise or service provider, we invite you to frequently, easily and powerfully monitor your customer and team digital experience by measuring performance and availability of essential infrastructure, applications and services, including:</p> <ul> <li>SaaS solutions</li> <li>API gateways</li> <li>Applications hosted in the public cloud</li> <li>Internal applications</li> <li>Transit and peer networks</li> <li>Content delivery networks</li> <li>Streaming video, social, gaming and other content providers</li> <li>Site-to-site performance across traditional WAN and SD-WANs</li> <li>Service provider connectivity and customer SLAs</li> </ul> <p>Find more information about <a href="https://www.kentik.com/go/get-started-synthetic-network-monitoring/">Kentik Synthetic Monitoring</a> and try it for yourself, or <a href="https://www.kentik.com/go/get-demo/">get a personalized demo</a> with a Kentik product expert. If you have questions or comments, please email me at <a href="mailto:[email protected]"><a href="mailto:[email protected]">[email protected]</a></a>.</p><![CDATA[Using Rust for Kentik’s New Synthetic Monitoring Agent]]><![CDATA[Kentik software engineer Will Glozer gives us a peek under the hood of Kentik's new synthetic monitoring solution, explaining how and why Kentik used the Rust programming language to build its network monitoring agent.]]>https://www.kentik.com/blog/using-rust-for-kentiks-new-synthetic-network-monitoring-agenthttps://www.kentik.com/blog/using-rust-for-kentiks-new-synthetic-network-monitoring-agent<![CDATA[Will Glozer]]>Fri, 28 Aug 2020 14:00:00 GMT<p>We recently launched <a href="https://www.kentik.com/go/get-started-synthetic-network-monitoring/" title="Learn more about Kentik Synthetic Monitoring and try it for free!">Kentik Synthetic Monitoring</a>. The industry-leading Kentik Network Intelligence Platform is now the only fully integrated network traffic and <a href="https://www.kentik.com/resources/synthetic-monitoring/">synthetic monitoring analytics solution</a> on the market, and the only solution to enable autonomous testing ― for both cloud and hybrid networks.</p> <p>A key component of our solution are hosts running <a href="https://kb.kentik.com/v4/Ma01.htm" title="Learn more about ksynth in the Kentik Knowledge Base.">ksynth</a>, Kentik’s software agent for synthetic monitoring. Used in both Global Agents and Private Agents and developed in Rust, ksynth generates synthetic network traffic in the form of ICMP pings, UDP-based traceroutes, and HTTP(S) requests. Performance, reliability, and security are the key reasons why we used <a href="https://www.rust-lang.org/en-US/index.html" target="_blank">Rust</a> to develop the agent.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7DIvEW1fNxXisOcppGAy11/047c1ab55710f3f3e0360686039a118e/agent-management-in-kentik-synthetic-monitoring.png" style="max-width: 800px;" class="image" title="Screenshot: Managing agents in Kentik Synthetic Monitoring" thumbnail /> <p style="font-size: 96%; margin-top: 15px; text-align: center"><em>Managing agents in Kentik Synthetic Monitoring</em></p> <h1 id="design">Design</h1> <p>At any given moment, one of our global agents may be running thousands of traffic generation tasks, so aside from an initial setup process, ksynth execution is almost entirely asynchronous. Kentik has been running Rust in production since 2017 (more on that in this <a href="https://www.kentik.com/blog/under-the-hood-how-rust-helps-keep-kentik&#x27;s-performance-on-high/" title="How Rust Helps Keep Kentik’s Performance on High">previous blog post</a>), but this is our first serious use of <a href="https://rust-lang.github.io/async-book/" title="Asynchronous Programming in Rust">async/await</a>, which was stabilized in late 2019.</p> <p>The asynchronous design allows us to model each ping, traceroute, or HTTP request assigned to an agent as a distinct task. These tasks are primarily I/O bound, waiting on packets or timeouts, so each agent can run many tasks without requiring significant resources. Results are exported at predefined intervals from another task, and the agent periodically polls the backend for new work and submits status reports from yet other tasks. We use the <a href="https://tokio.rs/" title="Tokio: An asynchronous runtime for Rust">tokio async runtime</a> with its work-stealing, multi-threaded scheduler to efficiently spread task load across <em>N</em> threads.</p> <p>Ksynth uses raw sockets for ping and traceroutes, and supports both IPv4 and IPv6, which adds a considerable amount of complexity. IPv4 raw sockets are able to send the full packet starting with the IP header. IPv4 traceroute probes each hop by setting the Time to Live field. However, IPv6 raw sockets do not have direct access to the IP header, instead <a href="https://tools.ietf.org/html/rfc3542#page-26" title="Advanced Sockets API for IPv6">ancillary data</a> must be passed to <a href="https://man7.org/linux/man-pages/man2/sendmsg.2.html" title="Linux man-pages: sendmsg">sendmsg</a>. That’s why I’ve contributed an open-source crate, <a href="https://crates.io/crates/raw-socket" title="raw-socket on crates.io">raw-socket</a>, that provides this functionality and more for both synchronous and asynchronous raw sockets.</p> <p>In addition to efficiency, the async/await model makes for very pleasant and readable code as demonstrated by the following snippet extracted from a very simple ping implementation (see <a href="https://docs.rs/crate/raw-socket/0.0.1/source/examples/ping.rs">https://docs.rs/crate/raw-socket/0.0.1/source/examples/ping.rs</a>):</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">let mut sock = RawSocket::new(ip4, dgram, Some(icmp4))?; let ping = IcmpPacket::echo_request(1, 2, b"asdf"); let dst = SocketAddr::new("1.1.1.1".parse()?, 0); let mut buf = [0u8; 64]; let pkt = ping.encode(&amp;mut buf); sock.send_to(pkt, dst).await?; let mut buf = [0u8; 64]; let (n, from) = sock.recv_from(&amp;mut buf).await?; let pong = IcmpPacket::decode(&amp;pkt[..n]);</code></pre></div> <p></p> <h1 id="alternatives">Alternatives</h1> <img src="//images.ctfassets.net/6yom6slo28h2/20WX4Oj8SUG86saiIYGwU0/801b05c8559788397411e37a0de068d6/rust-logo-blk.svg" class="image right no-shadow" style="max-width: 180px;" alt="Rust"> <p>Go and Rust are Kentik’s primary backend languages, and this design seems well-suited to Go’s concurrency model, so why did we choose Rust instead? Performance-wise, I’d expect the two languages to be quite similar as the agent is mostly waiting for <a href="https://www.kentik.com/resources/gartner-market-guide-npmd-2020/">network</a> and timer events. Rust’s lack of garbage collection and attention to minimizing allocations probably allows the agent to run more tasks—using fewer OS resources—which is beneficial, but not a major factor. It’s <em>security</em> and <em>cross-compilation</em> where Rust really shines for this specific application.</p> <h3 id="security">Security</h3> <p>Private Agents run within customer networks, which makes security a key concern. Ksynth relies on raw sockets to send and receive ICMPv4 and ICMPv6 packets and UDP packets with specific hop limits. Traditionally raw sockets require root permissions, but on Linux systems we are able to use <a href="https://man7.org/linux/man-pages/man7/capabilities.7.html" title="Linux man-pages: CAPABILITIES(7)">capabilities(7)</a> to grant the agent the <code class="language-text">CAP_NET_RAW</code> capability while running as a non-root user.</p> <p>Capabilities are a per-thread attribute so—aside from the well-known benefits of memory safety—Rust is also ideal for setting up this sort of sandboxing prior to starting the async runtime, which will manage spawning new threads as needed.</p> <p>In comparison, the programmer has little control over when the Go runtime will spawn new threads, and threads are spawned before any user-level code is executed. This is not an issue if the binary is started with the correct capabilities, but we want to be able to drop all unnecessary capabilities from all threads, even if a customer runs ksynth as root with unbounded capabilities.</p> <h3 id="cross-compilation">Cross-compilation</h3> <p>We deliver ksynth binaries for a number of different architectures including x86_64, armv7, and aarch64. The Rust toolchain has excellent support for cross-compiling and, for the most part, you simply need to add a target with <code class="language-text">rustup</code> and then pass the <code class="language-text">--target</code> option to <code class="language-text">cargo build</code>. Ksynth and its dependencies are mostly pure Rust; however, a few crates like <a href="https://crates.io/crates/zstd" title="zstd on crates.io">zstd</a> and <a href="https://crates.io/crates/ring" title="ring on crates.io">ring</a> contain C code and require a C compiler. To address this, we use the excellent <a href="https://github.com/rust-embedded/cross" title="cross on GitHub">cross</a> tool, which acts as a drop-in replacement for cargo and performs builds inside a container pre-configured with the necessary C toolchain for each target.</p> <p>Go also has excellent support for cross-compiling pure Go code, simply set the <code class="language-text">GOOS</code> and <code class="language-text">GOARCH</code> environment variables. However, when CGO code is introduced you then need the C cross-compiler toolchain, and I haven’t seen a Go equivalent of <code class="language-text">cross</code>. Additionally, Rust crates that depend on C libraries frequently bundle a copy of the library source and build it at compile time rather than depending on system libraries, pkg-config, etc.</p> <p>A final point in Rust’s favor is strong support for building statically-linked executables even when cross-compiling and linking to C libraries. We leverage this to minimize the complexity of shipping packages for many different architectures and operating system releases.</p> <h1 id="next-steps">Next Steps</h1> <p>We’ve just made Kentik Synthetic Monitoring generally available to our customers and currently have agents running in more than 200 PoPs across 44 countries. By the end of the year, we expect to have hundreds more running and a number of additional synthetic test types. We’re eager to see how ksynth scales and how many tasks each agent can handle. Keep an eye out for future articles discussing this and more on how we’re using Rust at Kentik!</p> <p>If you’d like to try Kentik Synthetic Monitoring for yourself, you can learn more and request a <a href="https://www.kentik.com/go/get-started-synthetic-network-monitoring/" title="Try Kentik Synthetic Monitoring, free for 30 days.">free trial here</a>.</p><![CDATA[Kentik Bridges the Intelligence Gap for Hybrid Cloud Networks]]><![CDATA[Kentik’s Hybrid Map provides the industry’s first solution to visualize and manage interactions across public/private clouds, on-prem networks, SaaS apps, and other critical workloads as a means of delivering compelling, actionable intelligence. Product expert Dan Rohan has an overview.]]>https://www.kentik.com/blog/kentik-bridges-the-intelligence-gap-for-hybrid-cloud-networkshttps://www.kentik.com/blog/kentik-bridges-the-intelligence-gap-for-hybrid-cloud-networks<![CDATA[Dan Rohan]]>Thu, 20 Aug 2020 14:00:00 GMT<p>As Kentik’s product manager for hybrid cloud, I am always talking to infrastructure and network teams around the world to understand a day-in-their-life. This provides me with an invaluable understanding of the challenges, goals, and priorities that they face today, and also a vision into their future network monitoring needs. Listening to our customers, partners and other infrastructure and network teams delivers critical feedback and excellent insight into their struggles, which helps drive Kentik’s product direction and future strategy to remove these challenges.</p> <p>One common frustration I heard was the difficulty in managing their hybrid cloud environments. IT resources, including those managed by Kubernetes and other container scheduling platforms, can be provisioned or de-provisioned in seconds creating surprising shifts in application delivery traffic. Undermining this scalability were network overlays that enable connectivity while obscuring details necessary for effective troubleshooting.</p> <p>Infrastructure and network teams can no longer rely on internal documentation or legacy tools to help them understand how their applications are delivered and what health or performance issues are getting in their way. They have been left with gaping holes in visibility, preventing them from holistically understanding topology state, traffic flows, network performance and device health status. Until now.</p> <p>This is why, today, Kentik is introducing Kentik’s <a href="/resources/hybrid-cloud-monitoring">Hybrid Map</a>, which lets these teams view the entirety of their network infrastructures, enabling them to dive deep, assess quickly, solve problems and gain insights immediately. You can now visualize and manage interactions within and between <em>on-prem</em> infrastructure, <em>cloud</em> infrastructure from Amazon AWS, Google Cloud, IBM Cloud and Microsoft Azure, as well as <em>internet</em> platforms and services — all in a single, unified view.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4IgnrX3Q1RMc4gWAZtbqnq/59c0b27ae3c0eb7c68798d63eb1df372/hybrid-map-kentik-data-081820.png" style="max-width: 800px;" class="image" title="Kentik hybrid cloud network map image" thumbnail /> <p style="font-size: 96%; margin-top: 15px; text-align: center"><em>Go inside your cloud infrastructure to see how applications are communicating inside the cloud, to the internet, and your on-prem network.</em></p> <h3 id="the-growth-of-hybrid-cloud">The Growth of Hybrid Cloud</h3> <p>We at Kentik know that hybrid cloud — the use of multiple IT infrastructures, both on-prem and in the cloud, to deliver computing services across an organization — has many benefits for infrastructure and networking teams and hybrid cloud usage is only going to grow as a result of the 2020 pandemic and the move to remote “everything.”</p> <p>Just in 2018, <a href="https://about.keysight.com/en/newsroom/pr/2019/25mar-nr19044-ixia-c-r-state-cloud-monitoring.pdf" target="_blank">eight out of 10 companies increased cloud-based workloads</a>. And due to the pandemic, according to the <a href="https://info.flexera.com/SLO-CM-REPORT-State-of-the-Cloud-2020" target="_blank">2020 Flexera State of the Cloud Report</a>, 59% of respondents who answered a question about COVID-19 expect cloud use to exceed plans. As businesses combine the governance advantages of on-prem systems with the flexibility, “always-on” availability and cost advantages of the cloud, this ongoing “cloud migration” allows IT managers to scale their network environments to support specific workloads or applications with the best customer experience in an extremely secure way.</p> <p>In this new hybrid cloud world, infrastructure is located wherever the business needs it, meaning that <em>data</em> can be located anywhere, too. In <a href="https://www.kentik.com/go/gartner-market-guide-npmd-network-performance-monitoring-and-diagnostics-2020/" target="_blank">Gartner’s Market Guide for Network Performance Monitoring and Diagnostics</a>, analysts predict that, by 2024, 50% of network operations teams will be required to re-architect their network monitoring stacks due to the impact of hybrid networking — a significant increase from 20% in 2019. This makes the network an even more critical resource.</p> <p>As organizations like yours adopt hybrid cloud environments, your network teams are under increased pressure to deliver the visibility required to optimize business operations. In <a href="https://about.keysight.com/en/newsroom/pr/2019/25mar-nr19044-ixia-c-r-state-cloud-monitoring.pdf" target="_blank">The State of Cloud Monitoring</a> report, 99% of those surveyed identified a direct link between visibility and business value and listed “increasing cloud visibility for greater operational control” as a top priority.</p> <p>At Kentik, we can take this huge challenge and flip it on its head, making it an asset for your organization and removing the complexity your infrastructure and networking teams experience. With Hybrid Map, you now have the opportunity to see things you could never see before and use them to your advantage. You can close operational gaps, getting actionable intelligence about <em>who</em> is using the network and <em>what</em> they’re using it for. Visibility and insights are hard to come by when the location of resources fluctuates between on-prem and public/private clouds. Not anymore.</p> <h3 id="the-challenges-of-managing-hybrid-networks">The Challenges of Managing Hybrid Networks</h3> <p>Hybrid networks are complex by nature. Using traditional network management tools, network professionals charged with tracking down performance and availability problems can struggle to piece the network together in time to fix critical issues. It’s difficult to discover which devices and interfaces make up a data path. Correlating the traffic, health, and performance of these elements consumes valuable time that could instead be spent understanding and fixing problems.</p> <p>What we found was that no one network performance monitoring and diagnostics (NPMD) vendor offered a single, comprehensive map that contributes to intelligently managing your hybrid network. According to a survey of IT professionals in the report <a href="https://about.keysight.com/en/newsroom/pr/2019/25mar-nr19044-ixia-c-r-state-cloud-monitoring.pdf" target="_blank">The State of Public Cloud Monitoring</a>, 35% of respondents use up to five monitoring tools to keep tabs on hybrid cloud and multi-cloud environments. Each tool has a specific purpose and thus tells only an isolated portion of the story. There are backbone network maps, capacity maps between sites and devices, cloud visualization maps for a specific cloud, and edge maps for edge networks. But no one consolidated, fully-integrated map for hybrid, multi-cloud networks.</p> <p>Automation, orchestration, and software-defined networking further complicate matters by constantly shifting where applications are located and redefining how they connect to one another. Without tooling built to comprehend this new reality, your network — and the team that runs it — is at a disadvantage.</p> <p>Compounding the challenge are business pressures for you to optimize operations across multiple infrastructures, especially when the business is either migrating from data centers to clouds or is acquiring new infrastructure through mergers and acquisitions.</p> <h3 id="the-kentik-solution---solving-the-visibility-and-monitoring-gap">The Kentik Solution - Solving the Visibility and Monitoring Gap</h3> <p>Using all these great insights, we now introduce Kentik’s Hybrid Map — providing the industry’s first solution to visualize and manage interactions across public/private clouds, on-prem networks, SaaS apps, and other critical workloads — to deliver compelling, actionable intelligence. Kentik’s platform helps you automatically discover traffic anomalies, determine root causes, and launch automation to mitigate issues.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1KOsyw5Y0TiFD6nA7xfkzx/1eceabed04914e8ec8b2a9f6acc47037/hybridmap-dc-drilldown.png" style="max-width: 800px;" class="image" thumbnail /> <p style="font-size: 96%; margin-top: 15px; text-align: center;"><em>Drill into data center devices to understand how your applications flow through your data center fabric.</em></p> <p>Available now, Kentik’s Hybrid Map has the following key elements to empower network and infrastructure teams:</p> <ul> <li> <p><strong>A single view for network maps</strong>: To visualize every aspect of your network, keep track of changes to your infrastructure, and get a view of historical data in context.</p> </li> <li> <p><strong>Map drill-downs</strong>: To gain insight into physical or cloud-based devices. Also, to investigate problematic network utilization by discovering which application, on which node, in which interface, IP, region, zone, or subnet is the culprit.</p> </li> <li> <p><strong>Network component and traffic detail views</strong>: To observe real-time traffic flows or look at historical data to see how changes in the environment have impacted application delivery or customer experience.</p> </li> <li> <p><strong>Advanced traffic analytics, health and performance data</strong>: To get instant answers to common network questions.</p> </li> </ul> <p>In short, Hybrid Maps are everything you love about Kentik… now with added visibility for your hybrid cloud network. To try these features in your own network, <a href="#signup_dialog" title="Get Started with a Free Trial">start a free trial</a>, or <a href="#demo_dialog" title="Request your Kentik demo">request a personalized demo</a> from the Kentik team.</p><![CDATA[The New Normals of Network Operations in 2020]]><![CDATA[Today we published a report on “The New Normals of Network Operations in 2020.” Based on a survey of 220 networking professionals, our report aims to better understand the challenges this community faces personally and professionally as more companies and individuals taking their worlds almost entirely online. ]]>https://www.kentik.com/blog/the-new-normals-of-network-operations-in-2020https://www.kentik.com/blog/the-new-normals-of-network-operations-in-2020<![CDATA[Michelle Kincaid]]>Tue, 23 Jun 2020 07:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/3jk3SuFDPyBHA11TSed1HB/6d7e4b42a5be29b93d76a31890ebdba4/report-newnormals-cover.jpg" style="max-width: 300px;" class="image right" /> <p>Today we published a report on “<a href="/resources/new-normals-of-network-operations-2020">The New Normals of Network Operations in 2020</a>.” Based on a survey of 220 networking professionals, our report aims to better understand the challenges this community faces both personally and professionally, especially as the global pandemic has intensified the importance of networks, with more companies and individuals taking their worlds almost entirely online. Do network teams feel more pressure than ever before to keep the world connected?</p> <p>Here are just a few of our findings:</p> <ul> <li> <p><strong>Cloud reliance is not as widespread as it seems</strong>. There’s no mistaking that the cloud is in high demand. However, despite reports of increased cloud bursting to give networks more capacity, the majority of our survey respondents (54%) said that their organizations have not increased their reliance on the cloud during the global pandemic. This could indicate usage of and a trend toward hybrid infrastructure.</p> </li> <li> <p><strong>Network budgets remain intact</strong>. Forty percent (40%) of respondents said their organizations’ networking budgets had not changed due to economic uncertainties caused by the pandemic. In fact, for 36% of respondents, budgets had either significantly or somewhat increased.</p> </li> <li> <p><strong>Work/life balance is a top networking challenge</strong>. Over half of respondents (51%) expressed concern for their work/life balance. This is likely, in part, due to the increased pressure that work-from-home and shelter-in-place policies have put on networkers to maintain network performance and reliability. Some respondents also wrote in additional stressors, including the management of business continuity plans, physical access to data centers, as well as general economic concerns.</p> </li> </ul> <p><a href="/resources/new-normals-of-network-operations-2020">The report</a> additionally includes charts, data and potential rationale behind certain findings, including for infrastructure supply chain delays, concerns over network capacity shortages, and overall productivity levels of network teams. We also break down our results by large enterprise respondents, as well as small and midsize businesses (SMBs) to determine whether company size plays a role in certain networking concerns.</p> <p>Overall, the new normals of network operations in 2020 require agility, flexibility, understanding, and balance — in many more ways than one. Find out why: <a href="/resources/new-normals-of-network-operations-2020">Download the report here</a>.</p><![CDATA[Capacity Planning Done Right]]><![CDATA[Network capacity planning traditionally required a user to collect data from several tools, export all of the data, and then analyze it in a spreadsheet or database. In this post, we highlight Kentik's latest platform capabilities to help understand, manage and automate capacity planning. ]]>https://www.kentik.com/blog/capacity-planning-done-righthttps://www.kentik.com/blog/capacity-planning-done-right<![CDATA[Dan Rohan]]>Thu, 28 May 2020 07:00:00 GMT<p>Kentik launched a <a href="https://www.kentik.com/press-releases/kentik-transforms-network-operations-with-general-availability-of-aiops/">completely redesigned user interface</a> in February with many new <a href="https://www.kentik.com/resources/datasheet-kentik-detect-key-capabilities//">capabilities</a>. As part of that, we focused on leveraging AIOps techniques to drive software-guided workflows that enable specific and typically-tedious tasks. One task we wanted to help <a href="https://www.kentik.com/blog/netops-secops-collaboration-shared-tools-are-essential/">NetOps</a> teams more easily manage is capacity planning. In this post, we highlight some of our latest platform capabilities to help with managing the capacity of various circuits and interfaces.</p> <p>Traditionally, capacity planning required a user to collect data from several tools, export all of the data, and then analyze it in a spreadsheet or database — all on an ongoing basis. Some might do it monthly minimally, or sometimes every two weeks. The goal is to build capacity plans to handle changes in network traffic patterns and understand the load of specific parts of the network and associated infrastructure.</p> <p>With Kentik’s recent platform developments, we’ve created an advanced way to automate capacity planning. When a user logs into the Kentik Platform, we take highlights from our guided workflows and present them directly within the home screen.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1ExThBFoX2hF2jpTfC9CpW/fb0eb19747f30f85ec8b1c6853dc75f5/capacity-plannning-pie.png" style="max-width: 250px;" class="image right" /> <p>The image on the right shows the capacity plans and status. Upon drilling into the view, the capacity groups and specific capacity plans are listed, and new plans can be added. In each plan, you select or configure capacity groups by selecting interfaces or dynamic groups of interfaces. There is a lot of flexibility as to how the capacity plan is set up and calculated. For example (see image below), it may make sense to have different thresholds, calculations, runout periods depending on what lead time is necessary to place a hardware or circuit order or provisioning request.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4o3UAPwx7M2u8S4xGqT05g/e3b1322a79af58aabeab940b8a9d608c/capacity-planning-thresholds.png" style="max-width: 800px;" class="image center" /> <p>Once the plans are built, Kentik will begin calculating the runout period and notifying the user. This data is calculated using SNMP metrics to ensure the accuracy of the interface capacity. The dashboard for each capacity plan provides details about the utilization and runout:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1zvbs27v7zonmHDzCIFRP3/def5e7bf4a83751854519c145a18fc70/capacity-planning-interfacedashboard.png" style="max-width: 800px;" class="image center" thumbnail /> <p>Kentik then provides an easily digestible and understandable view into the capacity of the network and when plans for capacity should be made. When clicking on the interface, the overview for that specific interface now includes the capacity plan details:</p> <img src="//images.ctfassets.net/6yom6slo28h2/hpSYwQ2hgcVTLMqmMjVAe/0377563303109a7df7533fd2cd5ed26d/capacity-planning-interface.png" style="max-width: 800px;" class="image center" thumbnail /> <p>In the case of frequent high utilization, the next step is to move traffic off of the interface. The Engineer Traffic link on the top right button brings you into the Traffic Engineering workflow to help move traffic off of the link. We’ll address Traffic Engineering in a separate blog post.</p> <p>The workflows in Kentik enable network engineering and operational needs for managing capacity, engineering traffic, and helping operate an efficient network. In product updates to come, Kentik plans to add automation to identify the likely cause and reason for the capacity changes and if they are drastic or gradual. This will use the traffic (flow) data to better understand patterns in capacity. Other areas for investigation include storing aggregates on capacity in given intervals.</p> <p>There are many other areas of investment we’re considering based on customer feedback and requirements and lots of exciting new developments coming on our product roadmap. To see Kentik’s capacity planning capabilities and more, <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Why Network Capacity was Critical for the NFL Draft – And Will Be for Other Sporting Events to Come]]><![CDATA[The NFL Draft created a network traffic spike and set an example for other sports leagues considering entirely-online drafts. In this post, we show Kentik’s view of network traffic from the draft and compare it to traffic seen from the Super Bowl. We also discuss why the data shows network capacity is critical for new normals. ]]>https://www.kentik.com/blog/the-nfl-draft-proves-why-networks-need-to-understand-and-add-capacityhttps://www.kentik.com/blog/the-nfl-draft-proves-why-networks-need-to-understand-and-add-capacity<![CDATA[Michelle Kincaid]]>Tue, 05 May 2020 07:00:00 GMT<p>If you’re a bigtime football fan or just a sports junkie currently missing live sports, perhaps you tuned into the recent National Football League (NFL) Draft, held entirely online from April 23-25. As one of the first live sporting events to take place during COVID-19, the NFL Draft was somewhat of a test case for other sports leagues considering their own versions of an online draft. The National Basketball Association (NBA) and the National Hockey League (NHL) are just two of the organizations currently considering whether to move forward with draft plans for June.</p> <div class="pullquote right">Network traffic during the NFL Draft was 10x higher than the average rate of network traffic typically being observed from the NFL streaming service.</div> <p>At Kentik, we wanted to see what network traffic might reveal about viewership of the NFL Draft, as the other leagues consider their options, so we took a look at NFL Network, a streaming service produced by the NFL, and offered through NFL.com. From the Kentik Platform (with the help of some of our largest service provider customers who agree to let us incorporate in their data), we can see that network traffic during the NFL Draft was 10x higher than the average rate of network traffic typically being observed from the NFL streaming service (see graph 1 below).</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1jzobMXdcOgH6NNFl6ZW2g/268169f78c344af42ffe4174089d0a3e/nfd-graph1.png" style="max-width: 800px;" class="image center" thumbnail /> <p style="font-size: 98%; text-align: center;"><em>Graph 1: NFL.com network traffic leading up to and during the NFL Draft</em></p> <p>While 10x traffic growth is significant, we wanted to put the NFL Draft traffic growth into perspective by comparing it to Super Bowl LIV, held on February 2, 2020 (just about one month before many shelter-in-place orders took place in many regions).</p> <p>The Super Bowl had an audience of <a href="https://www.foxsports.com/presspass/latest-news/2020/02/03/super-bowl-liv-fox-draws-viewership-102-million-across-television-digital-platforms" target="_blank">102 million people</a>, drawing an average minute audience of 3.4 million on streaming services, including the NFL Network. The NFL Draft, also broadcast on the NFL network and other streaming services, drew an audience of <a href="https://nflcommunications.com/Pages/2020-NFL-DRAFT-MOST-WATCHED-EVER;-SETS-NEW-ALL-TIME-HIGHS-FOR-MEDIA-CONSUMPTION.aspx" target="_blank">55 million people</a>, with 1.72 million people watching on the NFL Network.</p> <div class="pullquote right">The Super Bowl generated about 10x the average traffic over a three-month period whereas, in comparison, the NFL Draft generated about 2x traffic growth.</div> <p>When comparing streaming of the NFL Draft with streaming of the Super Bowl, we get a much different picture (see graph 2 below). We can see that the Super Bowl generated about 10x the average traffic over a three-month period whereas, in comparison, the NFL Draft generated about 2x traffic growth. It’s also interesting to note that around the second week of March, as many shelter-in-place orders started to roll out across the U.S., we can see that network traffic started to steadily pick up on NFL.com. As you might expect, the correlation could be due to those at home looking for ways to entertain themselves online.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5n3VS7Hy2AtvuJKcGbKrBm/bfffb9372819d16b830a73fdbd503ecb/nfd-graph2.png" style="max-width: 800px;" class="image center" thumbnail /> <p style="font-size: 98%; text-align: center;"><em>Graph 2: NFL.com network traffic from the Super Bowl to the NFL Draft</em></p> <p>From a networking perspective, one big takeaway from this research is that it’s critical to have a comprehensive understanding of what’s going on on the network at any given point in time. Many service providers, including those who provide broadband to our homes, calendar out big online events, like the Super Bowl, well in advance. This ensures that network capacity is readily available and network performance is high so that viewers watch these events online, uninterrupted and without latency.</p> <p>As COVID-19 changes the way we consume sporting events, tackle our jobs remotely, educate ourselves from a distance, and find a new normal, network traffic data shows us that “the show must go on.” And when it goes on online, networks need to be prepared.</p> <p>If live sporting events take place with online-only audiences, the data confirms that spectators will show up. So if the NBA and NHL move forward with their draft plans, we expect that, as we learned with the NFL Draft, fans are ready to watch. At Kentik, we’ll continue to watch the network traffic (and the sporting events) to see what happens.</p><![CDATA[Kentik Virtual Panel Series, Part 2: How Akamai, Uber and Verizon Media Are Supporting Remote Work and Digital Experience ]]><![CDATA[If you missed Akamai, Uber and Verizon Media talking about networking during COVID-19, read our recap of Kentik’s recent virtual panel. In this post, we share details from the conversation, including observed spikes in traffic growth, network capacity challenges, BC/DR plans, the internet infrastructure supply chain, and more. ]]>https://www.kentik.com/blog/panel-series-part2-akamai-uber-verizon-mediahttps://www.kentik.com/blog/panel-series-part2-akamai-uber-verizon-media<![CDATA[Michelle Kincaid]]>Wed, 22 Apr 2020 07:00:00 GMT<p><a href="https://www.kentik.com/resources/how-leading-companies-support-remote-work-and-digital-experience-2-webinar/"><img src="https://images.ctfassets.net/6yom6slo28h2/Q9d7aaJgh5thXWI31cWBo/347d13a8fc466fbf2aa13c6a33fed97e/blog-panel2.jpg" style="max-width: 420px;" class="image right" /></a></p> <p>Last week Kentik hosted its <a href="https://www.kentik.com/resources/how-leading-companies-support-remote-work-and-digital-experience-2-webinar/">second virtual panel</a> on scaling networks and services during COVID-19. As our moderator, Kentik Co-founder and CEO Avi Freedman was joined by network leaders from Akamai, Uber and Verizon Media, namely:</p> <ul> <li>Christian Kaufmann, VP of Network Technology, Akamai</li> <li>Jason “JB” Black, Head of Global Network Infrastructure, Uber</li> <li>Igor Gashinsky, Chief Network Architect, VP, Verizon Media</li> </ul> <h3 id="traffic-growth-and-the-capacity-to-scale">Traffic Growth and the Capacity to Scale</h3> <p>The Kentik team continues to monitor network traffic trends in correlation with COVID-19’s shelter-in-place orders and increased remote work and distance learning across the globe. This was one of the first topics that Avi asked panelists about, too. Has network traffic continued to increase or stabilize?</p> <p>Igor from Verizon Media said they’ve <a href="https://www.verizon.com/about/news/how-americans-are-spending-their-time-temporary-new-normal" target="_blank">observed</a> web traffic growth of about 30 percent. He also noted “gaming traffic is way up,” as are entertainment services in general. Igor said where there has traditionally been a nighttime peak of traffic, with people then-coming home from work and turning to streaming services and other forms of online entertainment, that peak has now also shifted into the daytime hours. “They’re home all the time, so it looks like they’re doing both,” he added.</p> <p>Christian from Akamai said they’ve also seen network traffic increase about 30 percent over the past month. However, he added that in looking from March 2019 to March 2020, Akamai saw network traffic had doubled year-over-year.</p> <p>In terms of having the capacity to handle higher levels of traffic, all of the panelists agreed with Uber’s Jason Black, “We’ve built our backbone to scale, and we are well-provisioned to be able to handle whatever traffic we need.”</p> <h3 id="bcdr-and-the-internet-infrastructure-supply-chain">BC/DR and the Internet Infrastructure Supply Chain</h3> <p>While many digital enterprises and service providers have recently shared with us that they’ve had the extra capacity and were designed for scale, even before COVID-19, we wanted to know how that fits into our panelists’ business continuity and disaster recovery (BC/DR) plans.</p> <p>Christian from Akamai talked about the importance of network automation for BC/DR. Specifically, he advised on having a good understanding of cluster or server status where, if one cluster or server fails, there is automation in place for another one to take over without network latency or performance issues.</p> <p>Igor shared that Verizon Media had a plan BC/DR in place, which included being able to support all of their employees in a remote setting for three weeks. (They’ve built upon this plan most recently). Igor also mentioned that the Verizon Media team was prepared for some supply chain delays this year, given the Chinese New Year in February, so they had a deeper buffer built-in for an anticipated slowdown.</p> <p>Jason said he’s had weekly calls “down to the chip-level” to understand what Uber’s supply chain looks like. “I am seeing that there are delays of a month to two months, even in some manufacturing processes, which is obviously going to spill over into the OEM switch market and then further and further up the chain,” he said. Jason’s advice? “Don’t put all of your eggs in one basket.” Uber’s network team is vendor-agnostic and aims to avoid any lock-in situations so that they can diversify across vendors and have flexibility in their infrastructure to make changes quickly and where needed.</p> <p>The panelists went on to discuss their network architectures, hiring plans for the road ahead, and new levels of productivity observed (while still maintaining their network teams’ work-life balance). If you missed the event, you can watch the <a href="https://www.kentik.com/resources/how-leading-companies-support-remote-work-and-digital-experience-2-webinar/">replay of Kentik’s virtual panel here</a>.</p> <p>We’re already busy planning for our next few virtual panels, so keep an eye out for future invites. Until then, take a minute to learn how Kentik can help you run your network and support your remote workforce now. <a href="https://www.kentik.com/go/enterprise-trial-request/">Get 120 days of the Kentik service, with no subscription fees</a>.</p><![CDATA[The Gartner Market Guide for Network Performance Monitoring and Diagnostics Highlights Network Challenges ]]><![CDATA[The Gartner Market Guide for Network Performance Monitoring and Diagnostics recently published and revealed new network challenges for buyers. In this post, we take a deeper look at the challenges, as well as opportunities for NPMD investment. ]]>https://www.kentik.com/blog/the-gartner-market-guide-for-network-performance-monitoring-and-diagnosticshttps://www.kentik.com/blog/the-gartner-market-guide-for-network-performance-monitoring-and-diagnostics<![CDATA[Dan Rohan]]>Thu, 16 Apr 2020 07:00:00 GMT<img src="https://images.ctfassets.net/6yom6slo28h2/1TIaxjYtT3HIdGOCIYmwte/b3bcacb7cf9482385913df53106bbfe6/blog-datacenter-change.jpg" class="image right" style="max-width: 350px;" /> <p>Last month we saw the publication of the <a href="https://www.kentik.com/go/gartner-market-guide-npmd-network-performance-monitoring-and-diagnostics-2020/">Gartner Market Guide for Network Performance Monitoring and Diagnostics</a><sup>1</sup> (NPMD). If you’re unfamiliar with this Market Guide, it may be because this year, and for future publications, Gartner moved this from its previous version of a Magic Quadrant. We believe the reason for the shift is particularly because buyers are still purchasing traditional network monitoring tools. In this blog post, we’ll look at two of the main network challenges that we believe stick out in this Market Guide, as well as investment recommendations that should drive buyers to rethink their traditional NPMD solutions.</p> <h3 id="cloud-migration-and-the-loss-of-network-visibility">Cloud Migration and the Loss of Network Visibility</h3> <p>As cloud migration picks up, one thing is clear: traditional on-premises NPMD tools no longer work for cloud and hybrid environments. On the other hand, some would argue that the use of APM and infrastructure tools are enough for the cloud. However, that notion also falls apart, as most (if not all) organizations have traditional data centers and cloud. As enterprises transition towards digital delivery of products and services, they must manage across the infrastructures, and the traditional tools and approaches do not work or scale effectively.</p> <p>We believe Gartner highlights similar challenges in the Market Guide. In Kentik’s view, the report discusses a few key points:</p> <ul> <li> <p><strong>There are challenges introduced by cloud migrations that significantly change the network infrastructure and limit visibility into the network layers</strong>. Kentik provides this visibility in all three major public clouds, unified with the traditional networks.</p> </li> <li> <p><strong>As applications are decomposed into micro-services and deployed on cloud-native architectures, network visibility is a challenge</strong>. These orchestrated environments have different infrastructures and most often run inside containers on Kubernetes, delivered by Envoy. Not only that, but the new networking overlays that come with these open source technologies require new software for visibility. Kentik provides visibility into Kubernetes, Envoy, and down to the container level. Kentik supports Flannel and Calico, which are some of the most popular networking models for Kubernetes. <a href="https://www.kentik.com/blog/the-visibility-challenge-for-network-overlays/">See our previous blog for additional detail</a>.</p> </li> <li> <p><strong>There are new challenges arising with the changes in the edge</strong>. This is where Kentik excels. As the main visibility platform for the companies delivering over edge networks, our customers include data center providers, CDNs, and other network operators. Kentik provides visibility into how traffic flows and is delivered, and we can handle the scale of the largest networks without hardware.</p> </li> </ul> <h3 id="packets-are-difficult-to-collect">Packets are Difficult to Collect</h3> <p>In the Market Guide, Gartner notes, “Network packets are becoming increasingly difficult to collect.” This is something that Kentik believes has been an issue for quite some time on the multi-terabit networks that our platform monitors. Packets were never an option. As most networks continue to scale, these challenges are becoming apparent to a wider audience — not just the large providers. Packet visibility is a major issue in the public cloud where the cost and complexity of aggregating and analyzing packets is no longer cost-effective or reasonable.</p> <h3 id="areas-of-npmd-investment">Areas of NPMD Investment</h3> <p>We believe Gartner points to three main areas of investment occurring in NPMD. The first is better visibility into digital experience via digital experience monitoring (DEM). Synthetic transaction monitoring (STM) supports DEM and is something Kentik is hard at work on right now to complement the path visibility we provide to real traffic across the internet. Kentik ingests more BGP than almost any other company on the internet today. Our head of operations engineering, Costas Drogos, did <a href="https://www.kentik.com/blog/scaling-bgp-peering-kentik-saas-environment/">a talk on this at DENOG 2020</a>.</p> <p>The second area of investment is in AIOps, or the application of machine learning and artificial intelligence techniques to the network data being collected. Kentik’s platform incorporates AIOps techniques for the network professional, and we continue to work to create new ways to surface useful information automatically.</p> <p>Gartner also notes, “Future-proof network monitoring by investing in NPMD tools that provide the required level of visibility in your hybrid environments, including edge network and cloud network monitoring.” To us, this makes sense, considering the level of investment in public cloud. Kentik aligns with this trend and can provide the visibility necessary today.</p> <p>For more information on how to rethink your NPMD strategy, read the full <a href="https://www.kentik.com/go/gartner-market-guide-npmd-network-performance-monitoring-and-diagnostics-2020/" title="Read the Gartner NPMD Market Guide, 2020">Gartner Market Guide for Network Performance Monitoring and Diagnostics</a>. With Kentik listed in the report as one of the “representative vendors,” we look forward to seeing how the market and analyst coverage evolve in the year ahead.</p> <p>If you want to see how Kentik can help you rethink your NPMD strategy, <a href="https://www.kentik.com/go/enterprise-trial-request/">sign up by May 31st, 2020</a> to get <strong>120 days of access to Kentik</strong>. No subscription fees.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <div style="font-size: 96%; margin-bottom: 25px;"><sup>1</sup> Gartner, “<a href="https://www.kentik.com/go/gartner-market-guide-npmd-network-performance-monitoring-and-diagnostics-2020/" target="_blank">Market Guide for Network Performance Monitoring and Diagnostics</a>”, Josh Chessman, 5 March 2020.</div> <div style="font-size: 96%;"><em>Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.</em></div><![CDATA[A Critical Piece of Your SD-WAN Visibility]]><![CDATA[SD-WAN promises to boost agility, security, performance, and management and operations. However, without effective network visibility, the path to a successful deployment is difficult to achieve. In this post, we provide considerations for solving SD-WAN visibility challenges. ]]>https://www.kentik.com/blog/a-critical-piece-of-your-sd-wan-visibilityhttps://www.kentik.com/blog/a-critical-piece-of-your-sd-wan-visibility<![CDATA[Aaron Kagawa]]>Wed, 15 Apr 2020 07:00:00 GMT<p>SD-WAN, one of the most disruptive network technologies of the past few years, has a fast rate of adoption. Enterprises adopting SD-WAN are driven by key factors such as WAN cost savings, application performance improvement, management and operation simplification, and more. <a href="https://www.cisco.com/c/dam/en/us/solutions/collateral/enterprise-networks/intelligent-wan/idc-tangible-benefits.pdf">Research shows</a> 95% of enterprises are using or expect to use SD-WAN within 24 months.</p> <p>To achieve the four main promises of SD-WAN – which are agility, security, performance, and management and operations – many practitioners will tell you SD-WAN visibility plays a critical role in a successful deployment. In this blog post, we are going to share our thought process and progress in solving visibility challenges for SD-WAN users to help them operate efficiently.</p> <h3 id="why-visibility-matters-in-the-sd-wan-deployment-lifecycle">Why Visibility Matters in the SD-WAN Deployment Lifecycle</h3> <p>The SD-WAN idea sounds simple, but the path to SD-WAN is complicated. Think about how many choices of different vendors and different approaches are out there. In order to lead a successful SD-WAN deployment journey, users need to take the approach of looking at the entire SD-WAN lifecycle to understand what kind of visibility is most helpful in different phases. Here is some guidance:</p> <ul> <li> <strong>Planning</strong>. It can be scary not knowing where you are starting. Getting visibility before making decisions can help measure the change and prove the results. Best practices include: <ul> <li>Baselining traffic bandwidths </li> <li>Evaluating security traffic patterns</li> <li>Discovering what applications are running between sites, the internet, and to the data center</li> <li>Understanding your service providers, link utilizations, and traffic patterns </li> </ul> </li> <li><strong>Building</strong>. The rollout and verification step validates whether the deployment performs as expected. Best practices include: <ul> <li>Visualization of SD-WAN fabric, including both underlay and transports</li> <li>Intent verification by auditing traffic against SD-WAN policies</li> <li>Visualize all transport (MPLS, internet, LTE, etc.) links and show traffic traversing on transports to validate SD-WAN traffic policies</li> </ul> </li> <li><strong>Ongoing operation</strong>. Traffic paths in SD-WAN are highly dynamic with constant policy decisions changing based on the current application/network state. Therefore, ongoing verification and operations rely on continuous monitoring and alerting. Best practices include: <ul> <li>Alert on SD-WAN transport (e.g., alerting policies for transport circuits down, policies to notify of abnormal shifts in application transport, policies on high utilization of transport circuits, and alerting on performance metrics per application and transport)</li> <li>Monitor utilization and runout for transport circuits for efficient capacity planning </li> <li>Identify and track network health </li> </ul> </li> </ul> <h3 id="sd-wan-visibility-and-sd-wan-monitoring-from-kentik">SD-WAN Visibility and SD-WAN Monitoring from Kentik</h3> <p>SD-WAN is just one piece of the puzzle. At Kentik, we aim to solve network visibility for all components of your network in one unified view.</p> <ul> <li> <p><strong>SD-WAN + traditional WAN</strong>: Kentik can monitor SD-WAN vendors and traditional WAN sites in a single product. Hence, enterprise companies who have a phased plan of the rollout of SD-WAN can leverage Kentik’s help with the transition from tradition and adoption of WAN to SD-WAN throughout the SD-WAN lifecycle.</p> </li> <li> <p><strong>SD-WAN + Cloud + Data Center</strong>: <a href="https://www.catonetworks.com/news/confidence-in-sd-wan-shaken-by-digital-transformation/" target="_blank">According to research</a>, 69% of SD-WAN users say cloud connectivity undermines their network confidence. Meanwhile, enterprises need to constantly optimize between the on-premises and cloud applications. Therefore, combining SD-WAN visibility with cloud and data center can provide end-to-end monitoring of your key traffic.</p> </li> </ul> <h3 id="under-the-hood-kentik-universal-data-records-udr">Under the Hood: Kentik Universal Data Records (UDR)</h3> <p><a href="https://www.kentik.com/blog/going-beyond-the-netflow-introducing-universal-data-records/">Universal Data Records (UDRs)</a> are an architectural element of the Kentik Platform. UDR makes it possible to apply Kentik’s powerful machine learning and analytics across a rich, correlated schema for translation into actionable insights that include business, service, and application context.</p> <p>With UDR, Kentik can quickly add more data sources to the platform to stay ahead and address the always-evolving network visibility challenges faced by our customers. That is how Kentik has been able to add the support of specific SD-WAN vendors in a speedy manner, including Cisco and Silver Peak SD-WAN, as well as specific firewalls like Cisco ASA, Zone-Based Firewall, Palo Alto Networks Firewalls, and other devices that are applicable to enterprise networking.</p> <p>In the SD-WAN context specifically, UDR gives Kentik the capability to ingest vendor-specific fields, which are very important in the SD-WAN space (e.g., Viptela: VPN Identifier, and Silver Peak: Application, Business Intent Overlay).</p> <h3 id="underlay-and-overlay-visibility">Underlay and Overlay Visibility</h3> <p>SD-WAN’s blind spots that reside in either overlay or underlay, or both, could increase the deployment and operational difficulties and inefficiencies. An automatic map-out of the visualization of SD-WAN’s underlay and overlay traffic insights would be super helpful to understand the current SD-WAN deployment.</p> <p>Let’s take SilverPeak SD-WAN for example. The following image captures the visualization of SD-WAN fabric with underlay and transport that connect various sites and data centers in “Network Map.” You can drill down to further traffic details of a certain Site/Device/Provider from there.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6O5LPouAOmKy3PFYhxyZsb/a4c42f63f24485a046d7714cf8a5a10b/silverpeak-networkmap.gif" style="max-width: 900px;" class="image center" thumbnail /> <p>Meanwhile, you can also visualize the overlay traffic via Business Intent Overlay (BIO), a Silver Peak-specific term that specifies how traffic with particular characteristics is handled within the network (see the Sankey diagram below). This is also to verify the intent of the traffic and its path.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2uqu0jr6oca5mGS5dkxgo1/acadae5462c17e5d2ba90798e8f040e8/bio-silverpeak.png" style="max-width: 900px;" class="image center" thumbnail /> <p>Moreover, you can visualize all the traffic that traverses on all transport links (e.g., MPLS, internet, etc.) and validate current implemented SD-WAN traffic policies.</p> <p>The following Sankey diagram shows the details of the traffic that flows out of Branch 1 of the SD-WAN environment, such as source/destination site, source VPN, application, DCSP, destination transport, and more.</p> <img src="//images.ctfassets.net/6yom6slo28h2/74tXDNiW4K6vIdffU4aB9x/343582f2bd99f837e88e790dac640303/wan-traffic-path-details.png" style="max-width: 800px;" class="image center" thumbnail /> <h3 id="going-forward">Going Forward…</h3> <p>Moving forward, we will continue to drive a strong roadmap for SD-WAN visibility, including:</p> <ul> <li>Additional data sources that can give a more complete picture of the SD-WAN deployment (e.g., interface metadata, device metrics, device metadata via SNMP)</li> <li>Network health for SD-WAN</li> <li>First-class support of site-to-site traffic</li> <li>More vendor support (e.g., VeloCloud)</li> <li>An SD-WAN workflow</li> <li>And much more</li> </ul> <p>If you’re ready to dive right in or have feedback, suggestions, and enhancements, please <a href="https://www.kentik.com/contact/">reach out to us</a>. Or get started improving your SD-WAN monitoring and visibility today: <a href="#signup_dialog" title="Improve SD-WAN Monitoring &#x26; Visibility with a Free Trial">Start a free trial of Kentik</a>, or <a href="#demo_dialog" title="Request your Kentik demo">request a personalized demo</a> from the Kentik team.</p><![CDATA[Lessons from Google on Converging Network Technologies with SD-WAN]]><![CDATA[Google is investing and innovating in SD-WAN. In this post, Kentik CTO Jonah Kowall highlights what Google has been up to and how Google’s SD-WAN work can apply to the typical organization. ]]>https://www.kentik.com/blog/lessons-from-google-converging-network-technologies-sd-wanhttps://www.kentik.com/blog/lessons-from-google-converging-network-technologies-sd-wan<![CDATA[Jonah Kowall]]>Tue, 31 Mar 2020 07:00:00 GMT<p>Today, SD-WAN is an overlay. However, these systems are evolving, and as they do, we can see the future coming together where SD-WAN becomes increasingly intelligent and driven by multiple signals. The future is, once again, being pioneered by today’s forward-thinking organizations that are building this reality.</p> <p>Google is a perfect example of a company that is often 5 to 10 years ahead of the industry, including for <a href="https://www.kentik.com/blog/a-critical-piece-of-your-sd-wan-visibility/">SD-WAN</a> innovation. After all, they operate one of the largest technology platforms on the planet.</p> <p>The recent <a href="https://www.nanog.org/meetings/nanog-78/" target="_blank">NANOG 78</a> conference, which converged into Kentik’s hometown of San Francisco, provided the opportunity to listen to Google’s networking leadership, as well as other very inspirational and interesting talks.</p> <p>The first day of NANOG kicked off with a <a href="https://www.youtube.com/watch?v=DpO1Tfa4IZ4" target="_blank">keynote from Amin Vahdat, Ph.D.</a>, engineering fellow and vice president at Google. This fascinating talk went into detail about how Google constructed and operates one of the most sophisticated and high-scale networks in the world. Dr. Vahdat also gave a deep retrospective of an outage in the summer of 2019, including what the team learned and how they evolved based on the outage.</p> <p>In summary, Google has created multiple WANs: one designed for computer-to-computer communications (B2) and then an SD-WAN, called B4, which they have been building over the past decade. Google has also augmented its peering with a sophisticated edge that Dr. Vahdat explained. The platform, known as <a href="https://www.blog.google/products/google-cloud/making-google-cloud-faster-more-available-and-cost-effective-extending-sdn-public-internet-espresso/" target="_blank">Espresso</a>, understands the applications and importance for making routing decisions to optimize performance. One of the key elements of Espresso is the software-defined peering driven by measuring every TCP connection but done at each server. Google can build a sampled aggregate of the performance.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5XreOz8QAmn4Nei8J0FY5Y/a836a3ce4803ed3c2782ad4cbdca75f5/espresso-metro.png" class="image center" style="max-width: 620px;" /> <p>Dr. Vahdat also explained that Google does not do this in real-time, but rather every “single-digits minutes” (meaning it’s not every minute, but not every 10 minutes either), and the data from that system is fed to the edge of the network to change the routing and paths. Google uses standard MPLS labels to decide on the egress so they can communicate with their peers. This is more easily conveyed by the high-level diagram of the system (featured above).</p> <p>The second day started with a <a href="https://www.youtube.com/watch?v=f2Pe0SHmgyo" target="_blank">keynote from Bikash Koley</a>, vice president of global networking at Google, who was formerly the CTO of Juniper. Koley’s talk highlighted how Google is building a true intent-based network. They are using a modeling language compiled into instructions for the network to enforce these policies. The key to this is using heavy amounts of monitoring and <a href="https://www.kentik.com/blog/how-to-maximize-the-value-of-streaming-telemetry-for-network-monitoring-and/">streaming telemetry</a> from the network devices. Google has been building this closed-loop system, but they do not have a self-driving network yet. They do, however, have very ambitious future plans to leverage the Tensor Processing Units (TPUs) designed for AI and machine learning to make more sophisticated decisions of the network.</p> <h3 id="how-does-googles-work-apply-to-the-typical-organization">How does Google’s work apply to the typical organization?</h3> <p>If we fast forward, a key construct for the network of the future is the incorporation of telemetry and monitoring data to drive the network and automate network-level decisions. SD-WAN is also a critical way to overlay across multiple vendors and technologies. Ultimately, there must be a platform for real-time decisions required to make the self-driving network a reality. The beginnings of these components already exist, but many partnerships, acquisitions, and new technologies must be built before this can be something folks can purchase or use.</p> <p>This platform of the future and the concepts being forged by Google will trickle down into our networks over the coming decade, As a result, we will see much more intelligent WANs that leverage telemetry to enable better reliability and security to improve the network and, ultimately, the application performance to drive better user experiences.</p><![CDATA[Kentik Virtual Panel Series, Part 1: How Leading Companies Support Remote Work and Digital Experience]]><![CDATA[Kentik recently hosted a virtual panel with network leaders from Dropbox, Equinix, Netflix and Zoom and discussed how they are scaling to accommodate the unprecedented growth in network traffic during COVID-19. In this post, we highlight takeaways from the event. ]]>https://www.kentik.com/blog/how-leading-companies-support-remote-work-and-digital-experiencehttps://www.kentik.com/blog/how-leading-companies-support-remote-work-and-digital-experience<![CDATA[Jonah Kowall]]>Fri, 27 Mar 2020 07:00:00 GMT<p><a href="https://www.kentik.com/resources/how-leading-companies-support-remote-work-and-digital-experience/"><img src="//images.ctfassets.net/6yom6slo28h2/4C8FX4ztJyI270CxQU9Vwr/52d182de8b129b321a7dc652b8cda430/virtualpanel-remotework800w.jpg" style="max-width: 600px; margin-bottom: 20px;" class="image right" /></a></p> <p>This week, Kentik <a href="https://www.kentik.com/resources/how-leading-companies-support-remote-work-and-digital-experience/">hosted a virtual panel</a> with network leaders from Dropbox, Equinix, Netflix and Zoom to discuss how they are scaling their infrastructure and services to accommodate the <a href="https://www.kentik.com/blog/the-gartner-market-guide-for-network-performance-monitoring-and-diagnostics/">unprecedented growth in network traffic during COVID-19</a>. Avi Freedman, Kentik co-founder and CEO, moderated the discussion. He was joined by:</p> <ul> <li>Alex Guerrero, Senior Manager of SaaS Operations at Zoom</li> <li>Bill Long, SVP of Core Product Management at Equinix</li> <li>Dave Temkin, VP of Network and Systems Infrastructure at Netflix</li> <li>Dzmitry Markovich, Senior Director of Engineering at Dropbox</li> </ul> <h3 id="architectures-and-recent-traffic-growth">Architectures and Recent Traffic Growth</h3> <p>These leaders are in charge of running networking and engineering functions for their organizations. All of them have observed significant changes in network usage and traffic patterns due to the changes in the global workforce. As a result, many of them needed to execute a six-month infrastructure growth plan, or even a 2020 growth plan, in a matter of just weeks.</p> <p>The panelists discussed how their organizations run their networks and data centers. Zoom and Dropbox, for example, leverage public cloud infrastructure and automation tools so their services can quickly burst in size for additional capacity. Their investment in automation is essential for handling the dramatic growth in the demand for their services.</p> <p>Equinix connects the world’s leading businesses, including Kentik, to customers, employees, and partners inside some of the most-interconnected data centers. They have a large global footprint of data centers and exchanges, including all major cloud providers. According to Bill Long, Equinix has recently seen anywhere from 10% to over 40% growth in traffic.</p> <p>Netflix and Dropbox both operate edge and content delivery networks designed to handle the increase in traffic volume. According to Dave Tempkin, Netflix operates all of its content delivery on its own global infrastructure. According to Dzmitry Markovic, Dropbox operates its own backbone network to process multiple terabytes of data per second. He also mentioned that Dropbox uses Kentik to understand their traffic flows and peering.</p> <h3 id="cloud-bursting-for-fast-growth">Cloud Bursting for Fast Growth</h3> <p>There were several common patterns between most of the organizations on the panel. While the burst in traffic and usage <a href="https://www.kentik.com/resources/infrastructure-service-providers/">required more infrastructure and capacity</a>, most of the organizations noted having plenty of headroom to grow. The panelists from Dropbox, Zoom and Netflix noted how bursting to the cloud supports additional short-term capacity requirements. The cost of the flexibility provided by public cloud comes with consumption or variable pricing. It’s harder to predict where the costs will go, but if your business is aligned with the consumption of these resources, it’s easier to absorb the costs. Public cloud allows the provisioning of resources only based on what is being used, versus the typical over-provisioning we see across infrastructures. To save costs, organizations using a lot of public cloud will reserve instances in advance. Those who are bursting are often using “spot” or on-demand instances. It’s important to note that cloud providers can also have capacity constraints, <a href="https://searchcloudcomputing.techtarget.com/news/252480711/Microsoft-Azure-capacity-woes-dont-signal-the-worst" target="_blank">as we saw this past week on Azure</a>.</p> <h3 id="new-projects-to-expand-infrastructure">New Projects to Expand Infrastructure</h3> <p>All the panelists created projects to expand their traditional infrastructures because of the traffic changes. Zoom’s Alex Guerrero mentioned his team is building and establishing new peering and connectivity to handle from where the bulk of the traffic is coming. Alex mentioned Zoom uses the Kentik platform <a href="https://www.kentik.com/blog/new-customer-video-how-zoom-uses-kentik-for-network-visibility-performance-peering-analytics/">for the analytics</a> to drive those decisions. The Zoom team also uses many of the Equinix cloud exchanges to connect across providers. Netflix’s Dave Temkin mentioned needing to find new partners from which to source servers used to move content caches closer to the eyeballs. Getting closer to the end-user is key for all the panelists who deliver content, collaboration technologies, or infrastructure.</p> <p>Bill Long said Equinix is in a position to handle the increased capacity, having spent the previous 2-3 years upgrading from 10Gbps to 100Gbps links in their core infrastructure as part of a normal technology refresh cycle. When remote work and shelter-in-place orders began, Equinix had significant resources and capacity to handle the additional traffic. Bill said they are also seeing a lot more connections and peering in their facilities to make the traffic flow more smoothly for collaboration and unified communications along with moves towards VPN services.</p> <h3 id="remote-work-life-balance">Remote Work-Life Balance</h3> <p>Most of the panelists said they manage distributed, remote teams who are accustomed to working that way. There was a discussion around supporting employees who are now balancing having their families in the background and supporting new ways for teams to socialize. The panelists noted that these are common challenges that their organizations are managing quite well. They also noted the importance of work-life balance, even for the network teams who are working harder than ever to keep us all connected.</p> <p>You can watch the replay of the <a href="https://www.kentik.com/resources/how-leading-companies-support-remote-work-and-digital-experience/">virtual panel on-demand here</a>.</p> <p>We will also be hosting more of these panels on various topics as we all change the way we work. The first step to scaling your infrastructure is being able to measure it.</p><![CDATA[UPDATED 3/26/20: Trends in Network Traffic in Correlation with COVID-19]]><![CDATA[At Kentik, we’re always interested in monitoring trends in network traffic. This helps to inform our community and customers globally about potential opportunities and threats to network operations. In this ongoing blog post, we'll provide updates to trends in traffic in correlation with COVID-19.]]>https://www.kentik.com/blog/network-traffic-trends-covid19https://www.kentik.com/blog/network-traffic-trends-covid19<![CDATA[Avi Freedman]]>Thu, 26 Mar 2020 07:00:00 GMT<h3 id="updated-32620">Updated 3/26/20</h3> <p>We are continuing to monitor traffic trends in correlation with the spread of the coronavirus, COVID-19. After recently observing the 200% network traffic spikes in North America and Asia (<a href="#3-6-20">see data in our 3/6/20 update below</a>), we wanted to look specifically at the U.S., focusing on states where partial or complete shelter-in-place orders have been issued.</p> <p>In future updates to this blog post, we will look at traffic by type of delivered service (i.e., gaming, videoconferencing, streaming, and social media), as well as looking into traffic trends for other domestic and international locations.</p> <p>While we hope these analyses are helpful, we’re grateful for the infrastructure and key SaaS and content services that connect, inform, and entertain us. Thanks to those behind the scenes working and scaling, and our thoughts are also with those affected by the difficulties as they balance family, work, and economic concerns.</p> <p>Before reviewing the findings below, we’d like to also thank Kentik customers, many of whom give us permission to do aggregate analysis to track traffic trends. The views in this blog post are generated from a cross section of those customers who run service provider networks of various types (e.g., eyeball, cloud, and transit networks).</p> <h4 id="network-traffic-across-the-united-states">Network Traffic Across the United States</h4> <p>Overall network traffic has been trending up across the U.S., with a slight uptick this week. While some cities have seen faster growth over the last few weeks, traffic overall is up in the tens of percent, and does not show trending towards doubling in the short term.</p> <p>While some of the <a href="https://www.cnbc.com/2020/03/23/coronavirus-facebook-to-reduce-video-streaming-quality-in-europe.html" target="_blank">European voluntary reductions in streaming quality</a> are starting to make their way to the U.S., with <a href="https://www.pcworld.com/article/3534032/youtube-defaults-to-lower-resolution-streams-in-the-us.html" target="_blank">YouTube now issuing a global reset</a> to lower default streaming bitrate, overall we are not seeing systemic congestion, and services are generally scaling well.</p> <p>We also wanted to look deeper into a couple of states where partial or complete shelter-in-place orders have been issued. The general takeaway is that for states where traffic was already high before the spread of COVID-19 (i.e., California and New York), we will not observe as great of a percentage increase in traffic. Below are our findings.</p> <h4 id="network-traffic-in-california">Network Traffic in California</h4> <p>California has the highest traffic to start with, so while we’ve seen a magnitude of growth in traffic, because our service provider index also counts transit traffic exchanged in regions, the actual percent increase has been less than in some other regions affected first by work-from-home and then work-from-family restrictions.</p> <p>In future posts, we’ll focus on areas where there have been bigger shifts in traffic in California and other geographies, including time-of-day shifts, views by type of over-the-top services, and traditional work versus residential cities.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3ChWSFum1Q4j7xBQ9esLz3/5df4400e1cccf4e924ea6177ca65f224/traffic-ca.png" style="max-width: 800px; border: 1px solid #dfdfdf;" class="image center no-shadow" thumbnail> <h4 id="network-traffic-in-new-york">Network Traffic in New York</h4> <p>New York has been the second largest geography where we see traffic going to in the U.S. across this service provider index. While there is a fair amount of interconnection between networks that we are counting in these measurements, there is more of a mix of residential and corporate traffic as a percentage.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4GyYkpBA0b0B31bZZvmldZ/4247a2f750eb5d137a2eecf529e57051/traffic-ny.png" style="max-width: 800px; border: 1px solid #dfdfdf;" class="image center no-shadow" thumbnail> <h4 id="network-traffic-in-illinois">Network Traffic in Illinois</h4> <p>As we look at geographies where our index sees a higher percentage of residential and corporate networks, including Illinois, we see even greater growth in traffic with the recent shifts to work-from-home and shelter-in-place. Historically, some of the largest traffic spikes that are more visible in the lower overall-traffic states start to show dominance of gaming and some other over-the-top (OTT) services, which we’ll explore in future updates to this blog post.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/hjngPLHcnV5r4TFCC7Zsm/c2fa27ce80aa99145c695d83ba94b4d4/traffic-il.png" style="max-width: 800px; border: 1px solid #dfdfdf;" class="image center no-shadow" thumbnail> <h4 id="network-traffic-in-washington">Network Traffic in Washington</h4> <p>Finally, in the states we initially looked at, Washington has the most representation from end networks in our service provider index and also has shown the highest step functions of growth in traffic. This goes back to our earlier point that the more traffic there was before COVID-19, the less the percentage traffic growth will be, and in Washington’s case, vice versa.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4xfhbG3OZxg2Z7BNDGe3bu/5b1c57e4e97dac9cfabe77a0925b6f14/traffic-wa.png" style="max-width: 800px; border: 1px solid #dfdfdf;" class="image center no-shadow" thumbnail> <p>As many service providers and digital enterprises work to adapt their services to be respectful of the access networks, we’ll be watching for the traffic effects on these networks, and over time, we’ll layer in performance data as well to compare it with the traffic trends.</p> <h4 id="can-the-internet-handle-more-traffic">Can the Internet Handle More Traffic?</h4> <p>At the current network traffic levels we’re observing, the internet overall is performing well. In our <a href="/resources/how-leading-companies-support-remote-work-and-digital-experience/">virtual panel yesterday</a> with leaders from Zoom, Netflix, Dropbox, and Equinix, people expressed general contentment with capacity and quality across the network interconnection edge, both using up excess headroom that had been pre-provisioned and growing that capacity with the traffic increases over the last few months.</p> <p>The panel’s take matches what we are hearing from our SaaS, content, and gaming customers: people are watching the supply chains, but generally are able to continue to add capacity to their own infrastructures and leverage the public cloud compute infrastructures to help scale their services.</p> <p>With the current mix of business and consumer traffic growth, we may see continued caution in suggesting lower-bitrate defaults, but the current feeling in the networking community seems to be that things should hold.</p> <p>If new macro issues surfaced that, say, directed large portions of those still consuming via cable and broadcast to internet-streaming properties, we could see different kinds of issues. However, so far, we have not seen a major ratio shift, and the eyeball network providers have long since enabled large content libraries to be delivered via set-top box.</p> <p>The Kentik team will continue to monitor this issue and report back with findings as we have them. For more information on what we are looking at and why, please read the update below or <a href="mailto:[email protected]">reach out to us</a>.</p> <p><a name="3-6-20"><div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px; margin-top: 40px;"></div></a></p> <h3 id="updated-on-3620">Updated on 3/6/20</h3> <p>Here at Kentik, we’re always interested in monitoring trends in network traffic. This helps to inform our community and our customers (i.e., enterprises globally, as well as service providers that make the internet and applications available to us all) about potential opportunities and threats to network operations. It is, after all, our mission to provide the analytics, insights and automation to make every network excellent.</p> <div class="pullquote right" style="max-width: 260px;">Kentik saw a 200% increase in video conferencing traffic during 9-5 working hours in North America and Asia.</div> <p>When news of the coronavirus, COVID-19, began to pick up back in January, the team at Kentik wanted to examine what traffic trends might reveal about digital behaviors. Particularly in Asia, where the first infection was reported in China, as well as in North America, where Kentik is headquartered, we wanted to see if there were noticeable traffic spikes during typical, daytime 9-5 working hours. With reports that countries, municipalities, companies, and other organizations continue to put precautionary measures in place, such as work-from-home policies to avoid illness, we wanted to see if we might observe an increase in network traffic, specifically for common business applications like video conferencing and messaging services.</p> <h4 id="heres-what-we-saw">Here’s What We Saw</h4> <p>On a daily basis, network traffic spikes are typically observed in evening hours when people tend to be at home online. However, from late January (around the start of news of the coronavirus) to late February: <strong>Kentik saw a 200% increase in video conferencing traffic during 9-5 working hours in North America and Asia.</strong></p> <h4 id="how-does-kentik-get-this-data">How Does Kentik Get This Data?</h4> <p>The Kentik Platform has the ability to apply application context to network traffic and can determine traffic volumes for specific applications. For example, we can distinguish between traffic types such as streaming video, video conferencing and video conferencing that includes messaging.</p> <h4 id="why-the-network-matters">Why the Network Matters</h4> <p>When normal routines of workforces and society, in general, are disrupted, technology, applications and the networks that run them become more important than ever. Behind these critical technologies are the network operations teams, who are on the front lines and must ensure performance, reliability and availability.</p> <p>The Kentik team will continue to monitor this issue and report back with more findings. For more information, <a href="https://www.kentik.com/contact/">reach out to us</a>.</p><![CDATA[Strategies for Managing Network Traffic from a Remote Workforce]]><![CDATA[When more of the workforce shifts to working remotely, it puts new and different strains on the infrastructure across different parts of the network. In this post, we discuss strategies for managing surges in network traffic coming from remote employees and share information on how Kentik can help.]]>https://www.kentik.com/blog/strategies-for-managing-network-traffic-remote-workforcehttps://www.kentik.com/blog/strategies-for-managing-network-traffic-remote-workforce<![CDATA[Jonah Kowall]]>Thu, 19 Mar 2020 07:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/4v6BeAQZGh8k5jMH8RJ2R4/abdd90aebed6eb271403114ca6d3a6e2/blog-remote-workers.jpg" class="image right no-shadow" style="max-width: 380px;" /> <p>When more of the workforce shifts to working remotely, it puts new and different strains on the infrastructure across different parts of the network, especially where VPN gateways connect to the network edge. Without proper visibility, there is no way to know where the issue lies or which resources users are accessing. Kentik provides the visibility you need to ensure the productivity of your remote workforce.</p> <p>NetOps teams today are contending with a surge in traffic coming from remote employees, and businesses are relying on them to ensure productivity. In many infrastructures, the inflection points are at the network edge, where VPN gateways authenticate and encrypt remote-access traffic. A potential challenge with remote work here is that these users’ devices may become a bottleneck, but most often, it’s the network and not the devices. This could mean the users are saturating the internet connectivity or that they’ve saturated the LAN (or maybe WAN).</p> <p>One strategy for managing the surge in traffic coming from remote workers is to implement a split-tunneling configuration. As more organizations use SaaS-based conferencing services, such as Zoom, Slack, Microsoft Office 365 (Teams), and Cisco WebEx, this traffic should use the employee internet connection directly, for as much traffic as possible, versus sending all employee traffic through the VPN. This configuration is known as a “split-tunnel” configuration and is set up by rules to exclude specific ports, protocols, or networks. More advanced VPNs can do this by application type. Making this change reduces the volume of traffic being sent to the enterprise network, but also provides a better user experience for the employees, especially as the users conduct more video and audio conferences.</p> <div class="pullquote right">Kentik provides an easy way to see not only the entire network but also how it’s being used.</div> <p>Another strategy is to increase visibility into the traffic flowing between the network edge and VPN gateways and optimize performance. At Kentik, we provide the network visibility that today’s most advanced organizations need, especially when managing and optimizing network connectivity across distributed environments. For example, with Kentik, you can easily see what traffic is being passed to the VPN and tune or change the rules to optimize and secure the traffic while also providing the best user experience.</p> <p>Kentik provides an easy way to see not only the entire network but also how it’s being used. The richest data sources are from the VPN devices or firewalls. Most often these devices can export NetFlow (or related flow types) or Syslog. Leading VPN solutions also export performance data. Kentik can ingest all three sources as traffic data. Based on the data sources used, these devices can provide deep visibility into the individual user context and session and what resources and types of applications users are accessing.</p> <p>If the VPN gateways cannot generate flow or Syslog with traffic, the next richest data comes from the edge routers or switches near the VPN devices. These devices can send NetFlow, sFlow, or other flow types to Kentik. When configuring traffic sources, the flow data is sent directly to Kentik over the internet, or it can go through kProxy, a Kentik client that encrypts the flow or Syslog data before sending it over the internet.</p> <p>Aside from using this traffic data, Kentik also collects information from the network devices using SNMP. This data is used to profile the devices and determine the configuration of the hardware and software. We also collect interface details and metrics using SNMP. This data is useful for Kentik’s automated capacity planning workflows and building topological maps (layer 2 and layer 3 connectivity).</p> <p>Kentik is the only SaaS platform to provide scalable visibility into any network. We can also turn services up without anyone needing to go to an office or data center. This keeps employees safe at home.</p> <p>If you have questions, reach out to us at <a href="mailto:[email protected]"><a href="mailto:[email protected]">[email protected]</a></a>. We would be happy to answer questions and show you how we can provide the visibility you need so you can ensure the productivity of your remote workforce.</p><![CDATA[Can I Put Kentik in My Data Center?]]><![CDATA[The most complex point of today’s networks is the edge, where there are more protocols, diverse traffic, and security exposure. The network edge is also a place where Kentik provides high value. In this post, we discuss how to implement Kentik in your data center.]]>https://www.kentik.com/blog/can-i-put-kentik-in-my-data-centerhttps://www.kentik.com/blog/can-i-put-kentik-in-my-data-center<![CDATA[Jonah Kowall]]>Tue, 03 Mar 2020 08:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/2D5zTukzzkx7bjedzvKtAz/e6ca7c7b9a96ab0023b7c823b139de38/blog-datacenter1.jpg" class="image right no-shadow" style="max-width: 380px; padding-bottom: 50px;" alt="Data Center" /> <p>The most complex point of today’s networks is the edge, where there are more protocols, diverse traffic, and security exposure. The network edge is also a place where Kentik provides high value. The bulk of Kentik’s customers use our differentiated platform to help them drive decision making on the edge of the network.</p> <p>Kentik can also collect traffic from cloud services ─ complementing the complete picture across on-premises data centers and public cloud environments ─ all from a single platform. We recently added directionality for cloud data, and we are working towards new visibility across sites and locations in a more impactful way to visualize sites and data with a map-based user interface.</p> <p>Since the edge is highly complex, our customers implement Kentik there first. The licensing fees for the edge are higher since we charge per edge device based on the total amount of throughput the device is capable of handling with its current configuration. All Kentik devices are licensed in this manner for both the edge and the data center. However, our data center pricing is much lower and has fewer features, such as full BGP support, which edge devices require.</p> <p>The challenge is that within the data center you have many more devices (e.g., switches, routers, firewalls), hence the licensing we use is lower for the data center devices. Thus, if you have many smaller devices, Kentik has a higher value package called a FlowPak. The FlowPak is a bulk-licensed bucket of flows. Data center devices may provide the required visibility with higher sampling rates, and most run fewer flows, hence the FlowPak is a higher value offering. We’ve seen a lot of customer interest in this alternate licensing option.</p> <p>FlowPaks come in the following sizes: 2,000 FPS, 5,000 FPS, 10,000 FPS, 20,000 FPS, 35,000 FPS, and 60,000 FPS. For anything higher, we provide a custom price and discount. (As a scalable SaaS platform, Kentik can handle unlimited traffic.)</p> <p>Kentik is hard at work building a next-generation set of views and capabilities to drive farther into the data center. Stay tuned!</p> <p>With any comments or questions, please reach out to us at <a href="mailto:[email protected]"><a href="mailto:[email protected]">[email protected]</a></a>. You can also <a href="#demo_dialog">sign up for a demo</a> or <a href="#signup_dialog">start a trial</a> now.</p><![CDATA[The Kentik Platform is the Future of Network Operations]]><![CDATA[Today we announced an evolutionary leap forward for NetOps, solving for today’s biggest network challenge: effectively managing hybrid complexity and scale, at speed. In this post, Kentik CTO Jonah Kowall discusses what’s new with the latest release of the Kentik Platform.]]>https://www.kentik.com/blog/kentik-platform-future-of-network-operationshttps://www.kentik.com/blog/kentik-platform-future-of-network-operations<![CDATA[Jonah Kowall]]>Thu, 27 Feb 2020 08:00:00 GMT<p>Today we announced an evolutionary leap forward for network operations (NetOps), solving for today’s biggest network challenge: effectively managing hybrid complexity and scale, at speed.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5SwpgNmllf0aFm5kDi6m2y/b51f124ebec71f1e0bd64a9044be4ad5/laptop-network-explorer.jpg" class="image center no-shadow" style="max-width: 600px" alt="Network Explorer" /> <p>From the very beginning, the Kentik Platform was designed to collect the most granular data from terabit-scale networks. The largest and most sophisticated network teams turn to Kentik to answer their most important questions around building the best networks to meet business demands. Their questions span topics from data about traffic (e.g. which applications are consumers using?), down to architectural decisions around peering and interconnection (e.g. how is content-provider data consumed by end-users?). Kentik, in short, allows networks to run as efficiently as possible for the most demanding service providers, SaaS companies, and digital enterprises.</p> <p>The problems that Kentik solves are highly variable and sophisticated, such as understanding how traffic enters and exits networks, sliced by almost any facet of the data. Kentik’s success to date stems from delivering a truly scalable product that answers these questions in seconds. Our platform’s latest evolution complements our existing capabilities and enables NetOps to manage hybrid networks.</p> <p>Historically, we’ve heard some users say it can be challenging to use, or learn from, network data if you do not know what you are looking for. That’s why Kentik introduced new capabilities that make it much easier to browse and understand your network resources without having intimate knowledge of the way the network is architected.</p> <p>New features include Quick Views to provide instant access to common reports, and a Network Map of the logical and physical topology with step-by-step workflows to drive the most common network operations tasks. Initiating a discovery or fact-finding mission is now much faster and easier, while still providing the depth and power of our Data Explorer’s slicing-and-dicing capabilities.</p> <p>The Kentik Platform includes real-time analytics, actionable insights, automation, and added integrations within four core modules:</p> <ul> <li> <p><strong>Kentik Operate</strong> is designed for users responsible for day-to-day network operations. Within the Operate module is Network Explorer, which provides an overview of the network with organized, pre-built views of activity and utilization, a Network Map, and other ways to browse your network, including the devices, peers, and interesting patterns that Kentik finds in the traffic. To make NetOps teams more efficient, we’ve built workflows for troubleshooting and capacity management. These are some of the most basic (but not easy!) tasks required to operate today’s complex networks, which span data center, WAN, LAN, hybrid and multi-cloud infrastructures</p> </li> <li> <p><strong>Kentik Edge</strong> provides the ability to understand and optimize the edge of the network. Managing peering, interconnection, and the overall state of traffic crossing the network edge to and from other networks is critical for any organization doing business on the internet. We’ve built specific workflows and views to <a href="https://www.kentik.com/kentipedia/network-traffic-engineering/" title="Kentipedia: Network Traffic Engineering">help with traffic engineering</a>, which, up until today, often involved a manual effort and spreadsheets. With our traffic-cost capabilities, Kentik automates the management of monthly commitments and costs across multiple providers. Additionally, automation from Kentik helps with managing and advising on peering, by leveraging existing and new data sources (such as PeeringDB), to take the guesswork out of selecting interconnection. Finally, like any organization which interconnects, we are all concerned with the health of the internet as an open network. This is why Kentik has been a big proponent of RPKI to ensure the health and accuracy of BGP and routing to avoid hijacking. Kentik’s new capabilities automate the analysis of RPKI and traffic validity.</p> </li> <li> <p><strong>Kentik Protect</strong> is built to help secure your infrastructure and provide rapid understanding of network activity. The threat of DDoS attacks, BGP hijacking, and other malicious activities, coupled with today’s network diversity and complexity, makes it challenging to isolate outages and other denials-of-service against organizations. With newly improved DDoS workflows, and other security-related use cases, Kentik protects your critical network assets.</p> </li> <li> <p><strong>Kentik Service Provider</strong> delivers new ways to analyze traffic to and from CDNs, allowing you to understand cache placement and fill efficiency. Also, Kentik provides a deeper understanding of various types of over-the-top (OTT) applications and their delivery, most often through CDNs. Kentik Service Provider unlocks opportunities for businesses to produce new revenue opportunities via workflows to analyze customer usage and discover potential new business.</p> </li> </ul> <p>We also introduced <strong>Insights</strong>, which provides new methods to turn data into action. Powerful algorithms that understand normal traffic behavior and customizable policies surface relevant, actionable, and interesting events in real-time. This helps network operations to work more proactively to identify, troubleshoot, and resolve network issues.</p> <p>The four modules work across every type of network from the LAN, WAN, and traditional data center to public cloud and cloud-native environments that produce VPC flow log information. Additionally, visibility into leading SD-WAN solutions has helped round out deeper application performance-based use cases. Aside from traffic and flow, Kentik also collects and analyzes device and interface-level metrics via SNMP and streaming telemetry to round out a complete set of visibility, further complementing our platform capabilities and filling the needs for network teams to overcome the challenges of managing hybrid infrastructures.</p> <p>For more information on the launch, check out our <a href="https://www.kentik.com/product-updates/february-2020-winter-launch/">technical update</a>.</p><![CDATA[SNMP vs. NetFlow]]><![CDATA[What is the difference between SNMP and flow technologies? Who uses the data sets? And what benefits do these technologies bring to network professionals? In this post, we explain how and why to use both.]]>https://www.kentik.com/blog/snmp-vs-netflowhttps://www.kentik.com/blog/snmp-vs-netflow<![CDATA[Jonah Kowall]]>Wed, 29 Jan 2020 08:00:00 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/1m1e3D0AkDd8MDyQEOibFn/9c6067633ef78a9dd3d35b05933c6140/blog-snmp-v-flow.jpg" class="image right" style="max-width: 320px" /> There is a lot of confusion regarding the two primary data sets in network management: SNMP and flow. This post will help define both and provide more context on where and why to use each.</p> <h3 id="what-is-snmp">What is SNMP?</h3> <p>SNMP is used to collect metadata and metrics about a network device. This critical technology is a basic building block of modeling, measuring, and understanding the network. While there are newer technologies such as streaming telemetry (ST), which users have grand thoughts about, unfortunately ST is not a replacement for many aspects of SNMP (but this is a topic for another blog post!).</p> <h3 id="what-is-flow">What is Flow?</h3> <p>Flow technologies, such as <a href="https://www.kentik.com/kentipedia/netflow-overview/">NetFlow</a>, sFlow, jFlow, IPFIX, and others, are used to describe traffic on the network. These technologies export data which describes conversations occurring across the specific network device. This includes port, IP source, destination, port numbers, and other markings such as quality of service (QoS). Flow incorporates a sampling configuration, which allows a specific percentage of conversations to export from the device.</p> <p>Most organizations have a cloud strategy, this could augment data centers or replace them entirely, but regardless of the organizational goals, the network is increasingly important when cloud is being adopted. Cloud doesn’t just put more reliance on the network, but it creates new data sources which network professionals must incorporate into their network visibility strategy. To see into these new cloud environments, the major players in the cloud have introduced flow log technology (e.g., AWS VPC Flow Logs, Google VPC Flow Logs, and Azure NSG Flow Logs) which should be part of all flow strategies.</p> <h3 id="who-uses-flow-and-snmp">Who Uses Flow and SNMP?</h3> <p>Flow and SNMP are used by many teams within the organization. Primarily, SNMP is used by operationally focused users trying to understand the key indicators that show issues in the health and operation of the network. This includes the devices themselves, along with the links between devices. SNMP data is also used by network engineers to troubleshoot reported problems along with network architects to do things like capacity planning.</p> <div class="pullquote right">Essentially, while SNMP provides the answer to “what” is happening on the network, flow technologies answer the “where” and “who.” You need both sets of data to make intelligent decisions and optimize the entire network.</div> <p>Flow technologies are used by network engineers, network architects, and security professionals to understand traffic, why congestion or heavy usage is occurring, what causes traffic changes, and how these changes can be mitigated. This could mean finding an application or network owner or blocking malicious traffic, such as a DDoS. The flow analytics are used to make decisions on how traffic is being sent or received to other internet-connected peers via traffic engineering and optimization.</p> <p>Essentially, while SNMP provides the answer to “what” is happening on the network, flow technologies answer the “where” and “who.” You need both sets of data to make intelligent decisions and optimize the entire network.</p> <h3 id="when-is-snmp-used">When is SNMP Used?</h3> <p>SNMP is the oldest of the network management protocols in use today. Without going into the details of implementations and the structure of SNMP itself, which has been written about extensively, we will focus on the use-cases.</p> <p>SNMP is used to actively query (poll) a network device to collect information about the device. This is used during discovery to figure out what kind of device it is, and is used to collect data about the device, such the vendor who made the device, when it was last configured, what kind of hardware is in the device, and many runtime data points about configuration and usage of the device.</p> <p>Aside from general metadata, you can also collect numerical data. This would be used to understand the status and usage of any particular subsystem on the device. Each time the device is polled, this data is returned and allows us to build a time series of the data. For example, you could see the CPU usage over the last 24 hours.</p> <p>The challenge with SNMP is that you can define the polling interval as long or short as required. Most solutions poll at 1-5 minute intervals, but as you can imagine during that 5-minute interval, a lot of spikes and other changes can be missed. Remember that we aren’t just collecting one data point about the device, but we could be collecting thousands of data points in each polling interval. This means that as we increase the frequency, we can create load on the device, which in turn could cause performance issues for the primary function. This is why ST was created with a specific function to stream data from the device as a more efficient way to send high frequency data to other systems.</p> <h3 id="when-is-flow-used">When is Flow Used?</h3> <p>Flow technologies are critical data sources when measuring what, where, and how actual traffic is passed through a device. While this can be viewed on a single device, flow becomes much more interesting when you integrate and combine this traffic data with additional sources to be able to measure traffic moving between devices. Extending this enables an understanding of the end-to-end traffic flow and conversation. That includes adding in high-value data such as threat feed and threat modeling, routing, topology, and other important networking information to model answers to difficult questions.</p> <p>Flow can also be used to understand consumption of bandwidth in a more granular manner. For example, looking per IP, per network and, oftentimes, into virtual constructs, such as MPLS, VPNs, and other virtual router devices, allows for a much deeper understanding of the traffic.</p> <p>Similar to the trends we are seeing on the data center side with cloud adoption, we are seeing new application architectures which are creating new network constructs. As organizations adopt microservices architectures, they are most often deploying on top of orchestration platforms built on top of Kubernetes. These include complex networking overlays such as Flannel, Calico, and about a dozen others. Kubernetes architectures also often incorporate proxies like Envoy. Kentik supports enriching flow with these other data sets. That allows for visibility into the application traffic flows all the way down to the clusters, pods, and services. This allows the network and application teams to work better together.</p> <p>The final advantage of flow is that, with sampling, you can collect a very small set of data to describe high volumes of traffic. This allows flow technologies to scale much better than packet-capture solutions while providing valuable insights into the traffic and traffic patterns on the network. While less accurate than describing and inspecting all of the packets in terms of application depth and performance data, flow can easily monitor terabits of traffic sampled or unsampled with much less required hardware or potential security issues.</p> <p>With any comments or questions on this topic, please <a href="/contact/">contact our team directly</a> or <a href="#demo_dialog" title="Request your Kentik demo">request a demo</a>.</p><![CDATA[Kentik for Grafana]]><![CDATA[Most DevOps teams use Grafana for reporting and data analysis for operational use cases spanning many monitoring systems. Grafana is a place these teams go to get answers to many questions. In this post, Kentik CTO Jonah Kowall discusses how Grafana became so popular and how the Kentik plugin for Grafana has new enhancements and features to help teams across the IT organization.]]>https://www.kentik.com/blog/kentik-for-grafanahttps://www.kentik.com/blog/kentik-for-grafana<![CDATA[Jonah Kowall]]>Tue, 07 Jan 2020 08:00:00 GMT<p>Kentik’s open platform and extensive open API architecture works for our most advanced customers to integrate data omnidirectionally with Kentik. Customers use our inbound APIs to send us various types of data, including threat feeds, DNS data, scoring data, or even physical topology data. Kentik takes this data and can correlate traffic and metric data with these additional data sources to provide additional relevance and context for customers.</p> <p>Aside from feeding Kentik data, many of our customers use Kentik-enriched and summarized data to feed internal systems with a subset of the richness in the Kentik backend and user interface. These are often internal systems which collect diverse data for multiple internal teams. One such common system our customers identified early into Kentik’s creation was the heavy use of the open source product, <a href="https://grafana.com/" target="_blank">Grafana</a>.</p> <p>Grafana’s popularity is due to the rise of many different open source time series systems, such as Prometheus for metric collection. Many, if not most, DevOps teams and others working across multiple tools use Grafana as the primary way they do reporting and analysis of data for operational use cases spanning many monitoring systems. This doesn’t mean that Grafana is the richest data source, but is a single place to go to get answers to many questions. Power users still spend most of their time in Kentik, but generalists, developers, or application-centric users spend most of their time in Grafana.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4s8URUMSrXDJh5VmkrlZOt/8828384c734d452e5d1eca2b4c9403c4/grafana-integrations.jpg" class="image center" style="max-width: 650px;"/> <p>Until earlier in 2019, Grafana Labs had been developing the <a href="https://grafana.com/grafana/plugins/kentik-app" target="_blank">Kentik plugin</a>, which was on the Grafana Connect Pro marketplace. Recently, our team took over development of the plugin used to connect the two platforms. It is now available as an open source project rather than through the Grafana Connect Pro marketplace. Since then, we’ve been busy building new enhancements and features, along with documentation of how to install the new plugin.</p> <p><strong>The new enhancements include:</strong></p> <ul> <li>Autocompletion of metrics and filters in queries</li> <li>Automatic DNS resolution option for IP addresses in Grafana</li> <li>New installation documentation <a href="https://github.com/kentik/kentik-grafana-app" target="_blank">found here</a></li> <li>Adding webpack for a newer framework</li> <li>Performance improvements via code refactoring</li> <li>Cleaning up files and repository</li> <li>Other bug fixes and error detection</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/65U8WnAOff69XLQegLWMnI/1ccf11fb7e0857b91f80a8236a1e7877/grafana-kentik-top-talkers.jpg" class="image center" style="max-width: 850px;"/> <img src="//images.ctfassets.net/6yom6slo28h2/1GU5kxyC2N5FQGI7eXJ83U/1777b6de0f15fe2306ab22db45864548/grafana-api-driven.gif" class="image center" style="max-width: 850px;"/> <p>Future work includes porting the plugin to React and adding in a changelog to the plugin itself. Grafana Labs is pushing plugins towards React, so we are waiting to see how this transition goes along with keeping our eyes out for what will come in Grafana 7.0 as that is built.</p> <p>Please check out <a href="https://github.com/kentik/kentik-grafana-app" target="_blank">Kentik’s GitHub repository</a>, which includes streamlined install documentation. Feel free to leave feature requests or help requests as issues on GitHub. <a href="mailto:[email protected]">Kentik Support</a> is also able to help, and our customer Slack channel is another great place to discuss the Grafana plugin.</p><![CDATA[Scaling BGP Peering in Kentik's SaaS Environment]]><![CDATA[Last month at DENOG11 in Germany, Kentik Site Reliability Engineer Costas Drogos talked about the SRE team’s journey during the last four years of growing Kentik’s infrastructure to support thousands of BGP sessions with customer devices on Kentik’s multi-tenant SaaS (cloud) platform.]]>https://www.kentik.com/blog/scaling-bgp-peering-kentik-saas-environmenthttps://www.kentik.com/blog/scaling-bgp-peering-kentik-saas-environment<![CDATA[Crystal Li]]>Thu, 19 Dec 2019 05:00:00 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/6uP7Bk8MEWd1iRqMtBLc8c/1a89777474e57303e1265a1386d5dbcd/denog-hamburg2019.png" class="image right" alt="DENOG" style="max-width: 230px;" />When it comes to analyzing network traffic for tasks like peering, capacity planning, and <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">DDoS attack detection</a>, there are multiple auxiliary sources that can be utilized to supplement flow information. These include SNMP, DNS, RADIUS and streaming telemetry. BGP routing data is another important data source. BGP can enrich network traffic data to visualize traffic per BGP path for peering analytics and also to inject routes that enable DDoS mitigation capabilities such as RTBH and Flowspec.</p> <p>Last month at DENOG11 in Germany, Kentik Site Reliability Engineer Costas Drogos talked about the SRE team’s journey during the last four years of growing Kentik’s infrastructure to support thousands of BGP sessions with customer devices on Kentik’s multi-tenant SaaS (cloud) platform. <a href="https://www.youtube.com/watch?v=uOu0Krn7_HE" target="_blank">Costas shared</a> various challenges the team overcame, the actions the team took, and finally, key takeaways.</p> <h3 id="bgp-at-kentik">BGP at Kentik</h3> <p>Costas started off by giving a short introduction to how Kentik uses BGP, in order to develop the technical requirements, which include:</p> <ul> <li>Kentik peers with customers, preferably with every BGP-speaking device that sends flows to our platform</li> <li>Kentik’s capabilities act like a totally passive iBGP route-reflector (i.e. we never initiate connections to the customers) on servers running Debian GNU/Linux.</li> <li>BGP peering uptime is part of Kentik’s contracted SLA - 99.99%</li> </ul> <p>At Kentik, we use BGP data not only to enrich flow data so we can filter by BGP attributes in queries, but we also calculate lots of other analytics with routing data. For example, you can see how much of your traffic is associated with <strong>RPKI</strong> invalid prefixes; you can do <strong>peering analytics</strong>; if you have multiple sites, you can see how traffic gets in and out of your network (<a href="https://kb.kentik.com/Fc07.htm">Kentik Ultimate Exit</a>™); and eventually, perform <strong>network discovery</strong>. Moreover, each BGP session can be used as the transport to push <strong>mitigations</strong>, such as RTBH and Flowspec, triggered by alerting from the platform.</p> <h3 id="scaling-phases">Scaling phases</h3> <p>Costas then shared how the infrastructure has been built out from the beginning to today as Kentik’s customer base has grown.</p> <h4 id="phase-1---the-beginning">Phase 1 - The beginning</h4> <p>Back in 2015, when we monitored approximately 200 customer devices, we started with 2 nodes in active/backup mode. The 2 nodes were sharing a floating IP that handles the HA/Failover, which is managed by <a href="https://github.com/jedisct1/UCarp" target="_blank">ucarp</a> — an open BSD CARP (VRRP) implementation. This setup ran at boot time from a script residing in <code class="language-text">/root</code> via <code class="language-text">rc.local</code>.</p> <p>Obviously, this setup didn’t go very far with the rapid growth of BGP sessions. After a while, one active node could no longer handle all peers, with the node getting overutilized in terms of memory and CPU. With Kentik growing quickly, the solution needed to evolve.</p> <h4 id="phase-2">Phase 2</h4> <p>In order to fit more peers, we had to add extra BGP nodes. Looking at our setup, the first thing we did was to replace ucarp because we observed scaling issues with more than 2 nodes. We developed a home-grown shell script (called ‘bgp-vips’) that communicates with a spawned <a href="https://github.com/Exa-Networks/exabgp" target="_blank">exaBGP</a>. This took care of announcing our floating BGP IP, which was now provisioned on the host’s loopback interface. Each host then announced the route with a different MED so that we had multiple paths available at all times.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4nJc0qBcIaeNhlm78n77Ra/b2476de241596d18230f21a0e8e9500a/phase2-code.png" class="image center" style="max-width: 600px;" /> <p>The next big step was to scale out the actual connections, by allowing them to land on different nodes. On top of that, since our BGP nodes were identical, the distribution of sessions should be balanced. Given that we only have one IP active in on each node, the next step was to have this landing node act as a router for inbound BGP connections with policy routing as the high-level design. The issue we then had to think about was how to achieve a uniform enough distribution. After testing multiple setups, we ended up using wildcard masks as the sieve to mark connections with.</p> <p>While we were able to scale connections and achieved a mostly uniform distribution among the peering nodes (example below), our setup was still not really IPv6 ready and needed full exaBGP restarts upon any topology modification, resulting in BGP flaps for customers.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5kuUFMdhMI1D4cj2VGdJt7/3618c3bac0566e3f2b63f002896561c2/bgp-flaps.png" class="image center no-shadow" style="max-width: 600px;" /> <p>On top of that, we introduced <a href="https://kb.kentik.com/Gc08.htm#Gc08-RTBH_Mitigation_Methods">RTBH for DDoS mitigation</a>, which immediately raised the importance of having a stable BGP setup — as we are now actively protecting customers’ networks.</p> <p>With the fast growth of Kentik, when we hit the 1,300-peers mark, a few more issues surfaced:</p> <ul> <li>Mask-based hashing was not optimal anymore;</li> <li>The home-grown shell script ‘bgp-vips’ was painful to work with day-to-day: modifying MEDs, health-checks and the smallest topology change warranted a full restart, dropping all connections.</li> <li>IPv6 peerings are starting to outgrow a single node.</li> </ul> <h4 id="phase-3">Phase 3</h4> <p>Improvement of the Phase 2 setup became imperative. Customers were being onboarded so rapidly that the only way forward was continued innovation.</p> <ol> <li>The first thing to optimize was traffic distribution, to achieve better node utilization. We replaced mask-based routing with hash-based routing (<a href="https://lwn.net/Articles/488663/" target="_blank">HMARK</a>). This offered us greater stability and, due to hashes having higher entropy than IPs, uniform-enough distribution. It also allowed mirroring the setup for our IPv6 fabric.</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/6zM0CoAJmKfoIgWRiXRPGk/7a096c75a3c9cb23ff521a19f93539c6/phase3-1iptables.png" class="image center" style="max-width: 650px;" /> <ol start="2"> <li>The second thing to improve on was our BGP daemon. We replaced the ‘bgp-vips’ shell scripts with a real daemon in python, called ‘vipcontrol’, that communicates with a side-running exaBGP over a socket. With this new ‘vipcontrol’ daemon, we can now modify runtime configuration and change MEDs on the fly. Another sidecar daemon, called ‘viphealth’ took care of health-checking the processes, handling IPs and modifying MEDs as needed.</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/7bYybH5mKmZXlIzhCZqkN1/2985ad9fc45be20bb877e65b47ff5477/phase3-2sudo.png" class="image center" style="max-width: 600px;" /> <p>In the meantime, Kentik introduced <a href="https://www.kentik.com/blog/kentik-takes-a-leap-forward-in-ddos-defense/">Flowspec DDoS mitigations</a>, so offering stable BGP sessions became even more important for Kentik.</p> <h4 id="phase-4---today">Phase 4 - Today</h4> <p>Today, Kentik continues to grow, peering with more than 4,000 customer devices. As before, we designed the next phase in the spirit of continuous improvement. We decided to create a new design, building on previous experience. We started by setting the requirements, including that we:</p> <ul> <li>Should support both IPv4 and IPv6</li> <li>Keep the traffic distribution uniform as our nodes are all identical</li> <li>Be able to scale horizontally to accommodate future growth demands</li> <li>Persist all the states in our configuration management code tree</li> <li>Simplify day-to-day operations such as adding a node, removing a node, and code deploys — and keep all these operations transparent to customers</li> </ul> <p>We tested different designs during an evaluation cycle and we decided to go with LVS/DSR (Linux Virtual Server / Direct Server Return), which is a load-balancing setup traditionally used for website load balancers, but it worked well for BGP connections, too.</p> <p>Here is how it works:</p> <ol> <li>A customer device initiates a connection.</li> <li>The floating IP gets announced to the BGP fabric.</li> <li>It’s then passed on to the load balancer node (which doesn’t run BGP code).</li> <li>The load balancer node rewrites the source MAC of the packet and forwards the packet to the real server that’s running the BGP code.</li> <li>The real server terminates the connection by replying directly to the customer (so we are not passing both directions of the TCP connection through the load balancing node).</li> </ol> <img src="https://images.ctfassets.net/6yom6slo28h2/55XU5NyCKBSnofOM8khnHC/8d3a91c21a7cda2d44d813a16749568d/lvs-dsr-setup.png" class="image center no-shadow" style="max-width: 700px;" /> <p>Under the hood, the new design utilized the following:</p> <ol> <li>Replaced exaBGP with <a href="https://bird.network.cz/" target="_blank">BIRD</a> (BIRD Internet Routing Daemon) and added BFD on top for faster route failover. Bird is used to announce the floating IP into Kentik’s fabric</li> <li>Used Keepalived in LVS mode with health checks for pooling/depooling real servers</li> <li>Load balancing nodes utilize IPVS to sync connection state - so that if a load balancing node fails, TCP connections remain intact as they move to the other load balancing node</li> <li>All of the configuration is set in Puppet+git, so that we can follow any changes that go into the whole configuration</li> </ol> <p>Today, we’re testing the new setup in our staging environment, evaluating the pros and cons and tuning it to ensure it’s going to meet future scaling requirements as we begin to support tens of thousands of BGP connections.</p> <h3 id="recap">Recap</h3> <p>With the rapid growth of Kentik, in the past four-years’ journey, we evolved the backend for our BGP route ingestion in four major phases to meet scaling requirements and improve our setup’s reliability:</p> <ol> <li>2 nodes in active-backup, using ucarp</li> <li>4 active nodes with mask-based hashing, using exaBGP in our HA setup</li> <li>10 active nodes with full-tuple hashing, support for balancing IPv6</li> <li>16+ nodes with LVS/DSR and IPVS now under testing.</li> </ol> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p>To watch the complete talk from DENOG, please check out this YouTube video.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 700px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.14285714285714%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/uOu0Krn7_HE?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <p>To learn more about Kentik, <a href="#signup_dialog" title="Request your Kentik demo">sign up for a free trial</a> or <a href="#demo_dialog" title="Start a Free Kentik Trial">schedule a demo</a> with us.</p><![CDATA[AIOps Comes of Age in Gartner’s Market Guide for AIOps Platforms]]><![CDATA[Gartner recently published its 2019 version of the Market Guide for AIOps Platforms. In this post, we examine our understanding of the report and discuss how Kentik’s domain-centric AIOps platform is built from the ground up for network professionals. ]]>https://www.kentik.com/blog/aiops-comes-of-age-gartners-market-guide-for-aiops-platformshttps://www.kentik.com/blog/aiops-comes-of-age-gartners-market-guide-for-aiops-platforms<![CDATA[Jim Meehan]]>Tue, 17 Dec 2019 08:00:00 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/6igntQAe8AozgC0etoPkgW/5ee01813afb32cf618660118c3ca48bb/gartner-aiops-market-guide.png" class="image right" style="max-width: 220px; margin-left: 20px;" alt="Gartner Market Guide for AIOps platforms" />Last month, Gartner published its 2019 version of its Market Guide for AIOps Platforms<sup>*</sup>. In case you’re unfamiliar, Gartner Market Guides provide some of the same coverage as Gartner’s well-known Magic Quadrant, but without the per-vendor details and comparison. We believe this year’s Market Guide covers some of the same ground as last year’s guide, but with some significant changes as well, representing the increasing maturation of the AIOps market.</p> <p>Kentik views AIOps as a set of three primary capabilities:</p> <ul> <li>Scalable data <strong>collection and correlation</strong> from a wide variety of sources</li> <li><strong>Real-time and historical analytics</strong> with problem detection based on machine learning</li> <li><strong>Suggestions and workflows</strong> with an option to <strong>initiate an action</strong> or next step</li> </ul> <p>All of these capabilities are centered around a key goal of using big data and machine learning technology to increase the efficiency of IT operations teams. AIOps acts as a force multiplier, allowing teams to spend less time firefighting and more time building new technology capabilities for the organization. We believe Gartner suggests that the ever-increasing volume, velocity and variety of IT operations data will require many more organizations to adopt AIOps technology to maintain the status quo of running reliable IT infrastructure. We also feel that the guide predicts that 40% of IT teams will augment traditional <a href="https://www.kentik.com/blog/kentik-for-grafana/">monitoring systems</a> with AIOps capabilities over the next few years.</p> <p>Kentik believes that this year’s guide makes a new distinction between domain-agnostic and domain-centric AIOps platforms. Domain-agnostic platforms are general purpose products that cover the whole of IT operations, and typically ingest pre-processed data and alerts from other monitoring systems that collect the raw IT operations data. One goal of these systems is to provide a single representation of an event or incident that may be reported by multiple traditional monitoring systems.</p> <p>Domain-centric AIOps platforms cover a subset of the IT operations landscape. Typical focus areas of domain-centric platforms include application performance monitoring (APM), <a href="https://www.kentik.com/blog/the-role-of-predictive-analytics-in-network-performance-monitoring/">network performance monitoring (NPM)</a>, and endpoint monitoring. Domain-centric platforms provide all of the core AIOps capabilities discussed above, but focus on a more specific set of use cases.</p> <p>Kentik’s opinion is that organizations should focus initial <a href="https://www.kentik.com/blog/aiops-comes-of-age-gartners-market-guide-for-aiops-platforms/">AIOps efforts</a> on specific use cases. According to Gartner, “Increase the odds of a successful AIOps platform deployment by focusing on a specific use case and adopting an incremental approach that starts with replacing rule-based event analytics and expands into domain-centric workflows like application and network diagnostics.” This makes sense because a narrower scope reduces deployment complexity and makes it easier to see near-term ROI from adopting AIOps technology.</p> <p>We believe Gartner’s guide also references the four stages of IT operations monitoring (ITOM), which is essentially a progression of value as AIOps and monitoring in general becomes more operationalized within organizations. The phases include:</p> <ul> <li><strong>Descriptive IT</strong>: Visualization and statistical analysis</li> <li><strong>Anomaly Detection and Diagnostics</strong>: Correlation and automated pattern discovery</li> <li><strong>Proactive Operations</strong>: Pattern-based prediction</li> <li><strong>Avoiding High-Severity Outages</strong>: Using analytics to uncover root causes</li> </ul> <p>Here at Kentik, we provide a domain-centric AIOps platform built from the ground up for network professionals. By collecting and correlating large volumes of network data like <a href="https://www.kentik.com/kentipedia/netflow-overview/">NetFlow</a>, SNMP, streaming telemetry and BGP routing data, we provide the deep insight and fast data access that network teams need to stay on top of managing today’s infrastructure — whether that be traditional data centers, public cloud, and everywhere in between.</p> <p>Update March 13, 2020: Kentik no longer distributes Gartner’s Market Guide for AIOps, but you can get the new 2020 <a href="https://www.kentik.com/go/gartner-market-guide-npmd-network-performance-monitoring-and-diagnostics-2020/">Gartner Market Guide for the Network Performance Monitoring and Diagnostics (NPMD) Market Guide here</a>, courtesy of Kentik. If you’d like to check out what Kentik can do for your network, <a href="https://www.kentik.com/contact/">get in touch</a> or <a href="#signup_dialog" title="Start a Free Kentik Trial">start a trial</a>.</p> <p><sup>*</sup><em>Gartner, Inc., Market Guide for AIOps Platforms, by analysts Charley Rich, Pankaj Prasad, Sanjit Ganguli, 7 November 2019</em></p> <p><em>Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, express or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.</em></p><![CDATA[How Superloop Connects APAC to the World with Network Insights from Kentik]]><![CDATA[“My advice to anyone who is considering Kentik is to get into it. Get a trial, import your data, and start playing with it,” says Rick Carter, head of networks at Superloop and Kentik user. In this vlog, Rick shares how Kentik helps Superloop bring connectivity to APAC by uncovering important network and business insights.]]>https://www.kentik.com/blog/how-superloop-connects-apac-to-the-world-with-network-insights-from-kentikhttps://www.kentik.com/blog/how-superloop-connects-apac-to-the-world-with-network-insights-from-kentik<![CDATA[Michelle Kincaid]]>Tue, 10 Dec 2019 12:00:00 GMT<div class="pullquote right" style="max-width: 300px;">"My advice to anyone who is considering Kentik is to get into it. Get a trial, import your data, and start playing with it. Dig into that information that you have and you’ll end up finding information that you never even knew about."<br />— Rick Carter, Head of Networks</div> <p>Australia-based network service provider <a href="https://superloop.com/" target="_blank">Superloop</a> supplies carrier-grade metro fibre networks, subsea cables, and fixed wireless networks that connect APAC to the rest of the world. Maintaining a secure, fast and powerful network is critical to Superloop’s success. That’s why the company turned to the Kentik AIOps Platform for network insights to enhance customer experience and boost business growth.</p> <p>“When I get in in the morning, I jump on Kentik and have a look to see how much traffic we’ve done over the peak of the night before. I’m always interested to see whether we’ve broken a record because we seem to be doing that very regularly lately,” says Rick Carter, head of networks at Superloop. “Some of the benefits of Kentik are on the operational cost side to see where our traffic is coming from and going to with various autonomous systems. That gives us the ability to see the cost we are paying for that traffic with our transit providers and see if we have the opportunity to peer with those providers to save on costs there.”</p> <p>Watch Rick talk about how Superloop uses the <a href="https://www.kentik.com/product/kentik-platform/">Kentik AIOps Platform</a> in our new video or read the transcript below.</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/4rHtCsxbxSw?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 0px;"></div> <h3 id="q-kentik-who-is-superloop">Q. Kentik: Who is Superloop?</h3> <p><strong>Rick Carter, Head of Networks at Superloop:</strong></p> <p>Superloop is an Australian-based network service provider, we have an employee base of around 300 people. We have offices in Sydney, Brisbane, Melbourne, Adelaide, Perth, Singapore and Hong Kong.</p> <p>Superloop is a provider of dark fiber and we have about 700 kilometers of carrier-grade fiber in the ground on the east coast of Australia, as well as in Singapore and Hong Kong.</p> <p>We’ve also recently brought into production a submarine cable. This is the first submarine cable between Sydney and Perth, and we also have another leg from Perth to Singapore.</p> <h3 id="q-how-does-superloop-use-kentik">Q. How does Superloop use Kentik?</h3> <p>When I get in in the morning, I jump on Kentik and have a look to see how much traffic we’ve done over the peak of the night before. I’m always interested to see whether we’ve broken a record because we seem to be doing that very regularly lately.</p> <p>Some of the benefits of Kentik are on the operational cost side to see where our traffic is coming from and going to with various autonomous systems. That gives us the ability to see the costs that we are paying for that traffic with our transit providers and see if we have the opportunity to peer with those providers to save on costs there.</p> <p>Other benefits are around DDoS traffic so we can provide a service to our customers to scrub dirty traffic coming in and provide them with a clean feed.</p> <h3 id="q-who-at-superloop-uses-kentik">Q: Who at Superloop uses Kentik?</h3> <p>The main users of Kentik in our organization are the network engineering and architecture teams. Other groups also use the reporting functions, including the global operations center, as well as the executive team. We provide a lot of reports to the executive team on how traffic trends are changing, and if there’s any major incidents that have occurred on the network.</p> <h3 id="q-why-did-superloop-turn-to-kentik">Q: Why did Superloop turn to Kentik?</h3> <p>Before Kentik, when we were using the open source tools, we’d get some information, or it’d take a lot of time to dig through all of the information to get what we actually needed. Kentik now gives us all of that information at our fingertips, makes it easy to search for the information, and find what we need very quickly.</p> <h3 id="q-what-advice-do-you-have-for-something-considering-kentik">Q: What advice do you have for something considering Kentik?</h3> <p>My advice to anyone who is considering Kentik is to get into it. Get a trial, import your data, and start playing with it. Dig into that information that you have and you’ll end up finding information that you never even knew about.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><a href="#demo_dialog" title="Request your Kentik demo">Schedule a demo of the Kentik AIOps Platform</a> or <a href="#signup_dialog" title="Start a Free Kentik Trial">sign up for a free trial</a>.</p><![CDATA[Disney+ Launch: What Broadband Providers Need to See]]><![CDATA[Disney+ launched with an impressive debut, especially considering that technical problems, potentially capacity issues, plagued the service shortly after launch. In this post, we look at Disney+ traffic peaks around launch. We also explain why, during high-profile launches like that of Disney+, broadband subscribers are likely to blame their provider for any perceived problems. Network operators must be able to understand the dynamics of new loads placed on the network by new services and plan accordingly.]]>https://www.kentik.com/blog/disney-plus-launch-what-broadband-providers-need-to-seehttps://www.kentik.com/blog/disney-plus-launch-what-broadband-providers-need-to-see<![CDATA[Jim Meehan, Greg Villain]]>Mon, 18 Nov 2019 08:00:00 GMT<p>Disney’s Disney+ service launched on Nov. 12 to much fanfare. In case you’ve been hiding under a rock, Disney+ is a subscription-based, video-streaming service operated by none other than Disney. In the past, Disney has licensed their extensive content library to other streaming providers (including Netflix), but they’ve now pulled all that content back in-house to deliver it via their own service. With a massive catalog that includes popular content from Disney and Disney-owned Pixar, Marvel, Lucasfilm (Star Wars) and National Geographic, all eyes have been on the company to see how popular its new service will become.</p> <p>Here at Kentik, we thought it would be interesting to dive into the details of the launch from the network traffic perspective. For that purpose, we reached out to a few friendly Kentik customers who were willing to share their data for the analysis. Kentik’s platform includes our True Origin technology, which identifies the OTT service associated with network traffic and also the CDN provider that delivered it. This allows network service providers to understand how subscribers and OTT services are utilizing their network. They can then make more informed network planning decisions to ensure that there is enough capacity to deliver the OTT services subscribers are using and to optimize their performance.</p> <p>First, we wanted to know how much traffic Disney+ was actually delivering to get a sense of just how popular it might become, especially compared to other streaming services.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5k7yt3qBLSF2o7ckCMTgHz/4a780b2378c91a0959cc7fd57f0c4eee/disney1-topott.png" style="max-width: 700px;" class="image center" alt="Disney+ Traffic" thumbnail /> <p>In the graph above, you can see the start of traffic from Disney+ around 18:00 UTC on launch day. It then quickly rose into the top 4 streaming services, nearly equaling Hulu by 23:00. By Thursday evening (Nov. 14), Disney+ traffic had grown to 55 Gbps at peak on this network, and briefly surpassed every streaming service except Netflix.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1vzwpGzorNNpDtzPKEtn0J/6c0e84951c88c96c37bb1c3c65284d13/disney2-topott.png" style="max-width: 700px;" class="image center" alt="Disney+ and other steaming services" thumbnail /> <p>That’s an impressive debut, especially considering that technical problems (potentially capacity issues) were plaguing Disney+ shortly after launch. It’s also interesting to see that Disney+ traffic peaks about an hour before other streaming services — around bedtime for younger viewers.</p> <p>We also thought it would be informative to look at how Disney+ was distributing video content to subscribers. It’s been widely reported that back in 2017, Disney bought BAMTech, a long-time player in streaming video delivery. BAMTech provides the technical prowess, but where exactly do the bits originate?</p> <img src="//images.ctfassets.net/6yom6slo28h2/4hIqeUkSnUgBmlLRLqDBp3/dc61c3c4b0de1168ac8097f0c2da7ce4/disney3-topsrccdn.png" style="max-width: 700px;" class="image center" alt="Origin" thumbnail /> <p>From the perspective of one Kentik customer’s network, we can see that Disney+ utilizes a slew of different CDN providers to deliver traffic to viewers. Traffic volume is roughly evenly split among Level 3, Akamai, Fastly and Limelight — and the split seems to stay fairly consistent as launch day progressed.</p> <p>Looking from the perspective of a different customer’s network, the distribution of traffic among CDN providers is much different. Here we see Verizon Media (formerly EdgeCast) in the mix. Level 3 is delivering the largest share of traffic, with a smaller uneven split among the other providers. CDNs already use various metrics to serve content from the node that will provide the best performance for each user. But it appears here that Disney+ is also intelligently choosing the mix of CDN providers that will deliver the best performance based on the geolocation (or other metrics) of the destination viewer’s broadband provider.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5gstXbjYMTap7KnPfyBAHm/271407b9c38b82a40f823545921b8f94/disney4-topsrccdn.png" style="max-width: 700px;" class="image center" alt="CDN providers" thumbnail /> <p>In another customer’s network, we observed a bit of an anomaly shortly after the Disney+ launch. Verizon / EdgeCast (light green on the graph) suddenly dropped out of the mix of CDN providers — and overall traffic volume dropped as well. We know that Disney+ was experiencing some technical problems at launch, and perhaps this shift is evidence of corrective action.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2N1QUxTqMmZpK98oq0qDsK/3a917792d388dbc7077ab686b1a2c3ce/disney5-topott.png" style="max-width: 700px;" class="image center" alt="Anomaly after Disney+ launch" thumbnail /> <p>Another window into traffic delivery is the average per-viewer bitrate, segmented by CDN provider. This is derived by dividing the total traffic volume for a CDN by the number of active Disney+ viewer IP addresses for each CDN. Fastly, Level 3 and Limelight are all neck-and-neck in terms of performance, with Akamai somewhat below those three, and Verizon / EdgeCast well below the rest.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6sAn9G6MNcmLs5xcTC4pYb/7228c45711e38abb0bb7fbf2f9594123/disney6-topsrccdn.png" style="max-width: 700px;" class="image center" alt="Average per-viewer bitrate" thumbnail /> <p style="font-size: 98%; max-width: 700px; margin: 30px auto;"><b>Note:</b> The low values for Verizon’s EdgeCast are likely due to their very low number of samples, as this CDN is barely being used to deliver Disney+ for this ISP.</p> <p>While these graphs and data are interesting to look at to gauge the launch of Disney+, they’re also critical insight for the broadband providers that deliver this traffic to end-subscribers. During high-profile launches like this, broadband subscribers are likely to blame their provider for any perceived problems. Network operators must be able to understand the dynamics of new loads placed on the network by new services and plan accordingly. That includes active management and planning to make sure adequate capacity is in place to the CDN traffic origins, or local caches to bring the traffic origins on-net.</p> <p>If you’d like to see how Kentik can help you understand the dynamics of subscribers and OTT services on your own network, <a href="#demo_dialog" title="Request your Kentik demo">request a demo</a> or <a href="#signup_dialog" title="Start a Free Kentik Trial">sign up for a trial</a>.</p><![CDATA[How Packet Uses Kentik to Make Infrastructure a Competitive Advantage]]><![CDATA[Packet is focused on automating single-tenant bare metal compute infrastructure. On a mission to enable the world’s companies with a competitive advantage of infrastructure, Packet turned to Kentik for network visibility and insights. ]]>https://www.kentik.com/blog/how-packet-uses-kentik-infrastructure-competitive-advantagehttps://www.kentik.com/blog/how-packet-uses-kentik-infrastructure-competitive-advantage<![CDATA[Michelle Kincaid]]>Thu, 14 Nov 2019 08:00:00 GMT<p><img src="https://images.ctfassets.net/6yom6slo28h2/3UrOtzz6m0H4jvrLLzf1n9/deb5fca290695d6498ecfd70e3927cea/blog-packet383.png" class="image right no-shadow" style="max-width: 180px; margin: 15px 25px 15px 25px;" alt="Packet" />Cloud infrastructure provider <a href="https://www.packet.com/" taarget="_blank">Packet</a> empowers developer-driven companies to deploy physical infrastructure at global scale. To do that, Packet must ensure their network is always up, with insights and automation that drive performance. That’s why Packet turned to the <a href="https://www.kentik.com/product/kentik-platform/" title="Learn More About Our Platform: the Kentik Network Observability Cloud">Kentik Network Observability Cloud</a> for revolutionary network analytics.</p> <div class="pullquote left" style="max-width: 300px; margin-right: 40px;">“The one thing that is simply not a negotiable aspect these days is your network."</div> <p>“The one thing that is simply not a negotiable aspect these days is your network, so any investment you can make, such as with tools like Kentik, really can give you the insights you need to know, what is worth doing, where you should spend your time, and how you can react when there are issues,” says Zachary Smith, Co-founder and CEO of Packet.</p> <p>“Kentik gives us the visibility we need to address the incidents in a professional manner and without impact to our customers,” reports Adam Rothschild, Co-founder and SVP of Network and Data Center at Packet.</p> <p>Check out all the ways Packet leverages Kentik in our new video or read the transcribed Q&#x26;A from Zac and Adam below.</p> <div style="margin-bottom: 15px;">&nbsp;</div> <div class="gatsby-resp-iframe-wrapper" style="max-width: 560px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.25%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://www.youtube.com/embed/hcy6l6JpxqI?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> </div> </div> <h3 id="q-who-is-packet">Q: Who is Packet?</h3> <p><strong>Zachary Smith, Co-founder and CEO of Packet:</strong></p> <p>Packet is focused on automating single-tenant bare metal compute infrastructure. Our mission is to enable the world’s companies with a competitive advantage of infrastructure. And so to do that, we offer a bare-metal public cloud in over 20 places around the world.</p> <p>Customers like Sprint run their 5G IoT network off of Packet, and we support them in building infrastructure and giving them the fundamental network primitives so that they can deliver their IoT platform. In order to do that, Packet needs to:</p> <ul> <li>Make sure the network is always up;</li> <li>Provide customers with enough insights so that they can optimize, tune, react to things that are happening on the core platform;</li> <li>Automate everything to a level that a user would expect.</li> </ul> <h3 id="q-why-did-packet-turn-to-kentik">Q: Why did Packet turn to Kentik?</h3> <p><strong>Adam Rothschild, Co-founder and SVP of Network and Data Center at Packet:</strong></p> <p>Packet had a lot of needs for deep analytics into our network and where our traffic was going. This was a really multifaceted need:</p> <ul> <li>Our account teams needed to understand profitability on a per-customer basis.</li> <li>Our systems and security teams had a need for identifying any sort of denial-of-service attack or any sort of compromised server within the network.</li> <li>Our network engineering teams had a need to understand where our traffic was going, what providers were we exchanging traffic with to try to drive any sort of network expansion, peering and commercial relationships, and more.</li> </ul> <p>We had looked at various open-source tools that could solve for some of our problems. We also looked at commercial software that could solve other problems. We pretty much did a buy versus build calculus and considered writing our own software to solve for some of it. At the end of the day, it was really just the one-stop-shop approach that drew us into Kentik.</p> <h3 id="q-how-does-packet-use-kentik">Q: How does Packet use Kentik?</h3> <p><strong>Adam:</strong> We no longer have engineers that are flying blind and trying to use disparate tooling or manual methods to figure out what’s going on. We have a dashboard in front of us. We know immediately know what’s going on with our network at any point in time.</p> <p><strong>Zac:</strong> Recently, we had an issue where our internal observability and metering platform for traffic management had a bug, and we were able to identify ─ using flow records from the Kentik platform, writing custom queries, pretty easily through the GUI and our API ─ to come up with a secondary way to audit our billing for our customers. That saved us days and days of time and potentially lost revenue and resulted in accurate billing for our customers.</p> <p>The other thing we’ve been able to do recently is to help our customers identify where they should be putting more of their infrastructure. We have 23 locations around the world. For some of our customers, it’s hard to determine, “Well where should I be placing my infrastructure?” That’s a really hard situation because something you might think is, “Oh, I really want to reach Europe, so I should put my infrastructure in Amsterdam.” Except for actually, maybe 50-60% of your flows go to other SaaS companies or clouds or infrastructure partners. Being able to take reports from Kentik and show, on a per-customer basis, the destinations and types of traffic, we’re able to help through our sales process and come to a better answer for our customers.</p> <h3 id="q-how-does-packet-use-kentik-for-network-security">Q: How does Packet use Kentik for network security?</h3> <p><strong>Adam:</strong> As a cloud provider, DDoS is a core focus here at Packet. We deal with many incidents per day or per week, and Kentik gives us the visibility we need to address the incidents in a professional manner and without impact to our customers.</p> <p>Beyond basic DDoS, we also use Kentik to identify any sort of internet vulnerabilities or any customer servers that might be compromised and being used in a malicious manner, so that we can remove that from our network in a quick and automated manner as well.</p> <h3 id="q-what-is-the-roi-packet-gets-from-kentik">Q: What is the ROI Packet gets from Kentik?</h3> <p><strong>Zac:</strong> One of the most delightful surprises of using Kentik is that they’ve taken a really, really hard problem around how to analyze billions of flow records, across a super complex thing like a global IP network like ours, and turned it into something that’s really enjoyable to use.</p> <p>We’re able to take users who may be from our customer service team or developers who work on our platform, who have very little network knowledge, and they’re able to leverage this Kentik platform to do their jobs.</p> <p>On the flip side, you can take a 20-year network veteran, who’s literally tried every single tool and built the biggest platforms in the world, and they’re able to find value out of Kentik every single day.</p> <h3 id="the-takeaway">The takeaway</h3> <p><strong>Zac:</strong> The one thing that is simply not a negotiable aspect these days is your network, so any investment you can make, such as with tools like Kentik, really can give you the insights you need to know, what is worth doing, where you should spend your time, and how you can react when there are issues.</p> <p>The ROI for us is that, instead of building these tools in a subpar manner, we’re able to leverage a best-in-class product from Kentik, let our engineers get back to work, delivering on product or insights or bringing value to our customers.</p> <hr> <p>To see how Kentik can help your organization, <a href="#demo_dialog" title="Request your Kentik demo">schedule a demo</a> or <a href="#signup_dialog" title="Start a Free Kentik Trial">start a free trial</a>.</p><![CDATA[Beignets and Brainstorms: Work Meets Play in the French Quarter]]><![CDATA[Discover how our engineering team mixed strategy with socializing during an unforgettable offsite in New Orleans. From team-building sessions to airboat adventures and lively evenings in the French Quarter, the trip showcased our commitment to creating a vibrant, connected team culture.]]>https://www.kentik.com/blog/beignets-and-brainstorms-work-meets-play-in-the-french-quarterhttps://www.kentik.com/blog/beignets-and-brainstorms-work-meets-play-in-the-french-quarter<![CDATA[Christina Barr]]>Fri, 18 Oct 2019 05:00:00 GMT<h2 id="strategizing-in-the-heart-of-nola">Strategizing in the heart of NOLA</h2> <p>Recently, our VP of engineering and technical operations, <a href="https://www.kentik.com/blog/kentik-engineering-an-introduction/">Mike Ho</a>, led our team of engineers to New Orleans for an unforgettable work offsite. The days were filled with productive sessions, but they also dove headfirst into the city’s rich culture and some well-earned after-hours fun.</p> <p>The group gathered at the historic Hotel Monteleone in the heart of the French Quarter — did you know this hotel has a carousel bar? — flying in from around the globe. As a remote-first company, Kentik knows how valuable it is to get everyone in the same room (or airboat) for a few days to focus on a curated agenda and some in-person bonding.</p> <h2 id="collaboration-meets-creole-cuisine">Collaboration meets creole cuisine</h2> <p>But it wasn’t all work: smaller groups ventured out to sample New Orleans’ many flavors in the evenings, with one final group dinner to wrap things up. There were also a few special gifts for the engineers, like the NES Classic and Sega Genesis consoles (which, naturally, saw some immediate action after dinner).</p> <h2 id="alligators-airboats-and-teamwork">Alligators, airboats, and teamwork</h2> <p>This wasn’t just about meetings, though. One of the most memorable activities was an airboat ride through the swamps, a perfect adrenaline-pumping break from the sessions. Imagine: a 30-person airboat (yep, big enough for the entire engineering squad) cutting through the Louisiana bayou, everyone watching wide-eyed as the captain dangled his arms in the water to lure in some gators. Thankfully, no engineers were volunteered to help!</p> <img src="//images.ctfassets.net/6yom6slo28h2/2umzLOa30FyGYCyPIV9wed/1e03a09d7e7874ee13c083087c265941/airboat2019.jpg" style="max-width: 500px;" class="image center" alt="Kentik engineering team NOLA trip" /> <h2 id="teamwork-with-a-twist-of-jazz">Teamwork with a twist of jazz</h2> <p>The New Orleans experience allowed the team to bond in a unique, exciting environment that took them far beyond the usual workplace interactions. From collaborative sessions during the day to jazz-filled nights in the French Quarter, the trip left the team with unforgettable memories and a renewed sense of camaraderie, proving that sometimes, the best ideas come from a little Creole magic.</p><![CDATA[VMworld 2019 Wrap-up: VMware’s Key Focus is Shifting...]]><![CDATA[Kentik's Jim Meehan and Crystal Li share their insights from VMworld 2019, with a focus on news from the worlds of networking, multi-cloud, Kubernetes and security.]]>https://www.kentik.com/blog/vmworld-2019-wrap-up-vmwares-key-focus-is-shiftinghttps://www.kentik.com/blog/vmworld-2019-wrap-up-vmwares-key-focus-is-shifting<![CDATA[Jim Meehan, Crystal Li]]>Fri, 27 Sep 2019 07:00:00 GMT<p>VMworld is VMWare’s global conference for virtualization and cloud computing. Started 15 years ago with the theme of “Join the Virtual Evolution”, the sessions and vendors now go far beyond just virtualization and cover a broad range of cloud technologies. Many consider it one of the industry’s top digital infrastructure events.</p> <p>This year, Kentik joined as a sponsor and exhibitor for the first time at VMworld US in San Francisco. We got the chance to join and talk to many other individuals and businesses who contribute to and influence the digital experience transformation in the IT industry.</p> <p>In this post, we share some of what we learned at the event—from what was announced in the general sessions, to what we heard from attendees during conversations in the Kentik booth.</p> <h2 id="multi-cloud-kubernetes-security-and-more">Multi-Cloud, Kubernetes, Security and more…</h2> <p>From all the key announcements, we can clearly tell that the 20-year-old virtualization pioneer, VMware, is riding the wave of modern containerized applications. VMware’s current strategy focuses on three key areas: <strong>Kubernetes, multi-cloud</strong> and <strong>security</strong>. These are closely aligned with where the IT industry is moving in general.</p> <p>Let’s review a few important product announcements for each.</p> <h3 id="kubernetes">Kubernetes</h3> <ul> <li><a href="https://blogs.vmware.com/cloudnative/2019/08/26/vmware-completes-approach-to-modern-applications/" target="_blank" title="VMware Tanzu Portfolio">VMware Tanzu Portfolio</a>: A portfolio of products and services to transform the way the world builds software on Kubernetes. The complete picture includes: <ul> <li>BUILD: helps customers BUILD modern applications</li> <li>RUN: ensures customers RUN a consistent implementation of Kubernetes</li> <li>MANAGE: enables customers to manage their entire Kubernetes estate from a single point of control</li> </ul> </li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/4u0axiz2Opqno9qXkFWsIu/9cf3bbdb4e1ae9eb1816efac9491b355/vmware-tanzu-1.png" class="image center" title="VMware Tanzu" style="max-width: 300;" thumbnail /> - <a href="https://blogs.vmware.com/vsphere/2019/08/introducing-project-pacific.html" target="_blank" title="VMware Project Pacific">Project Pacific</a>: A new architecture for vSphere with Kubernetes deeply integrated that provides the following capabilities: - vSphere with Native Kubernetes - App-focused Management that enables app-level control for applying policies, quota and role-based access to developers. - Dev & IT Ops Collaboration with a consistent view via Kubernetes constructs in vSphere <img src="//images.ctfassets.net/6yom6slo28h2/5Y4TcPkDxoe6ebPbbPtT43/14264b50da99c2632d4f802206ef5f0c/vmware-project-pacific-1.png" class="image center" title="VMware Project Pacific" style="max-width: 300;" thumbnail /> - Acquisition announcement of <a href="https://www.vmware.com/company/news/releases/vmw-newsfeed.VMware-Signs-Definitive-Agreement-to-Acquire-Pivotal-Software.1905769.html" target="_blank" title="VMware to Acquire Pivotal Announcement">Pivotal</a>: Provides a cloud-native platform (Pivotal Cloud Foundry) for building and deploying next-generation applications. <h3 id="multi-cloud">Multi-cloud</h3> <ul> <li>Expansion of recent partnerships in VMware Cloud on AWS: Aims to help customers migrate and modernize applications with consistent Infrastructure and operations.</li> <li>Acquisition announcement of <a href="https://blogs.vmware.com/networkvirtualization/2019/08/avi-networks-same-mission-new-home.html/" target="_blank" title="VMware Acquires Avi Networks">Avi Networks</a>: A multi-cloud application services platform that provides software for the delivery of enterprise applications in data centers and clouds—e.g., load balancing, application acceleration, security, application visibility, performance monitoring, service discovery and more.</li> <li><a href="https://blogs.vmware.com/apps/2019/08/distributed-machine-learning-on-vsphere-leveraging-nvidia-vgpu-and-mellanox-pvrdma.html" target="_blank" title="VMware Nvidia Announcement">Partnership with Nvidia to offer virtualized GPUs</a>: Either on-premise or as part of VMware Cloud on AWS.</li> </ul> <h3 id="security">Security</h3> <ul> <li>VMware’s intention to acquire <a href="https://www.vmware.com/company/news/releases/vmw-newsfeed.VMware-Enters-Definitive-Agreement-to-Acquire-Carbon-Black.1905770.html" target="_blank" title="VMware to Acquire Carbon Black">Carbon Black</a>: A security company that focuses on securing modern cloud-native workloads</li> </ul> <h2 id="keyword-consistent">Keyword: Consistent</h2> <p>“Consistent” is the word we heard many times throughout the conference. VMWare’s <a href="https://cloud.vmware.com/" target="_blank" title="VMware cloud">cloud</a> mission is to provide consistent operations across diverse infrastructures including hybrid and native public clouds. The consistency concept is reflected in many themes including <strong>Cloud Migration, On-Demand Scaling, Multi-cloud operations,</strong> and <strong>Modern App Architecture</strong>.</p> <p>We saw quite a lot of conference sessions explicitly talking about this, such as:</p> <ul> <li>Run Upstream Kubernetes <strong>Consistently</strong> Across Clouds with VMware Tanzu</li> <li><strong>Consistent</strong> Load Balancing for Multi-Cloud Environments</li> <li>Apply <strong>Consistent</strong> Security Across VMs, Containers, and Bare Metal</li> <li>NSX Cloud: <strong>Consistently</strong> Extend NSX to AWS and Azure</li> </ul> <p>There are two layers of consistency to think about here: <strong>Consistent Infrastructure</strong> and <strong>Consistent Operations</strong>:</p> <ul> <li>“Consistent Infrastructure” means to deploy the core building blocks of cloud computing—compute, storage, networking—using the same technologies, so users don’t have to worry about incompatibilities among the underlying cloud vendors This allows users to focus on choosing the cloud services that work best for them.</li> <li>“Consistent Operations” means providing essential operational capabilities so users have the visibility, automation, security, and governance they need to properly manage and operate their systems and apps across multiple environments. Consistent operations assures that users can run applications across all environments efficiently and securely.</li> </ul> <h4 id="consistency-is-also-a-kentik-story">Consistency is also a Kentik Story…</h4> <p>Kentik Provides a Single View for Diverse Network Infrastructure: <img src="//images.ctfassets.net/6yom6slo28h2/3XO9tm6lJWVFlcgfQPMeiU/d58783dc622558569f7c8bc40e3dadda/kentik-single-view-for-diverse-network-infrastructure.png" class="image center" title="Kentik: A Single View for Diverse Network Infrastructure" style="max-width: 600;" thumbnail /></p> <p>If you are looking for a modern network analytics platform that can:</p> <ul> <li>Provide a 360-degree unified view of your diverse infrastructure (hybrid-cloud &#x26; multi-cloud)</li> <li>Intelligently detect anomalies based on historical as well as real-time traffic data</li> <li>Accelerate your troubleshooting process, powered by machine learning</li> <li>Integrate quickly and easily with your existing workflows and tools, enabled by rich APIs</li> <li>Provide a genuinely end-to-end network monitoring and analytics solution</li> <li>And we embrace Kubernetes and service meshes, even as they continue to evolve. Today, Istio metrics are one of many data sources ingested by the Kentik platform to enrich network flow data with service mesh context to provide complete observability for cloud-native environments.</li> </ul> <p>… then you should give Kentik a try! It’s easy to get started: <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial right now</a>.</p> <h2 id="summary">Summary</h2> <p>Tying everything together, VMware has evolved by doubling down on its strategy of offering a unified operating model for hybrid cloud. That means it is now positioning its portfolio as the VMware Hybrid Cloud Platform—managing Kubernetes apps alongside traditional apps on vSphere, across on-prem and cloud platforms.</p><![CDATA[The State of Network Automation: Don't Worry... You Aren't Behind]]><![CDATA[Kentik CTO Jonah Kowall highlights challenges and opportunites in network automation and describes how Kentik is leading the way in providing next-generation solutions for automation, notification, advanced API integrations with telemetry, and more. "Every organization has an automation goal, and it’s no doubt that network automation is not only essential to avoid costly outages, but also helps organizations scale without putting people in the work path... The problem is that every organization has a storied history of automation tools, meaning we already have at least a dozen of them in our organizations across various silos and stacks, some of which are commercial and some are open source."]]>https://www.kentik.com/blog/the-state-of-network-automation-dont-worry-you-arent-behindhttps://www.kentik.com/blog/the-state-of-network-automation-dont-worry-you-arent-behind<![CDATA[Jonah Kowall]]>Mon, 23 Sep 2019 07:00:00 GMT<p>Every organization has an automation goal, and it’s no doubt that network automation is not only essential to avoid costly outages, but also helps organizations scale without putting people in the work path. This is how Google is able to manage millions of servers running billions of containers each day, and how cloud-native companies have constructed their applications on top of new infrastructure underpinned by Kubernetes.</p> <p>The problem is that every organization has a storied history of automation tools, meaning we already have at least a dozen of them in our organizations across various silos and stacks, some of which are commercial and some are open source.</p> <p>Within the network domain, these older tools are often the NCCM style tools that automate repetitive tasks (examples include ManageEngine, Micro Focus, Solarwinds, and other vendor-specific tools such as those from Cisco, Arista, and Juniper). There have also been some new entrants, and some great open-source thanks to contributions by DigitalOcean in the form of Netbox. Aside from these NCCM tools many organizations are also adopting network orchestration tools that promote infrastructure as code, and DevOps cultures and methodologies.</p> <p>In Gartner’s recent Market Guide for Network Automation, 2018, a survey of 205 network professionals shows adoption of Linux tools (such as Chef/Puppet/Ansible) for network automation as the most common approach (at about a third of respondents).</p> <p>These DevOps tools can typically manage multiple types of infrastructure. Most commonly seen on the network side is <strong>Ansible</strong>, but these implementations are often augmented by Napalm or generic Netconf/YANG in Python. Ansible is essentially custom code or scripting (known as playbooks), but they can be purpose-built and integrated with other libraries such as netmiko or nornir for those wanting to avoid making a larger time investment to learn Ansible. Clearly Python is the winner across the board here.</p> <p>You may ask a few typical questions when looking at these investments of time and or money:</p> <h3 id="why-should-i-do-this">“Why should I do this?”</h3> <p>This is a big change, as network engineers are going to have to make a significant skill jump from people thinking in terms of packets, routing, devices, and terminals to <em>checking in code</em>.</p> <p>But the advantage is—once you adopt these practices—you can use various techniques to implement continuous validation contained in a release or testing pipeline. This means making sure configurations are accurate, secure, compliant with other policies, or generally of higher quality.</p> <p>The result of that is fewer network outages due to poor syntax or basic semantics causing misconfigurations, which are the most common cause of outages. Open source tools like Batfish can be used for both basic and even more advanced validation, with other alternatives as well.</p> <h3 id="why-should-i-have-my-developers-managing-infrastructure">“Why should I have my developers managing infrastructure?”</h3> <p>The right question to answer is: <em>“Why aren’t your infrastructure engineers learning some coding skills?”</em></p> <p>In today’s environment, this is no longer an option, and people who do not have development skills are are not future-proofing themselves, since they cannot scale the organization effectively over time. If the network team doesn’t have these skills, a good exercise is to offer training as a professional development activity. They will thank you, and the team will scale better to meet demands.</p> <p>The challenge with DevOps tools is oftentimes these are great solutions for those with a greenfield network within a cloud or data center environment. Most organizations, however, have existing technical debt in the form of mixed legacy and modern equipment.</p> <p>The net result is islands of automation as highlighted by this EMA research poll:</p> <img src="//images.ctfassets.net/6yom6slo28h2/4LTsuKz6XmgkXw7LO6vE8G/31c39f48967c90f60facdb6fae352f81/ema-network-automation-2020-stat.PNG" class="image center thumbnail" alt="Kentik AIOps for Network Management Platform Diagram" style="max-width: 800;" /> <p>There are too many automation tools, each of which is used for specific gear, environments, or use cases. If you believe you <em>don’t</em> need another automation tool, the real answer is that <em>you probably do,</em> if you want to enable advanced use cases around CI and CD.</p> <h2 id="telemetry-and-network-automation">Telemetry and Network Automation</h2> <p>Ultimately we at Kentik believe that automation is a key area for the integration of telemetry to drive several use cases. The automated network management capabilities <a href="https://newsroom.cisco.com/feature-content?type=webcontent&articleId=1816954" target="_blank">promised</a> by large vendors, <a href="https://www.datacenterknowledge.com/networks/intent-based-networking-data-center-cisco-vs-juniper" target="_blank">again</a> and <a href="https://blogs.cisco.com/analytics-automation/cisco-ai-network-analytics-making-networks-smarter-simpler-and-more-secure?oid=psten017292&ccid=cc000098&dtid=oblgzzz000659" target="_blank">again</a>, are focused on the small problems. Examples include how to automate the fixing of my WiFi issues (reboot/resetting ports), or how can I push an ACL out.</p> <p>While these are valid use cases, they’re issues that our legacy NCCM tools can already solve. These are <em>not</em> the problems which network operators spend time on—we’ve already automated them for the most part. The concept and goal of a fully closed-loop system sounds wonderful, but in today’s heterogeneous networks it’s not yet feasible, especially as we augment our infrastructures with the public cloud. We can drive more efficient operations by providing network professionals with better information, more easily accessible, from within their existing workflows or goals.</p> <p>No matter how far along you are in your transformation or optimization, there will always be other organizations who are more advanced, and plenty who are far behind. Everyone is grappling with multiple stacks and complexity behind them. You are not alone, nor are you behind everyone. The industry is evolving, and as more operators have more provisioning, operations, debugging, and remediation workflows automated and orchetrated, we’ll move as an industry closer to being to enable the promise of closed-loop automation.</p> <h2 id="chatops">ChatOps</h2> <p>In terms of real-world action your peers are taking today, many of their workflows are routed through the tools they use to interact within the networking group and across their enterprise today.</p> <p>For real-time collaboration, they are often using Slack, Microsoft Teams, or other common chat platforms to augment email and other messaging platforms. Every organization has adopted something at this point to increase collaboration and teamwork. Many have extended these platforms, and the more advanced products in this space have many bots and integrations.</p> <p>One such integration is into tooling.</p> <p>We have hooked Kentik up to these systems to be able to answer questions about the network devices, paths, networks, and even to pull information about telemetry and alarms. This is extremely useful when coupled with integrations to automation systems like Ansible and Netbox. Kentik has been working with a consulting partner, Network to Code, who has built this exact type of system.</p> <p>As this integration matures, we will be releasing more details as to how you can get this in your environment or get help to have it implemented for your specific uses. <strong>You can get a preview at Ansiblefest.</strong> Please <strong><a href="mailto:[email protected]" title="Email Jonah Kowall">reach out to schedule a meeting with me</a></strong>, as I will be attending.</p> <h2 id="integration-into-workflow-tools">Integration into Workflow Tools</h2> <p>For organizations who haven’t made the full leap into ChatOps, or require work integrated across groups, we see automation-related workflows integrating with ticketing systems, notification systems, and other such tools to be kept alerted of things beginning to go wrong, or—in the worst case—once there is a major problem.</p> <p>This type of automation is essential for any network operator and these, in turn, can drive or at least trigger automation, even if partly human-driven once initiated. This is precisely why <a href="https://www.nelsonfrank.com/blog/the-operational-advantages-of-unifying-itsm-and-itom/" title="Read my blog post The Operational Advantages of Unifying ITSM and ITOM at Nelson Frank for more detail." target="_blank">ServiceNow has these capabilities</a> in a single platform, and how that can be very beneficial to teams. Kentik integrates with ServiceNow for this reason as a standard notification channel and, for many of our customers, ServiceNow is a critical hub of response to technical issues and their remediation.</p> <h2 id="use-of-telemetry-in-custom-scripts">Use of Telemetry in Custom Scripts</h2> <p>Advanced users of Kentik integrate with our flexible APIs to access any data within Kentik to drive their own custom solutions. This could be verifying connectivity, traffic patterns, performance, and usage.</p> <p>As an organization, we are driving towards making these types of integrations easier and enabling best practices: Using expertise from customers and team members who run the most demanding and complex networks, augmented by machine learning to provide automated analysis. As these methods advance, we will be creating a new closed-loop system to make teams more efficient.</p> <p>Customers today drive automation from Kentik to make operational decisions such as changing routing, deflecting threats, or scrubbing DDoS traffic. These advanced use cases are only possible with our real-time view of network traffic and anomalies within that traffic.</p> <p>Hopefully, this gives you some good ideas as to what you can use automation and telemetry to drive. If you have any comments or questions, please <a href="/contact/">contact our team directly</a> or <a href="#demo_dialog" title="Request your Kentik demo">request a demo</a>.</p><![CDATA[How to Maximize the Value of Streaming Telemetry for Network Monitoring and Analytics]]><![CDATA[Kentik explains the advantage that streaming telemetry (also known as streaming network telemetry) brings to network analytics and our approach to leveraging streaming telemetry for maximum value.]]>https://www.kentik.com/blog/how-to-maximize-the-value-of-streaming-telemetryhttps://www.kentik.com/blog/how-to-maximize-the-value-of-streaming-telemetry<![CDATA[Aaron Kagawa, Crystal Li]]>Wed, 28 Aug 2019 07:00:00 GMT<p>Streaming telemetry is no longer an unfamiliar term in the network monitoring realm. In fact, interest in streaming telemetry has been increasing over recent years, while SNMP (Simple Network Management Protocol) is falling, according to Google Trends:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5su8oA3HJl97ugw70eltLT/e2aeee31a061d293a9e9442c079b9b85/streaming-telemetry-search-popularity.png" class="image center" alt="streaming telemetry versus snmp search popularity" style="max-width: 700;" thumbnail /> <h2 id="what-is-streaming-telemetry">What is Streaming Telemetry?</h2> <p>Streaming telemetry, also known as streaming network telemetry, is an innovative method of real-time data collection. Network devices such as routers, switches, and firewalls continually send data about the network’s health and functionality to a centralized location. This system provides a robust platform to access a wide array of metrics that modern network devices generate, effectively addressing the challenges posed by next-generation networks. Streaming telemetry is the comprehensive practice of transmitting measurements from various sources to a receiving station for storage and further analysis, ultimately streamlining network management.</p> <p>Steaming telemetry uses a push-based mechanism that transmits data automatically and continuously from various remote network devices (e.g., routers, switches, firewalls, etc.) to a central repository. Selecting a proper telemetry architecture can potentially remove many issues—such as security, scaling, polling gaps, and resource utilization of the polled device—around sending and receiving streaming telemetry data.</p> <p>In this blog post, we will review the <strong>current state</strong> of streaming telemetry and its ecosystem, discuss our take on the <strong>value that streaming telemetry brings</strong> to the network analytics table, and outline <strong>Kentik’s approach</strong> to powering-up network teams by leveraging streaming telemetry.</p> <h2 id="streaming-telemetry-where-are-we-now">Streaming Telemetry: Where are We Now?</h2> <h3 id="core-technology">Core Technology</h3> <p>Streaming telemetry can potentially accelerate <strong>network troubleshooting, automation,</strong> and <strong>traffic optimization</strong>. Core components of the technology to support those goals include:</p> <ul> <li>Near real-time network data, achieved with <strong>push-based</strong> data collection</li> <li>A programmatic way of configuring and managing network devices, achieved by a <strong>data model</strong>, which describes the specific metrics and metadata to include (for example, <a href="https://tools.ietf.org/html/rfc6020" title="YANG - A Data Modeling Language for the Network Configuration Protocol" target="_blank">IETF YANG</a>, <a href="http://www.openconfig.net/" title="OpenConfig: A vendor-neutral, model-driven network management designed by users" target="_blank">OpenConfig</a>, and other vendor proprietary models</li> <li>A highly-scalable <strong>architecture and framework</strong> with more data point granularity and superior performance</li> </ul> <h3 id="streaming-telemetry-vs-snmp">Streaming Telemetry vs. SNMP</h3> <p>As network complexity increases, especially in large enterprises, traditional monitoring methods like SNMP face real-time visibility and scalability challenges. Streaming telemetry emerges as a modern alternative, addressing the limitations of SNMP in various aspects.</p> <p>SNMP (Simple Network Management Protocol) is a pull-based model where the monitoring system periodically requests data from network devices. This method can cause delays in detecting issues and consumes considerable resources on both the monitored device and the monitoring system. Additionally, SNMP lacks a standardized data model, leading to inconsistencies in the data collected from different vendors.</p> <p>On the other hand, streaming telemetry uses a push-based model that allows network devices to stream data continuously and automatically to a centralized location. This approach enables near real-time network data collection, significantly improving the visibility and responsiveness of network monitoring. Streaming telemetry also leverages standardized data models, such as IETF YANG and OpenConfig, promoting consistency in the data collected across various devices and vendors.</p> <p>By providing real-time visibility, standardized data models, and a more efficient data collection mechanism, streaming telemetry challenges SNMP as the preferred method for network monitoring and analytics in modern enterprises.</p> <h3 id="the-vendor-ecosystem">The Vendor Ecosystem</h3> <p>Major networking vendors now support streaming telemetry on many of their hardware platforms, including:</p> <ul> <li><a href="https://developer.cisco.com/docs/ios-xe/#!streaming-telemetry-quick-start-guide" title="Cisco Streaming Telemetry Guide" target="_blank">Cisco</a> - OS: IOS XE, XR and Nexus OS; Platform: ASR9K, CRS, NCS 6K</li> <li><a href="https://www.juniper.net/documentation/en_US/junos/topics/concept/junos-telemetry-interface-oveview.html" title="Juniper Junos Streaming Telemetry Interface" target="_blank">Juniper</a> - OS: Junos OS; Platform: MX, QFX, EX, vMX</li> <li><a href="https://www.arista.com/en/solutions/software-defined-network-telemetry" title="Arista Network Telemetry" target="_blank">Arista</a> - EOS</li> <li><a href="https://infocenter.nokia.com/public/7750SR150R1A/index.jsp?topic=%2Fcom.sr.system.mgmt%2Fhtml%2Ftelemetry-intro.html" title="Nokia System Management Guide: Telemetry" target="_blank">Nokia</a> - SR OS</li> <li>Ciena</li> <li>Infinera</li> <li>… and many more.</li> </ul> <p>We are also seeing many new <strong>open-source projects</strong> related to streaming telemetry, such as:</p> <ul> <li><a href="https://blogs.cisco.com/sp/introducing-pipeline-a-model-driven-telemetry-collection-service" title="Introducing Pipeline: A Model-Driven Telemetry Collection Service" target="_blank">Pipeline</a> (backed by Cisco)</li> <li><a href="https://github.com/Juniper/open-ntiy" title="Open NTI on GitHub: Open Network Telemetry Collector build with open source tools" target="_blank">OpenNTI</a> (backed by Juniper)</li> <li><a href="https://github.com/aristanetworks/goarista/" title="Arista Go library on GitHub: Fairly general building blocks used in Arista Go code and open-sourced for the benefit of all." target="_blank">GoArista</a> (backed by Arista)</li> </ul> <h2 id="streaming-telemetry-state-of-adoption">Streaming Telemetry: State of Adoption</h2> <p>With increasing interest from major technology companies and growing support from network hardware vendors, we’re seeing that early deployments of streaming telemetry have picked up speed in recent years, especially in organizations with large-scale infrastructures.</p> <p>However, since the technology is still not standardized, there are many choices and variables, leading to <strong>different flavors of streaming telemetry</strong> that could make deployment more complex and slow down adoption. For example:</p> <ul> <li><strong>Transport options:</strong> Many choices like TCP, UDP, and gRPC</li> <li><strong>Session Initiation options:</strong> Dial-out (i.e., the device sends data to the collector) versus Dial-in (i.e., the collector connects to the device)</li> <li><strong>Encoding options:</strong> Choices of JSON, XML, and Google Protocol Buffers (GPB)</li> </ul> <p>There’s still a long way to go in standardizing streaming telemetry interfaces, which will ultimately likely boil down to either (1) picking a winner or (2) coming up with some best practice solutions and reference guides on when to use each option.</p> <p>A long-term commitment to consistency and effort (from both networking vendors and the open-source community) will be required to move the technology forward over the coming years. As such, we expect that SNMP and streaming telemetry will <em>coexist</em> for a very long time.</p> <h3 id="the-gnmi-protocol">The gNMI Protocol</h3> <p>A new specification, <a href="https://tools.ietf.org/html/draft-openconfig-rtgwg-gnmi-spec-01" title="gNMI, a network management protocol based on the gRPC framework" target="_blank">gRPC Network Management Interface (gNMI)</a>, is currently one of the main efforts to standardize streaming telemetry and other areas of network management. gNMI is a <a href="https://grpc.io" title="gRPC: A high performance, open-source universal RPC framework" target="_blank">gRPC-based</a> protocol for state management on network elements. Current participants in the project include big tech brands such as Google, Facebook, Microsoft, Apple, Netflix, AT&#x26;T, T-mobile, Comcast, and others.</p> <p>From a streaming telemetry perspective, the goal of gNMI is to <strong>normalize and control telemetry streams</strong> across multiple vendors with <strong>consistent data elements and interfaces</strong> for data collection.</p> <h3 id="combining-streaming-telemetry-with-other-data-sources">Combining Streaming Telemetry with Other Data Sources</h3> <p>From the network operations perspective, streaming telemetry can improve efficiency in many use cases, including:</p> <ul> <li><strong>Detecting problems</strong> by setting up network monitors and alerts based on pre-configured thresholds or network performance baselines</li> <li><strong>Troubleshooting</strong> connectivity and performance issues</li> <li><strong>Planning for network capacity</strong> according to usage and budgets</li> <li><strong>And much more…</strong> especially when we can use AI or machine-learning techniques to make automated decisions based on telemetry data.</li> </ul> <p>However, streaming telemetry shouldn’t be the <em>only</em> data source that drives these capabilities. As an example:</p> <p>As a network operator, let’s say that you want to be notified <strong>when utilization is high</strong> for critical backbone links. The next step would be to determine the characteristics of the traffic that are driving up utilization. For example, which applications, clients, and servers are prominent on the highly-utilized links and can thus be used to make various optimization decisions (e.g., changing traffic patterns)?</p> <p>An appropriate approach could be:</p> <ol> <li>Use streaming telemetry metrics as a set of indicators of thresholds and then</li> <li>Use NetFlow to figure out what type of traffic is causing it.</li> </ol> <p>As another example, streaming telemetry can also report <strong>real-time information on packet drops</strong> across links. This information can then be used via a <strong>network automation workflow</strong> to provision new paths and optimize traffic across the network.</p> <p>The idea is to <a href="https://www.kentik.com/blog/data-enrichment-will-be-the-new-correlation/" title="Related blog post: Data Enrichment Will Be the New Correlation">correlate all relevant data with multidimensional data enrichment</a> ― regardless of whether the data is sourced from streaming telemetry, network flows, or events and logs ― to see the bigger picture and learn the story behind the superficial symptoms.</p> <h2 id="kentiks-approach-to-streaming-telemetry">Kentik’s Approach to Streaming Telemetry</h2> <p>At Kentik, we’ve been evaluating the market readiness to support this exciting technology, and current customers have asked for streaming telemetry support to help take advantage of the wide variety of data that can be sourced from streaming telemetry sources.</p> <p>That’s why <strong>Kentik now officially supports an MVP release of streaming telemetry</strong>.</p> <p>We are bringing all of our innovation for flow data to streaming telemetry. Unlike traditional approaches, Kentik’s <a href="https://www.kentik.com/product/kentik-platform" title="Kentik Platform: AIOps for Network Monitoring &#x26; Analytics">AIOps platform for network monitoring and analytics</a> allows users to easily combine flow data with streaming telemetry. Kentik’s backend architecture is designed to receive a high volume of streaming data, contextualized with Interface Classification, flow enhancements, flow tagging, and more.</p> <p>Kentik gives real meaning to the data, which is one of the major differentiators compared to other tools in the market today. Legacy tools may be able to collect the data but do not provide deep insights into it.</p> <p>As shown below, Kentik <strong>ingests telemetry data at scale</strong>, just like every other type of data we collect. Then via <strong>enrichment</strong> and <strong>machine learning</strong>, Kentik surfaces potential problems in real-time so that network teams can quickly and accurately <em>respond to incidents</em>, proactively recognize and <em>prevent issues from impacting service and business</em>, and focus on <em>network optimization</em> rather than firefighting.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2tJ8HD6SzFvwW5W6dqRYGo/08a93fbbebd52de59df0e686297623d0/kentik-support-for-streaming-telemetry.png" class="image center no-shadow" alt="Kentik support for streaming telemetry" style="max-width: 700;" thumbnail /> <p>Additionally, we have a robust roadmap of capabilities that will dramatically expand the usefulness of an already-powerful technology over time.</p> <h3 id="current-support-and-roadmap">Current Support and Roadmap</h3> <p>Kentik’s product team has always employed a user-centric approach to feature development. We usually implement features in multiple phases, in order to gather feedback from customers as we iterate during the development process.</p> <p>The thought process behind this is to design a scalable mechanism for ingesting and storing data, along with UI components, so we can lay down a solid architectural foundation to leverage streaming telemetry. Second, we bring basic support to customers to gather feedback and evolve iteratively. Third, we combine everything, normalize the requirements, and build the workflow to collect metrics and understand the data.</p> <p>Kentik’s Phase 1 support for streaming telemetry includes:</p> <ul> <li>Direct collection of telemetry data</li> <li>Interface classification support</li> <li>Support for <a href="https://www.juniper.net/documentation/en_US/junos/information-products/pathway-pages/junos-telemetry-interface/junos-telemetry-interface.pdf" title="Junos Telemetry Interface Feature Guide" target="_blank">Juniper “gNMI” JTI</a> with UI support</li> <li>Interface metrics (partial support)</li> </ul> <p>Please <a href="mailto:[email protected]" title="Email Kentik Customer Success">contact our Customer Success Team</a> if you want to get a preview of this early version of streaming telemetry support.</p> <p>With access to the streaming telemetry features, you can get statistics and visualizations of network ingress and egress traffic, via which interfaces, with connectivity types and other relevant data:</p> <img src="//images.ctfassets.net/6yom6slo28h2/47Ch4ur7gukMa9USsTsOGG/2c21fad8616fc05ef211079c658e313b/kentik-support-streaming-telemetry-table-1.png" alt="kentik-support-streaming-telemetry-table-1"> <img src="//images.ctfassets.net/6yom6slo28h2/2xo4JjCjR8pA9NhV7ENTb6/7bf44da556fe2b75f0d5518c85bc1d5a/kentik-support-streaming-telemetry-graph-2.png" class="image center no-shadow" alt="Kentik visualization of streaming telemetry data" style="max-width: 800;" thumbnail /> <p>In subsequent phases, we will add support for more vendors (e.g., Cisco Dial-Out for ASR), full interface metrics, more sample interval options, full alerting on metrics and state changes, and much more. The goals are to eliminate gaps in visibility, understand the complete health of the network, and relate this information to applications and traffic throughout the entire infrastructure.</p> <h2 id="conclusion">Conclusion</h2> <p>Digital businesses drive the fastest revenue growth in history, and networks underpin all of it. New network monitoring and management capabilities are in urgent demand, and streaming telemetry is filling the visibility gap by providing real-time and HD-like visibility. Consuming telemetry data at scale while correlating it with all the other aspects of network context can be challenging. Kentik is well on the way to solving this difficult network monitoring problem.</p> <p>To be the first to know about our latest developments, <a href="#subscribe_dialog">subscribe to the Kentik blog</a>. You can also <a href="https://www.kentik.com/go/get-demo/" title="Request a Kentik Demo">request a personalized demo</a> to see Kentik’s powerful network analytics—and our latest streaming telemetry features—for yourself.</p><![CDATA[How Zoom Uses Kentik for Network Visibility, Performance, Peering Analytics & Improved Customer Support]]><![CDATA[Learn how enterprise video communications leader, Zoom, uses Kentik for network visibility, performance, peering analytics and improved customer support. Zoom's Alex Guerrero, senior manager of SaaS operations, and Mike Leis, senior network engineer, share how they use Kentik to help Zoom deliver "frictionless meetings."]]>https://www.kentik.com/blog/new-customer-video-how-zoom-uses-kentik-for-network-visibility-performance-peering-analyticshttps://www.kentik.com/blog/new-customer-video-how-zoom-uses-kentik-for-network-visibility-performance-peering-analytics<![CDATA[Michelle Kincaid]]>Tue, 13 Aug 2019 14:00:00 GMT<p>Zoom is the leader in modern enterprise video communications. The company is fast-growing and known for the frictionless experience they provide via their easy, reliable cloud platform for video and audio conferencing (and much more).</p> <p>With technology at the heart of the company, Zoom has 17 global data centers, including five in the U.S. The company’s extensive global network connects all of those data centers and is a critical part of the frictionless, always-on service Zoom provides to customers.</p> <p>In this video, Zoom’s Alex Guerrero, senior manager of SaaS operations, and Mike Leis, senior network engineer, discuss how network analytics from Kentik enables Zoom to see its network at a global scale.</p> <p>“The visibility of being able to see your entire network and all of the trends happening across every device in your entire network, all at once, is huge. I don’t know of any other tool that I can think of that does that quite the way Kentik does,” notes Mike…</p> <div class="gatsby-resp-iframe-wrapper" style="max-width: 837px; margin-left: auto; margin-right: auto; margin-bottom: 24px" > <div style="padding-bottom: 56.272401433691755%; position: relative; height: 0; overflow: hidden;"> <iframe src="https://fast.wistia.net/embed/iframe/qargzm2gqk" title="Kentik Case Study Zoom Video" allow="autoplay; fullscreen" allowtransparency="true" frameborder="0" scrolling="no" class="wistia_embed" name="wistia_embed" msallowfullscreen="" style=" position: absolute; top: 0; left: 0; width: 100%; height: 100%; "></iframe> <script src="https://fast.wistia.net/assets/external/E-v1.js" async=""></script> </div> </div> <p>With Kentik, Zoom can make informed network decisions on network peering, capacity planning, performance, and above all, can maintain it’s always-on customer experience. To learn more about these benefits from Kentik, check out our <a href="https://www.kentik.com/solutions/network-visibility/" title="Network Visibility from Kentik">solutions for network visibility</a> and explore our <a href="https://www.kentik.com/product/kentik-platform/" title="AIOps platform for network visibility">AIOps platform for network professionals</a>.</p><![CDATA[Data Enrichment Will Be the New Correlation]]><![CDATA[At one point, data was called "the new oil." While that’s certainly an apt description for the *insights* we can extract from data, most organizations today are finding that new data repositories and "data lakes" often don't provide the expected benefits due to the analytics challenge. CTO Jonah Kowall explains how advanced data enrichment techniques, leveraging AIOps technologies, can make the promise of data analysis a reality.]]>https://www.kentik.com/blog/data-enrichment-will-be-the-new-correlationhttps://www.kentik.com/blog/data-enrichment-will-be-the-new-correlation<![CDATA[Jonah Kowall]]>Thu, 25 Jul 2019 14:00:00 GMT<p>At one point, data was called “the new oil.” While that’s certainly an apt description for the <em>insights</em> we can extract from data, most organizations today are finding that new data repositories and “data lakes” often don’t provide the expected benefits due to the analytics challenge.</p> <p>In fact, <a href="https://www.kentik.com/resources/gartner-market-guide-npmd-2020/">Gartner</a> stated as a recent prediction that, “through 2022, only 20% of analytic insights will deliver business outcomes.” What this means is that—although we collect and store data—it’s what we do (or are not yet able to do) with it that counts.</p> <p>When this idea is applied to operational tooling, the reality is far worse. We’ve been touting the linkages between development, testing, operations, and automation for decades. However, the reality is that we’ve only begun to solve part of the problem. The closed-loop nature of monitoring and automation is still only being addressed in modern companies who develop a lot of custom automation code.</p> <p>In the traditional enterprise, where systems are highly variable and have different ages, the reality is different.</p> <p>Gartner has coined the term “AIOps” to advance these techniques by applying machine learning (ML) and artificial intelligence (AI) to the problem in order to better address the challenge. They point out three main areas where AIOps can be applied, which I’ll dig into further in this blog post:</p> <ol> <li>Root Cause Analysis (RCA)</li> <li>Log Analytics</li> <li>Event Correlation</li> </ol> <h2 id="root-cause-analysis">Root Cause Analysis</h2> <p>From Gartner’s list, <em>root cause analysis (RCA)</em> can benefit most from the application of AIOps techniques. This is what is occurring in the APM arena where guided root cause is becoming more of a common feature across market-leading products such as AppDynamics, Dynatrace, <a href="https://www.kentik.com/product-updates/november-december-2020/">New Relic</a>, and upstart Instana.</p> <p>These APM tools have their own data collection mechanisms in the form of software agents and other proprietary technologies. Granted, there are open-source agents and APIs out there, but they are largely not used for the RCA process as they lack the depth or context of these proprietary software agents.</p> <p>These tools face future challenges as the diversity of data sources increases: Their need to control the input into their algorithms and requirement to continually build new agents and instrumentation are a losing battle. That’s why there are so many emerging standards in APM for agents, APIs, and the way data is collected.</p> <p>All of these are doomed before they begin due to the diversity of and variance in the applications, languages, and frameworks. This is a futile exercise, but that doesn’t mean companies won’t spend billions of dollars trying to solve it.</p> <h2 id="log-analytics">Log Analytics</h2> <p>When it comes to creating insights out of data not generated by a monitoring system, <em>log analytics</em> immediately comes to mind. All infrastructure technologies and custom applications generate logs without any standard semantics as to what these messages mean and the importance of the messages themselves.</p> <p>Applying ML and AI techniques to this data will result in great gains in productivity by users and other systems. We’ve seen Splunk, Elastic, and Sumo Logic investing heavily in moving this market forward and embracing the application of these new analytics to logs.</p> <p>These techniques, while they improve the type of data you can extract from difficult-to-understand logs, still lack details around the <em>relationships</em> between data, such as where specific logs come from or how they relate back to a specific transaction or user. Thus, they’re great tools for basic or advanced troubleshooting, but not much else.</p> <p>Some of the log companies have almost evolved to building more advanced workflows and use cases similar to event correlation.</p> <h2 id="event-correlation">Event Correlation</h2> <p>As we’ve seen evolution in other areas, the application of ML and AI has been the most dramatic in <em>event correlation</em>. These systems generally consume more structured, event-based data from other <a href="https://www.kentik.com/blog/monitoring-vs-observability-understanding-the-role-of-each/">monitoring systems</a>. Their goal is to extract what is important and what is not and determine how the events are related.</p> <p>While these tools often deal with challenges similar to log management systems, event correlation tools have a major advantage in that the problem space is smaller, more widely known, and better controlled in terms of data inputs. This has enabled innovators such as Moogsoft and BigPanda to create distinct advantages from legacy technology providers who once owned this area and domain.</p> <p>All is good in <a href="https://www.kentik.com/blog/aiops-comes-of-age-gartners-market-guide-for-aiops-platforms/">AIOps</a>… or maybe not, since we still have major pain even amidst the gains. The issue with event correlation systems is that they ingest and analyze data, and the data behind the correlation decisions is discarded or summarized to make the problem more easily solved at scale. The log analytics solutions have an entirely different problem where they do not correlate until query time, resulting in very slow performance at scale. Most of these solutions also require that the data being correlated is well-defined and either adds new data, or creates additional meaning, during ingestion.</p> <h2 id="multidimensional-data-enrichment">Multidimensional Data Enrichment</h2> <p>One of the key areas <em>not</em> being addressed in current approaches is that correlation is not just for analyzing monitoring events. Correlation can also be done inside of monitoring systems. This is something Gartner alludes to in its research, but there are generally very few examples offered.</p> <p><em>Data enrichment</em> is something that needs to be done to provide additional context to existing data, using completely different data sources. This is often done in a primitive manner: Pulling in metadata such as tags (e.g., from a cloud platform) or orchestration engines (e.g., Kubernetes) in order to provide more context around metrics. However, it’s never done in a multidimensional way at scale.</p> <p>The ability to overlay multiple data sets—including orchestration, public cloud infrastructure, network path, CDNs involved in the delivery of traffic… even security and threat-related data—is essential to create the additional context needed for algorithms to provide better insights.</p> <p>There are very few platforms that can accomplish this type of enrichment at scale, not only due to the data storage challenges, but also because most simply lack the ability to ingest and correlate data in this manner.</p> <p>I was not aware of these types of capabilities before joining Kentik, but the company has been building and executing this type of multidimensional enrichment, at scale, for the last several years—all in an effort to create a next-generation <a href="https://www.kentik.com/product/kentik-platform/" title="Learn more about Kentik&#x27;s AIOps platform for network monitoring and analytics">AIOps platform for the network professional</a>. Ingesting and correlating data to create additional context is required to create the level of situational awareness that lets network professionals make the right decisions, every time.</p> <p>As we evolve Kentik’s data platform and capabilities we will be rethinking what is possible in order to bring the most scalable and capable platform to market.</p> <p>To be the first to know about our latest developments, <a href="#subscribe_dialog">subscribe to the blog</a>.</p> <p><strong>Related Reading:</strong></p> <ul> <li>Kentik Report: <a href="https://www.kentik.com/resources/state-of-automation-artificial-intelligence-and-machine-learning-in-network-management-2019" title="The State of Automation, Artificial Intelligence and Machine Learning in Network Management, 2019">The State of Automation, Artificial Intelligence and Machine Learning in Network Management, 2019</a></li> </ul><![CDATA[The First AIOps Platform for the Network Professional is Here]]><![CDATA[CTO Jonah Kowall introduces Kentik's AIOps solution for network management, explaining the need for evolution and innovation in the network performance monitoring and diagnostics market and providing an overview of the components of Kentik's revolutionary platform. ]]>https://www.kentik.com/blog/the-first-aiops-platform-for-the-network-professional-is-herehttps://www.kentik.com/blog/the-first-aiops-platform-for-the-network-professional-is-here<![CDATA[Jonah Kowall]]>Wed, 24 Jul 2019 07:00:00 GMT<p>Kentik has been many things to many people. Our company was created with a vision of solving <em>any</em> problem for any user wishing to analyze and mine their network traffic data. Our platform has been proven to scale to support the largest networks in the world—while providing rapid query and responses—without rolling up data, thanks to our data architecture. Our sophisticated users, with complex needs and advanced imaginations, have been able to answer any question in near real-time, across any environment, without needing to plan in advance.</p> <p>That’s what makes Kentik unique in the <a href="https://www.kentik.com/resources/gartner-market-guide-npmd-2020/" title="Read Gartner&#x27;s Market Guide for Network Performance Monitoring and Diagnostics, 2020">network monitoring and analytics market</a>.</p> <p>The ongoing challenge is that today’s network professional is being overwhelmed by complexity and diversity, making it difficult to know what questions to ask. The trends towards multi-vendor, overlay-driven networking and the addition of new data centers in the public cloud are requiring network professionals to be <em>generalists</em>—needing to know a little about a lot of different networking and application delivery technologies. At the same time, these networks are <em>complex</em>, so it’s also essential for network teams to have <em>depth</em> and <em>expertise</em> at a level that’s typically scarce within any given organization.</p> <p>This is the reason Kentik has been working hard through 2019 to build features that bring forward both the questions and answers that the modern network professional needs.</p> <h1 id="aiops-from-kentik">AIOps from Kentik</h1> <p>Today, Kentik introduces the first AIOps platform specifically for network professionals. With this platform, Kentik is providing insights, visualizations, and greater capabilities to take action.</p> <p>The focus areas for Kentik are intuitive onboarding and workflows to make managing networks easier. We accomplish this both by surfacing interesting or anomalous data, and by making it much easier to implement integrations with your favorite automation technologies.</p> <p>By collecting and correlating large volumes of network traffic data, associated metadata, and other third-party data in real-time, Kentik is able to turn these disparate data sources into instant insights about business-impacting network conditions—insights which were not possible to glean before.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6CNUg38KQq1WPTIsB3j4Ti/f98aab294bdb64c6debf0c92bd085832/kentik-aiops-network-management-platform.png" class="image center no-shadow" alt="Kentik AIOps for Network Management Platform Diagram" style="max-width: 1200;" /> <p>An open, API-driven system with many pre-built integrations is necessary for the multi-vendor network infrastructures today. Kentik has built support for many leading technologies (e.g., Streaming Telemetry and Flowspec), while also expanding our platform to support SNMP and Syslog. This means additional devices can quickly be added and supported across disparate data sources. We’ve also recently added support for Palo Alto Network devices, Cisco ASA, Cisco NBAR and AVC, along with Cisco SD-WAN and Meraki MX product lines.</p> <p>Our integrations leverage our ability to ingest and enrich data in real time across multiple data sources. Our scalable data pipelines are called <strong><em>K/Ingest</em></strong> and <strong><em>K/Enrich</em></strong> and, with these platform services, we can correlate traffic data with threat feeds, custom DNS information, user information, and other elements not found in the network data stream itself. This correlation can be done both at query time or at ingestion time—depending on the user requirements for real-time alerting and actions based on the newly combined data.</p> <p>One of the top initiatives in any network professional’s short list is automation and taking action on both data and insights. With <strong><em>K/Automate</em></strong> we have built-in support for understanding traffic changes and being able to automatically take actions such as blackhole traffic by, leveraging Flowspec, automatically open tickets in ServiceNow, and even notify your teams via PagerDuty or OpsGenie with a click of a button.</p> <p><strong><em>K/Advise</em></strong> provides the ability to surface interesting anomalies in context. This means when there are large deviations or other anomalies, the user is presented with that data, front and center.</p> <img src="//images.ctfassets.net/6yom6slo28h2/401X9AlXm8Ww0pOYJDTMVb/b7b4f7ae8d857f6dc171fca4bb502478/aiops-kadvise-screenshot-network-management.png" class="image center no-shadow" alt="AIOps K/Advise Network Management" style="max-width: 300;" thumbnail /> <p>K/Advise enables actionable data to be presented to the user without any configuration or other adjustments after the data is ingested to the platform.</p> <h2 id="why-now">Why Now?</h2> <p>The network monitoring market is full of stagnation, even to the point where Gartner is no longer publishing the “Network Performance Monitoring and Diagnostics (NPMD)” Magic Quadrant beyond 2019. This lack of innovation is readily apparent. Consider, for example, how on-premises appliances are still the norm.</p> <p>Gartner’s initial goal of adding the requirement of “diagnostics” into the network performance monitoring Magic Quadrant was to show that tools had to evolve from being <em>reactive,</em> packet-based technologies into <em>proactive,</em> machine learning (ML) driven systems.</p> <p>As the first AIOps platform for the network professional, Kentik has done just that. Expect to see much more as we continue rolling out capabilities in the months ahead!</p> <p>If you have suggestions or ideas, please <a href="https://www.kentik.com/contact/" title="Contact the Kentik team">contact our team directly</a> or feel free to reach out to me via <a href="mailto:[email protected]">email</a> or on Twitter <a href="https://twitter.com/jkowall">@jkowall</a>.</p> <p><strong>Related Reading:</strong></p> <ul> <li>Kentik Report: <a href="https://www.kentik.com/resources/state-of-automation-artificial-intelligence-and-machine-learning-in-network-management-2019" title="The State of Automation, Artificial Intelligence and Machine Learning in Network Management, 2019">The State of Automation, Artificial Intelligence and Machine Learning in Network Management, 2019</a></li> </ul><![CDATA[BGP and RPKI: A Path Made Clear with Kentik]]><![CDATA[Kentik's Greg Villain and Jason Philippon explain the basics of using Resource Public Key Infrastructure (RPKI) to protect against BGP prefix hijack attacks and stop them from propagating across the internet.]]>https://www.kentik.com/blog/bgp-and-rpki-a-path-made-clear-with-kentikhttps://www.kentik.com/blog/bgp-and-rpki-a-path-made-clear-with-kentik<![CDATA[Jason Philippon, Greg Villain]]>Mon, 08 Jul 2019 09:00:00 GMT<p>Border Gateway Protocol (more commonly known as <strong>BGP</strong>) is the routing protocol that makes the internet work. It is the language spoken by routers to determine how packets can be sent from one router to another to reach their final destination.</p> <p>In its early days, <a href="https://www.kentik.com/kentipedia/bgp-sflow-analysis/">BGP</a> didn’t include any method to ensure whether a prefix from a peer had the correct ASN originating it in its AS_PATH attribute. In simpler terms, any BGP-speaking network could announce itself (deliberately or not) as the origin of any prefix.</p> <p>Time and time again, this resulted in a “BGP prefix hijack,” where, in many cases, part of the internet was unreachable. In the most recent incident, which happened last week, an ISP incorrectly accepted invalid routes into their network from one of their customers. While this event included multiple issues, one of them was related to invalid prefixes being announced throughout the internet. In this case it was due to more specific prefixes that should not have been advertised.</p> <p>However, an improved routing security mechanism to make the internet routing world safer does exist. It’s called <strong>RPKI</strong>.</p> <p>Without RPKI, the only way to arm yourself and your network against propagation of BGP hijacks and other erroneous routing information is to parse the IRR (Internet Routing Registry), which even then relies on blind trust of whether or not the IRRs are correct.</p> <h2 id="what-is-rpki-resource-public-key-infrastructure">What is RPKI (Resource Public Key Infrastructure)?</h2> <p>RPKI stands for <strong>Resource Public Key Infrastructure</strong>. It is defined by <a href="https://tools.ietf.org/html/rfc6480" target="_blank" title="RFC6400 standard at IETF">RFC6480</a> and is a cryptographic method that was designed to sign BGP route prefix announcements with the correct originating AS number.</p> <p>One way to think about it is that RPKI is to BGP what DNSSEC is to DNS. It offers a way to sign and validate origination of BGP prefixes against an official and signed list of prefixes by origin ASN.</p> <p>(For more background, check out Cloudflare’s very comprehensive article around <a href="https://blog.cloudflare.com/rpki/" target="_blank" title="Cloudflare Blog: RPKI - The required cryptographic upgrade to BGP routing">RPKI and its benefits</a>.)</p> <h2 id="how-does-rpki-work">How Does RPKI Work?</h2> <p>The entire RPKI process stands outside of the BGP routing protocol itself. What that means is that the use of RPKI to validate BGP advertised data doesn’t involve the BGP protocol at all.</p> <h3 id="generating-and-signing-roas">Generating and signing ROAs</h3> <p>The first step for RPKI to work is that networks need to sign their prefixes. The result of an ASN cryptographically signing a prefix is called an <strong>ROA</strong> (Route Authorization) record. The current signing infrastructure and process is handled by the five independent regional internet registries (RIRs)—AFRINIC, APNIC, ARIN, LACNIC, and RIPE—which therefore act as CAs (Certificate Authorities).</p> <p>Besides the certificate part, ROA contains the attributes listed below, which we’ll look at in detail via a Kentik ROA:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">{ "prefix": "141.193.36.0/22", "maxLength": 24, "asn": "AS6169", "ta": "Cloudflare - ARIN" } </code></pre></div> <p>The components of this ROA are as follows:</p> <ul> <li><strong>prefix</strong> describes the prefix being signed in the ROA.</li> <li><strong>maxLength</strong> gives the highest mask that a prefix can take within the aforementioned prefix in order for it to be considered valid. In this specific example, all prefixes contained within 141.193.36.0/22 with netmasks equals to 22, 23, and 24 are considered valid. If a prefix is contained within the aforementioned one and has a /25 mask, it will then be considered invalid even though it is within a broader, valid one.</li> <li><strong>asn</strong> defines the certified origin ASN for all prefixes included in the prefix attribute, provided their mask is broader or equal to the maxLength attribute. This allows for route validation to compare Origin ASN of a received prefix vs. True Ownership for said received prefix—a discrepancy is then seen as a hijack. (AS6169 is Kentik’s ASN.)</li> <li><strong>ta</strong> this attribute points to the Trusted Anchor that the ROA comes from. In this example, it shows as coming from ARIN with the additional “Cloudflare” prefix showing that the list of ROAs it comes from has been pulled from ARIN by “Cloudflare.”</li> </ul> <h3 id="trusted-anchors">Trusted Anchors</h3> <p>In RPKI verbiage, Certificate Authorities are called Trusted Anchors, aka TAs, but they are basically the same. As explained earlier, today’s five RIRs are the TAs in the RPKI world.</p> <p>What that means is that you can only sign your own prefixes and, if they somehow get deallocated from you, the associated ROAs will get revoked.</p> <p>TAs are responsible for two elements of the RPKI chain:</p> <ol> <li>To provide the infrastructure for members to sign their prefixes, and</li> <li>To provide the consolidated list of ROAs for public consumption.</li> </ol> <h3 id="how-are-roas-used">How are ROAs used?</h3> <p>The overarching goal of RPKI is for networks to be able to compare the BGP announcements they receive to the aggregated list of ROAs and be able to drop them in case they aren’t from whoever they are actually supposed to be from.</p> <p>Since BGP has no knowledge of RPKI, another method is needed to leverage a consolidated, global list of ROAs to use it for route validation. This method pairs routers with a server, which is aptly called a “validator”.</p> <p>Validators are tasked with:</p> <ol> <li>Pulling ROA updates from all of the known TAs, and</li> <li>Presenting it to the routers they are paired with. They handle all the crypto processing of the data pulled from TAs.</li> </ol> <p>In order for this to work at scale, router vendors agreed on a lightweight protocol for routers to query validators. That protocol is called <strong>RTR</strong> (meaning “RPKI-to-router” protocol). It has been defined in the successive <a href="https://tools.ietf.org/html/rfc6810" target="_blank" title="RFC6810 at IETF">RFC6810 </a>and <a href="https://tools.ietf.org/html/rfc8210" target="_blank" title="RFC8210 at IETF">RFC8210</a> standards.</p> <p>RTR-capable routers are able to query validators for ROA data and use the resulting info for route validation (in essence, deciding to keep or drop a given prefix based on ROA data).</p> <p>The process described above can be summarized in the following diagram:</p> <img src="//images.ctfassets.net/6yom6slo28h2/7tVFhMLCeBK7ubnW4Ex6FY/abe2b6767e88772f61e2b84781b7af34/rpki-roa-process.png" class="image center no-shadow" alt="RPKI ROA process" style="max-width: 495;" thumbnail /> <h2 id="how-does-kentik-fit-into-the-rpki-ecosystem">How Does Kentik Fit into the RPKI Ecosystem?</h2> <p>To support the community’s initiative to secure BGP via RPKI, Kentik has now added RPKI status visibility into our product. This provides another option for network engineers desiring to join in the RPKI journey. It also helps network operators understand how their routing plane would react when and if they turn on “Strict Route Validation.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/2et5hClgViwlyBx6QYD280/2377d3cf941ab090bde99b33df5f065a/kentik-rpki-routes.png" class="image center no-shadow" alt="Kentik RPKI: Strict Route Validation" style="max-width: 700;" thumbnail /> <p>Until everyone adopts RPKI, the RPKI information Kentik provides helps users determine two critical things:</p> <ol> <li> <p>If traffic is being originated by ASNs that don’t own them (i.e. hijacks) and/or if it is being announced by the right origin more specifically than it should.</p> </li> <li> <p>What would happen if strict route validations were to be enabled on the routers where Kentik gets the flows and BGP feeds from.</p> </li> </ol> <p>For more detail on using RPKI in Kentik, please see the related blog post, ”<a href="https://www.kentik.com/blog/technical-guide-using-rpki-resource-public-key-infrastructure-in-kentik/" title="Related post: Technical Guide - Using RPKI in Kentik">Technical Guide: Using RPKI in Kentik</a>”. It provides a step-by-step example with screenshots to guide you. Not a Kentik user yet? <a href="https://www.kentik.com/go/get-demo/" title="Request a Kentik Demo">Request a demo</a>.</p> <h3 id="special-thanks-department"><em>Special Thanks Department</em></h3> <p>Kentik would not be able to tackle the RPKI challenge without the help of our industry peers. For that, we extend our thanks to <strong>Cloudflare</strong> who (among other things) offered their help from the beginning to bootstrap and perfect Kentik’s understanding of RPKI. Additionally, they have made a great deal of well-supported, open source tooling available to the community for free. (Kentik’s RPKI implementation is built on Cloudflare’s open source library, as well as the curated list of ROAs Cloudflare offers via <a href="https://rpki.cloudflare.com/rpki.json" target="_blank" title="Cloudflare's Curated List of ROAs"><a href="https://rpki.cloudflare.com/rpki.json">https://rpki.cloudflare.com/rpki.json</a></a>.)</p> <p>A special thanks goes out to <strong>Louis Poinsignon</strong>, who leads the development of Cloudflare’s RPKI tool-chain (<a href="https://twitter.com/lpoinsig" target="_blank">twitter.com/lpoinsig</a>), for his phenomenal support and willingness to help. Additionally, <strong>Jerome Fleury</strong> who runs the network team at Cloudflare (<a href="https://twitter.com/jerome_uz" target="_blank">twitter.com/jerome_uz</a>) for suggesting our two companies should team up when Cloudflare were about to announce their involvement in the RPKI community initiative last year.</p> <p>Thanks also to <strong>Job Snijders</strong> (<a href="http://twitter.com/JobSnijders" target="_blank">twitter.com/JobSnijders</a>) and <strong>Paulo Lucente</strong> from <strong>NTT</strong> and the <strong>PMACCT project</strong>, who kindly pointed us to the PMACCT code to guide us through the implementation.</p> <p>Lastly, to <strong>Aaron Weintraub</strong> from <strong>Cogent</strong> for his time in reviewing and testing our early, iterative versions for accuracy and usefulness. Thanks!</p><![CDATA[Technical Guide: Using RPKI in Kentik]]><![CDATA[Crystal Li's new Technical Guide walks you through new Kentik features for supporting Resource Public Key Infrastructure (RPKI), explaining Kentik's new RPKI Validation Status and RPKI Quick Status dimensions.]]>https://www.kentik.com/blog/technical-guide-using-rpki-resource-public-key-infrastructure-in-kentikhttps://www.kentik.com/blog/technical-guide-using-rpki-resource-public-key-infrastructure-in-kentik<![CDATA[Crystal Li]]>Mon, 08 Jul 2019 08:00:00 GMT<p>This Technical Guide will walk you through new Kentik features for supporting Resource Public Key Infrastructure (RPKI), explaining the new RPKI Validation Status and RPKI Quick Status dimensions. For a more general introduction to Kentik’s RPKI capabilities, please see the related blog post, ”<a href="https://www.kentik.com/blog/bgp-and-rpki-a-path-made-clear-with-kentik/" title="Blog: BGP and RPKI - A Path Made Clear with Kentik">BGP and RPKI: A Path Made Clear with Kentik</a>.”</p> <p>Resource Public Key Infrastructure (RPKI), defined by <a href="https://tools.ietf.org/html/rfc6480" target="_blank" title="RFC6400 standard at IETF">RFC6480</a>, is a cryptographic method that was designed to sign BGP route prefix announcements with the originating AS number.</p> <p>Kentik has now integrated RPKI support via new dimensions “<strong>RPKI Validation Status</strong>” and “<strong>RPKI Quick Status</strong>” (see screenshot below), to allow users to precisely determine what would happen to the existing network traffic if they were to turn on RPKI validation for their networking equipment.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1aF3yCp57QnEkd69cMMLlY/4f768ff7452f9bd1781153c1549c6e92/rpki-tech-image-1.png" class="image center no-shadow" alt="Kentik RPKI Validation and Status Settings" style="max-width: 603;" thumbnail /> <p>These two dimensions are “destination-directional” because they correspond to routes received by routers from external peering sessions. We’ll describe these dimensions in more detail, below.</p> <h2 id="rpki-validation-status">RPKI Validation Status</h2> <p>This dimension contains the full RPKI state for a given flow, values it can take are:</p> <h3 id="rpki-unknown">RPKI Unknown</h3> <p>No ROA has been found to associate to the incoming traffic. While this doesn’t mean that the related destination traffic was announced using an ROA certified ASN, a route validator will not have the router drop this traffic.</p> <h3 id="rpki-valid">RPKI Valid</h3> <p>There is an existing ROA found for the traffic under that destination prefix, and the BGP announcements for it are announced by the correct, certified ASN. This type of traffic represents the highest level of certainty around the legitimacy of the origin ASN (i.e., it’s the safest and will not be dropped).</p> <h3 id="rpki-invalid">RPKI Invalid</h3> <p>There are multiple sub-cases for this dimension as described below:</p> <ul> <li> <p><strong>RPKI Invalid: valid covering prefix</strong>: While traffic is seen as RPKI-invalid, Kentik has determined that there is a covering BGP announcement to make it RPKI-valid with a secondary announcement. In practice, this means that the initial prefix for this traffic will be dropped following in the case of RPKI strict route validation, but a covering RPKI-valid prefix will take over, making traffic uninterrupted.</p> </li> <li> <p><strong>RPKI Invalid: unknown covering prefix</strong>: Similar to the aforementioned case, the current BGP origin for the traffic is invalid, but there’s an existing backup one covering it that will prevent this traffic from being dropped as it has an RPKI-validation status of unknown. (Unknowns are not dropped: They correspond to the vast majority of traffic over the internet, until such time as RPKI is widely adopted and everyone signs all of their prefixes).</p> </li> <li> <p><strong>RPKI Invalid: prefix length out of bounds</strong>: Traffic under this label will be dropped in the case of strict route validation. Each ROA comes with a maxLength that doesn’t contain the prefix seen in the BGP table.</p> </li> </ul> <p>Let’s take one of Kentik’s ROAs as an example:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">{ "prefix": "141.193.36.0/22", "maxLength": 24, "asn": "AS6169", "ta": "Cloudflare - ARIN" } </code></pre></div> <p>The 141.193.36.0/22 ROA comes with a maxLength attribute of 24. Practically, what that means is that all prefixes inside the /22 with a netmask between /22 and /24 will be seen as valid. With this in mind, 141.193.26.0/25 will be considered RPKI-invalid because /25 is outside of the /24 max length.</p> <ul> <li> <p><strong>RPKI Invalid: incorrect Origin ASN (should be AS&#x3C;AS_Number>)</strong>: If the preferred BGP route for traffic behind specific traffic isn’t originated by the ASN specified by the ROA, an <strong>“Incorrect Origin ASN (should be AS&#x3C;AS_NUMBER>)”</strong> label will be associated with this traffic. In practice, this means that the preferred BGP route appears to be announced by a Hijacking ASN. This also means that there is no valid or unknown alternative for this traffic in the BGP table. Therefore, traffic corresponding to this prefix will be entirely dropped if Strict Route Validation is enabled.</p> </li> <li> <p><strong>RPKI Invalid: explicit ASN 0</strong>: The RPKI standard also allows us to statically define prefixes that shouldn’t be trusted at all. Trust Anchors, and anyone else who leverages RPKI, can inject artificial ROAs that contain an origin ASN value of <strong>0</strong>. An ROA with <strong>ASN = 0</strong> means that any traffic coming from that prefix, and all prefixes contained within it as defined by <strong>maxLength</strong> will be considered explicitly invalid.</p> </li> </ul> <h2 id="rpki-quick-status">RPKI Quick Status</h2> <p>While the “RPKI Validation Status” provides a lot of granularity with respect to reasons why traffic is valid or invalid, it may not provide the best view for quickly determining what traffic will be dropped and what will not.</p> <p>The “RPKI Quick Status” dimension serves this purpose. It can help operators determine how traffic is going to behave globally by aggregating the previously mentioned “RPKI validation Statuses” into more legible aggregates of a given network.</p> <p>RPKI Quick Status</p> <table> <thead> <tr> <th>RPKI Quick Status</th> <th>Corresponding RPKI Validation Status</th> <th>Route Validation Behavior</th> </tr> </thead> <tbody> <tr> <td>RPKI Unknown</td> <td>RPKI Unknown</td> <td>Will not be dropped</td> </tr> <tr> <td>RPKI Valid</td> <td>RPKI Valid</td> <td>Will not be dropped</td> </tr> <tr> <td>RPKI Invalid - Covering Valid/Unknown</td> <td>RPKI Invalid - Valid Covering Prefix<br>RPKI Invalid - Unknown Covering Prefix</td> <td>Will not be dropped</td> </tr> <tr> <td>RPKI Invalid - Will be dropped</td> <td>RPKI Invalid - Prefix Length Out of Bounds<br>RPKI Invalid - Incorrect Origin ASN<br>RPKI Invalid - Explicit ASN 0</td> <td>Will be dropped</td> </tr> <tr> <td>Empty value</td> <td>Empty value</td> <td>Undetermined behavior:<br>The prefix may be in a static route<br>The prefix may be a /32 or /31<br>No AS_Path info available</td> </tr> </tbody> </table> <h3 id="an-example-leveraging-rpki-quick-status">An Example: Leveraging “RPKI Quick Status”</h3> <p>When considering Strict Route Validation, the most common task at hand is to evaluate the amount of traffic that will be dropped once it is enforced throughout the network.</p> <p>As a result, what most users will likely want to do is focus on the RPKI Invalid cases that will end up being dropped. Filtering on the Quick Status validation can help with that task, by simply setting the following filter:</p> <img src="//images.ctfassets.net/6yom6slo28h2/52nYEQmyNaJuH1LMMBPmVs/c5a42b6880baead0f18c5605c3866d73/rpki-tech-image-2.png" class="image center no-shadow" alt="Kentik RPKI Filter" style="max-width: 618;" thumbnail /> <p>A sample query like the one below can yield useful information on what traffic will eventually get dropped:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6lkWgPpvzUSlgk4nsTqdu6/041a9e0f3973c3b62d0c84f7f10c1a5b/rpki-tech-image-query-3.png" class="image center no-shadow" alt="RPKI Query" style="max-width: 560;" thumbnail /> <ul> <li><strong>Site</strong> provides a broad level of grouping.</li> <li><strong>RPKI Validation Status</strong> displays all the different associated RPKI Statuses for traffic that will end up being dropped.</li> <li><strong>Destination Route Prefix/LEN</strong> provides a reference to the prefixes that will get dropped for further investigation.</li> <li><strong>Destination AS Number</strong> allows us to detect mismatches between the BGP announcement and the what RPKI expects as the “official” origin for this prefix.</li> </ul> <p>Here’s sample output from this query (click image to zoom):</p> <img src="//images.ctfassets.net/6yom6slo28h2/5bktWz5eEncos5jfOSQPyG/f60a3ec5d536ca567bc53c2b8095bd19/rpki-tech-image-4.png" class="image center no-shadow" alt="RPKI Query Sample Output" style="max-width: 1600;" thumbnail /> <p>In this example, we notice that prefix <strong>45.239.121.0/24</strong> will be dropped, but also that:</p> <ul> <li><strong>Site</strong>: This drop will be localized on the ATL1 pop</li> <li>The <strong>BGP announcement</strong> seen on ATL1 pop says the Origin ASN is <strong>AS61503</strong>, from the Destination AS Number dimension</li> <li>However, the <strong>official origin ASN</strong> for this prefix based on ROAs is <strong>AS266852</strong> from the RPKI Validation Status Dimension</li> </ul> <p>We can also drill down further to help identify, more precisely, where this issue comes from and even warn involved parties (from official origin ASN to Transit Provider conveying the hijack).</p> <p>Clicking the “Show by” contextual menu at the end of this row will allow to show its AS PATH (click image to zoom):</p> <img src="//images.ctfassets.net/6yom6slo28h2/3NyrTmDXvQepPE7l7Xn2MS/5526af56e5e541ef2b8c405293902cdc/rpki-tech-image-5.png" class="image center no-shadow" title="RPKI AS PATH" style="max-width: 1600;" thumbnail /> <p>This AS PATH will eventually tell us that this specific 45.239.121.0/24 prefix is currently seen as originated and prepended multiple times by AS61503 via the former Global Crossing ASN and finally by a peering session with Level3’s AS3356 (click image to zoom):</p> <img src="//images.ctfassets.net/6yom6slo28h2/583Lqetr2YSBCJ6okXXggs/53b1466386ac1a1e979356d2b8885df0/rpki-tech-image-6.png" class="image center no-shadow" title="RPKI Drilldown" style="max-width: 1600;" thumbnail /> <h2 id="rpki-dashboard">RPKI Dashboard</h2> <p>To improve the efficiency in obtaining an overview of RPKI traffic you can create a Dashboard and share it within organizations. Below is an example:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3gHGJcEFXe3DhSbRVrUlMZ/e55eccf34796424549bb8bd01093ae70/rpki-tech-image-7.png" class="image center no-shadow" title="RPKI Dashboard" style="max-width: 750;" thumbnail /> <h3 id="rpki-alerting-available-soon">RPKI Alerting (Available soon)</h3> <p>Finally, in the near future you will be able to use Kentik Alerts to monitor invalid RPKI traffic and be proactively notified in the event of a rise in this traffic. Stay tuned for more details as additional RPKI features are enabled!</p><![CDATA[What You Should Know About the Emerging Network Analytics Landscape]]><![CDATA[Kentik's Jim Meehan and Crystal Li report on what they learned from speaking with enterprise network professionals at Cisco Live 2019, and what those attendees expect from a truly modern network analytics solution.]]>https://www.kentik.com/blog/what-you-should-know-about-the-emerging-network-analytics-landscapehttps://www.kentik.com/blog/what-you-should-know-about-the-emerging-network-analytics-landscape<![CDATA[Jim Meehan, Crystal Li]]>Wed, 03 Jul 2019 07:00:00 GMT<p><strong>An insider’s look at Cisco Live 2019 and some surprising gaps in the latest network analytics solutions.</strong></p> <p><img src="//images.ctfassets.net/6yom6slo28h2/2nkKW9CbL6R4t27OPDreFh/2517d419e3947e21499fd8f3a953808e/kentik-booth-cisco-live-2019-with-swag.png" title="Kentik Booth Info and Swag at Cisco Live 2019" style="max-width: 300px; margin-bottom: 25px;" class="image right" /> It was a valuable experience to be part of Cisco’s recent annual customer and partner conference, “Cisco Live!” in San Diego. Aside from the keynotes and sessions, it was enlightening to hang out on the expo floor and in the Kentik booth, meeting and talking with network professionals from all over the world.</p> <p>Engaging with industry peers who are looking for network monitoring solutions gave us the opportunity to receive honest opinions, learn about frustrations and challenges people are facing, and of course, presented the chance to show off Kentik’s differentiators via live demos.</p> <p>As part of our post-mortem, this post looks back at what we heard from conference attendees we spoke with. It also sheds light on how networking professionals think about the requirements for a modern network monitoring and analytics solution.</p> <h2 id="observations-from-the-expo-floor">Observations from the Expo Floor</h2> <p>It can take a few days to visit all of the booths on Cisco’s sprawling “World of Solutions” expo floor. Among all the solution categories, <strong>networking management, monitoring, and analytics</strong> may be the most-packed category. Cisco on its own has multiple solutions here (including DNA Center, ACI APIC, vManage/vEdge for <a href="https://www.kentik.com/resources/integrated-advanced-sd-wan-visibility-with-kentik-and-silver-peak/">SDWAN</a>, and more)… not to mention solutions from all of the other vendors. The crowdedness in this sector is a testament to the high demand among enterprises for tools to monitor and analyze their network infrastructures.</p> <p>One trend we noticed is that every vendor out there claims that their product <strong>utilizes networking data,</strong> and that data-driven approaches are employed across the board. Supported data sources, therefore, are key to understanding each vendor’s focus and capabilities. Monitoring products consume flows, events, logs, metrics, packets, configurations, transaction details, and many other data resources, in different combinations, with varying degrees of success at achieving traffic-centric observability.</p> <p>With so much data utilized to achieve visibility, it’s not a surprise that <strong>AI/ML</strong> was a top buzzword at many booths. Although the AI/ML specifics are still quite hazy for most of the vendors, it’s clear that they hope to leverage AI technologies such as deep learning to learn and make better decisions with complex traffic data.</p> <img src="//images.ctfassets.net/6yom6slo28h2/aCcrAF6gI32Bvx4SgYvHL/d15a42daa5e42e477d0b1db804ea977f/cisco-world-of-solutions-cisco-live-2019.png" alt="cisco-world-of-solutions-cisco-live-2019"> <h3 id="enterprise-infrastructure-challenges">Enterprise Infrastructure Challenges</h3> <p>Before we discuss which characteristics customers were looking for in networking monitoring tools at Cisco Live, let’s first summarize the recent enterprise infrastructure challenges that are driving these requirements:</p> <ol> <li><strong>Increasingly complicated network infrastructures:</strong> Primarily caused by the emergence of the cloud, networks are increasing in complexity. We see many enterprises beginning their cloud journey with SaaS adoption, followed by deploying applications in public cloud infrastructure (starting with the “lift and shift” model), then reconstructing entire applications using modern, cloud-native approaches. Each step into the cloud adds another layer of complexity when it comes to managing the networks that power everything running on top.</li> <li><strong>One networking team and many application teams:</strong> There is usually only a single networking team that’s responsible for all of the enterprise networking that powers enterprise applications—which are often developed by multiple, distributed application teams. With services today being digitized at exponential speed, enterprises need to run a lot of different types of applications on their own. This reality increases the workload of the network team, who must ensure every path is problem-free to maintain service availability.</li> <li>The <strong>team structure</strong> is outdated and slow to collaborate: The team structure of many enterprises is no longer suitable for complex issue triage processes. Each team has its own unique set of tools, and communication is often inefficient, significantly slowing down the <a href="https://www.kentik.com/blog/network-troubleshooting-complete-guide/">network troubleshooting process</a>when issues occur.</li> </ol> <h3 id="frustrated-netops-teams">Frustrated NetOps Teams</h3> <p>Cisco Live is a great place to listen to the stories of enterprise networking professionals.</p> <p>One director-level network architect stopped by our booth and mentioned his team manages the entire data center, campus, WAN and cloud network in the organization. He told us his team uses <strong>siloed tools</strong> for different use cases and in distinct parts of the network. But as a leader, he is more interested in <strong>understanding the big picture of the overall health</strong> of his company’s networking infrastructure.</p> <p>Another NetOps lead relayed a very similar situation: Her organization has many <strong>specialized and disjoint tools</strong> for different areas, and <strong>lacks a single monitoring tool</strong> that can provide <strong>end-to-end visibility</strong>.</p> <p>We also had inquiries about <strong>integration</strong> and <strong>automation</strong> capabilities from multiple solution architects, who were hoping to find a way to have all the tools somehow work together and support existing workflows.</p> <h3 id="the-end-to-end-requirement">The “End-to-End” Requirement</h3> <p>Network professionals may not be able to describe the specifics of every requirement, but they clearly understand what they <em>don’t</em> want. That is: disjoint legacy monitoring tools that frustrate them in their day-to-day work and fail to deal with new, complex infrastructure.</p> <p><strong>“End-to-end”</strong> is a phrase that these network professionals used to describe what they are looking for. In a broad sense, end-to-end could mean a solution to monitor every aspect of the network. But to be more realistic, it more likely refers to a balance between decent coverage from a single product, with tight integration into the entire ecosystem.</p> <h2 id="kentik-for-enterprise-network-visibility-challenges">Kentik for Enterprise Network Visibility Challenges</h2> <img src="//images.ctfassets.net/6yom6slo28h2/6kxbLJvNNZefyfguIBBfmX/a0328060299cf9c81bdbc3c2897aa0b6/kentik-single-view-for-diverse-network-infrastructures.png" class="image center no-shadow" alt="Kentik Provides a Single View for Diverse Network Infrastructure" style="max-width: 750;" thumbnail /> <p>If you are looking for a modern network analytics platform that can:</p> <ul> <li>Provide a 360-degree unified view of your diverse infrastructure</li> <li>Intelligently detect anomalies based on historical as well as real-time traffic data</li> <li>Accelerate your troubleshooting process, powered by ML</li> <li>Integrate quickly and easily with your existing workflows and tools, enabled by rich APIs</li> <li>And provide a genuinely <strong>end-to-end</strong> network monitoring and analytics solution…</li> </ul> <p>… we encourage you to <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today. We’d be happy to show you how Kentik addresses all of the above-mentioned concerns and more.</p><![CDATA[The Network Traffic Analytics that Enterprises Need]]><![CDATA[The most commonly used network monitoring tools in enterprises were created specifically to handle only the most basic faults with traditional network devices. CTO Jonah Kowall explains why these tools don't scale to meet today's network visibility needs, why more enterprises are moving from faults & packets to flow, and how Kentik can help.]]>https://www.kentik.com/blog/the-network-traffic-analytics-that-enterprises-needhttps://www.kentik.com/blog/the-network-traffic-analytics-that-enterprises-need<![CDATA[Jonah Kowall]]>Mon, 01 Jul 2019 07:00:00 GMT<p>The most commonly used network monitoring tools in enterprises (e.g., solutions such as SolarWinds, ManageEngine, Paessler) were created specifically to handle the basic ups/downs, or faults, with traditional network devices. While many of these tools have expanded to support other technologies, they fundamentally <em>cannot</em> handle high-scale data collection, nor are they architected for the new network. While these tools can be cost-effective and work great for basic needs, today’s enterprise networks are far from basic… and far from traditional.</p> <p>As just one example, many organizations are now running SD-WAN technologies which require the monitoring and management of new solutions. While these SD-WAN products come with tooling, their visibility and management capabilities are lacking.</p> <h3 id="cloud-adoption-and-network-visibility-gaps">Cloud Adoption and Network Visibility Gaps</h3> <p>A more significant network visibility challenge comes in the form of <em>cloud adoption,</em> inclusive of SaaS, IaaS, and PaaS. These cloud technologies also are not handled by the legacy network monitoring tools as those legacy tools typically do not monitor traffic—or if they do, they cannot scale to today’s new data volumes, nor are they path-aware. And none of them understand application topology, especially when running in cloud-native environments.</p> <p>As enterprises adapt and change their data centers—augmenting them with public and private cloud services—more of the resources are becoming shared among many teams within the enterprise. As part of this trend, being able to quickly identify and remediate network issues, usage spikes, and network threats is essential to delivering high-quality services. Managing network capacity and ordering circuit and connection upgrades in advance are also essential to keep the business running efficiently.</p> <p>The network is a critical component when operating a hybrid data center built upon various cloud services. For this reason, enterprises are moving from a focused approach of monitoring network <em>faults</em> and <em>packets</em> to one of monitoring network traffic via <em>flow</em> and other <em>cloud-native data sources</em>.</p> <p>A further challenge is that with cloud services, the legacy technologies to aggregate, analyze, and store packets are no longer feasible without extremely high cost and complexity. The analysis of traffic data via flow technologies (including VPC Flow Logs from cloud providers such as Amazon, Microsoft, Google, and other agent-based approaches such as <a href="https://packagecloud.io/kentik/kprobe" target="_blank" title="Kentik kProbe at packagecloud">kProbe</a> or nProbe to generate flow from hosts) are becoming the right solutions to address the visibility gaps.</p> <h3 id="how-the-cloud-natives-solve-this-issue">How the “Cloud Natives” Solve this Issue</h3> <p>Traffic analytics based on <em>flow sources</em> (whether more traditional sources like NetFlow and sFlow, or virtual private cloud flow logs) have been in heavy use by first movers, particularly those who build and deliver SaaS services—and they’ve been leveraging Kentik.</p> <p>We’re talking about the types of companies that deliver the high-performance, always-available SaaS services that enterprises depend on today (many of whom, like Kentik, are <a href="https://www.cncf.io/about/members/" target="_blank" title="Cloud Native Computing Foundation members">members of the Cloud Native Computing Foundation</a>).</p> <p>Their networks encompass both <em>north-south traffic,</em> as data moves from their systems to the internet and into their users’ hands, but also <em>east-west traffic,</em> as data traverses their geographically-distributed physical and logical data centers. Many SaaS companies are deploying and seeing the network within orchestrated applications running on Kubernetes or Kubernetes-based cloud services because, as SaaS providers, their applications are extremely dynamic.</p> <h3 id="why-more-enterprises-are-going-with-the-flow">Why More Enterprises are “Going with the Flow”</h3> <p>As their existing tooling doesn’t provide the required visibility, it’s now becoming a more common for enterprises to run into issues with traffic delivery as they move and optimize applications and workloads. Enterprises also find lacking capabilities to measure and analyze capacity, usage, or internet topology (routing) required to deliver their services.</p> <p>With Kentik we can proactively identify these issues and impending problems for enterprises, SaaS providers and service providers alike. We’re also able to provide data to measure usage by location, services, or even individual teams and users with our advanced data collection and analytics platform.</p> <p>Kentik delivers modern network analytics at the scale required by the services today’s enterprises run on. And as enterprise networks become more and more like the networks those SaaS vendors have built, enterprises are seeking this same type of visibility.</p> <p>For Kentik, the challenge is not just collecting vast amounts of network data, but also enriching it at scale. You can learn more about how Kentik is making advances in monitoring not just flow data, but enriching it with <a href="https://www.kentik.com/blog/going-beyond-the-netflow-introducing-universal-data-records/" title="Kentik Blog: Going Beyond the (Net)Flow: Introducing Universal Data Records">any data element in any format</a>.</p> <p>But this is just the beginning. Stay tuned for the next chapter of Kentik. Coming soon…</p> <p><strong>Related Reading:</strong></p> <ul> <li><a href="https://www.kentik.com/blog/what-are-vpc-flow-logs-and-how-can-they-improve-your-cloud/" title="What are VPC Flow Logs and How Can They Improve Your Cloud?">What are VPC Flow Logs?</a></li> <li><a href="https://www.kentik.com/netflow-guide-types-of-network-flow-analysis" title="NetFlow Guide: Types of Network Flow Analysis">NetFlow Guide: Types of Network Flow Analysis</a></li> <li><a href="https://www.kentik.com/blog/the-visibility-challenge-for-network-overlays/" title="The Visibility Challenge for Network Overlays">The Visibility Challenge for Netork Overlays</a></li> </ul><![CDATA[Introducing the Industry's First Visibility Solution for Virtual Routing and Forwarding (VRF)]]><![CDATA[Kentik's Aaron Kagawa and Crystal Li explain VRF (virtual routing and forwarding) and Kentik's unique solution for understanding where VRF traffic exits your network.]]>https://www.kentik.com/blog/introducing-the-industrys-first-visibility-solution-for-vrf-virtual-routing-and-forwardinghttps://www.kentik.com/blog/introducing-the-industrys-first-visibility-solution-for-vrf-virtual-routing-and-forwarding<![CDATA[Aaron Kagawa, Crystal Li]]>Wed, 19 Jun 2019 07:00:00 GMT<p>Virtual routing and forwarding (VRF) has been around for years as a technology. However, for those in charge of monitoring a network, there has never been a solution capable of providing full visibility into the end-to-end VRF traffic that flows in and out of a network.</p> <p>Kentik is changing that. We now offer an industry-first VRF visibility solution to show where VRF traffic will exit your network. This functionality goes beyond the local device VRF configurations and utilizes route tables, BGP peering, and other sources to bring end-to-end visibility of VRFs that no other network monitoring tool can provide.</p> <h2 id="a-refresh-what-is-vrf">A Refresh: What is VRF?</h2> <p><a href="https://en.wikipedia.org/wiki/Virtual_routing_and_forwarding" target="_blank" title="Wikipedia: Virtual routing and forwarding">Virtual routing and forwarding (VRF)</a> is a technology that allows multiple instances of a routing table to coexist within the same router at the same time. You can think of VRFs as “logical routers” residing in one physical router, serving to automatically segregate the traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7LcZtwKjNThey4NDEgjyGS/e46f39f86dbdb7a01e397e675967ef79/unnamed.png" class="image center no-shadow" alt="Virtual Routing and Forwarding (VRF) Implementation" style="max-width: 512;" thumbnail /> <p>VRF is one of the earliest networking virtualization techniques that creates multiple virtual networks within a single network entity (as illustrated above). In a single network component, multiple VRF resources create isolation between virtual networks. That’s why VRF is widely used in the infrastructure of ISPs, enterprises, research &#x26; education, and many other verticals, as the technique supports the data center, peering, interconnection, and traffic engineering.</p> <h2 id="the-vrf-traffic-visibility-challenge">The VRF Traffic Visibility Challenge</h2> <p>Network engineers across industries often struggle with visibility into VRF. As just one example: Consider the case of ISPs. ISPs use the same physical router-to-router traffic for various customers, and they configure VRF to separate their various customers’ traffic in order to achieve multi-tenancy.</p> <p>As a network engineer, there are many challenges to solve in order to make sure all end customers transmit business data through the pipe without any possible traffic leaking.</p> <p>Without VRF visibility, network engineers struggle to answer questions such as:</p> <ol> <li>What does my <strong>inbound or outbound traffic</strong> at the provider edge (PE) segmented by VRFs look like?</li> <li>How can I ensure <strong>no traffic is leaking</strong> by verifying network partitions using VRF/VRF-lite function correctly?</li> <li>Can I visualize all traffic associated with a <strong>specific route distinguisher</strong>?</li> <li>Do the <strong>names of the VRFs</strong> that I created for a specific route distinguisher make sense?</li> <li>Can an <strong>alert be raised</strong> for a sudden change (e.g., an increase or decrease) in bandwidth for my customers at the PE distinguished by VRFs?</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/t2xIh2np7FeJi5FnJvsBM/a330d7494e2b83643ec487bc4ebd2d14/pasted_image_0.png" class="image center no-shadow" alt="Virtual Routing and Forwarding (VRF) in an ISP Context" style="max-width: 512;" thumbnail /> <h2 id="vrf-visibility-from-kentik">VRF Visibility from Kentik</h2> <p>Kentik is set to solve end-to-end VRF visibility challenges with comprehensive coverage:</p> <ol> <li><strong>VRF Awareness:</strong> The first phase of our VRF implementation includes support for providing local device VRF visibility by enhancing flow records with VRF information. We are doing this by polling for the VRF information from the standard L3VPN mibs. As shown in the screenshot below, there are eight new dimensions associated with VRF support, including source and destination VRF Name, VRF Route Distinguisher, VRF Route Target, and VRF Extended Route Distinguisher.</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/Z0YafWUQSBGDt8J0jBq4T/ba468eecff9a4dbac9d5ad8738ed6fb6/pasted_image_0.png" class="image center no-shadow" alt="Kentik VRF Dimensions" style="max-width: 300;" thumbnail /> <p><strong>2. VRF Manual API &#x26; Alerting Capability:</strong> To give users programmatic control of VRF attributes associated with each interface, we added support for VRF attributes in the interface methods of our device API, which can be experimented with in the <a href="https://api.kentik.com/api/tester/v5" title="Kentik API Tester">Kentik API tester</a>. Moreover, all these VRF dimensions are also supported in alert policies.</p> <p><strong>3. Associate VRF with BGP Attributes:</strong> Recently, we also added the functions to correlate information of VRF with BGP, which is true differentiation. This means we can now calculate various BGP and Ultimate Exit attributes correctly in VRF L3VPN configurations.</p> <p>A quick recap on Kentik-patented feature <a href="https://www.kentik.com/blog/visualizing-traffic-exit-points-in-complex-networks/" title="Kentik Blog: Ultimate Exit: Visualizing Traffic Exit Points in Complex Networks">Ultimate Exit (UE)</a>: Ultimate Exit enables end-to-end visibility of the traffic, providing an easy way to visualize what volumes of traffic are flowing in and out of your network, from any source to any destination network. You can then use that information to cut costs (e.g., peering) and to more accurately estimate the cost of carrying any set of traffic for any given customer.</p> <p>Now, you can do even more with VRF visibility, such as (1) obtain VRF routing table via BGP peering, (2) enhance flow records with correct BGP UE and AS path from VRF routing tables, and (3) associate VRF information with the right routing tables to correctly associated UE.</p> <img src="//images.ctfassets.net/6yom6slo28h2/25072ANLLQKi1rEWHVleGp/9d7e12a40df7af611ee6ae9c0e7b0c46/pasted_image_0.png" class="image center no-shadow" alt="Kentik VRF Sankey Diagram" style="max-width: 700;" thumbnail /> <p>With this capability, Kentik customers can now see, on a per-VRF basis, where the traffic is entering the network, how far they are carrying it, where it is leaving, what type of interface (e.g., transit/peer/customer), and what the volume is. This enables them to figure out the cost to provide the customer service in a VRF service.</p> <h2 id="summary">Summary</h2> <p>With VRF visibility supported by Kentik:</p> <ul> <li>An <strong>infrastructure/network planner</strong> can see inbound or outbound traffic at the provider edge (PE) segmented by VRFs.</li> <li>A <strong>network operator</strong> can see all traffic associated with a specific route distinguisher (RD) or verify the names of the VRFs that are associated with a specific RD.</li> <li>A <strong>network operator</strong> can get alerts for changes (e.g., increase/decrease) in traffic volume per customer using VRF IDs to distinguish customers at the PE.</li> <li>An <strong>enterprise network</strong> can verify that VRF-lite network partitions are functioning correctly (e.g., to ensure there is no traffic leaking).</li> </ul> <p>As a result, in today’s complex network deployments, end-to-end VRF traffic visibility from Kentik allows network operations teams to understand and manage traffic in networks of all types, from source to destination, so to gain more accurate calculations on cost for any given customer.</p> <p>For more VRF visibility support information, please see <a href="https://kb.kentik.com/Da02.htm#Da02-VRF_dimensions" title="Kentik Knowledgebase: VRF Dimensions">VRF Dimensions</a> in our Knowledge Base, or contact our <a href="mailto:[email protected]" title="Email Kentik Customer Success">Customer Success team</a>.</p><![CDATA[Kentik True Origin Brings CDN Insights to ISPs]]><![CDATA[Kentik's Greg Villain and Philip De Lancie explain the various benefits of True Origin™ and how ISPs can use it to see what content is most valuable to subscribers, how it impacts the network, and how to deliver a better subscriber experience. Includes a quick tutorial on setting up a Sankey chart to explore video traffic volumes.]]>https://www.kentik.com/blog/kentik-true-origin-brings-cdn-content-delivery-insights-to-ispshttps://www.kentik.com/blog/kentik-true-origin-brings-cdn-content-delivery-insights-to-isps<![CDATA[Greg Villain & Philip De Lancie]]>Tue, 04 Jun 2019 07:00:00 GMT<p>With Kentik customers now starting to take advantage of <a href="https://www.kentik.com/news/kentik-true-origin-broadband-and-mobile-providers-fast-insights/" title="Read more about Kentik True Origin">True Origin™</a>, the time is right to learn more about the benefits of this new feature-set. In the sections that follow we’ll explain what True Origin is, what operational needs it addresses, how it works, and the steps involved in putting it to use.</p> <p>By the end of this post you should have a pretty good idea of how True Origin helps network operations and engineering, particularly for Internet Service Providers. The following capabilities provide a taste of how useful True Origin can be:</p> <ul> <li>Get an accurate picture of what content is valuable to your subscribers, how that content is delivered, and the impact on your network.</li> <li>Identify hurdles in delivering content to subscribers and resolve them for a better subscriber experience.</li> <li>Dive deeper into embedded cache setups and better support their deployment and operation.</li> </ul> <p>To understand how True Origin helps us with the above scenarios, let’s start by putting ourselves in the place of an (eyeball) ISP with a large number of subscribers that consume content from the Internet. The content comes into the ISP network from various sources via various paths with varying cost and performance. For the ISP’s executive and engineering teams, these variations increase the complexity of meeting two main operational goals:</p> <ul> <li>Ensuring the high throughput required for a good user experience.</li> <li>Maximizing cost-efficiency.</li> </ul> <p>To succeed at these goals, an ISP needs to define a robust, performant architecture and also a set of best practices. To do that, you need real-time, high-quality information about two things: The traffic over your network and the cost to you of carrying that traffic.</p> <p>Until now, there hasn’t been a single tool that provides the required visibility—in the same place, at the same time—into both traffic and cost-of-carrying. That’s why True Origin is a major step forward for every ISP carrying a significant volume of Internet-sourced content.</p> <h1 id="what-is-true-origin">What is True Origin?</h1> <p>Kentik’s foundational mission is to bring our users network traffic visibility that’s fast, easy to consume, and highly actionable. By correlating an ever-wider set of traffic data into a single, instantly-queryable dataset (the Kentik Data Engine, aka KDE), we’re able to generate technical and business insights with direct, powerful relevance to your network operations. True Origin falls right in line with this mission, building on a set of key Kentik Detect features that are geared towards eyeball ISPs:</p> <ul> <li><strong>CDN tagging:</strong> Most Internet-based content is sourced from one of approximately 45 commercial and purpose-built CDNs. Kentik has built an engine that maintains a curated list of the source and destination IPs associated with these CDNs, so our flow records from their traffic include not just source or destination IP address but also the actual CDN names.</li> <li><strong>OTT tagging:</strong> Because commercial CDNs deliver traffic for multiple content providers, OTT services can’t be identified based on the server IP in a flow record. Instead, we need to be able to match the content provider from which a subscriber has requested content with the subsequent download to that subscriber. To do that we developed OTT tagging, a powerful DNS-based methodology that enables us to identify the OTT service behind a given flow regardless of the CDN, hosting provider, or connectivity type utilized for the delivery of that service’s content.</li> <li><strong>Subscriber tagging:</strong> A subscriber tag is a reference to the individual subscriber that is at the generating or receiving end of a flow. By incorporating subscriber information into each KDE flow record we can provide ISPs with the next level of flow visibility.</li> </ul> <p>The above features are each powerful by themselves. But working together they uniquely address the questions faced daily by our ISP partners. The main goal is to provide actionable, human-readable answers about the relationship between specific content and the methods by which that content is delivered to subscribers. Here’s a look at some ISP questions that True Origin can help answer:</p> <ol> <li>Is there an ongoing content-related traffic event? Is it linked to a specific content provider?</li> <li>Does the root cause of a given issue lie with the content provider, the CDN, or my own network?</li> <li>What entities does specific content pass through on its way to my subscriber? What are the economics of that route?</li> <li>How does my throughput ranking from content providers compare to competing ISPs?</li> <li>To what extent are my subscribers choosing to get content directly from content providers rather than via my own IPTV offering?</li> <li>Are my subscribers getting content from the nearest possible servers?</li> <li>What content am I serving from embedded appliances, and how efficiently? Are more needed?</li> <li>Is content “shedding” from my embedded appliances to my transit or peers?</li> </ol> <p>Flow data alone (source and destination ASNs and IPs) lacks the information needed to answer the questions above. True Origin addresses that limitation by enhancing KDE flow records with additional data that enables Kentik Detect to provide richer context. The following factors are now either included in or derived from these enhanced records:</p> <ul> <li><strong>Connectivity types:</strong> Content may be delivered from outside the network via IX peering, private peering, or transit. Or it may come internally from embedded cache appliances. Each method has different performance and cost criteria.</li> <li><strong>Source and destination CDNs:</strong> Which CDN delivers what, how does each work, and how does that affect network utilization and cost?</li> <li><strong>Multi-service providers:</strong> A single content provider may operate multiple services distributing multiple forms of content. Facebook, for example, delivers not only a social web application, but also live video, two messaging applications, Instagram, and a gaming application. It’s important to be able to evaluate the resource demands of each such service not only individually but also by content provider—largely because interconnection and troubleshooting discussions for any of the underlying OTT services will involve the parent provider.</li> <li><strong>OTT service categorization:</strong> As subscribers become cord-cutters, you need to be able to group OTT services into business categories in order to see the impact of individual types of content (e.g., video) on network utilization and costs regardless of the individual content providers involved.</li> </ul> <h2 id="how-true-origin-works">How True Origin Works</h2> <p>We’ve already implemented OTT tagging and CDN tagging as dimensions that you can filter and group-by in Kentik Detect, so to get a feel for how True Origin works we’ll start with an example that uses both features.</p> <p>Consider a simple (conceptually) transaction in which a client requests OTT content from a server, which we’ll assume in this case is managed by a CDN:</p> <ol> <li>The transaction typically begins with a DNS query from a given IP (a subscriber). The DNS server looks up the hostname, which corresponds to the content provider, and returns the IP address of the server that will be the actual source of the content. KDE receives this information via kprobe, our software host agent.</li> <li>The flow data from the transaction itself contains a source IP (server pushing the content), and a destination IP (a subscriber downloading the content).</li> <li>As DNS responses and flow data are ingested, KDE correlates flows and DNS queries in which:</li> </ol> <ul> <li>The source IP in flow data is matched to the IP returned by the DNS query.</li> <li>The destination IP in flow data is matched to the IP from which the DNS query originated.</li> <li>The timestamp on both the flow data and the DNS query are approximately the same.</li> </ul> 4. When the above conditions are met then the KDE flow record will be enriched with the CDN name (CDN tagging) and the content provider name (OTT tagging). <p>Based on the transaction described above, True Origin may seem relatively simple in theory. In practice, though, successful execution requires resolving the following challenges:</p> <ul> <li>CDN Tagging by itself isn’t sufficient to obtain a clear picture:</li> </ul> <ul> <li>CDNs carry traffic for multiple content providers on the same servers.</li> <li>Content providers typically rely on a mix of CDNs to deliver their content.</li> <li>Content providers may deliver their content via both their own infrastructure (including a variety of upstream connectivity methods and providers) and commercial CDNs.</li></ul> - Correlating DNS queries and network generated flows into a single a time-series database is hard, because you can’t do it offline and send results later. DNS queries/responses must be correlated with flows in real time as they are ingested into the flow-record datastore. - CDNs come in many different shapes and forms, each involving its own mix of infrastructure and connectivity, which brings up the following considerations that can affect performance and cost: <ul> <li>To what extent does the CDN embed servers in the network of the last mile ISP?</li> <li>To what extent does it host content-serving cache servers on its own infrastructure?</li> <li>To what extent does it peer directly with ISPs or leverage diverse transit?</li></ul> <p>With True Origin, Kentik has solved the challenges above, enabling ISPs—as well as the operators of enterprise and campus LANs—to get a clear picture of how your network utilization and costs are impacted by Internet-sourced content.</p> <p>To do so, you’ll use the following source and destination dimensions, whose values are stored in each flow record at ingest:</p> <ul> <li><strong>CDN:</strong> A content delivery network, derived as described in CDN tagging above.</li> <li><strong>OTT Service:</strong> An individual content service whose hostname is looked up via DNS.</li> <li><strong>OTT Service Provider:</strong> An entity that offers a service. For example Facebook is the provider for the services Facebook Messenger, WhatsApp, Instagram, etc., Netflix is the provider of the service Netflix, and Google is the provider for Google Drive, GMail, Google Maps, etc.</li> <li><strong>OTT Service Type:</strong> The nature of the content provided by the service. Values include Adult, Ads, Antivirus, Audio, Cloud, Conferencing, Dating, Developer Tools, Documents, Ecommerce, File Sharing, Gaming, IoT, Mail, Maps, Media, Messaging, Network, Newsgroups, Photo Sharing, Social, Software Download, Software Updates, Storage, Video, VPN, Web.</li> </ul> <p>The key factor in OTT tagging is the <em>hostname</em> that, as described earlier, is looked up in a DNS query. That’s the source from which we derive the OTT Service value; the other OTT dimension values are associated with the service via our curated list of providers and service types, which continually evolves as we discover and analyze new sources of traffic.</p> <h1 id="leveraging-interface-classification">Leveraging Interface Classification</h1> <p>Now that we understand a bit about how True Origin works, let’s start thinking about how to put it to use in Kentik Detect. Once again, we’ll be approaching this from the point of view of an ISP with eyeball subscribers. Because our subscribers pull far more content from the Internet than they push to it, our ingress traffic has a much greater effect on volume and costs than our egress traffic. So what we’ll mainly focus on is traffic that enters our network towards our subscribers.</p> <p>With most flow visibility tools it wouldn’t be easy to segment our traffic this way, but Kentik Detect’s <a href="https://kb.kentik.com/Cb10.htm" title="Kentik Knowledgebase: Interface Classification">Interface Classification</a> feature is designed to handle exactly this kind of challenge. So to get the results we need when using True Origin, we’ll start with a quick detour into Interface Classification.</p> <p>Interface classification works in multiple stages. First we apply rules against which interfaces are evaluated and classified, then the flow records from traffic across those interfaces is enriched with information about the interfaces, and then queries on those flow records can group-by or filter based on any or all of the following Interface Classification dimensions:</p> <ul> <li><strong>Network Boundary</strong> (source and destination): Internal or External depending on whether the interface connects inside or outside of the network (see Network Boundary Attribute).</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/7qNFkVjt3b32ezwMf4upG4/5e8c41a3d7b9e288fa97f9cb38008c50/network-boundary-types.png" class="image center no-shadow" alt="Network Boundary Types" style="max-width: 500px;" thumbnail /> - __Provider:__ The provider via which traffic from a given externally facing interface reaches the Internet (see [Provider Classification](https://kb.kentik.com/Cb10.htm#Cb10-Provider_Classification "Kentik Knowledgebase: Provider Classification")). Note that in this context “provider” is an umbrella term that could, depending on your specific situation, refer to a transit provider, an Internet Exchange supplier, a customer, or even a CDN (in the case of an embedded cache). - __Connectivity Type__ (source and destination): The type of interface, including types like Transit, Peering, and Backbone (see [Understanding Connectivity Types](https://kb.kentik.com/Ab05.htm#Ab05-Understanding_Connectivity_Types "Kentik Knowledgebase: Understanding Connectivity Types")). <img src="//images.ctfassets.net/6yom6slo28h2/75w8fTUn7ZXSBLUJp81gfI/64ff811a89da53cf25373616b010098a/network-cdn-true-origin-connectivity-types.png" class="image center no-shadow" alt="Kentik Network Connectivity Types" style="max-width: 600px;" thumbnail /> <p>Among the types of connectivity covered by Interface Classification, <a href="https://kb.kentik.com/Ab05.htm#Ab05-Embedded_Cache" title="Kentik Knowledgebase: Embedded Cache">Embedded Cache</a> is particularly relevant to traffic delivered by some of the major CDNs. As an ISP, I may have interfaces that are connected to a cache appliance server that’s been provided by a CDN or content provider—such as a Facebook appliance (FNA), Google Global Cache (GGC), Netflix Open Connect (OCA), or Akamai cache.</p> <p>If so, I want to be sure that those interfaces are classified as such. To do that, I’ll set an Interface Classification rule as shown below: <img src="//images.ctfassets.net/6yom6slo28h2/64HCWqsr8www3KAOJ8bVcJ/a7a67b254c4d0605868023f62d057453/kentik-rule-for-embedded-cache.png" class="image center no-shadow" alt="Kentik Rule for Embedded Cache" style="max-width: 700px;" thumbnail /></p> <p>Using regex, the above rule will evaluate all SNMP-polled interface descriptions. Matching interfaces will be labeled with these three dimensions:</p> <ul> <li><strong>Network Boundary</strong> = INTERNAL</li> <li><strong>Connectivity Type</strong> = Embedded Cache</li> <li><strong>Provider</strong>: The $1 value indicates that the provider value will be determined by performing a group capture, using regex matching, on the first group <strong>(AKAMAI|NETFLIX|FACEBOOK|GOOGLE)</strong> in the interface description (see <a href="https://kb.kentik.com/Cb10.htm#Cb10-Provider_Classification_with_Regex" title="Kentik Knowledgebase: Provider Classification with Regex">Provider Classification with Regex</a>).</li> </ul> <p>Once Interface Classification is applied we can take advantage of it to better understand how traffic from content providers (via CDNs) is affecting our network utilization and costs. In Data Explorer, for example, we can set filters on Interface Classification dimensions so our visualizations and tables are focused on the traffic that’s of greatest interest. To see only traffic coming from the outside, we’d set a filter to <strong>SOURCE Network Boundary EQUALS External</strong>. To see only traffic coming out of the network the filter would instead be <strong>DESTINATION Network Boundary EQUALS External</strong>.</p> <h1 id="understanding-our-video-traffic">Understanding Our Video Traffic</h1> <p>Now that we see how Interface Classification helps tailor our queries we can combine it with True Origin dimensions to really narrow down to a specific type of content. We’ll focus on video traffic because these days it represents the lion’s share of traffic volume.</p> <p>We’ll start in the Filtering pane of the Data Explorer sidebar, where we’ll create a new ad hoc filter made up of two filter groups (see <a href="https://kb.kentik.com/Da04.htm#Da04-Filter_Groups_Interface" title="Kentik Knowledgebase: Filter Groups Interface">Filter Groups Interface</a>). One group will contain two ORed conditions that cover traffic from two possible sources, either outside the network (via externally facing interfaces) or from caches embedded inside the network. A second filter group, ANDed with the first, will use the OTT Service Type dimension with a value of Video. The resulting filter will look like this: <img src="//images.ctfassets.net/6yom6slo28h2/4n7fKMLvBg9DtdhvXj5OQB/d610bec1f139e68b08f7aa65e85bcfd9/kentik-filter-for-video-from-external-or-embedded-cache.png" class="image center no-shadow" alt="Kentik Rule for Embedded Cache" style="max-width: 500px;" thumbnail /></p> <p>Next, still in the sidebar, we’ll choose our group-by dimensions in the Query pane as shown below (see <a href="https://kb.kentik.com/Db03.htm#Db03-Query_Basic_Options" title="Kentik KB: Query Basic Options">Query Basic Options</a>), so that the query will show us top-X traffic by a combination of OTT dimensions, connectivity type, and CDN. <img src="//images.ctfassets.net/6yom6slo28h2/3Nd9lZlOXtYoZa8aKM8Wmw/d01d622d15b8b8579f2a7a0602998c95/cdn-true-origin-group-by-dimensions.png" class="image center no-shadow" alt="Group By Dimensions" style="max-width: 500px;" thumbnail /></p> <p>We also specify traffic from all devices (see <a href="https://kb.kentik.com/Da03.htm#Da03-Devices_Pane_Settings" title="Kentik KB: Devices Pane Settings">Devices Pane Settings</a>) and a time range of Last 1 Day (see <a href="https://kb.kentik.com/Da03.htm#Da03-Time_Pane_Settings" title="Kentik KB: Time Pane Settings">Time Pane Settings</a>).</p> <p>Finally, from the View Type menu at the top right of the chart display area, we choose Sankey diagram (see <a href="https://kb.kentik.com/Da01.htm#Da01-Chart_View_Types" title="Kentik KB: Chart View Types">Chart View Types</a>). When we run the query with these settings, the resulting Sankey diagram (example below) shows the video traffic coming from external sources or an embedded cache, ranked (top-X) according to a key built from our five group-by dimensions: <img src="//images.ctfassets.net/6yom6slo28h2/DtgKoxN2x3wElOEtV8eWM/6753dc8ad79df9d569a4db6c748293cf/kentik-sankey-chart-service-connectivity-cdn.png" class="image center no-shadow" alt="Sankey Diagram Showing Service, Connectivity and CDN" style="max-width: 700px;" thumbnail /></p> <h3 id="so-what-do-we-learn-from-this-sankey">So what do we learn from this Sankey?</h3> <p>Let’s first look at relative traffic volume to confirm that what we see represents reality. Based on the volume of video-only traffic, our top OTT service providers (3rd column) are Netflix, Google’s YouTube, and Hulu. Nothing surprising there, but it’s good that our query results are consistent with the usual suspects and our general knowledge about their traffic.</p> <p>Now let’s dig into a few more interesting things. First, Netflix’s OpenConnect CDN is known to have a nearly perfect offload score. Based on these results that appears to be warranted, because we can see that the Netflix-supplied traffic is coming exclusively from the embedded caches in our network. If we ever see Netflix “shedding” traffic into connectivity types like Transit or Peering, we’ll know that this is an anomaly that needs to be investigated.</p> <p>Second, it’s worth noting (though it may be assumed) that the vast majority of our video traffic is being supplied via CDNs, either commercial (Akamai, Level3, etc.) or in-house (Netflix, Google CDN, Amazon CloudFront, etc.). This underscores the value to ISPs of being able to see beyond CDNs to identify the true providers of their bandwidth-heavy traffic.</p> <p>Next, by hovering over any listed OTT service (2nd column) we can easily identify both the connectivity type (4th column) and the CDN (5th column). This is important because, as previously noted, it’s extremely important for ISPs to deliver OTT services with the best possible performance at the best possible cost. Content whose connectivity type is transit makes those goals more challenging for two reasons:</p> <ul> <li><strong>Transit connectivity</strong> adds an unknown number of hops, which may impact performance.</li> <li><strong>Transit costs</strong> are higher than embed or peering (IX or PNI) costs, which impacts our bottom line.</li> </ul> <p>Despite the potential drawbacks above, the delivery policies of some video content providers leverage the full spectrum of available means. In our Sankey, for example, we can hover over Hulu to see the following delivery methods in use:</p> <ul> <li>The majority is delivered by <strong>transit</strong> (Level3) which translates into higher cost.</li> <li>A good chunk is delivered via <strong>PNI</strong> (e.g., Edgecast and Fastly).</li> <li>A tiny slice of the traffic is delivered via <strong>embedded cache</strong> (Akamai).</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5pgpEhyj8vJ0PNp2tcufsE/9fe18b600e0d8cfd026a8b604d64cf15/cdn-video-traffic-embedded-cache-transit-pni.png" class="image center no-shadow" alt="Providers may deliver traffic via a combination of embedded cache, transit and PNI" style="max-width: 700px;" thumbnail /> <p>Information like the above enables you to <em>plan intelligently</em> for the demands that CDN-delivered content puts on your infrastructure. And it also alerts you to opportunities to either <em>lower costs</em> or <em>boost performance</em> (e.g., deliver video at a higher bitrate) by trying to find alternatives to transit — more embedded caches, perhaps? — for traffic from content providers. In future posts about True Origin we’ll dive deeper into specific use cases, such as troubleshooting traffic that should be going to caches but is instead arriving via transit. In the meantime, if you have questions or need help with using True Origin, don’t hesitate to <a href="mailto:[email protected]" title="Email Kentik Support">contact Kentik support</a>.</p><![CDATA[The Visibility Challenge for Network Overlays]]><![CDATA[Kentik CTO, Jonah Kowall, explains the challenges inherent in ensuring visibility in the age of network overlays, plugins, and public clouds... and how Kentik addresses Kubernetes and other overlay technologies.]]>https://www.kentik.com/blog/the-visibility-challenge-for-network-overlayshttps://www.kentik.com/blog/the-visibility-challenge-for-network-overlays<![CDATA[Jonah Kowall]]>Wed, 29 May 2019 07:00:00 GMT<p>Kubernetes and Docker are increasingly familiar to DevOps and SRE teams, but still relatively unfamiliar to network teams even though the network interactions are complex. In order to maintain reliable applications in these new architectures, network teams must become more involved to proactively identify issues.</p> <p>Those working in network operations or development are well used to the “blame game”: The network is most often at fault… until the network team provides proof that it’s not. I can recall 10 years ago having found a strange CIFS configuration problem via packet capture when we were about to abort a massive data center cutover. Upon fixing the issue, we were able to cut over with just enough time, and it was a big win for the team. Ultimately, packets do not lie, and analyzing packets is always the final diagnostic and analysis effort when issues occur.</p> <h2 id="network-visibility-in-the-age-of-overlays">Network visibility in the age of overlays</h2> <p>Unfortunately, packet capture is becoming less feasible with the abstraction of networks via overlays and encryption.</p> <p>In today’s infrastructures, these additional layers are deployed on top of the existing network. In a typical Kubernetes deployment, one of several network overlay technologies are used—the most common two being Flannel and Calico—but there are <em>dozens more</em> as per <a href="https://kubedex.com/kubernetes-network-plugins/" target="_blank" title="Kubedex Blog: Kubernetes Network Plugins">this excellent blog by Steven Acreman</a> and his <a href="https://docs.google.com/spreadsheets/d/1qCOlor16Wp5mHd6MQxB5gUEQILnijyDLIExEpqmee2k/edit?usp=sharing" target="_blank" title="Kubernetes Network Plugins Google Sheet by Steven Acreman">associated Google Sheet</a>.</p> <p>The challenge is that each of these network overlay technologies requires a new or modified set of tools capable of understanding the protocols, security, and routing of packets in these new layers.</p> <h2 id="improving-network-visibility-for-kubernetes-and-similar-overlay-deployments">Improving network visibility for Kubernetes and similar overlay deployments</h2> <p>But the benefits of network overlay technologies need not come at the cost of decreased network visibility. Based on our customers’ need to address these issues, Kentik has created a way to see <em>inside</em> the network overlay technologies used within public clouds, such as Elastic Kubernetes Service (EKS), Google Kubernetes Service (GKS), and Azure Kubernetes Service (AKS).</p> <p>Kentik also supports on-premises Kubernetes deployments running plugins such as Flannel and Calico. These capabilities allow Kentik to collect data from within the network overlays and services to discover how the Kubernetes pods and nodes communicate. Although this helps a lot, capturing packets is a challenge due to overlay-limited legacy tools.</p> <p>When debugging a service communication issue in a typical network environment, it’s difficult to determine whether the problem is in the physical network, firewalls, a logical configuration such as routing, or other access controls. It could even be at the host level with configuration of DNS, or external to the network at the load balancing layers.</p> <p>These same services and requirements remain in place in today’s infrastructure—even on public cloud—but now there is significantly more complexity as you are running another network on top. (See this overview to <a href="https://sookocheff.com/post/kubernetes/understanding-kubernetes-networking-model/" target="_blank" title="Blog Post: A Guide to the Kubernetes Networking Model">learn more about Kubernetes networking</a>.) In many of these overlay deployments, there are complexities around addressing, routing, and advertising. These are all in addition to the existing complexity of the physical network.</p> <p>Without a tool to help diagnose the traffic and paths, it is challenging to isolate problems. Some users even run BGP via Calico, which creates challenges around managing these at the pod level via granular traffic and path control. Within these overlays, there are also security policies which may filter access by pod, service, IP, URL, or other methods.</p> <h2 id="introducing-kubetags">Introducing “kubetags”</h2> <p>Kentik deploys inside a Kubernetes cluster as a small container called kubetags, via <a href="https://hub.docker.com/r/kentik/kubetags" target="_blank" title="Kentik kubetags via Docker Hub">Docker Hub</a>. This very lightweight agent adds Kubernetes metadata to the flow data captured with either Kentik’s kProbe agent (on the host or container) or via the native <a href="https://www.kentik.com/solutions/infrastructure/#aws" title="Kentik Solutions for AWS, Azure, and GCP flow logs">flow log support Kentik offers on Amazon, Microsoft, and Google clouds</a>.</p> <p>We are looking to extend these types of visibility features to address additional overlays in the future.</p> <p>Learn more about the <a href="https://www.kentik.com/product/kentik-platform/" title="Learn More About Our Platform: the Kentik Network Observability Platform">Kentik Network Observability Platform</a> here, or <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Going Beyond the (Net)Flow: Introducing Universal Data Records]]><![CDATA[Kentik's Aaron Kagawa explains why today's network analytics solutions require new types of contextual data and introduces the concept of Universal Data Records. Learn how and why Kentik is moving beyond network flow data.]]>https://www.kentik.com/blog/going-beyond-the-netflow-introducing-universal-data-recordshttps://www.kentik.com/blog/going-beyond-the-netflow-introducing-universal-data-records<![CDATA[Aaron Kagawa]]>Thu, 23 May 2019 07:00:00 GMT<p>Every story has a beginning. At Kentik, flow data was ours. It’s how we began to set the bar for what modern network analytics should look like. In just a few short years, <a href="https://www.kentik.com/customers/" title="Kentik Customers Page">we’ve proven</a> that flow-based analytics (with formats like <a href="https://www.kentik.com/kentipedia/netflow-overview/" title="Netflow Overview in the Kentipedia">NetFlow</a>, <a href="https://www.kentik.com/blog/netflow-vs-sflow/" title="Blog Post: NetFlow Vs sFlow">sFlow</a>, and <a href="https://www.kentik.com/kentipedia/j-flow-analysis/" title="jFlow Analysis Overview in the Kentipedia">JFlow</a>) give enterprises and service providers powerful insights into network performance, availability, security, and much more.</p> <p>However, networks are growing more complex. Infrastructure is increasingly diverse. And visibility is diminishing alongside new overlays and technologies layered upon the network.</p> <p>Going <em>beyond</em> flow data is now a necessary requirement for maintaining comprehensive network visibility. That’s why Kentik is not just “going with the flow.” We’re evolving.</p> <p>We’ve enhanced our core platform to accept any data element in any format—including fields like application ID, user ID, NAT translations, and vendor-specific fields—and even data records that aren’t flow at all.</p> <p>It was hard work, but the results have proven to be well worth the effort. We call the new architectural element <strong>“Universal Data Records” (UDRs)</strong>, and with this, Kentik now has the ability to innovate faster than ever before—adding more data sources to our platform to stay ahead of and address the always-evolving network visibility challenges faced by our customers.</p> <h1 id="universal-data-records-for-the-cloud">Universal Data Records for the Cloud</h1> <p>One of the challenges we hear about with increasing frequency is centered around networking in the cloud era. As organizations shift applications and workloads to multi-cloud environments and new network overlays appear, many network professionals are losing visibility into their networks. For example, it becomes harder to understand which teams might be running which services in what locations. This type of uncertainty can lead to performance problems, security nightmares, and cloud spending surprises.</p> <p>UDRs allow us to flexibly receive and store new data fields that aren’t present in traditional network data. This innovation made it possible to add support for <a href="https://www.kentik.com/resources/google-cloud-vpc-flow-logs-for-kentik/" title="Learn more about Kentik&#x27;s support for Google VPC Flow Logs">VPC Flow Logs from Google Cloud Platform (GCP)</a>, followed by adding support for <a href="https://www.kentik.com/resources/aws-vpc-flow-logs-for-kentik/" title="Learn more about Kentik&#x27;s support for AWS VPC Flow Logs">AWS VPC Flow Logs</a>, both of which contain new fields that describe attributes like instance names and zone/region names.</p> <p>Additionally, the rise of containers, with their many moving (and short-lived) parts, has made it increasingly difficult for network operators to understand what traffic is coming from where. That’s why we’ve also added UDRs for Kubernetes and Istio. This provides our customers with visibility into pod-to-pod and service-to-service traffic flows, giving them a better understanding and ways to visualize container orchestration and service meshes. And we now have several other monitoring capabilities in the works for cloud and cloud-native data sources that are set to roll out over the coming year.</p> <p>You can read how Pandora uses Kentik for our new <a href="https://www.kentik.com/blog/how-pandora-leverages-vpc-flow-logs-kentik-for-google-cloud-visibility/" title="Kentik Blog: How Pandora Leverages VPC Flow Logs for GCP Visibility">cloud visibility capabilities here</a>.</p> <h1 id="udrs-for-the-firewall">UDRs for the Firewall</h1> <p>Most recently, UDRs allowed us to add visibility into firewalls, including <a href="https://www.kentik.com/product-updates/january-2019/#Palo%20Alto%20Networks%20Firewalls" title="Kentik Product Update: Firewall Support">Cisco ASA and others</a>. Firewalls can carry deep insights into network traffic based on their ability to perform deep packet inspection and authentication, and add attributes (such as user names and application types) to flow data. These are hugely valuable for adding both <em>security</em> and <em>application context</em> to network activity.</p> <p>For us, this is part of Kentik’s continued evolution. We’re no longer just supporting core and edge. Rather, we’re stepping beyond that—and quickly.</p> <h1 id="from-beginning-end-to-end">From <del>Beginning</del> End to End…</h1> <p>With a powerful backend equipped to ingest more data types than ever—and running in a highly scalable SaaS-delivered product—the Kentik team has a full pipe of planned work.</p> <p>Additional data sources and types we’re adding include SNMP, Streaming Telemetry, MPLS labels, Cisco NBAR, VXLAN, SD-WAN performance metrics, and syslog. We also just rolled out full alerting support for the UDR fields mentioned previously in this post.</p> <p>Correlating and alerting on all of these data sources, together, will arm our customers with complete, end-to-end network visibility and actionable insights.</p> <p>Keep an eye on our <a href="https://www.kentik.com/product-updates/" title="Kentik Product Updates">product updates page</a> where we post monthly recaps of what’s new. You can also <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today to see Kentik’s powerful network analytics for yourself.</p><![CDATA[How Race Communications Used Kentik to Stop Mirai Botnet Infection and Abuse]]><![CDATA[Customer success engineer Dan Kelly explains how Race Communications used Kentik's powerful network analytics to identify malicious traffic associated with the Mirai botnet, determine which of Race Communications’ customer IP addresses were being exploited, save its online IP reputation and fend off other types of DDoS and botnet attacks.]]>https://www.kentik.com/blog/how-race-communications-used-kentik-to-stop-mirai-botnet-infection-and-abusehttps://www.kentik.com/blog/how-race-communications-used-kentik-to-stop-mirai-botnet-infection-and-abuse<![CDATA[Dan Kelly]]>Tue, 21 May 2019 07:00:00 GMT<p>The Mirai botnet was first discovered back in 2016, but has continued to persist and abuse common vulnerabilities and exposures (CVEs) on IoT devices, including home routers and many other network-connected devices. In short, the Mirai network of bots was built by malicious actors who exploited remote access and control protocol ports over many different device types, producing damaging traffic levels and creating an advanced, powerful tool that can be used for large-scale DDoS attacks and many other nefarious purposes.</p> <p>When Kentik customer Race Communications—a provider of reliable, high-speed internet and advanced communications to communities throughout California—learned of Mirai’s potential risk to its customers, the team knew it needed to act fast.</p> <p>In this blog post, we outline how Race Communications was able to leverage Kentik’s powerful network analytics to identify malicious traffic associated with Mirai, determine which of Race Communications’ customer IP addresses were being exploited by the botnet, and ultimately, save its online IP reputation.</p> <h1 id="the-race-to-stop-mirai">The Race to Stop Mirai</h1> <p>Race Communications was alerted to the potential Mirai risk when the team received a letter from another network online noting that IP address(es) owned by Race Communications acting maliciously over the internet. These addresses were apparently port scanning IP addresses that belonged to the company sending the letter. The letter took on the form of a formal complaint and asked that Race Communications cease this activity.</p> <p>While the port scans were only coming from a few hosts, Race Communications noticed that the complaint was against an entire /24 IP block. This had the potential to lead to the entire /24 block getting blacklisted for malicious activity. This type of blacklisting could potentially cause other Race Communications customers to experience a loss of connectivity to services, due to poor IP reputation.</p> <p>Race Communications knew that by utilizing network forensics capabilities from Kentik, the team would be able to quickly drill down into the incident and determine the root cause behind the formal complaint.</p> <h1 id="network-forensics--visibility-from-kentik">Network Forensics &#x26; Visibility from Kentik</h1> <p>The Race Communications network team turned to Kentik’s “Unique Destination Port” metric and was quickly able to see how many ports the address listed in the formal complaint might be hitting, and why it would be considered a scan. When this number revealed only 40 destination ports on average, the team again turned to the insights available with the Kentik platform.</p> <p>Checking the pattern of all ports that were being utilized, Race Communications discovered that the vast majority of all destination ports were either port 23 (Telnet), or port 37215 (Huawei Remote Procedure Port). The Huawei port was of immediate interest to the network team as this port is part of a long-lived exploit cataloged as <a href="https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2017-17215" target="_blank" title="Huawei HG532 Vulnerability Info from Common Vulnerabilities and Exploits">CVE-2017-17215</a>. The exploit requires only a single authentication to work, and once it had been exploited, remote code execution is possible on the associated device. (<a href="https://vuldb.com/?id.114804" target="_blank" title="HUAWEI HG532 SERVICE PORT 37215 PACKET REMOTE CODE EXECUTION Info from Vuldb">Vuldb has additional information</a> on the life of this exploit.)</p> <p>The final port that Race Communications observed was port 2323, which was a tiny fraction of the total traffic, and always sourced from port 23 on protocol TCP. This is a potential sign of a Mirai variant botnet C&#x26;C traffic, <a href="https://www.securityfocus.com/bid/102344" target="_blank" title="SecurityFocus article on Mirai variant">as described in this article</a>, and is additionally associated with <a href="https://nvd.nist.gov/vuln/detail/CVE-2016-10401" target="_blank" title="NIST National Vulnerability Database Info on CVE-2016=10401" >CVE-2016-10401</a>, an exploit associated with ZyXEL network devices and escalation of permissions. The method of exploiting both the Huawei CVE and the ZyXEL are very similar, requiring one authentication first.</p> <p>In just a three-hour period, Race Communications was able to see a very large Unique Destination IP count-per-port. As shown in the image below, this rate was highly consistent:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1YJ1vvuFwzr3ah2XoPOxTT/0a9f73143a8ada543398b74d795681d1/botnet-detection-murai-kentik-1.jpg" class="image center no-shadow" alt="Kentik Report: Unique Destination IP Port Metric" style="max-width: 750px;" thumbnail /> <p>Race Communications was also able to leverage the Destination Port information from Kentik to pinpoint backtrace several additional Source IPs on its customers’ networks that were participating in similar traffic patterns.</p> <p>During this portion of the investigation, while working through the traffic, Kentik and the Race Communications network team found additional interesting and actionable information within the Kentik platform:</p> <ul> <li>Most destination 37215 ports were hitting IP addresses in the Asian market regions, such as China, Japan, and South Korea. To the team, this was a potential indicator that the attackers were targeting addresses within a region more likely to contain a Huawei product.</li> <li>Destination port 23 traffic was hitting an evenly distributed traffic pattern between the Asian markets and the United States. To the Race Communications team, this had the potential to imply that the attacker(s) were searching for new devices using Telnet, aiming to find vulnerable Huawei devices in the process, and scripting to additionally hit these devices with a vendor-specific attack.</li> <li>Each Source IP that was on its customers’ networks was reaching an hourly average of 3,500 Unique Destination IPs consistently, for weeks. Each Unique Destination is another potential report, and another hit against this Race Communications’s customer IP reputation.</li> </ul> <h1 id="an-interesting-forensic-finding">An Interesting Forensic Finding</h1> <p>Perhaps the most interesting finding for Race Communications was that Kentik’s Spamhaus Botnet and Threat-List data feeds tagged all of the offending IPs that Race Communications had identified manually. Additionally, Kentik’s platform highlighted other suspicious activity among Race Communications’ customer subscriber IPs. This meant that there had been significantly more complaints against its customer IP blocks than Race Communications was initially aware of from the single, formal complaint the team received.</p> <img src="//images.ctfassets.net/6yom6slo28h2/Ymxmh3Jq2GCFkxOpSpqZt/6d5b4ad066630d39af65d7d008c37482/botnet-detection-murai-kentik-2.jpg" class="image center no-shadow" alt="Kentik Report: Top Threat Sources" style="max-width: 750px;" thumbnail /> <p>The Race Communications team knew it was not sufficient to wait for another entity online to formally complain. With insights from Kentik, the team could act proactively in order to detect malicious subscriber traffic heading to the larger internet.</p> <h1 id="automated-ddosbotnet-detection">Automated DDoS/Botnet Detection</h1> <p>Kentik was able to offer Race Communications an automated way to detect botnet activity, by creating Kentik DDoS-type policies with modifications. Kentik now alerts the provider whenever a subscriber IP communicating outbound via ports 23 and 32715 reaches more than 500 unique destinations.</p> <p>These insights give Race Communications the option to proactive notify internal operational groups of events that could impact IP reputation in the long term. Additionally, Race Communications can now contact their customers to alert them of a possible infection on one of their Internet-connected devices.</p> <h2 id="additional-ddos-detection-features">Additional DDoS Detection Features</h2> <p>In addition to the capabilities offered by Kentik’s platform scrubbing partners Radware and A10 Networks, and our BGP-based Remote Trigger Black Hole (RTBH), Kentik now offers <a href="https://www.kentik.com/blog/kentik-takes-a-leap-forward-in-ddos-defense/" title="More Info About Kentik&#x27;s BGP FlowSpec DDoS Mitigation Features">BGP FlowSpec mitigations</a>. FlowSpec can be utilized to mitigate traffic with low-to-no collateral effects versus RTBH. It excels in situations where non-volumetric attacks like the ones associated with the Mirai botnet are compromising commercial security and reputation.</p> <h1 id="summary">Summary</h1> <p>Using Kentik, Race Communications can protect its business and its online IP reputation by unlocking:</p> <ul> <li>Information needed to remediate not just the <em>reported</em> problem, but the <em>entire detectable problem</em></li> <li>A path by which to contact subscribers and explain specifically why they have been the subject of a complaint (e.g., if they have infected machines on their network, response and action can quickly take place)</li> <li>An automated way of being informed of new IP addresses that form the Mirai pattern or other botnet traffic patterns</li> <li>Faster mean time to diagnose (MTTD)</li> <li>Faster mean time to repair (MTTR)</li> </ul> <p>Kentik is able to offer insights that can help identify almost any malicious or unwanted traffic on <em>any network</em> and provides automatic notification and mitigation capabilities. For more information on network visibility from Kentik, we encourage you to <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p> <h3 id="related-reading">Related reading:</h3> <ul> <li>Kentipedia Entry: <a href="https://www.kentik.com/kentipedia/ddos-detection/" title="DDoS Detection Defined in the Kentipedia">DDoS Detection</a></li> <li>Kentik Solution Brief: <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="Kentik DDoS Detection and Defense">Kentik DDoS Detection and Defense</a></li> </ul><![CDATA[How to Achieve Multi-Cloud Visibility in Hybrid and Multi-Cloud Architectures]]><![CDATA[Learn how to maintain multi-cloud visibility within hybrid and multi-cloud architectures, as well as the requirements for modern cloud monitoring tools.]]>https://www.kentik.com/blog/how-to-achieve-multi-cloud-visibility-network-monitoring-and-analyticshttps://www.kentik.com/blog/how-to-achieve-multi-cloud-visibility-network-monitoring-and-analytics<![CDATA[Crystal Li]]>Tue, 14 May 2019 07:00:00 GMT<p>Multi-cloud is quickly becoming a “new normal” deployment scenario as organizations of all types leverage an ever-increasing variety of cloud computing services. Here are just four of the reasons for this trend:</p> <ul> <li><strong>Preference:</strong> Different teams within an organization have different preferences when picking clouds due to each team’s competency and specialization.</li> <li><strong>No lock-in:</strong> The move to multi-cloud avoids vendor “lock-in” and reduces reliance on one cloud provider. This helps many organizations not only gain stronger negotiating power but also build a more resilient infrastructure.</li> <li><strong>Efficiency:</strong> Multi-cloud can promote operational efficiency that generates positive business outcomes (e.g., enabling rapid migration when pricing or capability make it appealing).</li> <li><strong>Regulations:</strong> Some make a move to meet compliance/regulatory requirements for international business.</li> </ul> <p>Regardless of the exact <em>“whys,”</em> adoption of cloud architectures doesn’t happen in a vacuum. Many cloud services are deployed to extend, supplement, or enhance existing on-prem infrastructure. And so, whenever we talk about “multi-cloud” architectures, we should also consider “hybrid-cloud” architectures — these two scenarios go hand-in-hand when we take a holistic look at an organization’s network infrastructure.</p> <h2 id="understanding-multi-cloud-and-hybrid-cloud-architectures">Understanding Multi-Cloud and Hybrid Cloud Architectures</h2> <p>As multi-cloud adoption becomes more prevalent, it’s essential to understand both multi-cloud and hybrid-cloud architectures. At a basic level, these models determine how cloud resources interact and integrate with each other and potentially with on-premises infrastructure.</p> <h3 id="what-is-multi-cloud-architecture">What is Multi-Cloud Architecture?</h3> <p>In a multi-cloud architecture, an organization uses multiple cloud providers to fulfill its computing needs. It’s not about redundancy; different cloud providers are often chosen because they excel in particular service offerings or workloads. For example, one provider might offer optimal database services while another has a superior machine learning platform. By employing a multi-cloud strategy, organizations can harness the strengths of various providers, ensuring operational efficiency and avoiding a one-size-fits-all approach.</p> <h3 id="what-is-hybrid-cloud-architecture">What is Hybrid Cloud Architecture?</h3> <p>On the other hand, a hybrid cloud architecture blends the use of public cloud resources with on-premises infrastructure. This setup offers organizations the flexibility of the cloud while retaining the control and security of their on-premises systems. The hybrid approach’s key advantages include deployment flexibility, better data control, and the ability to scale out (using public cloud) without major changes to existing on-prem setups.</p> <h2 id="what-is-multi-cloud-visibility-and-why-is-it-important">What is Multi-Cloud Visibility and Why is it Important?</h2> <p>In multi-cloud architectures, visibility is not just about seeing your resources but understanding how they interact across diverse environments.</p> <p>Effective management of multi-cloud environments demands a comprehensive view across all cloud providers and on-prem infrastructures. With multiple clouds, each with its unique dashboard and metrics, it can be a Herculean task to maintain a cohesive picture of operations. Moreover, monitoring dispersed resources and workloads becomes increasingly complex due to the sheer volume of data and the potential for disparities.</p> <p>However, a unified view becomes indispensable in ensuring optimal performance. Potential issues like latency, data transfer bottlenecks, or inconsistencies can arise as resources spread across multiple providers. Comprehensive multi-cloud visibility allows organizations to address these concerns preemptively, ensuring seamless operations and the high performance that today’s digital enterprises demand.</p> <h2 id="the-reality-of-multi-cloud-visibility-today">The Reality of Multi-Cloud Visibility Today</h2> <p>Along with the rise of hybrid- and multi-cloud architectures, it’s worth noting another reality — visibility (or lack thereof). In multi-cloud environments, it’s no longer sufficient to leverage each cloud provider’s built-in visibility tools. Those tools were built for each provider’s own cloud infrastructure, and they typically lack some essential features, including:</p> <ul> <li><strong>Resiliency:</strong> The built-in cloud monitoring tools for an IaaS solution reside <em>inside</em> that public cloud! When that provider’s cloud service goes down, its native monitoring services will also be impacted. Users can no longer rely on built-in monitoring tools to do any kind of troubleshooting in such a scenario and will be left in the dark.</li> <li><strong>Lack of a holistic view</strong> across multiple cloud providers and the organization’s on-prem infrastructure. For example, you cannot use AWS CloudWatch to monitor GCP’s network statistics and/or your on-prem data centers. Today, there remains a huge gap between the <em>need</em> for a unified view (one that allows monitoring of your entire infrastructure in hybrid/multi-cloud environments) and the tools that are available <em>within</em> each environment.</li> <li>An individual cloud vendor is probably <em>not</em> the <strong>Subject Matter Expert</strong> you want for <em>all</em> of your networks, old and new (including the ones you own).</li> </ul> <h2 id="the-need-for-a-modern-cloud-monitoring-tool">The Need for a Modern Cloud Monitoring Tool</h2> <p>In hybrid and multi-cloud environments, a modern monitoring tool should meet four key requirements:</p> <ul> <li><strong>Unified view:</strong> A modern cloud monitoring tool must provide a unified view across your entire infrastructure — hybrid, multi-cloud, and on-premise components. A single-pane-of-glass view and management can help reduce operational complexities and avoid silos.</li> <li><strong>Powerful analytics:</strong> The tool must unlock powerful analytics and enable in-depth visibility. The tool must also be able to help manage the top challenges we face with modern cloud infrastructures, including performance, cost, and security. Analytics features should be contextualized and enable you to answer both <em>technical</em> and <em>business</em> questions for various stakeholders.</li> <li><strong>Modern integrations:</strong> Such a tool must also integrate with container orchestration and service mesh management (e.g., Istio). One major cloud advantage is the agility of adopting cloud-native architecture, including containers/microservices. However, to leverage that advantage, your monitoring tool needs to provide visibility into the communications <em>between</em> those services (e.g., when the communication between them is slow, that means business is at risk).</li> </ul> <h2 id="kentik-for-multi-cloud-visibility-now-with-support-for-gcp-aws-and-microsoft-azure-flows">Kentik for Multi-Cloud Visibility: Now with Support for GCP, AWS and Microsoft Azure Flows</h2> <p>Kentik is on a fast and firm pace to expand cloud visibility for on-prem to public clouds. We are on a mission to provide a scalable, powerful, and easy-to-use__ solution that provides holistic and unified insights for your infrastructure.</p> <p>Last summer, we started with a <strong>Google Cloud Platform (GCP)</strong> integration by adding GCP VPC Flow Logs as one of the data sources (see our <a href="https://www.kentik.com/blog/kentik-for-google-cloud-vpc-flow-logs/" title="Google Cloud VPC Flow Logs Support Blog Post">GCP VPC Flow Logs blog post</a> and <a href="https://www.kentik.com/resources/google-cloud-vpc-flow-logs-for-kentik/" title="Kentik Solution Brief: Google Cloud VPC Flow Logs">GCP solution brief</a> for more detail). Last winter, we enabled visibility into <strong>AWS</strong> with its VPC Flow Logs (see our <a href="https://www.kentik.com/blog/how-kentik-helps-assess-aws-visibility" title="AWS VPC Flow Logs Support for Kentik">AWS Visibility blog post</a> and <a href="https://www.kentik.com/resources/aws-vpc-flow-logs-for-kentik/" title="Solution Brief: AWS VPC Flow Logs for Kentik">AWS visibility solution brief</a>).</p> <p>But we haven’t stopped there: Now, as of spring 2019, we put our footprint into <strong>Azure</strong> by integrating <strong>Network Security Group Version 2 (NSGv2) Flow Logs</strong> as well. For more detail, read our new solution brief about <a href="https://www.kentik.com/resources/azure-nsg-flow-logs-for-kentik/" title="Solution Brief: Azure NSG Flow Logs for Kentik">Azure NSG Flow Logs for Kentik</a>.</p> <h2 id="getting-started-with-the-azure-nsgv2-flow-logs-integration">Getting Started with the Azure NSGv2 Flow Logs Integration</h2> <p><a href="https://docs.microsoft.com/en-us/azure/network-watcher/network-watcher-nsg-flow-logging-overview" title="Read more on Microsoft Azure: Introduction to flow logging for network security groups" target="_blank">Azure NSGv2 Flow Logs</a> allow you to get information about ingress and egress IP traffic through a Network Security Group (NSG) on a <strong>per-rule</strong> basis.</p> <p>The onboarding workflow is straightforward. To export Azure NSGv2 Flow Logs to the Kentik platform, just follow these few steps:</p> <ol> <li><strong>Gather Azure Information:</strong> This might include Azure Role, Azure Subscription ID, Resource Group and Location from your Azure instance. The main goal is to make sure that you have the essential information handy and have the right permissions granted for the exporting.</li> <li><strong>Add Azure Cloud and Complete the Settings:</strong></li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/1CwjDtqLiLXZvS3BAkxC6L/70cc879e2aa6dd4e3c531a12c2a6b020/azure-enable-flow-log-export-1.png" class="image center no-shadow" alt="Azure Flow Log Export" style="max-width: 512px;" /> - __Authorize Access to Azure:__ Enter the Subscription ID of the Azure instance from which Kentik’s NSG Flow Exporter application will export flow logs and authorize the access. - __Specify Azure Resources:__ Enter the Resource Group Name and Location, as well as the Storage Account where flow logs will be generated. - __Configure Flow Log Export__ to export flow logs to a Storage Account from the specified Resource Group and Location. (An auto-generated script is recommended.) - __Validate Configuration.__ 1. __Apply Azure Dimensions__ (Source/Destination) and get desired insights into the cloud. <img src="//images.ctfassets.net/6yom6slo28h2/69mDqf1fWcZTYUnn6plU9M/ffd7249a423ce75b45584218bbb593c8/apply-azure-dimensions-for-flow-log-export-2.png" class="image center no-shadow" alt="Azure Dimensions for Flow Log Export" style="max-width: 474px;" /> A sample Azure traffic dashboard is shown below. It can capture the traffic profile with time series, by subscription, by account, or by region, both inbound and outbound for a 30-day overview: <img src="//images.ctfassets.net/6yom6slo28h2/1Ks051iTw80yNL7bbqZr4f/60d3592a0ea5cf413443fe3af769cf3d/azure-network-traffic-flow-dashboard-3.png" class="image center no-shadow" alt="Azure Network Traffic Flow Log Export" style="max-width: 750px;" thumbnail /> For more detailed instructions on getting started, see the [Kentik for Azure topic](https://kb.kentik.com/Bd08.htm#Bd08-Kentik_for_Azure "Azure Flow Logs Support in Kentik - Kentik Knowledge Base") in our Kentik Knowledge Base. <h3 id="kentik-for-multi-cloud-aws--azure--gcp">Kentik for Multi-Cloud: AWS + Azure + GCP</h3> <p>The <a href="https://www.kentik.com/solutions/microsoft-azure/" title="Kentik solutions for Azure Observability">Azure integration</a> completes Kentik’s trifecta of public cloud options. Kentik’s cloud visibility platform now provides easy analytics, powerful visualization, and instant insight, empowering customers with:</p> <ul> <li>A unified view across hybrid and multi-cloud into your infrastructure</li> <li>Integrated with container orchestration and service mesh management</li> <li>The ability to manage performance, cost, and security issues across modern cloud infrastructures</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/gdrUcCOcZtjfAWtPqxQJM/e27f67b58738d6e564292cb21b40b585/kentik-azure-multi-cloud-monitoring-aws-azure-gcp-4.png" class="image center no-shadow" alt="Azure Network Traffic Flow Log Export" style="max-width: 512px;" thumbnail /> <p>For more information on how to leverage the traffic data from your infrastructure and put it into application and business context for your hybrid and multi-cloud environment, you can read our <a href="https://www.kentik.com/resources/cloud-visibility/" title="Cloud Visibility Solution Brief from Kentik">cloud visibility solution brief</a>, <a href="#demo_dialog" title="Request a demo of Kentik’s cloud visibility solution">request a personalized demonstration</a>, or sign up for a <a href="#signup_dialog" title="Start your free trial of Kentik’s cloud visibility solution">free trial</a>.</p><![CDATA[Network Teams: Keys to Cloud Migration Success]]><![CDATA[As enterprises integrate their networks with their public cloud strategies, several best practices for network teams are emerging. In this guest blog post, Enterprise Management Associates analyst Shamus McGillicuddy dives into his recent cloud networking research to discuss several strategies that can improve the chances of a successful cloud networking initiative. ]]>https://www.kentik.com/blog/network-teams-keys-to-cloud-migration-successhttps://www.kentik.com/blog/network-teams-keys-to-cloud-migration-success<![CDATA[Shamus McGillicuddy]]>Tue, 07 May 2019 07:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/1jDGNtIbOiiae96VqOZD7L/5ec026645b978984c2fc400e671b34c1/featured1-cloud-success.jpg" style="max-width: 325px; margin-bottom: 25px;" class="image right" /> <p>As enterprises integrate their networks with their public cloud strategies, several best practices for network teams are emerging. Enterprise Management Associates (EMA) research recently found that the following strategies can improve the chances of a successful cloud networking initiative.</p> <h3 id="be-a-team-player">Be a Team Player</h3> <p>Network infrastructure professionals should focus on building good relationships with the other teams that are key players in the cloud because the cloud can create internal conflicts. Fourteen percent of network teams say the public cloud has created a rift between them and the security operations team, and 17% of network teams say the cloud has created a rift with the DevOps group.</p> <p>Make sure you are able to partner with these groups. Share tools and data and give DevOps and SecOps teams views that are tailored to their roles. Put best practices and formal processes for collaboration in place. And most importantly, demand a seat at the table as early as possible. EMA research has generally found that the earlier network teams get involved with a cloud strategy, the more successful they are at supporting that strategy.</p> <h3 id="lay-the-groundwork">Lay the Groundwork</h3> <p>Network teams should assess the network requirements of every application before it migrates to the cloud. Already, 99% of network teams do this, and 89% do it for every application. EMA research has found that the most successful network teams are more likely to assess every application that moves to the cloud.</p> <p>These assessments take time. Only 2% of enterprises take a brief snapshot, and 16% collect data for only one day. Instead, the typical assessment period ranges from two days to two weeks.</p> <p>Network teams also collect a variety of data during these assessments. The most popular data collected include infrastructure metrics such as SNMP MIBs and traps (52%), network flows such as NetFlow or sFlow (44%), and synthetic traffic generated by active monitoring tools (43%).</p> <h3 id="understand-and-manage-cloud-networking-costs">Understand and Manage Cloud Networking Costs</h3> <p>Most network managers are actively monitoring and managing cloud networking costs. Already, 89% of network teams are monitoring cloud networking costs today and 10% say they need to start doing this.</p> <p>Not everyone is ready at the outset. Of the network teams that currently monitor cloud costs, 33% said they had to acquire new tools. And cost monitoring requires more than just collecting billing data. Sixty-one percent of network managers extrapolate cloud costs via network infrastructure data (e.g. SNMP MIBs), and 55% extrapolate costs by monitoring traffic.</p> <p>Network managers have several reasons for monitoring cloud networking costs. First, 64% are trying to prevent the incursion of unplanned charges. Also, 61% are trying to improve budget planning, and 59% apply this insight to billing or cost allocation, either to other parts of the IT organization or to lines of business.</p> <h3 id="modernize-your-tools">Modernize Your Tools</h3> <p>EMA predicts that at least one of your network management tools will let you down during your cloud journey. Seventy-four percent of network managers have told EMA that they had an incumbent network management or monitoring tool fail to meet their public cloud requirements. While 35% said they customized their existing tools to make things work, 28% had to buy new tools.</p> <p>Some network management vendors are simply failing to respond to the cloud. Twenty-eight percent of network managers say their incumbent tools failed because their vendors failed to create a roadmap for cloud support. However, the most common reasons for tool failure are cost (45%) and the complexity of the cloud support (44%). In other words, network tool vendors are asking for additional licensing, or vendors are adding cloud functionality that is simply too difficult for customers to use.</p> <p>The natural next step is to use native monitoring tools and services offered by cloud providers, like AWS CloudWatch. Ninety-eight percent of network managers use such tools today, but not everyone is sold on them. Only 55% say these monitoring services are essential: “I don’t find [CloudWatch] helpful. It requires a lot more fine-tuning and extrapolation to make it useful for [network operations],” a senior network architect at a large global media company told EMA.</p> <p>“Native tools that come from Azure are not ready for prime-time,” a network architect at a large North American retailer told EMA. “We can get the data we need, but you have to build the tools yourself. That’s no fun.”</p> <p>Resiliency is a big issue with native monitoring tools, according to 28% of network managers. These services simply go down when the cloud provider goes down. Twenty-four percent of network managers say native cloud monitoring tools can’t provide an end-to-end view across the different regions or availability zones of a single cloud provider. And 24% also said that their existing network monitoring tools cannot effectively collect and integrate data from these cloud monitoring services, which limits their ability to correlate cloud insights with insights from their enterprise network.</p> <h3 id="final-thoughts-on-cloud-networking">Final Thoughts on Cloud Networking</h3> <p>The above provides just a few recommendations. EMA’s research into the subject of cloud network engineering and operations is extensive. For instance, we’ve explored how network teams leverage colo providers to improve hybrid cloud and multi-cloud networking, and we’ve identified how network teams are using automation tools to improve cloud network engineering and governance.</p> <p>However, the four tips above should be a good start for anyone. Be a team player, lay the groundwork by opening up your tools and data, keep an eye on cloud networking costs, and modernize your toolsets.</p><![CDATA[Network Engineering and Operations in the Multi-Cloud Era]]><![CDATA[In this post, we provide an overview of a new research report from EMA analyst Shamus McGillicuddy. The report is based on the responses of 250 enterprise IT networking professionals who note challenges and key technology requirements for hybrid and multi-cloud networking.]]>https://www.kentik.com/blog/network-engineering-and-operations-in-the-multi-cloud-erahttps://www.kentik.com/blog/network-engineering-and-operations-in-the-multi-cloud-era<![CDATA[Michelle Kincaid]]>Wed, 20 Mar 2019 07:00:00 GMT<p>A new research report, “<a href="https://www.kentik.com/resources/network-engineering-and-operations-in-the-multi-cloud-era">Network Engineering and Operations in the Multi-Cloud Era</a>,” looks at the networking challenges enterprises face as they move to cloud environments. The report, written by EMA analyst Shamus McGillicuddy (and sponsored by Kentik), draws from a survey of 250 enterprise IT networking professionals, and highlights key technology requirements for hybrid and multi-cloud networking, as well as the emerging tool strategies. In this post, we share some of the key findings.</p> <h3 id="hybrid-cloud-networking-challenges">Hybrid Cloud Networking Challenges</h3> <p>As we typically see when any new technology or IT trend moves past the R&#x26;D phase and into broad adoption, security risks seem to arise. Therefore, it’s no surprise to see that this study found “security is a huge issue for hybrid cloud networking.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/2WGPd3NRzuo0lb1NINWCHV/9b91740c3fda78399b5084699b4744ce/hybrid-cloud-challenges.jpg" class="image center" style="max-width: 750px;" /> <p>However, it’s also worth noting that network complexity and lack of visibility are top challenges for enteprise NetOps teams. From our perspective, that’s because today’s cloud environments are becoming more dynamic, with unpredictable DevOps deployments adding huge variability to load. Instrumentation, incident response, and service assurance are all on the list of tasks now made dramatically more complex. To keep traffic flowing reliably, NetOps teams know they need pervasive, granular, real-time visibility, and the legacy tools they’re using fail to deliver.</p> <h3 id="cloud-networking-performance-management">Cloud Networking Performance Management</h3> <p>Within the research, EMA asked the 250 enterprise NetOps respondents to describe how their network performance management tools support them from the following three perspectives:</p> <ul> <li>Network performance to and from cloud applications</li> <li>Application and server performance in the cloud</li> <li>Network performance between cloud workloads</li> </ul> <p>“The chart (below) shows that slightly less than half of enterprises are able to attain all three aspects of visibility across their public and private cloud deployments,” wrote the analyst. “Smaller minorities are gaining insight into these various aspects of performance with one of either public or private cloud infrastructure, but not both. Over all, this means that roughly half of enterprises lack end-to-end performance management capabilities across their entire public and private cloud environment.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/2Mjkf59tkzUpZZDyywaswe/8df71b96f1d7ec756dc56a2135bcbb91/cloud-networking-perf-management.jpg" class="image center" style="max-width: 750px;" /> <h3 id="legacy-network-management-tools-failure">Legacy Network Management Tools Failure</h3> <p>The majority (74%) of enterprise networking professionals surveyed for this research noted the legacy network management tool or tools they were using failed to address their cloud requirements. “Thirty-nine percent were forced to find another solution, but 35 percent were able to customize the tools to meet their needs. Large enterprises struggle with this issue more often.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/6SOEytCOp0ZrwQiz38Upoi/a47389890206824c6158d947f11e5f0d/legacy-tools-failure.jpg" class="image center" style="max-width: 750px;" /> <p>Why do these tools fail? Cost and complexity, according to EMA.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5jdUDGOAMLro1twCMjTXqM/f5dd7b1e43ad9128a336f32afca33d0b/why-tools-fail.jpg" class="image center" style="max-width: 750px;" /> <h3 id="network-automation">Network Automation</h3> <p>“There are countless engineering tasks involved in building out hybrid cloud and multi-cloud networks,” writes analyst McGillicuddy in the report. “Given the rate of change in the public cloud, engineering automation is a major goal for network teams.”</p> <p>With cloud costs and complexity issues bubbling up, and legacy network management tools failing to support enterprise NetOps teams, network automation is also the rise. However, NetOps teams note they do want to maintain manual control over some network engineering and configuration tasks, acording to the EMA research.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6EgWwsVmxaE0tYzkOEmVzy/74cf9b6b41df577b20343382e90a9d17/network-automation.jpg" class="image center" style="max-width: 750px;" /> <p>The above data shows, “Network teams excel at applying automation to network security controls in the cloud. DDI management (DNS, DHCP, and IP addressing), network monitoring configuration, and routing are also very frequently automated. However, network security monitoring is less often automated,” notes EMA.</p> <h3 id="a-modern-approach-to-multi-cloud-networking">A Modern Approach to Multi-Cloud Networking</h3> <p>To see more from EMA’s research, <a href="https://www.kentik.com/resources/network-engineering-and-operations-in-the-multi-cloud-era">download the full report here</a>. Additionally, you can learn how Kentik’s modern network analytics approach helps enterprises address the challenges of multi-cloud visibility, complexity, cost management, security, and much more by downloading our <a href="https://www.kentik.com/resources/cloud-visibility">Kentik cloud visibility solution brief</a> or via <a href="https://www.kentik.com/go/get-demo/">demo request</a>.</p><![CDATA[Kentik Takes a Leap Forward in DDoS Defense ]]><![CDATA[Today almost every type of organization must be prepared to mitigate DDoS attacks and proactively defend their networks. In this post, we discuss our newly added BGP Flowspec support in Kentik Detect. This enables our customers to have the granularity needed to detect attack traffic in real time and mitigate attacks faster. ]]>https://www.kentik.com/blog/kentik-takes-a-leap-forward-in-ddos-defensehttps://www.kentik.com/blog/kentik-takes-a-leap-forward-in-ddos-defense<![CDATA[Crystal Li]]>Wed, 13 Mar 2019 07:00:00 GMT<p><em><strong>Introducing Kentik’s New BGP Flowspec Support</strong></em></p> <img src="//images.ctfassets.net/6yom6slo28h2/3lJmx6OYDRHINv1b7AUBEa/0a77f7e624899252b10ce2fa1d4c0971/ddos-protection.png" style="max-width: 165px; margin-bottom: 30px" class="image right no-shadow" alt="ddos protection" /> <p>Distributed denial-of-service <a href="https://www.kentik.com/resources/penteledata-keeps-ddos-at-bay-with-kentik-detect/">(DDoS) attacks</a> have been a continuous threat since the advent of the commercial Internet. Today, they are becoming increasingly prevalent and cause major financial damage to all types of organizations. In addition to ever larger traffic volumes, attackers are also increasing their target diversity, with attack traffic simultaneously spanning data, applications, and infrastructure to increase the chances of success. With many attacks stemming from totally unpredictable events like political dissent, employee misconduct, or actions of third parties, every type of organization needs to be prepared to mitigate DDoS attacks and proactively defend their networks to ensure business continuity.</p> <p>Existing methods of DDoS mitigation present a number of challenges:</p> <ul> <li> <p>A high degree of <strong>coordination is required between customers and service providers</strong>. For example, during an attack, service provider network engineers need to be skilled at finding attacks, choosing an appropriate mitigation strategy, and have proper access to the infrastructure to apply it.</p> </li> <li> <p>A common DDoS mitigation technique is <strong>Remotely-Triggered Black Hole (RTBH), which requires extensive pre-configuration</strong> of discard routes and/or uRPF on all edge routers. Any misconfiguration can lead to downtime or ineffective mitigation with business impact.</p> </li> <li> <p>For destination-based RTBH, <strong>the victim’s destination IP address becomes completely unreachable</strong>. While this minimizes collateral damage to adjacent customers and infrastructure, the victim is still down. The mitigation actually “completes the attack.” The victim can update DNS to point at a different IP address in an attempt to get their application back up. However, if the attack is targeting the DNS hostname and not the IP address, the attack will just switch over to the new IP address.</p> </li> <li> <p><strong>Source-based RTBH only works for a small number of sources</strong>. It can’t scale to a large <a href="https://www.kentik.com/go/assessment-network-readiness/">network</a> perimeter or when the source of the attack is distributed across thousands of sources.</p> </li> </ul> <h3 id="what-is-bgp-flow-specification-flowspec">What is BGP Flow Specification (Flowspec)</h3> <p>Defined in <a href="https://tools.ietf.org/html/draft-ietf-idr-rfc5575bis-06" target="_blank">RFC 5575</a>, BGP Flow Specification (Flowspec) is a DDoS mitigation solution that allows you to rapidly deploy and propagate filtering and policing across a large number of BGP peers.</p> <p>The basic elements of Flowspec are:</p> <ol> <li><strong>Flowspec match</strong>:</li> </ol> <blockquote> <p>A key concept from BGP is NLRI, Network Layer Reachability Information, which describes the network/prefix that the given BGP route matches. There are <strong>12 NLRI attributes</strong> defined in BGP Flowspec. These attributes are added to the NLRI field within the BGP Update Message that’s advertised to peers and define the particular traffic that the Flowspec route will match.</p> </blockquote> <table style="float: left; margin-left: 4%; margin-right: 4%;"> <tr> <td><strong>Type-1</strong></td> <td>Destination Prefix</td> </tr> <tr> <td><strong>Type-2</strong></td> <td>Source Prefix</td> </tr> <tr> <td><strong>Type-3</strong></td> <td>IP Protocol</td> </tr> <tr> <td><strong>Type-4</strong></td> <td>Port</td> </tr> <tr> <td><strong>Type-5</strong></td> <td>Destination port</td> </tr> <tr> <td><strong>Type-6</strong></td> <td>Source port</td> </tr> <tr> <td><strong>Type-7</strong></td> <td>ICMP type</td> </tr> <tr> <td><strong>Type-8</strong></td> <td>ICMP code</td> </tr> <tr> <td><strong>Type-9</strong></td> <td>TCP flags</td> </tr> <tr> <td><strong>Type-10</strong></td> <td>Packet length</td> </tr> <tr> <td><strong>Type-11</strong></td> <td>DSCP</td> </tr> <tr> <td><strong>Type-12</strong></td> <td>Fragment</td> </tr> </table> <p>As you can see, these NLRI types go beyond what’s available in a traditional BGP NLRI, which contains a destination IP / prefix match only. With Flowspec you can match attack traffic much more granularly. You can even distinguish conversations flowing between individual pairs of IPs. This improved granularity can dramatically improve the kind of over-blocking that occurs with traditional RTBH.</p> <div style="clear: both;"></div> <ol start="2"> <li><strong>Flowspec action</strong>:</li> </ol> <blockquote> <p>Flowspec uses BGP Extended Communities to define actions that routers will take for traffic matching the NLRI attributes from above.</p> </blockquote> <table style="float: left; margin-left: 4%;"> <tr> <th>Action Type</th> <th>Examples</th> </tr> <tr> <td>traffic-rate</td> <td>set to 0 to drop all traffic</td> </tr> <tr> <td>traffic-action</td> <td>sampling</td> </tr> <tr> <td>redirect to VRF </td> <td>Change Route Target (RT)</td> </tr> <tr> <td>traffic-marking </td> <td>DSCP value</td> </tr> </table> <div style="clear: both;"></div> <h3 id="why-flowspec-is-better">Why Flowspec is Better</h3> <p>Here are three key reasons why Flowspec is better:</p> <ul> <li>Flowspec has the same <strong>granularity</strong> compared to access control lists (ACLs), as it’s based on n-tuple matching.</li> <li>Flowspec has the same <strong>automation capability</strong> compared to Remotely-Triggered Black Hole (RTBH) routes. It’s now much easier to propagate filters to all edge routers in large networks.</li> <li>Flowspec leverages <strong>BGP best practices and policy controls</strong>. Familiar best practices used for RTBH can also be applied to BGP Flowspec.</li> </ul> <h3 id="which-network-vendors-support-flowspec">Which Network Vendors Support Flowspec?</h3> <p>Flowspec has been around for a few years, so at this point, most of the major routing stacks that support BGP also support Flowspec. This includes open source routing daemons and commercial software from various networking vendors such as <a href="https://www.cisco.com/c/en/us/td/docs/routers/asr9000/software/asr9k_r5-2/routing/configuration/guide/b_routing_cg52xasr9k/b_routing_cg52xasr9k_chapter_011.html" target="_blank">Cisco</a>, <a href="https://www.juniper.net/us/en/training/jnbooks/day-one/networking-technologies-series/deploying-bgp-flowspec/" target="_blank">Juniper</a>, and <a href="https://www.nanog.org/sites/default/files/wed.general.trafficdiversion.serodio.10.pdf" target="_blank">Nokia</a> (former Alcatel-Lucent).</p> <h3 id="kentik-flowspec-support">Kentik Flowspec Support</h3> <p>To strengthen Kentik’s Security and DDoS solution, we have now added Flowspec as a mitigation method. In the first phase, Kentik supports:</p> <ul> <li>Manual Flowspec mitigation:</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/1pdYKXKWKXK0AhK5X4ubRD/2630d55552270f3c708791754c81e413/flowspec-setting1200.png" style="max-width: 900px;" class="image center no-shadow" /> <img src="//images.ctfassets.net/6yom6slo28h2/2c077zbEAeupVnBqnhFv2G/c5871732b0935257d813d911c2a4d98b/flowspec2-1400.png" style="max-width: 1000px;" class="image center" thumbnail /> <ul> <li>Or, you can configure Flowspec as an automated mitigation method to associate with a particular alerting policy:</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/4iIA3HO34FSsyYEr6j2DEZ/af126263bb590e9cac8245c21f44207f/icmp-attack.png" style="max-width: 600px;" class="image center" thumbnail /> <ul> <li>You can also populate the Flowspec match criteria from the traffic details of the alarm that triggered the mitigation. This allows for dynamic, granular, fully-automated DDoS mitigation:</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/3r51H0VqLiSLyzO0rA0Mh7/3928f7288174f0598031dfbe7797cd0f/flowspec4-900w.png" style="max-width: 550px;" class="image center" thumbnail /> <p>Stay tuned for more Flowspec functionality with additional auto-matching intelligence and automation options. For additional technical details, check out our <a href="https://kb.kentik.com/Gc10.htm#Gc10-Flowspec_Mitigation">Flowspec Mitigation topic</a> in the Kentik Knowledge Base. Of note, Flowspec support is not turned on by default, so if you are interested in using this feature, please don’t hesitate to contact the Kentik Customer Success team at <a href="mailto:[email protected]"><a href="mailto:[email protected]">[email protected]</a></a>.</p><![CDATA[Kubernetes Networking 101]]><![CDATA[In this blog post, we provide a starting point for understanding the networking model behind Kubernetes and how to make Kubernetes networking simpler and more efficient. ]]>https://www.kentik.com/blog/kubernetes-networking-101https://www.kentik.com/blog/kubernetes-networking-101<![CDATA[Crystal Li, Jim Meehan]]>Wed, 06 Mar 2019 08:00:00 GMT<h2 id="a-getting-started-guide-for-cloud-native-service-communication"><em>A Getting-Started Guide for Cloud-native Service Communication</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/2W0NirLGL18TLzzzwSGQT0/a11fbaf294491e7cb54375530e9384a5/network-design.jpg" style="max-width: 370px; margin-bottom: 20px;" class="image right" /> <p>As the complexity of microservice applications continues to grow, it’s becoming extremely difficult to track and manage interactions between services. Understanding the network footprint of applications and services is now essential for delivering fast and reliable services in cloud-native environments. Networking is not evaporating into the cloud but instead has become a critical component that underpins every part of modern application architecture.</p> <p>If you are at the beginning of the journey to modernize your application and infrastructure architecture with <a href="https://www.kentik.com/solutions/usecase/kubernetes-networking/" title="Learn more about Kentik for Kubernetes networking">Kubernetes networking</a>, it’s important to understand how service-to-service communication works in this new world. This blog post is a good starting point to understand <strong>the networking model behind Kubernetes</strong> and <strong>how to make things simpler and more efficient</strong>.</p> <p>(For container cluster networking, there is the <strong>Docker model</strong> and the <strong>Kubernetes model</strong>. In this blog, we will focus on discussing <a href="https://kubernetes.io/" target="_blank">Kubernetes</a>.)</p> <h2 id="a-refresh-kubernetes-basics">A Refresh: Kubernetes Basics</h2> <p>Let’s first review a few basic Kubernetes concepts:</p> <p><strong>Pod:</strong> “A pod is the basic building block of Kubernetes,” which encapsulates containerized apps, storage, a unique network IP address, and instructions on how to run those containers. Pods can either run a single containerized app or run multiple containerized apps that need to work together. (“Containerized app” and “container” will be used interchangeably in this post.)</p> <div style="max-width: 700px; margin: 0 auto 30px;text-align: right;"><img src="//images.ctfassets.net/6yom6slo28h2/7d8M8VJeXRXr1K5vkPROSq/e32483abb606b024ecab392404647afc/pod.png" style="max-width: 700px; padding: 0; margin-bottom: 0;" class="image center no-shadow" /><em>Image © Kubernetes</em></div> <p><strong>Node:</strong> A node is a worker machine in Kubernetes. It may be a VM or physical machine. Each node contains the services necessary to run pods and is managed by Kubernetes master components.</p> <div style="max-width: 550px;margin: 0 auto 30px; text-align: right;"><img src="//images.ctfassets.net/6yom6slo28h2/7g8uWJRQsj99Q49VaM3ykh/8de65ae5050e0fb88914623604ea2f9b/node.png" style="max-width: 550px; padding: 0;margin-bottom: 0;" class="image center no-shadow" /> <em>Image © Kubernetes</em></div> <p><strong>Service/Micro-service</strong>: A Kubernetes Service is an abstraction which defines a logical set of pods and a policy by which to access them. Services enable loose coupling between dependent pods.</p> <div style="max-width: 600px;margin: 0 auto 30px;text-align: right;"><img src="//images.ctfassets.net/6yom6slo28h2/1yKjblJoAuWSovBpHs47Pd/42ff0c5941e71de54126907592b2a4ca/service-microservice.png" style="max-width: 600px; padding: 0;margin-bottom: 0;" class="image center no-shadow" /><em>Image © Kubernetes</em></div> <h2 id="kubernetes-networking-types-requirements-and-implementations">Kubernetes Networking: Types, Requirements, and Implementations</h2> <p>There are several layers of networking that operate in a Kubernetes cluster. That means there are distinct networking problems to solve for different types of communications:</p> <ol> <li>Container-to-container</li> <li>Pod-to-pod</li> <li>Pod-to-service</li> <li>External-to-service</li> </ol> <p><em>Sometimes, 1 &#x26; 2 are grouped as “pod networking” and 3 &#x26; 4 as “service networking.”</em></p> <p>In the Kubernetes networking model, in order to reduce complexity and make app porting seamless, a few rules are enforced as fundamental <strong>requirements</strong>:</p> <ul> <li><strong>Containers</strong> can communicate with all other <strong>containers</strong> <em>without NAT</em>.</li> <li><strong>Nodes</strong> can communicate with all <strong>containers</strong> <em>without NAT</em>, and vice-versa.</li> <li>The IP that a container sees itself as is the <strong>same IP</strong> that others see it as.</li> </ul> <p>This set of requirements significantly reduces the complexity of network communication for developers and lowers the friction porting apps from monolithic (VMs) to containers. However, this type of networking model with a flat IP network across the entire Kubernetes cluster creates new challenges for network engineering and operates teams. This is one of the reasons we see so many network <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#how-to-implement-the-kubernetes-networking-model" target="_blank">solutions and implementations</a> that are available and documented in <a href="http://kubernetes.io/" target="_blank">kubernetes.io</a> today.</p> <p>While meeting the three requirements above, each of the following implementations tries to solve different kinds of networking problems:</p> <ul> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#aci" target="_blank"><strong>Application Centric Infrastructure (Cisco)</strong></a> - provides container networking integration for ACI </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#aos-from-apstra" target="_blank"><strong>AOS (Apstra)</strong></a> - enables Kubernetes to quickly change the network policy based on application requirements </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#big-cloud-fabric-from-big-switch-networks" target="_blank"><strong>Big Cloud Fabric (Big Switch Networks)</strong></a> - is designed to run Kubernetes in private cloud/on-premises environments </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#cni-genie-from-huawei" target="_blank"><strong>CNI-Genie (Huawei)</strong></a> - CNI is short for &ldquo;Container Network Interface;&rdquo; this implementation enables Kubernetes to simultaneously have access to different implementations of the Kubernetes network model in runtime </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#cni-ipvlan-vpc-k8s" target="_blank"><strong>Cni-ipvlan-vpc-k8s</strong></a> - enables Kubernetes deployment at scale within AWS </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#cilium" target="_blank"><strong>Cilium (Open Source)</strong></a> - provide secure network connectivity between application containers </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#contiv" target="_blank"><strong>Contiv (Open Source)</strong></a> - provides configurable networking (e.g., native l3 using BGP, overlay using VXLAN, classic L2, or Cisco-SDN/ACI) for various use cases </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#contrail-tungsten-fabric" target="_blank"><strong>Contrail / Tungsten Fabric (Juniper)</strong></a><strong> </strong> - provides different isolation modes for virtual machines, containers/pods, and bare metal workloads </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#danm" target="_blank"><strong>DANM (Nokia)</strong></a> - is built for telco workloads running in a Kubernetes cluster </li> <li> <strong><a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#flannel" target="_blank">Flannel</a></strong> - a very simple overlay network that satisfies the Kubernetes requirements </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#google-compute-engine-gce" target="_blank"><strong>Google Compute Engine (GCE)</strong></a> - all pods can reach each other and can egress traffic to the internet </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#jaguar" target="_blank"><strong>Jaguar (OpenDaylight)</strong></a> - provides overlay network with one IP address per pod </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#knitter" target="_blank"><strong>Knitter (ZTE)</strong></a> - provides the ability of tenant management and network management </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#kube-router" target="_blank"> <strong>Kube-router (CloudNativeLabs)</strong></a> - provides a Linux LVS/IPVS-based service proxy, a Linux kernel forwarding-based pod-to-pod networking solution with no overlays </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#l2-networks-and-linux-bridging" target="_blank"><strong>L2 networks and Linux bridging</strong></a> </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#multus-a-multi-network-plugin" target="_blank"><strong>Multus (a Multi Network plugin)</strong></a> - supports the Multi Networking feature </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#nsx-t" target="_blank"><strong>NSX-T (VMWare)</strong></a> - provide network virtualization for a multi-cloud and multi-hypervisor environment </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#nuage-networks-vcs-virtualized-cloud-services" target="_blank"><strong>Virtualized Cloud Services (VCS) (Nuage Networks)</strong></a> </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#openvswitch" target="_blank"><strong>OpenVSwitch</strong></a> - to enable network automation through programmatic extension, while still supporting standard management interfaces and protocols </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#ovn-open-virtual-networking" target="_blank"><strong>Open Virtual Networking (OVN) (Open vSwitch community)</strong></a> </li> <li> <strong><a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#project-calico" target="_blank">Project Calico (Open Source)</a> </strong>- a container networking provider and network policy engine </li> <li> <a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#romana" target="_blank"><strong>Romana (Open Source)</strong></a> - a network and security automation solution that lets you deploy Kubernetes without an overlay network</li> <li><a href="https://kubernetes.io/docs/concepts/cluster-administration/networking/#weave-net-from-weaveworks" target="_blank"><strong>Weave Net (Weaveworks)</strong></a> - a resilient and simple-to-use network for Kubernetes and its hosted applications </li> </ul> <p>Among all these implementations, <strong>Flannel</strong>, <strong>Calico</strong>, and <strong>Weave Net</strong> are probably the most popular ones that are used as network plugins for the <strong>Container Network Interface (CNI)</strong>. CNI, as its name implies, can be seen as the simplest possible interface between container runtimes and network implementations, with the goal of creating a generic plugin-based networking solution for containers. As a <a href="https://www.sdxcentral.com/articles/news/cncf-takes-cni-container-interface-standard-under-its-wing/2017/05/" target="_blank">CNCF project</a>, it’s now become a common interface industry standard.</p> <p>Some quick and additional information about <strong>Flannel</strong>, <strong>Calico</strong>, and <strong>Weave Net</strong>:</p> <ul> <li><a href="https://coreos.com/flannel/docs/latest/" target="_blank"><strong>Flannel</strong></a>: Flannel can run using several encapsulation backends with <strong>VxLAN</strong> being the recommended one. Flannel is a simple option to deploy, and it even provides some native networking capabilities such as host gateways.</li> <li><a href="https://www.projectcalico.org/" target="_blank"><strong>Calico</strong></a>: Calico is not really an overlay network, but can be seen as a pure IP networking fabric (leveraging BGP) in Kubernetes clusters across the cloud. It is a network solution for Kubernetes and is described as simple, scalable and secure.</li> <li><a href="https://www.weave.works/docs/net/latest/overview/" target="_blank"><strong>Weave Net</strong></a>: Weave Net is a cloud-native networking toolkit that creates a virtual network to connect containers across multiple hosts and enable automatic discovery Weave Net also supports VxLAN.</li> </ul> <h2 id="what-is-a-service-mesh">What is a Service Mesh?</h2> <p>Application developers usually assume that the network below layer 4 “just works.” They want to handle service communication in Layers 4 through 7 where they often implement functions like load-balancing, service discovery, encryption, metrics, application-level security and more. If the operations team does not deliver these services within the infrastructure, developers will often roll up their sleeves and fill this gap by <strong>changing their code in the application</strong>. The downsides are obvious:</p> <ul> <li>Repeating the same work across different applications</li> <li>Disorganized code base that blends both application and infrastructure functions</li> <li>Inconsistent policies between services for routing, security, resiliency, monitoring, etc.</li> <li>Potential troubleshooting nightmares</li> </ul> <p>…And that’s where service meshes come in.</p> <p>One of the main benefits of a service mesh is that it <strong>allows operations teams to introduce monitoring, security, failover, etc. without developers needing to change their code</strong>.</p> <p>So far, the most quoted definition for service mesh is probably from William Morgan, Buoyant CEO:</p> <blockquote> <p>“A service mesh is a <strong>dedicated infrastructure layer</strong> for <strong>handling service-to-service communication</strong>. It’s responsible for the <strong>reliable delivery of requests</strong> through the complex topology of services that comprise a modern, cloud-native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, <strong>without the application needing to be aware</strong>.”</p> </blockquote> <p>With service meshes, developers can bring their focus back to their primary task of creating business value with their applications, and have the operations team take responsibility for service management on a dedicated infrastructure layer (sometimes called “Layer 5”). This way, it’s possible to maintain consistent and secure communication among services.</p> <h3 id="cross-cloud-service-meshes">Cross-cloud Service Meshes</h3> <p>A <em>cross-cluster</em> service mesh is a service mesh that connects workloads running on different Kubernetes clusters — and potentially standalone VMs. When connecting those clusters across multiple cloud providers — we now have a <em>cross-cloud</em> service mesh. However, from a technical standpoint, there is little difference between a cross-cluster and a cross-cloud service mesh.</p> <p>The cross-cloud service mesh brings all the benefits of a general service mesh but with the additional benefit of offering a single view into multiple clouds. (Learn more about this topic in our post on <a href="https://www.kentik.com/blog/kubernetes-and-cross-cloud-service-meshes/" title="Kentik Blog: Kubernetes and Cross-Cloud Service Meshes">Kubernetes and Cross-cloud Service Meshes</a>.)</p> <h2 id="service-mesh-control-plane-and-data-plane">Service Mesh Control Plane and Data Plane</h2> <p>Familiar to most network engineers, service meshes have two planes: a control plane and a data plane (a.k.a. forwarding plane). If traditional networking planes focus on routing and connectivity, the service mesh planes focus more on application communication such as HTTP, gRPC, etc.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3RbQeCMAdHsnuKLq1ZUv4B/64684dc44d965e8a3e0203f93197ce04/buoyant.png" style="max-width: 800px;" class="image center" /> <p>The <strong>data plane</strong> basically touches every data packet in the system to make sure things like service discovery, health checking, routing, load balancing, and authentication/authorization work.</p> <p>The <strong>control plane</strong> acts as a traffic controller of the system that provides a centralized API that decides traffic policies, where services are running, etc. It basically takes a set of isolated stateless <strong>sidecar proxies</strong> and turns them into a distributed system.</p> <p>There are many service mesh solutions out there, the most-used are probably <strong>Istio, Linkerd, Linkerd2</strong> (formerly Conduit), and <strong>Consul Connect</strong>. All of those are 100% Kubernetes-specific. Below is a comparison table from the <a href="https://kubedex.com/istio-vs-linkerd-vs-linkerd2-vs-consul/" target="_blank">Kubedex community</a> if you want to get nerdier on this topic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2gDLrLIEFoftf9BIGvjqt1/56d6ce255add46925920a6884275545b/table.png" style="max-width: 800px;" class="image center" /> <h2 id="istio">Istio</h2> <p>As it’s probably the most popular service mesh architecture today, let’s take a closer look into Istio.</p> <p>Istio is an open platform that lets you connect, secure, control, and observe services in large hybrid and multi-cloud deployments. Istio has strong integration with Kubernetes, plus it has backing from Google, Lyft, and IBM. It also has good traffic management and security features. All of these have helped Istio to gain strong momentum in the cloud-native community.</p> <p>The design goal behind the Istio architecture is to make the system “capable of dealing with services at scale and with high performance,” according to the <a href="https://istio.io/docs/concepts/what-is-istio/#design-goals" target="_blank">Istio website</a>.</p> <h3 id="using-istio-in-practice-with-kubernetes-clusters">Using Istio in Practice with Kubernetes Clusters</h3> <p>Now let’s take a look at simple examples of a Kubernetes cluster with and without Istio, so we can look at the benefits of installing Istio without any changes to the application code running in the cluster.</p> <p>Take this <a href="http://sockshop.gcp.kentik.io/" target="_blank">Sockshop</a> for example: it’s a demonstration e-commerce site that sells fancy socks online. There are a few microservices within this application: “catalog,” “cart,” “order,” “payment,” “shipping,” “user,” and so on. Because each microservice needs to talk to others via remote procedure calls (RPCs), the application forms a “mesh of service communication” over the network given the distributed nature of microservices.</p> <p>Without a dedicated layer to handle service-to-service communication, developers need to add a lot of extra code in each service to make sure the communication is reliable and secure.</p> <p>With Istio installed, all of the microservices will be deployed as containers with a sidecar proxy container (e.g., Envoy) alongside. A sidecar ”<a href="https://istio.io/docs/concepts/what-is-istio/#why-use-istio" target="_blank">intercepts all network communication between microservices, then configures and manages Istio using its control plane functionality.</a>” With no changes to the application itself, even if all the components of the application are written in different languages. Istio is well-positioned to generate detailed metrics, logs, and traces about service-to-service communication, policy enforcement, mutual authentication, as well as providing thorough visibility throughout the mesh.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3wGhQvDUtW0adKVKSa8nRT/abeeca34e85a27ab97b4e0e9d7793754/service-mesh.png" class="image center" style="max-width: 700px;" /> <p>Plus, it’s straightforward to get Istio setup in just a few steps without extra configuration (See the <a href="https://istio.io/docs/setup/kubernetes/quick-start/" target="_blank">Quick Start Guide</a>). This ease of deployment has also been one of Istio’s differentiators compared to other service mesh architectures.</p> <h2 id="key-takeaways-about-kubernetes-networking">Key Takeaways about Kubernetes Networking</h2> <p>Networking that underlies service communication has never been so critical. Because modern applications are so agile and dynamic, networks are a key part of ensuring application availability.</p> <p>Service meshes and the Kubernetes Networking Model are a sweet combination that can make life easier for developers, and can make applications more portable across different types of infrastructure. Meanwhile, ops teams get a great way to make service-to-service communication easier and more reliable.</p> <p>At Kentik, we embrace cutting-edge technology such as Kubernetes and service meshes even as they continue to evolve. Today, Istio metrics are one of many data sources ingested by the Kentik platform to enrich network flow data with service mesh context to provide complete observability for cloud-native environments.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1tiKRQBNH8rfW1lBx9GQ4V/008b6e59472a1368d82868af664a5cba/sankey-kubernetes.png" style="max-width: 750px;" class="image center" /> <p>Explore more about Kentik’s “<a href="https://www.kentik.com/solutions/infrastructure#container-networking">Container Networking</a>” use cases, <a href="https://www.kentik.com/solutions/usecase/kubernetes-networking/" title="Kubernetes Networking with Kentik Kube">Kentik Kube for Kubernetes Networking</a>, or <a href="https://www.kentik.com/contact/">reach out to us</a> to tell us about your infrastructure transformation journey into cloud-native architecture.</p><![CDATA[Viasat Taps Kentik for Real-Time Visibility of Critical Broadband Networks]]><![CDATA[Broadband providers must manage high infrastructure costs on a price-per-bit basis. Visibility into their network performance and security is critical. In this post, we dig into how that led Viasat, a global provider of high-speed satellite broadband services and secure networking systems, to tap Kentik for help.]]>https://www.kentik.com/blog/viasat-taps-kentik-real-time-visibility-critical-broadband-networkshttps://www.kentik.com/blog/viasat-taps-kentik-real-time-visibility-critical-broadband-networks<![CDATA[Michelle Kincaid]]>Thu, 28 Feb 2019 08:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/reyr8IFaYouyksWEEucG6/edbef6f1187c8de72d7c3f92089e2f4d/viasat-1.png" class="image right no-shadow" style="max-width: 180px; margin-bottom: 20px; margin-left: 30px;" /> <p>Today’s broadband providers share a common goal: to deliver fast, always-on, secure connectivity to customers. To execute on that goal, providers operate some of the largest, most complex networks. Satellite-based providers who serve hard-to-reach customers have an additional challenge: managing high infrastructure costs on a price-per-bit basis.</p> <p>Visibility into network performance and security is critical for these providers. Any network downtime, or even just the presence of unnecessary traffic, can result in significant financial loss for both the providers and their customers. That’s why high-speed satellite broadband provider Viasat turned to Kentik.</p> <p>The legacy tools Viasat used in the past weren’t capable of incorporating normal daily or weekly traffic variation into their detection algorithms, producing a high volume of both false positives and false negatives. With a continuous stream of traffic anomalies like denial-of-service attacks, Viasat knew it needed a fast, accurate, real-time network traffic analytics platform to prevent outages and cost exposure from carrying unnecessary traffic across expensive satellite links.</p> <p>Leveraging Kentik, Viasat now achieves:</p> <ul> <li>Accurate baselining for faster security investigations</li> <li>Powerful APIs for data customization and automation</li> <li>Out-of-the-box accessibility and ease of use for cross-department ROI</li> </ul> <p>For the details on how and why Viasat chose Kentik’s modern network analytics, as well as more information on the ROI Viasat now achieves, check out our <a href="https://www.kentik.com/resources/case-study-viasat"><strong>new case study</strong></a>.</p><![CDATA[Cloud Architecture Best Practices]]><![CDATA[Building a cloud application is like building a house. If you don’t at least acknowledge the industry’s best practices, it may all come tumbling down. Here we look at AWS’ Well-Architected Framework as a good starting point for building effective cloud applications and outline why having the right tools in place can make all the difference. ]]>https://www.kentik.com/blog/cloud-architecture-best-practiceshttps://www.kentik.com/blog/cloud-architecture-best-practices<![CDATA[Crystal Li]]>Tue, 26 Feb 2019 08:00:00 GMT<h3 id="using-the-right-tools">Using the Right Tools</h3> <p><a href="https://www.kentik.com/resources/cloud-architecture-best-practices-using-the-right-tools"><img src="//images.ctfassets.net/6yom6slo28h2/5MIs3v98SfHmkkiHQbtnrM/cbe8c4ba2d70815e48139143728430b2/cloud-arch-best-practices.png" style="max-width: 220px; margin-left: 30px; padding: 5px; margin-right: 30px; margin-bottom: 20px;" class="image right" /></a></p> <p>Building a cloud application is somewhat like building a house. You want a design that’s unique and beautiful, but you also want to follow best practices to <strong>make sure your infrastructure is correctly built</strong>.</p> <p>In the construction industry, principles to ensure a building is constructed effectively are documented in the <a href="https://en.wikipedia.org/wiki/International_Building_Code" target="_blank">International Building Code</a>. In the realm of cloud infrastructure, Amazon has pioneered the <a href="https://aws.amazon.com/well-architected/" target="_blank">AWS Well-Architected Framework</a>. The Framework is something like a building code, established as a <strong>set of best practices for cloud architecture</strong>.</p> <p>If you’re building a cloud application, the Well-Architected Framework is a great place to start. It’s also a reminder that if you want a secure, high-performing, resilient, and efficient infrastructure, you must have the right tools in place. Otherwise, many of the costs, simplicity, and agility benefits of cloud applications won’t be achieved.</p> <p>That’s why we created a <a href="https://www.kentik.com/resources/cloud-architecture-best-practices-using-the-right-tools"><strong>white paper</strong></a> detailing the topic. In it, we:</p> <ul> <li>Outline the AWS Well-Architected Framework: Principles, “Five Pillars,” and best practices</li> <li>Discuss how the Framework applies to networking — a critical part of cloud architecture</li> <li>Highlight how Kentik can help you effectively implement these best practices</li> </ul> <p>If you want to get your cloud architecture right, check out our <a href="https://www.kentik.com/resources/cloud-architecture-best-practices-using-the-right-tools"><strong>new white paper</strong></a> for principles, practical advice and recommendations. We’d love to help you build solid cloud deployments that best serve your business goals.</p><![CDATA[How to Design Self-service Network Analytics into Your Services]]><![CDATA[If you are a service provider or enterprise providing your customers with traffic visibility into their networks, there are things you must consider. In this post, we will look into design considerations for self-service network analytics portals.]]>https://www.kentik.com/blog/how-to-design-self-service-network-analytics-into-your-serviceshttps://www.kentik.com/blog/how-to-design-self-service-network-analytics-into-your-services<![CDATA[Akshay Dhawale]]>Thu, 14 Feb 2019 08:00:00 GMT<p>We recently blogged on why <a href="https://www.kentik.com/blog/network-insight-is-essential-for-communications-service-consumers/">network insight is essential for communications service consumers</a>, including why self-service portals are increasingly important for today’s network and network operators. In that blog post, we looked at self-service portals from an end-user standpoint.</p> <p>At Kentik, we believe in curated, actionable insight catering to different user personas across organizations. So, in this post, we will look into these portals in terms of design considerations for service providers.</p> <p>If you are a service provider or enterprise and have decided to provide your customers with traffic visibility into their networks, amongst many other decisions you may find yourself thinking about:</p> <ol> <li>Differentiation</li> <li>Speed of delivery and deployment</li> </ol> <h3 id="differentiation">Differentiation</h3> <div style="max-width: 350px; float: right; padding-left: 30px;text-align: right;"><img src="//images.ctfassets.net/6yom6slo28h2/6AyDAHAyYWsdcJDZf1jBoH/ef68dfe542ca8d5b69b967b0dbe3c69c/product-differentiation2.png" style="max-width: 350px" class="image no-shadow" /><em>Image © feedough.com</em></div> <p>Differentiation is the outcome of identifying what precise service(s) you are offering to your customers. It may also refer to the different tiers of customers to whom you are offering a similar service.</p> <p>One level of differentiation is categorizing the customers (i.e. <strong>customer-oriented</strong>). I refer to “customer” in the modern network sense, where a customer can be, and oftentimes is, an internal service organization or cost center accountable for internal chargebacks. For instance, in an internal IT environment, the IT team may be responsible for the upkeep of internal networks. It is not uncommon for the same team to also be accountable for the application issues to the application owners. In this case, differentiation is separating out these entities over on the network to be able to monitor and manage them individually based on business requirements. In another example, a managed service provider (MSP) may be offering multiple services to customers, such as the visibility of traffic to/from your datacenters, satellite offices monitoring, or clean-pipe services. In any case, the essence of such differentiation is being able to tag the identified business context by looking at network traffic. <strong>The crucial aspect here is for the technology stack to be able to support this business logic correlation with network traffic in real-time and at scale.</strong></p> <p>The other type of differentiation is <strong>service-level oriented</strong>. That is, looking at how to offer the same type of service in different tiers. For instance, not every DDoS protection customer will need traffic scrubbing. Some, for cost reasons, might prefer to blackhole the traffic while others need a comprehensive mitigation solution with advanced post-mortem reporting. Or, you want to slice the visibility reports into tiers for customers paying differently.</p> <p>Often times, the challenge here is to come up with a product offering which can be sliced into premium- vs. basic-like tiers via offered content and the way of delivery. Regardless of how easy or complex the required setup is, <strong>it is important that the underlying technology stack is inherently multi-tenant</strong>. This allows developers and ops teams to reutilize a set of services for similar customers without having to build a separate copy of the application for each instance.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7Aj6ecIIMIRW9siqj4LoQJ/02a186d808b38468ea97f6ac5708e143/single-multi-tenant.png" style="max-width: 800px" class="image center no-shadow" /> <h3 id="speed-of-delivery-and-deployment">Speed of Delivery and Deployment</h3> <p>Kentik is built with a highly scalable and distributed backend architecture which can ingest millions of flows per second and correlate with business logic in near real-time. Now, <a href="https://www.kentik.com/blog/empower-your-customers-with-self-service-analytics/">My Kentik</a> can offer multi-tenancy over the same backend (for nerd-approval rating, it is sub-multi-tenancy, as Kentik by itself is a multi-tenant SaaS offering). You can think of My Kentik as a framework which allows for multi-tenancy (tenants, in this case, are the different customers) over the data we store with full retention. This allows us to avoid unnecessary replication of the data while still offering a per-tenant instance with full cardinality.</p> <div style="max-width: 900px; margin: 0 auto 30px;"><img src="//images.ctfassets.net/6yom6slo28h2/lUVBnwlV2rPb0HuCq47x9/98fb4021eb35d6cbf0cce7cd21ac8236/mykentik-1000w.png" style="max-width: 900px" class="image center" /><em> Above you can see white-labeled web portal settings. Acme Telecommunications has setup My Kentik for its customers, ‘Acme Inc’ and ‘Another Tenant’, each with their own views and alerts.</em></div> <p>Another aspect of offering customers’ visibility is the ease and speed of onboarding new customers. Deployment at-scale can come with its own challenges — for instance, allowing a Parent-Admin to be able to spoof into per-tenant views without creating unique logins for the same person on all tenant environments; or edits on one visualization not carrying over into all other tenant views.</p> <div style="max-width: 1000px; margin: 0 auto 30px;"><img src="//images.ctfassets.net/6yom6slo28h2/62Gm7Qcz1uMu9xiGnoJZFz/7eb156e84d9bba866eebcd1d6ab725dd/acme-whitelabel.png" style="max-width: 1000px" class="image center" /><em>Above, you can see a tenant Acme Inc has a white-labeled service portal, which includes admin selected views and alerts.</em></div> <p>How does Kentik solve this problem? At the admin level, Kentik holds a central repository or library of alert policies, views, reports, and dashboards. A group of these can be put together as a library package and can be deployed to tenants in just one click. Editing or augmenting a dashboard can be (optionally) designed to be auto-deployed for other tenant instances, too. This means the lead time for onboarding new customers is very minimal, and system-wide upgrades for a multi-tenant application are mostly hassle-free.</p> <p>There are a few more challenges which come with offering self-service portals beyond the technical architecture selection. In the next blog, we will explore more about the actual content that is reported, Kentik’s approach to tiering, and the operational pieces of such a service offering.</p> <p>For more information, please see the <a href="https://kb.kentik.com/Cb15.htm">My Kentik Portal</a> topic in the Kentik Knowledge Base.</p> <p><em>My Kentik is available now for all Kentik customers, with a limited number of tenants. For additional details about pricing packages for additional tenants, please contact your Kentik account team or email <a href="mailto:[email protected]">[email protected]</a>.</em></p><![CDATA[A Demo: Pinpointing an Abnormal Traffic Spike in Under 5 Minutes]]><![CDATA[For NetOps and SecOps teams, alarms typically mean many more hours of work ahead. That doesn’t have to be the case. In this post, we look at how to troubleshoot an abnormal traffic spike and get to root cause in under five minutes.]]>https://www.kentik.com/blog/demo-pinpointing-abnormal-traffic-spike-under-5-minuteshttps://www.kentik.com/blog/demo-pinpointing-abnormal-traffic-spike-under-5-minutes<![CDATA[Anthony Haraguchi, Crystal Li]]>Thu, 07 Feb 2019 08:00:00 GMT<p>Imagine you are an ops guy or gal and it’s your turn in the on-call rotation. Alarms go off in the middle of the night, flooding from everywhere, via slack, email, and text messages. This scenario always leaves you wishing you could spend as little time as possible to discover the problem, fix it, and go back to sleep.</p> <p>Many readers who understand the complexity of today’s networks are shaking their heads and saying, “Good luck! It’s going to be a long night for that person.” However, things turn out differently when you’re working with Kentik’s modern analytics for all your networks.</p> <p>In this post, we look at how to troubleshoot an abnormal traffic spike and get to root cause in under five minutes (instead of several hours!).</p> <h3 id="1-start-from-a-dashboard">1. Start from a dashboard</h3> <p>In Kentik, every alert is linked to a pre-built dashboard that’s customized for the policy that generated the alert. These dashboards include all the relevant charts and graphs to ensure an efficient troubleshooting process. Think about how much time is saved by having all the data you need pre-assembled in one place, rather than manually pulling it from siloed sources. Kentik dashboards are your starting point to quickly browse through and get a sense of the overall health status of your network, see the big picture, and spot any potential problems.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7FnWSjMx0hcbyFBxnN6Tz7/e6333f05ff1f323f477c14e6e1946767/dashboard.gif" style="max-width: 800px;" class="image center" /> <h3 id="2-data-explorer-takes-you-farther">2. Data Explorer takes you farther</h3> <p>Troubleshooting networks with tools of the past is like cutting a four-foot sheet of plywood with a hand saw. By the time you’re done, you’re exhausted and swearing. Kentik’s Data Explorer is like a powerful table saw, slicing through network problems in seconds. Once you locate a spike on a dashboard, you can click on the graph, which takes you directly into the Data Explorer. The Data Explorer is composed of query controls on the left, and visualization with a table on the right, which show the query output. This makes it simple to run ad-hoc queries (or chains of queries) with each new result appearing in under one second.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Ef8D1l0SYotmxpr9JZuwM/9a498123c348ebc4467ce4f0a77fd685/data-explorer.gif" style="max-width: 800px;" class="image center" /> <h3 id="3-drill-down">3. Drill down</h3> <p>A big part of the troubleshooting process is eliminating unrelated results so that the signal can stand out from the noise, allowing you to focus only on the traffic related to the problem. “Include” and “exclude” make this process fast and easy. Here’s how: find the problem traffic as a row in the table, open the drop-down menu for that row (on the very right of the image below), click “include” (or “exclude”) and then “run query.” Now the graph and table are redrawn with only the selected traffic included (or without that traffic in the case of “exclude”). Repeating this process a few times is a super fast way to discover root causes and resolve issues.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7stjnHx3xyocQxiu48Rmu8/f02cbaf0c6d980e96a1eb1c80e3960a6/drill-down.gif" style="max-width: 800px;" class="image center" /> <h3 id="4-zoom-in-and-add-additional-dimensions">4. Zoom in and add additional dimensions</h3> <p>The next step is to narrow the time range to pinpoint when the issue happened. Kentik’s time series visualizations are interactive, so this step is as simple as clicking and dragging across the narrower time range you’d like to see. When you zoom in, time series data is aggregated automatically. For example, in a “30-day” graph/chart, each data point represents one hour; in a “1-day” graph/chart, each point is 10 minutes; and in a “1-hour” graph/chart, each point is one minute. Now with the exact timeline, you may be able to correlate the spike to a possible incident/event that directly or indirectly caused it.</p> <p>Clicking anywhere in the dimension control (top left) will take you to the dimension selector. You can add or change dimensions to pivot the data, adding additional insight to help find the root cause. Once you run the updated query, you can see where the problem traffic is sourced from, which AS it comes from, and any other information that might be useful.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6vbvsCBwtunLSEE7BucW3I/dabc67762cab2558682d67d6fc84f4a4/zoom-in.gif" style="max-width: 800px;" class="image center" /> <h3 id="5-follow-up-actions">5. Follow-up actions</h3> <p>In the four simple steps above, you’ve seen how to quickly narrow down an issue and see why it happened, where the traffic originated, and much more. However, don’t call it a day just yet. There are a few more things to do:</p> <ul> <li>Share the view with other stakeholders</li> <li>Think about the action you will take — either to mitigate the risk or prevent it from happening the next time</li> </ul> <p>And Kentik can help with that too. For a complete demo, please watch the video below.</p> <div style="position: relative; overflow: hidden; padding-top: 56.25%;margin: 0 auto;"><iframe src="https://www.youtube.com/embed/xDF6JomSpz8?rel=0" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen style="position: absolute; top: 0; left: 0; width: 90%; height: 90%; border: 0; border: 0px; margin-left: 5%; margin-right: 5%;"></iframe></div> <p>If you’d like to explore Kentik directly, you can <a href="#signup_dialog">sign up for a free trial</a> to experience troubleshooting network issues at lightning speed.</p> <p><em>Note: This is a demo environment.</em></p><![CDATA[6 “Wow!” Moments from Kentik Customers]]><![CDATA[“Wow!” moments don’t happen randomly. At Kentik, they reflect product value, differentiators, and the hard work of our engineering team. In this post, we look at a few of those moments when the insight delivered by Kentik made customers say, “Wow, that was amazing!” ]]>https://www.kentik.com/blog/6-wow-moments-from-kentik-customershttps://www.kentik.com/blog/6-wow-moments-from-kentik-customers<![CDATA[Crystal Li]]>Thu, 31 Jan 2019 08:00:00 GMT<p>Metrics and statistics tell us a lot about how our customers engage with Kentik — things like how often they log in, which features they’re using, and how many support cases they’ve opened. But sometimes we can immediately tell we’re delivering a great product experience from feedback that’s more qualitative. We call the best examples of this feedback “Wow!” moments — the moments where Kentik delivers an insight that makes customers say, “Wow, that was amazing!”</p> <p>“Wow!” moments don’t happen randomly. They reflect product value, differentiators, and the hard work of our engineering team. These are the “magic metrics” that make Kentik stand out from the crowd by being significantly better than other products.</p> <p>In this blog post, we picked six “Wow!” moments shared with us by Kentik customers. These are based on experiences from when customers first used Kentik, or later when Kentik helped resolve a tricky network problem. All of them highlight how network data insights revealed by Kentik made their jaws drop.</p> <h3 id="1sankey-diagram-for-bgp-analytics">1. Sankey Diagram for BGP Analytics</h3> <p>Kentik’s BGP analytics illustrate traffic per BGP path, with each hop represented as a separate column. The sizes of the bars and lines are in proportion to the volume of traffic through each path and hop. This allows customers to quickly grasp both the big picture and details of Internet routing.</p> <p>We caught many priceless looks as users first saw our Data Explorer displaying visualizations of paths, neighbors, transits, origins, countries and the traffic distribution among them.</p> <div style="max-width: 800px; margin: 0 auto; font-size: 20px;"><img src="//images.ctfassets.net/6yom6slo28h2/5E4nTb1D6pQa6HmoNFaeng/e41ddcea3f3175c70821b16056c1d311/1-sankey-bgp.gif" style="max-width: 800px;" class="image center" /> <i>The above captures how much traffic in which AS (AS Number) and where (Country) it flows to which AS where.</i></div> <h3 id="2-time-range-navigation-and-aggregation">2. Time Range Navigation and Aggregation</h3> <p>Kentik automatically aggregates time series data in graphs and charts into steps that make sense for the currently selected time range. For example, in a 30-day graph, each data point represents 1 hour; in a 1-day graph each point is 10 minutes, and in a 1-hour graph, each point is 1 minute. The data visualization is dynamic and interactive, so you can quickly zoom in by selecting an area of interest. This makes it very easy to drill down on narrower time ranges where you see abnormal behavior, such as traffic spikes, to get more detail.</p> <div style="max-width: 800px; margin: 0 auto; font-size: 20px;"><img src="//images.ctfassets.net/6yom6slo28h2/1MMsNLJjcEA6S4GfXt1Ei0/295fdfea1723d012551c770b79da8596/2-time-range.gif" style="max-width: 800px;" class="image center" /> <i>When a spike happens, you can zoom into the smallest data roll-up unit for the exact time when it happened.</i></div> <h3 id="3data-pivot-and-navigation">3. Data Pivot and Navigation</h3> <p>Kentik retains all the data it receives, so there’s no limit to the granularity or detail that can be displayed. Aside from the ability to drill down into narrower time ranges, the Data Explorer makes it easy to pivot the data by adding or changing dimensions and filters to pinpoint exactly the view needed to understand an incident. It’s also easy to switch between visualization types to highlight meaningful data.</p> <div style="max-width: 800px; margin: 0 auto; font-size: 20px;"><img src="//images.ctfassets.net/6yom6slo28h2/35yyZaN1OGcjZ40L8dNSRV/d423181f2afecbc66d358f0bffa8cd10/3-datapivot.gif" style="max-width: 800px;" class="image center" /> <i>Above shows that you can laser-focus on one of the many flows that you are interested in, then add more dimensions and even change the chart type to present the data differently for better situational awareness.</i></div> <h3 id="4api-first">4. API First</h3> <p>Nearly everything in Kentik can be accessed by a simple REST API, which a lot of our customers find super helpful. Using APIs to integrate Kentik with other systems is a very powerful mechanism to leverage both the capabilities of the Kentik platform and the traffic data stored with Kentik. The “Show API Call” feature in the Data Explorer makes it simple to get a pre-built API call example for any query, allowing users to fetch either the raw data or a pre-built image file.</p> <div style="max-width: 800px; margin: 0 auto; font-size: 20px;"><img src="//images.ctfassets.net/6yom6slo28h2/34sp9X4J6seipntx2B4My8/b037e069d511a59cdc9053f3f06496fb/4-api-first.gif" style="max-width: 800px;" class="image center" /> <i>Above shows Kentik can generate API calls for charts or data via cURL or using JSON input.</i></div> <h3 id="5alert-investigation-dashboards">5. Alert Investigation Dashboards</h3> <p>When a user gets a notification from a Kentik alert policy on Slack, e-mail, or some other channel, they can view the new alerts in the “Alerting” section of the Kentik UI. Each alert policy is associated with a specific dashboard — either built-in or custom. Clicking on the dashboard icon next to an alarm opens up the associated dashboard with filters pre-set for all of the attributes associated with the alert. This saves steps and time and allows users to instantly see all the breakdowns that are useful for investigating each type of alarm.</p> <div style="max-width: 800px; margin: 0 auto; font-size: 20px;"><img src="//images.ctfassets.net/6yom6slo28h2/5Wky1WMaL1w45SwFZUMivP/512f7ce9dab16951ad74e9b63d078e56/5-alert.gif" style="max-width: 800px;" class="image center" /> <i>Above shows you can open a dashboard directly associated with the raised alert to shorten the time-to-troubleshoot significantly.</i></div> <h3 id="6guided-mode-dashboards">6. Guided Mode Dashboards</h3> <p>Customers who have seen guided mode dashboards really like the flexibility it provides to create customized workflows within Kentik. The guided mode allows users to enter a single value or variable and instantly pivot multiple perspectives on the same dashboard, without repeatedly setting up query options. Guided mode dashboards extend the value of network data by making it accessible for semi-technical or non-technical users.</p> <div style="max-width: 800px; margin: 0 auto; font-size: 20px;"><img src="//images.ctfassets.net/6yom6slo28h2/5IrCoQVeabFnheC48S0dMy/71ff94ee4c614a05328f54e50beeee0b/6-guidedmode.gif" style="max-width: 800px;" class="image center" /> <i>Above shows that by entering different AS numbers in the guided mode, you can do peer investigation for different ASes that matter to you.</i></div> <h3 id="summary">Summary</h3> <p>Yesterday’s “Wow!” quickly becomes today’s ordinary. That’s why Kentik is continually innovating new features and functionality, and working to create more new “Wow!” moments for our customers. Please check back for additional posts in this series.</p> <p><em>Special thanks to Kentik’s talented solution architects and our customers who share these stories.</em></p><![CDATA[Report: It's a Multi-cloud, Cost-containment World]]><![CDATA[Today we released a new report: “AWS Cloud Adoption, Visibility & Management.” The report compiles an analysis based on a survey of 310 executive and technical-level attendees at the recent AWS user conference. Simply put, we found: It’s a multi-cloud, cost-containment world.]]>https://www.kentik.com/blog/report-multi-cloud-cost-containment-worldhttps://www.kentik.com/blog/report-multi-cloud-cost-containment-world<![CDATA[Michelle Kincaid]]>Thu, 24 Jan 2019 08:00:00 GMT<p><em><strong>AWS Cloud Adoption, Visibility &#x26; Management</strong></em></p> <img src="//images.ctfassets.net/6yom6slo28h2/1Y29zHg2bSty9AQjtTFHIS/3ade00542276be6ced380dedac703aa2/aws-report-cover.jpg" style="max-width: 190px; padding: 5px; margin-right: 30px; margin-bottom: 20px;" class="image right" /> <p>Today we released a new report: “<a href="https://www.kentik.com/resources/aws-cloud-adoption-visibility-management"><strong>AWS Cloud Adoption, Visibility &#x26; Management</strong></a>.” The report compiles an analysis based on the survey responses of 310 executive and technical-level attendees at the recent AWS user conference, re:Invent.</p> <p>We conducted this survey because we continue to hear from customers and industry friends that cloud providers have taken away the huge overhead of building, maintaining and upgrading physical infrastructure. However, at the same time, the rapid expansion of public cloud use, as well as multi-cloud, hybrid cloud and cloud-native environments, has created new challenges for visibility and cost control. It’s our hope by releasing the report and analysis, we’ll help draw attention to these challenges and highlight what we believe to be the drivers and the solutions.</p> <p>Here are some of the key findings in our report:</p> <ul> <li> <p><strong>Multi-cloud is real, and more common than hybrid-cloud</strong>. The clear majority of respondents (58%) indicated they were actively using more than one of the big-three cloud service providers, i.e. AWS, Azure, and Google Cloud. While most of the group (40%) actively use two cloud service providers, nearly a fifth of respondents (18%) use all three. Surprisingly, only 33% of respondents reported using hybrid-cloud, with at least one cloud service provider as well as some type of traditional infrastructure (i.e. company-owned or co-location / third-party data centers).</p> </li> <li> <p><strong>A common multi-cloud combo: AWS + Microsoft Azure</strong>. It’s no surprise that when surveyed at an AWS user conference, 97% of our survey respondents reported that their organization actively uses AWS. However, more than one-third (35%) of respondents said their organization also actively uses Azure, and 24% reported using both AWS and Google Cloud Platform.</p> </li> <li> <p><strong>The biggest cloud challenge: Cost management (depending on who you ask)</strong>. Overall, nearly 30% of the survey-takers said their biggest cloud management challenge is cost management, with security taking second place with 22% of responses. However, when looking at the responses by title, challenge rankings shifted. (We provide a <a href="https://www.kentik.com/resources/aws-cloud-adoption-visibility-management">deeper analysis by title within the full report</a>.)</p> </li> <li> <p><strong>There is an influx of monitoring tools; no clear leader</strong>. While the largest percentage of respondents (54%) reported having a cloud monitoring tool for visibility into their cloud applications, other tools are being used to attempt to achieve total visibility, including: log management tools (48%), application performance management (APM) tools (40%), open source tools (34%), network performance management (NPM) tools (25%), and more.</p> </li> <li> <p><strong>At least two tools are used to try to gain cloud visibility</strong>. Respondents also noted using monitoring tools together in various combinations for cloud application monitoring. Fifty-nine percent (59%) of respondents reported using at least two tools for visibility into their cloud applications. Thirty-five percent (35%) of respondents use three or more tools for this.</p> </li> <li> <p><strong>Spreadsheets are still being used to understand AWS spend</strong>. Fifty-six percent (56%) of respondents say they use built-in tools within AWS (e.g. CloudWatch) to track and manage cloud services costs. Another 30% use third-party commercial tools. However, 10% of respondents reported that their organization still uses “manual tracking via spreadsheets” to understand what drives of their AWS data transfer costs.</p> </li> <li> <p><strong>A big gap exists in use of AWS VPC Flow Logs for cloud visibility</strong>. While VPC Flow Logs have been available as a way for organizations to gain more granular, real-time cloud visibility, adoption ranges widely. While nearly a third (32%) indicated they are actively using VPC Flow Logs, even more (37%) indicated that they nothing about them at all.</p> </li> </ul> <p><a href="https://www.kentik.com/resources/aws-cloud-adoption-visibility-management">Download full report</a> with analysis of the key findings. You can also find out more about Kentik’s modern approach to cloud visibility by downloading our <a href="https://assets.ctfassets.net/6yom6slo28h2/64quKIV7kkcUCs2uYIymuw/5980d1aee800705b65effed500fa8288/KentikCloudVisibility.pdf">Cloud Solution Brief</a>.</p><![CDATA[Enterprise & Service Provider Trends in Cloud Transformation]]><![CDATA[In a new report, *2019 Trends in Cloud Transformation,* 451 Research analysts dig into seven trends happening amidst the move to the cloud, as well as recommendations, and clear winners and losers for each of the trends. In this post, we provide an overview and a licensed copy of the report for you. ]]>https://www.kentik.com/blog/enterprise-and-service-provider-trends-cloud-transformationhttps://www.kentik.com/blog/enterprise-and-service-provider-trends-cloud-transformation<![CDATA[Michelle Kincaid]]>Wed, 16 Jan 2019 08:00:00 GMT<h3 id="the-winners-the-losers--the-recommendations"><em>The Winners, The Losers &#x26; The Recommendations</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/7vSxdJ4RS6L3aRxsSd9VPz/30b741caaf2434442d08835e4b5a7f13/451-2019cloudtrends.jpg" style="max-width: 190px;" class="image right" /> <p>“Every organization is becoming a service provider, creating digital services to better engage with customers, partners and suppliers and compete in the digital economy,” according to a <a href="https://www.kentik.com/resources/451-research-2019-trends-in-cloud-transformation">new 451 Research report</a>. “The currency of this economy is software: Every company will need to raise its software IQ to transform successfully.”</p> <p>In the report, <a href="https://www.kentik.com/resources/451-research-2019-trends-in-cloud-transformation">2019 Trends in Cloud Transformation</a>, the 451 Research analysts dig into the above statements and more, as part of their seven trends happening amidst the move to the cloud, They also offer the recommendations and clear winners and losers for each of the trends. Here is a snapshot of three of the trends they’re seeing:</p> <h4 id="trend-1-optimization-as-a-service-will-emerge">Trend 1: Optimization as a service will emerge</h4> <ul> <li><strong>The reason</strong>: “Enterprises face increasing complexity, which results in spiraling costs and operational challenges,” according to 451 Research.</li> <li><strong>Winners include</strong>: “managed service providers of third-party or hybrid clouds”</li> <li><strong>Losers include</strong>: “enterprises that don’t act now”</li> </ul> <h4 id="trend-2-focus-will-shift-to-cloud-native-outcomes-delivered-as-a-service-by-intelligent-platforms">Trend 2: Focus will shift to cloud-native outcomes delivered as-a-service by intelligent platforms</h4> <ul> <li><strong>The reason</strong>: “While most companies making the move to cloud infrastructure are initially happy to lift and shift applications to the cloud, it soon becomes clear that while such an approach reduces infrastructure costs, it is not in itself helping the organization to become more agile and flexible,” according to the analysts. “That requires adopting a more radical migration to software designed for cloud delivery.”</li> <li><strong>Losers include</strong>: “enterprises that stick with a lift-and-shift to cloud application strategy”</li> </ul> <h4 id="trend-3-cloud-will-work-as-advertised-in-the-era-of-consumption">Trend 3: Cloud will work “as advertised” in the era of consumption</h4> <ul> <li><strong>The reason</strong>: “Cloud management has become very complicated because of the jumble of clouds, containers and venues,” report the analysts. “Making cloud work ‘as advertised’ to deliver key benefits such as speed and scale is getting harder.</li> <li><strong>A recommendation</strong>: “Focus on cloud governance and the ongoing optimization of deployments as the real prize in this journey, rather than cloud readiness.”</li> </ul> <p>At Kentik, we licensed this 451 Research report because we’re seeing many of these same trends emerging for enterprises and service providers, and we wanted to share the analysts’ recommendations for success. That’s also why we’re continuously innovating on our modern network analytics platform. It is our goal to support our enterprise and service provider customers throughout their entire cloud transformation (and beyond), providing granular visibility, not just into the cloud, but via the insight and analytics needed to run all networks: old and new; the ones our customers own and the ones they don’t.</p> <p><strong>More Trends</strong>: Read the analysts’ deeper dive and check out all of the other trends, and cloud transformation winners and losers in the <a href="https://www.kentik.com/resources/451-research-2019-trends-in-cloud-transformation">451 Research report here</a>.</p><![CDATA[Five Cloud Deployment Mistakes That Will Cost You]]><![CDATA[There are five network-related cloud deployment mistakes that you might not be aware of, but that can negate the cloud benefits you’re hoping to achieve. In this post, we provide an overview of each mistake and a guide for avoiding them all. ]]>https://www.kentik.com/blog/five-cloud-deployment-mistakes-that-will-cost-youhttps://www.kentik.com/blog/five-cloud-deployment-mistakes-that-will-cost-you<![CDATA[Crystal Li, Jim Meehan]]>Wed, 09 Jan 2019 08:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/1IU7xpz8sYWSSaKA6caKeE/60578aa4dbc407e0e961e989d0735ab4/globe-mesh.svg" style="max-width: 200px;" class="image right no-shadow"> <p>With physical network infrastructure moving to the cloud, it’s easy to think that networks are instantly faster, more reliable, limitless, and cheap. This fallacy is one of the biggest drivers of cloud deployment mistakes.</p> <p>Cloud technology can be complex though, especially when it comes to networking. It’s full of new concepts like VPCs, cloud interconnects, and multiple availability zones and regions. Combined with the rapid pace of deployment and lack of visibility into how cloud resources are being utilized, it’s easy to make costly mistakes.</p> <p>Here are five network-related cloud deployment mistakes that you might not be aware of, but that can negate the cloud benefits that you’re hoping to achieve:</p> <ul> <li><strong>Mistake #1</strong> - Duplicate services and unknown dependencies</li> <li><strong>Mistake #2</strong> - Traffic or request hair-pinning</li> <li><strong>Mistake #3</strong> - Unnecessary inter-region traffic</li> <li><strong>Mistake #4</strong> - Missing compression</li> <li><strong>Mistake #5</strong> - Internet traffic delivery</li> </ul> <p>The good news is: There’s always an opportunity to learn from these mistakes. That’s why we’ve put together a <a href="https://www.kentik.com/resources/five-cloud-deployment-mistakes-that-will-cost-you"><strong>new whitepaper</strong></a> detailing each of the above mistakes, how it happens, and how to avoid letting any of these mistakes happen to you.</p> <p>If you recognize any of the above as familiar experiences (or even if you don’t), <strong><a href="https://www.kentik.com/resources/five-cloud-deployment-mistakes-that-will-cost-you">check out the whitepaper</a></strong> to set things straight. We’d love to help you undo (and learn from!) some of the most common cloud deployment mistakes out there right now.</p><![CDATA[How Pandora Leverages VPC Flow Logs + Kentik for Google Cloud Visibility]]><![CDATA[Music streaming service Pandora recently announced its migration to Google Cloud Platform (GCP). For the NetOps and SecOps teams behind the migration, we know cloud visibility is now more important than ever. That’s why we caught up with Pandora’s James Kelty to tell us how the company plans to maintain visibility across its infrastructure, including GCP, with help from Kentik. ]]>https://www.kentik.com/blog/how-pandora-leverages-vpc-flow-logs-kentik-for-google-cloud-visibilityhttps://www.kentik.com/blog/how-pandora-leverages-vpc-flow-logs-kentik-for-google-cloud-visibility<![CDATA[Michelle Kincaid]]>Wed, 19 Dec 2018 08:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/7IBRNRiwW4uooM0oeiY4yk/a1b44b8ae7056ff3fd4371e3e82cac46/jameskelty-pandora.jpg" style="max-width: 320px; padding:20px;" class="image right no-shadow" alt="Pandora" /> <p>At Kentik, we’re big fans of Pandora. Not just because of its non-stop, personalized music, or because they’re a long-time Kentik customer, but also because of the modern, cloud-native approach they’ve taken to running the infrastructure behind their booming music streaming service.</p> <p>As part of its always-innovating mentality, just last month Pandora announced its migration to Google Cloud Platform (GCP). <a href="https://engineering.pandora.com/investing-in-pandoras-core-differentiators-1fbec589e1d" target="_blank">In a blog post</a>, the company said the move would support its scientists, developers, and analysts, who “traverse about 6 PB of data with tools such as Hive, Spark, and Presto every day to gain insights and improve the product.”</p> <p>For Pandora’s NetOps and SecOps teams behind the migration, we know cloud visibility is now more important than ever. It helps ensure cost-effective, scalable infrastructure and fast investigation into any potential issues to keep services fast and highly available across regions.</p> <p>That’s why we caught up with <strong>James Kelty, senior director of network operations and engineering at Pandora</strong>, to tell us how the music service plans to maintain visibility across its infrastructure, GCP now included. (For background on why Pandora initially decided to leverage Kentik, you can watch James tell the story <a href="https://www.kentik.com/resources/pandora-drives-superior-network-performance-with-kentik/">in this short video case study</a>.)</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><strong>KENTIK: Why is cloud visibility important for Pandora?</strong></p> <div class="pullquote right" style="max-width: 280px;">"Cloud is an extension of infrastructure with new and differing rules, and cloud visibility allows us to tie both parts together."</div> <p><strong>JAMES at PANDORA:</strong> You always want your developers to be autonomous. However, a big challenge with cloud is that those developers don’t think in terms of types and size of infrastructure. Rather, they think more about having the ability to use what’s available, as quickly as they need to move. When you introduce cloud — and the metered services that come along with it — that can get a company into trouble. Cloud visibility supports cost control by showing us what’s happening across infrastructure and where developers are putting workloads and applications. No more billing surprises.</p> <p>Additionally, it’s important for alerting and security. When it comes to large looming infrastructure that can exist anywhere in world, keeping track of internal-to-external traffic flows and vice versa (e.g. seeing unexpected flows between specific areas and alerting on that) helps to make sure people aren’t circumventing any security policies.</p> <p>Simply put, cloud is an extension of infrastructure with new and differing rules, and cloud visibility allows us to tie both parts together.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><strong>KENTIK: What is the advantage of VPC Flow Logs for cloud visibility?</strong></p> <p><strong>JAMES at PANDORA:</strong> Flow data extends our ability to see into and across GCP, and our use of VPC Flow Logs is focused on making sure our on-prem and GCP assets are speaking to each other in a way we’ve planned for.</p> <p>With VPC Flow Logs, we can see what our services are doing and if they’re talking to each other across regions. We can also drill into why they may or may not be talking to each other and improve services by making them more autonomous. That helps us tie together our on-prem and cloud platforms and have visibility into traffic in between them.</p> <p>For us, there is no demarcation line between on-prem and cloud. Knowing what is happening on and between both — especially as developers and engineers start using services based on availability rather than location — will save on cost and headaches.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><strong>KENTIK: How does Kentik provide cloud visibility to Pandora?</strong></p> <div class="pullquote right">"Having the built-in capability to consume VPC Flow Logs was so much better than having to try to tie them together manually. With Kentik, we were able to immediately and very simply start pulling the logs in from GCP."</div> <p><strong>JAMES at PANDORA:</strong> In terms of the migration itself, GCP’s elastic infrastructure made it an easy decision for us. On top of that, we knew Google got VPC Flow Logs and their impressive granularity right from the beginning. So when we heard Kentik added support for Google VPC Flow Logs, we wanted to check it out.</p> <p>Since we’re standardized on Kentik as our flow platform, having the built-in capability to consume VPC Flow Logs was so much better than having to try to tie them together manually. With Kentik, we were able to immediately and very simply start pulling the logs in from GCP.</p> <p>As just one example of how we’re leveraging Kentik for cloud visibility, we have a nice dashboard in Kentik’s platform that shows us one of our projects using three zones in a single region. With the added visibility, we were able to see a lot (kilobytes per second) of cross-talking happening between the zones. Each egress to another zone has a metered cost. With Kentik’s ability to drill into the issue, we were able to reduce the unintentional flows, saving us both instantly and in the long term.</p> <p>We also put together a dashboard within Kentik’s platform to show us both our GCP internal traffic and on-prem traffic to see what applications have moved across our infrastructure. This allows us to see what applications need to be more cloud-native and helps us to determine whether to push the application into the cloud or not based on the costs associated with it.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><strong>KENTIK: Who at Pandora benefits from cloud visibility from Kentik?</strong></p> <p><strong>JAMES at PANDORA:</strong> Our network operations and security operations teams use the Kentik platform for both on-prem and cloud visibility. While our developers don’t directly use Kentik (yet), when they have specific questions about cloud interconnects, we can easily spin up a report to give them a quick view of the information they’re after, including what their traffic looks like in the very moment they spot a potential issue.</p> <div style="border-top: 1px solid #d5dee0; width: 100%; padding-bottom: 20px;"></div> <p><strong>KENTIK: What advice do you have for the cloud providers or those who are seeking cloud visibility?</strong></p> <p><strong>JAMES at PANDORA:</strong> GCP VPC Flow Logs are more granular than other data sources, and therefore, they’re more helpful for network operators than other means of attempting to monitor network activity in cloud environments. If competing cloud platforms could come up with the level of granularity that Google Cloud has, we’d all be better off.</p><![CDATA[How to Build Effective Dashboards: Key Principles and 5 Examples for Cloud Monitoring ]]><![CDATA[If you work in this industry, chances are you use dashboards. But how do you build an effective dashboard that goes beyond pretty graphs and actually provides insight? In this post, we offer key principles for dashboard creation and share how we build them at Kentik, with examples of our dashboards for cloud monitoring.]]>https://www.kentik.com/blog/how-to-build-effective-dashboards-key-principles-examples-cloud-monitoringhttps://www.kentik.com/blog/how-to-build-effective-dashboards-key-principles-examples-cloud-monitoring<![CDATA[Crystal Li, Dan Rohan]]>Tue, 18 Dec 2018 08:00:00 GMT<p>If you work in this industry, chances are you use a dashboard of some sort. Dashboards are, after all, intended to “provide at-a-glance views of KPIs (key performance indicators) relevant to a particular objective or business process,” according to Wikipedia. In other words, dashboards help teams focus on <strong>what matters</strong>.</p> <p>Building a dashboard is not hard, especially when you have lots of data available and many choices on how to visualize it. However, to build a dashboard that <strong>provides true insight</strong> is NOT easy, especially with large datasets and many visualization choices. Common questions that are difficult to answer include:</p> <ul> <li>What are the most important views?</li> <li>Which diagrams should I include to answer the question(s) at hand?</li> <li>Which type of chart provides the best understanding? (Line? Bar? Pie? Stacked? Sunburst?)</li> <li>Which filters will properly scope the view?</li> <li>Should I embed relevant alerts into my dashboard?</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/2t2Q24YPikQUgWeIQcMAKu/4b4d1bac7125264b55978e0a61d1114a/sockshop-anim.gif" class="image center" style="max-width: 800px;" /> <p>At Kentik, we focus on analytics and visibility across all networks, and our platform provides powerful dashboarding and customization capabilities. That’s why we wanted to share our philosophy on effective dashboards and some best practices we’ve learned from working with many customers.</p> <h2 id="general-principles-for-effective-dashboards">General Principles for Effective Dashboards</h2> <p>No matter what kind of dashboard you are building, there are a few universal principles for creating effective dashboards.</p> <h4 id="know-your-audience">Know your audience</h4> <p>Just like when you’re preparing for a speech or presentation, one of the first questions should be “Who is the audience?” The same logic applies here. Some example audiences we often see are:</p> <ul> <li><strong>Management and executives</strong> usually want an overview rather than thousands of drill-down details. They typically care more about business impact than in-depth technical analysis and metrics.</li> <li><strong>NetOps and NetEng teams</strong> now face enormous infrastructure complexity. With projects that are often deployed across physical data centers and multiple cloud regions and zones, they often want to see comprehensive operational pictures to help visualize capacity, performance, throughput, and other metrics.</li> <li><strong>SecOps teams</strong> in their day-to-day jobs are all about control, policy, hardening, and hunting. A dashboard that contains detailed security analytics and <a href="https://www.kentik.com/kentipedia/network-forensics/" title="Network Forensics and the Role of Flow Data in Network Security">forensics</a> is essential for investigating potential and historical threat activity.</li> </ul> <p>Interviews and communication with stakeholders will help you nail down the audience if you are tasked with building dashboards for others.</p> <h4 id="whats-the-purpose">What’s the purpose?</h4> <p>Think about the objective and outcomes of the proposed dashboard(s). <strong>Are they strategic? Operational? Analytical?</strong> Or maybe a multi-purpose dashboard that serves multiple stakeholders. Even within a single audience, dashboards that are used daily may have different purposes from dashboards used for monthly or quarterly review.</p> <h4 id="tell-the-story">Tell the story</h4> <p>Making sense of data requires more than just grouping data logically. Real insight comes from telling the true story CLEARLY. Ideally, your dashboard should guide viewers the key points in the storyline. You may want to spend some time thinking about how to structure, organize and tell a story about the data that maximizes the impact.</p> <p>Every story has a conclusion, and your dashboard should too. If they’re not providing actionable insights, dashboards devolve into a set of pretty graphs without meaning. To tell a story with an effective conclusion, think about the next actions users will take after viewing the dashboard.</p> <h4 id="keep-it-succinct">Keep it succinct</h4> <p>Less is more — no one likes busy dashboards that are overloaded with data. Knowing exactly what you need in your dashboard is most important. A great-looking dashboard means nothing without the right information (or with too much). Choose only the most important components.</p> <h2 id="5-dashboards-for-aws-and-gcp-cloud-visibility">5 Dashboards for AWS and GCP Cloud Visibility</h2> <p>To highlight some of the considerations above, let’s review some dashboards that Kentik has recently added to support the launch of our cloud visibility solution:</p> <h4 id="1-gcp-traffic-trends-and-overview">1. “GCP Traffic Trends and Overview”</h4> <p>This dashboard was built with both executive and technical consumers in mind. By surfacing interesting traffic patterns and trends, <strong>this dashboard answers the question: “What’s happening on my network <em>today</em>?”</strong> The panels on this dashboard give the user a quick comparison of ingress vs. egress traffic in both a time-series and total-comparison format, which are useful starting places for network investigations or at-a-glance anomaly detection.</p> <p>The dashboard also addresses two key questions that are critical to understanding cloud network activity, presented in visually interesting ways. The Sankey diagram shows the top inter-zone and inter-region flows found within your GCP deployment, which can drive significant data transfer costs if not monitored carefully. The Sunburst diagram on the right shows how top GCP instances are serving traffic out to the Internet with an interactive breakdown showing source ports that you’re delivering services from, the machine or machines hosting that service, followed by the zone, subnet, destination ASNs, and even top destination IPs that are consuming those services. If you’ve not checked out Sunburst visualizations yet, they’re a great way to visualize multi-dimensional traffic data!</p> <img src="//images.ctfassets.net/6yom6slo28h2/15FHuYKXdiCOmA6kaUEQqm/eb2a29d1eff2d2d3131f88a21d557676/gcptraffictrends.png" class="image center no-shadow" alt="GCP Traffic" style="max-width: 800px;" /> <h4 id="2-aws-traffic---30-day-overview">2. “AWS Traffic - 30 Day Overview”</h4> <p>The panels on this dashboard are geared toward financially-minded stakeholders within an organization, but they have cross-department utility as well. <strong>The main goal is to understand which elements of AWS network activity are driving the greatest transport costs</strong>. In case you’ve not already heard, per-GB network costs in the cloud can be up to 10x what you’re used to paying for typical Internet access from network carriers. So tracking and fully understanding what’s behind these charges are critical for achieving your cloud-spending targets.</p> <p>Following some of the design principles discussed earlier in this article, we configured this dashboard (and all of the dashboards you’ll see highlighted here) to tell a story with a coherent and organized structure. The dashboard is divided into three columns “below the fold.” The left column shows traffic entering your AWS infrastructure from the Internet. The middle column shows traffic that remains internal to AWS, and the right column shows egress traffic.</p> <p>We chose to display 30 days worth of data, aggregated by the total amount of data transferred over the last 30 days, rather than a data rate, such as bits-per-second. We believe this is the best way to accurately compare your cloud bills with Kentik data and determine exactly which team or project is responsible for the traffic that’s racking up huge charges on your end-of-month invoices.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5KTUZqD3WwScwo4UuM4ooo/52cf2a5322f70e51944a190246be6738/30dayoverview.png" class="image center no-shadow" alt="AWS Traffic" style="max-width: 800px;" /> <h4 id="3-gcp-activity-and-performance-overview">3. “GCP Activity and Performance Overview”</h4> <p><strong>This dashboard was truly designed for the techie at heart.</strong> It follows the same design pattern mentioned above, but takes it a step further by showing the ingress, internal, and egress traffic perspectives grouped by multiple dimensions and metrics, including project, subnet, zone, and VM. You’ll also notice that every other row alternates to show these critical boundary points by both bits-per-second and also the ever-useful GCP latency metric, which combines application and network latency for every TCP flow.</p> <p>In order to provide a familiar look and feel, we chose to render this data using stacked line charts, which might remind you of tools used in the past, like Cacti and MRTG. Of course, you can dive much deeper into these graphs than you can with the typical MRTG chart. Any of them can be used as great starting places for network spelunking. Just find a graph that shows the anomaly you’re interested in, or one that features the subnet, VM, or project you’re trying to analyze. Then you can click into each graph and drill down with the Kentik Data Explorer.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3mRFYh3c2Ag2IGoiM0myqu/2b4c4672703e99768ccc9f3e614c1eaf/gcp-activity-performance.png" class="image center" alt="GCP Activity and Performance" style="max-width: 800px;" /> <h4 id="4-gcp-customer-analytics">4. “GCP Customer Analytics”</h4> <p><strong>Who are your customers? What is the quality of their experience with your service or application? Are they being served by the best GCP infrastructure?</strong> These are the questions we help answer with our GCP Customer Analytics dashboard. This report was constructed with an eye toward organizations that are serving traffic to their customers directly from Google. It provides a super useful visualization that depicts counts of unique destination IP addresses on top of a Geo-heatmap. This gives the user quick insight into which parts of the world are being served by their GCP infrastructure.</p> <p>The dashboard goes further by showing you these destination geolocations matrixed with source zones, and color-coded by highest activity in order to help you understand if you’re serving traffic out of the appropriate zones and, by extension, if you might want to consider expanding your GCP footprint. By using the round trip + application latency metric, you can use this dashboard to infer the experience your customers are having and establish alert policies that notify you of potential scaling or performance issues that should be addressed.</p> <p>One other great thing about all of these preset dashboards is that they can all be used to drill down even further into your data. For example, pick almost any panel on any dashboard and select an element on the graph. In this case, we’ve selected the panel titled “<em>Inbound Latency</em> by Destination Project” and clicked on the ‘boreal-analog-7’ item.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1lMkNcjzpiCW42kEekec2C/2519437e6b7b78b8d4db55c5f121a870/inbound-latency.png" class="image center" alt="" style="max-width: 500px;" /> <p>When we click on the “Drill Down” link, we can then use the “GCP Project Explorer” dashboard to isolate the traffic that was most likely the cause of the latency spike in the graph shown above:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2yXz45A8wQ8kKq2WgQa6Au/e77dcaed10fe5f0e5c143b0bf0bd3be8/boreal-analog.png" class="image center" alt="" style="max-width: 800px;" /> <h4 id="5-aws-security-analysis">5. “AWS Security Analysis”</h4> <p>AWS VPC Flow Logs include a unique feature that we absolutely love: a field that indicates the firewall action for each flow. For background, AWS allows users to define network ACLs and firewall rules as part of their VPC architectures. This allows you to specify precisely which traffic is allowed into your environment and thus, count when traffic <em>does</em> or <em>does not</em> match your allowed ruleset. This is, of course, extremely helpful in <strong>identifying targeted attacks against your AWS infrastructure</strong>.</p> <p>We chose to break this analysis out into a few helpful visualizations. First, we show the top geolocations that are originating denied. Then, we break out those denied flows by destination port on your GCP network. And finally, we show you which IPs were the source of the most denied flows.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2HPS0iG3scyEcWAcw2iuc4/01ca99b3aba408f80a8ddf34b906d2b1/aws-security-analysis.png" class="image center" alt="" style="max-width: 800px;" /> <h2 id="summary">Summary</h2> <p>Much has been said about the power of large datasets, but the fact is, without proper organization and a proper storyline, it remains “just data.” Effective dashboards go a long way toward making network analytics data useful and actionable for stakeholders across any organization.</p> <p>To learn more about Kentik’s dashboards for cloud visibility, please see the Kentik for <a href="https://www.kentik.com/blog/kentik-for-google-cloud-vpc-flow-logs/">Google Cloud VPC Flow Logs blog post</a> and <a href="https://kb.kentik.com/Fc12.htm">KB article</a>, as well as the <a href="https://www.kentik.com/blog/how-kentik-helps-assess-aws-visibility/">How Kentik Helps Assess AWS Visibility blog post</a> and <a href="https://kb.kentik.com/Fc14.htm">KB article</a>. Or, if you’d like to explore these dashboards directly, you can <a href="#signup_dialog">sign up for a free trial</a>.</p> <p style="font-size: 94%"><strong>Contributors: </strong><a href="https://www.kentik.com/blog/author/drohan">Dan Rohan, Product Manager</a> & <a href="https://www.kentik.com/blog/author/crystalli">Crystal Li, Senior Product Marketing Manager</a></p><![CDATA[Network Visibility for Cloud-Native Architecture]]><![CDATA[At recent AWS re:Invent, we heard many attendees talking about the push for cloud-native to foster innovation and speed up development. In this post, we take a deeper dive on what it means to be cloud-native, as well as the challenges and how to overcome them. ]]>https://www.kentik.com/blog/network-visibility-for-cloud-native-architecturehttps://www.kentik.com/blog/network-visibility-for-cloud-native-architecture<![CDATA[Crystal Li, Jim Meehan]]>Tue, 11 Dec 2018 08:00:00 GMT<p>At the recent AWS re:Invent conference, we heard many attendees talking about cloud-native architecture and container-first approaches to application development. The discussions were not only focused on leveraging cloud-native architecture to foster innovation but also to speed up development for the attendees’ growing businesses. With all of the cloud-native buzz, we wanted to provide a deeper dive on the topic.</p> <h2 id="first-what-does-it-mean-to-be-cloud-native">First: What Does It Mean to be Cloud-Native?</h2> <p>Cloud-native is a term used to describe a modern software architecture, deployment, and operations model. It originated with early adopters of public cloud infrastructure, but it’s now gaining popularity across application development in general. It’s appealing because it speeds innovation by breaking large software projects into smaller components and creates portability by abstracting away infrastructure dependencies. The end goal is to deliver apps at the pace a business needs.</p> <p>In theory, a cloud-native architecture promises the best user experience with the least resources, which we see in four specific areas:</p> <ul> <li> <p><strong>Deployment agility</strong> - This is the idea that applications should be independent of the infrastructure they run on, so development teams can focus on creating value for the business rather than infrastructure dependencies. This provides the agility to quickly distribute or move workloads between in-house or cloud infrastructure in response to changing business conditions.</p> </li> <li> <p><strong>Rapid scalability</strong> - One of the characteristics of cloud-native architectures is “elasticity,” which unlocks autoscaling — the ability to scale down when the workload is small, and scale up without any reprovisioning during peak time for the business.</p> </li> </ul> <p>Operational efficiency - With increased scale, manual provisioning, management, and troubleshooting processes become unworkable. Automation becomes essential for managing workflows and reducing human errors.</p> <ul> <li><strong>High performance</strong> - With cloud-native architecture, it’s possible to leverage massive computing resources (e.g. CPUs and GPUs), and it lowers the barrier to enhanced digital performance for anyone (especially for AI and ML workloads which eat up a ton of computing resources).</li> </ul> <p>Some <strong>concepts and technologies</strong> that are usually associated with cloud-native architectures:</p> <ul> <li> <p><strong>DevOps</strong> - In order to speed up deployments and reduce operational risk, development and operations teams started collaborating more frequently and brought in system automation. Eventually, these Dev and Op teams merged into “DevOps” teams. DevOps teams now go hand-in-hand with cloud-native architecture enabling streamlined application deployment and faster time to market.</p> </li> <li> <p><strong>Containers and Container Orchestration</strong> - <a href="https://about.gitlab.com/2017/11/30/containers-kubernetes-basics/" target="_blank">Gitlab</a> has a good definition for containers: “A container is a method of operating system-based virtualization that allows you to securely run an application and its dependencies independently, without impacting other containers or the operating system. Each container contains only the code and dependencies needed to run that specific application, making them smaller and faster to run than traditional VMs.” Nowadays, many applications are run in “containerized” form in the cloud. And with larger containerized applications, orchestration becomes necessary to automate, scale, and manage them. <a href="https://kubernetes.io/" target="_blank">Kubernetes</a> is by far the most popular one.</p> </li> <li> <p><strong>Microservices</strong> (or sometimes called <strong>Microservice architecture</strong>) - This refers to structuring an application with a collection of loosely coupled, lightweight services, each implementing a specific, granular piece of the application. Development teams can iterate or scale each microservice independent of the others, speeding development. Microservice architectures are also platform-agnostic, allowing services to be written in different languages or deployed across multiple types of infrastructure for maximum flexibility.</p> </li> <li> <p><strong>CI/CD (a.k.a. Continuous Integration/Continuous Delivery)</strong> - CI/CD isn’t totally new, (a Wiki definition can be found <a href="https://en.wikipedia.org/wiki/CI/CD" target="_blank">here</a>) — it’s basically a software engineering practice of constantly deploying new, small code changes with automated fallout recovery mechanisms in place. Avoiding infrastructure monoliths is fundamental to achieve CI/CD.</p> </li> <li> <p><strong>Observability</strong> - Given the increasing complexity of microservices, there has been greater emphasis on the need for modern monitoring practices to gain better insight into how applications perform and operate. Observability incorporates concepts of pervasive instrumentation and retention of fine-grained metrics.</p> </li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5WXnaFNvckSgKEAI8Qg6CQ/da0400ee26fa819991d41d062b070391/cattle-notpets.jpg" class="image center" style="max-width: 400px;" /> <h2 id="network-challenges-brought-by-cloud-native-architecture">Network Challenges Brought by Cloud-Native Architecture</h2> <p>While a cloud-native architecture has the potential to significantly improve ROI for organizations, on the other hand, the above-mentioned new technologies are bringing in network challenges to the infrastructure. Why?</p> <p>First, the <strong>infrastructure challenges</strong>: Networks still underpin everything in cloud-native architectures. In fact, networks become even more critical because workloads are no longer in the form of monolithic software. Services that are single-function modules with well-defined interfaces need to talk to other services and resources that are all over the place. These communications dramatically increase network complexity.</p> <p>Second, <strong>dynamics!</strong> Containers and workloads can spin up or down based on demand anytime, anywhere. The ephemeral nature of containers makes it difficult to troubleshoot historical events, because the container may no longer be running. And because network identifiers like IP addresses may represent different services from moment to moment, they no longer reliably identify specific services or applications on their own.</p> <p>Next, <strong>multi-cloud makes it even more complicated</strong>. Although cloud-native architectures are designed to be infrastructure agnostic, operations teams still need to understand how applications affect network infrastructure to manage cost, performance, and security. This becomes very difficult when each infrastructure has a separate console for visibility and workload management, which creates silos and operational complexity.</p> <p>When an issue occurs, you need to see it first and then troubleshoot, understand it, and fix. Cloud-native architectures also create <strong>challenges for visibility tooling</strong> today, too:</p> <ul> <li> <p><strong>Traditional tools can’t deploy where you need them</strong>: Hardware appliances cannot be plugged into public cloud infrastructure, where many cloud-native workloads are deployed. Even VM-based appliances pose challenges here. These solutions may be useful for traditional infrastructure, but not for modern architectures.</p> </li> <li> <p><strong>Lack of context</strong>: Basic network identifiers like IP addresses, interfaces, and network devices lose their meaning in cloud-native environments. In order to provide useful insight, tooling must continuously map those identifiers to new labels like container and service names, customers, applications, and geolocations.</p> </li> <li> <p><strong>Compliance-centric</strong>: We see many tools that are focused on compliance requirements that are common in security operation center (SOC) workflows, but aren’t useful for other operational concerns like troubleshooting performance issues, or cost management.</p> </li> </ul> <p>At the AWS re:Invent conference this year, we spoke with many attendees who visited Kentik’s booth, and many agreed with us on the challenges that they have been facing while adopting cloud infrastructure and cloud-native architecture. Three of the most challenging problems we heard attendees mention were around a tooling gap in their organization, impacting:</p> <ul> <li> <p><strong>Performance management</strong>: This includes both network and application performance. Not only keeping services running and stable and fixing issues quickly, but also building and evolving the cloud architecture to achieve the best performance that aligns with business goals.</p> </li> <li> <p><strong>Cost management</strong>: We’ve heard from many cloud adopters who were surprised by some of the charges on their cloud bills. Being on the receiving end of this surprise is never fun (for you or your financial controllers). However, in many cases, you can cut your bill with informed network architecture decisions — for example, by reducing expensive inter-region and egress traffic.</p> </li> <li> <p><strong>Security</strong>: The cloud model is a shared responsibility model. While cloud providers are responsible for protecting the lower infrastructure layers, cloud adopters are the ones who are responsible for the applications they run on top of it. Because of these layers of responsibility, it’s harder than ever to ensure your cloud environment is fully protected.</p> </li> </ul> <h2 id="kentik-cloud-native-visibility">Kentik Cloud-Native Visibility</h2> <p>Cloud-native architecture has huge benefits, but it brings big challenges, too. Getting it right is crucial for the success of application development and migration. Kentik’s platform provides solutions that benefit every major cloud stakeholder, including NetOps, NetEng, SecOps, DevOps, and executives.</p> <p>Kentik’s cloud analytics suite addresses all of the cloud management challenges we discussed earlier in the post: network and application performance management, cost management, and cloud protection. Via comprehensive analytics from the big picture down to fine details, and proactive anomaly detection, we provide a comprehensive visibility solution for every piece of your cloud-native infrastructure.</p> <p>For more information on how to leverage the data from your infrastructure and put it into application and business context, you can read our <a href="https://www.kentik.com/resources/cloud-visibility">cloud visibility solution brief</a>, <a href="#demo_dialog">reach out to us</a>, or sign up for a <a href="#signup_dialog">free trial</a>.</p> <p style="font-size: 94%"><strong>Contributors: </strong><a href="https://www.kentik.com/blog/author/jmeehan">Jim Meehan, Product Marketing Director</a> & <a href="https://www.kentik.com/blog/author/crystalli">Crystal Li, Senior Product Marketing Manager</a></p><![CDATA[How Kentik Helps Assess AWS Visibility]]><![CDATA[Our network analytics platform supports visibility within public cloud environments via VPC Flow Logs. Our initial integration used VPC Flow Logs from Google Cloud Platform. Today, we are excited to extend our support to AWS. Read how we do it in this blog post. ]]>https://www.kentik.com/blog/how-kentik-helps-assess-aws-visibilityhttps://www.kentik.com/blog/how-kentik-helps-assess-aws-visibility<![CDATA[Crystal Li]]>Thu, 15 Nov 2018 08:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/1Qc7bjdBxa4cGkMwuGMwWo/3c447da9e5beefaae316db0fe38bdcf7/aws-cloud.jpg" style="max-width: 280px;" class="image right no-shadow" alt="AWS" /> <p>It’s (becoming) a cloud-centric world.</p> <p>Workloads are moving from on-prem to one or more of the big clouds (namely AWS, GCP, Azure, IBM, and Oracle). If your organization has made the move, have you ever wondered whether you’ve actually achieved the performance, reliability and efficiency gains that the cloud promises, or if your data stays secure, or if your ops costs have actually decreased?</p> <p>The answers can often be found with network visibility.</p> <p>This summer, Kentik extended our analytics platform to support visibility within public cloud environments via VPC Flow Logs. Our initial integration used <a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="Kentipedia: What are VPC Flow Logs?">VPC Flow Logs</a> from Google Cloud Platform as a data source and fully exposed GCP-specific tags like regions, zones, and VM names as dimensions within the Kentik UI (read our <a href="https://www.kentik.com/resources/google-cloud-vpc-flow-logs-for-kentik/">Solution Brief</a>).</p> <p><strong>Today, we are excited to extend our support to AWS.</strong></p> <h2 id="review-why-vpc-flow-logs">Review: Why VPC Flow Logs?</h2> <p>Cloud providers have taken away the burden of the traditional hardware layer most organizations are used to. On the one hand, this helps organizations reduce the related overhead of taking care of hardware. However, many cloud customers are then challenged by a lack of visibility into what is going on with their traffic patterns going into and out of their <a href="https://www.kentik.com/kentipedia/cloud-network-performance-monitoring/">cloud network</a>. With an increasing amount of requests from cloud adopters looking to monitor traffic in their VPC, Amazon, Google and Microsoft have all supported a Flow Logs feature. (You can read more about the three providers’ flow logs in our recent blog post “<a href="https://www.kentik.com/blog/what-are-vpc-flow-logs-and-how-can-they-improve-your-cloud/">What Are VPC Flow Logs and How Can They Improve Your Cloud?</a>”)</p> <h2 id="whats-in-aws-vpc-flow-logs-and-whats-unique">What’s in AWS VPC Flow Logs (and what’s unique)?</h2> <p>AWS VPC Flow Logs capture information about IP traffic flowing in and out of interfaces that are part of an AWS VPC. These <a href="https://www.kentik.com/blog/what-are-vpc-flow-logs-and-how-can-they-improve-your-cloud/">flow logs can be used to analyze and troubleshoot issues</a> within a cloud environment in a few key ways:</p> <ul> <li><strong>They’re agentless.</strong> This makes integration much easier and cleaner. You can create and enable flow logs per ENI (Elastic Network Interface), per subnet or per VPC.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/6sVWqN5LgW420Y0e8GqgqK/963a7edd4a0f960d845a02d604599076/agentless.jpg" style="max-width: 500px;" class="image center" alt="Integration with AWS" /> <ul> <li><strong>They include security context.</strong> AWS Flow Logs contain a field indicating the forwarding status of the traffic. Dropped traffic can indicate misconfigurations or malicious activities, and is useful for both proactive detection of events, as well as real-time investigation.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/27hVJ6lxY8mEUKUi2iqSQU/06671fe71a7958607fefe7ca6d5ec229/security-context.jpg" style="max-width: 850px;" class="image center" alt="Security" /> <ul> <li><strong>They provide a new layer of context.</strong> Last but not least, AWS VPC Flow Logs can be matched with cloud metadata like VM names, regions, zones, and account IDs which adds deep context for questions like: What services are generating traffic? Which services am I using/not using? What are the dependencies among services? What services or instances are the top contributors to inter-zone traffic?</li> </ul> <p>Below is a sample flow log record which contains <code class="language-text">&lt;version> &lt;account-id> &lt;interface-id> &lt;srcaddr> &lt;dstaddr> &lt;srcport></code> <code class="language-text">&lt;dstport> &lt;protocol> &lt;packets> &lt;bytes> &lt;start> &lt;end> &lt;action> &lt;log-status></code></p> <p><img src="//images.ctfassets.net/6yom6slo28h2/6C8lYvG1wIkSwqqquQuCou/6c8ac4b3f971f3967114047982753187/sample-record.png" style="max-width: 900px;" class="image center" alt="Sample Flow Log Record" /><em>Image source: AWS</em></p> <p>The following is an example of a flow log record in which SSH traffic (destination port 22, TCP protocol) to network interface eni-abc123de in account 123456789010 was allowed:</p> <img src="//images.ctfassets.net/6yom6slo28h2/11FmTH7KOiW4GQkG2MMuKc/3df2b72a5614bed88bcd9798cfc601e9/aws-flow-log-sample.png" style="max-width: 800px;" class="image center" alt="AWS Flow Log" /> <p>You can find more details, including the structure of each flow record and possible limitations in the <a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html" target="_blank">AWS user guide</a>.</p> <h2 id="what-the-kentik--aws-vpc-flow-logs-combination-gives-you">What the “Kentik + AWS VPC Flow Logs” combination gives you</h2> <p>In a <a href="https://www.kentik.com/blog/kentik-for-google-cloud-vpc-flow-logs/">previous blog about GCP VPC Flow Logs</a>, we talked about how Kentik can help multiple stakeholders within organizations to stop flying blind on their cloud initiatives, including:</p> <ul> <li>Management and Executives: For auditing cloud networking costs</li> <li>NetOps and NetEng: For comprehensive planning and trending</li> <li>SecOps: For detailed security analytics and forensics</li> </ul> <p>These use cases and value hold strongly for us with AWS VPC Flow Logs, too. Additionally, with Kentik:</p> <ul> <li>You can monitor multiple accounts in the same cloud</li> <li>You can monitor both on-prem infrastructure and multi-cloud infrastructure</li> </ul> <p>And you can do these all together under one unified view, enabling deep network visibility for greater business context to help deliver a greater customer experience and grow revenue.</p> <img src="//images.ctfassets.net/6yom6slo28h2/tuOobtq8coq6GweOgYwim/21e602a9f31903881fe53b3e8c977891/aws-traffic-trends.png" style="max-width: 800px; margin-bottom: 30px;" class="image center" alt="AWS Traffic Trends" /> <img src="//images.ctfassets.net/6yom6slo28h2/200AkkUOju0iuSKKKsUWu4/0334449f6b7330460f492a389832fb5a/aws-traffic-trends-pies.png" style="max-width: 800px; margin-bottom: 30px;" class="image center" alt="AWS Trends - Dashboard" /> <img src="//images.ctfassets.net/6yom6slo28h2/79kAZNGT288C2qmGAeQsSa/98d279c436c3af9e36cc16252e7d8695/aws-customer-analytics.png" style="max-width: 800px; margin-bottom: 30px;" class="image center" alt="AWS Customer Analytics" /> <h2 id="getting-started-with-kentik--aws-vpc-flow-logs">Getting started with Kentik + AWS VPC Flow Logs</h2> <p>It’s straightforward to setup Kentik for AWS. To export VPC Flow Logs to the Kentik platform, just follow these four steps:</p> <ol> <li>Create an IAM Role, assign proper permissions and trust policy.</li> <li>Create an S3 Bucket.</li> <li>Enable Flow Logs on one or more VPCs and send them to the S3 Bucket.</li> <li>Configure Kentik to pull AWS Flow Logs.</li> </ol> <p>For detailed instructions, see the <a href="https://kb.kentik.com/Fc14.htm">Kentik for AWS article</a> in the Kentik Knowledge Base. If you need a Kentik account, you can sign up for a <a href="#signup_dialog">free trial</a>.</p> <p>Also, visit us at AWS re:Invent Booth #112 to learn more about our cloud visibility offerings.</p><![CDATA[The Recap: Kentik at Networking Field Day 19]]><![CDATA[During Networking Field Day 19, Kentik presented on new capabilities for service providers, cloud and cloud-native environments, and gave a technical talk on tagging and data enrichment. In this post, we recap the event highlights and provide the videos for watching and sharing. ]]>https://www.kentik.com/blog/recap-kentik-at-networking-field-day-19https://www.kentik.com/blog/recap-kentik-at-networking-field-day-19<![CDATA[Michelle Kincaid]]>Tue, 13 Nov 2018 08:00:00 GMT<p>On Friday we presented at Networking Field Day 19. This is an event we love because, well, NFD delegate <a href="https://twitter.com/ecbanks/status/1061009148106555392" target="_blank">Ethan Banks</a> of Packet Pushers summarized it nicely:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5JLqTWVscMS46OcWwcCWIs/1a48d2d6ab10e78c983be7fcc7713462/ethan-banks.jpg" style="max-width: 400px" class="image center" alt="Ethan Banks Tweet" /> <h2 id="so-whats-new-at-kentik">So What’s New at Kentik?</h2> <p><img src="//images.ctfassets.net/6yom6slo28h2/4mzWP02MDSKeyUgicWMU2G/8621fd9b984c634418a1de6baf60d537/play-button-orange.png" class="image left no-shadow" style="max-width: 28px; margin-right: 5px;" /><a href="https://www.kentik.com/resources/nfd19-whats-new-at-kentik"><strong>Watch the segment</strong></a></p> <p>Kentik Co-founder and CEO Avi Freedman kicked off NFD 19 with his take on how networks are changing. He said it’s largely due to:</p> <ul> <li>Multiple architectures</li> <li>Increased traffic volume and complexity</li> <li>Ephemeral API-driven loads</li> <li>Increasing workflow automation</li> <li>Varied security threats</li> </ul> <p>Avi then got into why traditional network monitoring breaks when trying to solve for the above network changes due to lack of scale, lack of context, etc., etc. That’s why Kentik continues to innovate on our network analytics platform to support the needs of network and operations teams in real time. As part of that, Avi gave an update on what we’ve been working on since our last NFD. Here’s a snapshot of what’s new:</p> <img src="//images.ctfassets.net/6yom6slo28h2/33KRTjHBRYUacyEyiCW28O/400244c61746233b2a49e4c4a4d65ecb/new-functionality1.png" style="max-width: 700px" class="image center" alt="New Functionality" /> <h2 id="kentik-for-service-providers">Kentik for Service Providers</h2> <p><img src="//images.ctfassets.net/6yom6slo28h2/4mzWP02MDSKeyUgicWMU2G/8621fd9b984c634418a1de6baf60d537/play-button-orange.png" class="image left no-shadow" style="max-width: 28px; margin-right: 5px;" /><a href="https://www.kentik.com/resources/nfd19-kentik-for-service-providers"><strong>Watch the segment</strong></a></p> <p>We’ve also continued to innovate on our network analytics capabilities for service providers. With the network changes (mentioned above), service providers are being squeezed from all sides to maintain cost control amidst growing network load, push for commoditization for revenue generation, and hold off the competition.</p> <p>Kentik’s Head of Product Marketing Jim Meehan talked to the NFD crowd about how we’re aiming to help service providers address those ever-increasing pressures. Jim showed off a few key developments including OTT Service Tracking, CDN Attribution, Subscriber Tracking, and the new My Kentik portal. What were the key takeaways?</p> <p>CDN Attribution from Kentik helps with analyzing and tracking data. From NFD delegate <a href="https://twitter.com/TheRoutingTable/status/1061018174794002432" target="_blank" >Kevin Blackburn:</a></p> <img src="//images.ctfassets.net/6yom6slo28h2/3tbaLOfPeowmuIO6WkQIYM/b4b0133844cf2333c8c24588d8fc44a3/kevin-blackburn.png" style="max-width: 400px" class="image center" alt="Tween from Kevin Blackburn" /> <p>Our ability to match DNS requests with flows in real time is also a bonus for service providers needing faster, clearer visibility. From NFD delegate <a href="https://twitter.com/al_rasheed/status/1061018027955638279" target="_blank">Al Rasheed:</a></p> <img src="//images.ctfassets.net/6yom6slo28h2/3HlNrAwAtWCa2aAUKmig66/48600eb616a875f34aa2b2b8b9522577/nfd-jim-leadsysadmin.jpg" style="max-width: 400px" class="image center" alt="Al Rasheed Tweet" /> <p>The <a href="https://www.kentik.com/resources/my-kentik">My Kentik portal</a>, our self-service analytics for internal and downstream customers was also dubbed “really cool.” From industry peer and NFD live stream viewer <a href="https://twitter.com/WirelessJimP/status/1061020091540692992" target="_blank">Jeff Palmer</a>:</p> <img src="//images.ctfassets.net/6yom6slo28h2/7hfvOt7f1Ygc4OK6smwo2G/85a5e867aa420ba7d07fa30456bd0637/jim-palmer.png" style="max-width: 400px" class="image center" alt="Tweek from Jim Palmer" /> <h2 id="tech-talk-on-tagging--enrichment">Tech Talk on Tagging &#x26; Enrichment</h2> <p><img src="//images.ctfassets.net/6yom6slo28h2/4mzWP02MDSKeyUgicWMU2G/8621fd9b984c634418a1de6baf60d537/play-button-orange.png" class="image left no-shadow" style="max-width: 28px; margin-right: 5px;" /><a href="https://www.kentik.com/resources/nfd19-tech-talk-tagging-and-enrichment"><strong>Watch the segment</strong></a></p> <p>In a technical talk during the event, Avi discussed how our platform goes beyond the basics to help enterprises and service providers contextualize network data for enrichment. He also talked about how to gain insights from network traffic through tagging. And yes, Avi did nerd out with a flux capacitor featured during this talk:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Y7QQgUJPqe8yQgqQc6esC/cf1338b346cf766f75aadf1b3148a24f/flux-capacitor.jpg" style="max-width: 650px" class="image center" alt="Flux Capacitor" /> <h2 id="kentik-for-cloud--cloud-native">Kentik for Cloud &#x26; Cloud-Native</h2> <p><img src="//images.ctfassets.net/6yom6slo28h2/4mzWP02MDSKeyUgicWMU2G/8621fd9b984c634418a1de6baf60d537/play-button-orange.png" class="image left no-shadow" style="max-width: 28px; margin-right: 5px;" /><a href="https://www.kentik.com/resources/nfd19-kentik-for-cloud-and-cloud-native"><strong>Watch the segment</strong></a></p> <p>Kentik for Cloud and Cloud Native was a big part of NFD 19 for us. That’s because performance, security, and cost management challenges don’t go away when you move to the cloud. During this NFD segment, delegate Ethan Banks tweeted:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2HXaJTEHPGWowaAYGu2maO/bb2ae0c7ff1ab70e82edaf3ce6dc33ed/ethanbanks-nfd-tweet.png" class="image center" style="max-width: 400px;" alt="Ethan Banks Tweet" /> <p>Jim and Kentik peer Crystal Li, senior product marketing manager showed how maintaining visibility into cloud and across hybrid environments can be made easier with Kentik and our support for VPC Flow Logs. We outlined our Google Cloud and Amazon AWS developments in this session, too.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5uHQUBeHfysoQo0CA2YUeW/c5a96e7baa925d62b02cbec0d42aeff8/multi-cloud-architecture.png" style="max-width: 700px" class="image center" alt="Multi-cloud architecture" /> <h2 id="whats-next-for-kentik">What’s Next for Kentik</h2> <p><img src="//images.ctfassets.net/6yom6slo28h2/4mzWP02MDSKeyUgicWMU2G/8621fd9b984c634418a1de6baf60d537/play-button-orange.png" class="image left no-shadow" style="max-width: 28px; margin-right: 5px;" /><a href="https://www.kentik.com/resources/nfd19-whats-next-for-kentik"><strong>Watch the segment</strong></a></p> <p>Avi closed NFD 19 with a look at Kentik’s roadmap. But before that, he got into our “negative roadmap.” That is, all of the features we won’t be focused on anytime soon — from generic BI, to SIEM, to APM. Kentik’s road ahead does, however, include innovation around metrics, streaming telemetry, segmentation, orchestration integration, and edge computing. Be sure to check it out. And if you’re left wanting to see more, <a href="#demo_dialog">sign up for a scheduled demo</a> or log on now to <a href="#signup_dialog">trial our platform</a> for yourself.</p><![CDATA[What Are VPC Flow Logs and How Can They Improve Your Cloud?]]><![CDATA[Cloud providers take away the huge overhead of building, maintaining, and upgrading physical infrastructure. However, many system operators, including NetOps, SREs, and SecOps teams, are facing a huge visibility challenge. Here we talk about how VPC flow logs can help.]]>https://www.kentik.com/blog/what-are-vpc-flow-logs-and-how-can-they-improve-your-cloudhttps://www.kentik.com/blog/what-are-vpc-flow-logs-and-how-can-they-improve-your-cloud<![CDATA[Crystal Li]]>Wed, 07 Nov 2018 08:00:00 GMT<h2 id="the-need-for-visibility-in-the-cloud"><em>The Need for Visibility in the Cloud</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/5G8MlZRwQgqAUo0AMWW68/42c88e9046e7c569413a7e23ad655c20/featured-rainbow-flow.png" alt="Kentik for Cloud Visibility" style="max-width: 450px; margin-bottom: 20px;" class="image right no-shadow" /> <p>The cloud is science, not magic.</p> <p>By now, we should all appreciate that cloud providers take away the huge overhead of building, maintaining, and upgrading physical infrastructure, and allow organizations to better focus on their own core competencies. However, with public clouds abstracting organizations’ underlying physical layers, only showing the logical/virtualized views, and with distributed systems across on-prem and multiple public clouds, system operators (including NetOps, SREs, SecOps, etc.) are facing a huge visibility challenge.</p> <p>To surface and help address these challenges, we interviewed some of our customers who have migrated their workloads from on-premises to the public cloud, as well as those who deployed greenfield cloud-native applications. Here are three key challenges frequently mentioned:</p> <ul> <li> <p><strong>Challenge #1 - Traditional appliance-based monitoring tools don’t work</strong>. Since you lose ownership of the physical layer, there is nowhere to “plug in” physical appliances. You may still use monitoring appliances in your private/on-prem data centers, but how will you see the other side of the world (your footprints in public clouds)? It’s an all-or-nothing scenario; any mid-ground solution will <em>not</em> fly.</p> </li> <li> <p><strong>Challenge #2</strong> - The “lift and shift” cloud migration method has become less popular. Many network operators now want to promote cloud-native architecture when migrating to the public cloud to fully leverage the cloud’s benefits. However, <strong>app behavior and performance become a big uncertainty</strong> without proper real-time micro and macro visibility of the infrastructure. “Dynamically orchestrated” and “microservices-oriented” are key aspects of “cloud-native” architecture that make this especially challenging.</p> </li> <li> <p><strong>Challenge #3 - Cost management has never been so tricky</strong>. Are there hidden fees behind your cloud bill? Some have heard that only egress traffic is chargeable, or that inter-region/VPC bandwidth is more expensive than intra-region/VPC traffic, and that traffic consumption may be charged differently for each geo / zone. We’ve heard from many organizations who were shocked to see much higher bills than they initially expected, with few details available to understand what factors are driving those costs.</p> </li> </ul> <p>In a nutshell, networking is not only <em>not</em> going away, but it’s becoming a more critical piece of the puzzle when aiming to guarantee app availability and performance while maintaining cost control. That’s why network monitoring must transform to support cloud environments and provide the same level of insight and transparency that operators are accustomed to having in traditional infrastructure. That’s the only way operators will feel comfortable moving critical applications into a cloud environment.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5M91QDi2J2wIAgWKA6MSAy/007e9a4b7ddb482ecd766053543b1fed/frustration.jpg" class="image center" style="max-width: 800px;" alt="Cloud challenges" /> <center>You definitely don’t want to look like them when managing your infrastructure, do you?</center> <h3 id="good-news-cloud-giants-now-offer-visibility-via-cloud-flow-logs">Good News! Cloud Giants Now Offer Visibility via Cloud Flow Logs</h3> <p>As mentioned above, if you don’t operate physical infrastructure, getting complete network visibility has been almost impossible. However, we are happy to see cloud giants like Google (GCP), Amazon (AWS), and Microsoft (Azure) acknowledge their customers’ demands for real visibility by adding virtual private cloud (VPC) flow data—sometimes called “<a href="https://www.kentik.com/kentipedia/what-are-vpc-flow-logs/" title="Kentipedia: What are VPC Flow Logs?">cloud flow logs</a>”—to their offerings. Let’s take a moment to review what each of them provides.</p> <h4 id="google-gcp---vpc-flow-logs">Google GCP - VPC Flow Logs</h4> <p>GCP VPC Flow Logs capture telemetry data like NetFlow, plus additional metadata that specific to GCP. As shown in the below diagram, customers can leverage VPC Flow Log monitoring to track network activity over ① Google Cloud interconnects, ② over VPNs, ③ between workloads, ④ from endpoints going out over the Internet, and ⑤ from workloads to Google services.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5HbDNTuN1Ks2swkAYq4k8g/d74ed79acbb074b1ec9aa9af44e22578/google-cloud-platform.png" class="image center" style="max-width: 700px;" alt="Google VPC Flow Logs diagram" /> <p>These logs include a variety of rich data points, including the IP 5-tuple that we are very much familiar with (which contains the source IP, source port, destination IP, destination port, and the layer 4 protocol). Moreover, the logs also store information like timestamps, performance metrics (e.g. throughput, RTT) and endpoint tags such as VPC, VM name and geo annotations. According to Google, their VPC Flow Logs are meant to promote use cases such as network monitoring, network usage and egress optimization, network forensics and security analytics, and real-time security analysis.</p> <p>A couple of their highlights worth mentioning with these logs are::</p> <ul> <li><strong>Speed</strong> - GCP VPC Flow Log generation happens in near real-time (updates every 5 seconds vs. minutes) — which means the flow data have 5-second granularity.</li> <li><strong>Openness</strong> - Google promotes a partner ecosystem, meaning all of those flow logs can be exported to the partner system for analysis and visualization.</li> </ul> <p>For more information, please refer to their official docs and announcement:</p> <ul> <li><a href="https://cloud.google.com/vpc/docs/using-flow-logs" target="_blank">Using VPC Flow Logs</a></li> <li><a href="https://cloud.google.com/blog/products/gcp/introducing-vpc-flow-logs-network-transparency-in-near-real-time" target="_blank">Introducing VPC Flow Logs—network transparency in near real-time</a></li> </ul> <h4 id="amazon-aws---vpc-flow-logs">Amazon AWS - VPC Flow Logs</h4> <p>AWS was the first cloud provider to make their VPC Flow Logs available to customers, with the goal of helping their users to troubleshoot connectivity and security issues and make sure that network access rules work as expected. With VPC Flow Logs, AWS enables the ability to capture information about the IP traffic going to and from network interfaces in a VPC.</p> <p>What you need to know about AWS VPC Flow Logs:</p> <ul> <li>Agentless</li> <li>Can be enabled per ENI (Elastic Network Interface), per subnet or per VPC</li> <li>Includes actions (ACCEPT / REJECT) of Security Group rules</li> <li>Can be used to analyze and troubleshoot network connectivity and more</li> </ul> <p>AWS flow log records are a space-separated string that has the following format:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">&lt;version> &lt;account-id> &lt;interface-id> &lt;srcaddr> &lt;dstaddr> &lt;srcport> &lt;dstport> &lt;protocol> &lt;packets> &lt;bytes> &lt;start> &lt;end> &lt;action> &lt;log-status></code></pre></div> <p>AWS Flow Log configuration can also be managed with APIs. That means you can create, describe and delete flow log records using the Amazon EC2 Query API, and view flow log records using their CloudWatch API.</p> <p>For more information, please refer to their official docs:</p> <ul> <li><a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html" target="_blank">VPC Flow Logs</a></li> <li><a href="https://docs.aws.amazon.com/vpc/latest/userguide/working-with-flow-logs.html" target="_blank">Working With Flow Logs</a></li> <li><a href="https://aws.amazon.com/blogs/aws/vpc-flow-logs-log-and-view-network-traffic-flows/" target="_blank">VPC Flow Logs – Log and View Network Traffic Flows</a></li> </ul> <h4 id="microsoft-azure---flow-logging--virtual-network-tap">Microsoft Azure - Flow Logging &#x26; Virtual Network TAP</h4> <p>Azure flow logging is a feature of Azure Network Watcher, a tool used to monitor, diagnose, and gain insights into Azure cloud’s network performance and health. Azure flow logging allows you to view information about ingress and egress IP traffic through a Network Security Group (NSG). Similar to what AWS and GCP offer, Azure flow logs show outbound and inbound flows with 5-tuple details about the flow in JSON format. The nuance here is that the flows are on a per-rule basis, therefore, whether or not the traffic was allowed or denied is also captured.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4tyUqJGgqcIQASeI2USOui/4ef3fe12c1384905c3eaf7b231446483/azure-raw-log.png" class="image center no-shadow" style="max-width: 550px;" alt="Azure flow logging" /> <p>Furthermore, Azure has its virtual network TAP (Terminal Access Point) in developer preview in the west-central-us region recently. This is the service that allows you to continuously stream the VM network traffic to an analytics tool.</p> <p>For more information, please refer to their official docs:</p> <ul> <li> <p><a href="https://docs.microsoft.com/en-us/azure/network-watcher/network-watcher-nsg-flow-logging-overview" target="_blank">Introduction to flow logging for network security groups</a></p> </li> <li> <p><a href="https://docs.microsoft.com/en-us/azure/virtual-network/virtual-network-tap-overview" target="_blank">Virtual network TAP</a></p> </li> </ul> <h3 id="kentiks-cloud-visibility-solution">Kentik’s Cloud Visibility Solution</h3> <p>Kentik’s mission is to offer the world’s most powerful network analytics. We’ve already addressed many use cases across both internal networks and the internet. And now, our mission is expanding seamlessly to cloud-native networks.</p> <p>Kentik now accepts VPC flow logs as flow data sources and can take full advantage of each cloud provider’s tags to add infrastructure, application, and service context to the data presented in the Kentik User Interface (for GCP and AWS as of now; Azure is coming in the near future).</p> <p>With support for VPC flow logs from public cloud, Kentik provides a 360° view of the performance, composition, and paths of actual traffic — no matter where your workloads are located. We’ve combined our deep network analytics with application and business data to help optimize your costs, ensure your cloud-deployed apps are delivering a great customer experience and grow your revenue.</p> <img src="//images.ctfassets.net/6yom6slo28h2/74PanFwEPSe6Wo0WSoQMyq/31c39d2c4b189d54b2bde56a585c6259/kentik-for-cloud.jpg" style="max-width: 800px;" alt="Kentik for Cloud" class="image center" /> <p>For more information, check out our <a href="https://www.kentik.com/blog/kentik-for-google-cloud-vpc-flow-logs/">Kentik for Google Cloud VPC Flow Logs</a> blog post.</p><![CDATA[Under the Hood: How Rust Helps Keep Kentik’s Performance on High]]><![CDATA[We post a lot on our blog about our advanced network analytics platform, use cases, and the ROI we deliver to service providers and enterprises globally. However, today’s post is for our fellow programmers, as we go under Kentik’s hood to discuss Rust. ]]>https://www.kentik.com/blog/under-the-hood-how-rust-helps-keep-kentik's-performance-on-highhttps://www.kentik.com/blog/under-the-hood-how-rust-helps-keep-kentik's-performance-on-high<![CDATA[Will Glozer]]>Mon, 05 Nov 2018 07:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/20WX4Oj8SUG86saiIYGwU0/801b05c8559788397411e37a0de068d6/rust-logo-blk.svg" class="image right no-shadow" style="max-width: 180px;" alt="Rust"> <p>We post a lot on our blog about our advanced network analytics platform, use cases, and the ROI we deliver to service providers and enterprises globally. However, today’s post is for our fellow programmers out there, as we go under Kentik’s hood to discuss <a href="https://www.rust-lang.org/en-US/index.html" target="_blank">Rust</a>, a new programming language that is getting a lot of attention for its performance, safety, and design. Simply put: Rust is a big part of how Kentik ingests Gigabits per second representing over a Petabit per second of Internet traffic, and stores 100 TB of network flow data each day. The performance and reliability of our platform relies on software written in Rust, and we benefit from the robust ecosystem of open source libraries available on <a href="https://crates.io/" target="_blank">crates.io</a>.</p> <p>On the ingest side, Kentik’s high-performance host and sensor agent captures raw network traffic and converts it into kflow, our internal Cap’n Proto-based flow record format. In addition to basic data like source and destination IP address, port, protocol, etc., we collect network performance metrics like TCP connection setup latency, retransmitted and out-of-order packet counts, and window size. We also utilize <a href="https://crates.io/crates/nom" target="_blank">nom</a>, an excellent Rust parser combinator library, for high-performance decoding of application layer protocols like DHCP, DNS, and HTTP.</p> <p>Rust’s memory ownership model allows us to share the underlying packet capture buffer with the entire processing pipeline while ensuring that any references into the buffer do not outlive the packet data. Due to this, we were able to implement very efficient zero-copy parsers that do minimal allocation.</p> <p>On the datastore side, we’ve recently rolled out a new backend disk storage format written in Rust that delivers improved performance and higher storage density. And our query layer utilizes a HyperLogLog extension, also written in Rust, for high-performance cardinality queries. Rust’s performance and low memory use is a key benefit here.</p> <p>These components are distributed as shared libraries with a C API and linked into various parts of our distributed storage engine. We’ve taken this same approach with libkflow, a library for generating and sending kflow records to Kentik. However, libkflow is written in Go, and the resulting shared library is large because it contains the full runtime, garbage collector, etc.</p> <p>Additionally, when the Go runtime is initialized it creates many threads, which fails spectacularly when linked into a program that then forks. Go also has <a href="https://github.com/golang/go/issues/13492" target="_blank">long-outstanding</a> <a href="https://go-review.googlesource.com/c/go/+/37868" target="_blank">bugs</a> that cause the runtime to segfault when linked into a static binary. We’ve had to develop and maintain patches to the Go runtime that work around these issues.</p> <p>None of this is meant as an attack on Go. The majority of Kentik’s backend systems are written in Go and we’re happy with that choice. However, Rust code is easier to embed as a dynamic or static library since there is no runtime or garbage collector, and no overhead calling Rust functions from C or vice versa. Rust’s design combined with LLVM’s powerful optimizer also often produces code with superior performance, less memory use, and no garbage collection overhead.</p> <p>Rust is a critical component in the Kentik software stack, and absolutely delivers on its promise of performance and safety while being a joy to program in. If you’re a developer who has been curious about it, we’d encourage you to check out <a href="https://www.rust-lang.org/en-US/index.html" target="_blank">Rust</a>. And, of course, if you’re curious about Kentik, <a href="#signup_dialog">trial it here</a>.</p><![CDATA[This Halloween, Don’t Get Spooked by Cloud Visibility Myths]]><![CDATA[To some, moving to the cloud is like a trick. But to others, it’s a real treat. So in the spirit of Halloween, here’s a blog post to break down two of the spookiest (or at least the most common) cloud myths we’ve heard of late. ]]>https://www.kentik.com/blog/this-halloween-dont-get-spooked-by-cloud-visibility-mythshttps://www.kentik.com/blog/this-halloween-dont-get-spooked-by-cloud-visibility-myths<![CDATA[Crystal Li, Michelle Kincaid]]>Wed, 31 Oct 2018 07:00:00 GMT<p>We’ve heard both sides of the story: To some, moving to the cloud is like a trick. But to others, it’s a real treat. So in the spirit of Halloween, we wanted to break down two of the spookiest (or at least the most common) cloud myths we’ve heard of late. From there, we hope to help you remove the mystery of maintaining visibility, performance, and security — no matter where you are in your cloud journey.</p> <img src="//images.ctfassets.net/6yom6slo28h2/81jog0RXfU6qkCQ8ek04U/a6155beb45c9eafdb720d78beecf31da/spooky-cloud.jpg" style="max-width: 600px;" class="image center" alt="Afraid of the cloud?" /> <h3 id="myth-1-going-to-the-cloud-means-you-no-longer-need-to-worry-about-your-applications-or-networking-there">Myth 1: Going to the cloud means you no longer need to worry about your applications or networking there.</h3> <p>You’ve likely heard of the <a href="https://aws.amazon.com/compliance/shared-responsibility-model/" target="_blank"/>“Shared Responsibility Model”</a> that every public cloud provider employs today. Take AWS’s model, for example, which we can also apply to other public clouds. In a nutshell, cloud providers’ responsibility is “<strong>of</strong> the cloud” while a cloud customer’s responsibility is “<strong>in</strong> the cloud.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/CK1xWx3lksqCkIIa4kwYe/367c30877ba3f279997490e2b038d138/shared-responsibility-model.png" style="max-width: 700px;" class="image center" alt="AWS Shared Responsibility Model"> <p>On the one hand, cloud providers like Amazon AWS, Google Cloud, and Microsoft Azure are responsible for protecting the infrastructure that runs all of the services offered in their cloud, which is composed of the hardware, software, networking, and facilities that run the cloud services. On the other hand, these providers require that their customers perform all of the necessary security configurations and management tasks, including on things such as data, applications, operating system, networking configuration / protection, etc. And so, as depicted above on the right, networking is not going away just because of a “cloud” mask, it’ll still be there and in need of you to manage.</p> <p>Don’t assume you don’t own your cloud environment. That’s a myth. The “Shared Responsibility Model” should remind every cloud architect and operator that it’s an “all or nothing” game. As per the diagram, if a cloud customer does a lousy job of fulfilling their end of the responsibility, no matter how great the cloud providers do on their side, the organization and business are still at huge risk.</p> <h3 id="myth-2-you-will-lose-control-when-you-move-to-cloud">Myth 2: You will lose control when you move to cloud.</h3> <p>As a reminder, public cloud promises many great benefits including: deployment agility, operational efficiency, rapid scalability, high performance, and above all else, improved ROI.</p> <p>Yet, at the same time, it remains true that when you move workloads up to AWS, Google Cloud, Azure, etc., you lose ownership of the hardware, including servers, disks, routers/switches and so on. It is also true that when you operate on your cloud infrastructure, you are actually operating on the “abstracted” layer that the cloud providers make available to you (hey, call it “virtualization”). And finally, it is also true that in your cloud journey, many traditional monitoring / management tools won’t support your modern cloud operations, leaving you clueless when something happens.</p> <p>Sounds like you’re losing control, right?</p> <p>Well, losing control at certain levels is not necessarily a bad thing if that serves a business goal (e.g. you don’t want to deal with the overhead of building your own data center). What’s important is keeping a few key questions in mind:</p> <blockquote> <p><strong>Question 1:</strong> What benefit do you want to gain from the cloud (cloud-native architecture)?</p> </blockquote> <blockquote> <p><strong>Question 2:</strong> Are you maximizing those benefits yet? If not, how can you improve this?</p> </blockquote> <p>That’s where having at least some control is important. That’s also where it becomes critical to know the visibility / control gaps in today’s cloud infrastructure, as that’s typically where organizations typically lose control. So what are the gaps?</p> <blockquote> <p><img src="//images.ctfassets.net/6yom6slo28h2/1RhvwtNYfKcmmkCuicqAui/04a154c99528a0670161b41755b65d00/web-visibility.png" style="max-width: 140px;" class="image right no-shadow" alt="cloud visibility" /><strong>Gap 1:</strong> Visibility into application performance dependency mapping</p> </blockquote> <blockquote> <p>Chances are that each of your applications are most likely in a highly distributed system. When one component has issues, it’s hard to identify which service is affected and whether or not other services are experiencing similar issues or changes.</p> </blockquote> <blockquote> <p><img src="//images.ctfassets.net/6yom6slo28h2/6g2a0PxyKsCiyiGu42aYKW/1f40f1c4e1b3fcb1df96fe4f4a59c9ad/multiple-alerts.png" style="max-width: 140px;" class="image right no-shadow" alt="hybrid cloud alerts" /><strong>Gap 2:</strong> Unified visibility across multiple cloud infrastructure</p> </blockquote> <blockquote> <p>If you have a hybrid cloud (e.g. private data center and public cloud infrastructure), or multi-public-cloud IaaS (AWS+Azure+GCP+etc.), it makes much more sense from a business perspective to have one operation team to manage it all, rather than hiring multiple teams with various expertise to handle one of each and create silos. However, with digital infrastructure growing and constantly changing, having just one team do it all can create visibility gaps, particularly from a resources and expertise standpoint.</p> </blockquote> <blockquote> <p><img src="//images.ctfassets.net/6yom6slo28h2/2BnjJRv7TKEWkEoA2AiOYK/af87071fa2c265919d745a297ae959c9/cloud-cost.png" style="max-width: 140px;" class="image right no-shadow" alt="cloud cost" /><strong>Gap 3:</strong> Visibility into bandwidth and usage cost for cloud bills</p> </blockquote> <blockquote> <p>One key reason many enterprises are adopting public cloud is to cut IT bills. However, if you don’t architect your cloud infrastructure wisely, you can end up spending even more than before.</p> </blockquote> <p>Regardless of the gaps, it’s a myth that you have to lose visibility and therefore control entirely when you’re in the cloud. Here are three important points to remember:</p> <blockquote> <p><strong>Complete visibility across workloads in your cloud environment is a must.</strong> Moving up to the cloud doesn’t mean you should trust anyone and assume everything runs safely. We recommend you to take the opposite approach to be more cautious and make sure to have complete visibility across your entire cloud environment.</p> </blockquote> <blockquote> <p><strong>Proactively team up with your cloud provider(s) to gain the visibility they own.</strong> The fact that Amazon, Microsoft and <a href="https://www.kentik.com/blog/kentik-for-google-cloud-vpc-flow-logs/">Google offer VPC Flow Logs</a> is a validation for this. Many cloud customers craved for underlying flow data visibility because it unlocks deeper visibility that matters to the business. And cloud giants have recently acknowledged this importance.</p> </blockquote> <blockquote> <p><strong>Find a cutting-edge tool that fits your cloud infrastructure and your business.</strong> Avoid the haunted house of traditional metrics from legacy vendors. Real-time, modern cloud visibility from Kentik allows you to see, run, and protect your infrastructure (wherever it may be) and your business.</p> </blockquote> <p>Don’t be spooked by the cloud. It is possible to maintain visibility into your infrastructure. See for yourself in a new video on how we’re <a href="https://www.kentik.com/resources/kentik-reinventing-analytics">reinventing network analytics</a> or <a href="https://www.kentik.com/go/get-demo/">sign up for a demo</a>.</p> <p>Also, don’t miss the whitepaper ”<a href="https://www.kentik.com/resources/aws-cloud-adoption-visibility-management">AWS Cloud Adoption, Visibility &#x26; Management</a>,” where we look at the survey responses of 310 technical and executive-level peers who attended the 2018 AWS user conference.</p><![CDATA[Network Insight is Essential for Communications Service Consumers]]><![CDATA[Real-time network data insights are not only important to the service provider. In this post, we discuss why service providers’ end-customers, who consume those services -- subscribers, digital enterprise, hosting customers, etc., also need visibility.]]>https://www.kentik.com/blog/network-insight-is-essential-for-communications-service-consumershttps://www.kentik.com/blog/network-insight-is-essential-for-communications-service-consumers<![CDATA[Akshay Dhawale, Crystal Li]]>Tue, 23 Oct 2018 07:00:00 GMT<p>In a recent blog post, we announced the availability of the <a href="https://www.kentik.com/blog/empower-your-customers-with-self-service-analytics/">“My Kentik” self-service analytics portal</a>. We highlighted the great value-add that Kentik’s service provider customers can now deliver to their end-customers with real-time network data insights. Read ahead to see why having this visibility is important not only to the service provider, but also to their end-customers, who consume those services — subscribers, digital enterprise, hosting customers, etc.</p> <h2 id="where-is-my-money-spent">Where is my money spent?</h2> <p>Imagine you are looking for a place to rent. Eventually, you narrow it down to three properties that you like equally. Each of the landlords offers a different option for paying rent:</p> <ul> <li><strong>Option 1:</strong> You pay a fixed monthly amount to your landlord — rent, garbage, utilities, are all included, but no further breakdown details are provided.</li> <li><strong>Option 2:</strong> You pay only the rent to your landlord. You pay all other bills on your own directly to the providers.</li> <li><strong>Option 3:</strong> You can log in online to view and pay an itemized bill with all the details presented. Even better, you can make suggestions on choosing different utility vendors.</li> </ul> <p>Hold your thoughts for a moment and think about managing the infrastructure OpEx in your business. How would you want to pay those bills?</p> <h2 id="self-service-portals-are-critical-customer-service-tools">Self-service portals are critical customer service tools</h2> <p>Microsoft polled 5,000 people across the world for their “<a href="https://info.microsoft.com/rs/157-GQE-382/images/EN-CNTNT-Report-DynService-2017-global-state-customer-service.pdf" target="_blank" rel="noopener">2017 State of Global Customer Service Report</a>,” which showed strong evidence for the rise of self-service:</p> <ul> <li>90% said they expect an online portal for self-service; and</li> <li>74% have used a self-service support portal to find answers to service-related questions before calling an agent.</li> </ul> <p>So which factors are driving the self-service revolution in the customer engagement model?</p> <ul> <li>First of all, the self-service portal visualizes the service — it gives consumers a better way to utilize existing services and discover new services that can help solve the business problem.</li> <li>Second, the self-service portal improves communication and productivity. With the same view that the service consumer has now, it is more likely to have a meaningful conversation with the service provider and more efficiently pinpoint the issues.</li> <li>Third, the self-service portal is a great tool for consumers to increase the value of the services by visualizing them.</li> <li>Last but not least, as a positive side effect, service consumers now have the opportunity to learn new skills and improve knowledge of the products and services provided by the service provider.</li> </ul> <h2 id="how-my-kentik-can-help">How “My Kentik” can help</h2> <p>The <a href="https://www.kentik.com/resources/my-kentik/">“My Kentik” portal</a> is a built-in feature of the Kentik platform that enables curated, self-service network traffic visibility for downstream customers. Now service consumers can get a level of visibility that’s never before been available to them — powered by the same dataset their providers have, but highly customized.</p> <p>Imagine you are subscribed to DDoS protection services from an MSSP (managed security service provider). When an incident occurs, using “My Kentik”, you can now immediately access the traffic and alert details yourself. That allows both high-level and drill-down analysis to speed up troubleshooting and issue resolution, even while waiting on the phone with customer service. To see for yourself, <a href="https://www.kentik.com/go/get-demo/">sign up for a demo</a> or <a href="https://portal.kentik.com/signup">free trial</a>.</p> <img class="image center" src="//images.ctfassets.net/6yom6slo28h2/lkMpA4qoc860qMKUa2W0u/dc98136b8c46438424cd54352e87e864/mykentik-alerts.png" alt="DDoS Alert" style="max-width: 1000px;" /> <p>If you are an enterprise customer who buys Internet/MPLS circuits, with “My Kentik”, you can now <a href="https://www.kentik.com/solutions/network-visibility/#wan-sd-wan">view your WAN traffic utilization</a> and understand which departments, WAN sites, and data centers are driving network utilization. You can also use this data to make a more cost-optimized plan for network changes and expansion.</p> <img class="image center" src="//images.ctfassets.net/6yom6slo28h2/4TNlQfG5bGc0aGae2aiEaC/1130937f41961105a00a4c680aa3e511/mykentik-pie.png" alt="" style="max-width: 355px;" /> <p>If you’re part of a team of application developers or architects, the “My Kentik” portal can provide <a href="https://www.kentik.com/solutions/network-visibility/">cloud or data center traffic insights</a>, so you can understand how the applications you’re deploying are affecting the infrastructure and vice versa. That includes the east-west traffic between microservices and other components within the app itself and informs app architecture decisions that help avoid impact from infrastructure bottlenecks.</p> <img class="image center" src="//images.ctfassets.net/6yom6slo28h2/1Imf0JhaF6akkQaqsES00y/04689189f5f71e0fc889eaa1c91c8a31/mykentik-customer-visibility-standard.png" alt="Customer Visibility Standard" style="max-width: 1000px;" /> <p>If you are a customer receiving IP transit or peering services,  “My Kentik” can provide traffic utilization and billing correlation analysis. Breakdowns like port utilization, geolocation, ASN, and traffic type breakdown uncover the drivers behind total traffic volume and explain unexpected increases and cost exposure.</p> <img class="image center" src="//images.ctfassets.net/6yom6slo28h2/7n9oyFMk7eQmM4Y88ESgKw/fd7e222bd24aaf4477a2b10c7d7a2362/mykentik-subtenant-widgets.png" alt="Subtenant Widget" style="max-width: 1000px;" /> <p>If you’re a customer of hosting or IaaS services, “My Kentik” can provide an understanding of traffic details such as per-host and per-site utilization, application / service utilization and even connection-level details of each host’s historical network activity for security and forensic analysis.</p> <img class="image center" src="//images.ctfassets.net/6yom6slo28h2/6GfR87hR04Smcea0uiewKE/beb800dba790b07751adca5ea8276005/mykentik-custom-visibility-standard.png" alt="Custom Visibility Standard" style="max-width: 1000px;" /> <p>My Kentik Portal delivers transparency across all of these customer types and use-cases that can shorten incident response time, make troubleshooting far easier, enable more robust infrastructure, and accelerate business growth. Talk to your service provider now and ask about their integration with Kentik.</p> <p>For more information, please see the <a href="https://assets.ctfassets.net/6yom6slo28h2/B4oF6RRrQkAey28MaCAaS/5c15281268d1d2cbc61b91f0cd52ec1b/my-kentik.pdf">“My Kentik” solution brief</a> and the <a href="https://kb.kentik.com/Cb15.htm">Subtenancy topic</a> in the Kentik Knowledge Base or contact the Kentik Customer Success team at <a href="mailto:[email protected]"><a href="mailto:[email protected]">[email protected]</a></a>.</p><![CDATA[Kentik Invests in European Market Expansion]]><![CDATA[Kentik is growing! Today we marked our permanent landing in Europe. Kentik’s SaaS solution has already been embraced by dozens of organizations across the European continent -- and our new presence will support even more organizations who deliver or depend on routed networks and the Internet as an essential part of their operations and business.]]>https://www.kentik.com/blog/kentik-invests-in-european-market-expansionhttps://www.kentik.com/blog/kentik-invests-in-european-market-expansion<![CDATA[Jim Frey]]>Tue, 16 Oct 2018 07:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/qqcdMIGYbQ2IsIUQSUoaG/a45c44fbc319c8950764e5398979532c/blog-eu-map.png" style="max-width: 300px;" class="image right no-shadow" alt="Kentik in Europe"> <h3 id="kentik-is-growing"><em>Kentik is growing!</em></h3> <p>While we’re based in the US, we have always found both great interest and ready adoption by providers and digital businesses around the world, including those in Europe. Despite a relatively short 3+ years since our launch, Kentik’s SaaS solution has already been embraced by dozens of organizations across the European continent — particularly those who deliver or depend upon routed networks and the Internet as an essential part of their operations and business.</p> <p>Throughout the past several years, Kentik has engaged in active discussions with many more European organizations who understand and value what the Kentik service can provide, in terms of network visibility, traffic analytics, network performance intelligence, and streaming anomaly detection. And because Europe is comprised of a large number of densely clustered countries, there is high need for clarity and insights around peering and transit, for both technical operations and business optimization. Reaching all of these potential new clients requires, at some point, more dedicated resources that are aligned directly with the needs of the European marketplace. The Kentik team has long understood this, and recently committed to make such investments a reality.</p> <p><strong>And with that decision comes our big news:</strong></p> <p>The team is embarking upon a major initiative and rollout to expand services within Europe and the greater EMEA region:</p> <ul> <li>First and foremost, this includes offering <strong>SaaS hosting from a European location</strong>, in Frankfurt, Germany. We chose Frankfurt because of its abundant network connectivity, myriad choices for data center facilities, and its excellent fit for helping with requirements for data residency shared by many European organizations.</li> <li>Second, Kentik is <strong>expanding our channel program</strong> to add more local European partners, including our first in Germany, Diverse GmbH, with plans to continue growing our partner roster in the months and years ahead. Diverse GmbH joins our existing list of regional partners, which include Interdata, Baffin Bay Networks, and Acorus Networks.</li> <li>Lastly, Kentik is investing in <strong>local personnel resources within Europe</strong>, adding field roles to our existing operations personnel in region.</li> </ul> <p>“On a daily basis, we hear of new and existing network management challenges from enterprises and service providers alike. All are seeking effective technologies for visibility and insights into transit, capacity, and interconnection, and yet regulations create new obstacles for where organizations can store network and user data,” said <strong>Jean-Marc Odet, CEO of Interdata</strong>, a leading French company specializing in network integration, architecture and security and a value-added reseller (VAR) for Kentik. “We’re excited to see Kentik expand into Europe. Kentik’s powerful network analytics will help many more organizations in the region address their network challenges, and we look forward to working with them on their efforts.”</p> <p>With these growth initiatives, Kentik will be in place to support more customers in more countries with both shared and specific needs. We will remove barriers to adoption and give European organizations a real choice when it comes cutting-edge network traffic and performance monitoring solutions. Kentik is bringing the world’s most powerful network analytics to Europe! To learn more, <a href="https://www.kentik.com/news/kentik-expands-into-europe-addresses-demand-modern-network-analytics/">read our press release</a> or <a href="mailto:[email protected]">reach out to us</a>.</p><![CDATA[Visualizing Traffic Exit Points in Complex Networks]]><![CDATA[Relentless traffic growth and a constant stream of new technologies, e.g. SDN and cloud interconnects, make it harder to understand how services traverse the network between application infrastructure and users or customers. In this post, we discuss how that led Kentik to build our BGP Ultimate Exit to help address traffic visibility challenges. ]]>https://www.kentik.com/blog/visualizing-traffic-exit-points-in-complex-networkshttps://www.kentik.com/blog/visualizing-traffic-exit-points-in-complex-networks<![CDATA[Dan Rohan]]>Wed, 19 Sep 2018 09:00:43 GMT<p>Let’s face it — today’s networks are complex. The physical topology continues to expand with relentless traffic growth, and a constant stream of new technologies like SDN, Clos architectures, and cloud interconnects make it even harder to understand how services traverse the network between application infrastructure and users or customers. Even simple questions like “where will this traffic exit my network?” become difficult to answer.</p> <p>It’s that “exit point” question in particular that led Kentik to build our <a href="https://www.kentik.com/blog/package-tracking-for-the-internet/">BGP Ultimate Exit feature set</a> (UE for short). In a nutshell, UE enriches all the flow data Kentik receives with tags that indicate the PoP, router, interface, etc. where that flow will exit the network, potentially many hops away. For outbound traffic, the exit might be to an adjacent network, a customer, or an upstream provider. For inbound traffic, the “exit” might be to a host in a data center. In either case, UE is insanely valuable across both enterprise and service provider networks, for diagnosing routing issues, understanding which traffic sources are driving traffic growth at remote points in the network, or even understanding the relative cost to serve each customer.</p> <p>To be clear, UE is fundamentally different than simply looking at traffic based on the typical source or destination fields available in regular NetFlow. To illustrate, let’s consider a network that looks roughly like the diagram below. Each network device (switch, router) in this network is sending data to Kentik (flow, BGP, and SNMP data).</p> <img src="//images.ctfassets.net/6yom6slo28h2/2qSC5oK8YQkgCYUs22kOiW/c90ba7333ea6635cd3b727e61feab270/diagram1.png" style="max-width: 700px" alt="network diagram" class="image center" /> <p>As you can see, we have a network where traffic is entering an edge/border device destined for a host on the same network.</p> <p>Based upon the flow generated from the border router, we can see that this flow is originating from 1.1.1.1 and headed towards 3.3.3.3. Looking at that same flow, we see that the source interface is the transit interface and the destination interface is the backbone interface connected to device B. We can figure all of this out just by using the standard fields in flow records exported from the edge device alone.</p> <p>But what if you wanted to figure out which piece of network gear is actually handing this traffic off to the destination host (3.3.3.3)?  If you don’t have a detailed mental map of your network (and honestly, who really does at this point?) the traditional approach is to log into a router, run a show route command to figure out which adjacent device is announcing the IP internally, then jumping to that device to look at ARP tables, figure out which interface the device is attached to, rinse and repeat.</p> <p>BGP Ultimate Exit dramatically simplifies this by determining the egress point (err, the “ultimate exit”) at the exact moment that the traffic ingresses the first router. Kentik stitches together flow data, BGP routes and SNMP interface IP data to determine precisely where the traffic will be routed to without having to do any ‘flow hopping’ recursion lookups or logging into your routers. Here’s a quick sketch of how it works:</p> <ol> <li>As packets ingress the router, it performs lookups to determine how to forward the traffic and also creates a flow record which is exported to Kentik.</li> <li>Kentik’s ingest layer maintains a full BGP table from each router. By looking up the dest IP from the flow record in the BGP table for the router it was received from, we enrich each flow with additional BGP-related fields, including the BGP next-hop IP address for the matching route.</li> <li>Kentik’s ingest layer also maintains an SNMP interface IP table for every device in the network. By looking up the next-hop IP in this table, we can tag each flow with the egress router and site that the next-hop IP is associated with.</li> <li>Kentik also maintains an auto-generated in-memory table of the ASNs that are adjacent to each interface. Comparing each flow against this table allows us to additionally tag it with a specific egress interface.</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/4FFJiy2UUEio2g8oi6oimA/0088225a0c2de2ba95ed90635b540ded/diagram2.png" style="max-width: 700px" alt="network diagram" class="image center" /> <p>As another example, imagine you wanted to see the destinations for content from a specific server or set of servers, and how that traffic was being delivered over your infrastructure. Perhaps you’re planning some network maintenance and want to detail the expected impact of your work. Kentik’s UE feature makes this easy. We’ll use the same diagram, but flip the arrows around:</p> <h4 id="steps-in-the-kentik-ui">Steps in the Kentik UI:</h4> <ul> <li>Select all devices to make sure the query considers the entire network</li> <li>Add filters to uniquely identify traffic from the server in question. For example, if you wanted to see where traffic from source IP 3.3.3.3 left your network, you’d set up your filters like so:</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/2d8bu8hwg8Y86UkYOuI26O/7b8e5b0d397308bc39d5c969a66a6199/filtering.png" style="max-width: 400px" alt="network diagram" class="image center" /> <ul> <li>Choose the dimensions to include in the query output. The example below is particularly useful, showing the network entry point(s) by source IP, source interface and ingress device, then the egress UE device and interface, together with the next-hop ASN and destination IP.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/43vZ7FABBuKuY8Aa0a6uwK/35ac57297a9f17f13a9ce0a872d5133f/query.png" style="max-width: 400px" alt="network diagram" class="image center" /> <p>The output will be similar to the diagram below, showing the end-to-end flow of the traffic from that particular server:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2AefbyoeDaC2oa80cqo0wO/3a2ba9e9e5b338abd642bc4d3ea5e05a/end-to-end-flow.png" style="max-width: 700px" alt="end-to-end flow of traffic" class="image center" /> <p>For service providers, UE can help sales and product teams understand how each customer impacts network load, and the relative cost of delivering each customer’s traffic. By applying filters for a customer’s ASN, interfaces, or subnets, the SP can get a view of how much of the customer’s traffic is delivered locally out of the same PoP or region where it ingressed, and how much traverses the SP’s backbone to egress in remote regions (with a much higher associated cost). This type of visibility allows product teams to create differentiated pricing structures, calculate margin for each customer, and enforce contracts with “traffic distance” terms.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2uxZpoDZWYaOEi48a8w62C/7195fb9954ef2f98f6e38150e8f57034/Screen-Shot-2018-09-24-at-11.08.24-AM.png" style="max-width: 700px" alt="network diagram" class="image center" /> <p>As you can see Kentik’s BGP Ultimate Exit feature set provides key functionality for understanding and managing traffic flows in networks of all types.</p> <p>Not already a Kentik customer? Start leveraging the powerful visualizations in <a href="/platform/kentik-detect/">Kentik Detect</a> today by <a href="#demo_dialog">requesting a demo</a> or <a href="#signup_dialog">signing up for a free trial</a>.</p><![CDATA[NetFlow vs. sFlow: What’s the Difference?]]><![CDATA[In this post we look at the difference between NetFlow and sFlow and how network operators can support all of the flow protocols that their networks generate. ]]>https://www.kentik.com/blog/netflow-vs-sflowhttps://www.kentik.com/blog/netflow-vs-sflow<![CDATA[Ken Osowski]]>Tue, 11 Sep 2018 09:00:38 GMT<p>Flow-based network monitoring relies on collecting information about packet flows (i.e., a sequence of related packets) as they traverse routers, switches, load balancers, ADCs, network visibility switches, and other devices. Network elements that support traffic monitoring protocols such as <a href="https://www.kentik.com/kentipedia/netflow-overview/" title="Kentipedia: What is NetFlow? An Overview of the NetFlow Protocol">NetFlow</a> and <a href="https://www.kentik.com/kentipedia/big-data-sflow-analysis/" title="Kentipedia: Big Data sFlow Analysis">sFlow</a> extract critical details from packet flows, like source, destination, byte and packet counts, and other attributes. This “metadata” is then streamed to “flow collectors” so that it can be stored and analyzed to produce network-wide views of bandwidth usage, identify abnormal traffic patterns that could represent possible security threats, and zero in on congestion.</p> <p>Flow-based monitoring provides significant advantages over other network monitoring methods. It can capture substantially more detail about network traffic composition than SNMP metrics, which show only total traffic volume. It’s also significantly less expensive and easier to deploy than raw packet capture. And since flow generation capability is now a built-in feature of almost all modern network equipment, it provides a pervasive monitoring footprint across the entire network.</p> <h2 id="what-is-netflow">What is NetFlow?</h2> <p><a href="https://www.kentik.com/kentipedia/netflow-overview/" title="Kentipedia: What is NetFlow? An Overview of the NetFlow Protocol">NetFlow</a> is the trade name known for a <strong>session sampling</strong> flow protocol invented by Cisco Systems that is widely used in the networking industry. In networking terms, a “flow” is a unidirectional set of packets sharing common attributes such as source and destination IP, source and destination ports, IP protocol, and type of service. NetFlow statefully tracks flows (or sessions), aggregating packets associated with each flow into flow records, which are then exported. NetFlow records can be generated based on every packet (unsampled or 1:1 mode) or based on packet sampling. Sampling is typically employed to reduce the volume of flow records exported from each network device.</p> <p>The most commonly deployed NetFlow versions are v5 and v9, with the main difference being that v9 supports “templates” which allow flexibility over the fields contained in each flow record, while v5 flow records contain a fixed list of fields. IPFIX is an IETF standards-based protocol that is largely modeled on NetFlow v9. Other NetFlow variants supported by various network equipment vendors include <a href="/kentipedia/j-flow-analysis/" title="Kentipedia: J-Flow">J-Flow</a> (NetFlow v5 variant) from Juniper, cflowd from Alcatel-Lucent and Juniper, NetStream from 3Com/Huawei and RFlow from Ericsson.</p> <div as="Promo"></div> <h2 id="what-is-sflow">What is sFlow?</h2> <p>sFlow, short for “sampled flow,” is a <strong>packet sampling</strong> protocol created by InMon Corporation that has seen broad network industry adoption. This includes network equipment vendors that already support NetFlow including Cisco, along with router vendors like Brocade (now Extreme) and in many switching products including those from Juniper and Arista. It is used to record statistical, infrastructure, routing, and other metadata about traffic traversing an sFlow-enabled network device. sFlow doesn’t statefully track flows, as is the case with NetFlow, but instead exports a statistical sampling of individual packet headers, along with the first 64 or 128 bytes of the packet payload. sFlow can also export metrics derived from time-based sampling of network interfaces or other system statistics.</p> <p>sFlow does not support unsampled mode like NetFlow does, nor does it timestamp traffic flows. It relies on accurate and reliable statistical sampling methods for documenting flows, thereby reducing the amount of flow information that ultimately needs processing and analysis.</p> <h2 id="key-differences-between-netflow-and-sflow">Key Differences Between NetFlow and sFlow</h2> <p>Here are some key differences between NetFlow and sFlow:</p> <ul> <li>NetFlow does not forward packet samples directly to collectors but instead exports “flow records” to collectors that are created by tracking a collection of packets associated with a session. This session-specific, summary flow information is created as a single record in the network device’s RAM or TCAM. The device then exports a NetFlow datagram that contains multiple flow records. This stateful session tracking requires its share of network device CPU and memory resources. In some cases, a significant amount of resources when higher packet sampling rates are configured.</li> <li>sFlow packet sampling consists of randomly sampling individual packets. Based on a defined sampling rate, an average of 1 out of N packets is randomly sampled. sFlow captures packet headers and partial packet payload data into sFlow datagrams that are then exported to collectors.</li> <li>Since sFlow captures the entire packet header, by default it’s able to provide full layer 2–7 visibility into all types of traffic flowing across the network including MAC addresses, VLANs, and MPLS labels, in addition to the Layer 3 and 4 attributes typically reported by NetFlow. sFlow has less resource impact on devices since it only performs packet sampling and does have to identify and keep track of sessions as is the case with NetFlow.</li> <li>sFlow has the option to export interface and other system counters to collectors. Counter sampling performs periodic, time-based sampling or polling of counters associated with an interface enabled for sFlow. Interface statistics from the counter record are gathered and sent to collectors by sFlow. sFlow analysis applications can then display the traffic statistics in a report, which helps isolate network device issues. Three different categories of counters can be generated: <ul> <li>Generic interface counters: records basic information and traffic statistics on an interface</li> <li>Ethernet interface counters: records traffic statistics on an Ethernet interface</li> <li>Processor information: records CPU usage and memory usage of a device</li> </ul> </li> </ul> <table> <tbody> <tr> <td style="background-color: #22548b; color: #fff;"><strong>Flow Protocol Capabilities</strong></td> <td style="background-color: #22548b; color: #fff;"><strong>NetFlow</strong></td> <td style="background-color: #22548b; color: #fff;"><strong>sFlow</strong></td> </tr> <tr> <td>Packet Capturing</td> <td>Not Supported</td> <td>Partially Supported</td> </tr> <tr> <td>Interface Counters</td> <td>Not Supported</td> <td>Fully Supported</td> </tr> <tr> <td>IP/ICMP/UDP/TCP</td> <td>Fully Supported</td> <td>Fully Supported</td> </tr> <tr> <td>Ethernet/802.3</td> <td>Not Supported</td> <td>Fully Supported</td> </tr> <tr> <td>Packet Headers</td> <td>Specific Fields Only</td> <td>Fully Supported</td> </tr> <tr> <td>IPX</td> <td>Not Supported</td> <td>Fully Supported</td> </tr> <tr> <td>Appletalk</td> <td>Not Supported</td> <td>Fully Supported</td> </tr> <tr> <td>Input/Output Interfaces</td> <td>Fully Supported</td> <td>Fully Supported</td> </tr> <tr> <td>Input/Output VLAN</td> <td>Some Vendors</td> <td>Fully Supported</td> </tr> <tr> <td>Source &amp; Destination Subnet//Prefix</td> <td>Fully Supported</td> <td>Fully Supported</td> </tr> </tbody> </table> <h2 id="what-are-the-applications-of-netflow-and-sflow">What are the Applications of NetFlow and sFlow?</h2> <p>Network monitoring and analysis are obviously of great value to network operators. Network flow data—because it carries additional information over technologies such as raw packet capture or SNMP—enables deeper analysis. Applications of NetFlow and sFlow enable a wide variety of network monitoring, application monitoring, network planning, network troubleshooting and network security applications, such as:</p> <ul> <li>Visibility into network and bandwidth usage by application and users</li> <li>LAN/WAN traffic measurement</li> <li>Troubleshooting and diagnosing network problems</li> <li>Detecting anomalous network traffic and illicit network usage (as in DDoS attacks)</li> <li>Measuring and monitoring quality-of-service or other key performance indicators and ensuring adherence to SLAs (service-level agreements).</li> </ul> <h2 id="about-flow-protocol-sampling">About Flow Protocol Sampling</h2> <p>Both NetFlow and <a href="https://www.kentik.com/kentipedia/big-data-sflow-analysis/" title="Kentipedia: Big Data sFlow Analysis">sFlow</a> use sampling techniques. With sFlow it’s required, and with NetFlow it’s optional. There is a long-running discussion in the industry about the accuracy of data and insight derived from sampling-based flow protocols. With sampling enabled, network devices generate flow records from a 1-in-N subset of traffic packets. As the variable N increases, flow records derived from the samples may become less representative of the actual traffic, especially for low-volume flows over short time windows. In the real world, how high can N be while still enabling us to see a given traffic subset that is a relatively small part of the overall volume? What is the impact on the accuracy of flow record analysis? <a href="https://www.kentik.com/blog/accuracy-in-low-volume-netflow-sampling/">Testing performed by Kentik</a> indicates that even at sampling rates as low as 1:10000, lower bandwidth traffic patterns are discernible even in high throughput networks.</p> <h2 id="choosing-between-netflow-and-sflow">Choosing Between NetFlow and sFlow</h2> <p>So which is better? In many ways, sFlow provides a more comprehensive picture of network traffic, because it includes the full packet header, from which any field can be extracted, where NetFlow typically contains only a subset of those fields. sFlow also typically places less load on network devices. In many cases, the choice is not up to the user though, because most networking gear supports only one or the other. Many networks contain gear from multiple vendors, and the key question for the network operator then becomes — does my network monitoring platform support all of the flow protocols that my network generates? This is an important consideration to ensure there are no visibility gaps across the infrastructure.</p> <p>Contemporary big data network monitoring platforms, such as Kentik, are well suited to cope with network monitoring challenges. Kentik’s adoption of a big data architecture is at the core of their network flow-based monitoring platform, which supports NetFlow, IPFIX, and sFlow protocols. This allows Kentik to correlate high volumes of flow data records for customers, eliminating network monitoring accuracy concerns. Big data is not only about handling large volumes of data, but also letting network operations staff navigate through and explore that data very quickly.</p> <p>For an excellent overview of the origins, evolution and extensibility of NetFlow and sFlow, see Kentik CEO Avi Freedman’s blog posts on <a href="https://www.kentik.com/netflow-sflow-and-flow-extensibility-part-1/" title="NetFlow, sFlow, and Flow Extensibility: Part 1 of 2">NetFlow, sFlow, and Flow Extensibility: Part 1</a> and <a href="https://www.kentik.com/netflow-sflow-and-flow-extensibility-part-2/" title="NetFlow, sFlow, and Flow Extensibility: Part 2 of 2">NetFlow, sFlow, and Flow Extensibility: Part 2</a> and, more recently, his <a href="https://www.kentik.com/blog/the-network-also-needs-to-be-observable-part-3-network-telemetry-types" title="The Network Also Needs to be Observable, Part 3: Network Telemetry Types">overview of network telemetry types</a>.</p> <p>To see how Kentik can help your organization instrument its network with multiple flow-based protocols, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Empower Your Customers with Self-Service Analytics]]><![CDATA[Service assurance and incident response are just one side of the network performance coin. What if you could use the same data to provide additional value to customers, and highlight the great service you provide? Today we announced the “My Kentik” portal to do just that. Read the details in this post. ]]>https://www.kentik.com/blog/empower-your-customers-with-self-service-analyticshttps://www.kentik.com/blog/empower-your-customers-with-self-service-analytics<![CDATA[Jim Meehan]]>Wed, 05 Sep 2018 07:30:52 GMT<p>For operations teams, delivering a great customer or user experience is paramount. Keeping the network performant and reliable is job number one, and most teams would agree that pervasive, real-time data is essential to accomplishing that task. Service assurance and incident response are just one side of the coin though. What if you could use the very same data to highlight to customers the great value that your service provides?</p> <p>Today, we’re announcing the availability of “<a href="https://www.kentik.com/resources/my-kentik/">My Kentik</a>” which allows Kentik customers to present filtered, curated views of network data to their own downstream customers or internal divisions. My Kentik is now a built-in part of the Kentik platform, and powered by the same datasets that Kentik already collects for service assurance, analytics, and reporting tasks. Delivered as a branded portal experience, My Kentik makes it easy for end customers to access both details and big-picture insights related to the services they purchase from the network provider.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3bGjF4gpnySKiEsswsyIe0/693016d90b118c02d2714b18ce224123/my-kentik1.png" class="image right no-shadow" style="max-width: 1200px" alt="My Kentik" /> <p>My Kentik creates multiple new opportunities to leverage network data for increased value across the business:</p> <ul> <li>Reduce support and billing caseload by providing customers with self-service access to reports and data about their service utilization</li> <li>Increase the stickiness and perceived value of services with complimentary analytics and reporting</li> <li>Generate additional revenue by offering analytics as an add-on service.</li> </ul> <p>My Kentik includes extensive GUI-driven customization capability for administrators so network providers can tailor dashboards and reports that are specific to the types of services they offer, or even specific to individual customers. Enterprises can provide curated views for each department, WAN site, or application team within a datacenter.</p> <p>The types of views that can be provided range across the whole spectrum of services that providers or enterprises offer to downstream customers or internal divisions. Some example scenarios include:</p> <ul> <li><strong>DDoS / MSSP providers</strong>:  Views for each protected customer showing alerts and associated traffic details</li> <li><strong>Peering exchanges</strong>:  Traffic matrices by MAC or ASN and peering opportunity views for each exchange participant</li> <li><strong>IP transit providers</strong>:  Utilization by port and PoP, traffic breakdowns by geo, remote ASN, and traffic type for each transit customer</li> <li><strong>Hosting / IaaS</strong>:  Per-host and per site / PoP utilization, application / service breakdowns, on-net / off-net traffic breakdown for each hosting customer</li> <li><strong>Enterprise</strong>: WAN traffic utilization, on / off-net breakdown, local and remote top talkers view, per cloud &#x26; SaaS provider breakdown for each department or WAN location</li> <li><strong>Datacenter</strong>:  Traffic matrix of provider / consumer services, top talker visibility, and network impact awareness for each application team</li> </ul> <p>The configuration controls for My Kentik are provided within the Admin menu of the Kentik portal, and allow Kentik customers to create a custom URL and logo for downstream users to access their branded My Kentik portal.</p> <img src="//images.ctfassets.net/6yom6slo28h2/FmEpWKT2OQSC2YY4MSEuY/d465cb98965c2659cc3bf48b53a44d05/admin-menu.png" class="image center" style="max-width: 840px" alt="Admin menu" /> <p>Configuring individual tenant customers is also straightforward, with controls for data filtering and which views are accessible.  Admins within the individual tenants can also create logins for additional users to access that tenant’s views.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5KSfn3A5bOemUI2SQKqmwq/57d7cca6cd76ceedb1b03725218a99b0/edit-subtenant1-1016x1024.png" class="image center" style="max-width: 600px" alt="Edit Subtenant" /> <p>For additional information, please see the <a href="https://kb.kentik.com/Cb15.htm">Subtenancy</a> topic in the Kentik Knowledge Base or contact the Kentik Customer Success team at <a href="mailto:[email protected]">[email protected]</a>.</p> <p>My Kentik is available now for all Kentik customers, with a limited number of subtenants. For additional details about pricing packages for additional subtenants, please contact your Kentik account team or email <a href="mailto:[email protected]">[email protected]</a>.</p><![CDATA[The State of Network Management in 2018]]><![CDATA[With increased business reliance on internet connectivity, the network world has and will continue to get increasingly complex. In this post, we dig into the key findings from our new “State of Network Management in 2018” report. We also discuss why we’re just in the early stages of how our industry will need to transform. ]]>https://www.kentik.com/blog/state-of-network-management-2018https://www.kentik.com/blog/state-of-network-management-2018<![CDATA[Michelle Kincaid]]>Thu, 23 Aug 2018 06:45:32 GMT<img src="//images.ctfassets.net/6yom6slo28h2/2jOP7TBWjeOkQq0q4U4k6o/99f018c425ab0f98b24ac1b45c1db71d/state-of-network-management-full.png" class="image right" style="max-width: 400px" alt="State of Network Management report" /> <p>Between multiprotocol label switching (MPLS) and software-defined networking (SDN), there were about 15 years where the networking world was pretty static. That has changed. Right now, we’re in a world moving as fast as the ISP world did back in the 90s, and every few weeks there’s something new.</p> <p>To see just how much the networking industry has changed, as well as current trends and challenges, we conducted a survey of 531 networking industry professionals at the recent Cisco Live 2018 conference. Today, we’re announcing the results in a new report: “<strong>The 2018 State of Network Management Report</strong>.”</p> <p>Check out the key findings below and <strong><a href="https://www.kentik.com/resources/whitepaper-state-of-network-management-2018/">download the report</a></strong> to get all of the details, including commentary on the findings from our Co-founder and CEO Avi Freedman, who has been in the networking industry for over three decades.</p> <h3 id="key-findings">Key Findings:</h3> <ul> <li><strong>Automation is trending:</strong> The largest majority of respondents (35%) marked automation as the most important network trend right now. Yet, only 15% of respondents said their organization is prepared for it.</li> <li><strong>AI &#x26; ML buzzword fatigue:</strong> Despite the industry buzz around artificial intelligence (AI) and machine learning (ML), less than 1% of respondents marked it as a most important trend. However, 45.2% of respondents do perceive it to be helpful for network management.</li> <li><strong>User experience network challenges:</strong> Data breach was “the biggest network worry” (33.1% of responses). However, user experience was right behind it (28.8% of responses). As more organizations conduct business online, network outages are now a direct tie to customer success for many companies across industries.</li> <li><strong>Proliferation of tools for cloud visibility:</strong> There has been a huge proliferation of tools to manage cloud and internet dependencies. As a result, many organizations are trying various combinations of tools to manage the visibility challenge. Network traffic analytics appeared as the most commonly used way network professionals (28.3%) are managing the challenge.</li> <li><strong>Shared tools challenge becomes more real:</strong> A majority of respondents (67%) agreed that using the same stack of tools to manage both network performance and security could significantly improve operational efficiencies. However, only about 40% of respondents (39.5%) said their organization is using the same stack of tools to manage both network performance and network security.</li> <li><strong>Age-old problem with incident response still exists:</strong> The largest majority of respondents (30.1%) said the hardest part of managing and resolving an incident on their networks is that users or customers know about incidents before they do. Another 26% reported that their biggest challenge with incident response is that data exists, but they can’t access or analyze it easily. Without the ability to analyze network data in real time, network professionals cannot mitigate issues before they affect users and customers.</li> </ul> <p>With increased business reliance on internet connectivity, the network world has and will continue to get increasingly complex. We’re just in the early stages of how our industry will need to transform.</p> <p>But there’s good news: There’s progress being made. Many teams, including ours here at Kentik, are focused on rapidly solving these problems. To learn more about Kentik’s network analytics, <a href="#demo_dialog">request a demo</a> or <a href="#signup_dialog">sign up for a free trial</a> today. And don’t forget to <a href="https://www.kentik.com/go/2018-state-of-network-management/">download our 2018 State of Network Management Report</a>.</p><![CDATA[Service Assurance Starts with Network Analytics]]><![CDATA[Managing quality of service and service-level agreement (SLAs) is becoming more complex for service providers. In this post, we look at how and why enterprise cloud services and application usage is driving service providers to rethink service assurance metrics. We also discuss why network-based analytics is critical to satisfying service assurance needs. ]]>https://www.kentik.com/blog/network-analytics-for-service-assurancehttps://www.kentik.com/blog/network-analytics-for-service-assurance<![CDATA[Ken Osowski]]>Fri, 10 Aug 2018 09:00:09 GMT<img src="//images.ctfassets.net/6yom6slo28h2/Fnblde2LeKMumi6Qq64sW/f44ddddb64fae66a5ffbf5e2cea67884/service-assurance.png" class="image right no-shadow" style="max-width: 190px" alt="" /> <p>Enterprises are increasingly trying to reduce costs while improving agility, scalability, and time-to-market. To do that, many are turning to cloud services and applications. They’re also moving their internal service infrastructure to public and hybrid cloud environments.</p> <p>Service providers see this as an opportunity to add incremental revenue streams by leveraging demand from their install base. However, to do that, the providers must move beyond quality of service (QoS) and service-level agreement (SLA) metrics for traditional physical, protocol-driven networks, such as MPLS or Carrier Ethernet. Now, with enterprises more heavily relying on cloud services that shift to application usage across hybrid networks (consisting of both private and public networks), the key to making service providers’ end users happy is based on determining the quality and consistency of the end user’s application experience.</p> <p>This is at the heart of what is called service assurance in the service provider community.</p> <h2 id="service-assurance-challenges">Service assurance challenges</h2> <p>Virtualization and software-defined networks (SDN) are two trends requiring a rethinking of how legacy service assurance is accomplished. Next-generation service delivery platforms need to deliver a superior customer experience in a more efficiently managed SDN environment.</p> <p>The first step in this transition is to virtualize network functions, then introduce software control of the network, and then add automation. The implementation of automation accompanied by monitoring are crucial elements in an end-user environment where devices and applications are constantly changing.</p> <p>This requires service providers to think beyond existing network SLAs and ensure cloud service assurance. Service providers now must correlate application content usage and network and session layers to provide a competitive and satisfactory customer experience that accelerates customer acceptance of cloud services.</p> <p>SDN-enabled service provider cloud environments are purposely designed to be highly dynamic in nature, while requiring a real-time view of the state of the infrastructure that is regularly updated and impacted by network changes. Service assurance now requires understanding network changes and their impacts on application performance through a combination of network and application monitoring to overcome the challenges that this new agile services environment creates.</p> <h2 id="service-assurance-best-practices">Service assurance best practices</h2> <p>As service providers prepare for effective service assurance, many are hiring service assurance engineers and managers to help. Key areas of focus for service assurance include:</p> <ul> <li>Managing all aspects of service quality assurance with active participation in major service projects.</li> <li>Assuring service quality management with a strong focus on internal and external communications.</li> <li>Coordinating and consolidating changes, incidents, and problems related to telecom services.</li> <li>Collecting data and preparing regular and ad-hoc management reports regarding changes, incidents, and service problems.</li> <li>Working with internal and external parties to maintain network stability and service levels.</li> <li>Defining and monitoring service quality measurements.</li> </ul> <p><a href="https://www.kentik.com/solutions/network-business-intelligence/">Network-based analytics</a> is critical to satisfying service assurance needs. Network analytics has the power to match application patterns made through various protocols and correlate these to data paths traversed throughout the network. This allows network assurance engineers to understand what should be considered normal and/or baseline performance measurements and use this information to identify suboptimal paths, packet loss, congestion points, or security threats.</p> <h2 id="embracing-service-assurance-methodologies">Embracing service assurance methodologies</h2> <p>Effective application performance management depends on <a href="https://www.kentik.com/solutions/data-center-hybrid-cloud/#hybrid-cloud-visibility">modern network performance management tools</a>. Kentik helps service assurance engineers that are tasked directly with nurturing the viability and performance of their applications in their organizations no matter what networks are used. Kentik builds on highly accurate network visibility in near real-time to facilitate service assurance enabling:</p> <ul> <li><strong>End-to-end network visibility</strong> to help with increased application complexity and high rates of change in SDN environments.</li> <li><strong>Highly accurate network monitoring</strong> to put past events into context where they may have otherwise been ignored.</li> <li><strong>Troubleshooting application problems</strong> before they become service-affecting.</li> <li><strong>Visibility for all devices</strong> on the network with assurance that a connectivity problem can be readily identified.</li> <li><strong>End-to-end network visibility</strong> to predict how future application usage might impact network performance.</li> </ul> <p>Kentik’s adoption of a big data architecture is at the core of the <a href="https://www.kentik.com/product/kentik-platform/" title="Learn More About Our Platform: The Kentik Network Observability Platform">Kentik Network Observability Platform</a>. This brings great advantages for service assurance analytics, as big data is not only about handling large volumes of data, but also about letting network operations staff navigate through and explore that data very quickly.</p> <p>Service assurance will be a key driver for using network data to understand how organizations are operating from second-to-second and how specific service assurance use cases are analyzed and presented. To see how Kentik can help your organization analyze, monitor, and react to traffic patterns in the context of service assurance best practices, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Kentik for Google Cloud VPC Flow Logs]]><![CDATA[VPC Flow Logs Blog Post: The migration of applications from traditional data centers to cloud infrastructure is well underway. In this post, we discuss Kentik’s new product expansion to support Google’s VPC Flow Logs. ]]>https://www.kentik.com/blog/kentik-for-google-cloud-vpc-flow-logshttps://www.kentik.com/blog/kentik-for-google-cloud-vpc-flow-logs<![CDATA[Jim Meehan]]>Tue, 24 Jul 2018 08:18:55 GMT<p>It’s no secret that the migration of applications from traditional data centers to cloud infrastructure is well underway. And it’s tempting to think that “the network” is just one of the many infrastructure management headaches that disappear after migrating to the cloud. However, most organizations find that understanding the network behavior of cloud-deployed applications is still a critical part of ensuring their availability and performance. In many cases, even more important than before, with the increasing scale and distributed nature of modern applications.</p> <div class="pullquote right">Kentik customers can get full visibility into network activity within GCP projects — and also between GCP and traditional data centers in hybrid cloud architectures.</div> <p>Because the underlying network devices are abstracted away from the end user in cloud environments, cloud providers haven’t historically been able to provide detailed telemetry about network activity. So ops teams weren’t able to rely on data like NetFlow, sFlow, and IPFIX that they used to use in traditional data center environments to get a full understanding of how applications talk to each other on the network. This loss of visibility and traditional tooling — “going dark” — was an unfortunate but required tradeoff when moving to the cloud.</p> <p>Fortunately, that situation is now changing for the better. Google recently announced the <a href="https://cloudplatform.googleblog.com/2018/04/introducing-VPC-Flow-Logs-network-transparency-in-near-real-time.html" title="Google: Introduction of VPC Flow Logs">availability of VPC Flow Logs</a> for the Google Cloud Platform which provide detailed, real-time telemetry of network activity within and between VPCs inside GCP projects. It’s very much like NetFlow for VPCs, but better. VPC Flow Logs provide 5-second granularity, whereas NetFlow is typically 1-minute granularity. VPC Flow Logs also contain fields with network latency measurements, and tags that identify various attributes (VM, VPC, and region / zone names) associated with the source and destination of the traffic, which provides extremely useful context about each flow, and the underlying network activity it represents.</p> <h2 id="vpc-flow-logs-are-now-supported-in-kentik-with-google-cloud-platform-specific-tags">VPC Flow Logs are now supported in Kentik with Google Cloud Platform-specific tags</h2> <p>Kentik now also supports VPC Flow Logs as a data source and fully exposes the GCP-specific tags as dimensions and filter terms through the Kentik UI. This means Kentik customers can get full visibility into network activity within GCP projects — and also between GCP and traditional data centers in hybrid cloud architectures. The latter situation is one where customers have told us they are especially blind.</p> <p>Kentik + VPC Flow Logs provide teams across the operations spectrum an extraordinarily useful tool to ensure the availability and performance of services, and maintain a great user / customer experience.</p> <h3 id="vpc-flow-logs-for-devops-or-sre-teams-improved-situational-awareness-and-service-assurance">VPC Flow Logs for DevOps or SRE Teams: Improved Situational Awareness and Service Assurance</h3> <p>“What happened?” or “What is happening?” is always the question of the moment. When services go down or some other unexpected condition is impacting user experience, the clock is ticking. Logs are essential, but often don’t tell the whole story. The network “sees all,” and Kentik’s ability to retain a detailed, real-time picture of network activity provides instant answers to key questions like:</p> <ul> <li>Who / what is talking to the service in question? Top talkers?</li> <li>How has traffic volume or distribution changed after the incident started?</li> <li>Is this the only service affected, or do other services see similar changes?</li> </ul> <p>Fast filtering, pivots, and drill-downs let teams quickly get to root cause and gather the details they need to restore services to healthy state. Going a step further, Kentik also baselines normal traffic distribution to / from services or hosts, to provide proactive detection of potential problems when conditions change for an even faster response. As an API-first platform, Kentik is also easy to integrate with cloud deployment and incident response toolchains.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1REGzUJaxaekIkg4qU0W86/ef6f6e623582d1758589f402edddef21/devops-sre.png" class="image center" style="max-width: 815px" alt="Google VPC Flow Logs for Devops" /> <h3 id="vpc-flow-logs-for-netops-and-neteng-teams-comprehensive-planning-and-trending">VPC Flow Logs for NetOps and NetEng Teams: Comprehensive Planning and Trending</h3> <p>At scale, GCP projects quickly become complex. Various application tiers may be deployed across multiple zones and regions, and potentially communicating with remote services in a hybrid or multi-cloud architecture. Without a way to visualize traffic flows and service dependencies, it becomes nearly impossible to understand the big picture and take a data-driven approach to cloud infrastructure planning and growth.</p> <p>Kentik’s flexible visualizations and dashboards can provide NetOps and NetEng teams with easy answers for:</p> <ul> <li>Traffic growth trending and capacity planning. Do we need to add Google Cloud Interconnects?  Where?</li> <li>Top traffic producers and consumers. Which services create the highest cost exposure for inter-zone traffic?  Can we reduce cost by changing where some services are deployed, or refactoring them to be more network efficient?</li> <li>Which global geo-locations are accessing my services?  Are users being served by the zone that provides them the best performance / experience?  Would some user segments be better served by deployments in new zones?</li> <li>Service dependency mapping. Which Google services and legacy data center services does my application depend on?  If I migrate or decommission this service, which other services will be affected?</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/3eDWib4JygAgEQ84WiIeSQ/6f7000241de1483f9ceedd372babe1b2/netops.png" class="image center" style="max-width: 850px" alt="Google VPC Flow Logs for NetOps" /> <h3 id="for-secops-teamsvpc-flow-logs-enable-detailed-security-analytics-and-forensics">For SecOps Teams: VPC Flow Logs Enable Detailed Security Analytics and Forensics</h3> <p>Controls, policy, hardening and patching are all still basic tenets of security engineering and operations. But incident response is also a key capability for modern security teams. Competent incident response requires data — lots of it, and fast. Kentik’s ability to let users quickly navigate through a comprehensive log of network activity provides insight for security teams which is both broad and detailed. Since the network is both the point of entry and internal transport for threats, VPC Flow Logs provide pervasive instrumentation of potential threat activity to, from, and within GCP projects.</p> <p>Kentik’s fast, detailed archive of all VPC network activity can provide SecOps teams with the details they need for:</p> <ul> <li>Real-time awareness. Which connections are currently active to / from suspected compromised VMs?  Where do they originate? What else has that source talked to in my environment?</li> <li>Timeline. When did this activity start?</li> <li>Understanding scope / lateral movement. Were there subsequently any suspicious connections from this VM to other VMs in the VPC?  To VMs in other VPCs or projects?</li> <li>Uncovering potential data exfiltration. Were there any unexpected high-volume traffic flows from this VM or others out to the Internet?</li> </ul> <p>Kentik’s streaming alerting engine also baselines past network activity and provides notifications of potentially malicious activity, like traffic from unexpected geographies, or traffic between host pairs or service pairs that haven’t been seen before.</p> <img src="//images.ctfassets.net/6yom6slo28h2/G5eyOdeJ6SY0uQwasqqYA/75a43b5b1ca1860b244eeb4961d7a2da/secops.png" class="image center" style="max-width: 850px" alt="Google VPC Flow Logs for SecOps" /> <h3 id="for-management-and-executives-audit-cloud-networking-costs">For Management and Executives: Audit Cloud Networking Costs</h3> <p>Cloud data transfer costs 10 times wholesale Internet bandwidth rates, and even inter-VPC traffic costs more than many organizations pay for Internet connectivity. A key use case we see for Kentik customers is proactively monitoring for new applications consuming Internet, cloud interconnects, and inter-VPC traffic and nipping misrouted traffic in the bud. For example, traffic going over the Internet that should be going across private connections, or former customers pounding APIs and causing traffic charges.</p> <p>This is an extension of a classic cost optimization use case for running customers’ in-house infrastructure, but it has become even more pressing with higher cloud bandwidth costs.</p> <h3 id="getting-started-with-kentik--vpc-flow-logs">Getting Started with Kentik + VPC Flow Logs</h3> <p><strong>It’s easy to add VPC Flow Logs from a GCP project into your Kentik account.</strong></p> <p>To summarize the steps:</p> <ol> <li>Enable VPC Flow Logs for one or more VPCs in your GCP project</li> <li>Enable export of VPC Flow Logs to a GCP Pub/Sub topic, and create a pull subscription with appropriate permissions for Kentik to access that topic.</li> <li>Set up the virtual device in Kentik that will act as the container for VPC Flow Logs from this topic / project.</li> </ol> <p>For detailed instructions, see the <a href="https://kb.kentik.com/Fc12.htm" title="Knowledge Base: Enabling VPC Flow Logs">Kentik for Google VPC</a> article in the Kentik Knowledge Base.</p> <p>If you need a Kentik account, you can <a href="#signup_dialog">sign up for a free trial here</a>.</p><![CDATA[The Evolving NetFlow Product Landscape]]><![CDATA[NetFlow offers a great way to preserve highly useful traffic analysis and troubleshooting details without needing to perform full packet capture. In this post, we look at how NetFlow monitoring solutions quickly evolved as commercialized product offerings and discuss how cloud and big data improve NetFlow analysis. ]]>https://www.kentik.com/blog/the-evolving-netflow-product-landscapehttps://www.kentik.com/blog/the-evolving-netflow-product-landscape<![CDATA[Ken Osowski]]>Fri, 20 Jul 2018 09:45:36 GMT<p>NetFlow is a protocol that was originally developed by Cisco to help network operators gain a better understanding of their network traffic conditions. Once NetFlow is enabled on a router or other network device, it tracks unidirectional statistics for each unique IP traffic flow, without storing any of the payload data carried in that session. By tracking only the metadata about the flows, NetFlow offers a way to preserve highly useful traffic analysis and troubleshooting details without needing to perform full packet capture — the latter of which can be very expensive and yield few incremental benefits.</p> <p>NetFlow monitoring solutions quickly evolved as commercialized product offerings represented by three main components:</p> <ul> <li><strong>NetFlow exporter</strong>: A NetFlow-enabled router, switch, probe or host software agent that tracks key statistics and other information about IP packet flows and generates flow records that are encapsulated in UDP and sent to a flow collector.</li> <li><strong>NetFlow collector</strong>: An application responsible for receiving flow record packets, ingesting the data from the flow records, pre-processing and storing flow records from one or more flow exporters.</li> <li><strong>NetFlow analyzer</strong>: An analysis application that provides tabular, graphical and other tools and visualizations to enable network operators and engineers to analyze flow data for various use cases, including network performance monitoring, troubleshooting, identifying security threats and capacity planning.</li> </ul> <p>Cisco started by providing NetFlow exporter functions in their various network products running Cisco’s IOS software. Cisco has since developed a vertical ecosystem of NetFlow partners who have mainly focused on developing NetFlow collector and analysis applications to fill various network monitoring functions.</p> <p>In addition to Cisco, other networking equipment vendors have developed NetFlow-like or compatible protocols, such as J-Flow from Juniper Networks or sFlow from InMon, to create exporter interoperability with third-party collector and analysis application vendors that also support NetFlow, creating a horizontal ecosystem across networking vendors.</p> <p>The IETF also created a standard flow protocol format called IPFIX that embraces NetFlow from Cisco but now serves as an open, industry-driven standards approach to consistently enhancing flow protocols for the entire networking industry instead of Cisco evolving NetFlow unilaterally.</p> <p>NetFlow collector and analysis applications represent two key capabilities of NetFlow network monitoring products that are typically implemented on the same server. This is appropriate when the volume of flow data being generated by exporters is relatively low and localized. In cases where flow data generation is high or where sources are geographically dispersed, the collector function can be run on separate and geographically distributed servers (such as rackmount server appliances). In these cases, collectors then synchronize their data to a centralized analyzer server.</p> <h3 id="netflow-productization">NetFlow Productization</h3> <p>Products that support NetFlow components can be classified as follows (with example vendor products listed in each category):</p> <p><strong>NetFlow exporter support in a device</strong>:</p> <ul> <li>Cisco 10000 and 7200 routers</li> <li>Cisco Catalyst switches</li> <li>Juniper MX and PTX series routers (via IPFIX)</li> <li>Alcatel-Lucent routers</li> <li>Huawei routers</li> <li>Enterasys switches</li> <li>Flowmon probes</li> <li>Linux devices</li> <li>VMware servers</li> </ul> <p><strong>Stand-alone NetFlow collector</strong>:</p> <ul> <li>SevOne NetFlow Collector</li> <li>NetFlow Optimizer (NetFlow Logic)</li> </ul> <p><strong>Stand-alone NetFlow analyzer</strong>:</p> <ul> <li>Solarwinds NetFlow Traffic Analyzer (NTA)</li> <li>PRTG Network Monitor</li> <li>ManageEngine NetFlow Analyzer</li> </ul> <p><strong>Bundled NetFlow collector and analyzer</strong>:</p> <ul> <li>Arbor Networks PeakFlow</li> <li>Plixer Scrutinizer</li> </ul> <p><strong>Open source NetFlow network monitoring</strong>:</p> <ul> <li>nfdump/nfsen</li> <li>nProbe</li> <li>SiLK</li> </ul> <p><strong>Network monitoring products that focus on machine and probe data</strong>:</p> <ul> <li>Splunk</li> <li>Sumo Logic</li> <li>ArcSight</li> <li>Cisco Tetration</li> </ul> <p>Open source network monitoring tools seem like a good option, <a href="https://www.kentik.com/blog/hidden-risks-open-source-network-flow-analyzers/">but they are very difficult to scale horizontally</a> and most do not understand IP addresses as anything more than text, so prefix aggregation and routing data cannot be used.</p> <p>Network incident monitoring vendors like Splunk collect a lot of machine and probe data. Many vendors in this product category are seeing the value of integrating NetFlow. However, these platforms are designed primarily to deal with unstructured data like logs. Highly structured data like NetFlow often contains fields with formats that require translation or correlation with other data sources to provide value to the end user.</p> <h3 id="pushing-netflow-limits">Pushing NetFlow Limits</h3> <p>With DDoS attacks on the rise, NetFlow has been increasingly used to identify these threats. NetFlow is most effective for DDoS troubleshooting when sufficient flow record detail is available and can be compared with other data points such as performance metrics, routing and location. Unfortunately, the state-of-the-art NetFlow analysis tools up until recently have been challenged to achieve troubleshooting effectiveness, due to data reduction. The volume of NetFlow data can be overwhelming with millions of flows per second, per collector for large networks.</p> <p>Since most NetFlow collectors and analysis tools are based on scale-up software architectures hosted on single servers or appliances, they have extremely limited storage, compute and memory capacity. As a result, it is common practice to roll-up the details into a series of summary reports and to discard the raw flow record details. The problem with this approach is that most of the detail needed for operationally useful troubleshooting is lost. This is particularly true when attempting to perform dynamic baselining, which requires scanning massive amounts of NetFlow data to understand what is normal, then looking back days, weeks or months in order to assess whether current conditions are the result of a DDoS attack or an anomaly.</p> <h3 id="how-cloud-and-big-data-improve-netflow-analysis">How Cloud and Big Data Improve NetFlow Analysis</h3> <p>Cloud-scale computing and big data techniques have opened up a great opportunity to improve both the cost and functionality of NetFlow analysis and troubleshooting use cases. These techniques include:</p> <ul> <li>Big data storage allows for the storage of huge volumes of augmented raw flow records instead of needing to roll-up the data to predefined aggregates that severely restrict analytical options.</li> <li>Cloud-based SaaS options save the network managers from incurring CapEx and OpEx costs related to dedicated, on-premises appliances.</li> <li>Scale-out NetFlow analysis can deliver faster response times to operational analysis queries on larger data sets than traditional appliances.</li> </ul> <p>The key to solving the DDoS protection accuracy issue is big data. By using a scale-out system with far more compute and memory resources, a big data approach to DDoS protection can continuously scan network-wide data on a multi-dimensional basis without constraints.</p> <p>Cloud-scale big data systems make it possible to implement a far more intelligent approach to the problem, since they are able to:</p> <ul> <li>Track and baseline millions of IP addresses across network-wide traffic, rather than being restricted to device level traffic baselining.</li> <li>Monitor for anomalous traffic using multiple data dimensions such as the source geography of the traffic, destination IPs, and common attack ports. This allows for greater flexibility and precision in setting detection policies.</li> <li>Apply learning algorithms to automate the upkeep of detection policies to include all relevant destination IPs.</li> </ul> <p>Kentik Detect has the functional breadth for capturing all the necessary network telemetry in a big data repository to isolate even the most obscure DDoS attacks network events — as they happen or predicted in the future. Network visibility using NetFlow is key to managing your network and ensuring the best possible security measures. To understand more about NetFlow see this <a href="https://www.kentik.com/kentipedia/what-is-netflow-overview/">Kentipedia article</a> and <a href="https://www.kentik.com/blog/accurate-visibility-with-netflow-sflow-and-ipfix/">blog post</a>. To see how Kentik Detect can help your organization monitor and adjust to network capacity patterns and stop DDoS threats, read this <a href="https://www.kentik.com/blog/improving-legacy-approaches-to-capacity-management/">blog</a>, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[The Launch of Kentik’s First Open Source Project: Mobx Form]]><![CDATA[We just published our first open source project on GitHub and npm. Called Mobx Form, in this blog post we look at how the project helps developers with coding complex forms.]]>https://www.kentik.com/blog/the-launch-of-kentiks-first-open-source-project-mobx-formhttps://www.kentik.com/blog/the-launch-of-kentiks-first-open-source-project-mobx-form<![CDATA[Michelle Kincaid]]>Mon, 16 Jul 2018 07:00:00 GMT<p>Today we’re excited to announce we published our first open source project on GitHub and npm. So what is the project? And what does it do?</p> <p>If you’ve ever signed up for or purchased anything online, there’s a chance you’ve completed a series of tedious forms. The underlying technology behind those forms has not significantly changed in the last decade. However, with browsers becoming more interactive, libraries have been created to help developers implement robust, nontrivial forms.</p> <p>While a form often looks simple enough to an end user who completes it, for developers the code for that form can actually encompass a lot more work behind the scenes. ReactJS and Mobx are popular libraries for building highly interactive applications, but offer only low-level building blocks to handle complex data entry needs. Without a form-specific library, the very basics of forms — setup, validation, and submission — require significant boilerplate code.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5JZPw2q3qEwyQqIISAmU60/50d183bcea3ec2a9029e1d9a4b696354/complex-form.png" class="image right" style="max-width: 250px" /> <p>At Kentik, our network analytics platform requires dozens of complex forms (see image at right). Network and security teams enter everything from IP addresses to complex regular expressions, tackling everything from peering and capacity planning, to performance monitoring, and anomaly detection and alerting. When we started building our platform, <a href="https://www.kentik.com/platform/kentik-detect/">Kentik Detect®</a>, we went through an exercise of evaluating open source form libraries to see if they would fit our use cases. Our engineering team wanted a declarative solution to avoid boilerplate code and create consistent user experiences and code patterns, but knew they would need lots of imperative hooks to deal with special cases.</p> <p>Yet, as we continued to evolve our product and scale our user base to include the top enterprises and service providers internationally, it became clear that we’d need our own solution for forms. At the end of the day (actually 24/7), a great deal of work from our developers goes into ensuring our customers avoid any value errors from our forms and, ultimately, the users gain fast insights into what is happening on their networks. That’s why we created <a href="https://github.com/kentik/mobx-form">Mobx Form</a> — and we’re putting it up on GitHub and npm because we know other developers in our community might also need help tackling the complexity of these types of forms.</p> <p>Leading our Mobx Form project, Aaron Smith, Kentik’s engineering manager and one of the brains behind our sleek UI (see below), notes: “Our Mobx Form code was quickly incorporated into our product, and over the course of a year, we’ve been focused on adding to it and fixing bugs. It’s now meeting the needs of developers here at Kentik. And while it’s our first open source project, we’re looking forward to sharing more with the community as we continue to build upon our easy-to-use UI and fast network analytics.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/4idRegjlSUgSEWOyGe8m0w/51eaf442e988ad05f6f9d10b4de8520d/mobx-ui.png" class="image left" style="max-width: 500px" /> <p>To start using Kentik’s new Mobx Form library, <a href="https://github.com/kentik/mobx-for">check out the project on GitHub</a> or <a href="https://www.npmjs.com/package/@kentik/mobx-form">on npm</a>. If you’d like to see what our powerful network analytics platform can do for your network and your business, <a href="#demo_dialog">schedule a demo</a>> or sign up for a <a href="#signup_dialogl">free trial</a>.</p><![CDATA[Cloud-Scale Visibility Tools: The Right Stuff]]><![CDATA[Digital transformation is not for the faint of heart. In this post, ACG Analyst Stephen Collins discusses why it’s critical for ITOps, NetOPs, SecOPs and DevOps teams to make sure they have the right stuff and are properly equipped for the network visibility challenges they face.]]>https://www.kentik.com/blog/cloud-scale-visibility-tools-the-right-stuffhttps://www.kentik.com/blog/cloud-scale-visibility-tools-the-right-stuff<![CDATA[Stephen Collins]]>Tue, 10 Jul 2018 07:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/NzxFtOvRi8A0MmYgOMsQc/9de65e5770c08617262daf71a9ac7fc4/john-glenn.jpg" style="float: right; padding-left: 15px; width: 210px;" /> <p><custom-counter></custom-counter></p> <p>This series of guest posts has concentrated on the numerous challenges facing enterprise IT managers as businesses embrace digital transformation and migrate IT applications from private data centers into the cloud. The recurring theme has been the critical need for new tools and technologies for gaining visibility into cloud-scale applications, infrastructure and networks. In this post, I would like to finally expand on this theme.</p> <p>The scope of cloud-scale visibility is daunting and technically demanding. Monitoring needs to span multiple domains: the private enterprise data center and WAN; fixed and mobile service provider networks; the public Internet; and hybrid multi-cloud infrastructure. Full stack visibility is compulsory, including application software, computing infrastructure and visibility into both virtual network layers and the various physical underlay networks.</p> <p>Network and computing infrastructure is increasingly software-driven, allowing for extensive, full stack software instrumentation that provides monitoring metrics for generating KPIs. Software probes and agents that can be easily installed and spun up on-demand are displacing costly hardware probes that need to be physically deployed by on-site technicians. Active monitoring techniques now play a key role in tracking the performance of cloud-based applications accessed via the Internet, including synthetic monitoring that simulates user application traffic flows for proactively detecting problems before they impact a large number of users.</p> <p>Performance metrics and other types of monitoring data can be collected in real time using streaming telemetry protocols such as gRPC. At the network layer, streaming telemetry data is displacing SNMP polling and CLI screen scraping for gaining visibility into state information. Now that support for NetFlow, sFlow and IPFIX is commonplace in routers and switches, flow metadata is a readily available source of telemetry for real time visibility into network traffic flows across all monitoring domains.</p> <p>Network data is big data. The collection of massive amounts of streaming telemetry requires a high-speed data pipeline for ingesting data in real time and distributing it to the appropriate monitoring and analytics tools. Highly scalable Kafka clusters that utilize a publish/subscribe model are a commonly deployed pipeline solution, supplying telemetry data to multiple consumer analytics engines and tools.</p> <p>Streaming analytics engines consume and process data for generating operational insights in real time. Column-oriented databases ingest data to support near real-time multi-dimensional analytics for correlating a wide range of time series data types. Machine learning engines analyze huge data sets to discover correlations and trends that might be impossible for operators to discern using traditional monitoring techniques. Hadoop-based data lakes support offline batch processing on massive amounts of data for gaining business intelligence insights.</p> <p>While Big Data open source software is freely available, many enterprise IT organizations can’t sustain the investment needed for developing Big Data monitoring and analytics tools in-house, or the IT managers would prefer to rely on the vendor community to supply fully supported productized solutions based on open source.</p> <p>Big Data was born in the cloud and Big Data analytics is well-suited for cloud-based deployments. SaaS-based Big Data analytics solutions are also an attractive option for organizations seeking a productized solution with low upfront costs, no on-site installation required and minimal ongoing maintenance.</p> <p>I conclude by referencing a quote often attributed to astronaut John Glenn — someone who unquestionably had “the right stuff.” Nobody is asking IT managers to do something as outrageously risky as “sitting on top of 2 million parts — all built by the lowest bidder on a government contract.” But digital transformation is not for the faint of heart, so it’s critical that ITOps, NetOPs, SecOPs and DevOps teams make sure they have the right stuff and are properly equipped for the challenges they are facing.</p><![CDATA[Moving from Silos to Synergy]]><![CDATA[Silos in enterprise IT organizations can inhibit cross-functional synergies, leading to inefficiencies, higher costs and unacceptable delays for detecting and repairing problems. In this post, ACG analyst Stephen Collins examines how IT managers can start planning for now as the first step in moving from silos to synergy.]]>https://www.kentik.com/blog/moving-from-silos-to-synergyhttps://www.kentik.com/blog/moving-from-silos-to-synergy<![CDATA[Stephen Collins]]>Thu, 28 Jun 2018 13:00:51 GMT<img src="//images.ctfassets.net/6yom6slo28h2/41BbPKDAWQQqIIAaeGyUii/d10cab83b1f79e27040c6d9974e86df6/tech-silos.jpg" class="image right" style="max-width: 340px" alt="Tech silos" /> <p>Within business, there is a natural tendency for organizations to gravitate towards operational silos in which teams share information, resources, tools and techniques to achieve their common objectives.</p> <p>Enterprise IT organizations are no exception.</p> <p>Yet silos can inhibit cross-functional synergies, leading to inefficiencies, higher costs and unacceptable delays for detecting and repairing problems.</p> <p>Enterprise IT staff typically specialize in network operations (NetOps), security operations (SecOps) and IT operations (ITOps). DevOps is a new specialization that integrates rapid application development and operational deployment in small teams that deploy new software continuously.</p> <p>In most IT organizations, each team uses its own set of tools tailored for the specific needs of its operational silo. This also applies to the data that feeds these tools. Network managers use tools that ingest SNMP data, flow metadata and packet capture traces. Security teams sift through log data files and watch dashboards for various security appliances. IT operators monitor web servers, application resources, backend databases and storage subsystems.</p> <p>This mode of operation is perfectly acceptable, provided that there are few interdependencies between silos and when issues arise they can be quickly resolved through well-defined processes for exchanging information between teams.</p> <p>However, the migration of applications and services to the cloud — SaaS, hybrid multi-cloud, SD-WANs, IoT — is breaking this model. The heavy reliance on public network and application infrastructure makes it tricky to cleanly partition operational teams into silos. When something goes wrong, it won’t necessarily be easy to tell if a problem is due to network performance degradation, a poorly coded DevOps application or a failure within the cloud application infrastructure.</p> <p>The new generation of NetOps, SecOps, ITOps and DevOps tools will need to be more tightly integrated to facilitate cross-functional teamwork. Each team will rely on its own set of tools, but well-defined APIs will allow data to be shared between teams so that events can be rapidly correlated across operational stacks.</p> <p>For example, what at first glance might appear to be an infrastructure failure or performance anomaly could turn out to be a DDoS attack, with each set of tools interpreting the impact differently. By integrating multiple stacks to share data, operators can more quickly determine the actual root cause and proceed to mitigate the attack.</p> <p>Breaking down operational silos involves redesigning business processes and then selecting the right tools and creating the supporting software stacks to implement those processes. Certainly not a transition that will take place overnight, but one that IT managers should start planning for now as the first step in moving from silos to synergy.</p><![CDATA[Case Study: How Immedion Maintains Always-On, Secure Data Centers]]><![CDATA[With increased reliance on cloud and data centers, providers are under more pressure to maintain real-time network visibility to reduce potential threats to their service offerings. That's why provider Immedion chose Kentik. Read the case study.]]>https://www.kentik.com/blog/case-study-how-immedion-maintains-always-on-secure-data-centershttps://www.kentik.com/blog/case-study-how-immedion-maintains-always-on-secure-data-centers<![CDATA[Michelle Kincaid]]>Thu, 31 May 2018 15:36:18 GMT<img src="//images.ctfassets.net/6yom6slo28h2/2UoqWVfNTGEeOuMCQ6gWAA/9d0b7da535fedd2a2cba950baaa512cb/immedion.png" class="image right" style="max-width: 180px; padding-left: 20px; margin-bottom: 15px;" alt="Immedion" /> <p>Businesses of all sizes are increasingly reliant on cloud and data center providers that host their mission-critical applications and data. With bad actors constantly looking for opportunities to disrupt or penetrate those providers’ networks, real-time network visibility and threat alerting are imperative. That’s why cloud, data center, and managed services provider <a href="https://www.immedion.com/">Immedion</a> chose Kentik to help maintain high customer satisfaction and provide always-on, accessible, secure services.</p> <p>“As a growing data center provider, we implemented <a href="/platform/kentik-detect/">Kentik Detect</a>® on our seven sites in two days and automated DDoS detection and mitigation across all of our sites within a month,” said David Johnson, director of network engineering for Immedion.</p> <p><strong>With Kentik, Immedion gains:</strong></p> <ul> <li>Real-time visibility and security across seven facilities</li> <li>Commitment to service-level agreements (SLAs)</li> <li>Significant time and cost savings</li> </ul> <p>“Having worked with other DDoS solutions in my past, Kentik is a welcome addition to the flow analytics and DDoS attack protection solution options on the market,” added Johnson. “From our perspective, Kentik is unparalleled in the industry when it comes to performance, scalability, efficiency, and cost.”</p> <p><a href="/resources/case-study-immedion/">Read the full Immedion case study.</a></p> <p>Read our solution brief to learn more about <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a>.</p> <p><em>Ready to see the ROI for yourself? <a href="#demo_dialog">Schedule a demo</a> or sign up for a <a href="#signup_dialog">free trial</a>.</em></p><![CDATA[Network Capacity Planning 101: Requirements & Best Practices]]><![CDATA[In this post, we look at the best practices for an effective capacity planning solution that ensures optimal network performance and visibility.]]>https://www.kentik.com/blog/network-capacity-planning-101-best-practiceshttps://www.kentik.com/blog/network-capacity-planning-101-best-practices<![CDATA[Ken Osowski]]>Tue, 29 May 2018 04:00:00 GMT<p>Modern networks are made up of a collection of routers, switches, firewalls, and other network elements. While they are all configured to maintain the best possible network performance, availability and throughput — both inside the boundaries of the internal network and across links to other networks — operations teams leverage <a href="https://www.kentik.com/solutions/usecase/network-capacity-planning/" title="Network Capacity Planning Solutions from Kentik">network capacity planning</a> to identify potential shortcomings, misconfigurations, or other parameters that could affect a network’s availability or throughput within a forecasted timeframe. From a high-level perspective, network operators engage in network capacity planning to understand some key network metrics:</p> <ul> <li>Types of network traffic</li> <li>Capacity of current network infrastructure</li> <li>Bandwidth utilization at various points in the network</li> <li>Current network traffic volumes for both the internal network and connectivity to external networks</li> </ul> <div as="Promo"></div> <h2 id="how-capacity-planning-benefits-your-network-performance">How capacity planning benefits your network performance</h2> <p>By performing this type of network profiling, operators are able to understand the maximum capability of current resources and the impact of adding incremental new resources needed to serve future bandwidth demand and requirements. While capacity planning helps with the identification of new network infrastructure, it can also help to identify additional staff or resources that will manage and monitor the network.</p> <h2 id="a-few-network-capacity-planning-best-practices">A few network capacity planning best practices</h2> <ul> <li>Establish the degree to which sources, destinations, network devices, users, applications, protocols, and services are generating traffic. Accurately assess the network impact and traffic volumes of new services.</li> <li>Measure and analyze traffic metrics to establish performance and capacity baselines for future bandwidth consumption.</li> <li>Assure optimal application service delivery by identifying network bottlenecks before they impact service.</li> <li>Understand the impact of network bandwidth utilization from a user’s perspective, including application performance KPIs and SLAs.</li> <li>Identify the network performance profile of individual devices on network infrastructure.</li> <li>Understand and assess the limits of your load balancing equipment so your CPU/memory usage doesn’t get strained and impact latency or network downtime.</li> </ul> <h2 id="key-metrics-for-planning-network-capacity">Key metrics for planning network capacity</h2> <p>Network operations teams often set a baseline for network performance. A key question they ask is: What is expected of the network when it is working properly? (Operating a network is a complex endeavor and, regardless of how much planning takes place, problems do occur that lead to costly network downtime.) Network engineers determine optimal performance thresholds, which can then be utilized by network management tools. Key metrics include:</p> <ul> <li><strong>Bandwidth</strong> — the maximum rate that information can be transferred, typically measured in bits/second</li> <li><strong>Throughput</strong> — the actual rate that information is transferred</li> <li><strong>Latency</strong> — the delay between the sender and the receiver</li> <li><strong>Jitter</strong> — the variation in packet delay at the destination source</li> <li><strong>Packet loss</strong> — measured as a percentage of packets lost with respect to packets sent</li> <li><strong>Error rate</strong> — the number of corrupted bits expressed as a percentage or fraction of the total sent</li> </ul> <p>When a threshold for a key performance metric is reached, the icon for a network element within a capacity planning tool’s interface may display as red, and depending on the severity of the event, may also issue an alert.</p> <h2 id="key-capacity-planning-solution-requirements">Key capacity planning solution requirements</h2> <p>Key requirements for capacity planning solutions will vary to some degree based on the type of organization using it. For example, an enterprise, depending on its size, may need to understand WAN usage, ISP uplink capacity, east-west data center hotspots, and inter-data center adequacy.</p> <p>For ISPs, looking at network utilization is crucial to formulating business insights when connecting to other ISP networks. ISPs regardless of their network interconnection type (e.g. transit, backbone, paid peering, or free peering) all need to assess their network utilization in order to forecast business impacts. To ensure network operators can readily take action on insights from capacity planning tools, key requirements to look for in a solution include:</p> <ul> <li>Built-in link utilization visualizations</li> <li>Ability to set the level or granularity of reporting on network elements</li> <li>Ability to create thresholds using customizable performance metric values</li> <li>Predictability of time frames when full utilization of network resources is forecasted to happen</li> <li>Automatically set alert thresholds based on trend data</li> <li>Dynamic reporting of performance metrics based on built-in or custom on-the-fly dimensions</li> <li>Ability to formulate a custom report and email on a recurring basis</li> <li>Organization of key capacity planning metrics as a customizable dashboard interface</li> </ul> <h2 id="why-network-capacity-planning-is-essential-for-your-business">Why network capacity planning is essential for your business</h2> <p>The business value of network capacity planning cannot be underestimated. Being able to quickly focus on network bottlenecks in a timely manner has a direct correlation to user satisfaction and optimizing network infrastructure expenditures. However, gathering data on networks can be an overwhelming task and a network performance monitoring tool is necessary to keep track of all the factors that impact your network.</p> <p>Kentik has the functional breadth for capturing all the necessary network telemetry in a big data repository to isolate even the most obscure capacity impacting network events — as they happen or predicted in the future. Network visibility is key to facilitating capacity planning. To see how Kentik’s capacity analytics can help your organization analyze, monitor, and adjust to network capacity patterns, read <a href="https://www.kentik.com/blog/improving-legacy-approaches-to-capacity-management/" title="Improving Legacy Approaches to Capacity Management">our blog post on improving capacity management</a>, <a href="#demo_dialog">request a demo</a>, or sign up for a <a href="#signup_dialog">free trial</a> today.</p> <p>Learn more about using Kentik for <a href="https://www.kentik.com/solutions/usecase/network-capacity-planning/" title="Kentik: Network Capacity Planning use cases.">network capacity planning use cases here</a>.</p><![CDATA[RSA 2018: A Parade Watcher’s View]]><![CDATA[Kentik's VP of Channels Jim Frey takes a look back at RSA 2018. After walking the show floor, in this post he highlights how the event has grown over the years as the security vendor landscape has gotten more complex and bloated. He also explains why good technology alone is not the answer.]]>https://www.kentik.com/blog/rsa-2018-a-parade-watchers-viewhttps://www.kentik.com/blog/rsa-2018-a-parade-watchers-view<![CDATA[Jim Frey]]>Thu, 24 May 2018 15:21:28 GMT<img src="//images.ctfassets.net/6yom6slo28h2/6i4m7CyiSkQycyimUaSysM/419c2f96c2b61a7fb0e0bf49c69f0743/rsaconference-768x400.jpg" class="image right" style="max-width: 400px" alt="RSA Conference" /> <p>It’s been a few weeks now since the RSA conference happened in San Francisco, and I needed that time to fully digest the onslaught of information. As a long-time standee on the sidelines of cybersecurity, watching the sector grow unabated, lurching from issue to issue, always expanding and rarely, if ever, consolidating, I have to marvel at the sheer circus of it all. My area of expertise is network management and monitoring, which commonly utilizes the very same technology elements that underlie advanced security monitoring. But unlike network management, the past few years have seen oodles of new startups enter the security solutions space, adding to a growing clamor at RSA. How is the average security pro ever going to keep up?</p> <p><strong>Yes, the crooks keep winning, but…</strong></p> <p>After walking the expo floor at RSA, my conclusion is that the vendors are the ones who are really taking it to the bank. Just like the California Gold Rush of the 19th century, the ones who will get rich are the merchants – the suppliers of security technologies and services. In the final tally, the amount of money spent on all of the tools and products will outstrip the losses incurred from successful attacks by the vast majority of organizations.</p> <p><strong>The security technology vendor landscape is beyond bloated.</strong></p> <p>After walking the RSA expo floor and navigating the dizzying array of offerings being hawked, I had the opportunity to sit in on a market landscape briefing given by 451 Research. Clearly, the venture community is betting on the merchants. New investments are pouring into companies at a rapid clip, on the order of $1 billion per quarter. And while enterprise security is the most active area of tech M&#x26;A for three years running, with 140+ acquisitions expected this year, there simply aren’t enough buyers to handle the supply. But beyond that sobering assessment of the vendors, the 451 team had some sage thoughts to share. A few that I found most timely and compelling:</p> <ul> <li>“Every security product will be an analytics product.”</li> <li>“Cybersecurity is the killer app for big data analytics.”</li> <li>“Security automation will move closer to the mainstream.”</li> </ul> <p>Put these together, and big data analytics that can contribute to automation would be a really good combination.</p> <p><strong>Good technology alone is not the answer.</strong></p> <p>Let’s face it. We have a number of big societal and economic weaknesses that make cybercrime pay. Bad systems design, poor coding practices, insufficient testing, too much trust, too little technical security proficiency, and a financial sector that gaily throws around credit like so many flower petals at a wedding all add up to a virtual paradise for the bad guys. The banks not only expect fraud, they plan on it. They are willing to take massive losses to crime and they still turn a healthy profit. Think about that for a minute – failure is actually being embraced and tolerated. In my opinion, financial incentives (the carrot) and CxO penalties (the stick) are both helpful at getting the IT sector to take security seriously, but we also just need to take frank stock and do some growing up. We need to add discipline and good engineering practices to building and deploying IT infrastructure and applications. We need to say no to rushing new products and services to market if they have not been developed in ways that are both stable and secure. The 451 team is finally seeing organizations take this seriously, adding secure coding practices, and giving rise to DevSecOps. They also noted a marked rise in the business for MSSPs (managed security services providers), who can augment staff with expertise and coverage. So the cavalry is coming – it’s about time!</p> <p><strong>Takeaways</strong></p> <p>One thing is for certain – this juggernaut will not slow anytime soon. It will keep many good people employed, and massive resources (in currency and headcount) will be thrown at the many, very real challenges. But many resources will also be poorly spent, with little net effect or improved security posture. After nearly 30 years standing on the sidelines, watching this parade, I have a few recommendations to make:</p> <ul> <li>First, unless you are a large organization or a service provider, get help. There is no way that your security team can keep up with the multi-headed hydrae that are enterprise security technology and the realities of today’s threat landscape.</li> <li>Next, when going to RSA, you’ll probably want to spend your quality time in the big booths of the large established vendors. Sad to say, but many of the little booths around the edges are filled with startups that, as likely as not, won’t be there next year, or they will have become a pod in one of those bigger booths.</li> <li>Finally, take advantage of the investments you have already made. Don’t overlook the complementary value of your monitoring tools and technologies, many of which can provide security-specific insights at no additional cost to you. And given 451’s take, if those tools happen to be big data analytics with the ability to automate response, you’re in an even better position.</li> </ul><![CDATA[NetOps & SecOps Collaboration: Shared Tools are Essential]]><![CDATA[Network performance and network security are increasingly becoming two sides of the same coin. Consequently, enterprise network operations teams are stepping up collaboration with their counterparts in the security group. In this post, EMA analyst Shamus McGillicuddy outlines these efforts based on his latest research.]]>https://www.kentik.com/blog/netops-secops-collaboration-shared-tools-are-essentialhttps://www.kentik.com/blog/netops-secops-collaboration-shared-tools-are-essential<![CDATA[Shamus McGillicuddy]]>Tue, 22 May 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/6I6zF8Zfpeogi66KIOsG8M/83d84e61ac49c32c4619e8adc84bd0b0/Knoweldge-Sharing.jpg" class="image right" style="max-width: 400px;" alt="" /> <p>Network performance and network security are increasingly becoming two sides of the same coin. Consequently, enterprise network operations teams are stepping up their collaboration with their counterparts in the security group. Forty-two percent of network managers are collaborating with security teams more than they have in previous years, according to Enterprise Management Associates’ (EMA) new research, “<a href="https://www.enterprisemanagement.com/research/asset.php/3599/Network-Management-Megatrends-2018:-Exploring-NetSecOps-Convergence,-Network-Automation,-and-Cloud-Networking">Network Management Megatrends 2018: Exploring NetSecOps Convergence, Network Automation, and Cloud Networking</a>.”</p> <p>This collaboration is increasing for a variety of reasons, not least of which is the fact that security incidents are the second most common root cause of complex IT service problems, trailing only network infrastructure issues. Security systems, such as firewalls blocking legitimate traffic, are the fourth most common cause of IT complex service issues. Furthermore, 35 percent of network managers say that the reduction of security risk has become a more important measure of network operations’ success in recent years.</p> <p><strong>Shared Tools &#x26; Shared Data Sets are Crucial to NetOps &#x26; SecOps Collaboration</strong></p> <p>EMA’s Megatrends research identified how this collaboration between network and security teams plays out inside the enterprise. Shared tools and data are clearly essential. Forty percent of enterprises claim to have fully converged network and security operations with shared tools and processes. Another 35 percent maintain separate groups but integrated their tools for collaboration. A few (16 percent) have separate teams with some shared tools for collaboration.</p> <p>Collaboration isn’t just about incident detection and incident response. These enterprises say the most critical point of collaboration for networking and security is infrastructure design and deployment (38 percent of survey respondents). Event monitoring (31 percent) and incident response (27 percent) are secondary priorities.</p> <p><strong>Collaboration Requisites: Network Performance Management &#x26; Advanced Network Analytics</strong></p> <p>Given that network operations and security operations are either sharing or integrating tools, EMA asked enterprises to identify the most important tools for this collaboration. Network performance monitoring (33 percent) and advanced network analytics (32 percent) are the most important tools in the network manager’s toolset. Security incidents are a common root cause of complex IT service problems, so a performance monitoring solution can serve as an early warning system. It can also help network managers identify anomalies that could support the security team’s investigation.</p> <p>Advanced network analytics solutions apply heuristics like pattern recognition, anomaly detection, and event correlation to multiple sets of network data, identifying hidden patterns to illuminate threats and breaches.</p> <p>Network managers also identified several technologies from the security toolset that support collaboration. They include security analytics (31 percent), security incident and event management (24 percent), threat intelligence feeds (23 percent), and DDoS detection and prevention (21 percent).</p> <p><strong>Roadblocks to Collaboration</strong></p> <p>The convergence of network and security operations isn’t easy. The two groups have very different philosophies and cultures, and they typically dislike each other. EMA’s research identified several barriers to success.</p> <p>The most common challenge to network and security collaboration, from the network team’s perspective, is a lack of defined processes and practices (29 percent). Management and monitoring tools are an important foundation for best practices and policies. Network managers should look for tools that help them map, document, and communicate critical services. They can share this with the security team, helping that group to understand what’s important to the network team, how things work, and how the team typically responds to events. Network teams should document everything, including incident response, and ask the security team to define their own processes. Also, the network and security teams should ask the IT service management group if they can assist with any of this. They may have tools that can help.</p> <p>The second leading challenge (26 percent) is the fact that network and security teams have different goals. The network team’s mission is to connect people to applications and services. The security team’s mission is to lock things down, limit access, and protect assets. This is a leadership problem and a cultural problem. EMA recommends that IT leadership step up and show how these two groups can pull in the same direction.</p> <p>The third most common problem (24 percent) these groups encounter when collaborating is a lack of shared data that is consistent, relevant, and current. This is partly a tooling problem. Network and security teams should look for ways to integrate their tools and data sets well. Even better, they should share certain tools across groups, which will give them a single view of infrastructure and facilitate best practices and processes for collaboration.</p> <p><strong>The Benefits of Collaboration</strong></p> <p>EMA research identified three top drivers for network and security operations collaboration. First, enterprises see it as an opportunity to reduce operational expenses (38 percent). Converged teams will retire redundant tools and streamline workflows, which should boost IT productivity.</p> <p>Second, enterprises see an opportunity for risk reduction (37 percent). With the two teams working together, especially on infrastructure design and deployment, the network should become inherently more secure.</p> <p>Finally, enterprises expect more efficient workflows, reducing mean time to insight and remediation (34 percent). When incidents do occur, network and security operations will be better aligned to respond.</p> <p>Given these potential benefits, EMA recommends that enterprises ask their network operations and security tool vendors — especially network analytics and network performance monitoring vendors — if they can help with network and security operations collaboration. These vendors should be able to support integration and data sharing. Additionally, network teams should start establishing and documenting best practices and processes for collaboration with the security group. They should look for opportunities to strike a balance between the different goals of networking and security groups. If the two groups can solve this issue themselves, they can look to the IT executive suite for support.</p><![CDATA[The Impact of 5G on Enterprise Network Monitoring]]><![CDATA[5G is marching towards commercialization. In this post, we look at the benefits and discuss why network monitoring for performance and security considerations are even more crucial to the operation of hybrid enterprise networks that incorporate 5G network segments.]]>https://www.kentik.com/blog/the-impact-of-5g-on-enterprise-network-monitoringhttps://www.kentik.com/blog/the-impact-of-5g-on-enterprise-network-monitoring<![CDATA[Ken Osowski]]>Thu, 17 May 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/1BasyXiSpGuySwoO2s6Cca/83b0ee889642c934f36a94109b391b6f/5b0f4a6a6a732_576670-5g-300x169.jpg" class="image right" style="max-width: 300px" alt="" /> <p>The fifth generation of wireless technology, more commonly being referred to as 5G, is marching towards commercialization. Performance is a key driver behind the movement, with 5G enabling:</p> <ul> <li><strong>Greater throughput</strong> — mobile broadband with bandwidth ranging from 50 Mbps to 10 Gbps.</li> <li><strong>Lower latency</strong> — ultra-low latency down to 1 millisecond.</li> <li><strong>More connection density</strong> — 1 million connected mobile devices in less than half of a square mile as compared to around 2K with 4G cellular networks.</li> </ul> <p>With the promise of greater performance, network service providers are actively stepping up their testing and trials. Standards organizations, including the 3rd Generation Partnership Project (3GPP) and the Internet Engineering Task Force (IETF), are also stepping in to codify 5G specifications. As a result of the buzz, many businesses large and small are in the process of evaluating what 5G will do for them.</p> <p>To offer just one example, the scalable connection density and low-latency response from 5G will play a key role in driving IoT device monitoring and control across industries. End users of these devices will experience throughput and responsiveness only previously available on fixed networks. Other applications set to benefit from this kind of wireless network performance and scalability include enhanced mobile broadband, enterprise communications, autonomous vehicles, telemedicine, augmented reality (AR), connected homes, and industrial automation.</p> <p><strong>5G and SDN Capabilities Intersect</strong></p> <p>Many enterprises are replacing their fixed MPLS wide area networks (WANs) with virtualized software-defined WANs (SD-WANs) to reduce costs. Alongside that shift, 5G offers sufficient capacity and reliability to provide an additional path to support an SD-WAN, providing greater coverage, particularly for remote sites.</p> <p>Support for services such as SD-WAN can be segregated from other enterprise traffic using 5G network splicing — which standards organizations coined for the ability to dynamically provision service-specific data pipes in 5G networks — in order to cost-effectively reach areas where fixed-line networks are not economical. In these cases, remote enterprise sites may not need the highest bandwidth but require a deterministic quality of service.</p> <p>5G service providers can make this possible by offering customized data pipes that support a broad profile of application usage with network splicing as the basis for “5G cloud services” including SD-WAN traffic. For example, a service provider with this kind of application performance diversity would be able to have a common connection across two different enterprise customers — with one customer running video collaboration applications and the other customer needing to meet low latency demands for autonomous cars.</p> <p>Network-slicing technology has the ability to create custom levels of network performance and latency commitments and will be inherent to 5G networks.</p> <p><strong>Network Monitoring Needs</strong></p> <p>Enterprises also need to consider the risks when incorporating 5G networks as part of their hybrid network schemes, as well as the impacts it can have on performance monitoring. Some of these considerations include:</p> <ul> <li><strong>Monitoring</strong> — 5G can potentially enable a lot more mobile endpoints requiring network monitoring tools that can quickly sift through vast amounts of detail.</li> <li><strong>Performance</strong> — Wireless networks have historically been seen as a WAN backup when the primary circuit goes down. With 5G, wireless WANs could serve as a primary connectivity method, so measuring performance across these connections will become critical for enterprises to ensure effective integration of 5G networks.</li> <li><strong>Security</strong> — 5G will introduce new security concerns. Most SD-WAN solutions are encrypted SSL or IPsec tunnels, but new attack vectors may present themselves with data going over the air instead of over wires.</li> <li><strong>Visibility</strong> — Historically enterprise networks had a router/firewall connected to the WAN as a gateway. Will a BYOD mentality pervade that sidesteps this convention? Monitoring traffic flows could be a real challenge if the packets don’t always flow through the WAN gateway.</li> </ul> <p>To address these issues, network monitoring for performance and security considerations are even more crucial to the operation of hybrid enterprise networks that incorporate 5G network segments.</p> <p>A big data, SaaS approach is ideal for unifying network data at scale and providing the compute power to monitor hybrid enterprise networks. If you’re interested in learning more about how Kentik’s big data approach is used in hybrid network environments that include 5G networks, <a href="#demo_dialog">schedule a demo</a>. If you already know that you want to implement a much more cost-effective network performance and security monitoring solution for your enterprise network, start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Machine Learning and AI: The Superhero Solution for IT Operations]]><![CDATA[At that ONUG Spring 2018 event, ACG analyst Stephen Collins moderated a panel discussion on re-tooling IT operations with machine learning and AI. The panelists provided a view “from the trenches." In this post, Collins shares insights into how panelists' organizations are applying ML and AI today, each in different operational domains, but with a common theme of overcoming the challenge of managing operations at scale.]]>https://www.kentik.com/blog/machine-learning-and-ai-the-superhero-solution-for-it-operationshttps://www.kentik.com/blog/machine-learning-and-ai-the-superhero-solution-for-it-operations<![CDATA[Stephen Collins]]>Tue, 15 May 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/3sJp3CoYC4iWgiAiW8Qe8c/6cf23775b3be2db5ac9af49342c473b3/Iron-Man-300x169.jpg" class="image right" style="max-width: 340px" alt="Iron Man" /> <p>At last week’s ONUG Spring 2018 event in San Francisco, I moderated a panel discussion on re-tooling IT operations with machine learning (ML) and AI. The panelists provided a view “from the trenches,” sharing insights into how their organizations are applying ML and AI today, each in different operational domains, but with a common theme of overcoming the challenge of managing operations at scale.</p> <p><strong>The Panelists</strong></p> <ul> <li><strong>Harmen Van Der Linde, Global Head of CitiManagement Tools at Citigroup:</strong> His organization is responsible for the delivery of infrastructure software deployment automation and monitoring solutions. This involves managing highly dynamic, cloud-scale infrastructure in which things happen too fast for human operators to keep up as they are continuously flooded with alerts and alarms. Harmen’s team has applied statistical analysis of time series data to model network behavior and is using linear regression to analyze trends and predict future behavior.</li> <li><strong>Keith Shinn, SVP of Service Experience and Insights at Fidelity Investments:</strong> He manages a global team that is applying data-driven user experience principles to service delivery. Keith’s team wanted to improve business processes and better serve customers at Fidelity’s nationwide investor centers. They had a large-scale event correlation infrastructure in place but switched from polling to streaming, supported by a Kafka data pipeline. The team then used machine learning algorithms to analyze time series data and generate insights relevant to Fidelity’s business objectives.</li> <li><strong>Bryan Larish, Director of Technology at Verizon:</strong> He leads a team tasked with the simple goal of using ML and AI to make Verizon’s network run even better. Easier said than done, given that Verizon’s mobile and fixed-line networks are among the largest in the world! The Verizon team started with statistical analysis of time series data that resulted in operational improvements compared to existing methods. ML algorithms were able to derive useful correlations from the massive amount of KPI metrics collected from the network. It is notable that Verizon is also starting to use neural network based on deep learning to optimize <a href="https://www.kentik.com/blog/the-gartner-market-guide-for-network-performance-monitoring-and-diagnostics/">network performance monitoring</a>, leveraging that technology’s pattern-matching capabilities.</li> </ul> <p><strong>Common Themes from the Panel</strong></p> <p>All three panelists stressed the need to have a clear understanding of your organization’s business objectives because these will determine how you source and curate the data that will be collected and analyzed. They all recommend starting with the low-hanging fruit — statistical analysis of time series data — which can yield immediate operational efficiencies.</p> <p>With a wealth of ML and AI technology in the public domain, it came as no surprise that each organization relied heavily on open source software across the entire ML and AI software stack. However, while open source tools are freely available, the people who know how to use these tools are generally not. Therefore, each organization had to hire additional staff with relevant expertise in ML and AI, and the message was to be prepared to make a similar investment.</p> <p>By eliminating time-consuming, labor-intensive tasks, there is legitimate concern that ML and AI will lead to the elimination of jobs. However, Verizon’s Bryan Larish offered a different take. He used an analogy inspired by the Marvel Comics character Tony Stark, who is transformed into a superhero with special powers while wearing his Iron Man suit. Verizon intends to make its network run better by augmenting the capabilities of its operations teams with ML and AI, transforming ordinary operators into a legion of extraordinary Iron Men. Hard to argue with that!</p><![CDATA[PhoenixNAP: How Network Visibility Enhances Security of Multi-Tenant Environments]]><![CDATA[Modern enterprise networks are becoming more dynamic and complex, which poses significant challenges for today’s IT leaders. In this post, data center and IT service provider phoenixNAP discusses how Kentik Detect helps overcome network visibility challenges.]]>https://www.kentik.com/blog/phoenixnap-how-network-visibility-enhances-security-of-multi-tenant-environmentshttps://www.kentik.com/blog/phoenixnap-how-network-visibility-enhances-security-of-multi-tenant-environments<![CDATA[Adrian Montebello]]>Fri, 11 May 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/19hG5JNPuow4Q48Amwa2Cu/5f56f0bb23024077fa9ec1c2c1ba4013/phoenixnap-500x101.png" class="image right no-shadow" style="max-width: 250px" alt="phoenixnap" /> <p>Modern enterprise networks are becoming more dynamic and complex, which poses significant challenges for today’s information technology leaders.</p> <p>As more workloads move to the cloud and access networks become more open, ensuring security and stability becomes a more demanding task. The attack surface expands, creating more gaps for hackers to exploit potential vulnerabilities. At the same time, maintaining network health is more important than ever for today’s organizations, with digital operations now the backbone of successful business operations.</p> <p>To improve the security of their networks, enterprises need to have comprehensive insight into all network activity. By implementing pervasive network visibility systems, they can mitigate risk and improve overall data security.</p> <h4 id="the-anatomy-of-a-multi-tenant-network">The Anatomy of a Multi-Tenant Network</h4> <p>Hybrid, cross-cloud infrastructures now in common use present a new, challenging facet for managing network security. Across businesses of all sizes, many workloads now run on public cloud or virtualized environments with a third-party provider.</p> <p>In these environments, multiple tenants share the same infrastructure and network resources, requiring new, advanced methods for managing network security. Compared to dedicated environments, multi-tenant infrastructures carry a higher risk of breach or failure, which makes enhanced threat detection and monitoring a critical security function. Without these in place, the entire infrastructure is at risk, and even a minor issue can affect all clients that share common infrastructure.</p> <p>Infrastructure providers, therefore, need to be able to extract detailed network activity data and provide timely insight to each tenant. Identifying who moves what through the network is difficult under normal circumstances, and the petabytes of data transferred within multi-tenant networks make the task even more difficult. This can result in lengthy “dwell times,” i.e., the time between a compromise and a breach identification, which can have a severe impact on a tenant’s business.</p> <p>If the network / infrastructure vendor has their network instrumented for detailed traffic visibility, they will be in a much better position to provide accurate information in real time. Both the vendor and the client business have more resources to help them mitigate the risk, speed incident response, and keep their infrastructure secure.</p> <h4 id="why-network-visibility-matters">Why Network Visibility Matters</h4> <p>The advancement of security and network technologies has taken network visibility to an entirely new level. Cutting-edge systems can provide a comprehensive overview of all network activity to create a foundation for the defense against different types of threats.</p> <p>By gaining real-time visibility into your network, organizations can detect any unusual activities before they create damage or financial loss. Moreover, they can provide a timely response in case of a breach, as well as gather relevant data for post-incident forensics. More specifically, implementing network visibility solutions and systems can help on three significant levels.</p> <p><strong>1.  Proactive detection of abuse or potentially malicious activity</strong> Proactive detection is an essential step toward securing network traffic. Rather than exporting intelligence from incident reports, proactive monitoring continuously gathers data that can help prevent the incident in the first place. Tools designed for proactive monitoring create baselines of network traffic and automatically alert on deviations from historical activity, making it easier to detect different types of malicious behavior.</p> <p>With this type of intelligence, network engineers can prevent a greater number of malicious intrusions and breaches. Unusual network activities are reported in real time and are easier to stop from making an impact. Moreover, this kind of intelligence minimizes the impact in the case of an actual breach.</p> <p><strong>2.  Real-time investigation (incident response)</strong> If a breach or equipment failure occurs, real-time investigation is critical to minimizing downtime. However, with network attacks becoming rapidly more sophisticated, fast response time is becoming harder to achieve without a comprehensive network visibility solution.</p> <p>The problem with real-time investigation is the fact that some complex attacks, such as blended DDoS or highly specialized malware, may not be easy to detect. In these cases, the time span between the initial compromise and detection is prolonged, resulting not only in poor user experience but also damages the company long term. Even though there has been an overall improvement in companies’ ability to minimize dwell time over the last couple of years, an alarming number of organizations still fail to detect a breach in a reasonable timeframe.</p> <p>According to this year’s SANS Incident Response report, the ability of organizations to identify and respond to a breach within 24 hours increased significantly last year – from 40% of respondents in 2016 to 50% in 2017. On the other hand, about 15% of organizations surveyed need months to identify a breach, while about 5% take more than seven months.</p> <p>The extent to which this can be damaging is perhaps best illustrated by some recent newsworthy breaches. In last year’s Equifax breach, it took months for the company to identify the compromise. The result was almost 143 million affected accounts. More recently, security analysts discovered that personal information of 198 million US voters has been publicly accessible for ten years via an Amazon S3 instance belonging to the data firm Deep Root Analytics.</p> <p>These examples both indicate that companies are still taking too long to identify vulnerabilities and protect their data. While the SANS report figures show a significant improvement in the way companies handle potential security breaches, these real-world examples warn of the dangers of not being able to do so.</p> <p><strong>3.  Post-incident network traffic forensics</strong> To provide an effective post-incident response, network engineering teams need access to detailed, relevant data from across the entire network. They need to be able to do a granular investigation that encompasses more than simply identifying the target resources affected by an attack. They need information on any unusual activity that may have preceded the incident in order to identify entry points, lateral movement, and to inform potential solutions for improved security posture going forward.</p> <h4 id="a-case-for-intelligent-network-visibility">A Case for Intelligent Network Visibility</h4> <p>The complexity of network monitoring grows in parallel to the network itself. The bigger the network, the more difficult it becomes to track all activities on it. At phoenixNAP, we faced a similar issue as we opened new data centers worldwide and as our customer base grew.</p> <p>As a provider of custom bare metal, cloud, colocation, and advanced managed services solutions, phoenixNAP is focused on ensuring the security and reliability of its network. Our global network expands at a rapid pace, which is why we needed a comprehensive network monitoring tool.</p> <p>Kentik enabled us to more effectively monitor and manage our global network and provide our clients with a higher degree of stability. By moving from an appliance-based tool to Kentik’s scalable network traffic intelligence platform, Kentik Detect®, we enhanced our ability to detect deviant network behavior and provide more comprehensive incident response. Below are some of the practical ways in which Kentik enabled more effective network monitoring and analysis at phoenixNAP:</p> <p><strong>1.  Traffic engineering and load balancing across locations and ISPs</strong> PhoenixNAP uses Kentik to analyze traffic patterns for its clients. A number of factors such as destination ASNs, AS paths, next hop and outgoing location, and, finally, ISP are taken into consideration to plan and execute traffic path changes and engineering. This allows us to scale bandwidth and improve network performance characteristics like latency and loss as our clients’ business grow.</p> <p><strong>2.  Proactive alerting of traffic dips and potential outages</strong> PhoenixNAP uses Kentik to track traffic usage per client. Kentik keeps a historical baseline for each client’s traffic, which we use both for historical traffic comparison and also for automatic detection of any loss or disruption in traffic. Alerts from Kentik automatically open tickets with our engineering teams to investigate any incidents and provide an early warning for network-wide problems like path changes or other issues. This enables us to be proactive and start remediating immediately while also providing timely notifications to our clients about ongoing situations.</p> <p><strong>3.  Real-time interface capacity alerts</strong> Kentik also monitors router and switch ports for utilization. Alarms are generated when ports reach a certain % utilization that indicates imminent congestion and customer impact. This gives the engineering teams valuable information and provides lead time to augment infrastructure capacity to support our continuous growth. PhoenixNAP executive teams also rely on data from Kentik to determine drivers of growth and inform decision making about targeted network upgrades.</p> <p><strong>4.  Integration with multiple communication channels</strong> Out of the box, Kentik integrates with email systems, Slack, PagerDuty, and other logging systems. This gives technical teams the ability to receive notifications on a variety of channels rather than having to dig through a complex UI to get details about ongoing incidents. It also increases collaboration among teams and avoids the need to constantly relay incident status information between the NOC and the engineering teams.</p> <p><strong>5.  Help clients with compromised hosts</strong> The recent memcached amplification attack method resulted in a record-breaking 1.7 Tbps DDoS attack. Using Kentik, phoenixNAP set up alert policies and automated notifications to provide early warning if client systems became involved in attack activity. If any malicious activity of this kind was detected, the traffic would first be filtered to prevent impact to the network. Afterward, the affected client would be notified so that the affected systems could be patched. Kentik helped us improve overall security, preserve valuable bandwidth, and keep the network highly available for all phoenixNAP clients. For more details on how we used Kentik’s solutions to better service our clients, read <strong><a href="https://www.kentik.com/resources/case-study-phoenixnap/">our case study</a></strong>. You can also hear more about our Kentik use cases in this <strong><a href="/resources/webinar-how-network-visibility-enhances-security-of-multi-tenant-environments/">on-demand joint webinar</a>.</strong></p><![CDATA[AWS Route 53 BGP Hijack: What Kentik Saw]]><![CDATA[News broke last week that attackers attempted to steal cryptocurrencies from users of MyEtherWallet.com by using a BGP route hijack attack. Numerous Kentik Detect customers saw changes in their traffic patterns, allowing them to detect this attack. In this post, we look at how the attack worked and the visibility that Kentik Detect provided our customers.]]>https://www.kentik.com/blog/aws-route-53-bgp-hijack-what-kentik-sawhttps://www.kentik.com/blog/aws-route-53-bgp-hijack-what-kentik-saw<![CDATA[Justin Ryburn]]>Wed, 02 May 2018 04:00:00 GMT<h3 id="using-kentik-detect-to-analyze-and-respond-to-bgp-issues"><em>Using Kentik Detect to analyze and respond to BGP issues</em></h3> <p>Last week, news broke that attackers attempted to <a href="https://www.forbes.com/sites/thomasbrewster/2018/04/24/a-160000-ether-theft-just-exploited-a-massive-blind-spot-in-internet-security/#756394f5e26b">steal cryptocurrencies from users of MyEtherWallet.com by using a BGP route hijack attack</a>. Numerous Kentik Detect® customers saw changes in their traffic patterns which would allow them to detect this attack. Let’s take a look at how this works.</p> <h4 id="what-is-bgp-route-hijacking">What is BGP Route Hijacking?</h4> <p>In simple terms, Border Gateway Protocol (BGP) is the protocol that routes traffic on the Internet. Each BGP speaking organization is assigned an Autonomous System Number (ASN) that identifies them on the Internet. They can then announce the routes (groups of IP addresses) that they own from their ASN. During a <a href="https://www.kentik.com/kentipedia/bgp-hijacking/" title="Kentipedia: BGP Hijacking, Understanding Threats to Internet Routing">BGP route hijack</a>, an attacker advertises IP prefixes from an ASN that is not the normal originator. This causes legitimate traffic to those IPs to be redirected to the attacker. During last week’s attack, the attacker was redirecting traffic that belonged to Amazon’s Route 53 DNS servers. Below is a diagram of how this attack worked:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6gt1hXeDlKWcQU8Cgo8y4o/aa5916d9f4b0504bf8bccb5086f512e2/aws-attack.png" class="image center" style="max-width: 500px" alt="AWS attack" /> <p>Once the attackers have the DNS requests coming to them, they can respond with the IP address of a server they control. In this case, requests for MyEtherWallet.com were answered with the IP address of a fake server in Russia. Unsuspecting MyEtherWallet.com users then entered their usernames and passwords on the fake site and the attackers used them to access those accounts on the real MyEtherWallet.com website. This type of attack is often referred to as a Man-in-the-Middle (MITM) attack.</p> <h4 id="what-kentik-saw">What Kentik Saw</h4> <p>According to the Cloudflare blog, the BGP hijack was taking place from 11:05 UTC to 12:55 UTC so we will focus on that time range. First, we filtered traffic just to the hijacked subnets:</p> <img src="//images.ctfassets.net/6yom6slo28h2/BYJTrqs0FM8ya880uQWaw/e8be5573dfbfef6694f1e5a148ad686c/filtering600w.png" class="image center" style="max-width: 300px" alt="Filtering" /> <p>And then we looked at the packets per second (PPS) by destination AS number:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2n0pCTKgxC6OOKwGEG2qsw/1bac29bd05ba890f7370d55a625de9b7/pps-dest-as1100w.png" class="image center" style="max-width: 800px" alt="Packets per second" /> <p>We can see that there is a major spike to eNet, Inc. (AS10297) who is a small ISP in Ohio the attackers used to propagate this attack. There is definitely a dip in traffic to the legitimate destination on Amazon (AS16509) but the traffic did not completely go away. This is likely caused by the fact that not all regions accepted the hijacked routes from AS10297. Drilling down further we can take a look at just the traffic that has been hijacked. This makes it very easy to see the timeframe for the hijack as well as the distribution of traffic across the various hijacked IP prefixes.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7sNHEO26WWCMMsSqykQ6g2/ab3f0263cad9728ff1beeb8a0d14f08e/drill-down1100w.png" class="image center" style="max-width: 800px" alt="Drill-down" /> <p>To get an even better idea of the distribution of the traffic across the various hijacked prefixes, let’s take a look at a Sankey diagram. Here we have the DNS clients sending their DNS queries to the hijacked blocks advertised by AS10297. We can see that most of the traffic is UDP port 53 DNS traffic but there is a little bit that is TCP port 53. By looking at the thickness of the bands, you can see how much traffic is taking each path.</p> <img src="//images.ctfassets.net/6yom6slo28h2/wbzTKuemOsU6C0WSaUkU8/acef678f8b25f8bdecff1257e6612165/sankey-paths-1100w.png" class="image center" style="max-width: 900px" alt="Sankey Diagram" /> <p>For more information on Sankey diagrams, check out our blog: <a href="https://www.kentik.com/insight-delivered-the-power-of-sankey-diagrams/">Insights Delivered: The Power of Sankey Diagrams</a>.</p> <h4 id="summary">Summary</h4> <p>Although not always as high profile as this event, BGP hijacking attacks <a href="https://arstechnica.com/information-technology/2017/12/suspicious-event-routes-traffic-for-big-name-sites-through-russia/">happen regularly</a>. In this case, the attackers stole crypto-currencies. In other cases, damages to digital enterprise have been devastating. To see how you can get this type of visibility in your own network, <a href="#demo_dialog">schedule a demo</a> or sign up today for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[DDoS Protection in the Wild Wild West]]><![CDATA[The Internet is the wild wild west -- and the pace of DDoS attacks is not letting up. But thanks to recent advances in streaming telemetry, network visibility, and Big Data, the good guys are armed with the weapons they need to maintain the peace.]]>https://www.kentik.com/blog/ddos-protection-in-the-wild-wild-westhttps://www.kentik.com/blog/ddos-protection-in-the-wild-wild-west<![CDATA[Stephen Collins]]>Tue, 01 May 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/67z2CCMUxyMy8uCIEA0eEi/79b45b425f9d7738e693e1de5f422c3b/560352e7b713e.png" alt="560352e7b713e.png" class="image right" style="max-width: 400px" alt="DDos Protection" /> <p>Distributed denial-of-service (DDoS) attacks are the Achilles heel of Internet-centric enterprise IT. Businesses with websites, running applications in the cloud or relying on cloud-based services are not only vulnerable to direct attacks on critical infrastructure, but can also take collateral damage as a result of attacks on other targets. Even worse, the frequency, intensity, scale and diversity of DDoS attacks are all increasing. Sophisticated bad actors are also using DDoS attacks as a smokescreen to mask advanced threats designed to breach perimeter security, exfiltrate data and deliver malware.</p> <p>The Internet is the wild wild west, except the bad guys aren’t gunslingers in black hats strutting into town in broad daylight, but hackers in hoodies lurking in the shadows of the Dark Web.</p> <p>One of the first large-scale DDoS attacks occurred nearly 20 years ago, in 1999. Since then, network engineers and equipment vendors have developed solutions to detect attacks as they occur and rapidly take action to mitigate against their ill effects. While these solutions were reasonably effective at the time, recent technology trends have created conditions conducive to attacks that are exacerbating the problem.</p> <p>DDoS attackers are taking advantage of the global scope of the Internet, the explosion in the number of applications and services in the cloud and the proliferation of millions and millions of poorly secured IoT devices that are easily compromised. Hackers are harnessing the power of vast IoT botnets to launch massive volumetric attacks from hundreds of thousands of endpoints Attacks targeting web servers or application servers can also knock out perimeter security appliances such as firewalls and intrusion protection systems. On top of this, the ever-increasing speed of Internet backbone and access connections means that the rate of traffic hitting networks and systems under attack is also increasing.</p> <p>Hybrid multi-cloud enterprise IT presents attackers with a target-rich environment in which <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="Read our solution brief to learn more about Kentik DDoS Detection and Defense">DDoS protection</a> involves multiple approaches that work in concert to detect and mitigate attacks. There are cloud-based DDoS protection solutions for websites and Internet-facing applications. The leading cloud service providers offer some type of DDoS protection. Internet service providers (ISPs) deploy DDoS detection and mitigation systems in their own networks. Enterprises have the option of installing an on-premise DDoS protection appliance or relying on a DDoS protection service provided by the ISP.</p> <p>There is no silver bullet and different solutions are required in each domain: enterprise networks, ISP networks, and in the cloud.</p> <p>Detecting DDoS attacks is problematic when attacks generate what appears to be legitimate traffic. If websites or applications become slow or inaccessible to users, how does an enterprise IT manager determine if this is due to a sudden surge in demand vs. a DDoS attack? False positives are always a concern, and attacks should be mitigated immediately, but the last thing anyone wants to do is block legitimate traffic.</p> <p>What if the enterprise itself isn’t under attack, but users are experiencing the effects of collateral damage due to overwhelming demand on networks or cloud infrastructure shared with the intended target? The effects of DDoS attacks can impact a wide radius surrounding the initial point of attack.</p> <p>DDoS protection is complex and IT managers would be wise to seek solutions from vendors and service providers with proven expertise directly relevant to the particular needs of their enterprise. One approach is to outsource DDoS protection to managed security service providers. Large enterprises are often inclined to manage their own deployments, which could be a hybrid approach of on-premise and cloud-based detection and mitigation solutions.</p> <p>Due to the vast scale of the Internet, effective DDoS protection relies on tools that constantly ingest network telemetry data and provide real-time visibility into Internet traffic flows, end-to-end network paths and the end points generating malicious traffic. Retaining historical data can be extremely valuable for characterizing attacks as they occur and to conduct post-mortem analysis, so IT managers should closely evaluate the benefits of DDoS protection solutions that incorporate Big Data for rapid, multi-dimensional analytics of network telemetry data.</p> <p>Make no mistake, the bad guys aren’t going away, and the pace of DDoS attacks is not letting up. But thanks to recent advances in <a href="https://www.kentik.com/blog/how-to-maximize-the-value-of-streaming-telemetry-for-network-monitoring-and/" title="Kentik Blog: How to Maximize the Value of Streaming Telemetry for Network Monitoring and Analytics">streaming telemetry</a>, network visibility and Big Data, the good guys are armed with the weapons they need to maintain the peace.</p><![CDATA[The Role of Predictive Analytics in Network Performance Monitoring]]><![CDATA[Predictive analytics has improved over the past few years, benefiting from advances in AI and related fields. In this post, we look at how predictive analytics can be used to help network operations. We also dig into the limitations and how the accuracy of the predictions depends heavily on the quality the data collected.]]>https://www.kentik.com/blog/the-role-of-predictive-analytics-in-network-performance-monitoringhttps://www.kentik.com/blog/the-role-of-predictive-analytics-in-network-performance-monitoring<![CDATA[Ken Osowski]]>Tue, 24 Apr 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/OlSaOBuXegOUK2MCOAgku/d593d9b4799c04348ae502f623acab04/Screen-Shot-2018-04-23-at-12.26.22-PM-300x199.png" class="image right" style="max-width: 300px" /> <p>Predictive analytics uses historical data to predict future events. In the context of IT practices, the analytics are gaining in interest, in part, due to advances in supporting technologies, such as big data, machine learning, and artificial intelligence. For network performance monitoring (NPM), predictive analytics can be used to help network operations teams identify potential network failures and performance issues with greater accuracy and lower mean time to repair (MTTR). To address issues before they affect network operations, key performance metrics are monitored and analyzed. Identifiable patterns are then used to facilitate network changes to deal with performance issues or possible security threats.</p> <h2 id="key-use-cases-for-npm">Key Use Cases for NPM</h2> <ul> <li><strong>Determining network performance —</strong> Machine learning can be used by predictive analysis for network performance optimization. By predicting capacity problems accurately, operations teams can act preemptively to rebalance the load on a network and provision the network with more capacity. Predictive analytics can also examine trends in data traffic patterns based on usage type and provide an early warning whenever it discovers possible issues.</li> <li><strong>Identifying security threats —</strong> Predictive analytics enables security analysis to recognize anomalous behavior from systems, devices and/or users. Rapid detection of security breaches are more important than ever, and predictive analytics can provide clues that escape human observers. Predictive analytics, along with NetFlow (or other flow data variants), can help weigh the risk of devices on a network and predict which are at highest risk. The cost of a network breach is typically several million dollars, so the more quickly you can detect and correct the breach, the less cost and impact there will be to your company’s reputation and their bottom line.</li> </ul> <h2 id="the-caveat">The Caveat</h2> <p>As network operators implement network function virtualization (NFV) and software-defined networks (SDN) methodologies, operations teams will be dealing with a higher frequency of network changes. That makes it harder to predict network performance or security anomalies. With constant network changes becoming the norm, historical data may also be less available making predictions less accurate.</p> <h2 id="big-data-and-cloud-scale-resources-to-the-rescue">Big Data and Cloud-Scale Resources to the Rescue</h2> <p>Even with sophisticated predictive learning capabilities, the level of prediction accuracy still depends on the quality, detail, and accessibility of the data set.</p> <p>With all this data, network performance monitoring tools are necessary to extract insights and trends. Machine learning techniques are used to find patterns in data and to build models that predict future outcomes. A variety of machine learning algorithms are available, including linear and nonlinear regression, neural networks, support vector machines, decision trees, and other algorithms.</p> <p>Advancements in distributed storage have opened the way for big data to play a pivotal role in providing a foundational platform for predictive network analytics. And while most network operators have invested in big data repositories, they need to be designed at the outset not as an afterthought to address predictive analytics use cases. Use case-driven predictive analytics is key to effectively storing and using all of this data.</p> <p>The availability of cloud-scale compute power has paved the way for advanced predictive network analytics algorithms. These algorithms require immense compute power to implement machine learning techniques in near real-time that combine and contextualize all of this data.</p> <p>Predictive analytics has improved over the past few years, benefiting from advances in AI and related fields. The limitation of AI is that the accuracy of the predictions depends heavily on the quality the data collected.</p> <p>A big data, SaaS approach is ideal for unifying network data at scale and providing the compute power to achieve predictive analytics benefits. If you’re interested in learning more about how Kentik’s big data approach is used for predictive analytics, check out the Kentik Detect for DDoS Protection <a href="//assets.ctfassets.net/6yom6slo28h2/HiW5fElf4kmIwSyyKgSmA/78f8c0b8f0f25ea2131fea9333e90891/kentik_ddos_solution_brief.pdf">solution brief</a>. If you already know that you want to implement a much more cost-effective network security and planning solution for your institutional network, start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Hybrid Multi-Cloud and Hyperscale Performance Monitoring: Go Big or Go Home]]><![CDATA[Forward-thinking IT managers are already embracing big data-powered SaaS solutions for application and performance monitoring. If hybrid multi-cloud and hyperscale application infrastructure are in your future, ACG Analyst Stephen Collins' advice for performance monitoring is “go big or go home.”]]>https://www.kentik.com/blog/hybrid-multi-cloud-and-hyperscale-performance-monitoring-go-big-or-go-homehttps://www.kentik.com/blog/hybrid-multi-cloud-and-hyperscale-performance-monitoring-go-big-or-go-home<![CDATA[Stephen Collins]]>Tue, 17 Apr 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/5feStuJjSEYe8c8Y4uqcSy/fbb3b8c747e64bfe17b3b3a344c2bf44/maxresdefault-300x169.jpg" alt="maxresdefault-300x169.jpg" class="image right" style="max-width: 300px;" /> <p>We live in the age of analytics, powered by incredible advances in distributed computing and big data technology. Companies are turning to data and analytics to improve all aspects of how they do business. Data scientists are the new rock stars as enterprise IT managers are busy retooling for this new age of data-driven insights, operations and decision making.</p> <p>So it should come as no surprise that big data analytics will play a critical role in managing application performance in hybrid multi-cloud and hyperscale infrastructure. Network data is big data, characterized by massive volume, high velocity, and a wide variety of data types for monitoring application, infrastructure, and network performance. Big data analytics enables IT managers to rapidly correlate data across multiple datasets to extract actionable insights for a spectrum of operational use cases. Consider the many types of telemetry data that need to be collected, processed and stored:</p> <ul> <li>Wire data extracted directly from packets, including flow metadata</li> <li>KPI data from network elements and monitoring probes</li> <li>Server, OS, VM and container instrumentation</li> <li>Application performance metrics</li> <li>Syslog data from various servers and network elements</li> </ul> <p>This telemetry is primarily time series data, which is often enriched and fused with contextual data from other sources, including:</p> <ul> <li>CDNs, DNS servers and GeoIP databases</li> <li>User, device and provider data from OSS/BSS and CRM servers</li> <li>Security threat intelligence feeds</li> </ul> <p>Modern big data platforms are capable of handling the volume, velocity and variety of performance monitoring data in hybrid multi-cloud and hyperscale environments. Highly scalable big data clusters support the cost-effective storage capacity required for petabytes of data and high-velocity data pipelines capable of ingesting streaming telemetry data in real time. Column-oriented big data repositories enable powerful multi-dimensional analytics on massive time series datasets. IT managers can perform complex queries that correlate across multiple data types in near real-time — seconds vs. minutes — gaining insights into application and network performance that are not possible using existing monitoring tools.</p> <p>Big data accommodates the large datasets required to execute machine learning algorithms that can automatically detect conditions, trends and anomalies in real time. Since machines are far better at crunching numbers than humans, these systems can automate the time-consuming workflows for humans to detect, if at all. Machine learning also enables predictive analytics so that IT managers can be proactive in anticipating problems and taking action before they occur. Ultimately, big data analytics and machine learning will provide the closed-loop feedback critical for automating many NetOps, ITOps, SecOps and DevOps workflows, reducing OPEX and improving uptime by eliminating operator errors.</p> <p>While big data can deliver big benefits, deploying and operating big data clusters is resource-intensive and talent with the necessary expertise is in short supply. This is why many organizations are choosing SaaS-based big data solutions instead of deploying platforms on-premise. IT managers don’t have to manage the associated complexity and technology risk while significantly reducing the up-front investment in favor of affordable, pay-as-you-grow pricing.</p> <p>Forward-thinking IT managers are already embracing big data-powered SaaS solutions for application and performance monitoring. If hybrid multi-cloud and hyperscale application infrastructure are in your future, my advice for performance monitoring is “go big or go home.”</p><![CDATA[Preparing for the Hybrid Multi-Cloud Endgame]]><![CDATA[While we’re still in the opening phase of the hybrid multi-cloud chess game, ACG analyst Stephen Collins takes a look at what's ahead. In this post, he digs into what enterprises embracing the cloud can expect in the complex endgame posing many new technical challenges for IT managers.]]>https://www.kentik.com/blog/preparing-for-the-hybrid-multi-cloud-endgamehttps://www.kentik.com/blog/preparing-for-the-hybrid-multi-cloud-endgame<![CDATA[Stephen Collins]]>Tue, 10 Apr 2018 04:00:00 GMT<p>While we’re still in the opening phase of the hybrid multi-cloud chess game, I would like to take a look ahead at what enterprises embracing the cloud can expect in what promises to be a complex endgame posing many new technical challenges for IT managers. Ensuring application and network performance looms as a critical concern in environments where a new generation of IT software and computing infrastructure spans multiple geographically distributed data centers and a mix of public and private networks, including SD-WAN overlays.</p> <div as="WistiaVideo" videoId="xogyx13gzl" audio></div> <div class="caption" style="margin-top: -15px">Navigating the Complexities of Hybrid and Multi-cloud Environments with Phil Gervasi</div> <p>In hybrid multi-cloud environments, DevOps teams will utilize a hybrid of public cloud services and private cloud infrastructure. Given the diversity of applications and differing requirements, they will typically leverage multiple services provided by some combination of Amazon, Google and Microsoft. In addition to popular SaaS applications such as Salesforce and Office 365, enterprises will adopt a broad range of cloud-based IT solutions from leading software companies like IBM, Oracle and SAP.</p> <p>Private cloud infrastructure will be a fixture in the hybrid multi-cloud end game, for applications requiring physical security or whose performance could be impacted by variable throughput and latency in the “best effort” Internet. Private cloud infrastructure will be located either in an on-premise data center or a shared hosting facility. In addition, well into the endgame, IT managers will still be tasked with maintaining traditional data centers for legacy applications and databases, and it is likely they will be required to provide access to these from new applications deployed in public or private clouds.</p> <p>In such highly complex environments, it is obvious that many things can go wrong, and will. Therefore, enterprise IT managers need to prepare by acquiring the right tools and learning the new skills needed to win the hybrid multi-cloud end game. Properly equipped, DevOps, NetOps and ITOps teams will be able to rapidly detect and determine the root cause of performance anomalies and failures when they occur so they can take the necessary remedial action.</p> <h2 id="what-will-happen-when-users-report-that-an-application-seems-too-slow">What will happen when users report that an application seems “too slow”?</h2> <p>Let’s take a quick top-down look at some of the potential hybrid multi-cloud problems and the tools IT managers will need to solve them.</p> <p>At the highest level, DevOps teams will need to monitor the performance of highly dynamic applications composed of microservices distributed across many servers and racks but could also be running in multiple data centers operated by different cloud providers. Performance monitoring tools must be able to map the topology of these applications and measure the performance of each component. DevOps teams also need tools for monitoring the performance of the underlying computing infrastructure, including each layer in the supporting software stack.</p> <p>If the application and infrastructure layers are performing properly, then it is likely the problem is somewhere in the network. But in whose network? Is the performance anomaly in the cloud provider’s hyperscale data center network? Or perhaps the CDN the cloud provider is using? Or maybe it is in the Internet somewhere along the path application flows are traversing across multiple peering and transit networks? NetOps teams need tools for gaining visibility into these various networks to determine the root cause of performance problems.</p> <p>Network monitoring must also provide end-to-end visibility into paths that traverse both public and private networks. SD-WAN monitoring requires visibility into a combination of both public and private underlay network connections utilized by the SD-WAN overlay. Consider what will likely be a common scenario, which is private cloud applications running in a hybrid SD-WAN based on an existing MPLS backbone but utilizing broadband Internet connectivity for remote sites and users. In this environment, IT managers need a toolset that enables them to clearly delineate between private cloud application and infrastructure problems vs. network layer problems so they can quickly zero in on the root cause.</p> <p>There is no doubt that hybrid multi-cloud IT applications will be complex to deploy and manage, but the good news is that smart people are working on innovative solutions to the critical problems. We still have a long way to go before reaching the end game, so now is the time to start preparing and educating yourself on the new tools and techniques needed to execute a winning strategy.</p><![CDATA[Kentik Detect for FinServ Networks: Real-World Use Cases]]>https://www.kentik.com/blog/kentik-detect-for-finserv-networks-real-world-use-caseshttps://www.kentik.com/blog/kentik-detect-for-finserv-networks-real-world-use-cases<![CDATA[Justin Ryburn]]>Wed, 04 Apr 2018 04:00:00 GMT<p>In our last blog post, <a href="https://www.kentik.com/blog/at-the-turning-point-finserv-data-networks/">At The Turning Point: FinServ Data Networks</a>, we discussed the challenges faced by financial services organizations when it comes to managing modern networks. In this post, we dig a little deeper with real-world examples of how Kentik helps our financial industry customers solve these challenges. Before we dive into the specifics, let’s take a look at a few aspects of Kentik Detect® that make this possible:</p> <ul> <li><strong>Unlimited scale, no compromises</strong> – The Kentik Data Engine (KDE) was built using a modern distributed systems architecture so it can scale horizontally to any network size, retain the raw data for 90 days or more, and provide the details you need in seconds. Networking teams no longer need to be limited by unresponsive tools, aggregated stats, and data deadends. For a deeper dive on KDE, check out our <a href="https://www.kentik.com/platform/kentik-detect/kentik-data-engine/">product page</a>.</li> <li><strong>Comprehensive detection</strong> – Kentik’s policy-based alerting engine continuously evaluates incoming network data to detect anomalies and trigger notifications or automatic actions. Users can choose from a library of pre-built policies, or create new policies with custom traffic filters, baselines and threshold conditions. For more information check out our <a href="https://www.kentik.com/blog/kentik-detect-alerting-configuring-alert-policies/">alerting blog post</a>.</li> <li><strong>Business context</strong> – Custom Dimensions map business data (i.e. identifiers like customer names, locations, and services) onto network traffic data. This allows non-technical stakeholders from product, marketing, sales, management, and executives to understand the data and gain insight in terms that are relevant to their roles. To learn more about Custom Dimensions, check out our <a href="https://kb.kentik.com/Cb06.htm#Cb06-Custom_Dimensions">Knowledge Base article</a>.</li> <li><strong>Your data, your way</strong> – Fully customizable and interlinked dashboards allow Kentik Detect to adapt to each user’s workflows, not the other way around. Technical stakeholders from operations, security, and application teams can build their own views oriented to their role, and also curate data to create a “UI within the UI” for sales, finance, executives, and others. For more information, check out our <a href="https://kb.kentik.com/Db02.htm#Db02-Dashboards">Dashboards Knowledge Base article</a>.</li> </ul> <p>With these things in mind, let’s take a look at a few examples of how our FinServ customers use Kentik Detect to solve real-world problems.</p> <h4 id="the-big-picture">The Big Picture</h4> <p>One of the more obvious, yet powerful uses for Kentik Detect is a dashboard that provides a comprehensive overview of network traffic across the entire infrastructure: LAN / WAN, internal data centers and public cloud. To illustrate this we will imagine a company called ACME Bank &#x26; Investments with a dashboard called <em>ACME Bank &#x26; Investments Network Overview</em>. <img src="//images.ctfassets.net/6yom6slo28h2/7EAtixgqpGiEC0kqEmyaGs/e11c406c3ba6fa00c9f2dda404cf80be/big-picture-1200w.png" alt="big-picture-1200w.png" class="image center" style="max-width: 800px;" /></p> <p>This first panel shows traffic per location across their WAN: banking branches, ATMs, HQ buildings, trading desks, etc.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4fEpUr1ceAUIKoUeSa6qIy/7879ade795b0ae841b586ec53ce6c639/wan-client-locations-psd.gif" alt="wan-client-locations-psd.gif" class="image right" style="max-width: 800px;" /> <p>In the next panel we’ll take a look at the relationships between clients across ACME’s WAN and services running in their internal data center in San Jose (SJC1). The Sankey diagram illustrates the utilization across datacenter zones, server IPs, and services. Kentik makes extensive use of the Sankey visualization type to illustrate relationships within network data. (Read more on our Sankeys in our blog post <a href="https://www.kentik.com/blog/insight-delivered-the-power-of-sankey-diagrams/">Insight Delivered: The Power of Sankey Diagrams</a>.)</p> <img src="//images.ctfassets.net/6yom6slo28h2/4iWfenA4IgMgAsUG4u0e60/51fc40dcd825bc2a082d114907970cd1/sankey1-1000w.png" alt="sankey1-1000w.png" class="image center" style="max-width: 800px;" /> <p>With a simple filter change, another panel illustrates traffic flows between WAN locations and public cloud services.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3pb8wtkZhCeOuoKiUaSmGI/ee71787055ab9334f191fc936d318c21/sankey2-1000w.png" alt="sankey2-1000w.png" class="image center" style="max-width: 800px;" /> <p>Panels can be set to auto-refresh every 30, 60, or 90 seconds, making it easy to verify traffic flows during network maintenance or while troubleshooting active events. Panels can also be navigationally linked to other dashboards, or can drill down to fine details using the Data Explorer.</p> <h4 id="is-it-the-network">Is It The Network?</h4> <p>Network operations teams are usually “guilty until proven innocent.” Application performance problems are often blamed on the network unless ops teams can provide data to demonstrate otherwise. Getting that data has historically been a huge challenge. Traditional tooling is often not deployed where and when the problems occur, and doesn’t retain the details necessary to definitively answer the “was it the network?” question. Kentik Detect solves both those problems. Our kprobe agent deploys anywhere – on servers, VMs, public cloud instances, or sensor hosts connected to packet brokers or taps. The kprobe agent produces flow data enriched with performance metrics and layer 7 details which are stored in the Kentik Data Engine. That enriched data can then trigger proactive notifications about performance problems when they arise, and provide the fine details engineers need to quickly distinguish between application and network performance issues. <img src="//images.ctfassets.net/6yom6slo28h2/onXSFz4dBA04aKO0QWGiG/aa8b9cb4edd7818e71c62e53cd6cb6f1/notification-1200w.png" alt="notification-1200w.png" class="image center" style="max-width: 800px;" /></p> <p>Using our ACME Bank &#x26; Investments example again, the policy and alert above allows ACME’s networking team to quickly see that an application running on server3_sjc1 in SJC1 Datacenter Zone A is slow in responding to requests. The team can also see the exact HTTP URL where the slow responses were observed. Drilling down on the alert brings up up a dashboard with more detail:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1C6mr8pBRuYEWUIwaWM8G0/baf629215d560030b4d69a560d452f44/dashboard-detail-1200w.png" alt="dashboard-detail-1200w.png" class="image center" style="max-width: 800px;" /> <p>ACME can quickly see that the application’s latency has increased but there has been no change in the packets per second (pps) or bits per second (bps) towards this server. Just to be sure, the team can dive into another panel to compare the application latency (+ axis) with the server latency (- axis):</p> <img src="//images.ctfassets.net/6yom6slo28h2/6l7SqFmJcA6wQ84M6AUAMC/17b6e2af6e8871b80f720595af764fd4/app-server-latency-1200w.png" alt="app-server-latency-1200w.png" class="image center" style="max-width: 800px;" /> <p>Very quickly the networking team at ACME is able to isolate the problem to an application performance issue instead of a network problem.</p> <h4 id="securing-services">Securing Services</h4> <p>Application deployment models have changed dramatically, from monolithic apps that live in a static location, to a distributed services model where app components are deployed and auto-scaled dynamically. That’s made it very difficult for networking teams to get accurate information on the network traffic relationships between service components and clients. Application owners have even less awareness about how their apps and components communicate. That makes deploying security policies like firewall rules or ACLs nearly impossible. If you don’t understand which components and services need to communicate with each other, how can you even begin? Once again, Kentik Detect can help here. We can create a Kentik view for ACME that lists each unique source / dest / service as actually observed on the network:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6yjrJ8gTe0eum2w4YW0Weo/7457f0a4663a1ec5b357cfcaf23a50e2/list-view-1200w.png" alt="list-view-1200w.png" class="image center" style="max-width: 800px;" /> <p>Now the security team can build a ruleset with the confidence that they will not inadvertently block any communication that’s required for the application to function. Even better, they could configure a policy to alert whenever communication occurs which deviates from this list of expected connections.</p> <h4 id="summary">Summary</h4> <p>These are just a few examples of how Kentik Detect’s modern, scalable approach to collecting, organizing, and displaying network data can help ease the network pain for financial services organizations. If you are not already a Kentik customer and would like to get started analyzing your network traffic today, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[At The Turning Point: FinServ Data Networks]]><![CDATA[While data networks are pervasive in every modern digital organization, there are few other industries that rely on them more than the Financial Services Industry. In this blog post, we dig into the challenges and highlight the opportunities for network visibility in FinServ.]]>https://www.kentik.com/blog/at-the-turning-point-finserv-data-networkshttps://www.kentik.com/blog/at-the-turning-point-finserv-data-networks<![CDATA[Jim Meehan]]>Thu, 29 Mar 2018 04:00:00 GMT<h3 id="challenges--opportunities-for-network-visibility-in-the-financial-services-industry">Challenges &#x26; Opportunities for Network Visibility in the Financial Services Industry</h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/3AK3Yb5uNy42UeOCIg2y2U/549477bc1120b672b1b2086acd3dd18b/retail-banking-ot-core-banking-300x200.jpg" alt="retail-banking-ot-core-banking-300x200.jpg" class="image right" style="max-width: 300px;" />While data networks are pervasive in every modern digital organization, there are few other industries that rely on them more than the Financial Services Industry (FSI). Dating all the way back to the telegraph era, banks and brokerages have used networks to sync transactions and transfer money. Network availability and performance are critically important in banking networks where millions of dollars are transacted every second. Investment firms dealing with High-Frequency Trading applications have zero tolerance for loss and latency. And anyone who’s been close to an FSI IT department is familiar with the expectations and stress placed on the network team. On top of the internal pressure, an increasing pace of external challenges and technology trends make the job even harder. A few of these include:</p> <ul> <li>Migration from large monolithic applications to distributed applications (a.k.a. microservices)</li> <li>Internet everywhere: it’s now your WAN, your digital supply chain, and delivery vehicle</li> <li>Moving from statically deployed applications to DevOps, CI/CD development models</li> <li>Migrating from a central internal data center to a hybrid multi-cloud environment</li> <li>Relentless growth of traffic volume and network elements</li> <li>New security requirements due to increased threat activity and regulation</li> </ul> <p>While some of these trends provide significant benefits to the organization by lowering infrastructure costs, or speeding application development cycles, they are also creating extraordinary challenges for network and security operations teams. Basic details about network traffic that these teams previously relied upon to perform day-to-day tasks like troubleshooting congestion, outages, misconfiguration, or potential threats are now harder to get because:</p> <ul> <li>Traditional tooling can’t deploy to all the places where applications now live (i.e. public cloud)</li> <li>Current tools can’t scale with traffic growth and still provide on-demand details</li> <li>Network identifiers like IP addresses or interface names have lost their context in the face of dynamically deployed and autoscaled applications</li> </ul> <p>Along with these challenges comes opportunity, however. Tooling can’t be an afterthought — especially in organizations so reliant on the network. Network monitoring and visibility strategies must change to accommodate these trends in the network itself. In order to meet these new challenges, monitoring architecture should mirror the trends and technologies being employed in applications and networks overall. Key considerations for new network tooling include:</p> <ul> <li> <p><strong>Web-scale</strong> — A buzzword, yes. But also a legitimate philosophy that encompasses modern distributed computing, built-in redundancy and high availability, and horizontally scalable data ingest, storage, and compute. Appliance-based architectures (physical or virtual) don’t scale to the volume of monitoring data produced by today’s networks.**</p> </li> <li> <p><strong>Fast and flexible</strong> — Past monitoring architectures required a tradeoff: fast answers to predefined questions, or long query times (minutes to hours) with full query flexibility. Neither approach is workable for solving emergent issues in critical networks because the answers you need aren’t available when you need them. To be truly operationally useful, modern tooling must provide answers in seconds with robust data filtering and segmentation capabilities.</p> </li> <li> <p><strong>Proactive insight</strong> — Enabling network teams to quickly find “the needle in the haystack” is necessary, but not sufficient. Modern tooling should baseline normal network activity at scale, proactively find anomalies before customers or users notice, and provide the details that engineers need to solve the problem quickly.</p> </li> <li> <p><strong>Real context</strong> — In the past, networks may have been static enough that engineers could identify applications, users, or locations by IP address or network element names alone. That’s no longer possible at today’s scale, especially as application components are dynamically scaled and deployed. For full understanding, modern tooling must label network data with business level context like application and service names, usernames, or physical locations.</p> </li> <li> <p><strong>Deploys everywhere</strong> — Monitoring needs to go everywhere your applications and traffic go:  your WAN (SD or not), your datacenters, public cloud instances. A single UI to view all the traffic, anywhere, puts a stop to the network ops swivel chair.</p> </li> <li> <p><strong>Serves everyone</strong> — Network data has lots of potential value across multiple teams:  network engineering and operations for sure, but also SecOps and DevOps teams, and even up into management and executive ranks. All those teams are going to use the data differently, however. To truly provide value across teams, modern tooling must allow data to be curated and views and workflows to be customized.</p> </li> </ul> <p>Here at Kentik, we’ve built the next generation network traffic analytics platform that incorporates these requirements and more. We’ve been helping FSI network and security teams meet and exceed their organizations’ high expectations for network availability, performance and security, while simultaneously pivoting to new technologies and operational models. Stay tuned for upcoming blogs about specific challenges Kentik has helped customers solve. If you’d like to learn more, contact us to <a href="#demo_dialog">schedule a demo</a> or <a href="#signup_dialog">start a trial</a>.</p><![CDATA[Internet Underlay Visibility is Critical for SD-WAN Overlays]]><![CDATA[Fixing a persistent Internet underlay problem might be as simple as using a higher bandwidth connection or as complex as choosing the right peering and transit networks for specific applications and destination cloud services. In this blog, ACG analyst Stephen Collins advices that to make the best-informed decision about how to proceed, IT managers need to be equipped with tools that enable them to fully diagnose the nature of Internet underlay connectivity problems.]]>https://www.kentik.com/blog/internet-underlay-visibility-is-critical-for-sd-wan-overlayshttps://www.kentik.com/blog/internet-underlay-visibility-is-critical-for-sd-wan-overlays<![CDATA[Stephen Collins]]>Tue, 27 Mar 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/1HKPIIn83CcKYuEkMWWc0C/639abce5879b54abc59fa729f667735c/Public-cloud-ethernet-hero-1080x675-300x188.jpg" class="image right" style="max-width: 300px; margin-bottom: 20px;" alt="sd-wan underlay overlay illustration" /> <p>Looking back at the flood of announcements and the flurry of M&#x26;A activity, it’s fair to say that 2017 was the “Year of the SD-WAN.” Or at least the year that SD-WANs were permanently etched into our collective consciousness.</p> <h2 id="what-is-an-sd-wan-overlay">What is an SD-WAN overlay?</h2> <p><a href="https://www.kentik.com/kentipedia/sd-wan-software-defined-networking-defined-and-explained/" title="Kentipedia: Software-Defined Wide Area Network (SD-WAN) Explained">SD-WANs</a> are the confluence of four technology trends: software-defined networking in wide area networks (WANs), commodity hardware for customer premise equipment, Internet connectivity for business applications, and enterprise IT hybrid multi-cloud migration. Yet the simple term SD-WAN belies the extraordinary impact these trends are having on enterprise networking.</p> <p>Classic enterprise WANs rely on services like MPLS and Carrier Ethernet that are fully managed by service providers from end to end. SD-WANs are virtual overlay networks based on tunnels that carry traffic over multiple underlay networks, typically a hybrid of existing carrier services and unmanaged connections via the public Internet.</p> <h2 id="what-are-the-beneifts-to-implementing-sd-wan-overlays">What are the beneifts to implementing SD-WAN overlays?</h2> <p>The benefits of SD-WANs are compelling both economically and operationally. Software-driven SD-WANs are application aware and able to route individual traffic flows over the best paths to ensure end-to-end throughput and performance. SD-WAN network elements continuously monitor the state of overlay connections for throughput, latency and jitter and can dynamically reroute traffic over alternate overlay connections in the event of a problem.</p> <p>However, SD-WAN overlays are inherently blind to the inner workings of underlay networks. For underlays based on carrier services like MPLS, the service provider can be relied on to resolve problems, but for underlays based on Internet connectivity, how do enterprise IT managers identify and resolve problems?</p> <h2 id="how-can-internet-connectivity-impact-sd-wan-overlays">How can Internet connectivity impact SD-WAN overlays?</h2> <p>Internet underlay visibility is critical to ensuring the smooth operation of SD-WAN overlays. Let’s take a quick look at two typical SD-WAN scenarios leveraging Internet connectivity.</p> <p>A common use case for SD-WAN early adopters has been the use of fixed line and wireless (4G LTE) broadband Internet services for connecting hundreds of branch offices and remote sites that are distributed across a large geographic area. These SD-WANs typically involve a number of different broadband service providers and traffic will be flowing to and from these sites over many Internet peering and transit networks.</p> <p>When a problem is detected in an MPLS network, the IT manager will call the service provider for a resolution, but when a problem is detected in the SD-WAN overlay, how will the IT manager rapidly determine the root cause in the vast expanse of the Internet underlay?</p> <p>The other scenario involves SD-WANs that incorporate Internet “breakout” connections for popular SaaS applications like Office 365 and <a href="http://salesforce.com/">Salesforce.com</a> or for enterprise IT applications deployed in public cloud services provided by Amazon, Google or Microsoft. SD-WAN software can rapidly detect connectivity problems with these services and even switch over automatically to alternate underlay connections.</p> <p>SD-WAN vendors and service providers, however, do not typically provide the tools needed to diagnose the cause of the underlying Internet connectivity problem. Transient problems in the Internet underlay may not be an issue, but if there are persistent problems with an underlay connection, the IT manager needs to know the root cause. Contacting the ISP might yield an answer but the source of the problem could also be several hops beyond its own network.</p> <p>To quickly resolve problems in either scenario, IT managers need new tools that can gain visibility into the end-to-end <a href="https://www.kentik.com/kentipedia/what-is-network-topology/" title="Kentipedia: What is Network Topology?">network topology</a> and various paths that traffic flows are traversing from the Internet breakout connection into multiple cloud provider networks.</p> <p>Fixing a persistent Internet underlay problem might be as simple as using a higher bandwidth connection or as complex as choosing the right peering and transit networks for specific applications and destination cloud services. Either way, to make the best-informed decision about how to proceed, IT managers need to be equipped with <a href="https://www.kentik.com/product/core/" title="Learn more about Kentik Core, one of Kentik&#x27;s solutions for network visibility.">network visibility tools</a> that enable them to fully diagnose the nature of Internet underlay connectivity problems. Learn more about using Kentik to <a href="https://www.kentik.com/solutions/usecase/wan-and-sd-wan/" title="Kentik: WAN &#x26; SD-WAN use cases.">monitor both SD-WAN overlay and underlay performance here</a>.</p><![CDATA[SaaS Applications: No-Brainer or Headache?]]><![CDATA[Traditional monitoring tools for managing application performance in private networks are not well-suited to ensuring the performance, reliability and security of SaaS applications. In this post, ACG analyst Stephen Collins makes the case for why enterprise IT managers need to employ a new generation of network visibility and big data analytics tools designed for the vast scale of the Internet.]]>https://www.kentik.com/blog/saas-applications-no-brainer-or-headachehttps://www.kentik.com/blog/saas-applications-no-brainer-or-headache<![CDATA[Stephen Collins]]>Wed, 21 Mar 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/5voElVojHUeY00KOoEi8Ia/47e6ea9645c8625dbc6d90e7beec9dbb/cloud2-300x200.jpg" alt="cloud2-300x200.jpg" class="image right" style="max-width: 300px; margin-bottom: 15px;" /> <p>A recent Gartner report <a href="https://www.gartner.com/en/newsroom/press-releases/2017-10-12-gartner-forecasts-worldwide-public-cloud-services-revenue-to-reach-260-billionin-2017">forecasts</a> that the total market for software-as-a-service (SaaS) will double in size from almost $50 billion in 2016 to $100 billion in 2020. The rapid pace of adoption for business applications is proving that SaaS is a win-win for enterprise IT managers and the software companies providing these services. The economic benefits of SaaS are so compelling it is hard to justify deploying application software on-premise if a SaaS-based version is available that satisfies customer requirements.</p> <p>While SaaS providers are utilizing state-of-the-art cloud-scale infrastructure to deliver applications to users via the Internet, enterprise IT managers face entirely new challenges because these services depend on the performance and reliability of networks that the enterprise doesn’t own or operate.</p> <p>Choosing to deploy SaaS applications may seem like a no-brainer, but what does an enterprise IT manager do when someone calls the help desk to report that an application is “slow” or simply no longer accessible? The first call is often to the SaaS provider, but what if there is no problem with the application or supporting infrastructure on its end? What’s the next step in determining the root cause, particularly if the problem is deep in the Internet at a point that is impacting the flow of application traffic for a significant number of users? Even worse, what if the problem clears up on its own only to reappear at a later date, and this occurs frequently without ever determining the root cause?</p> <p>Is SaaS really a no-brainer or a headache?</p> <p>To complicate matters, today’s enterprise user population encompasses not just employees working in company offices, but highly mobile and remote workers connected via a mix of 4G LTE, public WiFi and consumer broadband networks. Access to SaaS applications may also be provided to customers and partners, wherever those users might be situated.</p> <p>On the provider side, there are different deployment models for SaaS applications. The expedient approach is to deploy on public cloud infrastructure provided by Amazon, Microsoft or Google. However this may not be cost-effective for certain applications, so SaaS providers often deploy on their own “bare metal” infrastructure, frequently located in a shared hosting facility. Software behemoths like IBM and Oracle deploy SaaS applications using their own cloud infrastructure.</p> <p>The net result of all this is that for any given enterprise, traffic flows for SaaS applications traverse many different paths across the Internet from thousands of endpoints to many destination data centers in the cloud.</p> <p>In addition, large SaaS providers might deploy multiple data centers that are distributed geographically while others ensure broad geographic coverage using a content distribution network (CDN), which is an Internet overlay for intelligently distributing application software and data to local caches that are kept in sync with the primary data center. This approach introduces yet another layer of complexity in terms of gaining end-to-end visibility into the flow of application traffic over the Internet.</p> <p>The traditional monitoring tools for managing application performance in private networks are not well-suited to ensuring the performance, reliability and security of SaaS applications. Enterprise IT managers need to employ a new generation of network visibility and Big Data analytics tools designed for the vast scale of the Internet.</p> <p>Armed with these tools and the new skills needed to master them, SaaS application deployment starts to look like more of a no-brainer than a headache for enterprise IT managers.</p><![CDATA[The Importance of Network Security & Planning in Educational Institutions]]><![CDATA[College, university and K-12 networking and IT teams who manage and monitor campus networks are faced with big challenges today. In this post, we take a deeper look at the challenges and provide requirements for a cost-effective network monitoring solution.]]>https://www.kentik.com/blog/the-importance-of-network-security-planning-in-educational-institutionshttps://www.kentik.com/blog/the-importance-of-network-security-planning-in-educational-institutions<![CDATA[Ken Osowski]]>Mon, 19 Mar 2018 04:00:00 GMT<img src="//images.ctfassets.net/6yom6slo28h2/4BZT8pNaJimQQ6aCQgWOSY/ea6efb401aafd51e4c7863a4296888bc/education-gets-a-virtual-up-ab778d5480-300x169.jpg" alt="" class="image right" style="max-width: 300px;" /> <p>Whether it’s for a college, university or K-12 institution, networking and IT teams engaged in the management and monitoring of campus networks are facing some big challenges today:</p> <ul> <li>Maintaining very large, complex local wireless and fixed WAN networks that include access to research networks that span broad geographies</li> <li>Identifying and stopping security threats such as DDoS attacks and malware intrusion</li> <li>Enforcing copyright laws and restricting content</li> <li>Supporting a disparate collection of network and user devices</li> <li>Supporting high numbers of remote and transient users engaged in research activities creating a higher security risk</li> <li>Dealing with unpredictable student users who may have an inclination to maliciously misuse or breach network security</li> <li>Monitoring high volumes of large file sharing and file downloading that can impact overall network performance</li> <li>Maintaining information privacy</li> </ul> <p>These challenges not only impose a responsibility on school network operations staff to keep networks running smoothly but provide some watchdog duties as well to avoid misuse of network resources. At the same time, professors and teachers are looking to leverage the latest in network-based collaboration and learning tools that incorporate streaming HD audio and video for virtual classroom environments. New applications and methods of communication are constantly added to the mix that can affect network performance without any warning.</p> <p><strong>What is Needed?</strong> Two needs have risen to the top of the list for educational institutions wanting to maintain control of their network. The first is eliminating internal network misuse. The second is preventing security threats originating from outside their networks. Both of these can bring a network to a halt, or worse yet result in data getting compromised. And unlike enterprises, educational institutions (in particular, universities with students living on campus) are particularly vulnerable to cyber attacks, caused by possible gaming retaliation initiated by outside “poor losers” or to the inherent “openness” of an educational network. After all, the latter is what facilitates the free exchange of ideas, making it easier for researchers to engage in worldwide collaboration in pursuit of common research goals. It’s apparent that the need for network security and planning is crucial to educational institutions, but they may not have as large of a budget as enterprises do to efficiently monitor their networks. When budgets are a concern, it’s crucial to find a network monitoring solution that can still scale to monitor the largest, most complex networks without breaking the budget.</p> <p><strong>Key capabilities of a cost-effective network monitoring solution should include:</strong></p> <ul> <li>Leveraging a SaaS solution to save on maintenance and management costs</li> <li>Improving the speed and accuracy of anomaly detection and alerting</li> <li>Enabling cross-departmental traffic visibility</li> <li>Providing DDoS detection and mitigation</li> <li>Supporting ISP traffic analysis to ensure quality of service</li> <li>Troubleshooting and investigating network abuse complaints</li> </ul> <p><strong>A solution that meets these requirements can provide the following benefits:</strong></p> <ul> <li>Eliminated CapEx because no dedicated network monitoring devices are required.</li> <li>Reduced OpEx by not needing dedicated staff to configure and maintain dedicated monitoring devices; and by reducing traffic on ISP links that billed based on usage.</li> <li>Customized reporting to campus network topology by reporting on dorm/building/department etc. or other campus-specific user populations as identified by their IP address, VLAN, or MAC address.</li> <li>Ensured service levels. Higher education institutions in particular experience dramatic traffic spikes due to the usage patterns of large residential student populations. Properly managing network investments involves optimizing traffic delivery across existing links before determining if upgrades are required.</li> <li>Improved network security. Defend more successfully against DDoS attacks, and find and shut down network or other IT abuses.</li> <li>Identification of excessive network usage by users, devices, and OTT services.</li> </ul> <p>A big data, SaaS approach is ideal for unifying network data at scale and providing the compute power to achieve these benefits. If you’re interested in learning more about how Kentik’s big data approach enhances network security and planning, check out Kentik for DDoS Protection and Defense <a href="/resources/brief-kentik-ddos-detection-and-defense/">solution brief</a>. If you already know that you want to implement a much more cost-effective network security and planning solution for your institutional network, start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Insight Delivered: The Power of Sankey Diagrams]]><![CDATA[Sankey diagrams were first invented by Captain Matthew Sankey in 1898. Since then they have been adopted in a number of industries such as energy and manufacturing. In this post, we will take a look at how they can be used to represent the relationships within network data.]]>https://www.kentik.com/blog/insight-delivered-the-power-of-sankey-diagramshttps://www.kentik.com/blog/insight-delivered-the-power-of-sankey-diagrams<![CDATA[Jim Meehan]]>Wed, 14 Mar 2018 04:00:00 GMT<h3 id="understand-relationships-in-network-traffic-data-using-kentiks-sankey-diagram-visualization"><em>Understand relationships in network traffic data using Kentik’s Sankey diagram visualization</em></h3> <h2 id="seeing-steam">Seeing steam</h2> <img src="https://images.ctfassets.net/6yom6slo28h2/569mbDZxrikcueA8M4OMSQ/aafaee02af09c52a4b5dff104bdc7944/steam-engine-400w.png" class="image right" style="max-width: 400px" alt="Sankey: Thermal Efficiency of Steam Engines" /> <p>Sankey diagrams have been around for more than a century.  The first one, drawn by Captain Matthew Sankey in 1898, was used to show the energy inputs, outputs and efficiency of a steam engine.  In general, a Sankey diagram is a type of flow diagram where the width of the bands represent the proportional quantity of flow distributed over one or more dimensions. In the case of Captain Sankey’s original diagram, the “flow” is steam, and the dimensions are steam production and steam utilization. Sankey diagrams have since been adapted for many other uses to show flows of energy, materials, revenue, and more.</p> <h2 id="visualizing-network-data">Visualizing network data</h2> <p>We often see network data presented in time-series visualizations — line graphs that indicate how traffic volume varied over time.  Multi-line or stacked graphs can show how traffic is split over a single additional dimension. Time-series displays can’t easily show the traffic volume relationships among multiple dimensions, though, and this is where Sankey diagrams really shine as a way to visualize traffic flows in IP networks.</p> <p>Showing how traffic volume is distributed over multiple dimensions makes it easy to spot significant relationships between different aspects of the traffic, for example, how traffic sources are related to traffic destinations.  Kentik has made wide use of the Sankey visualization type in our Kentik network traffic analysis platform, starting first with our Peering Analytics feature, and then later as a general purpose visualization type in the Data Explorer.</p> <div as="Promo"></div> <img src="https://images.ctfassets.net/6yom6slo28h2/2C7RYwR8LWAyi0MQUeyUS/212751fa1e1f07cc20cd4171e98550ef/query-pane-283w.png" class="image right" style="max-width: 180px" alt="Query pane" /> <p>To understand how Sankey diagrams work in Kentik, let’s look at a few examples.  To start, we’ll look at a relatively simple example showing traffic measured in bits per second (bps), broken down by source IPs, destination IPs and services (protocol / port).  To set that up, we’ll go to <em>Data Explorer » Explorer View</em>, click anywhere in the Group By Dimensions selector in the Query pane of the Sidebar, and choose the following dimensions:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/15zifbJEpG0qyqqa8CswKq/e697b395f846896ee359949a91d1f577/metrics-280w.png" class="image right" style="max-width: 180px" alt="Metric drop-down" /> <p>Then move down to the Metrics drop-down menu and select <strong>bits/s</strong>:</p> <p>In the upper right hand corner (above the chart) we choose the visualization type:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5f16zHq1cWmQeU2yOkyKok/7ca36757544735147180f7d36a0b56b8/visualization-type-250px.png" class="image center" style="max-width: 280px" alt="visualization type" /> <p>We will leave the rest of the panes with their default settings and click the blue Run Query button; Data Explorer returns a Sankey diagram that looks like this:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/nKcKq2exkO4eOi4uqQWk/fbe9d93a779c9d9e3dc6e9e06c8b9bf8/sankey-animation.gif" class="image center" style="max-width: 906px" alt="Animated Sankey diagram" /> <p>You’ll notice that the dimensions order (left to right) matches the order that was set in the Group-By Dimension selector — in this case, Source IP/CIDR, Destination IP/CIDR and Destination Protocol:IP Port.  The connecting bands make it easy to see, for example, which destinations the source IP 10.0.6.3 was talking to, which services were in use for those connections, and the relative traffic volume represented by the width of the band.  Visualizing network traffic this way makes it very easy to see top contributors for use cases like pinpointing root causes of congestion. Mousing over the different parts of the diagram highlights individual components, and reveals discreet traffic volumes in a tooltip.</p> <h2 id="understanding-cost-per-customer">Understanding cost per customer</h2> <p>For companies that sell IP transit services, questions often come up about the load individual customers place on different parts of the network, especially in relation to revenue from each customer. Specifically, operators want to know where customers’ traffic enters the network, where it exits, which adjacent network(s) it exits to, and how far it is carried. For these providers, the answers lead directly to the cost per customer. A similar use case appears in enterprise networks, to understand how different departments, groups or services utilize the network.</p> <p>To illustrate this, we will use the following <strong>Group-By Dimensions</strong> in the Kentik Data Explorer:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4D8F0VqZFu04YEOWaOOO2K/7d9fe0ea853f98ca1ed7005822d8749e/cpc-options-229w.png" class="image center" style="max-width: 228px" alt="Cost per customer" /> <p>Once we run this query, we get a great visualization like the below, which makes it easy to see exactly where Customer XYZ’s traffic enters the network, where it exits, and which adjacent network it exits to.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4UWQ4HgCvuegkQygEym4Aw/79d53dc34072e8f2d97833827545cced/customerxyz-traffic-1000w.png" class="image center" style="max-width: 800px" alt="Customer traffic example" /> <h2 id="real-time-situational-awareness">Real-time situational awareness</h2> <p>Another powerful use for Sankey diagrams is to visualize attack traffic entering a network, showing exactly where DDoS traffic is sourced from, how it entered the network, who’s targeted, and the type of traffic. As an example, let’s take a look at the following <strong>Group-By Dimensions</strong>:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2Y9RmLMdMs6euKYiiyG04q/3f208857801eda368df6c8deeda698ed/real-time-options-229w.png" class="image center" style="max-width: 228px" alt="Group-by dimensions" /> <p>Once we click the blue <strong>Run Query</strong> button, we get a chart like the one below.  This kind of visualization provides instant situational awareness about how the attack is affecting the network and informs fast decision making about the appropriate response.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1wbxgdXG6YwyYSaCqa2u08/90cc6b73f3718619778f4889b2308f6b/real-time-sankey-986w.png" class="image center" style="max-width: 986px" alt="" /> <h2 id="troubleshooting-data-center-congestion">Troubleshooting data center congestion</h2> <p>Many data center network teams need to look at east/west traffic across data center fabrics to see flows between services, servers, racks, or pods in addition to traffic across the external border out to the Internet. A Sankey diagram can provide instant visibility when troubleshooting data center congestion issues. To show an example of this, let’s use these <strong>Group-By Dimensions</strong>:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/2JvwxSP8dGEqCiCWWgeaeO/f03769baa1a5eadb34fe4d9fef4f5960/group-by-dimensions-228w.png" class="image center" style="max-width: 228px" alt="Troubleshooting" /> <p>A query like this will result in a visualization that looks something like the image below.  Congestion top contributors are easy to spot because they stand out as the biggest bars.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4KvuCOEck0iwOMs8uyCS28/bf0f98c80b9b965b6894913ec01ebc96/data-congestion-1000w.png" class="image center" style="max-width: 800px" alt="Congestion" /> <h2 id="workload-network-impact">Workload network impact</h2> <p>Data center operators also frequently want to understand which services are the top traffic producers and consumers. This is important for understanding how compute or storage workloads contribute to network load, and pinpointing root causes of data center congestion hotspots. In this example, we are using Kentik’s Custom Dimension feature which maps business specific data like service names onto network flow data.  (Check out our <a href="https://kb.kentik.com/Cb06.htm#Cb06-Custom_Dimensions">Knowledge Base article</a> for details on Custom Dimensions.) For this visualization, we’ll use the following <strong>Group-By Dimensions</strong>:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/6d0W7xfmZG8Qe6gCoEG46E/b68c5dc2dcb8e84ad1e07e8e0137d4fb/workload-network-impact-options.png" class="image center" style="max-width: 228px" alt="Group-by dimensions" /> <p>Once we have defined these Custom Dimensions and built our query as described in previous sections, we get a visualization that looks similar to this:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/5yAwbSR5722u6wWQ2KG2sQ/dcb791875b7c870dfb3a8ccf36021355/workload-network-impact-sankey-957w.png" class="image center" style="max-width: 957px" alt="Visualizations" /> <h2 id="geolocation-relationships">Geolocation relationships</h2> <p>Understanding how traffic maps to geographic locations can be essential for network planning, user demographic analysis, and even product development and pricing.  Kentik includes a built-in geolocation database, which automatically tags every flow with source and destination country, region, and city. By picking the <strong>Group-By Dimensions</strong> below (all Destination type) we can create a “traffic roadmap” for a specific remote ASN:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/4AVTGfF2W4WoCkwaMAK8Wi/62762df80ae4db259ad396e7c48fee8d/geo-location-226w.png" class="image center" style="max-width: 228px" alt="Group-by dimensions" /> <p>After running the query, the resulting visualization would look something like this:</p> <img src="https://images.ctfassets.net/6yom6slo28h2/1Jl9xyfdrq2oCCqkquw84k/daec700ad236b7f6a06f13bc5b509672/final-1000w.png" class="image center" style="max-width: 1000px" alt="Resulting Sankey Visualization" /> <p>Not already a Kentik customer? Start leveraging the powerful visualizations in Kentik today by <a href="#demo_dialog">requesting a demo</a> or signing up for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Massive Scale Visibility Challenges Inside Hyperscale Data Centers]]><![CDATA[The adoption of hyperscale principles presents data center operators with massive scale visibility challenges. Yet it's inevitable. In this post, ACG analyst Stephen Collins discusses why operators need to prepare for the inevitable by acquiring and mastering the new tools needed to assure the performance, reliability and security of these applications.]]>https://www.kentik.com/blog/massive-scale-visibility-challenges-inside-hyperscale-data-centershttps://www.kentik.com/blog/massive-scale-visibility-challenges-inside-hyperscale-data-centers<![CDATA[Stephen Collins]]>Wed, 14 Mar 2018 01:56:42 GMT<img src="//images.ctfassets.net/6yom6slo28h2/67YGW7QYowu8CKkiUSeIMS/29803c3fc896c1a65bf7be628ba4e70b/1039202_104c-300x169.jpg" class="image right" style="max-width: 300px; margin-bottom: 20px;" alt="" /> <p>Hyperscale data centers are true marvels of the age of analytics, enabling a new era of cloud-scale computing that leverages Big Data, machine learning, cognitive computing and artificial intelligence. Architected to scale up smoothly in order to accommodate increasing demand, these massive data centers are based on modular designs that allow operators to easily add compute, memory, storage and networking resources as needed. Yet massive scale creates network visibility challenges unlike those faced by operators of existing enterprise data centers based on the classic three-tier architecture.</p> <p>Hyperscale data centers achieve massive scale by racking and stacking cost-effective, commodity hardware platforms like those specified by the Open Compute Project. Consisting of thousands of servers based on multicore processors ─ each with up to 32 CPU cores using the latest Intel Xeon processors ─ the compute capacity of these data centers is staggering. Advanced virtualization techniques enable hyperscale data centers to execute hundreds of thousands to millions of individual workloads.</p> <p><strong>To further complicate matters:</strong> These workloads are highly distributed and dynamic. Container-based applications composed of numerous discrete microservices typically span multiple servers and racks within a data center and auto-scaling orchestration mechanisms can spin workload instances up and down as needed, consuming available compute capacity on-demand and distributing workloads in a virtual topology with little correlation to the underlying physical topology. At the same time, workloads accessing data sets stored on multiple servers within the data center and even on servers in other data centers can generate a high volume of traffic internal to the data center. The net result is constantly shifting and unpredictable “east-west” traffic flow patterns traversing the leaf-spine switching fabric inside and between data centers.</p> <p>In a classic three-tier data center, traffic flows predominantly “north-south” from the ingress/egress point through load balancers, web servers and application servers. In this architecture, it is straightforward to identify bottlenecks and performance anomalies. In hyperscale data centers, the trend is for only about 15% of traffic to flow north-south while the remaining 85% flows east-west. At the same time, the overall flow of data is increasing such that the link speed of typical spine connections is increasing from 10G to 40G and moving to 100G. The sheer volume of data traffic and the number of flows is such that existing methods for instrumenting and monitoring classic data center networks either won’t work or are not cost-effective in the hyperscale domain.</p> <h4 id="massive-scale-presents-data-center-operators-with-new-types-of-network-visibility-and-performance-management-challenges">Massive scale presents data center operators with new types of network visibility and performance management challenges.</h4> <p>With thousands of servers interconnected via a leaf-spine switching architecture and traffic flowing predominantly east-west, the resulting network topologies are so complex that most operators are employing BGP routing within the data center and equal-cost multi-path routing (ECMP) to select the optimal path for each flow. Operators need new tools for gaining visibility into these topologies and for continuously monitoring traffic flows in order to discover bottlenecks and detect anomalies rapidly enough to take corrective action in real time.</p> <p>It is also critical that operators be able to gain visibility into application dependencies external to the data center. How is traffic flowing to other data centers? Which services or microservices are being accessed? Is the performance of these connections impacting the application? Is traffic engineering required to improve the performance of Internet connections or the data center interconnect? Or perhaps traffic engineering is needed to ensure the resiliency of these connections in the event of connectivity failures or performance anomalies? Data center operators need new tools in order to address these challenges.</p> <p><strong>The bottom line is:</strong> The adoption of hyperscale principles presents data center operators with massive scale visibility challenges. Yet this evolution is inevitable as businesses race to exploit a new generation of data-intensive, real-time, cloud computing applications for competitive advantage. Therefore, operators need to prepare for the inevitable by acquiring and mastering the new tools needed to assure the performance, reliability and security of these applications.</p><![CDATA[Industry 4.0: What Could Possibly Go Wrong?]]><![CDATA[Industry 4.0 is the next phase in the digitization of the manufacturing sector. In this post, ACG Research's Principal Analyst Stephen Collins looks at the major implications that Industry 4.0 will create for enterprise IT systems, software and network infrastructure, and how to model, monitor and control it all.]]>https://www.kentik.com/blog/industry-4-0-what-could-possibly-go-wronghttps://www.kentik.com/blog/industry-4-0-what-could-possibly-go-wrong<![CDATA[Stephen Collins]]>Thu, 08 Mar 2018 02:53:54 GMT<img src="//images.ctfassets.net/6yom6slo28h2/5jMTbdxqRy8oEK244uGCIY/5fab2d2cd5304d8240e188e885f618f9/35386183_thumb5_690x400-300x174.jpg" alt="35386183_thumb5_690x400-300x174.jpg" class="image right" style="max-width: 300px;" /> <p>This series of blog posts will examine many aspects of ensuring application performance in cloud-scale enterprise WANs utilizing a hybrid of public and private networks. However, in this post and the next, I’d like to take a closer look at the unique challenges of managing network performance inside the new generation of hyperscale data centers supporting the delivery of cloud-scale applications. The scale and performance characteristics of these networks is radically different than those supporting traditional three-tier enterprise data centers and therefore existing tools and techniques for managing network performance are not adequate. A key benefit of migrating enterprise IT applications to public clouds is that webscalers like Amazon, Google and Microsoft will end up managing the performance of the supporting hyperscale data centers. However, there will always be mission-critical applications that businesses need to deploy in their own private cloud infrastructure, which means these companies will have to invest in acquiring the new tools and skills needed to deploy and manage hyperscale data centers. A prime example is the cloud-scale, data-intensive, cognitive computing infrastructure that will support Industry 4.0 applications. Management consultants designated the term “Industry 4.0” for the process of digital transformation in the manufacturing domain. McKinsey offers a typical definition:</p> <p><em>“We define Industry 4.0 as the next phase in the digitization of the manufacturing sector, driven by four disruptions: the astonishing rise in data volumes, computational power, and connectivity, especially new low-power wide-area networks; the emergence of analytics and business-intelligence capabilities; new forms of human-machine interaction such as touch interfaces and augmented-reality systems; and improvements in transferring digital instructions to the physical world, such as advanced robotics and 3-D printing.”</em></p> <p>Digital transformation in this sector will have major implications for enterprise IT systems, software and network infrastructure. Industry 4.0 involves modeling and monitoring the physical world in the digital domain using cyber-physical systems. IoT plays a key role in sensing all of the critical elements in the physical world and delivering a continuous feed of sensor data to a Big Data analytics cluster situated in the cloud. Industry 4.0 “smart factories” consist of multiple cyber-physical system modules creating virtual copies of the physical world, monitoring physical processes and working cooperatively with each other in real time. <strong>What could possibly go wrong?</strong> No doubt, many things can and will, but let’s focus on the overarching challenge which is the massive scale of these operations. It is common for large industrial companies to have multiple factories distributed globally with close ties to key suppliers, which could number in the hundreds. An Industry 4.0 manufacturing operation will involve continuously ingesting huge volumes of sensor data from a vast number of endpoints, typically several orders of magnitude more than even the largest enterprise manages today. Private cloud hyperscale data centers will be needed to support Big Data clusters for processing sensor data in real-time along with a whole host of Industry 4.0 applications for modeling, monitoring and controlling the collection of cyber-physical system modules that comprise the full-scale operation. These data centers will be based on a highly efficient leaf-spine switching architecture with full mesh connectivity, enabling network traffic to flow “east-west” between any two servers with minimal latency. This is proving to be the optimal data center network design for DevOps-based cloud computing applications deployed using vast numbers of containers running individual microservices. In my next blog post, we’ll take a closer look at the unique challenges of managing application performance inside these hyperscale data center environments.</p><![CDATA[Finding Bots with Kentik Detect]]><![CDATA[Kentik Detect now incorporates IP reputation data from Spamhaus, enabling users to identify infected or compromised hosts. In this post we look at the types of threat feed information we use from Spamhaus, and then dive into how to use that information to reveal problem hosts on an ad hoc basis, to generate scheduled reports about infections, and to set up alerting when your network is found to have carried compromised traffic.]]>https://www.kentik.com/blog/finding-bots-with-kentik-detecthttps://www.kentik.com/blog/finding-bots-with-kentik-detect<![CDATA[Justin Ryburn]]>Tue, 06 Mar 2018 02:39:56 GMT<h2 id="new-feature-uses-threat-feeds-to-reveal-compromised-hosts"><em>New Feature Uses Threat Feeds to Reveal Compromised Hosts</em></h2> <p>In our previous post, on <a href="https://www.kentik.com/seeing-cdn-traffic-with-kentik-detect/">CDN Attribution</a>, we mentioned that our development team has been hard at work enabling new ways to visualize and investigate network traffic patterns in Kentik Detect®. One of these new features is the ability to find traffic from infected or compromised hosts. What makes this possible is the fact that we now enrich flow records (NetFlow, sFlow, IPFIX, etc.) in Kentik Detect with IP reputation data from our friends at Spamhaus. Based on this information, we’ve exposed two new dimensions that you can use for group-by or filtering: Bot Net CC and Threat List Host. In this post, we’ll walk through a use case for the Bot Net CC dimension, but first let’s look deeper into the threat feed information from Spamhaus and the corresponding dimensions in Kentik Detect:</p> <ul> <li><strong>Bot Net CC</strong>: This dimension is based on a list of IPs where botnet command and control (CC) servers are known to be running (or were running in the past). By looking for IPs that are communicating with these CC IPs you can identify hosts (or subscribers) in your network that are potentially infected with malware or otherwise participating in botnet activity.</li> <li><strong>Threat List Host</strong>: This dimension includes a broader set of potentially malicious IPs on the Internet. The IPs are identified as malware distribution points, phishing websites, spam sources, etc. When this column is populated in Kentik Detect’s datastore we identify the specific type of threat associated with the IP. This dimension can help hosting providers identify IPs in their own network that are “doing bad stuff,” but it can also enable any network operator to identify client IPs that are talking to remote IPs on the list.</li> </ul> <h4 id="querying-for-bots">Querying for Bots</h4> <p>To illustrate the use of Bot Net CC, let’s go back to the fictional enterprise from our last post, Pear, Inc. The savvy reader is probably already aware that a popular method of Distributed Denial of Service (DDoS) attacks is to infect hosts with a virus. These hosts then communicate with a Command and Control (C&#x26;C) server that tells them where to send attack traffic. Pear’s IT Security department wants to see if they have any hosts on their network that have been compromised and are communicating with a C&#x26;C server, thereby participating in a botnet.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5F5GrHyUfKuMEiggWGMGK6/429fa7f9cb0dccac7662038dc37708af/query-for-bots-588w-300x144.png" class="image center" style="max-width: 300px" /> <p>We can check this by building a query in the sidebar panes of the Data Explorer (Data Explorer » Explorer View). Using the <strong>Group By Dimensions</strong> selector in the <strong>Query</strong> pane, we set the query to look at two dimensions that, when combined (shown at right), will give the list of compromised hosts on our network: Source IP/CIDR and Destination Bot Net CC. Also in the sidebar (for details on sidebar settings, refer to the <a href="https://kb.kentik.com/Da03.htm#Da03-Sidebar_Panes">Sidebar Panes</a> article in our Knowledge Base), we set the <strong>Time</strong> pane to look back at the last hour, and in the <strong>Devices</strong> pane we choose <strong>All</strong> so that we can look at our entire network. We’ll also build a filter in the <strong>Filtering</strong> pane (shown at right) that will ignore our normal network traffic and look only at botnet traffic:</p> <img src="//images.ctfassets.net/6yom6slo28h2/49Q47WKj2oyoCEoGWMysoe/679fa321929daf8242012774fad11dca/filtering-588w-300x118.png" class="image center" style="max-width: 300px" /> <ul> <li>Click anywhere in the pane to open the <strong>Filtering Options</strong> dialog.</li> <li>In the <strong>Ad-Hoc Filter Groups</strong> pane, click <strong>Add Filter Group</strong>.</li> <li>Change “Include” to “Exclude”.</li> <li>Change the dimension to <em>Destination Bot Net CC</em>.</li> <li>Click <strong>Save Changes</strong>.</li> </ul> <p>After our query configuration is complete, we click on the <strong>Run Query</strong> button at the top of the sidebar. If there were four infected hosts then the query would return a graph that looks similar to the below. Based on these results, the IT Security staff at Pear would know which hosts to examine for botnet malware, so it can be identified and removed.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3UG7RwWwvuCciKqAwaEEmc/b39337cbbc469ab0a0a48cf73c592c4b/runquery-1000w.png" class="image center" style="max-width: 800px" alt="" /> <p>Pear could take this a step further by running this query as a scheduled report that is emailed to Kentik Detect users within the Pear organization that are assigned to the report as subscribers. Reports are based on dashboards. The following steps show how to save the query to a new dashboard and then set up a weekly report:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2sIuKJszqo2meIOi8W0ksa/1d4db8132a6f4803e2737d5b1802b009/add-subscription-700w.png" class="image center" style="max-width: 700px" alt="" /> <ul> <li>From the drop-down <strong>View Options</strong> menu at the upper right of the Data Explorer display area (above the graph), choose <em>Add to Dashboard » New Dashboard</em>. The <strong>Add Dashboard dialog will open.</strong></li> <li>Enter “Bot Traffic” in the Name field, then click <strong>Add Dashboard</strong> in the lower right corner. The <strong>Add Panel</strong> dialog will open.</li> <li>Leave the dialog settings at defaults (the name of the panel will be “Top Source IP/CIDR, Dest Bot Net CC by Average bits/s”) and click the <strong>Add Panel</strong> button. You’ll be taken to the new “Bot Traffic” dashboard.</li> <li>Click <strong>Admin</strong> in the main portal navbar, then choose the <strong>Subscriptions</strong> page from the sidebar.</li> <li>Click the <strong>Add Subscription</strong> button at upper right to open the <strong>Add Subscription</strong> dialog.</li> <li>Fill out the subscription settings as shown in the screenshot above, then click the <strong>Add Subscription</strong> button.</li> </ul> <p>If the Pear IT Security department were to follow the steps above, they would receive an email every Monday with a list of infected hosts covering the previous seven days.</p> <h4 id="proactive-alerting-for-compromised-hosts">Proactive Alerting for Compromised Hosts</h4> <p>Getting a scheduled report is nice, but it would be even more helpful to be notified automatically whenever an infected host is found. By creating an alert policy in Kentik Detect’s alerting system, Pear’s IT department could get a Slack notification in their #itsec channel any time a new host starts talking to a C&#x26;C server. To do this, we would choose <strong>Alerting</strong> from the main portal navbar, then click <strong>Policies</strong> in the sidebar. On the resulting Policies page we click the <strong>Add Policy</strong> button. In the resulting <strong>Add Alert Policy</strong> dialog, we make settings on the following tabs:</p> <ul> <li><strong>General Settings tab</strong>:  We give our policy a name and change the drop-down <strong>Policy Dashboard</strong> menu to <em>Bot Traffic</em>, the dashboard we created in the previous section. Note that we leave <strong>Silent Mode</strong> enabled so that the alert does not activate while the alerting system is determining what’s normal (the baseline). It’s also nice to add a description so that everyone in the organization can see at a glance what this policy is for.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/3bnjV69xNuGAEscU0uwaw6/e8e10bfd0da07e5a40b5ac5b980ada24/add-alert-policy.png" class="image center" style="max-width: 700px" alt="" /> <ul> <li> <p><strong>Dataset tab</strong>: In the <strong>Data Funneling</strong> pane we click anywhere in the <strong>Dimensions</strong> selector to add the same dimensions we used in our original query: Source IP/CIDR and Destination Bot Net CC. We’ll also click in the <strong>Secondary Metrics</strong> selector and choose <em>Bits/second</em>.<img src="//images.ctfassets.net/6yom6slo28h2/BasmDiZ7xeuCuUAIIEY2q/dbce212762341716007fb4f3f5c01287/dimensions-693w.png" class="image center" style="max-width: 693px" alt="" /> Moving down to the <strong>Building Your Dataset</strong> pane, we change the <strong>Maximum Number of Keys Analyzed Per Evaluation</strong> to 100. We leave everything else at default so the system auto-calculates the <strong>Minimum Traffic Threshold</strong>.<img src="//images.ctfassets.net/6yom6slo28h2/6PpxZwagRGw6qoQ6kI0kAy/21b7098ccff5e6a693507f05511cff24/notifications-546w.png" class="image center" style="max-width: 546px" alt="" /></p> </li> <li> <p><strong>Alert Thresholds tab:</strong> We choose a criticality level (e.g. Major) from the list at left, then set the switch at right to Enabled, which reveals a number of panes containing controls. In the <strong>Threshold Configuration</strong> pane, we choose <em>Check If Certain Conditions Exist</em> from the drop-down <strong>This Threshold Will</strong> menu. Leave the other settings at their defaults.<img src="//images.ctfassets.net/6yom6slo28h2/4giKMIIgIgIQsyC8Mc2K8I/8fce6fa063d353185e7a41013553fdd9/threshold-config-556w.png" class="image center" style="max-width: 556px" alt="" /> Moving down to the <strong>Conditions</strong> pane, we set a threshold of <em>Greater Than 1 kpps</em>. This will trigger the alarm if any of the hosts on Pear’s network are sending more than 1000 packets per second of traffic to a known Botnet C&#x26;C server.<img src="//images.ctfassets.net/6yom6slo28h2/4348q7qlxmc0kGi004Mea/bd91c7b082b6bf1a6907fbc64e513a77/conditions-687w.png" class="image center" style="max-width: 687px" alt="" /> The final pane that needs our attention is <strong>Notifications</strong>. Click anywhere in the <strong>Send Alert Status Updates To</strong> selector and choose <em>Slack itsec</em>. Note that this notification channel must already have been configured under Alerting » Channels. For more information, check out our KB article on <a href="https://kb.kentik.com/Gc04.htm#Gc04-Add_or_Edit_Channel">creating a notification channel</a>.<img src="//images.ctfassets.net/6yom6slo28h2/3D8X9edLuMkUgEg0y8QkUE/ce885de129e36bc34d73bbe849db2caa/notifications-546w.png" class="image center" style="max-width: 546px" alt="" /></p> </li> <li> <p><strong>Historical Baseline tab</strong>: Leave default settings.</p> </li> </ul> <p>Once all of the required fields have been specified there will be a green check mark on all of the tabs. When we click on the <strong>Add Alert Policy</strong> button, the policy will begin in Silent Mode, studying the normal patterns of traffic. After the Silent Mode end date, the IT Security department will get a Slack notification every time a new host starts talking to a botnet C&#x26;C server. (For more information on setting up policies, you can refer to our <a href="https://kb.kentik.com/Gc02.htm#Gc02-Alert_Policies_Page">Knowledge Base</a> article or our blog post on <a href="https://www.kentik.com/kentik-detect-alerting-configuring-alert-policies/">Configuring Alert Policies</a>.)</p> <h4 id="summary">Summary</h4> <p>This is just a taste of the ways that Kentik Detect’s new Threat Feeds information can enable your company to see compromised hosts on their network that they’ve never been able to see before. This insight can be used for both proactive monitoring and data forensics to protect the company’s infrastructure. If you are an existing customer and would like help setting up or using this feature, get in touch with our <a href="mailto:[email protected]">Kentik support team</a>. If you are not already a Kentik customer, it’s easy to get started: <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Recent memcached Attacks: What Kentik Sees]]><![CDATA[Attackers are using memcached to launch DDoS attacks. Multiple Kentik customers have reported experiencing the attack activity. It started over the weekend, with several external sources and mailing lists reporting increased spikes on Tuesday. In this blog post, we look at what our platform reveals about these attacks.]]>https://www.kentik.com/blog/recent-memcached-attacks-what-kentik-seeshttps://www.kentik.com/blog/recent-memcached-attacks-what-kentik-sees<![CDATA[Jim Meehan]]>Thu, 01 Mar 2018 02:31:13 GMT<h3 id="using-kentik-detect-to-analyze-and-respond-to-memcached-attacks"><em>Using Kentik Detect to analyze and respond to memcached attacks</em></h3> <p>Attackers are using an open source software package called memcached to launch DDoS attacks, and multiple Kentik Detect® customers have reported experiencing the attack activity. It started over the weekend, with several external sources and mailing lists reporting increased spikes on Tuesday.</p> <h4 id="memcached-what-is-that">Memcached? What is that?</h4> <p>For background, <a href="https://memcached.org/">memcached</a> is an open source software package that provides an in-memory caching layer often deployed as a component of web application stacks to reduce load on traditional databases for frequently-accessed objects. In a typical deployment, memcached would be accessed only by internal application components, however, it appears that thousands of memcached instances are exposed to the Internet at large. The memcached service exhibits several qualities that make it a perfect target for DDoS reflection and amplification abuse: by default it listens on UDP and binds to every interface, and can generate large responses in reply to small requests. In this way, it’s very much like DNS and NTP services that have been abused this way in the past. Attackers send a small request packet to well-connected hosts with memcached running. The request packets have a spoofed source IP of the intended victim. The memcached server then generates a flood of reply traffic toward the spoofed victim IP.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3yeRIiBlrOcqmGMCSAwu0o/c6d61a860a2ae275219533b26a2db1d2/memcache.png" alt="memcache.png" class="image center" style="max-width: 800px;" /> <p>The main difference between memcached attacks and DNS and NTP attacks of the past is the Bandwidth Amplification Factor (BAF). The <a href="https://www.us-cert.gov/ncas/alerts/TA14-017A">US-CERT is reporting</a> BAFs of 10,000 to 51,000 for memcached versus 28-54 for DNS and 556 for NTP. That makes for massive attack potential.</p> <h4 id="what-kentik-sees">What Kentik Sees</h4> <p>We pulled some interesting stats based on anonymized traffic aggregates from Kentik Detect customers who have opted into our data sharing program. First, a timeline of total memcached traffic volume over the last week: <img src="//images.ctfassets.net/6yom6slo28h2/4el0VsSl1ewcMUgkewk204/89e869f3d27d8a58d3e03328793feba0/timeline-1000w.png" alt="timeline-1000w.png" class="image center" style="max-width: 800px;" /></p> <p>There are small blips extending back into the prior week, but attack activity really started in earnest around noon UTC on Feb 24th. We believe that’s attributed to the new attack vector becoming commoditized and available within the for-pay DDoS attack black market. The graph scale above is intentionally omitted, but <strong>multiple Kentik customers saw 100+ Gbps peaks of memcached traffic</strong>. We also looked at unique sources, starting from Feb 24th. We can see that a relatively small number of sources are generating a lot of traffic, indicating that the attackers are likely leveraging <strong>well-connected servers with 10G connectivity</strong>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/19jp7HRCxgs8CEO2ya8sUS/3ded1ac2c388833194622278afd6c779/sources-1000w.png" alt="sources-1000w.png" class="image center" style="max-width: 800px;" /> <p>And lastly, a Sankey flow diagram indicating relative memcached traffic volume for source and destination countries over the last few days:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6lkztl0AvKq0QicKWaGK6Y/2743eacf0eebc80fea6e765ebcdea5ef/sankey-memcache-1000w.png" alt="sankey-memcache-1000w.png" class="image center" style="max-width: 800px;" /> <h4 id="two-key-takeaways">Two Key Takeaways</h4> <ol> <li>Most Kentik customers already had policies configured and enabled to detect these attacks. The “VERY HIGH PPS” and “UDP HIGH BPS” policies in the Alert Policy Library both work well for detecting these attacks out of the box. The example below shows an alarm from a memcached reflection attack Tuesday: <img src="//images.ctfassets.net/6yom6slo28h2/116p21xuXkYuemwao8UgGm/1b232cda28c2023c47f2391192dd9902/alarm-1000w.png" alt="alarm-1000w.png" class="image center" style="max-width: 800px;" /></li> <li>As of ~4 pm PST Tuesday, Kentik updated our global “UDP BADPORTS” saved filter to include UDP/11211. Any customer alert policies, saved views, or dashboards that reference this filter will now automatically include memcached traffic. That includes the “UDP BADPORTS ATTACK” policy in the alert library, which can also be used for more targeted detection of these attacks. This will provide our customers with a faster, easier way to see the traffic associated with this type of attack.</li> </ol> <h4 id="summary">Summary</h4> <p>These types of UDP amplification attacks are increasing in frequency and volume. They can be very costly to a business in terms of lost revenue and high Service Level Agreement (SLA) payouts. To see whether the memcached attacks are affecting your network, <a href="#demo_dialog">schedule a demo</a> or sign up today for a <a href="#signup_dialog">free trial</a>.</p> <p>To learn more about how Kentik can help improve network security and protect against DDoS attacks, read our <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a> solution brief.</p><![CDATA[How Limelight Networks Automates Traffic Engineering and Reduces Infrastructure Cost and MTTR]]>https://www.kentik.com/blog/how-limelight-networks-automates-traffic-engineering-and-reduces-infrastructure-cost-and-mttrhttps://www.kentik.com/blog/how-limelight-networks-automates-traffic-engineering-and-reduces-infrastructure-cost-and-mttr<![CDATA[Michelle Kincaid]]>Thu, 01 Mar 2018 01:48:58 GMT<img src="//images.ctfassets.net/6yom6slo28h2/5a1WX9aLRCEqkWSioCKGA4/4bf9877e0a9d46fa19ebb7a03789e602/llnw_dark_logo_500x500px-e1519691784631-300x116.jpg" alt="llnw_dark_logo_500x500px-e1519691784631-300x116.jpg" class="image right" style="max-width: 300px;" /> <p>Most web content and application providers today utilize Content Delivery Networks (CDNs) to ensure the best user experience. Therefore, CDN providers are constantly competing to deliver the fastest websites and software downloads, most responsive applications, highest-quality video, and lowest latency gaming. To thrive in the highly competitive CDN marketplace, operators must continuously optimize their globally distributed networks for both performance and cost, while strategically planning capacity and growth.</p> <p>That’s why Limelight turned to Kentik. “At our scale, no commercial solution could meet our needs until we found Kentik,” said Julien Vaught, vice president of network architecture and engineering at Limelight Networks.</p> <p><strong>With network traffic intelligence from Kentik, Limelight achieves:</strong></p> <ul> <li>Significant reduction in MTTR</li> <li>Automation for <a href="https://www.kentik.com/kentipedia/network-traffic-engineering/" title="Kentipedia: Network Traffic Engineering">traffic engineering</a></li> <li>Big cost savings for transport and IP transit</li> </ul> <p>“Kentik has the best feature functionality on the market and has substantially reduced our MTTR for customer, peering, and transit-related issues,” added Vaught. “It’s rare to be able to leverage one product for value in so many different and useful ways across the organization, but with Kentik, we’ve been able to do just that.” <a href="https://www.kentik.com/resources/case-study-limelight/">Read the full Limelight case study.</a></p> <p>Ready to see the ROI for yourself? <a href="#demo_dialog">Schedule a demo</a> or sign up for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Digital Transformation Starts With Managing Digital Disruption]]><![CDATA[While defining digital transformation strategies is a valuable exercise for C-level executives, CIOs and IT managers also need to adopt a pragmatic and more tactical approach to “going digital." That starts with acquiring tools and building the new skills needed to ensure business success and profitability in the face of digital disruption. In this post, ACG Principal Analyst Stephen Collins looks at how to manage it all.]]>https://www.kentik.com/blog/digital-transformation-starts-with-managing-digital-disruptionhttps://www.kentik.com/blog/digital-transformation-starts-with-managing-digital-disruption<![CDATA[Stephen Collins]]>Wed, 28 Feb 2018 01:29:26 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/5zV889svPqMGsKWOCoEicO/0d15fcacd9194bdf5d3b8431089bb7e6/-300x180.jpg" alt="-300x180.jpg" class="image right" style="max-width: 300px;" />Wikipedia defines digital transformation as “the change associated with the application of digital technology in all aspects of human society” and “that digital usages inherently enable new types of innovation and creativity.” Management consultants have been raking in big bucks helping established businesses plot digital transformation strategies to defend against “digital disruptors” poised to enter existing markets and win over customers. While this is a valuable exercise for C-level executives, CIOs and IT managers also need to adopt a pragmatic and more tactical approach to “going digital” that starts with acquiring the tools and building the new skills needed to ensure business success and profitability in the face of digital disruption. As the world becomes pervasively more digital, businesses actually have little control over the timing and phasing of the digital transformation process, which is equal parts proactive and reactive. While it is imperative to develop a plan, a realistic strategy must also factor in the inevitable disruption that will impact various aspects of the business, driven by external forces beyond management’s control. Whether management recognizes it or not, the majority of businesses are already being digitally transformed and forced to deal with the effects of digital disruption. <strong>Let’s look at some of the typical digital disruption challenges businesses are facing.</strong> <strong>Competitive forces</strong> have driven many companies to engage with customers online, not only selling products but also providing support and other services. What does a company do if its online operations happen to be hit by a DDoS attack? Will they even know they are under attack? <strong>Employees are increasingly mobile</strong> and working remotely over consumer-grade broadband and public WiFi connections. What happens when remote workers start complaining that “the network is slow”? Is it actually a network problem and if so, whose network? The majority of software vendors are embracing <strong>the SaaS model</strong> and moving their applications to the cloud, and have enticed enterprise customers with compelling pricing models. However, what steps do enterprise IT managers take to determine why a SaaS application has suddenly started performing poorly? What happens after they call the vendor who says that everything looks fine on its end? Many large businesses already have <strong>DevOps teams deploying new microservices-based applications</strong> in public cloud infrastructure. How are these teams able to monitor the performance of these cloud-based applications? How are they able to diagnose network problems that might span the cloud provider’s infrastructure, the public Internet, enterprise networks and service provider access networks? <strong>Streaming video</strong> is now widely used in business settings but video traffic typically traverses multiple paths from its source in the cloud across CDNs and the public Internet to the enterprise user’s device. How does IT operations manage video streaming to ensure end-user quality of experience? How will they be able to rapidly identify and fix performance problems? <strong>IoT</strong> is inherently digital and adoption is occurring at a rapid pace as businesses instrument the physical world with sensors spanning a wide range of environments. Yet industrial scale IoT applications require new tools for cloud-scale visibility in order to ensure performance and security in the face of unexpected digital disruption to the devices, the underlying networks or the supporting infrastructure. If the fuel for digital transformation is software and the engine is commodity hardware, then the public Internet serves as the power distribution grid. For more than two decades, large-scale data centers and private WANs have been the backbone of business operations. Now digital transformation is driving businesses to utilize public networks for much more than e-commerce applications. The Internet readily connects businesses with customers, partners and employees, anywhere in the world. It provides connectivity to public and private clouds for the delivery of both consumer and business applications and content. The deployment of IoT relies on the pervasive nature of Internet connectivity. One could argue that every facet of digital transformation ties back to the Internet. Yet due to its decentralized and distributed nature, the public Internet is particularly vulnerable to enabling and propagating the forces of digital disruption. Successfully managing through these disruptions to realize the ultimate benefits of digital transformation requires new tools and techniques for ensuring the delivery of cloud-based applications and services. Too many businesses currently lack these tools because they have only been concerned with managing traditional private data centers and enterprise WANs.</p><![CDATA[The Future of DDoS Protection in an IoT World]]><![CDATA[IoT represents a massive threat to network infrastructure, as seen in widely publicized IoT-based DDoS attacks like Mirai. So what needs to happen to safeguard our devices and networks from participating in these botnet attacks? And how can IoT device originated attacks get quickly identified and stopped by network operators? In this post, we discuss scalable IoT DDoS protection.]]>https://www.kentik.com/blog/the-future-of-ddos-protection-in-an-iot-worldhttps://www.kentik.com/blog/the-future-of-ddos-protection-in-an-iot-world<![CDATA[Ken Osowski]]>Thu, 22 Feb 2018 14:15:42 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/2Ibi1Rtn5muCY8SQkEqOGU/2f1d9f025efcd8760bd571f653d0e53e/ddos-detective.png" alt="ddos-detective.png" class="image right" style="max-width: 300px;" />The Internet of Things (IoT) represents a massive threat to network infrastructure as already seen in widely publicized IoT-based DDoS attacks. As an example of what can happen, the Mirai malware set loose in late 2016 created a botnet of IoT devices that included ordinary consumer devices such as security cameras, routers, and other home use IoT devices purposely designed to take websites and entire networks offline. The scale of the initial Mirai-based attack was eye-opening. The KrebsOnSecurity website came under a sustained DDoS attack in September 2016 from more than 175,000 IoT devices. That one attack maxed out at 620 Gbps, almost twice the size of the next largest attack that Akamai had ever seen!</p> <p>At the end of September 2016, the authors of Mirai released the source code for their botnet. This set the stage for other copycat attacks. Some Mirai botnets grew quite large and were used to launch devastating attacks, including one on October 21st, 2016 that waged an attack against Domain Name Service (DNS) firm Dyn that disrupted Twitter, Netflix, Reddit and a host of other major sites. Another Mirai botnet variant was used in extortion attacks against a number of banks and Internet service providers in the United Kingdom and Germany. Justice was ultimately served with the Mirai co-creators pleading guilty to charges of using their botnet to conduct click fraud—a form of online advertising fraud that cost Internet advertisers more than $16 billion. Based on an annual report from the Spamhaus Project, there has been a 32% increase in botnet controllers in 2017. So what needs to happen to safeguard our devices and networks from participating in these botnet attacks? And how can IoT device originated attacks get quickly identified and stopped by network operators?</p> <p><strong>Consumer IoT Vulnerabilities Still Widespread</strong> Industrial IoT deployments have their vulnerabilities but not to the extent that consumer-driven IoT usage does. In industrial IoT deployments, secure methods and procedures deployed by dedicated network operations staff attempt to ensure that these devices are not compromised. This is not to say that industrial IoT deployments are always secure but consider the typical lack of discipline used to set up consumer IoT devices such as video cameras, thermostats, lighting, switches, smart speakers and TVs in our homes that are potentially vulnerable to compromise and use for DDoS attacks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/hazbHfMsDekQIUUMCOWO2/b035dda5943c048d2e117dbcf0973382/34-AM.png" alt="34-AM.png" class="image right" style="max-width: 300px;" /> <p>I currently have 30 or so IP addresses in my home associated with consumer IoT devices and a Comcast connection that is just over 100 Mbps downstream and 12 Mbps upstream. If I fail to implement my firewall correctly and leave my network open with no password or weak password protection for a bad actor to get into some of my IoT devices, I am vulnerable to unwillingly be part of a botnet based attack. My upstream speed and IoT device count alone may not be high enough to stage a high volume DDoS attack but we have 1,100 doors in my housing development with the potential for the same vulnerabilities in place. With consumer IoT device popularity on the rise and residential broadband providers like Comcast, Charter, AT&#x26;T, Google Fiber and others offering 1 Gigabit Internet access to 10s of millions of homes, the next Mirai style attack could be massive in volume. And full duplex DOCSIS 3.1 cable network technology offers the ability to offer 500 Mbps upstream services (on a 1 Gbps service) that in aggregate across millions of homes would dwarf the bandwidth of the original Mirai botnet’s attack.</p> <p><strong>Kentik’s Scalable <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">IoT DDoS Protection</a></strong> Kentik’s adoption of a big data architecture is at the core of their network monitoring and DDoS protection platform, Kentik Detect®. This brings some real advantages for IoT DDoS protection–including detection and mitigation–because big data is not only about handling large volumes of data, but also letting network operations staff make sense of that data very quickly and take action without human intervention if necessary. Key advantages include:</p> <ul> <li><strong>Threat Feeds</strong> – In an integration with one of the industry’s leading independent threat intelligence providers, the Spamhaus Project, Kentik Detect uses comprehensive IP reputation data to tag traffic associated with suspected malicious sources or destinations. Hosts are identified that may be infected with malware, participating in botnets, or engaged in other undesirable activity, and can be flagged for customer notification, clean-up, removal from the network, or other security enforcement.</li> <li><strong>Scalable, accurate DDoS protection</strong> – A10 Networks and Kentik together provide a highly scalable DDoS protection and analytics solution using Kentik Detect’s real-time, automated triggering of A10 Thunder TPS mitigation, to stop hard to detect multi-vector DDoS attacks. Also, Kentik Detect can use Radware’s DefensePro platform to provide attack mitigation at network throughputs up to 300 Gbps with up to a 230M PPS attack prevention rate. Kentik Detect can also automatically invoke mitigation methods using Radware’s Cloud DDoS Protection Services that support over 2 Tbps of mitigation capacity.</li> <li><strong>Full data retention, deep IoT analytics</strong> – Kentik’s big data solution doesn’t create summaries or roll-ups and discard network traffic details. Instead, raw data is retained unsummarized, and exploratory analytics enable DDoS traffic patterns to be recognized that might go unnoticed as IoT infrastructure is built out.</li> <li><strong>Adaptive baselining and anomaly detection –</strong> Big data enables automated tracking of dozens of traffic dimensions to determine which should be baselined and measured for anomalies. This enables far more accurate detection and notification by making the system responsive to the organic changes in IoT network infrastructure and traffic patterns. This makes it easier for IoT-based DDoS threats to be readily detected versus normal traffic.</li> <li><strong>Custom Dashboards</strong> – Kentik Detect’s Custom Dashboard feature enables users to quickly make sense of the large volumes of data generated by IoT devices. By creating custom panels that visualize the data in the way that makes the most sense to the user, better insight can be gained into IoT network traffic patterns that may represent a security threat.</li> </ul> <p>To see how Kentik can help your organization analyze, monitor, and react to IoT DDoS threats, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[What’s Driving Advanced Network Analytics Adoption? Instant ROI]]><![CDATA[A new EMA report on network analytics is full of interesting takeaways, from reasons for deployment to use cases after analytics are up and running. In this post, we look specifically at the findings around adoption to see why and how organizations are leveraging network analytics.]]>https://www.kentik.com/blog/why-is-advanced-network-analytics-adoption-on-the-rise-instant-roihttps://www.kentik.com/blog/why-is-advanced-network-analytics-adoption-on-the-rise-instant-roi<![CDATA[Michelle Kincaid]]>Mon, 12 Feb 2018 14:15:29 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/mvGgK5CKUoqoK0uceSW0C/7fd06cc95b9d6e7ef8ef26b1b56beb47/roi.png" alt="roi.png" class="image right" style="max-width: 300px; padding: 15px; margin-left: 20px; margin-bottom: 20px;" />EMA analyst Shamus McGillicuddy recently published a <a href="http://www.enterprisemanagement.com/research/asset.php/3549/Advanced-Network-Analytics:-Applying-Machine-Learning-and-More-to-Network-Engineering-and-Operations-">report on the state of advanced network analytics</a>. The report is full of interesting takeaways from reasons for deployment to use cases after analytics are up and running. However, in this post, we looked specifically at the findings around adoption to see why and how organizations are leveraging network analytics. As background, Kentik sponsored the EMA report, so we supported Shamus in coming up with questions to ask the 200 survey respondents who drove the report’s findings. All respondents qualified for the survey by “being directly involved in a network analytics initiative” within their company.</p> <h4 id="whats-driving-network-analytics">What’s driving network analytics?</h4> <ul> <li>IT automation (34%)</li> <li>Data center consolidation (26%)</li> <li>External cloud (IaaS, SaaS, PaaS, etc.) (23%)</li> <li>Business-and-IT-driven digital transformation (21%)</li> <li>Digital experience management/optimization (21%)</li> </ul> <p>Software-defined data center, SD-WAN, and the convergence between network operations and security operations were other notable drivers, according to the survey. So why did automation rise to the top? Shamus notes:</p> <p><em>“Today, network automation is often limited to repeatable low-level tasks that require limited insight and understanding of network state. Unique high-level tasks usually require human intervention. However, as new solutions apply machine learning and other advanced heuristic techniques, they can gain an unprecedented understanding of the network. Analytics solutions can then feed this insight to automation tools, giving them the necessary intelligence to take on some of those complex, unique tasks without human intervention.”</em></p> <p>In looking at the bigger picture, more organizations now have critical infrastructure running outside of their direct control due to digital transformation trends like automation, consolidation, and cloud. While these initiatives may speed delivery timelines and reduce cost, they can also add complexity to an organization’s network, which in turn, creates a greater need for analytics that unlock real-time visibility across environments and applications.</p> <h4 id="top-5-benefits-of-network-analytics">Top 5 benefits of network analytics</h4> <p>EMA survey respondents reported that network analytics bring ROI via the following ways:</p> <ul> <li>Faster time to repair network problems (69%)</li> <li>More efficient use of network capacity (60%)</li> <li>Proactive ability to prevent network problems (56%)</li> <li>Optimized capacity for service delivery (51%)</li> <li>Better correlation between network change and network health &#x26; performance (50%)</li> </ul> <p>In short, network analytics are driving cost savings, speed, scale, and optimized user experience — all of which, in the end, drive greater profitability.</p> <h4 id="overcoming-the-obstacles-to-adoption">Overcoming the obstacles to adoption</h4> <p>While budget restraints typically top the list as a barrier for new technology adoption, it was interesting to see respondents say that risk is more of an obstacle for deploying analytics. Here are the top-5 reported barriers to entry:</p> <ul> <li>Risk regarding objection from security and/or compliance team (32%)</li> <li>Risk regarding data sovereignty conflict (22%)</li> <li>Lack of budget (21%)</li> <li>Lack of leadership - unsupportive management (20%)</li> <li>Politics - conflict about who should own and share the data (20%)</li> </ul> <p>In our industry, it’s not uncommon to see these barriers-to-adoption. In fact, Kentik’s co-founders previously ran some of the biggest network infrastructure for companies like Akamai, Netflix and Cloudflare, so they know firsthand what can slow down deployments. That’s why our founders built Kentik with:</p> <ul> <li><strong>Immediate ROI</strong>: <a href="https://www.techvalidate.com/product-research/kentik/facts/35B-3E9-D7E">Here’s why our customers say Kentik drives the fastest ROI.</a></li> <li><strong>Security in mind</strong>: <a href="https://www.kentik.com/why-your-netflow-is-safe-in-the-cloud/">Here are 5 reasons Kentik Detect is secure</a>.</li> <li><strong>Strong APIs for cross-team collaboration</strong>: <a href="https://www.techvalidate.com/product-research/kentik/facts/B5D-2AA-CD2">Here’s a customer testimonial.</a></li> </ul> <p>Here are a few examples of how our customers are benefiting from our network analytics platform:</p> <ul> <li><strong>Pandora</strong> drives superior network performance (<a href="https://www.youtube.com/watch?v=ktXv1sKHzfU">video</a>)</li> <li><strong>GTT</strong> enhances visibility and security while adding profit (<a href="https://www.kentik.com/resources/case-study-gtt/">case study</a>)</li> <li><strong>DreamHost</strong> protects its service performance and brand reputation (<a href="https://www.kentik.com/resources/case-study-dreamhost/">case study</a>)</li> </ul> <p>If you want to hear more about network traffic intelligence from Kentik, get a demo or sign up for a <a href="#signup_dialog">free trial</a>.You can also download the full EMA report at: “<a href="http://www.enterprisemanagement.com/research/asset.php/3549/Advanced-Network-Analytics:-Applying-Machine-Learning-and-More-to-Network-Engineering-and-Operations-">Advanced Network Analytics: Applying Machine Learning and More to Network Engineering and Operations</a>.”</p><![CDATA[Seeing CDN Traffic with Kentik Detect]]><![CDATA[CDNs have been around for years, but they’ve gained new importance with the rise of video streaming services like Netflix and Hulu. As traffic from those sites soars, CDNs introduce new challenges for network operations teams at both service providers and enterprises. Kentik Detect's new CDN Attribution makes identifying and tracking CDN traffic a whole lot easier. In this blog, we provide examples of how companies can implement this functionality.]]>https://www.kentik.com/blog/seeing-cdn-traffic-with-kentik-detecthttps://www.kentik.com/blog/seeing-cdn-traffic-with-kentik-detect<![CDATA[Justin Ryburn]]>Thu, 08 Feb 2018 14:15:27 GMT<h3 id="cdn-attribution-reveals-traffic-source-for-enterprises-and-sps"><em>CDN Attribution Reveals Traffic Source for Enterprises and SPs</em></h3> <p>As you can tell from our <a href="https://www.kentik.com/product-updates/">Product Updates page</a>, new features keep rolling out from the Kentik Detect® development team, giving our customers new ways to visualize, monitor, and respond to their network traffic. In recent months it’s been a challenge to give every new component the attention it deserves, so in the next couple of posts we’ll circle back to some recently added capabilities that are worthy of further exploration. By looking at use cases for each of these features, we hope you’ll get a feel for how they help both enterprises and service providers (SPs) to gain deeper visibility into their network traffic. This time we’ll take on CDN Attribution, and next time Threat Feeds.</p> <p>Content Delivery Networks (CDNs) have been around for years, but they’ve gained new importance with the skyrocketing popularity of video streaming services like Netflix, Amazon Video, and Hulu. As traffic from those sites soars to a whole new level, CDNs help deliver the content and maintain a good experience for the consumer. But they also introduce new challenges for network operations teams at both service providers and large enterprises.</p> <h4 id="seeing-cdn-traffic">Seeing CDN Traffic</h4> <p>Kentik Detect’s CDN Attribution feature helps by enabling operators to better understand traffic that is associated with CDNs. To see how this works, let’s suppose that we have a fictional enterprise called Pear, Inc. Pear’s IT department wants to be able to see if employees are streaming content from CDNs across their network. They would also like to be able to rate-limit that traffic so that business traffic, which is a higher priority, has a better shot at the bandwidth available on the network.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/53wkai0NQA6USMo80AyGmC/2ab7391d279cf4930b7a3ac6b0bfcbd3/query1-594w.png" alt="query1-594w.png" class="image right" style="max-width: 300px;" />To accomplish these goals with Kentik Detect, we start by building a query in the sidebar panes of the portal’s Data Explorer (Data Explorer » Explorer View). The query will look at two of the dimensions that are stored with each flow record that we ingest into our datastore: Source CDN and Destination IP/CIDR:</p> <p>Still in the sidebar, we set the Time pane to look back at the last hour (for details on settings in the Data Explorer sidebar, see the <a href="https://kb.kentik.com/Da03.htm#Da03-Sidebar_Panes">Sidebar Panes</a> article in our Knowledge Base). In the Devices pane, meanwhile, we’ll choose “All” so we can look at traffic across the entire network.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/13g9ipcwakukU6OUYogIAa/ffa2311ada76af2da18acc33f93ffbd6/filtering-exclude-598w.png" alt="filtering-exclude-598w.png" class="image right" style="max-width: 300px;" />We also set filters in the Filtering pane (shown at right) so that the results will show only flows that come from a CDN:</p> <ul> <li>Click anywhere in the pane to open the Filtering Options dialog.</li> <li>In the Ad-Hoc Filter Groups pane, click Add Filter Group.</li> <li>Change “Include” to “Exclude.”</li> <li>Change the dimension to Source CDN.</li> <li>Click Save Changes.</li> </ul> <p>Once our query is configured and we click Run Query at the top of the sidebar, the query should return a graph that looks something like the below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/504PjMvs1G4UkKkK8USImq/586bc4ac9814c65a5db66742462b50c3/top-source-cdn-1056w.png" alt="top-source-cdn-1056w.png" class="image center" style="max-width: 1000px; padding: 20px;" /> <p>Armed with this information, Pear’s IT department would be able to see which hosts on their network are pulling content from each CDN. And they could also export the results in a .csv file to make it easier to configure rate-limiting; from the Options menu at the upper right, above the chart, choose Export » Legend Data (CSV).</p> <h4 id="monitoring-for-cdn-cache-fill">Monitoring for CDN Cache Fill</h4> <p><img src="//images.ctfassets.net/6yom6slo28h2/Zqzz006DMQC8OqCqKuiGa/3c0d3463628724518a4a723f029f50a8/query2-596w.png" alt="query2-596w.png" class="image right" style="max-width: 300px;" />A lot of SPs host a CDN caching server in the edge of their network to reduce the amount of content traffic that flows across their backbone. To make this work, the CDN must push new content to the cache as it becomes available. This is known as CDN Cache Filling. It’s usually done during off hours when other network traffic is low, though SPs can request that it be done at a different time of day.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/5Umx5S7XHiiu8oAeSs608y/979976ef083dae2fa5f58b2335743057/filtering-include-594w.png" alt="filtering-include-594w.png" class="image right" style="max-width: 300px;" />Monitoring cache-fill traffic is an important task for networking teams at these SPs. To do that, we can once again build a query in Data Explorer that looks at Source CDN and Destination IP/CIDR (shown above right). This time we will filter the traffic to look only at flows to our CDN caching server, 17.0.0.1/32, and only from the CDN named Netflix (shown at right).</p> <p>Looking at this traffic across all devices, but this time with Lookback set to “Last 1 Day,” the returned graph will now show us the traffic from the cache-filling operation, which should look similar to the graph below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4wP9f92cV26wcW0CsCWqgO/78ffd8b8e4044dd4ddb1381b49656d59/last-one-day-1056w.png" alt="last-one-day-1056w.png" class="image center" style="max-width: 1000px; padding: 20px;" /> <p>That’s just a small taste of what you can do with CDN Attribution. For more information on how CDN Attribution works or how to configure it in your Kentik Detect account, check out our <a href="/resources/using-cdn-attribution/">Tech Note</a> or refer to our <a href="https://kb.kentik.com/Fc08.htm">Knowledge Base article</a>. If you’re not already a customer, there are a couple of easy ways to see more of what you can do with Kentik Detect: <a href="#demo_dialog">schedule a demo</a> or sign up today for a <a href="#signup_dialog">free trial</a>. As noted earlier, next time we’ll dig into Threat Feeds…</p> <p>Until then, watch our CDN Attribution feature in action here: <a href="https://www.youtube.com/embed/f0xGgRAcjJo">https://www.youtube.com/embed/f0xGgRAcjJo</a></p><![CDATA[What Do Poker and Networking Have in Common?]]><![CDATA[If you've spent any amount of time with Kentik Co-founder and CEO Avi Freedman, you know he's a networking nerd at heart. You may have also heard of his past (and sometimes current pastime) as a high-stakes poker player. Now, in merging two of Avi's passions together, we've launched Kentik Poker!]]>https://www.kentik.com/blog/what-do-poker-and-networking-have-in-commonhttps://www.kentik.com/blog/what-do-poker-and-networking-have-in-common<![CDATA[Michelle Kincaid]]>Wed, 07 Feb 2018 14:45:41 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/5Dz5xZyJAQ0mGkqi0AGAcQ/a513802cc2a280ed73ef907272c772ba/kentik-cards-300x300.png" alt="kentik-cards-300x300.png" class="image right" style="max-width: 300px;" />Come try your hand at Kentik Poker.</p> <p>If you’ve spent any amount of time with Kentik Co-founder and CEO Avi Freedman, you know he’s a networking nerd at heart. You may have also heard of his past (and sometimes current pastime) as a high-stakes poker player, with both No-Limit Hold’em and Pot Limit Omaha. He’s even played at the World Series of Poker and the Ultimate Poker Challenge. Now, in merging two of Avi’s passions together…</p> <p><strong>We’ve launched Kentik Poker!</strong></p> <p><img src="//images.ctfassets.net/6yom6slo28h2/4jqf3W087eYwO6SosCA6e0/ecb04397e3bc08403dca2a54acfd7b2a/poker-night-seattle.jpg" alt="poker-night-seattle.jpg" class="image right" style="max-width: 300px;" />For our inaugural event, we traveled up to Seattle and brought together our networking peers from cloud companies, content providers, hosting and infrastructure services, and even the travel industry. With a poker-training session from professional dealers and several No Limit Texas Hold’em tables the event also included food, drinks, and free-form networking tech, business, and policy nerd discussions, as well as some travel and gadget-geeking chatter.</p> <p>Congratulations go to our five Seattle tournament winners, which included: Josh Noll of Microsoft’s Global Network Acquisition Group, Jim Muggli of Network Utility Force, and Network Engineer Mark Reynolds IV. Prizes were new Bose headphones (to avoid any corporate gift issues with playing for/or awarding money).</p> <p>With the fun we had at our first event, we’re already preparing for the next one ─ happening March 8 in Denver!</p> <p><strong><img src="//images.ctfassets.net/6yom6slo28h2/2qMt3lrcXCK2ccoQkKwG6e/dc9f96c06a1ee23033a7569bcd112dae/seattle-poker-avi.png" alt="seattle-poker-avi.png" class="image right" style="max-width: 300px;" />According to Avi:</strong></p> <p>“In poker, your best strategy is table selection. While it’s fun to play with the real pros like Phil Ivey, that’s not actually who you want to play with if you want to win. However, in the networking space, we all learn by playing with the best and sharing best practices for what works and doesn’t.”</p> <p><strong>Ready to play Kentik Poker?</strong></p> <p>Are you a Denver-based network nerd with a poker face? Try your luck at our upcoming tournament. <a href="https://www.greenvelope.com/viewer/?ActivityCode=.public:e4eb9dc506aa4fe8a809b1b6289897cf31303637323335"><strong>Request your seat here.</strong></a></p> <p>Not based in Denver? Stay tuned because we’re planning for more games in more cities soon!</p><![CDATA[News in Networking: Advanced Analytics for an Oil Company & HCI for an NFL Team]]><![CDATA[This week Verizon invested in an Open Network Automation Platform membership. Oil company Shell talked about the success of its predictive analytics. The Chicago Bears said they’re all in on hyperconverged infrastructure. And a Google Cast protocol bug caused temporary Wi-Fi outages. More stories after the jump...]]>https://www.kentik.com/blog/news-in-networking-advanced-analytics-for-an-oil-company-hci-for-an-nfl-teamhttps://www.kentik.com/blog/news-in-networking-advanced-analytics-for-an-oil-company-hci-for-an-nfl-team<![CDATA[Michelle Kincaid]]>Fri, 19 Jan 2018 17:06:55 GMT<p>This week’s top story picks from the Kentik team.</p> <p>This week Verizon invested in an Open Network Automation Platform (ONAP) platinum membership. Oil company Shell talked about the success of its predictive analytics. The Chicago Bears NFL team said they’re all in on HCI, or hyperconverged infrastructure. And a Google Cast protocol bug caused temporary Wi-Fi outages.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.sdxcentral.com/articles/news/verizon-joins-onap/2018/01/"><strong>Verizon joins ONAP</strong></a> <strong>(SDxCentral)</strong> Verizon is the newest platnium member of the Open Network Automation Platform (ONAP). Srini Kalapala, Verizon’s VP of global technology and supplier strategy, told SDxCentral that “the company has been doing network virtualization for a while. Now, it will initially evaluate which parts of the ONAP open source software it can immediately use.”</li> <li><a href="http://www.zdnet.com/article/shell-drills-for-data-insights/"><strong>Oil company Shell is seeing big benefits from advanced analytics</strong></a> <strong>(ZDNet)</strong> “Shell’s upstream operations team uses predictive analytics capabilities to optimize the ordering, storage, and utilization of pieces of spare part inventory for onshore and offshore oil rigs. These include well heads and pipeline parts. The project has delivered millions of dollars in benefits and paid for itself in less than four weeks,” noted ZDNet.</li> <li><a href="https://techcrunch.com/2018/01/17/cloudflare-access-aims-to-replace-corporate-vpns/"><strong>Cloudflare Access aims to replace corporate VPNs</strong></a> <strong>(TechCrunch)</strong> “Essentially Cloudflare is doing the important part of the VPN — inspecting certificates and traffic, establishing a chain of trust for packets — in a less clunky way and one that enables companies to let data live on cloud services instead of internal servers,” reported TechCrunch.</li> <li><a href="http://www.lightreading.com/mobile/backhaul/sprint-cox-go-from-patent-suit-to-partnership/d/d-id/739841?"><strong>Sprint, Cox go from patent suit to partnership</strong></a> <strong>(Light Reading)</strong> “Following negotiations over a long-running VoIP patent infringement suit, Sprint and Cox have settled their differences and announced a multi-year deal that will give Sprint access to Cox broadband infrastructure throughout the cable company’s service footprint,” according to Light Reading.</li> <li><a href="https://siliconangle.com/blog/2018/01/16/amazon-bags-comcast-latest-marquee-customer/"><strong>Amazon bags Comcast as its latest marquee customer</strong></a> <strong>(SiliconANGLE)</strong> “Comcast isn’t exactly a new customer for Amazon, but in its announcement today the company said it chose AWS as its “preferred public cloud infrastructure provider,” which means it’ll be using more of its services,” reported SiliconANGLE.</li> <li><a href="https://www.sdxcentral.com/articles/news/chicago-bears-win-big-nutanix-hyperconverged-infrastructure/2018/01/?c_action=home_slider"><strong>Chicago Bears win big with Nutanix hyperconverged infrastructure</strong></a> <strong>(SDxCentral)</strong> It may be the NFL, but the Chicago Bears are in on HCI. According to SDxCentral, the football team’s IT team “deployed Nutanix Enterprise Cloud software for its mission-critical applications, including Microsoft SQL Server databases, financial reporting software, and an internally-developed player scouting application.”</li> <li><a href="https://www.businesswire.com/news/home/20180117005267/en/A10-Networks-Launches-Full-Spectrum-Cloud-Scrubbing"><strong>A10 Networks launches full spectrum cloud scrubbing and on-premise enterprise DDoS protection solution</strong></a> <strong>(Press Release)</strong> Kentik partner A10 Networks announced this week that the company has added to its solutions. “A10 now provides a single advanced solution for on-premise and cloud scrubbing enterprise DDoS defenses, backed by our DDoS SIRT team,” Raj Jalan, CTO of A10, said in a press release.</li> <li><a href="http://www.post-gazette.com/business/career-workplace/2017/10/11/Carnegie-Mellon-Uber-Lyft-Ride-hail-apps-access-low-income-neighborhoods/stories/201710110028"><strong>Carnegie Mellon is home to a $27.5M project to build cloud computing solutions</strong></a> <strong>(Pittsburgh Post-Gazette)</strong> Carnegie Mellon University (CMU) has a big investment in a new program “to build a smarter solution for edge devices — like a router or larger access point to a network — to operate on the cloud,” according to the Pittsburgh Post-Gazette. The CONIX Research Center, headquartered on CMU’s Oakland campus, will host researchers from six different universities for the next five years to solve this connectivity problem.”</li> <li><a href="http://www.theregister.co.uk/2018/01/15/router_vendors_update_firmware_to_protect_against_google_chromecast_traffic_floods/"><strong>Google Cast protocol bug causing temporary Wi-Fi outages on many routers</strong></a> <strong>(The Register)</strong> “Wi-Fi router vendors have started issuing patches to defend their products against Google Chromecast devices,” reported The Register earlier this week. “The bug is not in the routers, but in Google’s “Cast” feature, used in Chromecast, Google Home, and other devices. Cast sends multicast DNS (MDNS) packets as a keep-alive for connections to products like Google Home, and it seems someone forgot to configure the feature to go quiet when Chromecast devices are sleeping.”</li> <li><a href="https://www.bizjournals.com/sanjose/news/2018/01/17/john-chambers-cisco-vc-firm-jc2-ventures.html"><strong>Longtime Cisco chief John Chambers launches Palo Alto-based VC firm</strong></a> <strong>(Business Journal)</strong> Former Cisco leader John Chambers announced this week he’s launching a new venture firm called JC2 Ventures. According to Silicon Valley Business Journal, the VC will focus on “investing in the Internet of Things, Big Data, neuro-linguistic programming, artificial intelligence, social media, security and ag-tech.”</li> <li><a href="http://packetpushers.net/podcast/podcasts/show-372-kentik-network-traffic-intelligence-sponsored/"><strong>Podcast 372: Kentik &#x26; Network Traffic Intelligence</strong></a> <strong>(Packet Pushers)</strong> Last but certainly not least, Kentik’s Co-founder and CEO Avi Freedman took the airwaves last week for a Packet Pushers podcast on Kentik and our network traffic intelligence platform. Give it a listen to hear Avi’s take on the challenges of getting visibility in the cloud and on provider networks and what we’re doing to help.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[How IoT Drives the Need for Network Management Tools]]><![CDATA[As IoT adoption continues, enterprises are finding a massive increase in the number of devices and the data volumes generated by these devices on their networks. Here’s how enterprises can use network monitoring tools for enhanced visibility into complex networks.]]>https://www.kentik.com/blog/how-iot-drives-the-need-for-network-management-toolshttps://www.kentik.com/blog/how-iot-drives-the-need-for-network-management-tools<![CDATA[Ken Osowski]]>Wed, 03 Jan 2018 14:00:18 GMT<h2 id="looking-into-network-monitoring-in-an-iot-enabled-network">Looking into network monitoring in an IoT enabled network</h2> <p>Whether financial, healthcare, energy, manufacturing, or web enterprises, across all industries, a common goal is digitizing the organization as fast as possible. This allows businesses to take advantage of the many technologies that now enable greater speed and agility, and ultimately promise more revenue. As part of the movement, organizations are also looking to benefit from the internet of things (IoT).</p> <p>As IoT adoption in the enterprise continues to take shape, organizations are finding that the diverse capabilities represent another massive increase in the number of devices and the data volumes generated by these devices in enterprise networks.</p> <div as="Promo"></div> <p>IoT infrastructure represents a broad diversity of technology. New data streams, protocols, security guidelines, and backup procedures challenge network and security operations staff. The higher volume of IoT network traffic makes capacity planning and network management more difficult, especially as new IoT deployments emerge. Also, IoT devices with inadequate security safeguards are easy targets to hijack with malware that weaponizes them for DDoS attacks. This has the potential to disrupt infrastructure as already seen in widely publicized IoT-based DDoS attacks.</p> <p>So, how can digital businesses cope with these challenges without giving up on IoT? How will network monitoring tools evolve to accommodate this ever-changing IoT network landscape?</p> <h2 id="key-iot-analytics-requirements">Key IoT analytics requirements</h2> <p>Network-based analytics is critical to managing IoT infrastructure. Network analytics has the power to examine details of the IoT communications patterns made through various protocols and correlate these to data paths traversed throughout the network. Normal or baseline performance measurements are established, and this information can then be used to identify suboptimal paths, packet loss, congestion points, or security threats.</p> <p>Traffic analytics is without a doubt a very powerful tool for network staff troubleshooting IoT solutions. But many network management tools weren’t architected to handle the scale of today’s networks, none-the-less the scale of traffic introduced by millions of IoT devices. Network management tools need to address IoT network analytics challenges head-on, starting with some key requirements. They must:</p> <ul> <li>Easily incorporate newly deployed devices and sensors for monitoring</li> <li>Scale to support very high network monitoring data ingest volumes</li> <li>Support detailed IoT device monitoring data across long periods of time</li> <li>Provide flexible data models for reporting with a high level of field customization</li> <li>Return query results against all this data in seconds</li> </ul> <p>So, how do legacy network management tools stack up against these requirements? Appliance-based network management solutions are too resource-constrained to handle the vast volume of data generated by IoT infrastructure. And software-based network management tools silo flow data, imposing severe constraints on analytics methods that require network data correlation across many network locations. This leads us to a big data approach to capture and report on this unstructured IoT data.</p> <h2 id="kentiks-scalable-and-flexible-iot-analytics">Kentik’s scalable and flexible IoT analytics</h2> <p>Kentik’s adoption of a big data architecture is at the core of the network monitoring platform. This brings some real advantages for IoT analytics, because big data is not only about handling large volumes of data, but also letting network operations staff navigate through and explore that data very quickly. Key advantages include:</p> <ul> <li><strong>Full data retention, deep IoT analytics</strong> – Kentik’s big data solution doesn’t create summaries or roll-ups and discard network traffic details. Instead raw data is retained unsummarized, and exploratory analytics enable traffic patterns to be recognized that might go unnoticed as IoT infrastructure is built out.</li> <li><strong>Custom reporting dimensions</strong> – Custom dimensions are customer-defined labels that are applied to flow data based on user-defined criteria, as Kentik ingests the data into the Kentik Data Engine (KDE). This allows users to create IoT specific mappings, such as identifying IoT devices by their IP addresses. This makes IoT-specific traffic flows much easier to identify and report on.</li> <li><strong>API access</strong> – Kentik provides API access to administrative functions like defining users, devices, and custom dimensions, as well as the ability to pull formatted data from the system. This allows for more automated integration with other tools to avoid siloed applications.</li> <li><strong>Adaptive baselining and anomaly detection</strong> – Big data enables automated tracking of dozens of traffic dimensions to determine which should be baselined and measured for anomalies. This enables far more accurate detection and notification by making the system responsive to the organic changes in IoT network infrastructure and traffic patterns. This makes it easier for IoT-based threats to be readily detected versus normal traffic.</li> <li><strong>Custom dashboards –</strong> Kentik’s Custom Dashboard feature enables users to quickly make sense of the large volumes of data generated by IoT devices. By creating custom panels that visualize the data in the way that makes the most sense to the user, better insight can be gained into IoT network traffic patterns.</li> </ul> <p>No matter how quickly an organization embraces IoT, it’s important to remember that the business value to be observed is not from the type of IoT device or how the device connects to the network, but rather from the types of insights that the device’s data is able to create. This data is used to understand how businesses are operating from second to second, and IoT analytics is at the heart of this revolution. To see how Kentik can help your organization analyze, monitor, and react IoT traffic patterns, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[What’s Ahead for Networking in 2018?]]><![CDATA[From telcos, to financial services, to tech companies, we asked 30 of our peers one question: What are your 2018 networking predictions? Yes, it’s a broad question. But respondents (hailing from network, data center, and security operations teams) surfaced five main predictions for the year ahead.]]>https://www.kentik.com/blog/whats-ahead-for-networking-in-2018https://www.kentik.com/blog/whats-ahead-for-networking-in-2018<![CDATA[Jim Frey]]>Thu, 21 Dec 2017 17:35:37 GMT<p>We asked our partners and peers for predictions.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/2pVOrqaFLigwkeGYU6s0k/50e87193f76b6644043913ea24d25895/2018predictions-300x173.jpg" alt="2018predictions-300x173.jpg" class="image right" style="max-width: 300px;" />From telcos like BT and CenturyLink, to financial services such as MasterCard, to tech companies like Amazon, SAP, and more, we asked 30 of our peers one question: What are your 2018 networking predictions? Yes, it’s a broad question. But respondents (hailing from network, data center, and security operations teams) surfaced five main predictions for the year ahead. In order of most-voted-for, they are:</p> <ol> <li>More network automation</li> <li>More cloud adoption</li> <li>More SD-WAN growth</li> <li>More demand for security</li> <li>More machine learning curiosity</li> </ol> <p>Because the top-five predictions likely won’t surprise most in our industry, we asked our partners and peers to weigh in on some of the reasons why these trends surfaced to the top. Here’s what they said…</p> <h3 id="more-network-automation"><strong>More Network Automation</strong></h3> <p>I’m going to comment directly on the first prediction, as it’s always been a personal fascination. We’ve been talking about network automation for a long time, but it’s traditionally been held back by nervous networking pros who see more risk than reward. Now, things are changing. The advent of cloud and virtual infrastructure is driving automation up and down the stack, and it’s dragging networking into the fold. Security is also a driving factor in automation - it’s better to shoot first and ask questions later than get compromised. But network automation isn’t the hard part - it’s being comfortable with the accuracy of the triggering intelligence, and we see that aspect of analytics improving steadily.</p> <h3 id="more-cloud-adoption"><strong>More Cloud Adoption</strong></h3> <p>“Workload migration to the cloud and the multi-cloud will continue to accelerate, with containerization becoming increasingly prominent across these environments,” suggests Monika Goldberg, executive director of <a href="https://www.shieldx.com/">ShieldXNetworks</a>, a 2017 Gartner Cool Vendor in Cloud Security. “Traditional appliance and virtualized appliance solutions will feel more pressure as more cost-effective and agile cloud-native software and service solutions emerge and mature.” Kentik teammate Justin Ryburn suggests: “I think we will see a lot more attention given to how to approach hybrid multi-cloud as users start to realize both on-prem workloads and cloud workloads are going to be a reality for quite a while.” Though Ryburn adds, “The migration to the cloud is going to be slower than some people thought it might be.” On that last point, I’d add my own observation, that cloud migration isn’t happening as fast as the pundits and cloud vendors thought, in no small part due cost, the fact that there are many applications that simply are not “cloud appropriate,” and simple organizational inertia around systems that “aren’t broke, so why fix them?”</p> <h3 id="more-sd-wan-growth"><strong>More SD-WAN Growth</strong></h3> <p>“SD-WAN will continue to be a hot topic as it represents a significant cost savings for enterprises when compared to traditional carrier MPLS services,” according to Ryburn. “We will continue to see announcements from carriers about managed SD-WAN offerings as they look to keep their existing enterprise customers by moving them to the newer technology.”</p> <h3 id="more-demand-for-security"><strong>More Demand for Security</strong></h3> <p><strong><em>Vulnerabilities Exploited in CDNs</em></strong> “New vulnerabilities in content delivery networks (CDNs) have left many wondering if the networks themselves are vulnerable to a wide variety of cyber-attacks,” notes Carl Herberger, vice president of security solutions at <a href="https://www.radware.com/Partners/TechnologyPartners/Kentik/">Radware</a>, a Kentik alliance partner. Herberger predicts these five cyber “blind spots” will be attacked in 2018:</p> <ul> <li>Increase in dynamic content attacks</li> <li>Attacks on non-CDN services</li> <li>SSL-based DDoS attacks</li> <li>Web application attacks</li> <li>Direct IP attacks</li> </ul> <p><em><strong>Third-Party Attacks</strong></em> Radware’s Herberger also predicts we’ll also see more “attacking from the side.” He believes this technique will attack the integrity of a company’s site through a variety of tactics:</p> <ul> <li>DDoS the enterprise’s analytics company</li> <li>Brute-force attack against all users or against all of the site’s third-party companies</li> <li>Port the admin’s phone and steal login information</li> <li>Large botnets to “learn” ins and outs of a site</li> <li>Massive load on “page dotting”</li> </ul> <p><strong><em>Cloud Providers to be Targeted</em></strong> Our partners from <a href="https://www.a10networks.com/blog/live-a10-kentik-ddos-demo-mwc">A10 Networks Security Engineering Research Team</a> (SERT) say the cloud is next on hackers’ hit list. “As more companies move to the cloud, attackers will directly or indirectly target cloud providers,” says A10 SERT. “Just one look at the Dyn and Mirai attacks of 2016 show this trend forming, and it’ll reach a new peak in 2018. Corporations will have limited response capabilities to deal with their cloud provider being attacked, as they have no control over the underlying infrastructure. This will cause more companies to look at a multi-cloud strategy to avoid putting all of their workloads with one cloud provider.”</p> <h3 id="more-machine-learning-curiosity"><strong>More Machine Learning Curiosity</strong></h3> <p><strong><em>AI to Power Emerging Security Tech</em></strong> A10 SERT also suggests, “While we’re not talking about full-fledged AI here, the rise of commoditized machine learning capabilities and chat bots being built into just about every new product will allow for human and electronic intelligence to be combined more effectively. Come next year, this will give security teams the ability to assess and prioritize security vulnerabilities based on more than just a single label, thus offering deeper protection.” <strong><em>AI in Networking?</em></strong> Luca Deri, founder of <a href="https://www.ntop.org/nprobe/ntop-and-kentik-bring-nprobe-on-the-cloud/">ntop</a>, a Kentik alliance partner notes, “In the modern internet, it is no longer possible to use a firewall to delimit bad from good, simply because it’s all mixed-up. For this reason it is compulsory to take into account non-network oriented metadata which include things such as if a human using a device, the time of day, the location, and the activity being carried on. With this further step, it becomes possible to make monitoring ready for the next decade.” On a similar note, Kentik teammate Justin Ryburn adds, “I think we will hear a lot more about AI and machine learning in 2018. A lot of research is being done on how these technologies can be applied to networking, but I think we are still a few years away from shipping products in this space.”</p> <h3 id="looking-ahead"><strong>Looking Ahead</strong></h3> <p>Whatever your network needs may be in 2018, be prepared. The continued adoption of SD-WAN, multi-cloud deployments, increased threatscape, and developments in machine learning are making networks more dispersed and complex. Network visibility will be more critical than ever as enterprises and service providers make these major trends a reality. For SaaS-based, real-time network traffic intelligence from Kentik, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Fascinating Facts from Kentik]]><![CDATA[What brands of network devices are Kentik customers using? Where does their international traffic come from and go to? What's the current norm for packet sizes and internet traffic protocols? Drawing on Kentik Detect's ability to see and analyze network traffic, this post shares some intriguing factoids, and it sheds light on some of the insights about your own network traffic that await you as a Kentik customer.]]>https://www.kentik.com/blog/fascinating-facts-from-kentikhttps://www.kentik.com/blog/fascinating-facts-from-kentik<![CDATA[Justin Ryburn]]>Mon, 18 Dec 2017 14:00:34 GMT<h2 id="big-data-stats-reveal-industry-trends"><em>Big Data Stats Reveal Industry Trends</em></h2> <p>Roughly 100 billion flow records each and every day. That’s how much flow data is ingested by Kentik Data Engine (KDE), the distributed big data backend that powers Kentik Detect®. It’s also just one of the many interesting statistics that we run across as we operate our SaaS platform for network traffic analytics. As the year draws to a close, we thought we’d share some of those fascinating tidbits with you, our loyal readers. Needless to say, everything presented below comes from customers that have granted permission to use their customer data. The data has been aggregated and anonymized to protect data privacy, which we’ve built into our platform and is an issue that we take very seriously.</p> <h3 id="network-devices-which-brands-are-trending">Network Devices: Which Brands are Trending?</h3> <img src="//images.ctfassets.net/6yom6slo28h2/3wAaIhjozCSSqkWcuImqgq/d6b89364028155a3a64de3ec86a65a01/Brands-500w.png" class="image right no-shadow" style="max-width: 500px" alt="Network device brands" /> <p>Cisco Systems is by far the market leader in the overall IT Networking sector. So it’s interesting to see that Juniper Networks, at 49.6%, is the leading maker of routers and switches sending flow to Kentik Detect. On the other hand, given Arista’s recent market traction, it’s not surprising to see them in the number-three spot. And how about the fact that Linux OSes show up in our Top 10 list as well? Maybe the “white box” trend really is taking off?</p> <p><strong>International Traffic: Where To, and From?</strong></p> <p>The graphs below are both based on flow record from Kentik customers for traffic whose source or destination is outside the United States. At left we see the top 10 destinations, by country, for traffic that is leaving the United States. Since most of the network operators in the US have a network POP in Canada (represented on the graph as CA), it makes sense that our northern neighbor would be in the number-one spot, with 18% of the traffic. Canada is closely followed by China (CN), Great Britain (GB), Germany (DE), and Mexico (MX).</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Q0HhMK5qEGuOQaOiUIkI4/ea83d61230c170072dd1649e318e791a/ToFrom_US-800w.png" class="image center no-shadow" style="max-width: 800px" alt="International traffic" /> <p>What about the other direction: traffic that originates in foreign countries and is destined for the United States? In this case, as we can see from the graph at right above, the rankings are a bit different. Canada is down in the number 4 spot, and China is down to number 6. The top 3 spots are occupied by the Netherlands (NL), Great Britain (GB), and Germany (DE).</p> <p><strong>Packet Sizes</strong></p> <p>Here’s one that’s probably not too surprising: there appears to be a tendency towards larger packets on the Internet, with most of the traffic being made up of packets greater than 1000 bytes. 1400 bytes appears to be the most common, with 55% of the traffic. There are jumbo frames, those greater than 1500 bytes, among the data set, but they make up less than 1% of the traffic so they were left off the graph to make it easier to read.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1LzoMG0REMAmMSc8YYIAQc/ed1d6d4f62d036d1d6337dfb93266bad/Packet_size-820w.png" class="image center no-shadow" style="max-width: 800px" alt="Packet Size" /> <p><strong>IP Versions and Internet Protocols</strong></p> <p>Google maintains Internet-wide stats on the overall <a href="https://www.google.com/intl/en/ipv6/statistics.html">IPv6 adoption rate</a>. As of this writing, those numbers show that about 17% of traffic on the Internet is IPv6. But for Kentik customers, the number is much lower. While Kentik Detect supports IPv6, only 0.2% of traffic for which flow records are sent to KDE use IPv6 (below left). Apparently most of our customers don’t yet have a high volume of IPv6 traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5NSLWMkl0I6EwMGIkQO0OA/1428e2a9e43d873d4614fa53f7d0ea09/VersionProtocol-808w.png" class="image center no-shadow" style="max-width: 800px" alt="w" /> <p>As for IP protocols, when we look at which ones our customers are most commonly seeing on their networks (above right), it’s not surprising that TCP (protocol 6, for those not familiar) comes out on top, with almost 71% of the traffic. It is interesting, though, how the rest of the list comes out. UDP (protocol 17) is down in the number 5 spot with only 3.2% of the traffic, and ICMP (protocol 1) comes in right behind it with 2.7%. To be honest, I had to go look up what some of these other protocols are as I have not seen those protocol numbers in years.</p> <p><strong>Port Numbers and Services</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/6N5QlMneRUiuqWoCcICU40/6fdef3478fe7d74fc55564d075b81bd8/Ports-500w.png" class="image center no-right" style="max-width: 500px" alt="port numbers and services" /> <p>Now let’s take a look at TCP/UDP port numbers, so we can get an idea of what kind of services are in use. It’s not surprising to see 443 (HTTPS) and 80 (HTTP) as the top two hitters given most Internet applications are now HTTP based. I was a little surprised to see, though, that port 443 beat out port 80. I guess that speaks to the increasing need for SSL-encrypted traffic.</p> <p><strong>Summary</strong></p> <p>The above just gives a quick taste of the interesting data points that you can see from the traffic data that Kentik customers can explore with our big data analytical tools. Are you ready to see what kinds of interesting data points are hiding in your network traffic? Get started today by <a href="#demo_dialog">scheduling a demo</a> or starting a <a href="#signup_dialog">free trial</a>.</p><![CDATA[DIY: The Hidden Risks of Open Source Network Flow Analyzers]]><![CDATA[Advances in open source software packages for big data have made Do-It-Yourself (DIY) approaches to Network Flow Analyzers attractive. However, careful analysis of all the pros and cons needs to be completed before jumping in. In this post, we look at the hidden pitfalls and costs of the DIY approach.]]>https://www.kentik.com/blog/hidden-risks-open-source-network-flow-analyzershttps://www.kentik.com/blog/hidden-risks-open-source-network-flow-analyzers<![CDATA[Ken Osowski]]>Tue, 12 Dec 2017 14:00:07 GMT<h3 id="analyzing-the-diy-approach-to-network-traffic-analyzers"><em>Analyzing the DIY Approach to Network Traffic Analyzers</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/1jHHHg17UIu6cKYCOmgGCS/52ee67e7bf529ea01d7ab68926e4183a/diy-disasters-300x158.jpg" class="image right" style="max-width: 400px" alt="DIY disasters" /> <p>The “do it yourself” (or DIY) mentality isn’t new to our industry. From hardware to software, people have been grappling with the buy vs. build dilemma for years. However, as enterprises and service providers put their 2018 tech budgets into action, we’re here to point out one DIY networking trend where the fine print is worth reading: Open source network flow analyzers. Traditional network monitoring appliances and network flow analyzer software were built with one primary purpose: to diagnose network issues. Legacy vendors behind those products put less attention on developing proactive and actionable insight. To do the latter, they would need analyzer software to process network data, in real-time, from many different sources across the network, including: sensors, routers, switches, and hosts, and complement that with BGP, SNMP, GeoIP, and threat data. But to offer that at scale, isn’t easy or cheap. With the limitations and expense of using legacy solutions, DIY is tempting. Some network teams have decided to develop their own custom analyzer software for network monitoring. And why wouldn’t they? It’s much more doable now than ever with open source building blocks readily available.</p> <h4 id="diy-requirements">DIY Requirements</h4> <p>The biggest challenge in DIY tech typically involves finding the best option to start from among a myriad of options available on the open source market. For DIY NetFlow analyzer projects, that boils down to identifying an open source big data backend for NetFlow data analysis that meets the most critical big data requirements:</p> <ul> <li>High-volume NetFlow collector ingest scalability</li> <li>Easy to use and expand UI frontend</li> <li>NetFlow data retention scalability</li> <li>Real-time query response</li> <li>High availability</li> <li>Open API access</li> </ul> <p>Hadoop, ELK, and Google’s BigQuery are among the short list of options that meet some of those requirements for DIY projects. But in looking closely at each:</p> <ul> <li><strong>Hadoop</strong> has one main shortcoming. It does not have any tools that help with data modeling to support the data analysis for NetFlow records. Some have implemented data cubes to fill this gap, but cubes fall short for very large volumes of data and are unable to adapt the data model in real time.</li> <li><strong>The ELK stack</strong> is a set of open source analyzer tools. Yet, when evaluating the ELK stack against the key NetFlow analysis requirements list, it falls short in several areas. For one, no binary data can be stored. This results in all binary data (such as NetFlow data) having to be re-formatted as JSON, resulting in massive storage bloat at scale and performance bottlenecks. It also has inadequate multi-tenancy fair usage, meaning the ELK stack can use tags to implement data access segmentation, but there is no way to enforce fairness.</li> <li><strong>BigQuery</strong> is a RESTful web service that enables interactive analysis of large datasets working in conjunction with Google Storage. When evaluating BigQuery against the key NetFlow analysis requirements, the data throughput limit of 100K records/second falls short for large networks generating tens of millions of NetFlow records/second.</li> </ul> <h4 id="the-cost-of-diy">The Cost of DIY</h4> <p>The DIY tools approach promises to address network analyzer functions at a lower cost than commercial vendor offerings. However, staffing an in-house deployment results in up-front investments and continuing, long-term resource allocations that can skew long-term total cost of ownership (TCO) higher. When estimating a DIY project in the feasibility stage, key costs often underestimated come from:</p> <ul> <li>Training all teams on the involved network protocols and their usage</li> <li>Maintaining resilience and reliability at scale</li> <li>Implementing geo-distributed flow data ingest</li> <li>Creating and maintaining a flow friendly data-store</li> </ul> <p>While building an in-house, custom, DIY network analyzer may seem like the right approach. Careful consideration needs to be given to the pros and the cons of this approach. It can have hidden costs that are not obvious at the outset.</p> <h4 id="a-clear-diy-alternative">A Clear DIY Alternative</h4> <p>Based on the big data requirements and cost alone, if you’re planning to tackle a DIY project in the year ahead, you and your network team need to consider another model. One that delivers on the original drive for DIY projects – lower costs and faster time-to-use. The best DIY alternative is a cloud-based SaaS model for implementing network analytics. SaaS-based network analytics has many benefits, including that the approach can:</p> <ul> <li>Ensure “day 0” time-to-value by eliminating the need for hardware Achieve real-time network visibility for real-time problem resolution</li> <li>Provide instant compatibility and integration with various NetFlow-enabled devices</li> <li>Eliminate capital and related operations investments such as space, power, and cooling.</li> <li>Reduce staffing required to maintain hardware and software</li> <li>Improve speed to deployment for new software features and bug fixes</li> </ul> <h4 id="kentiks-saas-approach">Kentik’s SaaS Approach</h4> <p>Kentik is the first to take on ultra-high-volume NetFlow monitoring using a highly cost-effective SaaS approach at massive scale with near real-time results. Kentik SaaS-based customers are getting immediate results for lower costs when they start using the service, and Kentik’s operations team is always there to ensure the health and success of each managed environment. In order to meet these NetFlow big data backend requirements, the Kentik Detect® platform leverages the following key elements, all of which are critical to a successful SaaS implementation:</p> <ul> <li>A clustered ingest layer to receive and process flow data in real-time at massive scale.</li> <li>A front end/API that uses an industry-standard language. All front end queries used by the UI portal or a client API are based on PostgreSQL.</li> <li>Caching of query results by one-minute and one-hour time periods to support sub-second response.</li> <li>Full support of compression for file storage to provide both storage and I/O read efficiency.</li> <li>Rate-limiting of ad-hoc, un-cached queries to provide fairness across all queries.</li> </ul> <p>This environment delivers on planned (but often not met) DIY objectives by including a robust set of open REST and SQL APIs. This enables internal tool development teams to integrate with the Kentik SaaS environment to address their specific operational needs. These use cases include:</p> <ul> <li>Customizing flow enrichment from sensors, routers, switches, hosts</li> <li>Unifying ops and business data with BGP, SNMP, Geo, threat data</li> <li>Creating a custom UI for network + business needs</li> </ul> <p>Using any of the standard big data open source distributions can lead to partial success for DIY projects, but only with a large investment of time and money for DIY. Either way, network analytics is no longer optional for network operators since the insights learned from network intelligence can translate directly into operational and business value. To move past DIY and learn how to harness the power of Kentik Detect for a truly effective real-time NetFlow analyzer, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[News in Networking: Cisco’s Acquisition, Juniper’s Security M&A, and Elon’s AI]]><![CDATA[Cisco will acquire cloud-cost comparison tool Cmpute.io. Juniper Networks may also be moving on cloud-focused M&A, specifically in the multi-cloud cybersecurity sector. Beyond cloud, the hype of AI continues. Elon Musk’s Tesla team is working on an AI project that could be “the best in the world," he said this week. More news after the jump...]]>https://www.kentik.com/blog/news-in-networking-ciscos-acquisition-junipers-security-ma-and-elons-aihttps://www.kentik.com/blog/news-in-networking-ciscos-acquisition-junipers-security-ma-and-elons-ai<![CDATA[Michelle Kincaid]]>Fri, 08 Dec 2017 18:52:07 GMT<h3 id="this-weeks-top-story-picks-from-the-kentik-team"><em>This week’s top story picks from the Kentik team.</em></h3> <p>Cisco announced plans this week to acquire Cmpute.io, an India-based cloud-cost comparison tool. Juniper Networks may also be moving on cloud-focused M&#x26;A. According to Light Reading, the company “has set its sights on being a leading player in the multi-cloud cybersecurity sector.” Beyond the cloud, the hype of AI continued to make headlines this week. According to CNBC, Elon Musk’s Tesla team is working on an AI project that could be “the best in the world.”</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.sdxcentral.com/articles/news/cisco-buy-cmpute-io-another-cloud-cost-comparison-tool/2017/12/"><strong>Cisco to buy Cmpute.io, acquiring another cloud cost comparison tool</strong></a> <strong>(SDxCentral)</strong><br> “Cisco announced plans to buy Cmpute.io for its software product that compares workloads and costs across clouds and identifies ways to save money. Cmpute.io is a privately-held company operating in Bangalore, India, and incorporated in Delaware. Cisco won’t disclose the acquisition price,” reported SDxCentral.</li> <li><a href="http://www.lightreading.com/carrier-security/cloud-security/juniper-scouts-for-multicloud-cybersecurity-manda-targets/d/d-id/738843?"><strong>Juniper scouts for multi-cloud cybersecurity M&#x26;A targets</strong></a> <strong>(Light Reading)</strong><br> “Juniper Networks has set its sights on being a leading player in the multi-cloud cybersecurity sector and is ready to make a strategic acquisition to advance that strategy, the vendor’s CEO told industry analysts at a briefing in London Wednesday,” according to Light Reading.</li> <li><a href="https://www.sdxcentral.com/articles/news/verizon-says-intelligent-edge-will-reduce-network-costs/2017/12/"><strong>Verizon says its intelligent edge will reduce network costs</strong></a> <strong>(SDxCentral)</strong><br> “Speaking at a Barclays investor conference, Ed Chan, SVP of technology, strategy, and planning at Verizon, said the company’s intelligent edge network is basically changing how the service provider is running the network by making software the control point for the network, which means it’s easier to automate services and share different network assets,” noted SDxCentral.</li> <li><a href="https://www.networkworld.com/article/3239967/lan-wan/vmware-targets-cloud-and-container-networking-with-latest-nsx-t-launch.html"><strong>VMware targets cloud and container networking with latest NSX-T launch</strong></a> <strong>(NetworkWorld)</strong><br> “VMware today released a new version of its NSX virtual networking software that aims to make it easier to manage network requirements of cloud-native and application-container-based applications. The move represents the latest example of a network vendor evolving its automation tooling to operate in not just traditional data center and campus networks, but increasingly in cloud environments that cater to a faster-pace of application development,” said NetworkWorld.</li> <li><a href="https://techcrunch.com/2017/12/07/salesforce-is-latest-big-tech-vendor-to-join-the-cloud-native-computing-foundation/"><strong>Salesforce is latest big tech vendor to join the Cloud Native Computing Foundation</strong></a> <strong>(TechCrunch)</strong><br> “Salesforce announced today that it was joining the Cloud Native Computing Foundation(CNCF), the open-source organization that manages Kubernetes, the popular open-source container orchestration tool,” noted TechCrunch. “Salesforce is a SaaS vendor, but it too is seeing what so many others are seeing: containerization provides a way to more tightly control the development process. Kubernetes and cloud native computing in general are a big part of that, and Salesforce wants a piece of the action.”</li> <li><a href="https://www.cnbc.com/2017/12/08/elon-musk-talks-up-teslas-upcoming-artificial-intelligence-hardware.html"><strong>Elon Musk told party attendees that Tesla is making A.I. hardware that could be ‘the best in the world’</strong></a> <strong>(CNBC)</strong><br> “Tesla CEO Elon Musk talked up the company’s work to develop custom hardware for artificial intelligence on Thursday during a talk at a company party for academic and industry researchers,” reported CNBC. “The specialized hardware could one day be used inside Tesla vehicles to do the computing work necessary for autonomous driving. Currently, Tesla’s Autopilot hardware system relies on graphics cards from Nvidia.”</li> <li><a href="https://www.sdxcentral.com/articles/announcements/apm-npm-next-gen-infrastructure-assurance-available/2017/12/"><strong>New Report: 2017 Next-Gen Software-Defined Infrastructure Assurance: Converging Roles of APM &#x26; NPM</strong></a> <strong>(SDxCentral)</strong><br> “With NPM and APM at the core of assurance, capacity planning, scaling and even security of our next-generation technology infrastructure, we’re certain that new technology investments (Big Data, AI/ML, automation) by end-users and vendors alike in performance management will provide a great return on investment for the foreseeable future,” noted SDxCentral in a new report on the topic.</li> <li><a href="http://www.eweek.com/it-management/gartner-says-enterprises-must-embrace-digital-transformation-in-2018"><strong>Gartner says enterprises must embrace digital transformation in 2018</strong></a> <strong>(eWEEK)</strong><br> “Gartner executives speaking at its annual Data Center Infrastructure and Operations Conference say enterprises should start implementing digital transformation projects in 2018 or risk falling behind their competition,” according to eWEEK.</li> <li><a href="https://www.nytimes.com/2017/12/02/technology/from-the-arctics-melting-ice-an-unexpected-digital-hub.html"><strong>Melting Arctic ice makes high-speed internet a reality</strong></a> <strong>(The New York Times)</strong><br> “High-speed internet cables snake under the world’s oceans, tying continents together and allowing email and other bits of digital data sent from Japan to arrive quickly in Britain. Until recently, those lines mostly bypassed the Arctic, where the ice blocked access to the ships that lay the cable,” notes The New York Times in a feature on one Alaskan town.</li> <li><a href="https://www.wired.com/story/evolution-of-data-leaks/"><strong>Wired visualizes the evolution of data leaks</strong></a> <strong>(Wired)</strong><br> “Based on data from CyberScout and the Identity Theft Resource Center, we created a series of contour lines showing how frequently companies, hospitals, government agencies, and other organizations were compromised by various methods over time,” reported Wired in an infographic on how data leaks have evolved.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Black Friday vs. Cyber Monday: Traffic Insights from Kentik]]><![CDATA[Media reports tell us that Cyber Monday marked a single-day record for revenue from online shopping. We can assume that those sales correlated with a general spike in network utilization, but from a management and planning perspective we might want to go deeper, exploring the when and where of traffic patterns to specific sites. In this post we use Kentik Detect to see what can be learned from a deeper dive into holiday traffic.]]>https://www.kentik.com/blog/black-friday-vs-cyber-monday-traffic-insights-from-kentikhttps://www.kentik.com/blog/black-friday-vs-cyber-monday-traffic-insights-from-kentik<![CDATA[Justin Ryburn]]>Thu, 30 Nov 2017 14:00:48 GMT<h3 id="comparing-online-holiday-sales-with-network-traffic"><em>Comparing Online Holiday Sales With Network Traffic</em></h3> <p>Unless last week’s Thanksgiving turkey still has you in a snooze, you likely heard some of the recent buzz about Black Friday and Cyber Monday. Kicking off the holiday shopping season, Black Friday deals brought in $5.03 billion last week, while Cyber Monday officially became (<a href="https://www.forbes.com/sites/jeanbaptiste/2017/11/28/report-cyber-monday-hits-new-record-at-6-6-billion-over-1-billion-more-than-2016/">according to Adobe</a>) “the largest U.S. online shopping day ever,” hitting a new record of $6.59 billion in a single day.</p> <p>At Kentik, we recognized immediately that our ability to see deep into network traffic could yield valuable insights not simply into these shopping events themselves but into any business scenario in which network traffic can be correlated with revenue. So we wanted to investigate how the sales figures matched up with traffic to the sites doing the selling. To see that we turned to the Data Explorer section of Kentik Detect, which is designed for ad hoc querying of traffic data including flow records (NetFlow, sFlow, etc.), BGP, GeoIP, and NPM metrics. With the permission and help of a few of our customers, we took a look at traffic patterns from both residential and commercial networks to some of the largest retailers promoting the biggest holiday sales, namely Amazon, Walmart, and Target.</p> <h4 id="walmart-being-prepared">Walmart: Being Prepared</h4> <p>Based on an article in the <a href="https://blogs.wsj.com/cio/2015/11/25/wal-mart-revamps-e-commerce-technology-as-amazon-applies-pressure/?mod=djemCIO_h&#x26;mg=prod/accounts-wsj">Wall Street Journal</a> earlier this week, we know that Walmart prepared for an influx of network traffic associated with its holiday deals by redesigning data centers and leveraging analytics. As Walmart’s CTO told the Journal, “We’ll be standing there in the network operations center, hoping for a boring technology week.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/XX2gnVO5mSowmuWeKYsCa/04381283bbb91a28c6cd782d50713e2c/Filter-Walmart-400w.png" alt="Filter-Walmart-400w.png " class="image right" style="max-width: 400px;" /> <p>One indication of how Walmart hoped to keep things uneventful comes from applying our analytics to traffic across the networks of some of our customers (each of whom opted-in for us to look at the data). Kentik Detect’s ability to filter on destination AS (shown at right) enabled us to look at traffic headed toward Walmart’s infrastructure. As shown in the graph below, there’s a noticeable drop in traffic at approximately midnight on November 18 — one week prior to Black Friday. Did Walmart prepare for the holiday rush by shifting some traffic away from their own infrastructure to resources in the public cloud?</p> <img src="//images.ctfassets.net/6yom6slo28h2/5cXHVnpzu8UKSSQCS4guoG/5aa8ee9bea2709cfbb0760fdb0ed710e/Graph_Shift-811w.png" alt="Graph_Shift-811w.png " class="image center" style="max-width: 811px;" /> <h4 id="zooming-in-on-the-thanksgiving-weekend-we-found-two-other-interesting-observations">Zooming in on the Thanksgiving weekend, we found two other interesting observations:</h4> <ul> <li>Normally there’s a big dip in traffic overnight, but that dip was quite minor on “Thanksgiving Eve” (Wednesday, November 22). We think that’s because Walmart launched all of their online deals that evening at midnight.</li> <li>Walmart’s network traffic volume appears to be in line with the overall sales results reported by Adobe: Cyber Monday was bigger than Black Friday.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5aZIU9vsBa8ose2kqM2sKA/fadf2c48a3545fb1c113195131f42da3/Graph_Walmart-812w.png" alt="Graph_Walmart-812w.png " class="image center" style="max-width: 812px;" /> <h4 id="targets-black-friday-vs-cyber-monday">Target’s Black Friday vs. Cyber Monday</h4> <p>Another large online retailer we looked at — this time via a large residential broadband provider — was Target.com. The provider has instrumented all of their DNS resolvers with Kentik’s kprobe host agent, which lets them examine the application-layer details of all DNS requests from their subscribers. Using that data, we looked at DNS query volume (for target.com and <a href="http://www.target.com">www.target.com</a>) to get a picture of how often this provider’s customers were visiting the Target website.</p> <p>An interesting observation popped up from that graph. Contrary to what we saw with Walmart, Target had about 4x greater traffic volume on Black Friday compared to what was observed on Cyber Monday.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6fN7YcuNFu6OOsCm8ys06c/2681ab7ccac98ca0576effbd13108197/Graph_Target-815w.png" alt="Graph_Target-815w.png " class="image center" style="max-width: 815px;" /> <h4 id="amazons-black-friday-vs-cyber-monday">Amazon’s Black Friday vs. Cyber Monday</h4> <p>We also looked at DNS query volume toward Amazon, using DNS to distinguish traffic toward Amazon.com vs. the rest of AWS. And once again we noticed a couple of interesting things about the Amazon traffic:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2sEgSgxoqY2oYWy0OU4GuQ/a2f88761f1269c3713d9f613ef9914fc/Filter_Amazon-400w.png" alt="Filter_Amazon-400w.png " class="image right" style="max-width: 300px;" /> <ul> <li>Compared to the week-before volume (shown as a dashed gray line), traffic volume in general was elevated starting on Thanksgiving Day. No surprise there.</li> <li>And as with Target, traffic on Black Friday had higher overall volume and a higher peak than traffic on Cyber Monday.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/7rie5u8mkwsm4oCikKqmAO/5812372c61caea261f5beff7a6cddbbe/Graph_Amazon-813w.png" alt="Graph_Amazon-813w.png " class="image center" style="max-width: 813px;" /> <h4 id="devices-down-for-dinner">Devices Down for Dinner</h4> <p>With all the emphasis on shopping it’s easy to forget that Thanksgiving is supposed to be about being grateful and gathering as a family. There’s some good news on that front. Our analysis shows that as people came to the dinner table on Thanksgiving, fewer of them appeared to be shopping or using their devices. You can see the dinner-time dip on the multi-day graph above, but below is a zoomed-in view of Thanksgiving Day only, with times shown in Pacific Standard Time (GMT -8).</p> <img src="//images.ctfassets.net/6yom6slo28h2/7JgNJIn3UWOOGK2w8k26C4/ae840144e9ee06e1e6a0c689eeb36359/Graph_Thanksgiving-813w.png" alt="Graph_Thanksgiving-813w.png " class="image center" style="max-width: 813px;" /> <h4 id="conclusion">Conclusion</h4> <p>The value of being able to make observations like these stretches far beyond retailers and the consumers who shop with them. For anyone whose business depends on an online presence, pervasive visibility is critical because network traffic translates directly into revenue. Imagine if the <a href="https://www.kentik.com/level-3-route-leak-what-kentik-saw/">Level 3 route leak</a> had occurred on Black Friday or Cyber Monday. Based on the sales numbers that retailers saw this shopping season — over $11.5 billion in just two days — a network outage could have cost millions of dollars per minute.</p> <p>Understanding the details of network traffic, in real-time, and accelerating a resolution for any observed issues can have a huge impact on the bottom line. To learn how to harness the power of Kentik Detect to protect your online revenue stream, <a href="#demo_dialog">request a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[News in Networking: SD-WAN for $3.3B and Facebook’s Free Open/R Networking Tool]]><![CDATA[This week, we learned the SD-WAN market is forecasted to reach $3.3 billion by 2021. Everyone from Cisco and VMware to AT&T and Charter are taking note. Also this week, Facebook open-sourced its Open/R networking development platform, which it uses for its own wide-area networks, data center fabric and wireless mesh topologies. More after the jump...]]>https://www.kentik.com/blog/7987-2https://www.kentik.com/blog/7987-2<![CDATA[Michelle Kincaid]]>Fri, 17 Nov 2017 16:22:59 GMT<p>This week’s top story picks from the Kentik team.</p> <p>This week, we learned the SD-WAN market is forecasted to reach $3.3 billion by 2021, according to a new report from IHS Markit. Everyone from tech giants like Cisco and VMware to telcos like AT&#x26;T and Charter are taking note and getting in on the trend. Also this week, Facebook open-sourced its Open/R networking development platform. Facebook uses Open/R to support its wide-area networks, data center fabric and wireless mesh topologies.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.sdxcentral.com/articles/news/ihs-cisco-vmware-in-sd-wan-market-two-horse-race/2017/11/"><strong>Cisco, VMware in SD-WAN market ‘two-horse race’</strong></a> <strong>(SDxCentral)</strong><br> “VMware and Cisco have acquired the two SD-WAN market share leader positions, making the SD-WAN market a two-horse race for the number-one spot,” said IHS Markit Analyst Cliff Grossner according to SDxCentral. “And we could see even more consolidation as vendors set out to add SD‑WAN to their capability sets, especially since the technology is key to supporting connectivity in the multi-clouds that enterprises are building.”</li> <li><a href="http://www.lightreading.com/carrier-sdn/sd-wan/charter-taps-sd-wan-to-boost-off-net-reach/d/d-id/738218"><strong>Charter taps SD-WAN to boost off-net reac</strong></a> <strong>(Light Reading)</strong><br> “Spectrum Business, the enterprise arm of Charter Communications, is jumping into the SD-WAN market, but doing so in a way that is distinctively different from its telco competitors. Spectrum is leveraging its existing Ethernet footprint — it is the fourth largest US Ethernet provider — and a new customer portal to create an SD-WAN overlay using Nuage Networks’ software,” reported Light Reading.</li> <li><a href="https://www.sdxcentral.com/articles/news/att-sd-wan-set-to-gain-network-based-dynamics/2017/11/"><strong>AT&#x26;T makes its SD-WAN ‘dynamic’</strong></a> <strong>(SDxCentral)</strong><br> Also in on the SD-WAN hype is AT&#x26;T. According to SDxCentral, Vice President for Intelligent Edge Josh Goodell said, that the company’s over-the-top (OTT) SD-WAN offering, “based on VeloCloud’s technology, can dynamically route traffic across multiple carrier links. The carrier’s currently counts around 100,000 SD-WAN deployments under an internally-derived static configuration.”</li> <li><a href="https://www.totaltele.com/498623/IoT-will-be-a-cash-cow-for-carriers"><strong>IoT will be a cash cow for carriers</strong></a> <strong>(TotalTelecom)</strong><br> “The Internet of Things (IoT) will open up new opportunities for carriers to sell enhanced data services to a wide range of new subscribers, Huawei’s deputy chairman and rotating CEO Ken Hu said at the Chinese vendor’s Global Mobile Broadband Forum,” according to TotalTelecom.</li> <li><a href="https://www.sdxcentral.com/articles/news/alibaba-cloud-next-gen-data-center-uses-cisco-tech/2017/11/"><strong>Alibaba’s next-gen data center uses Cisco tech</strong></a> <strong>(SDxCentral)</strong><br> “The newest Alibaba Cloud data center in Beijing uses Cisco technology. It comes as the Chinese cloud provider has vowed to “match or surpass” Amazon Web Services (AWS) by 2019,” reported SDxCentral.</li> <li><a href="https://siliconangle.com/blog/2017/11/16/facebook-open-sources-openr-networking-development-platform/"><strong>Facebook open-sources its Open/R networking development platform</strong></a> <strong>(SiliconANGLE)</strong><br> “We have been working with external partners and operators to support and use Open/R, and we invite more operators, [Internet service providers], vendors, systems integrators and researchers to leverage Open/R as a platform for implementing new network routing ideas and applications,” noted Facebook engineers, according to SiliconANGLE.</li> <li><a href="http://www.reuters.com/article/us-brocade-commns-m-a-broadcom/broadcom-closes-5-5-billion-brocade-deal-idUSKBN1DH1T9"><strong>Broadcom closes acquisition of network gear maker Brocade</strong></a> <strong>(Reuters)</strong><br> “Broadcom Ltd said on Friday it closed its acquisition of network gear maker Brocade Communications Systems Inc, giving it a larger share of the data center products market,” reported Reuters.</li> <li><a href="https://www.totaltele.com/498647/Automated-networks-a-must-for-5G-Vodafone"><strong>Automated networks are a must for 5G, says Vodafone</strong></a> <strong>(TotalTelecom)</strong><br> “Operators will need to harness the full power of artificial intelligence (AI) and process automation technology to cope with the data generated on their 5G networks, according to Johan Wibergh, chief technology officer at Vodafone Group,” reported TotalTelecom.</li> <li><a href="http://www.zdnet.com/article/github-to-devs-now-youll-get-security-alerts-on-flaws-in-popular-software-libraries/"><strong>GitHub adds security alerts on flaws in popular software libraries</strong></a> <strong>(ZDNet)</strong><br> “Development platform GitHub has launched a new service that searches project dependencies in JavaScript and Ruby for known vulnerabilities and then alerts project owners if it finds any. The new service aims to help developers update project dependencies as soon as GitHub becomes aware of a newly announced vulnerability,” noted ZDNet.</li> <li><a href="https://arstechnica.com/information-technology/2017/11/google-fiber-now-sells-55-per-month-gigabit-internet-in-one-city/"><strong>Google Fiber now sells $55/mo gigabit Internet…in one city</strong></a> <strong>(Ars Technica)</strong><br> “Google Fiber’s gigabit Internet service has consistently been priced at $70 a month since it launched in 2012, but it’s now available for just $55 in the ISP’s latest city,” reported Ars Technica.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Level 3 Route Leak: What Kentik Saw]]><![CDATA[As last week's misconfigured BGP routes from backbone provider Level 3 caused Internet outages across the nation, the monitoring and troubleshooting capabilities of Kentik Detect enabled us to identify the most-affected providers and assess the performance impact on our own customers. In this post we show how we did it and how our new ability to alert on performance metrics will make it even easier for Kentik customers to respond rapidly to similar incidents in the future.]]>https://www.kentik.com/blog/level-3-route-leak-what-kentik-sawhttps://www.kentik.com/blog/level-3-route-leak-what-kentik-saw<![CDATA[Jim Meehan]]>Wed, 15 Nov 2017 14:00:23 GMT<h3 id="identifying-source-and-scope-with-kentik-detect"><em>Identifying Source and Scope with Kentik Detect</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/4zlhG8Fw2kc6SaaKsWEk4m/357921704fe7ece1990c698bdc0006ef/Outage_heat_map-501w.png" alt="Outage_heat_map-501w.png" class="image right" style="max-width: 400px; margin-bottom: 10px;" /> Last week’s big story in the networking space was the <a href="https://www.wired.com/story/how-a-tiny-error-shut-off-the-internet-for-parts-of-the-us/">route leak</a> at major backbone provider Level 3 Communications, which triggered disruptions to Internet service nationwide. At Kentik, we’re focused on visibility and analytics for an increasingly broad range of networking use cases, so we naturally wondered if we might be able to see the performance impact of this incident on real users accessing the Kentik Detect portal. As we’ve <a href="https://www.kentik.com/kentik-detect-for-kentik-site-reliability/">posted in the past</a>, we operate our own infrastructure and we use our own network traffic intelligence platform — Kentik Detect — to monitor our services. So to answer this question we once again turned to our own product.</p> <h4 id="instrumental-agents">Instrumental Agents</h4> <p>First, let’s touch on some background that will make things clearer when we dig into the specifics of the Level 3 event. All of the front-end nodes that serve our portal are instrumented with our <a href="https://kb.kentik.com/?Bd03.htm">kprobe</a> software host agent. Kprobe looks at the raw packet data in and out of those hosts, turns it into flow records, and passes it into Kentik Detect. One of kprobe’s major advantages is that it can measure performance characteristics — retransmits, network latency, application latency, etc. — that we don’t typically get from flow data that comes from routers and switches. By directly measuring the performance of actual application traffic we avoid a number of the pitfalls inherent to traditional approaches such as ping tests and synthetic transactions, most notably the following:</p> <ul> <li>Active testing agents can’t be placed everywhere across the huge distribution of sources that will be hitting your application.</li> <li>Synthetic tests aren’t very granular; they’re typically performed on multi-minute intervals.</li> <li>Synthetic performance measurements aren’t correlated with traffic volume measurements (bps/pps), so there’s no context for understanding how much a given performance problem affected traffic, revenue, or users.</li> </ul> <p>With an infrastructure that’s pervasively instrumented for actual network performance metrics, the above issues disappear. That provides huge benefits to our Ops/SRE team (and to our customers!), particularly along three key dimensions:</p> <ul> <li>Our instrumentation metrics and methods provide near-instant notification of performance problems, with relevant details automatically included, potentially saving hours of investigative work.</li> <li>With the necessary details immediately at our fingertips, we are typically able to make changes right away to address an issue.</li> <li>Even if a fix is not immediately possible, the detailed data enables us to react and inform any users that may be impacted.</li> </ul> <p>Kentik delivers our platform on bare metal sitting inside a 3rd party Internet datacenter. Your applications and infrastructure may look different than ours. You might deliver an Internet-facing service from AWS, Azure, or GCE. Or perhaps you have an application in a traditional datacenter delivered to internal users over a WAN. Or possibly you have a microservices architecture with distributed application components. In all of those scenarios, managing network performance is critical for protecting user experience, and all of the benefits listed above still apply.</p> <h4 id="exploring-the-impact">Exploring the Impact</h4> <p>For analysis of the Level 3 <a href="https://www.kentik.com/kentipedia/bgp-route-leaks/" title="Kentipedia: What are BGP Route Leaks and How to Protect Your Networks Against Them">route leak</a>, we opened the Data Explorer section of the Kentik portal and set a query to look at total TCP retransmits per second (an indirect measurement of packet loss) around the time of the incident on November 6. As you can see from the graph below, there’s a noticeable spike in retransmits starting at 9:40 am PST, exactly the time that the Level 3 route leak was reported to have started.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4Rf880r0GI0KuSuIM8OOAY/07a5428854f5781018476f40aeb3c05d/Total_retransmits-814w.png" alt="Total_retransmits-814w.png" class="image center" style="max-width: 814px;" /> <p>Next we wanted to look at which customer networks were most affected. Building on our original query, we sliced the traffic by Source ASN. Some Kentik users would be inside corporate ASNs (at the office), but with today’s increasingly flexible workforce, there are also many users accessing the portal via consumer broadband ISPs from home, coffee shops, airplanes, etc. In the resulting graph and table (below), we saw lots of Comcast ASNs in the Top-N. This correlates well with anecdotal reports of many <a href="https://www.engadget.com/2017/11/07/comcast-internet-outage-level-3-route-leak/">Comcast customers experiencing reachability problems</a> to various Internet destinations for the duration of the incident.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6KeYJUnsreiaEMQ8QOcGKu/ba0e12dec7c76eb9fcf859cf7afa8adb/Retransmits_by_ASN-816w.png" alt="Retransmits_by_ASN-816w.png" class="image center" style="max-width: 814px;" /> <p>Next we drilled further into the Comcast traffic in order to pinpoint exactly which parts of their network were affected. We filtered to Comcast ASNs only (see below) and changed the group-by dimension to look at source regions.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6SLbkWSk4owucO0ss2UgcG/6117a11cbef86649c893b73ed5644a30/Filtering-815w.png" alt="Filtering-815w.png" class="image center" style="max-width: 814px;" /> <p>In the resulting graphs and table (below) we saw that the top affected geographies were centered on California, the Pacific Northwest, and the Northeast. That correlates quite nicely with some of the Comcast outage heat maps that were <a href="https://www.theverge.com/2017/11/6/16614160/comcast-xfinity-internet-down-reports">circulating in the press</a> on November 6 (including the Downdetector.com image at the start of this post).</p> <img src="//images.ctfassets.net/6yom6slo28h2/3kfwLz43M46oe4kUOk6yE2/c7e07b855920b83c5c929ba3df53eb4e/Retransmits_by_region-816w.png" alt="Retransmits_by_region-816w.png" class="image center" style="max-width: 814px;" /> <p>Lastly, we tried to identify the source of the performance issues by looking for a common path between Kentik and the affected origin ASNs. To do so, we looked at retransmits grouped by 2nd hop ASNs (immediately below) and 3rd hop ASNs (below).</p> <img src="//images.ctfassets.net/6yom6slo28h2/1sstKGIDEk26WEKEe4GA2c/be1e264718d95e13522b7b3f32fc6491/2nd_hop-808w.png" alt="2nd_hop-808w.png" class="image center" style="max-width: 808px;" /> <img src="//images.ctfassets.net/6yom6slo28h2/3Ia0mdOafKSMKKIw8UUS2m/12c541d9682190f45939a15c3e71112a/3rd_hop-orig-808w.png" alt="3rd_hop-orig-808w.png" class="image center" style="max-width: 800px;" /> <p>Now the picture became pretty clear: the routes leaked from Level 3 (3rd hop) forced a lot of traffic through NTT America (2nd hop) that would normally have gone to or through Level 3 directly. That appears to have caused a lot of congestion either within NTTA’s network or on the NTTA/Level3 peering interconnects, which caused the packet loss observed in the affected traffic. This explanation — arrived at with just a few quick queries in the portal’s Data Explorer — once again turned out to be well-correlated with <a href="https://dyn.com/blog/widespread-impact-caused-by-level-3-bgp-route-leak/">other media reports</a>.</p> <h4 id="alerting-on-performance">Alerting on Performance</h4> <p>Coincidentally, a few days after the Level 3 incident Kentik released new alerting functionality that enables alert policies to watch, learn, and alarm not only on traffic volume characteristics (e.g. bps or pps) but also on the performance metrics provided by kprobe. If we’d had that functionality at the time of the Level 3 event we would have been proactively notified about the performance degradation that we instead uncovered manually in Data Explorer. And while our customers were able to see the effect of the route leak using bps and pps metrics, performance metrics will make possible even earlier detection with higher confidence. We and our service-operating customers are excited that performance-based alarms will now enable us to deliver an even better customer experience.</p> <p>With digital (and traditional!) enterprises increasingly reliant on the Internet to deliver their business, deep visibility into actual network performance is now essential if you want to ensure that users have a great experience with your service or product. Combining lightweight host agents with an ingest and compute architecture built for Internet scale, Kentik Detect delivers the state-of-the-art analytics needed to monitor and troubleshoot performance across your network infrastructure. Ready to learn more? Contact us to <a href="#demo_dialog">schedule a demo</a>, or sign up today for a <a href="#signup_dialogl">free trial</a>.</p><![CDATA[News in Networking: The Big Misconfig and IoT’s Enterprise Time Bomb]]><![CDATA[Last year it was the Dyn outage. This week, a network misconfiguration at Level 3 caused Comcast, Spectrum, Verizon, Cox, RCN and other telecos across the country to feel the effects of the outage. Also this week, findings from a Forrester survey revealed IoT is now an “enterprise security time bomb,” with a huge challenge around being able to identify these types of devices on the network. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-the-big-misconfig-and-iots-enterprise-time-bombhttps://www.kentik.com/blog/news-in-networking-the-big-misconfig-and-iots-enterprise-time-bomb<![CDATA[Michelle Kincaid]]>Fri, 10 Nov 2017 20:01:17 GMT<p>This week’s top story picks from the Kentik team.</p> <p>This time last year it was the Dyn outage. This week, a network misconfiguration at Level 3 caused a big stir. Comcast, Spectrum, Verizon, Cox, RCN and other telecos across the country felt the effects of the ISP’s outage. Also this week, findings from a survey conducted by Forrester revealed IoT is now an “enterprise security time bomb,” with a huge challenge around being able to identify these types of devices on the network.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.wired.com/story/how-a-tiny-error-shut-off-the-internet-for-parts-of-the-us/"><strong>How a tiny error shut off the internet for parts of the US</strong></a> <strong>(Wired)</strong><br> A network misconfiguration at Level 3 took down providers like Comcast, Spectrum, Verizon, Cox, and RCN on Monday. Level 3, which was just acquired by CenturyLink, said in a statement to Wired that it resolved the issue in about 90 minutes. “Our network experienced a service disruption affecting some customers with IP-based services,” the company said. “The disruption was caused by a configuration error.”</li> <li><a href="https://economictimes.indiatimes.com/tech/internet/tata-communications-betting-big-on-iot-to-spend-100-million-in-2-3-years/articleshow/61481819.cms"><strong>Tata Communications betting big on IoT, to spend $100M in 2-3 years</strong></a> <strong>(The Economic Times)</strong><br> “IoT is a serious business and will become a significant revenue line for the firm going forward,” an SVP of Tata Communications told the Economic Times. According to the report, the telco will invest about $100 million in IoT over the next few years.</li> <li><a href="http://www.zdnet.com/article/iot-devices-are-an-enterprise-security-time-bomb/"><strong>IoT devices are an enterprise security time bomb</strong></a> <strong>(ZDNet)</strong><br> “The majority of enterprise players cannot identify IoT devices on their networks,” according to a new study conducted by analyst firm Forrester. “IoT is causing serious security concerns for enterprises worldwide with few companies capable of securing them as they are unable to identify devices properly,” wrote ZDNet regarding the study’s findings.</li> <li><a href="https://www.bizjournals.com/sanjose/news/2017/11/07/vmware-acquisitions-could-continue-after-velocloud.html"><strong>Here are more companies VMware may want to buy after VeloCloud</strong></a> <strong>(Business Journal)</strong><br> Following VMWare’s acquisition of VeloCloud, the San Jose Business Journal has a list of other companies that could be next on the to-buy list. SentinelOne and Carbon Black are two security companies VMware is potentially interested in, according to the story.</li> <li><a href="https://www.wsj.com/articles/salesforce-bases-new-service-on-amazons-cloud-1462917059"><strong>Salesforce bases new service on AWS</strong></a> <strong>(Wall Street Journal)</strong><br> Salesforce’s annual Dreamforce conference wrapped up today. During the show, the company announced it is building a new service on the AWS cloud, “a notable shift for a company that generally has used its own computing infrastructure,” reported the Wall Street Journal.</li> <li><a href="https://www.networkworld.com/article/3236472/data-center/researchers-developing-building-free-data-centers.html"><strong>Researchers developing building-free data centers</strong></a> <strong>(NetworkWorld)</strong><br> “A new outdoor server farm concept that uses vats of liquid-cooled computers instead of buildings could be literally located in farmland,” according to NetworkWorld.</li> <li><a href="https://www.networkworld.com/article/3235917/lan-wan/a-deep-dive-into-ciscos-intent-based-networking.html"><strong>Cisco SVP gives deep dive into intent-based networking</strong></a> <strong>(NetworkWorld)</strong><br> Cisco SVP Scott Harrell gave NetworkWorld a deep dive on its new, much-talked-about intent-based networking. Speed and agility are two key pieces of the company’s networking strategy, according to Harrell. “You need to be able to respond in the network. It’s critical for the network to rapidly evolve to meet those needs with minimal manual intervention,” he said.</li> <li><a href="https://techcrunch.com/2017/11/10/ibm-passes-major-milestone-with-20-and-50-qubit-quantum-computers-as-a-service/"><strong>IBM makes 20 qubit quantum computing machine available as a cloud service</strong></a> <strong>(TechCrunch)</strong><br> “IBM has been offering quantum computing as a cloud service since last year when it came out with a 5 qubit version of the advanced computers. Today, the company announced that it’s releasing 20-qubit quantum computers, quite a leap in just 18 months,” reported TechCrunch.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Improving Legacy Approaches to Network Capacity Management]]><![CDATA[Managing network capacity can be a tough job for Network Operations teams at Service Providers and Enterprise IT. Most legacy tools can't show traffic traversing the network and trend that data over time. In this post, we'll look at how Kentik Detect changes all that with new Dashboards, Analytics, and Alerts that enable fast, easy planning for capacity changes and upgrades.]]>https://www.kentik.com/blog/improving-legacy-approaches-to-capacity-managementhttps://www.kentik.com/blog/improving-legacy-approaches-to-capacity-management<![CDATA[Justin Ryburn]]>Mon, 06 Nov 2017 14:00:55 GMT<h3 id="better-tooling-brings-faster-more-reliable-capacity-decisions"><em>Better Tooling Brings Faster, More Reliable Capacity Decisions</em></h3> <p>Anyone who has ever done capacity planning for a network knows of the challenges this role brings. How do you figure out when your links, routers, switches, firewalls, and other network infrastructure are going to run out of capacity? In some cases, capacity is ignored until there is an issue because no one has the time to deal with it. <img src="//images.ctfassets.net/6yom6slo28h2/66t7iRaThC2qa0WOuI0KK8/37c2892b86798e4cc1044faff2ccdc7b/capacity-planning.png" alt="capacity-planning.png" class="image right" style="max-width: 360px; paddig: 10px; margin: 20px;" /></p> <p>There are, of course, traditional capacity planning tools available that collect counter data and let you set an alert on a static threshold. However, that restricts you to an interface-level view. It does not include business insights service providers need like total capacity to a given external network, total capacity in or out of a given market, or even what type of connectivity is being looked at (transit, backbone, paid peering, or free peering). It also does not cover the needs of large enterprises, like understanding WAN usage, ISP uplink capacity, east-west datacenter hotspots, and inter-datacenter link adequacy. In many cases, capacity planning is a completely manual process with complex spreadsheets to pull in the data, aggregate it, run statistics on it, and trend it over time. There has to be a better way, right?</p> <p>In this blog post, we will take a look at how Kentik Detect can help gain integrated insights into network capacity, utilization, performance, and traffic composition to ensure the best service delivery at minimum cost.</p> <h4 id="kentik-detect-capacity-analytics">Kentik Detect Capacity Analytics</h4> <img src="//images.ctfassets.net/6yom6slo28h2/1iAVQlB05SE48wwKS08miS/6c0e221faeaae6348ca1d14131fd53cb/kentik-detect-capabilities-331w.png" alt="kentik-detect-capabilities-331w.png" class="image right" style="max-width: 240px; margin-bottom: 20px;" /> <p>Kentik recently added a powerful Capacity Analytics capability to the Kentik Detect platform. This feature allows the user to get a quick view of the link utilization across the network, automatically sorted or filtered to expose the most urgent issues, with projected capacity “run out” dates. To see this feature in action, click on the “Analytics” dropdown menu in the Kentik Portal and select “Capacity.”</p> <p>For this example, we will run the report with the default In / Out Dimensions settings of “Source Interface” and “Destination Interface.” We’ll change the Granularity setting to “Showing device level data” instead of “Showing site level data.” This will show the name of the device in the output instead of just the interface name. To look at trending data and run-out date projections, we’ll set the “Trending” switch to “Enabled.” Last but not least, we’ll set a date filter to show only the interfaces that are going to run out of capacity by the end of the year (2017-12-31).</p> <p>For Thresholds, we’ll use the defaults of 90% for Critical, 70% for Warning, and 20% for Display (minimum utilization to appear on the report). We will leave the Time, Devices, and Filtering all to their current default settings which will analyze one week’s worth of data for all devices. Lastly, we click the blue “Run Report” button at the top to see the magic begin.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3svvKGHAzCKCkmgYwIwI0E/292066948b2291dca5b0ccc9e553a92f/run-report-812w.png" alt="run-report-812w.png" class="image center" style="max-width: 812px;" /> <p>In the report output we can see there’s a section for each site (i.e. PoP or router group) we have defined in Kentik Detect. In the “Ashburn, VA” section we will focus on the second line for a device called pe1.iad.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4Vb4ZQMjEkq46CCueiq8aq/960bcf03a174c93120875be509048d0e/pe1iad-812w.png" alt="pe1iad-812w.png" class="image center" style="max-width: 812px;" /> <p>This has a link called ae7 (an aggregated Ethernet bundle) which is a backbone type of link with 30 Gbps of capacity. Kentik automatically discovers the capacity of interfaces (including bundles) via SNMP. The astute reader will see that we have three different ways to analyze utilization: Average, 98th Percentile, and Max. Based on the thresholds that we set, we quickly see which links currently have warning and critical utilization levels in each of the three categories. The average appears to be okay, but looking at the 98th percentile and max levels, we can see we are approaching capacity.</p> <p>If we look at the second line in the traffic columns, we can see how the traffic is growing over time. In this case, the MoM (month-over-month) growth rate is about 11% for the 98th percentile calculation and 10% for the maximum.</p> <p>The last thing we will look at on this report is the projected date at which we will run out of capacity on this link. This is probably the most powerful part of the whole report as it gives us an estimated target date so we can plan to either shift traffic or have an upgrade in place by that date. In this case, the 98th percentile is showing a run-out date of December 11, and the maximum is showing December 6. An action plan is definitely needed by the beginning of December to avoid congestion on this interface. Most capacity planners would need very complex spreadsheets to uncover that date, but Kentik’s Capacity Analytics provides prioritized, automatic calculations for every interface in the network.</p> <h4 id="pro-active-alerting-based-on-trend-data">Pro-Active Alerting Based on Trend Data</h4> <p>Running a report is great, but what if we could get automatic, pro-active notifications when an interface is nearing its capacity? With Kentik Detect’s Alert Policies you can do just that. In fact, we have a pre-built example in our alert policy library to get you started. To access the feature, click on Alerting » Library and scroll down until you see the alerts below. Click the highlighted button to copy them from the library into your account’s active policy set.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2vukYqAR2smqe4SKeKaQYO/fcb2255dacc864ac64b3810e8aa380be/alerting-library-812w.png" alt="alerting-library-812w.png" class="image center" style="max-width: 812px;" /> <p>By default, these alerts are set to trigger if any interface has at least 700 Mbps of traffic and is at more than 85% utilization. You can adjust these settings to meet your needs by clicking on the alert to bring up the edit window, and then click the “Alert Thresholds” tab.</p> <img src="//images.ctfassets.net/6yom6slo28h2/68W2elyxJm8wykACYwuME4/6c02f8b4244eba5b43d16fce737f3ef6/conditions-812w.png" alt="conditions-812w.png" class="image center" style="max-width: 812px;" /> <p>At the bottom of the Alert Thresholds page is a spot where you can configure how you want to be notified (email, Slack, Syslog message, JSON Post to a webpage, or PagerDuty notification). This gives you the ability to tie the alerts to a capacity management system you may already have in place.</p> <p>One more thing you may want to adjust is the dashboard that is linked to any alarms that are generated. We will cover more about how to use the dashboard in the next section, but there is a Capacity Management dashboard that might make sense for this type of alert policy. To adjust this, click on “Policy Dashboard” on the “General Settings” tab.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6xUQuHOR6EIykKacuMO88c/7cc0d96bc54639963a44e70c2a4b4a71/policy-dashboard-812w.png" alt="policy-dashboard-812w.png" class="image center" style="max-width: 812px;" /> <p>Once you have an alert for an interface utilization nearing capacity, how do you tell the top talkers on that link? For enterprises, this could be useful for seeing if there is a misconfiguration or misuse that might be causing the bandwidth consumption. For service providers, this is helpful for seeing if a single customer or CDN is using the majority of the capacity on the link.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2TMiPHgi12Is8yMWaqS08A/ae3e0b41a8f82e39825e11a9b5e47f0b/actions.png" alt="actions.png" class="image right" style="max-width: 120px;" /> <p>On the right side of the alert, you will see some action buttons. Click on the highlighted button to open this in the Kentik Data Explorer. When the Data Explorer loads, you will see a graph similar to the one we depicted, showing the total traffic on the link.</p> <img src="//images.ctfassets.net/6yom6slo28h2/76ApPqUZjyCGOGwUqCKiUU/c5b01bb1c5588701e26077d353c61b4e/data-explorer-812w.png" alt="data-explorer-812w.png" class="image center" style="max-width: 812px;" /> <p>In order to dive deeper into the traffic, click in the “Group By Dimensions” field and add some additional details you would like to see. A good place to start is adding Source IP/CIDR, Destination IP/CIDR, and Destination Protocol:IP Port to see your top IP talkers on the link.</p> <p>If you want to learn more about leveraging the power of the Kentik Data Explorer, check out <a href="https://kb.kentik.com/?Db03.htm#Db03-Data_Explorer">our KB topic</a>. For more details on how to tune our alerts to meet your needs, check out our previous blog post on <a href="https://www.kentik.com/kentik-detect-alerting-configuring-alert-policies/">configuring alert policies</a>.</p> <h4 id="capacity-planning-dashboard">Capacity Planning Dashboard</h4> <p>In the previous section, we mentioned a “Capacity Management” dashboard. This can be really handy for a capacity planner to get a quick snapshot of the top links across the network and a view of their capacity. To access this dashboard, click on Dashboards » Capacity Management.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7LccRizvMsUw8Aecqg02Mi/7becc98e1bb043af567738a3b2be62a3/capacity-planning-dashboard-812w.png" alt="capacity-planning-dashboard-812w.png" class="image center" style="max-width: 812px;" /> <p>This dashboard is divided into sections by interface capacity (1G, 10G, 20G, 30G, 40G, and 100G) as well as by direction (source or destination). The default time window for the information is the past one day. At a quick glance the capacity planner can see what interfaces are nearing capacity as they scroll through the dashboard. As an added bonus, you can export the dashboard by clicking on the icon in the upper-right-hand corner.</p> <img src="//images.ctfassets.net/6yom6slo28h2/54MmMAqdwcu2Gsm4k8oOca/0ee509cb767c9a6cf945299e25f05836/live-update.png" alt="live-update.png" class="image center" style="max-width: 240px;" /> <p>To really round out this feature, you can create dashboard subscriptions to get the output automatically delivered to yourself or your team via email on a schedule you choose. To set up a subscription, click on Admin » Subscriptions » Add Subscription and fill out the form with a report title, recipient email address(es), and delivery schedule.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3KxwoLYDRCiAQSqYsS2g46/c33084f4829a44729a05cb82ebed10a8/add-subscription-812w.png" alt="add-subscription-812w.png" class="image center" style="max-width: 400px;" /> <h4 id="summary">Summary</h4> <p>As we have seen, the Kentik Detect platform gives capacity management teams a much richer toolset to head off congestion, plan and budget for upgrades, and prevent the fire drills that are common in many network operations workflows. If you are already a Kentik customer and have questions on how to get setup with Capacity Analytics and Alert Policies, get in touch with our <a href="mailto:[email protected]">Kentik support team</a>. If you are not currently a Kentik customer but want to benefit from these planning tools, contact us today to <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[News in Networking: VMware for SD-WAN, Google’s SDN, and Chevron’s Cloud]]><![CDATA[This week, VMware announced plans to acquire SD-WAN startup VeloCloud. Google released a new Andromeda SDN to reduce its cloud latency. SoftBank, Facebook, Amazon and others made plans to lay a 60-Tbps undersea cable from Juniper. And Energy giant Chevron selected Microsoft as its preferred cloud provider. Those headlines and more after the jump...]]>https://www.kentik.com/blog/7939-2https://www.kentik.com/blog/7939-2<![CDATA[Michelle Kincaid]]>Fri, 03 Nov 2017 16:18:00 GMT<p>This week’s top story picks from the Kentik team.</p> <p>This week, VMware announced plans to acquire SD-WAN startup VeloCloud. Google released a new Andromeda SDN to reduce its cloud latency. SoftBank, Facebook, Amazon and others made plans to lay a 60-Tbps undersea cable from Juniper. And Energy giant Chevron selected Microsoft as its preferred cloud provider.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://techcrunch.com/2017/11/02/vmware-acquires-velocloud-as-it-moves-deeper-into-networking/"><strong>VMware acquires VeloCloud as it moves deeper into networking</strong></a> <strong>(TechCrunch)</strong><br> VMware announced Thursday that it is acquiring VeloCloud, a startup that provides cloud-based WAN. “The companies did not reveal the purchase price,” reports TechCrunch.</li> <li><a href="http://www.eweek.com/networking/google-boosts-performance-of-andromeda-snd-on-cloud-platform"><strong>Google releases new version of Andromeda SDN stack</strong></a> <strong>(eWEEK)</strong><br> “Google says the new version of Andromeda Software Defined Network substantially reduces latency on the company’s cloud platform,” according to eWEEK.</li> <li><a href="https://www.sdxcentral.com/articles/news/cord-300-billion-opportunity-network-edge/2017/11/"><strong>CORD: The $300 billion opportunity at the network edge</strong></a> <strong>(SDxCentral)</strong><br> “CORD, the effort to transform sites at the network edge into modern data centers, represents a $300 billion opportunity for a range of vendors from software developers to white box and networking suppliers and system integrators,” reports SDxCentral.</li> <li><a href="http://www.lightreading.com/automation/colt-automations-silent-killer-is-poor-quality-data/d/d-id/737883"><strong>Automation’s ‘silent killer’ is poor quality data</strong></a> <strong>(Light Reading)</strong><br> Fahim Sabir, the director of architecture and development for UK service provider’s Colt, told Light Reading and conference attendees this week that “data quality has been a massive concern for Colt Technology Services Group as it has worked on automating parts of its business.”</li> <li><a href="https://www.networkworld.com/article/3235768/hybrid-cloud/cisco-aims-to-simplify-multi-cloud-deployments.html"><strong>Cisco aims to simplify multi-cloud deployments</strong></a> <strong>(NetworkWorld)</strong><br> “Cisco’s new portfolio of solutions takes the complexity out of hybrid and multi-cloud deployments, reduces costs and speeds up time to market,” says analyst Zeus Kerravala in a post for NetworkWorld.</li> <li><a href="http://venturebeat.com/2017/10/30/softbank-facebook-and-amazon-commit-to-8700-mile-transpacific-internet-cable/"><strong>SoftBank, Facebook, Amazon, others to lay 60Tbps undersea cable</strong></a> <strong>(VentureBeat)</strong><br> “Japanese telecommunications giant SoftBank is joining forces with Facebook, Amazon, and a number of other technology companies to build a new 14,000 km (8,700 mile) transpacific subsea cable connecting Asia with North America. The Jupiter cable system will have two landing points in Japan… as well as one at Daet in the Philippines and one near Los Angeles,” reports VentureBeat.</li> <li><a href="https://www.theregister.co.uk/2017/11/01/microsoft_crystalnet_network_simulator/"><strong>Microsoft reveals network simulator that keeps Azure alive</strong></a> <strong>(The Register)</strong><br> “Microsoft has let the world in on one of its key Azure management tools: a simulator designed to help prevent nearly 70 per cent of the bugs that cause network downtime. The simulator, called CrystalNet, is a design tool Microsoft Research created for its admins to help avoid downtime during routine maintenance and upgrades,” according to The Register.</li> <li><a href="https://www.forbes.com/sites/alexkonrad/2017/10/30/chevron-partners-with-microsoft-in-cloud/#34cf0620684d"><strong>Chevron signs 7-year deal With Microsoft in one of cloud’s biggest wins yet</strong></a> <strong>(Forbes)</strong><br> “Energy giant Chevron has selected Microsoft as its preferred cloud provider in one of the biggest head-to-head wins yet in a developing race between tech leaders for cloud computing customers. Under the terms of the seven-year deal, Chevron will move its development of new applications to Microsoft’s cloud service Azure,” reports Forbes.</li> <li><a href="https://www.networkworld.com/article/3235708/cloud-computing/ibms-latest-private-cloud-is-built-on-kubernetes-and-is-aimed-at-microsoft.html"><strong>IBM’s latest private cloud is built on Kubernetes, and is aimed at Microsoft</strong></a> <strong>(NetworkWorld)</strong><br> “IBM today announced a new version of its private cloud platform that supports the popular open source application container platform Kubernetes. IBM Cloud Private gives customers an option to deploy applications onto the private cloud software in three ways,” says NetworkWorld.</li> <li><a href="http://www.zdnet.com/article/most-loathed-programming-language-heres-how-developers-cast-their-votes/"><strong>Most loathed programming language? Here’s how developers cast their votes</strong></a> <strong>(ZDNet)</strong><br> “Developers on Stack Overflow really don’t want to work in Perl and don’t like Microsoft much either… Other intensely disliked languages include PHP, Object-C, Coffeescript, and Ruby,” according to ZDNet.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Kentik for Site Reliability]]><![CDATA[At Kentik, we built Kentik Detect, our production SaaS platform, on a microservices architecture. We also use Kentik for monitoring our own infrastructure. Drawing on a variety of real-life incidents as examples, this post looks at how the alerts we get — and the details that we're able to see when we drill down deep into the data — enable us to rapidly troubleshoot and resolve network-related issues.]]>https://www.kentik.com/blog/kentik-detect-for-kentik-site-reliabilityhttps://www.kentik.com/blog/kentik-detect-for-kentik-site-reliability<![CDATA[Jim Meehan]]>Mon, 30 Oct 2017 13:00:05 GMT<h3 id="troubleshooting-our-saas-with-our-own-platform"><em>Troubleshooting Our SaaS… With Our Own Platform</em></h3> <p>Effective Site Reliability practices include monitoring and alerting to improve incident response. Here at Kentik, we use a microservices architecture for our production Software as a Service (SaaS) platform, and we also happen to have a great solution for monitoring and alerting about the performance of that kind of application. In this post, we’ll take four real incidents that occurred in our environment, and we’ll look at how we use Kentik — “drinking our own champagne” — to monitor our stack and respond to operational issues.</p> <h4 id="issue-1-pushing-code-to-repository">Issue 1: Pushing Code to Repository</h4> <p>Our first incident manifested as an inability to push code to a repository. The build system reported: Unexpected status code [429] : Quota Exceeded, and our initial troubleshooting revealed that it couldn’t connect to the Google GCE-hosted container registry, gcr.io. But the GCE admin console showed no indication of quota being exceeded, errors, expired certification, or any other cause. To dig deeper, we looked at what Kentik could tell us about our traffic to/from gcr.io. In Data Explorer, we built a query, using the time range of the incident, with “Full Device” as the group-by dimension, and we filtered the query down to the IP address for gcr.io.</p> <img src="//images.ctfassets.net/6yom6slo28h2/42PKMZCcsEWU6GwEc2UmuO/f59e8e0a1615265bee494d9ef76a0166/Explorer_top_device-818w.png" class="image center" style="max-width: 800px" alt="" /> <p>As we can see in the output, two hosts (k122/k212) were sending a relatively high rate of pps to gcr.io but then stopped shortly after 11:00 UTC. It turns out that k122/k212 are development VMs that were assigned to our summer interns. Once we talked to the interns we realized that a registry project they were working on had scripts that were constantly hitting gcr.io. The astute reader has probably already realized that a HTTP response code of 429 means we were being rate limited by gcr.io because of these scripts. Without the details that we were able to query for in Kentik this type of root-cause analysis would have been difficult to impossible.</p> <h4 id="issue-2-load-spike">Issue 2: Load Spike</h4> <p>The next incident was brought to our attention by an alert we had set up in Kentik to pro-actively notify us of traffic anomalies. In this case, the anomaly was high bps/pps to an ingest node (fl13) that deviated from the historical baseline and correlated with high CPU on the same node.</p> <img src="//images.ctfassets.net/6yom6slo28h2/C1GsEV2TIW0qgqI8AUqcW/69539f46b3fb68d646259090a7157d24/Alert_issue_2-809w.png" class="image center" style="max-width: 800px" alt="Alert" /> <p>We turned again to Data Explorer to see what Kentik Detect could tell us about this increase in traffic, building a query using “Source AS Number” for the group-by dimension and applying a filter for the IP of the node that was alerting.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1VWLjNDWjuIwoi6SkqqowE/9b75434bbf37dcea7442707ebbb4bc30/Explorer_src_AS-820w.png" class="image center" style="max-width: 800px" alt="Explorer" /> <p>As you can see from the resulting graph, there was a big spike in traffic around 22:15 UTC. Looking at the table, we can see that traffic was coming from an ASN that we’ve anonymized to 123456. Having a good pro-active alert was hugely helpful, because it allowed us to quickly understand where the additional traffic was coming from and which service it was destined to, to verify that this node was handling the additional load adequately, and to know where to look to verify other vitals. Without this alert, we may have never been able to isolate the cause of the increase in CPU utilization. Using Kentik Detect we were able to do so in under 10 minutes.</p> <h4 id="issue-3-query-performance">Issue 3: Query Performance</h4> <p>Like most SaaS companies we closely monitor our query response times to make sure our platform is responding quickly to user requests, 95 percent of which are returned in under 2 seconds. Our next incident was discovered via an alert that triggered because our query response time increased to more than 4 seconds. We simultaneously had a network bandwidth alert showing more than 20 Gbps of traffic among 20 nodes. Drilling down in Data Explorer, we were immediately able to identify the affected microservice. We graphed the traffic by the source IP (sub-query processes) hitting our aggregation service, which revealed a big spike in traffic. Our aggregation service did not anticipate 50+ workers responding simultaneously during a large query over many devices (flow sources).</p> <img src="//images.ctfassets.net/6yom6slo28h2/62sOmcROWQeSWgaom4c6cw/29bd7a482186bfdbc39c61fcff1ae1fd/Explorer_src_IP-817w.png" class="image center" style="max-width: 800px" alt="" /> <p>We also looked at this traffic by destination port and saw that in addition to the spikes on port 14999 (aggregation service) we saw a dip on 20012 (ingest) which is a service running on the same node. The dip indicates that data collection was also affected, not just query latency.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1SJcEQI3iU0scmcgcOuqkq/d9cfd9f9a4480cc8b3f0c67caf2d5238/Explorer_dst_port-815w.png" class="image center" style="max-width: 800px" alt="" /> <p>Because Kentik Detect gave us detailed visibility into the traffic between our microservices we were able to troubleshoot an issue in under 30 minutes that would otherwise have taken us hours to figure out. And based on the insight we gained, we were able to tune our aggregation service pipeline control to prevent a recurrence.</p> <h4 id="issue-4-high-source-ip-count-to-internal-ip">Issue 4: High Source IP Count to Internal IP</h4> <p>The final incident we’ll look at isn’t technically a microservices issue, but it’s something that most network operators who deal with campus networks will be able to relate to. Once again, we became aware of the issue via alerts from our anomaly detection engine, with three alarms firing at the same time:</p> <ul> <li>High number of source IPs talking to our NAT</li> <li>High number of source IPs talking to a proxy server</li> <li>High number of source IPs talking to an internal IP</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/6N274pdxy8iCcYcIyqieqW/6c95f0a5d01fef014f049cbe74ecfd2e/Alert_issue_4-810w.png" class="image center" style="max-width: 800px" alt="" /> <p>Our initial investigation revealed that the destination IP was previously unused on our network. Given all the news recently of data breaches, that led us to wonder if a compromise had taken place.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1WAKsNpPM8WqmgeGWaCyMg/1b98df22a6803c7f2f1d1bfa615ba6ad/Open_dashboard-819w.png" class="image center" style="max-width: 800px" alt="" /> <p>On our Active Alerts page we used the Open in Dashboard button (outlined in red in the screenshot above) to link to the dashboard associated with the alert, where we were able to quickly see a profile of the traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Rlvu4sCdiygm2kS2m0AAc/30027a5bf1a3c2274611a6992294662d/Dashboard_issue_4-810w.png" class="image center" style="max-width: 800px" alt="" /> <p>The emerging traffic profile looked very much like a normal workstation — except that it appeared to be involved in some kind of digital currency mining, and it was traversing the same Internet transit as our production network!</p> <img src="//images.ctfassets.net/6yom6slo28h2/1XWK7DznvSoGAecc0C6e2C/80a78e3f41e977fc0cf5c275502458f9/DNS_query_table-809w.png" class="image center" style="max-width: 800px" alt="" /> <p>Using a Data Explorer query to reveal the DNS queries made by this host, we were able to confirm this suspicion.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1k60pjy84sCCe8Keqqe8YC/17e554e5ca54cdfd60986f3d19528f12/DNS_query_table-809w.png" class="image center" style="max-width: 800px" alt="" /> <p>After disabling the port, we investigated and discovered that one of our remote contractors had connected to the WiFi in our datacenter. We have since locked down the network. (We still love this contractor, we just don’t want him using our network to mine virtual currencies.) Without the alerts from Kentik Detect, and the ability to drill down into the details, it would likely have taken much longer for us to learn about and resolve this rogue host incident.</p> <h4 id="summary">Summary</h4> <p>The incidents described above provide a high-level taste of how you can use Kentik Detect to monitor network and application performance in the brave new world of distributed compute, microservices, hybrid-cloud, and DevOps. If you’re an experienced Site Reliability/DevOps engineer and you’re intrigued by this post you might be just the kind of person we need on our SR team; check out our <a href="https://www.kentik.com/careers/">careers page</a>. If you’re an existing customer and would like help setting up to monitor for these type of incidents, contact our <a href="mailto:[email protected]">customer success team</a>. And if you’re not already a Kentik customer and would like to see how we can help you monitor in your environment, <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Configuring Kentik for SSO]]><![CDATA[As security threats grow more ominous, security procedures grow more onerous, which can be a drag on productivity. In this post we look at how Kentik's single sign-on (SSO) implementation enables users to maintain security without constantly entering authentication credentials. Check out this walk-through of the SSO setup and login process to enable your users to access Kentik Detect with the same SSO services they use for other applications.]]>https://www.kentik.com/blog/configuring-kentik-for-ssohttps://www.kentik.com/blog/configuring-kentik-for-sso<![CDATA[Greg Villain & Philip De Lancie]]>Mon, 23 Oct 2017 13:36:07 GMT<h2 id="single-sign-on-enhances-user-security-and-convenience"><em>Single Sign-on Enhances User Security and Convenience</em></h2> <p>As a user of network-connected services you’re probably familiar with the dilemma: security threats grow more ominous, so security procedures grow more onerous, creating an increasing drag on productivity. To help make it more convenient to maintain security without constantly entering authentication credentials, Kentik has enabled single sign-on (SSO) for the Kentik Detect portal. That means Kentik users are now able access the portal via the same authentication services they use for other SSO-enabled applications, allowing them to access many services with just one sign-on.</p> <p>Kentik’s new SSO implementation is (exclusively) compliant with standard SAML2 transport, which is sometimes referred to as “Federated Identity Management.” In the SAML2 terminology:</p> <ul> <li>Kentik is the “service provider”(SP).</li> <li>Your SSO is the “identity provider” (IdP).</li> </ul> <p>Kentik’s SSO implementation has been successfully tested with the following identity providers:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2QiBmKE1YAssSEg0AyWis6/ce394e890dd14b63eafe376f66cf3d4b/SSO-Logos-824w.png" class="image center no-shadow" style="max-width: 824px" alt="SSO logos" /> <p>In the rest of this post we’ll look at how SSO works, how to configure it for your Kentik account (which requires a “Super Admin” user), and how your users will sign on once you’ve enabled SSO.</p> <h2 id="how-sso-works">How SSO works</h2> <p>SSO is conceptually quite simple. Each Kentik customer (organization) sets up an identity provider that keeps track of who in the company has permission to access Kentik Detect. (These Kentik users may be categorized into groups to facilitate differentiated management by role.) As shown in the following diagram, when a user attempts to log into the portal via the SSO login URL, Kentik finds the IdP in the company’s Kentik Detect SSO settings and contacts the IdP to request verification of the user. If the IdP is able to authenticate the user, a SAML2 response is returned to Kentik and Kentik logs the user in. If the IdP can’t authenticate, the user is unable to access the portal via SSO.</p> <img src="//images.ctfassets.net/6yom6slo28h2/62HWW0k0z6Ug2CUMoCwY86/1e7267c8120476c26d9205b88400554a/SSO-Overview_diagram-813w.png" class="image center no-shadow" style="max-width: 813px" alt="SSO overview diagram" /> <h3 id="sso-configuration-prerequisites">SSO Configuration Prerequisites</h3> <p>Two prerequisites must be met before you can successfully configure your Kentik account for SSO:</p> <ul> <li><strong>Identity provider account</strong>: You must have one of the following:<br> - <em>Existing identity provider account</em>: An account with an existing SAML2 identity provider (e.g. Okta, OneLogin, Ping Federate, Google GSuite, Duo, etc.) and a directory of users.<br> - <em>In-house identity management</em>: A self-operated SAML2-compatible IdP or identity gateway, such as Shibboleth.<br> <strong><em>Note:</em></strong> For users requiring LDAP or Active Directory as an authentication backend, we recommend Shibboleth, which has open source LDAP and AD extensions.</li> <li><strong>Kentik Super Admin user</strong>: The user level of at least one user in your organization must be Super Admin user account configured on Kentik Detect Portal (<a href="https://portal.kentik.com">https://portal.kentik.com</a>).</li> </ul> <h4 id="about-super-admin-users">About Super Admin Users</h4> <p>Super Admin users are equivalent to Admin users, with the following additional privileges:</p> <ul> <li>They can configure SSO from the portal’s Single Sign-on page (Admin » Single Sign-on).</li> <li>When SSO is required, Super Admins are the only ones who can still use non-SSO login (e.g. username + password, or username + password + 2FA), allowing a Super Admin to disable SSO in case of an identity provider failure.</li> <li>They can turn other users into Super Admin users.</li> </ul> <p>To prevent a single point of failure, we recommend that you set up two Super Admins so that when one is unavailable you can still reach the other. We don’t recommend more than two, however, because it’s wise to restrict the number of users that are allowed to log in using the traditional username/password approach.</p> <p>Any Admin-level user in a given organization can check who the Super Admin users are by looking at the <strong>Level</strong> column in the <strong>User List</strong> **(**Admin » Users). If no user is a Super Admin, please contact <a href="mailto:[email protected]">Kentik support</a> to request that a Super Admin be designated for your organization.</p> <p><strong><em>Note:</em></strong> If your organization signed up with Kentik prior to October 2017, the first user registered to your account will be automatically set as a Super Admin (to change, go to Admin » Users).</p> <h3 id="kentik-sso-configuration">Kentik SSO Configuration</h3> <img src="//images.ctfassets.net/6yom6slo28h2/5HfPZBnFfya80Yqso8Eici/fb2fa7520e9c2e8b6d963fb146b8c3b0/SSO-Admin_sidebar-216w.png" class="image right" style="max-width: 216px" alt="SSO admin sidebar" /> <p>Now let’s assume that you are a Super Admin ready to dig into SSO configuration. When you click <strong>Admin</strong> from the portal navbar, the <strong>Security</strong> section of the sidebar at left will include a link to the <strong>Single Sign-on</strong> page. All of the configuration steps below are performed either on that page or in your identity provider’s management app.</p> <p>The settings on the page are divided into two main sections: a set of switches at top and a set of fields below. Before configuration, check that the <strong>SSO Enabled</strong> switch at top is set to Off (default), so that you can complete the settings before actually turning on SSO.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5bO6UJg4y42i0WgGSwA8qA/a8af1f4ee6d0d8218e590e0e54c0163f/SSO-SSO_enabled-559w.png" class="image center" style="max-width: 559px" alt="SSO enabled" /> <h4 id="add-kentik-to-your-idp">Add Kentik To Your IdP</h4> <p>SSO involves two-way communication between Kentik and the identity provider, which requires that each is aware of the other. The information you’ll need to configure your IdP to recognize Kentik Detect as a service provider (SP) is found in the first two fields:</p> <ul> <li><strong>SP Entity Id</strong>: A unique identifier for Kentik.</li> <li><strong>SP ACS Url</strong>: The endpoint of the Assertion Consumer Service at Kentik, which is the URL to which the IdP should posts its response.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/4KD5vecIdikwoM4kK2Yye8/311ab6cba1c9f4d2ca790d2fdfd66ef0/SSO-Identity_provider-560w.png" class="image center" style="max-width: 560px" alt="" /> <p>Note that some IdP solutions, including Shibboleth, can take the above information from an XML configuration file. We’ve provided a ready-made config file for that purpose, which you can download directly from the <strong>Single Sign-on</strong> page via the <strong>Download Kentik SP Metadata</strong> button at the bottom.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2L3Cs179RuCYOWwcCwYWgA/3b7d8ab188b56746bcaeac504b355bc3/SSO-XML_download-279w.png" class="image center" style="max-width: 279px" alt="SSO XML download" /> <h4 id="configure-idp-settings-in-kentik">Configure IdP Settings in Kentik</h4> <p>Once you’ve added Kentik to your IdP, go back to the <strong>Single Sign-on</strong> page to set IdP-related settings with the following controls:</p> <ul> <li><strong>IDP SSO Url</strong> (required): A field to enter the IdP entry-point URL, which is the IdP URL to which Kentik redirects the browser when a user initially attempts to log in.</li> <li><strong>Email Attrib. Key</strong> (required): A field to enter the IdP’s email attribute key, which tells Kentik where to find the user’s email in the IdP’s response to an authentication request.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/63ibl2mHmwGSEcug6E8SeK/0cf859fc02ef8e3e89766009394069eb/SSO-IDP_url_Email_key-559w.png" class="image center" style="max-width: 559px" alt="SSO-IDP url Email key"> <ul> <li><strong>Encrypt Assertions</strong> (default = Off): Specify whether or not you want SAML assertions (authentication, attribute, and/or authorization decision statements) in a response from the IdP to be encrypted.</li> <li><strong>IdP Signing Cert</strong> (optional): A field to enter the IdP’s public signing key. If a signing cert is provided in this field, Kentik will reject any response for which there is either no signature or a signature that can’t be verified. If no signing cert is provided then Kentik will not require a signature or attempt to verify signed responses.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/66YSIGdQNq0USSGuGsqKIi/75a80203491a6e5f85cd6f8bd748d630/SSO-Encrypt_Cert-559w.png" class="image center" style="max-width: 559px" alt="SSO-Encrypt Cert" /> <h4 id="configure-user-level-behavior">Configure User-level Behavior</h4> <p>At this point you may also want to set the optional <strong>User Level Attrib. Key</strong>, which is a field to enter your IdP’s user-level attribute key. If the IdP’s response to an authentication request includes an IdP-specified user level, this setting tells Kentik where to find it. That allows user levels to be managed from the IdP:</p> <ul> <li>If this field is left blank, or the field is specified but the IdP-provided value is invalid, then the field will be ignored, meaning that the user level will be determined by Kentik’s internal user-level value for the user. The only valid (Kentik-recognized) values for the key are 0 (Member) and 1 (Admin).</li> <li>If the IdP does provide a valid level for a given user then at login Kentik’s internally stored user-level value for that user will be overwritten with the IdP-provided value. For example, if a user that’s registered in Kentik as an Admin is identified by IdP as a Member, then that user will become a Member and no longer have access to Admin privileges.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/6d8tGUYekMYoGuuywugi46/4fab0d8d900e5426474b80d336f137bb/SSO-User_level-559w.png" class="image center" style="max-width: 559px" alt="SSO User Level" /> <p>The fact that all values other than 0 (Member) and 1 (Admin) are ignored prevents an existing user level in Kentik from being overwritten with an invalid level from the IdP. It also means (because there is no valid value representing the Super Admin level) that a user’s level can’t be changed to Super Admin via IdP (this is intentional to discourage the automatic creation of excessive Super Admin users).</p> <p>Keep the following in mind when considering how to manage user levels with your IdP:</p> <ul> <li>If the IdP provides a valid user-level key then any user level that is changed directly in Kentik (via portal or API) will be reset at the next SSO login to the IdP-provided level.</li> <li>If a Super Admin user is included in an IdP group whose user level is collectively set via an IdP user-level key then that user will lose Super Admin privileges.</li> <li>In cases where the user-level values used by your IdP are not Kentik-valid (e.g. true and false rather than 1 and 0) your IdP may enable you to configure a transform that makes the value Kentik-valid before including it in the SAML assertion sent to Kentik.</li> </ul> <h4 id="configure-auto-provisioning">Configure Auto-provisioning</h4> <p>The <strong>Auto-provisioning</strong> switch (default = Off) determines what happens when sign-on is attempted by someone who is successfully authenticated by the IdP but is not already registered with Kentik as a user (they don’t currently exist in Admin » User):</p> <ul> <li>If set to On, login will be allowed and the user will be automatically registered with Kentik.</li> <li>If Off, login will be denied.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5khtFjZVgkcC4KAwyKeqkW/761e9ee40971a37b845b70f3d6de8d4a/SSO-Auto_provision-559w.png" class="image center" style="max-width: 559px" alt="SSO-Auto provision" /> <p>If you decide to use auto-provisioning, you’re most likely to achieve the expected results by taking into account the following:</p> <ul> <li>If the <strong>User Level Attribute Key</strong> field is blank or a valid user-level key is not found in the IdP response then the user level assigned to auto-provisioned users will be Member.</li> <li>There is currently no mechanism to auto-provision the Full Name field in Kentik’s internal record for each user. Instead, the Full Name of auto-provisioned users will be set to the IdP-provided email address (see <strong>Email Attrib. Key</strong> in <a href="#BL75-Configure_IdP_Settings_in_Kentik"><strong>Configure IdP Settings in Kentik</strong></a>), after which it can be changed directly in Kentik (portal or API).</li> <li>You can’t “auto-deprovision” a user. In other words, removing a user from the IdP does not remove that user from Kentik’s internal list of the organization’s users. That has to be done from the portal (Admin » User) or via Kentik’s <a href="https://kb.kentik.com/v0/Ec06.htm">User API</a>.</li> </ul> <h4 id="fine-tune-the-configuration">Fine-tune the Configuration</h4> <p>In addition to the settings above you’ll also find some additional settings to tailor the configuration to your organization’s specific needs:</p> <ul> <li><strong>SSO Required</strong> (default = Off):<br> - If set to On, all users (except for Super Admins) will be required to use log in via SSO.<br> - If Off, users may log in with either SSO or standard login.</li> <li><strong>Disable 2FA</strong> (default = Off):<br> - If On, two-factor authentication will be disabled whenever SSO is enabled.<br> - If Off, 2FA users who sign on via SSO will still need to input a one-time password from their 2FA source (e.g. Google Authenticator).</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5nC1KqVgPYKGEkCGuS4IYy/7511d6666ab6e833e0bf2fba5796f34d/SSO-Required_disable-559w.png" class="image center" style="max-width: 559px" alt="SSO-Required" /> <h4 id="sso-liftoff">SSO Liftoff</h4> <p>Once you’ve got all of your settings defined you’re ready to set the <strong>SSO Enabled</strong> switch to On. If needed you can turn SSO off at any time without losing any of the settings you’ve made.</p> <h3 id="sso-login-process">SSO Login Process</h3> <p>Once SSO is enabled, logins will take place at a newly created URL that is specific to your organization. In the following example, _company_shortname_ is a placeholder for the actual value, which is the last segment of the URL shown in the <strong>SP Entity Id</strong> or <strong>SP ACS Url</strong> field (see <a href="#BL75-Add_Kentik_To_Your_IdP"><strong>Add Kentik To Your IdP</strong></a>):</p> <p><code class="language-text">https://portal.kentik.com/sso/_company_shortname_</code></p> <p>When users land on your Kentik SSO login gateway page:</p> <ul> <li>If they already have a valid active session (as defined by the IdP) they will be automatically logged into the Kentik Detect portal.</li> <li>If they don’t already have a current session they will be redirected to the IdP’s login screen and then back to Kentik Detect Portal upon successful authentication.</li> </ul> <h3 id="migrating-to-sso">Migrating to SSO</h3> <p>The easiest way to transition from plain authentication to SSO is to leverage the <strong>SSO Required</strong> configuration switch and proceed in three steps:</p> <ol> <li>Configure SSO with <strong>SSO Required</strong> set to Off (default). Perform all of the needed tests and validate the optional features with a single user, typically the staff member (a Super Admin) in charge of SSO.</li> <li>Send an announcement to your organization’s Kentik user base. Include the following:<br> - A clear date on which Kentik access will be only via SSO.<br> - Contact info for the Super Admin, so that users can ask for help before the cutoff date.<br> - The new login URL for Kentik access via SSO (replace the _company_shortname_ placeholder with the last segment of the URL shown in the <strong>SP Entity Id</strong> or <strong>SP ACS Url</strong> field):<br> <a href="https://portal.kentik.com/sso/_company_shortname_">https://portal.kentik.com/sso/_company_shortname_</a></li> <li>On the cutoff date, flip the <strong>SSO Required</strong> switch to On, after which Kentik users will only be able to log in using the SSO URL.<br> <strong><em>Note:</em></strong> Once SSO is required, users attempting non-SSO access (<a href="https://portal.kentik.com/login">https://portal.kentik.com/login</a>) will be denied access.</li> </ol> <h3 id="conclusion">Conclusion</h3> <p>With the addition of SSO, Kentik has make it significantly easier to use Kentik Detect securely without the hassle of separate authentication. If you’re already a Kentik customer, ask your Kentik support team (<a href="mailto:[email protected]">[email protected]</a>) to help get you started. If you’re not already a customer, find out what you’re missing by signing up today for a <a href="#signup_dialog">free trial</a> or contacting us to <a href="#demo_dialog">request a demo</a>.</p><![CDATA[News in Networking: Russian Internet for North Korea, Google Finds DNS Vulnerabilities]]><![CDATA[Reports this week suggest a Russian teleco started providing internet connectivity to North Korea. Russia also made headlines for its covert efforts to steal secrets from the NSA. Oracle made a bunch of news with its OpenWorld conference this week, including taking aim at AWS. And Google disclosed seven vulnerabilities in DNS’ Dnsmasq software. More headlines after the jump...]]>https://www.kentik.com/blog/news-in-networking-russian-internet-for-north-korea-and-dns-vulnerabilitieshttps://www.kentik.com/blog/news-in-networking-russian-internet-for-north-korea-and-dns-vulnerabilities<![CDATA[Michelle Kincaid]]>Fri, 06 Oct 2017 15:51:36 GMT<p>This week’s top story picks from the Kentik team.</p> <p>Reports this week suggest a Russian teleco started providing internet connectivity to North Korea. Russia also made headlines for its covert efforts to steal secrets from the NSA. Oracle made a bunch of news with its OpenWorld conference this week, including taking aim at AWS. And Google disclosed seven vulnerabilities in DNS’ Dnsmasq software.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="http://www.38north.org/2017/10/mwilliams100117/"><strong>Russia Provides New Internet Connection to North Korea</strong></a> <strong>(38 North)</strong><br> “A major Russian telecommunications company appears to have begun providing an internet connection to North Korea. The new link supplements one from China and will provide back-up to Pyongyang at a time the US government is reportedly attacking its Internet infrastructure and pressuring China to end all business with North Korea,” reports Johns Hopkins blog 38 North.</li> <li><a href="https://arstechnica.com/information-technology/2017/10/the-cases-for-and-against-claims-kaspersky-helped-steal-secret-nsa-secrets/"><strong>Russia Reportedly Stole NSA Secrets with Help of Kaspersky</strong></a> <strong>(Ars Technica)</strong><br> While not yet confirmed, a story out Thursday suggests, “Hackers working for the Russian government stole confidential material from a National Security Agency contractor’s home computer after identifying files though the contractor’s use of antivirus software from Moscow-based Kaspersky Lab.”</li> <li><a href="https://www.wsj.com/articles/oracle-chairman-larry-ellison-takes-aim-at-amazon-1506916787"><strong>Oracle Corp.’s Larry Ellison Takes Aim at Amazon</strong></a> <strong>(Wall Street Journal)</strong><br> Oracle OpenWorld began on Sunday. In his opening keynote, Company Chairman Larry Ellison took aim at Amazon AWs. He “ran through several demonstrations of Oracle Database 18c, saying customers would pay several times more using Amazon’s technology,” reports the WSJ. Oracle’s new database, due out in December, “will autonomously provision only the computing resources as customers need them… [and] Oracle will guarantee its bill will be less than half what Amazon would charge customers for a similar service and less than 30 minutes per year of downtime.”</li> <li><a href="http://www.executivegov.com/2017/10/nist-dhs-partner-to-establish-internet-routing-system-security-standards/"><strong>NIST, DHS Partner to Establish Internet Routing System Security Standards</strong></a> <strong>(ExecutiveGov)</strong><br> The U.S. National Institute of Standards and Technology and the Department of Homeland Security have worked together to create new internet security standards. “BGP as currently deployed has no built-in security mechanisms, so it is common to see examples of ‘route hijacks’ and ‘path detours’ by malicious parties meant to capture, eavesdrop upon or deny legitimate internet data exchanges,” Doug Montgomery, an NIST computer scientist, said in regards to the new standards.</li> <li><a href="https://mackeepersecurity.com/post/nfl-players-association-exposed-personal-data"><strong>Misconfig in Elasticsearch Leaks NFL Sata</strong></a> <strong>(MacKeeper)</strong><br> A misconfigured Elasticsearch database was” used to collect data from Orchard Audit module that is tracking/analyzing user activity on a number of NFL related domains (mostly, nflpa.com) and sending back to the Elasticsearch for analysis,” according to the MacKeeper blog.</li> <li><a href="https://www.coindesk.com/cloudflare-suspends-website-using-cryptocurrency-miner-malware/"><strong>Cloudflare Suspends Torrent Website for Cryptocurrency Miner ‘Malware’</strong></a> <strong>(CoinDesk)</strong><br> Torrent site ProxyBunker said Cloudflare removed “all its relevant domains due to a miner hiding in the website’s code. A portal to other torrent sites, ProxyBunker had been running the ‘Coinhive’ monero miner for four days prior to the suspension,” reports CoinDesk.</li> <li><a href="https://security.googleblog.com/2017/10/behind-masq-yet-more-dns-and-dhcp.html"><strong>Yet More DNS and DHCP Vulnerabilities</strong></a> <strong>(Google Blog)</strong><br> A new blog post from Google talks about Dnsmasq, which “provides functionality for serving DNS, DHCP, router advertisements and network boot and is commonly installed in systems as varied as desktop Linux distributions (like Ubuntu), home routers, and IoT devices.” Google security engineers found “seven vulnerabilities including three potential remote code executions, one information leak, and three denial of service vulnerabilities” in the software.</li> <li><a href="https://www.helpnetsecurity.com/2017/10/03/companies-unprepared-dns-attacks/"><strong>Most Companies Are Unprepared for DNS Attacks</strong></a> <strong>(HelpNetSecurity)</strong><br> On the topic of DNS vulnerabilities, Dimensional Research released results of a survey of more than 1,000 security and IT professionals worldwide this week. On the topic of DNS attacks, the survey found that “86 percent of DNS solutions failed to first alert teams of an occurring DNS attack, and nearly one-third of professionals doubted their company could defend against the next DNS attack.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Kentik APIs Enable Multi-Solution Integration]]><![CDATA[In today's world of heterogeneous environments and distributed systems, APIs drive synergistic innovation, creating a whole that's more powerful than the parts. Even in networking, where the CLI rules, APIs are now indispensable. At Kentik, APIs have been integral to our platform from the outset. In this post we look at how partners and customers are expanding the capabilities of their systems by combining Kentik with external tools.]]>https://www.kentik.com/blog/kentik-apis-enable-multi-solution-integrationhttps://www.kentik.com/blog/kentik-apis-enable-multi-solution-integration<![CDATA[Justin Ryburn]]>Mon, 02 Oct 2017 13:00:51 GMT<h3 id="open-connections-can-build-powerful-combinations"><em>Open Connections Can Build Powerful Combinations</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/3oL5ikSXmESgy802yGoU2g/e8a6af7018d5c1b6898557fffb904548/API_connections-500w.png" alt="API_connections-500w.png " class="image right" style="max-width: 500px; padding: 20px;" /> The Application Programming Interface (<a href="https://en.wikipedia.org/wiki/Application_programming_interface">API</a>) has been around since the 1960s, but in today’s world of heterogeneous environments and distributed systems it’s become indispensable. Even in networking, hard-core bastion of the command-line interface (CLI), APIs have become useful enough that commentators such as Chris Grundemann are blogging about whether the <a href="https://chrisgrundemann.com/index.php/2017/api-new-cli-fact-fiction/">API is the new CLI</a> for networking professionals.</p> <p>The popularity of APIs, typically RESTful these days, is driven by the fact that they simplify the integration of different web-based applications, which enables innovation based on much faster and more open communication between systems. For an easy-to-understand example, take the “Skills” that are available for Amazon’s Alexa. Using the Alexa app on your phone, you can add a skill that integrates with a Nest account, after which you can control the temperature in your house via a voice command. How is that magic actually accomplished? When you ask Alexa to change the temperature, the Amazon system makes an API call to the Nest system. Because of APIs, two (or more) completely separate systems can now act as one cohesive unit.</p> <p>The benefits that can come from integrating Alexa and Nest apply as well to the integration of tools for operating and securing a network. While Kentik addresses use cases across monitoring, alerting, analytics, and planning, organizations will likely always encounter specific scenarios that aren’t totally handled by any one tool or system. That’s why network operations has for years involved deployment of a mix of different commercial, open-source, and home-grown tools. This creates what we refer to as the swivel chair problem for operations personnel, who must constantly switch between various systems to troubleshoot an issue. Monitoring solutions that implement APIs help solve this problem because they enable multiple systems to work together as a unified whole.</p> <h4 id="apis-for-kentik">APIs for Kentik</h4> <p>Here at Kentik, our engineers work really hard to continually expand the functionality and data types supported by our platforms. We realize, however, that Kentik Detect alone may not cover all network use cases. So APIs have been integral to the design of our platform from the outset.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2uuIjqxYq0C0YmaqsI8cIq/8d86a14177fcdde560f08df5527a32fa/API_menu-300w.png" alt="API_menu-300w.png " class="image right" style="max-width: 300px;" /> <p>On the one hand our focus on open APIs has facilitated partnerships with complementary vendors, such as the integrations that allow our alerting system to interact seamlessly with DDoS mitigation from leading providers such as Radware and A10. On the other hand we provide a set of very powerful APIs that users can leverage to integrate our functionality and network traffic data with other systems. As part of that effort we’ve built some resources right into the product to make our APIs as easy as possible for customers to use. In the Data Explorer section of the Kentik portal, for example, the View Options menu at top right of the graph display area (see image at right) exposes the Kentik API call (as cURL) needed to return either a chart or table data, and will also show you the JSON needed to make the call.</p> <p>One example of using Kentik APIs in a provider setting is to enable customer portals, which is explained in this <a href="https://www.kentik.com/kentik-apis-for-customer-portal-integration/">blog post</a>. To see the full variety of the calls that are available you can refer to our Knowledge Base, which covers both <a href="https://kb.kentik.com/?Ec03.htm">Admin APIs</a> and the <a href="https://kb.kentik.com/?Ec04.htm">Query API</a>. Another powerful resource for API calls is our <a href="https://api.kentik.com/api/tester/v5">API tester</a> which allows the user to try out API syntax in a GUI.</p> <h4 id="kentik-connect-pro-for-grafana">Kentik Connect Pro for Grafana</h4> <p>Another API-based option that we’ve developed for our customers is Kentik Connect Pro, a <a href="https://grafana.com/plugins/kentik-app?source=kentik">plug-in</a> that we worked with Grafana to develop for their popular open-source data graphing software. This allows our customers to skip a lot of the heavy lifting that would otherwise be involved in pulling in their Kentik data alongside the other data that they are already graphing.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6PNyRExCWQqy0MIiEaIaoS/8f7c30b03050a6997031576c3c3b01cf/Graphana_main-820w.png" alt="Graphana_main-820w.png " class="image center" style="max-width: 820px;" /> <p>Kentik Connect for Grafana ships with Kentik Data Source, a database connector that allows you to read and visualize data directly from our backend, the Kentik Data Engine (KDE). Within the Grafana environment, you can specify the parameters of the traffic that you want Kentik Connect to display:</p> <ul> <li><img src="//images.ctfassets.net/6yom6slo28h2/3cR2FA9IiIwwYAicww4guu/efbe9d7f2e39d8bb994b3a38a07db50b/Graphana_groupby-300w.png" alt="Graphana_groupby-300w.png " class="image right" style="max-width: 300px;" /><strong>Timespan</strong>: Set the time range for which you want to see traffic data.</li> <li><strong>Devices</strong>: View traffic from all devices or individual routers, switches, or hosts.</li> <li><strong>Group By</strong> (see screenshot at right): Group by over 30 source and destination dimensions representing NetFlow, BGP, or GeoIP data.</li> <li><strong>Metrics</strong>: Display data in metrics including bits, packets, or unique IPs.</li> <li><strong>Filters</strong>: Filter the data based on a very similar list of dimensions as the Group By.</li> <li><strong>Sort</strong>: Visualizations are accompanied by a sortable table showing Max, 95th percentile, and Average values.</li> </ul> <p>Kentik Connect also allows you to edit the configuration of devices you have registered in Kentik Detect. As with any Grafana dashboard, current settings can be managed and dashboards can be saved, shared, and starred. To enable integration between Kentik Connect in Grafana and your Kentik Detect portal, you will need your email address and Kentik API token, which can be found in the portal’s User Details dialog (go to Admin » Users and click on your row in the User List).</p> <img src="//images.ctfassets.net/6yom6slo28h2/2h7Apy7Mi06scmeyKK4Skc/c9454f8a2ca24c26bfacb3932042acbf/Admin_user-814w.png" alt="Admin_user-814w.png " class="image center" style="max-width: 814px;" /> <h4 id="summary">Summary</h4> <p>The examples above represent just a few of the ways to use the Kentik APIs. As you can see, using these APIs to integrate with other systems can be a very powerful mechanism to leverage both the capabilities of the Kentik platform and your Kentik-stored traffic data (flow records, SNMP, GeoIP, BGP, etc.). If you’re already a Kentik Detect user and would like some help implementing these APIs, get in touch with the Kentik <a href="mailto:[email protected]">support team</a> for assistance. If you’re not yet a user and would like to see what these powerful integration capabilities can do for you, contact us today to <a href="#demo_dialog/">request a demo</a> or start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[News in Networking: 160-Tbit/s Cable, Airline Network Issues, and a Patched-Mac Vuln]]><![CDATA[Microsoft, Facebook and teleco provider Telxius announced this week that their high-capacity subsea cable project is complete. A “network issue” in a global flight-booking system caused major airline delays. And an “alarming number” of patched Macs are vulnerable to an issue in the Extensible Firmware Interface. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-the-160-tbits-cable-airline-network-issues-and-a-patched-mac-vulnhttps://www.kentik.com/blog/news-in-networking-the-160-tbits-cable-airline-network-issues-and-a-patched-mac-vuln<![CDATA[Michelle Kincaid]]>Fri, 29 Sep 2017 17:20:21 GMT<p>This week’s top story picks from the Kentik team.</p> <p>Microsoft, Facebook and teleco provider Telxius announced this week that their high-capacity subsea cable project is complete. A “network issue” in a global flight-booking system caused major airline delays. And an “alarming number” of patched Macs are vulnerable to an issue in the Extensible Firmware Interface.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.theverge.com/2017/9/25/16359966/microsoft-facebook-transatlantic-cable-160-terabits-a-second"><strong>Microsoft and Facebook Laid a 160-Tbit/s Cable 4,100 Miles Across the Atlantic</strong></a> <strong>(The Verge)</strong><br> Microsoft, Facebook, and telecoms infrastructure provider Telxius announced their new high-capacity subsea cable, called Marea, is complete. “The cable is capable of transmitting 160 terabits of data per second, the equivalent of streaming 71 million HD videos at the same time, and 16 million times faster than an average home internet connection,” reports The Verge.</li> <li><a href="https://www.bloomberg.com/news/articles/2017-09-28/airlines-suffer-worldwide-delays-as-amadeus-booking-system-fails"><strong>Airlines Suffer Worldwide Delays After “Network Issue”</strong></a> <strong>(Bloomberg)</strong><br> Amadeus IT Group SA, which operates a global flight-bookings system, said a “network issue” was to blame for a number of delays on Thursday morning.</li> <li><a href="https://arstechnica.com/information-technology/2017/09/bug-in-fully-patched-internet-explorer-leaks-text-in-address-bar/"><strong>Internet Explorer Bug Leaks Whatever You Type in Address Bar</strong></a> <strong>(Ars Technica)</strong><br> A bug in Internet Explorer “allows any currently visited website to view any text entered into the address bar as soon as the user hits enter. The technique can expose sensitive information a user didn’t intend to be viewed by remote websites,” reports Ars Technica.</li> <li><a href="http://www.zdnet.com/article/big-data-case-study-how-ups-is-using-analytics-to-improve-performance/"><strong>How UPS is Using Analytics to Improve Performance</strong></a> <strong>(ZDNet)</strong><br> Juan Perez, chief information and engineering officer for UPS, says network planning tools and analytics will help his company “to optimize its logistics network through the effective use of data.”</li> <li><a href="https://arstechnica.com/information-technology/2017/09/an-alarming-number-of-macs-remain-vulnerable-to-stealthy-firmware-hacks/"><strong>Alarming Number of Patched Macs Vulnerable to Stealthy Firmware Hacks</strong></a> <strong>(Ars Technica)</strong><br> “The exposure results from known vulnerabilities that remain in the Extensible Firmware Interface, or EFI, which is the software located on a computer motherboard that runs first when a Mac is turned on. EFI identifies what hardware components are available, starts those components up, and hands them over to the operating system,” reports Ars Technica.</li> <li><a href="http://www.zdnet.com/article/ericsson-and-telstra-encrypt-in-transit-data-over-100gbps-link/"><strong>Ericsson and Telstra Encrypt In-Transit Data Over 100Gbps Link</strong></a> <strong>(ZDNet)</strong><br> Ericsson, Ciena, and Telstra have encrypted data while in transit over a 100Gbps link between the U.S. and Australia using multiple subsea cable systems, reports ZDNet.</li> <li><a href="http://www.crn.com/slide-shows/networking/300092796/emerging-vendors-2017-networking-and-voip-startups-you-need-to-know.htm/pgno/0/11"><strong>Emerging Vendors 2017: Networking And VoIP Startups You Need To Know</strong></a> <strong>(CRN)</strong><br> CRN just published a list of top networking and VoIP startups to watch. Kentik was honored to be named on the list as a company “with the goal of making networking easier to manage and consume.” Check out the full list.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[News in Networking: Cisco Without Chambers, CCleaner Malware & Programmable Networks]]><![CDATA[Cisco Chairman John Chambers announced this week that he will not seek re-election. The networking giant also announced a partnership with Viacom. Meanwhile, Cisco researchers found that the CCleaner malware was targeting at least 18 tech companies. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-cisco-without-chambers-ccleaner-malware-programmable-networkshttps://www.kentik.com/blog/news-in-networking-cisco-without-chambers-ccleaner-malware-programmable-networks<![CDATA[Michelle Kincaid]]>Fri, 22 Sep 2017 20:03:34 GMT<p>This week’s top story picks from the Kentik team.</p> <p>Cisco Chairman John Chambers announced this week that he will not seek re-election. The networking giant also announced a partnership with Viacom. And Cisco researchers found that the CCleaner malware was targeting at least 18 tech companies. A busy week for that company and the others below.</p> <p><strong>Here are the headlines we’re reading this week:</strong></p> <ul> <li><a href="http://www.lightreading.com/ethernet-ip/routers/the-john-chambers-era-is-ending-at-cisco/a/d-id/736432"><strong>The John Chambers Era Is Ending at Cisco</strong></a> <strong>(Light Reading)</strong><br> Cisco Chairman John Chambers announced this week that he will not seek re-election. CEO Chuck Robbins is expected to be named chairman at the shareholder meeting in December.</li> <li><a href="http://www.lightreading.com/video/video-storage-delivery/cisco-inks-video-delivery-deal-with-viacom/d/d-id/736419"><strong>Cisco Inks Video Delivery Deal With Viacom</strong></a> <strong>(Light Reading)</strong><br> In other Cisco news, this week the networking giant announced a partnership with Viacom “to build a versatile video network foundation that will enhance Viacom’s distribution of its premier branded content across multiple linear, digital and mobile screens for viewers in the U.S., Canada, Mexico and the Caribbean.”</li> <li><a href="https://www.wired.com/story/ccleaner-malware-targeted-tech-firms"><strong>CCleaner Malware Fiasco Targeted At Least 18 Specific Tech Firms</strong></a> <strong>(WIRED)</strong><br> Following news that hundreds of thousands of companies were hit with the CCleaner malware, Wired now reports that Cisco researchers found “the hackers had attempted to filter their collection of backdoored victim machines to find computers inside the networks of 18 tech firms, including Intel, Google, Microsoft, Akamai, Samsung, Sony, VMware, HTC, Linksys, D-Link and Cisco itself.”</li> <li><a href="https://www.cnbc.com/2017/09/18/aws-starts-charging-for-ec2-by-the-second.html"><strong>AWS will Charge by the Second, Its Biggest Pricing Change in Years</strong></a> <strong>(CNBC)</strong><br> According to CNBC, Amazon’s price change “comes four years after Google outdid AWS with per-minute pricing. Historically AWS has charged by the hour for its EC2 cloud computing service.”</li> <li><a href="https://www.sdxcentral.com/articles/news/digitalocean-launches-scalable-spaces-object-storage/2017/09/"><strong>DigitalOcean Launches Scalable Spaces Object Storage</strong></a> <strong>(SDxCentral)</strong><br> Cloud services provider DigitalOcean added scalable object storage to its product line this week. “Object storage has been “one of the most requested products that we’ve been asked to build,” the company said.</li> <li><a href="https://techcrunch.com/2017/09/21/database-provider-mongodb-has-filed-to-go-public/"><strong>Database Provider MongoDB Files to Go Public</strong></a> <strong>(TechCrunch)</strong><br> Ten years after being founded, document database provider MongoDB has filed for IPO. The company “brought in $101.4 million in revenue in the most recent year ending January 31, and around $68 million in the first six months ending July 31 this year,” reports TechCrunch.</li> <li><a href="http://www.zdnet.com/article/telstra-lays-out-roadmap-for-programmable-network/"><strong>Telstra Lays Out Roadmap for Programmable Network</strong></a> <strong>(ZDNet)</strong><br> Australian telco Telstra announced this week its plans to add virtualized customer premises, pre-designed templates, managed services, and dynamic real-time control of IP VPN networks to its programmable network over the next few months.</li> <li><a href="https://www.wsj.com/articles/hewlett-packard-enterprise-to-cut-10-of-workforce-1506049877"><strong>Hewlett Packard Enterprise to Cut 10% of Workforce</strong></a> <strong>(Wall Street Journal)</strong><br> HPE announced plans to reduce its workforce by 10 percent, or about 5,000 jobs. This is part of HPE Next initiative, “a three-year plan announced in June to take out $1.5 billion in gross costs and shift resources toward areas such as research and development,” according to the Wall Street Journal.</li> <li><a href="https://venturebeat.com/2017/09/19/you-might-use-ai-but-this-doesnt-mean-youre-an-ai-company/"><strong>You Might Use AI, But That Doesn’t Mean You’re an AI Company</strong></a> <strong>(VentureBeat)</strong><br> “You’re not an AI company because there are a few people using a few neural networks somewhere,” said Andrew Ng, a founder of the Google Brain team and expert in the space. “It’s much deeper than that,” we went on to tell VentureBeat.</li> <li><a href="https://www.technologyreview.com/s/608878/how-to-get-one-trillion-devices-online/"><strong>How to Get One Trillion Devices Online</strong></a> <strong>(MIT Tech Review)</strong><br> Chris Doran of ARM, the company that designs smartphone chips, explained to MIT Tech Review why security is the biggest obstacle for the Internet of Things. The Cliff Notes: “Computing is shifting from who’s got the most performant device to who’s got the most energy-efficient device, and the next step will be who’s got the most secure device.”</li> <li><a href="http://www.lightreading.com/optical/subsea/2017-storms-may-mean-network-rethink/d/d-id/736566?"><strong>2017 Storms May Mean Network Rethink</strong></a> <strong>(Light Reading)</strong><br> Gil Santaliz, founder and managing director of New Jersey Fiber Exchange, told Light Reading, “The impact of storms such as the U.S. and the Caribbean have seen this fall isn’t in the damage done to cables or even cable landing stations. It’s in what happens to traffic after it leaves those places.” The publication reports that Santaliz is “expecting the global industry to take a long look at how traffic is managed going forward.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[News in Networking: Data Center Hurricane Prep, Gambling with DDoS, and Rural Broadband]]><![CDATA[As the east coast prepares for Hurricane Irma, those with data centers in the storm’s path are also making efforts to avoid interruptions. DDoS attackers took a gamble this week, making a hit on popular online poker site America's Cardroom. And a debate has up on how CenturyLink’s plans to acquire Level 3 Communications could affect broadband in rural areas. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-data-center-hurricane-prep-gambling-with-ddos-and-rural-broadband-connectionshttps://www.kentik.com/blog/news-in-networking-data-center-hurricane-prep-gambling-with-ddos-and-rural-broadband-connections<![CDATA[Michelle Kincaid]]>Fri, 08 Sep 2017 19:46:50 GMT<p>This week’s top story picks from the Kentik team.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7nlLDy0kGQyGme0QSy888Q/5e053daefbb7fdce15bb892687f93969/News_tablet-396w.png" alt="News_tablet-396w.png" class="image right" style="max-width: 300px;" /> <p>As the U.S. east coast prepares for Hurricane Irma, those with data centers in the storm’s path are also making efforts to avoid interruptions. Meanwhile, DDoS attackers took a gamble this week, making a hit on popular online poker site America’s Cardroom. The site’s CEO is said to be considering a bounty for information on who was behind the attack. At the same time, a debate has risen on if and how CenturyLink’s plans to acquire Level 3 Communications could affect broadband in rural areas.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="http://www.datacenterknowledge.com/manage/use-data-center-irmas-path-heres-how-not-lose-your-data"><strong>Use a Data Center in Irma’s Path? Here’s How Not to Lose Your Data</strong></a> <strong>(Data Center Knowledge)</strong><br> With Hurricane Irma continuing on its path towards the U.S., major data center providers are preparing. According to the article, offering their experience from recent Hurricane Harvey, CyrusOne “switched to on-site generator power at one point.” Meanwhile, Equinix, Data Foundry, Digital Realty Trust, Internap, and Netrality “all reported that their Houston data centers had not experienced any service interruptions.”</li> <li><a href="http://searchnetworking.techtarget.com/news/450425547/Intent-based-networking-needed-to-run-more-complex-networks"><strong>Intent-based Networking is Needed to Run More Complex Networks</strong></a> <strong>(TechTarget)</strong><br> The debate continues over why and how intent-based networking systems will work. TechTarget reports on expert input about whether these systems are needed to manage networks of the future that connect data center, public clouds and IoT.</li> <li><a href="http://www.lightreading.com/automation/the-autonomous-network-is-the-endgame-for-telecom/a/d-id/735860"><strong>The Autonomous Network Is the Endgame for Telecom</strong></a> <strong>(Light Reading)</strong><br> “The immediate goal of automation is to create ‘autonomous processes’ — in which the network does what it needs to do without human intervention. That’s been an aspiration of telecom architects and solution providers for literally decades,” says Light Reading founder and CEO Steve Saunders in an opinion piece.</li> <li><a href="https://www.theregister.co.uk/2017/08/27/google_routing_blunder_sent_japans_internet_dark/"><strong>Google BGP Routing Blunder Sent Japan’s Internet Dark</strong></a> <strong>(The Register)</strong><br> A typo at Google is likely to blame for a border gateway protocol (BGP) issue that took down a large chuck of Japan’s Internet connectivity in recent weeks. According to The Register, “The trouble began when The Chocolate Factory ‘leaked’ a big route table to Verizon.”</li> <li><a href="https://www.scmagazine.com/ddosd-online-poker-site-ceo-contemplating-posting-reward-to-find-attacker/article/687314/"><strong>DDoS’d Online Poker Site CEO Contemplating Posting Reward to Find Attacker</strong></a> <strong>(SC Magazine)</strong><br> The online poker site America’s Cardroom was hit earlier this week with a DDoS attack timed to disrupt a major tournament prompting the company CEO to consider putting a 10-bitcoin bounty out to discover if the attack was launched by a competitor.</li> <li><a href="https://prwire.com.au/pr/46864/akamai-warns-of-large-ddos-attacks-from-spike-ddos-toolkit"><strong>Akamai Warns of Large DDoS Attacks from Spike DDoS Toolkit</strong></a> <strong>(Press Release)</strong><br> Akamai recently released a new threat advisory to warn enterprises of “a high-risk threat of powerful distributed denial of service (DDoS) attacks from the Spike DDoS toolkit. With this toolkit, malicious actors are building bigger DDoS botnets by targeting a wider range of Internet-capable devices.”</li> <li><a href="https://morningconsult.com/2017/09/07/biggest-merger-no-one-iss-talking-undermine-rural-broadband-expansion/"><strong>Rural Broadband Expansion At Issue in CenturyLink-Level 3 Deal</strong></a> <strong>(Morning Consult)</strong><br> CenturyLink recently announced plans to acquire Level 3 Communications for $34 billion, which would make it competitive with telecom giant AT&#x26;T. However, according to Morning Consult, people opposed to the deal say it would “hurt broadband access for rural providers by eliminating access to wholesale rates for critical fiber connections to the internet backbone.”</li> <li><a href="http://www.sfchronicle.com/business/article/One-man-s-DIY-Internet-service-connects-12175313.php#photo-14053437"><strong>One Man’s DIY Internet Service Connects Isolated Marin County Hamlet</strong></a> <strong>(San Francisco Chronicle)</strong><br> Also on the rural broadband front, kudos go to Brandt Kuykendall, a Marin Country, California resident. Kuykendall’s daughter needed fast, reliable internet access for school, but that didn’t exist in their neighborhood. Instead, Kuykendall decided to DIY a telecom.</li> <li><a href="http://www.lightreading.com/how-kentik-helps-operators-turn-network-data-into-sales/d/d-id/735930"><strong>How Kentik Helps Operators Turn Network Data into Sales</strong></a> <strong>(Light Reading)</strong><br> The team here at Kentik recently caught up with Carol Wilson of Light Reading. In a new story, Wilson reports on our efforts: “Network operators are becoming more adept at using their network traffic data in a variety of ways: To assure service delivery; bolster security; and improve customer experience, to name just a few. Now, one network intelligence specialist, Kentik Technologies Inc., is explaining to operators how they can leverage its platform, the Kentik Detect Netflow-based network traffic intelligence system, to also identify sales prospects.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Using Interface Classification in Kentik Detect]]><![CDATA[In our latest post on Interface Classification, we look beyond what it is and how it works to why it's useful, illustrated with a few use cases that demonstrate its practical value. By segmenting traffic based on interface characteristics (Connectivity Type and Network Boundary), you'll be able to easily see and export valuable intelligence related to the cost and ROI of carrying a given customer's traffic.]]>https://www.kentik.com/blog/using-interface-classification-in-kentik-detecthttps://www.kentik.com/blog/using-interface-classification-in-kentik-detect<![CDATA[Greg Villain & Philip De Lancie]]>Tue, 05 Sep 2017 13:36:34 GMT<h3 id="analyzing-your-connections-for-competitive-advantage"><em>Analyzing Your Connections for Competitive Advantage</em></h3> <p>In our <a href="https://www.kentik.com/the-why-and-how-of-interface-classification/">first post on Interface Classification</a>, we looked at what it is and how it works. This time we’ll follow up with more specifics on why it’s useful, illustrating with a few use cases that demonstrate the practical value of this new feature. To briefly review, Interface Classification enables an organization to quickly and efficiently assign a Connectivity Type and Network Boundary value to every interface in the network, and to store those values in the Kentik Data Engine (KDE) records of each flow that is ingested by Kentik Detect. Associating the classification information of interfaces with the flows that pass through those interfaces makes it possible for queries to segment traffic based on interface characteristics. For a closer look at how this is achieved, check out the <a href="https://kb.kentik.com/v4/Cb10.htm">Interface Classification article</a> in our Knowledge Base.</p> <p>The easiest place to see Interface Classification in action is the Data Explorer in the Kentik Detect portal. Interface classifications may be accessed here in two forms:</p> <ul> <li>Group-by dimensions (sidebar Query pane): Set the combination of fields that define a set of traffic that can be counted (by metric) and ranked.</li> <li>Filters (sidebar Filters pane): Include or exclude traffic records that contain a specified value in the field that you’re filtering on.</li> </ul> <p>The following table shows the columns in which the Interface Classification values are stored in the records of the KDE and how they are referred to in the Data Explorer UI.</p> <table style="width: 100%;"> <thead> <tr> <th>Portal dimensions and filters</th> <th>KDE field name</th> </tr> </thead> <tbody> <tr> <td>Portal dimensions and filters</td> <td>KDE field name</td> </tr> <tr> <td>SOURCE: Connectivity Type</td> <td>`i_src_connect_type_name`</td> </tr> <tr> <td>SOURCE: Network Boundary</td> <td>`i_src_network_bndry_name`</td> </tr> <tr> <td>DESTINATION: Connectivity Type</td> <td>`i_dst_connect_type_name`</td> </tr> <tr> <td>DESTINATION: Network Boundary</td> <td>`i_dst_network_bndry_name`</td> </tr> </tbody> </table> <p>In the next sections we’ll look at how these new dimensions can be easily applied in some very common use cases. Without them, performing the same tasks would require devoting significantly more effort to the construction of queries.</p> <h4 id="checking-country-level-connectivity">Checking Country-level Connectivity</h4> <p>One common task for network operators is to regularly review global connectivity, often evaluating it on a country-by-country basis. To intelligently price content destined for a given country, for example, content providers need a pretty good idea how much it actually costs them to deliver traffic to that locale. With costs of bandwidth varying from PNI (Private Network Interconnect) to Transit, it’s typically useful to look at segmentation for traffic — either inbound or outbound, depending on whether you are a content provider, carrier, or ISP — to each country of interest.</p> <p>We set up for this type of analysis using the following settings in the sidebar of the Data Explorer:</p> <ul> <li><strong>Group</strong>-<strong>by dimensions</strong>: [Destination] Connectivity Type</li> <li><strong>Filters</strong> (ANDed):<br> - Destination Network Boundary = external<br> - Destination Country = FR (France for this example)</li> <li><strong>Devices</strong>: Select all routers. We don’t need to be more specific because the Network Boundary filter will ensure that we only query on traffic as it exits the network.</li> <li><strong>Display Type</strong>: Pie Chart (gives us a quick visual overview).</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/dNHZXs8CgEoOquG4yeUMY/add7f74fbd35067f4e2e12e951b559e9/Top_dest_connectivity-723w.png" class="image center" style="width: 723px" /> <p>After we click <em>Apply Changes</em>, we can see in the result shown above that a lot of the French traffic goes through transit, which is the most costly form of connectivity. We can go deeper on this first pass of investigation by figuring out which PoPs said traffic is delivered by, and, even more precisely, which devices and towards which ISP networks.</p> <p>To do this our Data Explorer settings would be as follows:</p> <ul> <li><strong>Dimensions/Group By</strong>: [Full] Site, [Destination] Connectivity Type, [Full] Device, [Destination] Next-Hop ASN, [Destination] ASN.</li> <li><strong>Display and Sort By</strong> (Advanced Options): 95th percentile (smooths out potential spikes).</li> <li><strong>Filters</strong> (ANDed):<br> - Destination Network Boundary = external<br> - Destination Country = FR (France for this example)</li> <li><strong>Devices selected</strong>: Select all routers.</li> <li><strong>Display Type</strong>: Sankey diagram</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/CmEmPnCdos6Ua6yK8cqAA/00e79bfc2f7c1c3082ecfce00b970e3f/Sankey_connectivity-815w.png" alt="Sankey_connectivity-815w.png" class="image center" style="max-width: 815px;" /> <p>The resulting diagram shows where (from which of your devices) the traffic with a given destination ASN is leaving your network, as well as the 1st hop ASN for that traffic. By combining that information with your cost information for each of those ASes, you can now develop an overview of your outbound traffic cost structure in the selected country.</p> <h4 id="optimizing-connectivity-by-country">Optimizing Connectivity by Country</h4> <p>A lot of peering review meetings basically consist of reviewing connectivity in a given market and trying to see what can be optimized for cost and/or performance. Either way, it’s always a good practice to try to shift as much traffic as possible from transit to any form of peering (ideally free PNI, but possibly public peering in an IX fabric). Figuring out how to do that is effectively a two-step process.</p> <p>Step one is to select the destination country. How you do this depends on whether you already know what destination country you want to study or whether you want to select that country based on how much transit connectivity it requires. For the latter analysis in Data Explorer we would use the following settings:</p> <ul> <li><strong>Dimension/Group By</strong>: [Destination] Country</li> <li><strong>Filters</strong> (ANDed):<br> - Destination Network Boundary = external<br> - Destination Connectivity Type = transit</li> <li><strong>Devices</strong>: Select all routers.</li> <li><strong>Display Type</strong>: Time series stacked graph</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/1f2CCnosikwSCiEK0AkwOa/609e8f0baed2e64054e28405380d3edd/Top_transit_countries-818w.png" alt="Top_transit_countries-818w.png" class="image center" style="max-width: 818px;" /> <p>As shown above, we now have a list of destination countries that are ranked by transit connectivity. This tells us where the greatest potential lies for saving money by shifting from transit to peering. Let’s focus on Germany (DE) for our next step.</p> <p>Step 2 is to look at the Destination ASNs for the given country. This is where Data Explorer’s drill-down capabilities become particularly useful. In the table under the graph, click on the Show Options icon at the right of the DE row. In the resulting popup menu, choose Show by » Destination » AS Number.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6LlChwaycMi0qusow8EU4M/09b13de1d96e04880c153ac4b075490f/Dest_AS_drilldown-811w.png" alt="Dest_AS_drilldown-811w.png" class="image center" style="max-width: 811px;" /> <p>After applying the changes we have an ordered list of the top ASNs, in Germany, that are currently reached via transit, a.k.a a list of the traffic destinations with which it would be most beneficial for us to peer. This is valuable business intelligence that it would be far more difficult to obtain without our new Interface Classification dimensions.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2N3ZDKK8nKyG0SwwyQ8CmW/fb68e27a9ce8dd11c006d22d42ee99b0/Top_transit_ASNs-818w.png" alt="Top_transit_ASNs-818w.png" class="image center" style="max-width: 818px;" /> <h4 id="add-ultimate-exit-for-deeper-intelligence">Add Ultimate Exit for Deeper Intelligence</h4> <p>By now we have a good idea of the power that Interface Classification gives us in terms of extracting business value from network data. We can take this even further by marrying IC with another relatively new feature of Kentik Detect: <strong>Ultimate Exit</strong>. As a quick refresher (see <a href="https://www.kentik.com/package-tracking-for-the-internet/">this post</a> to get fully reaquainted), Ultimate Exit augments flow records in KDE with information about the site, device, and interface of the exit point of the underlying flows from which the records are derived.</p> <p>As shown in the Sankey diagram below, using this Ultimate Exit information allows you to slice traffic by the sites/devices/interfaces toward which it is headed so that you can see how much traffic is flowing between an entry site and an ultimate exit site/device/interface (the BGP ultimate exit dimension). You can also see which eyeball networks (the destination ASN) are reached by that traffic, and you can export that information for business analytics (from Options menu at top right of graph, choose Export » Table » as CSV). The business utility of this capability is to enable you to look at where transport is the most expensive between your entry site and ultimate exit site so that you can take appropriate steps to optimize.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3MKW9lMAo8EO0uSSoogucU/795d341a480b8232fc57eb41e2de6d9d/Sankey_UE-673w.png" alt="Sankey_UE-673w.png" class="image center" style="max-width: 673px;" /> <p>To understand the benefit of combining Interface Classification with Ultimate Exit, let’s assume that your cost per Mb/s depends on connectivity type, and that the greater the cost the lower your ROI. As shown in the following diagram, the resulting ROI hierarchy would be: transit (ROI = $) &#x3C; paid peering (ROI = $$) &#x3C; IX peering (ROI = $$$) &#x3C; routing between devices across your network (ROI = $$$$) &#x3C; routing within a single device, a.k.a. hot-potato-routing (ROI = $$$$$).</p> <img src="//images.ctfassets.net/6yom6slo28h2/6DrpwI0ZrigOy2cAc2e6mK/03ba1a86ea802521f08ed50fd1100971/ROI_levels-818w.png" alt="ROI_levels-818w.png" class="image center" style="max-width: 818px;" /> <p>To maximize ROI, you need network business intelligence that tells you where the traffic you carry for a given customer falls into this hierarchy. You can get that information in Data Explorer using two new dimensions: BGP Ultimate Exit Connectivity Type and BGP Ultimate Exit Network Boundary. Here’s how to set up such a query:</p> <ul> <li><strong>Dimension/Group By</strong>:<br> - [Source] AS Number: this isn’t crucial, but it helps us make sure that we’re catching ports for the right customer ASN.<br> - [Full] Site: the sites where we connect to our customer, named Customer AS1234)<br> - [Destination] BGP Ultimate Exit Site: the site where the customer’s traffic will be handed over, which will give us an idea of how much transport is involved for us to deliver it.<br> - [Destination] BGP Ultimate Exit Connectivity Type: the type of connectivity over which this customer’s traffic will be handed over. Can help calculate the cost of transport between Customer’s traffic entry and exit sites.<br> - [Destination] AS Number (optional).</li> </ul> <ul> <li><strong>Filters</strong>: [Source] Interface Description.</li> <li><strong>Devices</strong>: All edge routers.</li> <li><strong>Display Type</strong>: SanKey Diagram.</li> </ul> <p>This query yields us a Sankey with some very valuable ROI information. Focusing on the central part of the diagram, the annotations in the image below use color to show three different types of paths taken by the traffic of the customer in this example, with each representing different levels of ROI:</p> <ul> <li>Traffic on the pink path is the most profitable. This traffic is routed within the same site (Washington to Washington), so it has neither a metro nor long-haul cost. And since it’s delivered to another customer (Ultimate Exit Connectivity Type is “Customer”), it can be billed to that other customer.</li> <li>Traffic on the orange path is second in terms of ROI. It’s hot-potato-routed, so it has no transport costs, and it’s handed over via free PNI (Ultimate Exit Connectivity Type is “Free PNI”), so there’s no cost to deliver it. But it does not actually generate revenue like traffic on the pink path.</li> <li>Lastly, traffic on the blue path has the lowest ROI. It enters in San Francisco and is delivered in Sao Paulo, Brazil, which involves costly transport between two countries. This cost is probably not offset by the fact that it is delivered to a Free PNI.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/60KZIaDII08AMY4UMM6UAI/8b603f6511521ee976deb276d80ef222/Three_traffic_types-814w.png" alt="Three_traffic_types-814w.png" class="image center" style="max-width: 814px;" /> <p>The fact that Kentik Detect enables us to see different paths like the ones that we’ve colored in the diagram above illustrates that we can use it to precisely establish the different types of traffic that can be compounded into an ROI — weighted by traffic volume —for each customer. A ballpark ROI analysis can be conducted by eyeballing a Sankey generated from the settings shown above, which can be downloaded as an image to share during a traffic engineering or sales meeting. Or the underlying data can be <a href="https://kb.kentik.com/v4/Db03.htm#Db03-Export_Chart_or_Table">exported from Kentik Detect</a> (either as CSV via the portal or as <a href="https://kb.kentik.com/v0/Ec04.htm#Ec04-Query_Data_Method">JSON via our APIs</a>) for further analysis in a financial or BI environment.</p> <p>Ready to learn more about using Interface Classification, and about Kentik in general? <a href="#demo_dialog">Schedule a demo</a> or sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[News in Networking: New Data Centers, Same Old Internet Trolls]]><![CDATA[Apple announced two new data centers in Iowa. Cisco made another acquisition. VMware reported a strong quarter. DDoS attacks spiked. Google sped up TCP/IP. And Wired put together an interactive map to show where internet trolls live. More headlines after the jump...]]>https://www.kentik.com/blog/news-in-networking-new-data-centers-same-old-internet-trollshttps://www.kentik.com/blog/news-in-networking-new-data-centers-same-old-internet-trolls<![CDATA[Michelle Kincaid]]>Fri, 25 Aug 2017 18:30:15 GMT<p>This week’s top story picks from the Kentik team.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Uz68vhqVSAk0SoeayYkGo/5d9cbe59005519da4795d02a3123a463/News_tablet-396w.png" alt="News_tablet-396w.png " class="image right" style="max-width: 300px;" /> <p>Apple announced two new data centers in Iowa this week. Cisco made plans to acquire hyperconvergence software company Springpath. VMware reported a strong quarter and new Vodafone deal. DDoS attacks spiked, targeting gaming companies. Google sped up TCP/IP. And Wired put together an interactive map to show where internet trolls live.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.cnbc.com/2017/08/24/apple-gets-208-million-in-tax-breaks-to-build-iowa-data-center.html"><strong>Apple Gets $208M in Tax Breaks to Build Iowa Data Center</strong></a> <strong>(Reuters)</strong><br> Apple is moving to Iowa! Not really, but they are building two data centers in Des Moines, which will create 50 jobs in the area and give the tech giant a $208 million tax break, according to Reuters.</li> <li><a href="http://www.zdnet.com/article/cisco-buys-springpath-for-320-million/"><strong>Cisco Buys Springpath for $320M</strong></a> <strong>(ZDNet)</strong><br> Cisco announced plans this week to acquire hyperconvergence software company Springpath for a cool $320 million. The deal is expected to close in Q1 2018.</li> <li><a href="https://www.sdxcentral.com/articles/news/verizon-considers-competing-cisco-juniper-software/2017/08/"><strong>Verizon Considers Competing with Cisco, Juniper on Software</strong></a> <strong>(SDxCentral)</strong><br> Verizon is getting more involved in virtualization. Now SDxCentral reports the teleco could be planning to take on Cisco and Juniper with its Virtual Network Services software.</li> <li><a href="https://www.sdxcentral.com/articles/news/vmware-scores-huge-win-vodafone-nfv/2017/08/"><strong>VMware Scores ‘Huge Win’ with Vodafone for NFV</strong></a> <strong>(SDxCentral)</strong><br> VMware CEO Pat Gelsinger said business is good. On the company’s earnings call this week, he noted a “huge win” with Vodafone as a new NFV customer.</li> <li><a href="http://www.lightreading.com/mobile/5g/ericsson-cto-prioritizes-non-consumer-5g/d/d-id/735723"><strong>Ericsson CTO Prioritizes Non-Consumer 5G</strong></a> <strong>(Light Reading)</strong><br> Ericsson has a new CTO, Erik Ekudden, and he has “given a strong signal that his 5G focus will be on developing enterprise rather than consumer applications,” according to Light Reading.</li> <li><a href="http://www.techrepublic.com/article/10-bad-habits-network-administrators-should-avoid-at-all-costs/"><strong>10 Bad Habits Network Administrators Should Avoid At All Costs</strong></a> <strong>(TechRepublic)</strong><br> What shouldn’t network administrators do? The Kentik team weighed in on TechRepublic’s article to remind peers to use change control for logging changes, but don’t use command-line interface (CLI) to troubleshoot your networking.</li> <li><a href="http://www.eweek.com/security/ddos-attackers-taking-direct-aim-at-gaming-companies-akamai-reports"><strong>DDoS Attackers Taking Direct Aim at Gaming Companies, Akamai Reports</strong></a> <strong>(eWEEK)</strong><br> Akamai’s latest 2017 State of the Internet Report went live this week. The report notes a spike in DDoS attacks, 82 percent of which were targeted at gaming companies.</li> <li><a href="https://www.networkworld.com/article/3218084/lan-wan/how-google-is-speeding-up-the-internet.html"><strong>How Google is Speeding Up the Internet</strong></a> <strong>(NetworkWorld)</strong><br> TCP/IP is the internet’s main protocol for data transmission. In attempt to make data move faster, Google engineers have developed an algorithm that speeds it up by about 14 percent.</li> <li><a href="https://www.wired.com/2017/08/internet-troll-map/"><strong>Where Do Internet Trolls Live? Here’s a Map</strong></a> <strong>(Wired)</strong><br> To find out just how bad internet trolling is, Wired put together a map. “The company analyzed 92 million comments over a 16-month period, written by almost 2 million authors on more than 7,000 forums that use the software,” according to the media channel’s new interactive story. “The numbers reveal everything from the trolliest time of day to the nastiest state in the union.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Visualizing the Digital Eclipse]]><![CDATA[With much of the country looking skyward during the solar eclipse, you might wonder how much of an effect there was on network traffic. Was there a drastic drop as millions of watchers were briefly uncoupled from their screens? Or was that offset by a massive jump in live streaming and photo uploads? In this post we report on what we found using forensic analytics in Kentik Detect to slice traffic based on how and where usage patterns changed during the event.]]>https://www.kentik.com/blog/visualizing-the-digital-eclipsehttps://www.kentik.com/blog/visualizing-the-digital-eclipse<![CDATA[Jim Meehan]]>Tue, 22 Aug 2017 16:33:35 GMT<h3 id="analyzing-the-eclipses-impact-on-internet-traffic"><em>Analyzing the Eclipse’s Impact on Internet Traffic</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/DWrX6q38YgqMSMeoSamEe/638800f54367cd588787a7f14ad29db2/Eclipse-500w.png" class="image right" style="max-width: 500px; margin-left: 20px; margin-bottom:20px; margin-top: 15px;" alt="Eclipse" /> <p>As many as <a href="https://www.greatamericaneclipse.com/statistics/">7.4 million people</a> were expected to travel to the path of totality Monday to get a view of the Great American Eclipse. Millions more may have stayed in town but headed outside to see what they could see. With so many people away from their desktops, we had to wonder… What happened to network traffic in the US? Did legions of American workers step away from their desks to get a glimpse? Did students already back at school get an early recess to take it all in? As our fellow citizens ventured outside to watch the sun take second place, were enough of them off of the Internet to create a true digital eclipse? Or did our digital dependencies push us to post photos of the sky on social media and race to stream video from whatever feed best captured the event?</p> <p>Since the nature of our business here at Kentik is network traffic intelligence, we had a good view of what the Internet looked like leading up to and through the eclipse’s totality. Looking at network traffic served up from Internet service providers (ISPs) and web companies here in the US (with their explicit permission), the pattern was analogous to the changes witnessed by those of us who weren’t in the path of totality: a partial digital eclipse. The networks we monitored around the time of the eclipse did not see outage-causing spikes in network traffic. But we did observe noticeable network activity… and it was enough to trigger traffic anomaly alerts for some of our customers.</p> <p><strong>Graphing Internet Traffic</strong></p> <p>The graph below shows an ISP’s network traffic over a 24-hour span from Sunday to Monday evenings. Note the spike around 01:00 UTC (9:00 PM EDT) — that’s right when all of the Game of Thrones fans settled in for another shocking episode. Then traffic slowed while the East Coast slept. Later, as the eclipse came into view around 18:40 UTC (2:40 PM EDT) there was an obvious lull in network traffic, indicating that people may have temporarily stepped away from Internet-connected devices.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6vtZX3AGTCoMCESesWk4Iy/4feb5aec5bbb491de36a7167f36046e4/Eclipse-ISP_network_traffic-817w.png" class="image center" style="max-width: 817px" alt="" /> <p>Another Kentik customer who serves content to consumers globally also saw traffic dip during the eclipse. For the chart below we filtered the results on a few specific geographies that were directly in the path of totality, and the time delta between locations is clearly visible as the eclipse moved West to East (all times PDT). In some locations, traffic volume fell by 50 percent.</p> <img src="//images.ctfassets.net/6yom6slo28h2/68Fwi4ikN2MgK0gOKSoo2G/9221546d77d2ef89cf5132512cd59746/Eclipse-Top_Dst_region-810w.png" class="image center" style="max-width: 810px" alt="" /> <p>Lastly, the chart below shows traffic from another of our ISP customers as it exited from their network toward some large web companies (including Apple, Google, and Facebook) that deal with photos and other user-generated digital media. Many photo-taking eclipse-watchers were likely using phones set to automatically store photos to the cloud, and also posting those photos to social media. While we don’t look at actual user behaviors, our best bet based on the timing is that the spike represents a large number of photos being synced and uploaded simultaneously by the ISP’s customers. Even after the initial spike, the traffic remains elevated above the week-ago historical volume (dashed gray line) through the end of the day.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2ntiwQ3SDWq0MMmKMIM0ac/ac4ebfbd28a631dea4c559c5ec772396/Eclipse-Top_dst_as-816w.png" class="image center" style="max-width: 816px" alt="" /> <p><strong>Preparing for the next one</strong></p> <p>If you didn’t make it to Monday’s path of totality, there’s <a href="https://www.vox.com/science-and-health/2017/8/21/16178956/when-is-next-total-solar-eclipse">good news</a> from media outlet Vox: “Total solar eclipses happen somewhere in the world every 18 months or so.” That means you’ve got time to prepare for the next one. Similarly, most ISPs, telecommunications providers and web companies had time to prepare for Monday’s big event. AT&#x26;T, Verizon, and Sprint even deployed <a href="http://knkx.org/post/wireless-carriers-deploy-cell-wheels-boost-coverage-eclipse-path">“Cell on Wheels”</a> ahead of the eclipse to support overloaded networks.</p> <p>Unfortunately, however, most events affecting access to and performance of the Internet aren’t so predictable. DDoS attacks drive unexpected traffic swarms that cause network outages. Big weather events and emergencies cause network latency that limits connectivity. New application deployments change traffic patterns and trigger unexpected behaviors in users and infrastructures. And even routine routing shifts, maintenance, and errors can cause outages for ISPs, cloud providers, and, ultimately, the businesses and end-users that rely on their services.</p> <p>Maintaining service quality despite these unpredictable incidents requires the equivalent of eclipse glasses. You need to be able to see the details of events as soon as they begin to unfold, rapidly analyze causes and impacts, and initiate an informed response ASAP. That’s nearly impossible to do with the limited storage and compute power of legacy network monitoring systems. Kentik Detect, on the other hand, runs on a big data platform that is purpose-built to deliver real-time visibility and analytics at Internet scale.</p> <p>To learn more about Kentik, <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Monitoring DNS with Kentik]]><![CDATA[Domain Name Server (DNS) is often overlooked, but it's one of the most critical pieces of Internet infrastructure. As driven home by last October's crippling DDoS attack against Dyn, the web can't function unless DNS resolves hostnames to their underlying IP addresses. In this post we look at how combining Kentik's software host agent with Dashboards in Kentik gives you the tools you need to ensure DNS availability and performance.]]>https://www.kentik.com/blog/monitoring-dns-with-kentik-detecthttps://www.kentik.com/blog/monitoring-dns-with-kentik-detect<![CDATA[Justin Ryburn]]>Mon, 21 Aug 2017 13:00:11 GMT<h3 id="dashboards-for-dns-metrics-reveal-issues-with-your-infrastructure"><em>Dashboards for DNS Metrics Reveal Issues With Your Infrastructure</em></h3> <p>Working behind the scenes, Domain Name Server (DNS) is often overlooked, but it’s one of the most critical pieces of the Internet infrastructure. No matter how well your network is routing IP traffic, the network is essentially down without DNS to resolve hostnames like <a href="http://www.google.com">www.google.com</a> to their underlying IP addresses. That point was driven home by the Mirai DDoS attack last October against the DNS infrastructure of Dyn, Inc. The attack had a crippling effect on numerous major sites including GitHub, Twitter, Reddit, AirBnB, and Netflix.</p> <p>As with any critical component, network operations needs to be able to continuously monitor DNS performance and to respond quickly to any issues that may arise. But how? Using Kentik’s software host agent with the Dashboard functionality of the Kentik Detect portal gives you the tools you need to ensure that DNS is operating at peak performance.</p> <p>Here’s a brief overview of how the solution is put together. The Kentik host agent is installed on DNS servers, as shown at left in the image below. The agent monitors the DNS queries and responses as they pass to and from the server across its Ethernet interface (physical or virtual). This information is turned into flow data and sent over an SSL encrypted channel to the Kentik Data Engine (KDE), from which it is queryable in Kentik Detect. (For a deeper dive, see our Knowledge Base article on <a href="https://kb.kentik.com/Bd03.htm">Host Configuration</a>.)</p> <img src="//images.ctfassets.net/6yom6slo28h2/vQeCVperYc6wqQSkcEqSQ/d8764603a9c3e5d2174b71523ccdc892/DNS_overview-815w.png" alt="DNS_overview-815w.png" class="image center" style="max-width: 815px; padding: 20px" /> <img src="//images.ctfassets.net/6yom6slo28h2/4zZBBNgTL2uEa2wcMyWmEM/d53998c9fca51c7260f9bcefcec376d8/DNS_dimensions-160w.png" alt="DNS_dimensions-160w.png" class="image right" style="max-width: 160px;" /> <p>Once we have the data in our distributed big data database there are all kinds of powerful things we can do with it, including custom query-based Dashboards. Before we explore the <a href="https://kb.kentik.com/Db02.htm">Dashboard possibilities</a>, let’s take a look at the DNS fields that Kentik Detect makes available as group-by dimensions and for filtering. As shown at right, we can look at data based on the DNS query itself, the DNS query type code, the DNS return code, and the DNS response the server sent back to the client.</p> <h4 id="top-talkers-and-beyond">Top Talkers and Beyond</h4> <p>Now let’s use these DNS fields to look at specific aspects of our DNS utilization. Note that in the following descriptions we’ll be looking at visualizations that are rendered within dashboard panels that we’ve created in Kentik Detect. You can learn how this is done by referring to our KB topic on <a href="https://kb.kentik.com/Db02.htm#Db02-Add_or_Edit_Dashboard">Adding Dashboard Panels</a>. Preset dashboards with some of these views will be added to Kentik Detect in the coming months. It’s worth noting that dashboard panels can be created from views in the portal’s Data Explorer, and also opened in Data Explorer for further drill-down. Also note that in addition to monitoring via dashboards you can set our alerting system to notify you of changes to any of the monitored parameters (see this post on <a href="https://www.kentik.com/kentik-detect-alerting-configuring-alert-policies/">Configuring Alert Policies</a>).</p> <p>Probably the most obvious way to look at the DNS traffic is by looking at the Top Talkers to the DNS infrastructure. Let’s start by looking at the top DNS clients that are talking to our servers. In this particular case, we are looking at the traffic in packets per second (PPS) but we could also look at bits per second (BPS), unique Src IPs, Geo, or a number of other metrics. Monitoring this panel will enable us to see at a glance if one individual client is sending an increased amount of queries to our servers.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4HQQQpb5bWqegwg0iSUqqG/66545e44ea528cdb2cb5ed9bf8a824fa/DNS_top_clients-630w.png" alt="DNS_top_clients-630w.png" class="image center" style="max-width: 630px;padding: 20px" /> <p>The next thing we want to keep an eye on is the top queries we are receiving from the clients. This gives us the ability to quickly see if there is a significant increase in DNS queries to our servers for a single hostname. Here’s a Data Explorer view of this metric.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5XZP2m5km4e4YQsai62qwk/8941bd93f5912d7e303312e921891dc2/DNS_by_ingress-795w.png" alt="DNS_by_ingress-795w.png" class="image center" style="max-width: 795px;padding: 20px" /> <p>All good so far, so let’s take a look at the responses that our servers are returning. Similar to the previous view, this gives us a quick glimpse at how many responses we are sending back to the clients for each hostname, so we can monitor for increases in a given response.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6ebOlkwY4oaOQksggkEgYA/ddf2625ce4bb214298bb5a5885fcdb4c/DNS_by_egress-821w.png" alt="DNS_by_egress-821w.png" class="image center" style="max-width: 821px;padding: 20px" /> <p>This is very powerful information, but we aren’t done yet. As shown in the following dashboard panels, we can also look at the type of DNS queries we are receiving (below left) as well as the return codes that are being sent back to the client (below right). This lets us quickly see if we have a change in one particular query type or response code in our infrastructure.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5b3zJnddCg2wwwI0g8I8yw/d84a9900caffc68af1c2d6c968f5a1b3/DNS_query_code-814w.png" alt="DNS_query_code-814w.png" class="image center" style="max-width: 814px; padding: 20px" /> <h4 id="query-handling-and-server-utilization">Query Handling and Server Utilization</h4> <img src="//images.ctfassets.net/6yom6slo28h2/73QSWmnty8aqSuC0WEEoKK/68caa02c5ee44f48079fd19e3068ae59/DNS_Local_or_forward-300w.png" alt="DNS_Local_or_forward-300w.png" class="image right" style="max-width: 300px; padding: 10px" /> <p>Another thing that is interesting to keep an eye on is how many of the DNS queries that are received are handled by our local servers and how many of them are forwarded on to an authoritative DNS server. Watching this could give us an indication if our clients are requesting a bunch of new hostnames that have not been requested before.</p> <p>We can also check on the utilization of our DNS servers. We can look at this in a few different ways, the first of which is by market (shown below). Here we are looking at the PPS that is received by the DNS infrastructure in each one of our markets. We can quickly see if we have an increase in utilization (PPS) in any one market.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3LrCMiUL3yKQGE8UCiWuka/3d790efd1330e54e3e66c387a2faaf6e/DNS_by_market-630w.png" alt="DNS_by_market-630w.png" class="image center" style="max-width: 630px; padding: 20px" /> <p>Within a market, we can take a look at the load on each server in terms of PPS (below left). This gives a little more granular look at how much traffic is being handled by each server within the market. If we start to see one server taking a lot more load than another, we probably need to investigate to see what the root cause of that might be.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4enVTOhv3GIkU648KsegGK/a6e10ce8c3915dfe07352c7b4328447f/DNS_ring-821w.png" alt="DNS_ring-821w.png" class="image center" style="max-width: 821px; padding: 20px" /> <p>Another way to monitor the load on our DNS servers is by the IP type (IPv4 vs IPv6) of the requests that are coming in (above right). Changes in this type of information don’t really indicate a problem, but it’s interesting to keep an eye on.</p> <h4 id="summary">Summary</h4> <p>The views shown above are just a few of the many that can be built into a dashboard to monitor a DNS infrastructure. If you’re already a Kentik Detect user, contact <a href="mailto:[email protected]">Kentik support</a> for more information; our Customer Success team can help you build your own custom DNS Dashboard. If you’re not yet a user and you’d like to get started monitoring your own infrastructure, <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[News in Networking: Internet v. Hate, DDoS Waves, and a $300M Cyber Hit]]><![CDATA[Web companies put the brakes on hate group Daily Stormer using their services. The EFF responded with a warning. A new DDoS attack known as “pulse wave” was uncovered. DDoS attacks hit a bunch of Blizzard Entertainment games. And the biggest shipping company said a single attack cost it nearly $300 million.]]>https://www.kentik.com/blog/news-in-networking-the-internet-v-hate-ddos-waves-and-a-300m-cyber-hithttps://www.kentik.com/blog/news-in-networking-the-internet-v-hate-ddos-waves-and-a-300m-cyber-hit<![CDATA[Michelle Kincaid]]>Fri, 18 Aug 2017 18:42:35 GMT<p>This week’s top story picks from the Kentik team.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/3Epzjhcn8QwMICq2eoeksk/31dc0875f367b7c6fb9eb2543f320b3a/News_tablet-396w.png" alt="News_tablet-396w.png " class="image right" style="max-width: 300px;" /> There was no shortage of industry news this week. Web companies put the brakes on hate site The Daily Stormer using their services. The EFF responded, “We would be making a mistake if we assumed that these sorts of censorship decisions would never turn against causes we love.” A new DDoS attack known as “pulse wave” was uncovered with its clockwork-like traffic spikes. DDoS attacks hit a bunch of Blizzard Entertainment games. And the biggest shipping company, Maersk, said a single attack cost it nearly $300 million.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="http://gizmodo.com/cloudflare-ceo-on-terminating-service-to-neo-nazi-site-1797915295"><strong>Cloudflare CEO Opens Up Far-Right Website to DDoS Attacks</strong></a> <strong>(Gizmodo)</strong> Cloudflare CEO Matthew Prince made the executive decision to end DDoS attack support for the alt-right website The Daily Stormer. “This was my decision. This is not Cloudflare’s general policy now, going forward,” Cloudflare CEO Matthew Prince told Gizmodo. “I think we have to have a conversation over what part of the infrastructure stack is right to police content.” The EFF <a href="https://www.eff.org/deeplinks/2017/08/fighting-neo-nazis-future-free-expression">responded</a> to say, “We would be making a mistake if we assumed that these sorts of censorship decisions would never turn against causes we love.”</li> <li><a href="https://www.darkreading.com/attacks-breaches/pulse-wave-ddos-attacks-emerge-as-new-threat-/d/d-id/1329657"><strong>‘Pulse Wave’ DDoS Attacks Emerge As New Threat</strong></a> <strong>(Dark Reading)</strong> Research from security firm Imperva reveals a new type of DDoS attack, where “short but successive bursts of attack traffic” are a clockwork attempt to take down targets.</li> <li><a href="https://www.scmagazine.com/world-of-warcraft-overwatch-hearthstone-and-other-games-hit-by-ddos/article/681691/"><strong>World of Warcraft, Overwatch, Hearthstone, Other Games Hit by DDoS</strong></a> <strong>(SC Magazine)</strong> Blizzard Entertainment has been no stranger to DDoS attacks over the years. This week the gaming giant said the attack hit its servers hosting popular World of Warcraft, Overwatch, Hearthstone and other games.</li> <li><a href="https://www.wsj.com/articles/maersk-sees-surprise-loss-as-cyberattack-costs-take-effect-1502871686"><strong>Maersk Estimates $300M Loss from Cyber Attack</strong></a> <strong>(Wall Street Journal)</strong> What’s the cost of a cyber attack on your business? If you’re the world’s largest shipping company, it’s about $300 million. That’s how much A.P. Moller-Maersk said it lost with an outage caused by a recent attack.</li> <li><a href="https://www.totaltele.com/497790/Verizon-to-fund-own-emergency-services-network"><strong>Verizon to Fund Own Emergency Services Network</strong></a> <strong>(TotalTelecom)</strong> Following AT&#x26;T’s FirstNet plans, Verizon announced it also intends to roll out a new services network for first responders to use in the event of an emergency.</li> <li><a href="https://www.engadget.com/2017/08/14/mit-s-new-ai-can-keep-streaming-video-from-buffering/"><strong>MIT’s New AI Prevents Streaming Video from Buffering</strong></a> <strong>(Engadget)</strong> A new neural network built by researchers at MIT aims to stop buffering and pixelation on streaming video once and for all. The AI-based system “automatically adjusts video quality based on network conditions,” according to Engadget.</li> <li><a href="https://www.bloomberg.com/news/articles/2017-08-16/london-maintains-lead-in-private-data-plumbing-ahead-of-brexit"><strong>London Maintains Lead in Network Capacity Ahead of Brexit</strong></a> <strong>(Bloomberg)</strong> London’s private network capacity can do a bandwidth of 159 terabytes per second, according to a report from data center provider Equinix. That ranks the city ahead of New York City, Frankfurt, and many others’ capacities. Bloomberg reminds, “Businesses are turning to private data networks due to the huge amounts of data they are now generating.”</li> <li><a href="http://www.zdnet.com/article/github-seeks-to-spur-innovation-with-kubernetes-migration/"><strong>GitHub Seeks to Spur Innovation with Kubernetes Migration</strong></a> <strong>(ZDNet)</strong> Deploying Kubernetes clusters that run application containers isn’t easy, according to GitHub. However, the company did it to “allow for faster innovation on the online code sharing and development platform,” reports ZDNet. In separate GitHub news, the company’s CEO, Chris Wanstrath, <a href="https://www.cnbc.com/2017/08/17/github-ceo-chris-wanstrath-is-stepping-down.html">announced</a> this week that he is stepping down.</li> <li><a href="https://www.technologyreview.com/s/608640/inside-the-increasingly-complex-algorithms-that-get-packages-to-your-door/"><strong>Increasingly Complex Algorithms That Get Packages to Your Door</strong></a> <strong>(MIT Tech Review)</strong> “If you had to hand-deliver 50 packages, how would you go about planning the best route?” The answer is not easy, according to MIT Tech Review, which tried to get input on the topic from some of the biggest delivery companies. This feature story discusses the algorithms that aim to simplify the process.</li> <li><a href="https://arstechnica.com/gadgets/2017/08/hughes-signs-deal-to-launch-100mbps-satellite-internet-service-in-2021/"><strong>Hughes Signs Deal to Launch 100Mbps Satellite Internet Service in 2021</strong></a> <strong>(Ars Technica)</strong> Hughes Network Systems “is planning for its next major leap in bandwidth—a 100 Mbps-capable network based on a new satellite to be launched in 2021.” Ars Technica explains how Hughes plans to do it.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[The Why and How of Interface Classification]]><![CDATA[Kentik addresses the day-to-day challenges of network operations, but our unique big network data platform also generates valuable business insights. A great example of this duality is our new Interface Classification feature, which streamlines an otherwise-tedious technical task while also giving sales teams a real competitive advantage. In this post we look at what it can do, how we've implemented it, and how to get started.]]>https://www.kentik.com/blog/the-why-and-how-of-interface-classificationhttps://www.kentik.com/blog/the-why-and-how-of-interface-classification<![CDATA[Greg Villain & Philip De Lancie]]>Mon, 14 Aug 2017 13:16:39 GMT<h3 id="classifying-network-interfaces-enhances-engineering-and-business-insights"><em>Classifying Network Interfaces Enhances Engineering and Business Insights</em></h3> <p>Given that Kentik was founded primarily by network engineers, it’s easy to think of our <em>raison d’etre</em> in terms of addressing the day-to-day challenges of network operations. While that’s a key aspect of our mission, our unique big data platform for capturing, unifying, and analyzing network data actually supports a broader scope. Kentik Detect is built to generate valuable insights not only from the technical perspective but also for business intelligence. A great example of this duality is a feature called</p> <p><strong>Interface Classification</strong> (IC) that we recently introduced to our portal. IC spares technical teams by streamlining an otherwise-tedious task, but it also exposes information that provides sales teams with real competitive advantage. In this post we’ll look at several aspects of this new capability, including the technical and business problems it solves, the database dimensions we’ve added to enable it, and how to get started using it. In a future post we’ll take a closer look at some typical use cases.</p> <h4 id="keeping-counts-real">Keeping Counts Real</h4> <p>From a technical standpoint, Interface Classification provides an elegant solution to a critical question of flow-based analytics: how do we uniquely identify flows when they may pass through multiple interfaces within our autonomous system (AS)? We know that flows are a set of packets that share common attributes including protocol, source and destination IP, and src, dst, and next-hop AS. The problem arises because flow records (NetFlow, sFlow, IPFIX, etc.) are generated by individual routers, switches, and hosts rather than system wide. So unless the records generated at various points can be correlated, a given flow may be counted as multiple flows, throwing off the count.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3One2jfUl2CKy4YQYAGkiY/a79c21575faeb613ff3c03ea3b540815/IC-Flows_diagram-749w.png" alt="IC-Flows_diagram-749w.png" class="image center" style="max-width: 749px;" /> <p>In the example diagrammed above, AS100 is an imaginary Kentik Detect customer that receives and sends traffic to multiple ASNs via five network devices. We have only two traffic flows (the light and dark gray lines), but each flow is generating three flow records that are sent to Kentik Detect: at ingress (orange segments at left), at every internal network element (red segments in center), and at egress (blue segments at right). If we want to count flows at ingress then we need to be able to exclude the flow records that aren’t from the two orange segments at the left. Likewise, if we want to count the flows at egress we need to be able to exclude the flow records that aren’t from the two blue segments at the right.</p> <p>The multiple-count issue has (naturally) been anticipated in Kentik Detect from the outset. The flow metadata stored in the Kentik Data Engine (our backend) isn’t simply what comes to us from network devices; instead it is enhanced with a variety of derived or externally acquired information (e.g. BGP, GeoIP, SNMP, etc.) that makes it possible to minimize double-counting. For example you can manually select the devices included in a query, or use filters based on device names (e.g. names containing “EDGE”) or manually-entered interface descriptions. You can even use Saved Filters to avoid having to recreate the necessary filters for each query.</p> <p>The downside of these approaches is that they place a hurdle that users have to jump when they are trying to get information. And unfortunately they aren’t completely foolproof because the user has no way of knowing if the fields being filtered are consistent (does everybody in the organization use the exact same label strings?) or accurate (have descriptions been updated when interfaces are repurposed?).</p> <p>Interface Classification addresses the above issues by introducing the concept of <strong>Network Boundary</strong>, which allows any interface in a Kentik Detect user’s network to be characterized in one of two ways:</p> <ul> <li><strong>Internal</strong>: Connects the network device to which it belongs with an interface on another device within the same ASN. This could be a backbone interface, or an interface towards a server inside the network.</li> <li><strong>External</strong>: Connects the network device to which it belongs with an interface on a device in a different ASN, which might belong to a transit provider or a peer.</li> </ul> <p>Kentik Detect keeps a device database for information about each device (and its interfaces) that your organization registers with the system. When traffic crosses a device, we can derive from the device-generated flow records (e.g. NetFlow) which interfaces it passed through, look up those interfaces in the device database, and incorporate attribute values from that database into the flow records that we store in the Kentik Data engine (KDE). To implement Network Boundary we’ve augmented the device database with a network boundary attribute, which enables the following:</p> <ul> <li>Each interface on each device can now be classified with its Network Boundary value (Internal or External).</li> <li>Each flow record stored in KDE can now include the Network Boundary value for its source and destination interfaces.</li> </ul> <p>Assuming that the interfaces themselves have been correctly classified (we’ll get to this later), if we now filter on a given combination of source and destination network boundary values, as shown in the table below, we can easily zero in only on traffic with the directionality that we want to query, and avoid seeing the same flow counted multiple times.</p> <table class="Table_Body"> <tbody> <tr class="Table_Head"> <td>Network Boundary</td> <td>SOURCE = Internal</td> <td>SOURCE = External</td> </tr> <tr> <td>DESTINATION = Internal</td> <td><i>Internal traffic:</i> The flow represents traffic that is circulating inside the network from router to router or from router to host.</td> <td><i>Inbound traffic:</i> The flow represents ingress traffic from a peer, transit provider, Internet Exchange, etc.</td> </tr> <tr> <td>DESTINATION = External</td> <td><i>Outbound traffic:</i> The flow represents egress traffic from your network to another network.</td> <td><i>Hot-potato Traffic:</i> The flow represents traffic that goes directly from a peer to a peer without ever crossing your backbone.</td> </tr> </tbody> </table> <h4 id="what-does-my-traffic-cost">What Does My Traffic Cost?</h4> <p>Now that we have a sense of Interface Classification from the engineering perspective, let’s dig into the business side. Needless to say, you can’t make a profit moving traffic unless you sell your services for more than your cost. So to optimize pricing you need to know the end-to-end ROI of traffic coming from a given network, transiting through your own network, and exiting at multiple locations in different volumes. You also need to be able to slice traffic by categories of connectivity: transit, Internet Exchange peering, direct peering (free or paid), long distance backbone, metropolitan backbone, etc. That information isn’t included in any flow protocol, which has made rational pricing surprisingly difficult to determine. As with the flow-counting issue, Kentik Detect’s powerful filtering capabilities can get you a long way toward the info you need. That’s because networking best practices — if consistently adhered to — give you a couple of ways to try to figure out what’s what:</p> <ul> <li><strong>Interface description</strong>: Most decent size networks largely follow (more or less) consistent naming conventions, and those names are one of the first attributes looked at when troubleshooting, so there’s usually a strong incentive to keep them up to date. If an organization has SNMP polling enabled on all of its network devices, descriptions are readily available and pulled by Kentik Detect.</li> <li><strong>Interface IP address</strong>: Depending on a network’s addressing policies, connectivity type can be inferred from the IP Addressing on the interface, which is polled by Kentik Detect. Examples of policies might include: - a range of IP addresses may be used exclusively for transit customers; - a range of RFC1918 IP addresses may be used exclusively for interfaces on CDN servers behind a load balancer; - a range may correspond to the LAN of a particular Internet Exchange.</li> </ul> <p>The two methods above enable us to take the same approach to connectivity typing that we took with network boundary. In other words we created attributes for the <strong>Connectivity Type</strong> of interfaces, and we can assign connectivity type values for both source and destination interfaces to flows as they are ingested into KDE. That makes it much easier to include connectivity type as a factor in a query, so you can look at the type of network from which a given set of traffic is entering and to which it is going. Here’s the list of connectivity types that we are supporting in our initial release:</p> <ul> <li>Transit</li> <li>IX Peering (Fabric)</li> <li>Free private peering</li> <li>Paid private peering</li> <li>Backbone</li> <li>Customer</li> <li>Host</li> <li>Available (identifies an interface that is available and unused versus one that isn’t available, e.g., no transceiver).</li> <li>Reserved (identifies an interface that is currently available but already allocated for future use).</li> </ul> <h4 id="interface-classification-rules">Interface Classification Rules!</h4> <p>So far we’ve looked at the challenges that Interface Classification addresses and the attributes whose values can now be incorporated directly into KDE flow records. But these attributes only help us if attribute values have been accurately and consistently assigned to the interfaces themselves. That’s a big hurdle, because some networks have multiple thousands or tens of thousands of interfaces to classify. And it’s not a one-time job, because interfaces are added or decommissioned every day, rendering manual classification nearly impossible to keep consistent. So we need a practical way to automate the initial classification and then keep it up to date. That brings us to the real fun, which is our new Interface Classification rules engine.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4LtngcziRaw2g2scykGmC8/68eae622c4dbb8f0fb44fbe5fb643780/Edit_rule-300w.png" alt="Edit_rule-300w.png" class="image right" style="max-width: 300px; margin-bottom: 20px;" /> <p>The rules engine, accessed in the portal via Admin » Interface Classification, looks periodically at the SNMP attributes gathered from the interfaces that you’ve registered with Kentik Detect, and it evaluates those attributes using a limited set of user-configurable “if-match-then-classify” rules. These rules are defined in the editing pane of the Add Rule dialog (image at right), which you reach via the green Add Rule button.</p> <p>The “If” section tells the rules engine what condition to match, while the “Then” section enables you to define the corresponding Connectivity Type. By default, the Network Boundary is automatically determined by the Connectivity Type. You can change the automatic correspondence between the two attributes (click the Configure Network Boundaries link on the main IC page), or you can toggle Auto to off, then manually set the boundary that will be applied for an individual rule.</p> <p>These settings allow us to build rules in a couple of different ways:</p> <ul> <li>Base the rule on interface description, which might be stated as something like “if description contains <em>peering: PI</em> then classify the interface as <em>free private peering</em>.” In this case “peering: PI” would be a string that you provide based on your knowledge of the interface description protocol used on your network, and <em>free private peering</em> would be selected from the connectivity type drop-down.</li> <li>Base the rule on IP, which might be stated as “if IP address is in subnet <em>123.456.78.90</em> then classify the interface as <em>host</em>.”</li> </ul> <p>The following table gives an idea of the types of matching you can currently use in your rules:</p> <table class="Table_Body"> <tbody> <tr class="Table_Head"> <td>Interface Attribute</td> <td>Match clause</td> <td>Matches when...</td> </tr> <tr> <td>Description</td> <td>Equals</td> <td>Provided string is an exact match with the description (case sensitive).</td> </tr> <tr> <td>Description</td> <td>Contains</td> <td>Provided string is found in the description (case insensitive).</td> </tr> <tr> <td>Description</td> <td>Matches Regex</td> <td>Provided string is found in the description with Standard Regex match.</td> </tr> <tr> <td>IP Address</td> <td>is contained in subnet</td> <td>Interface's IP address is within the user-provided CIDR.</td> </tr> <tr> <td>IP Address</td> <td>is a Public IP Address</td> <td>Interface's IP address is a publicly routable IP address.</td> </tr> <tr> <td>IP Address</td> <td>is a Private IP Address</td> <td>Interface's IP address is reserved (e.g. RFC1918, test-net, doc-net, apipa, cgn, etc.).</td> </tr> <tr> <td>IP Address</td> <td>has no IP address</td> <td>Interface has no IP address.</td> </tr> </tbody> </table> <p>The Interface Classification UI gives you the ability to test rules before they are actually applied so that you know how many (and which) interfaces would be classified by the rule. Once you are happy with a rule you save it to the rules list (shown in the screenshot below). In most cases you’ll define multiple rules, which can be reordered in the list as desired (as well as disabled or deleted).</p> <img src="//images.ctfassets.net/6yom6slo28h2/45qPf05GYEwmM8o4mquimi/c72fa3055db5104e80d5c7dd07cb7d53/Rules_list-813w.png" alt="Rules_list-813w.png" class="image center" style="max-width: 813px;" /> <p>When you click the Evaluate Rules button (upper right) the engine works through the interfaces on all of your registered devices. Classification is applied based on the first (top-most) match condition (“if”) that results in a match. When an interface is classified, the values of the two attributes (connectivity type and network boundary) are written to your organization’s devices database in Kentik Detect, which is updated every three hours. From there the values are applied to incoming flow records as they are processed by our ingest layer.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7tH3i2NBpmkQMMs8U228cu/16a2f432ed186eea0b85b065c87f02af/Class_status-300w.png" alt="Class_status-300w.png" class="image right" style="max-width: 280px;" /> <p>If no rule is matched for a given interface, that interface won’t be classified. The percent of your interfaces that were successfully classified is shown (see image at right) as part of the Classified Interfaces pane of the Interface Classification page. The pane also provides a list of devices that have unclassified interfaces. High classification ratios are usually achieved by using a rigorously consistent interface description naming system and IP addressing system across your entire infrastructure. Now that we understand what Interface Classification is we can look at various scenarios to see how you can use it when querying your network data in Kentik Detect. We’ll get to that in the next post in this series. In the meantime, for more details see the <a href="https://kb.kentik.com/Cb10.htm">Interface Classification article</a> in the Kentik Knowledge Base. You can also familiarize yourself with Kentik in general, and Interface Classification in particular, by <a href="#demo_dialog">scheduling a demo</a> or signing up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[News in Networking: Docker, Disney, and Big Data Caps]]><![CDATA[Docker’s $1.3 billion valuation. Disney $1.6 billion acquisition for streaming services. And international internet speed tests and U.S. home internet data caps. Those stories and more after the jump.]]>https://www.kentik.com/blog/news-in-networking-docker-disney-and-big-data-capshttps://www.kentik.com/blog/news-in-networking-docker-disney-and-big-data-caps<![CDATA[Michelle Kincaid]]>Fri, 11 Aug 2017 17:58:01 GMT<p>This week’s top story picks from the Kentik team.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/1rrNppq7naMA8ceMkuwqO6/e48811c2082ccd4be866f64be0dd16c6/News_tablet-396w.png" alt="News_tablet-396w.png" class="image right" style="max-width: 300px;" /> Docker’s popularity is showing strong, to the tune of a $1.3 billion valuation, according to Bloomberg. Disney is also showing big money, reportedly paying $1.6 billion for streaming services provider BAMTech. The news comes after Disney announced plans this week to cut its streaming ties with Netflix. And finally, how fast is your internet? And how much of your home data is getting capped? Two new reports aim to answer that.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.bloomberg.com/news/articles/2017-08-09/docker-is-said-to-be-raising-funding-at-1-3-billion-valuation"><strong>Software Maker Docker Is Raising Funding at $1.3 Billion Valuation</strong></a> <strong>(Bloomberg)</strong> Docker is expected to announce it raised $75 million in funding, according to Bloomberg’s source. This investment “will help fuel Docker’s newest push to win business customers and finally monetize its free open-source tools popular with developers worldwide.”</li> <li><a href="https://451research.com/report-short?entityId=93122&#x26;type=mis&#x26;alertid=633&#x26;contactid=0033200002BXCL9AAP&#x26;utm_source=sendgrid&#x26;utm_medium=email&#x26;utm_campaign=market-insight&#x26;utm_content=newsletter&#x26;utm_term=93122-Disney+nabs+BAMTech+in+%241.6bn+play+for+streaming+services"><strong>Disney Nabs BAMTech in $1.6B Play for Streaming Services</strong></a> <strong>(451 Research)</strong> Not only did Disney this week announce its ending a valuable streaming deal with Netflix, it also announced the $1.6 billion acquisition of BAMTech. According to 451 Research, the deal “shows that Disney sees value not only in building its own streaming services but also in enabling other studios to do the same.”</li> <li><a href="https://techcrunch.com/2017/08/09/aws-joins-the-cloud-native-computing-foundation/"><strong>AWS Joins the Cloud Native Computing Foundation</strong></a> <strong>(TechCrunch)</strong> Amazon announced AWS is now part of the Cloud Native Computing Foundation (CNCF), the open-source home of the Kubernetes project. According to TechCrunch, “Rumor is that Amazon’s AWS cloud computing platform will soon launch its own Kubernetes-based container management service.”</li> <li><a href="https://www.itwire.com/networking/79413-telstra-s-big-bets-for-the-network-of-2020.html"><strong>Telstra’s Big Bets for the Network of 2020</strong></a> <strong>(ITWire)</strong> Australian telecom giant Telstra is looking ahead — to 2020. The company’s vision, according to its principal consultant of networks, Craig Mulhearn, will include SDN, “multiple mobile networks, data centres in the exchange, and catering for autonomous vehicle networking.”</li> <li><a href="http://www.lightreading.com/automation/big-data-is-on-centurylinks-automation-wish-list/d/d-id/735310?"><strong>Big Data Is on CenturyLink’s ‘Automation’ Wish List</strong></a> <strong>(Light Reading)</strong> Big data is a big part of CenturyLink networking efforts. Enterprise Architect Kevin McBride told Light Reading that using automation and analytics will inform its networks what to do next. “You can start asking questions of that data and have it place a workload,” he said. “The feedback loop is what makes this powerful.”</li> <li><a href="https://techcrunch.com/2017/08/09/ookla-speedtest-global-index/"><strong>How Internet Speeds Stack Up in New Monthly Global Ranking</strong></a> <strong>(TechCrunch)</strong> Speedtest.net is a go-to for checking broadband speeds. Now, the company is putting together a monthly report to show the rankings. TechCrunch reports, “The data will be compiled into a monthly report called the Speedtest Global Index, and the first one is available now.”</li> <li><a href="https://arstechnica.com/information-technology/2017/08/at-least-196-internet-providers-in-the-us-have-data-caps/"><strong>Data Cap Analysis Found Almost 200 ISPs Imposing Data Limits in US</strong></a> <strong>(Ars Technica)</strong> ISP tracker BroadbandNow reports that 196 home internet providers in the U.S. “impose monthly caps on Internet users. Not all of them are enforced, but customers of many ISPs must pay overage fees when they use too much data,” according to Ars Technica.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[HBO Attack: Can Alerting Help Protect Data?]]><![CDATA[Major cyber-security incidents keep on coming, the latest being the theft from HBO of 1.5 terabytes of private data. We often frame Kentik Detect's advanced anomaly detection and alerting system in terms of defense against DDoS attacks, but large-scale transfer of data from private servers to unfamiliar destinations also creates anomalous traffic. In this post we look at several ways to configure our alerting system to see breaches like the attack on HBO.]]>https://www.kentik.com/blog/hbo-attack-can-alerting-help-protect-datahttps://www.kentik.com/blog/hbo-attack-can-alerting-help-protect-data<![CDATA[Justin Ryburn]]>Wed, 09 Aug 2017 13:00:06 GMT<h3 id="setting-kentik-detect-to-alert-on-unauthorized-data-transfer">Setting Kentik Detect to Alert on Unauthorized Data Transfer</h3> <p>Last week we saw the latest in a series of major cyber-security incidents, this time an <a href="http://ew.com/tv/2017/07/31/hbo-hacked-game-of-thrones/">attack against HBO</a> that caused the leak of 1.5 terabytes of private data. Here at Kentik, there’s been a lot of talk around the (virtual) water cooler about how the Kentik platform, specifically our alerting system, might have helped HBO realize that they had a breach. We’ve been planning a deeper dive into a specific use case since our <a href="https://www.kentik.com/kentik-detect-alerting-configuring-alert-policies/">last post on alerting</a>, so I thought this incident could serve as the basis for a few informative examples of what our alerting system can do.</p> <p>To be clear, the following are hypotheticals that we’re exploring here solely for the purpose of illustration. Each of the following scenarios involves creating an alert policy based on an assumption about how a massive data theft like the HBO case would manifest itself in the network data that is collected and evaluated by the Kentik alerting system. The example policies are not mutually exclusive and could be used in combination to increase the chance of being notified of such an event. In each case we begin on the Alert Policy Settings page of the Kentik portal: choose Alerts » Alerting from the main navbar, then go to the Alert Policies tab and click the Create Alert Policy button at upper right.</p> <h3 id="top-25-interfaces">Top 25 Interfaces</h3> <p>There haven’t been a lot of technical details released on the attack itself, but we can assume that if we were HBO we would keep private data on a server that is separate from the servers from which they stream HBO Go and HBO Now. Given that, for our first example we can build an alert that looks for a change in the source interfaces that are sending traffic out of our network. <img src="//images.ctfassets.net/6yom6slo28h2/1TZMuEgODGaOU4KgUAMykg/f1cc75c86f979a5d899c894da7bb331e/Interface_shift-818w.png" alt="Interface_shift-818w.png " class="image center" style="max-width: 800px;" /></p> <p>To build our policy we begin with the Devices pane in the sidebar at the left of the Policy Settings page, which we set to look at data across all of our devices. This could be narrowed down if we are looking at specific switches that would source highly sensitive data.</p> <p>In the Query pane, also in the sidebar, we set the group-by dimension that the policy will monitor to Source Interfaces, and we set the primary metric to Bits/s. This means that the top-X evaluation made by the alerting system will involve looking at all unique source interfaces to determine which have the highest traffic volume as measured in Mbps. Also in the sidebar, we’ll use the Filters pane to apply a filter on the MYNETWORK tag, which will excludes traffic destined within our own network. That way we will only capture traffic that is leaving the network.</p> <p>Next we’ll move to the General Setting section of the Policy Settings page. We set the Track field to track current traffic for 50 items (source interfaces, as we set in the Query pane), and the History Top field to keep a baseline history of 50 items. We’ll also set the policy to automatically determine (based on an internal algorithm) the minimum amount of traffic that a source interface will need in order to be included in our Top 50 list.</p> <p>Moving on to Historical Baseline Settings, we configure the settings to look back over the last 21 days in 1 day increments (steps). Since we are evaluating every 5 minutes, we need to decide how we will aggregate this data in hourly rollups as well as our Final Aggregation for the 21 days of baseline. We set both of these settings to use 25th percentile. With the Look Around setting, we broaden the time slice used for Final Aggregation to 2 hours instead of a single 1-hour roll-up.</p> <p>Finally, we configure the threshold, which is a set of conditions that must be reached in order for the alert to actually trigger. In this case, we use an “if-exist” comparison to compare our current evaluation of the top 25 source interfaces to the historical baseline group of 50 source interfaces with the most traffic. The activate if setting makes it so that the threshold will only be evaluated for interfaces with greater than 50 Mbps of traffic.</p> <p>With the settings above, we’ve created an alert policy that is continually keeping track of the top 50 source interfaces that are sending traffic out of the network. If any of the the top 25 interfaces in the current sample are not in the top 50 historically, the threshold will trigger, sending us a notification (we covered Notification Channels in the previous blog, so we haven’t gone into detail here on how to set that up). If 1.5 terabits of traffic left the network from an interface that normally had fairly low volume, this policy would definitely fire off a notification.</p> <h3 id="top-15-destination-ip-addresses">Top 15 Destination IP Addresses</h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/6TWtVqcDZYWo2agWmWK8ms/cc373bd708f152e85bb58728044e7c64/Dst_IP_shift_sidebar-400w.png" alt="Dst_IP_shift_sidebar-400w.png " class="image right" style="max-width: 300px;" /> Another assumption we can make is that if we were HBO we would keep our private servers on their own subnet. For this example, we’ll assume an IP/CIDR of 172.16.12.0/24, and we’ll build an alert that looks for changes in the destination IP addresses that are talking to this source.</p> <p>This alert policy is similar to the previous one but it monitors the top 35 destination IP addresses and creates an alert if the current Top 15 are not in the Top 30 historically. To make this policy work, we start back at the sidebar. In the Query pane we set the group-by dimension to destination IP/CIDR. And in the Filter pane we narrow the data that we are evaluating by setting a filter for the source subnet that the private server is part of.</p> <p>We also change Historical Baseline settings so that the policy uses 98th percentile aggregation for the Minute -> Hourly Rollup and 90th percentile for the Final Aggregation. Lastly we’ll change the comparison options settings so that the current group is top 15 and the baseline group is top 30. If breached data were transferred from the private subnet to a new IP on the Internet then this threshold would be triggered and we would receive notification.</p> <h3 id="top-10-countries">Top 10 Countries</h3> <p>Another alert policy variation we could use is to look at destination country. This approach would be based on the assumption that the country this attack came from is not one that normally sees a lot of traffic from this private server. In the Query pane we’ll change the group-by dimension to destination country, but we’ll keep the Filter pane as-is to look at the source subnet for the private server. In General Settings, meanwhile, we’ll set both Track and History Top to 20.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2ghYhiLIysw4s6YQ2o4ie2/5deab42ba3e1ace7f05babb4ae544cdb/Site_Dst_Geo_threshold-812w.png" alt="Site_Dst_Geo_threshold-812w.png " class="image center" style="max-width: 800px;" /> <p>Configured this way, the alert policy is monitoring the Top 20 countries and the threshold will be triggered if any of the Top 10 in the current sample is not in the Top 20 historically. If we see a large volume of traffic leaving our server headed to a country to which we don’t normally send traffic, we’ll get a notification.</p> <h3 id="summary">Summary</h3> <p>It’s always easier to see the take-aways of a situation when you have the benefit of being a Monday Morning Quarterback. But what these examples show is that with the powerful capabilities of Kentik’s alerting system you can monitor your network for changes in traffic patterns and be notified of traffic anomalies that might just indicate the kind of data grab suffered by HBO.</p> <p>If you’re already a Kentik Detect user, contact <a href="mailto:[email protected]">Kentik support</a> for more information — our Customer Success team is here to help — or head on over to our <a href="https://kb.kentik.com/Ab10.htm">Knowledge Base</a> for more details on alert policy configuration. If you’re not yet a user and you’d like to experience first hand what Kentik Detect has to offer, <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[How Many Routes Do You Really Need?]]><![CDATA[With BGP and NetFlow correlated into a unified datastore, Kentik Detect's advanced analytics provide valuable insights for both engineering and sales. In this post we look into a fairly recent addition to Kentik Detect, Route Traffic Analytics. Especially useful for capacity planners and peering coordinators, RTA makes it easy to see how many unique routes are represented in a given percent of your traffic, which indicates the route capacity needed in your edge routers.]]>https://www.kentik.com/blog/how-many-routes-do-you-really-needhttps://www.kentik.com/blog/how-many-routes-do-you-really-need<![CDATA[Justin Ryburn]]>Mon, 07 Aug 2017 13:00:01 GMT<h3 id="using-kentik-detect-analytics-to-optimize-network-design"><em>Using Kentik Detect Analytics to Optimize Network Design</em></h3> <p>In a <a href="https://www.kentik.com/learning-from-your-bgp-tables/">previous post</a>, we looked at how Peering Analytics in the Kentik Detect portal allows you to visualize the traffic leaving your network by geography, site, or BGP AS path. In this post, we’ll look at another analytics feature, namely Route Traffic Analytics. Like Peering Analytics, RTA is useful to Capacity Planners and Peering Coordinators. It evaluates flow and BGP data to derive the distribution of traffic across routes, allowing you to see how many unique network prefixes (routes) are represented in a given percent of your traffic. That tells you which routes you need to have in your IP forwarding table (a.k.a. Forwarding Information Base, FIB, or CEF) to cover a given percentage of your traffic. Pretty awesome stuff! To access Route Traffic Analytics, go to Analytics » Route Traffic, where you’ll see a quick summary of total traffic for the last 24 hours on the device that’s currently selected in the Devices pane of the sidebar. The average and maximum Mb/s numbers are the same numbers you would get in the Data Explorer if you ran a query for total traffic on this device for the last 1 day.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1EwpTcAevikiIOMSOkqa44/816725f91c261ac47ab62b11637ef255/RTA-quick_traffic-800w.png" alt="RTA-quick_traffic-800w.png" class="image center" style="max-width: 800px;" /> <h4 id="route-distribution">Route Distribution</h4> <p>The quick traffic calculation is interesting as far as it goes, but let’s dig in a little more. In the sidebar’s Options pane, click the drop-down Action menu and choose Summary. You can also change which devices and 24-hour time-range you look at. Don’t forget to apply your changes with the big green button at the top. This should return a graph showing your route traffic distribution.</p> <img src="//images.ctfassets.net/6yom6slo28h2/hyzJ2WpdsIo8MOgmEMIkQ/266fc7c902933d7989b71842817f475b/RTA-route_distribution-823w.png" alt="RTA-route_distribution-823w.png" class="image center" style="max-width: 800px;" /> <p>What is this graph telling us? On the left (Y-axis) we have the traffic in Mbps and on the bottom (X-axis) we have the number of routes in the forwarding table. As the number of routes increases, we see a corresponding increase in total traffic (plotted with the blue line). We also have lines for 80th, 90th, and 95th percentile. The dashed horizontal lines represent the traffic for each percentile. The colored vertical lines represent the number of routes for each percentile. This allows us to see at a pretty quick glance how many routes cover how much of our traffic. In addition to the graph there is also a table that gives further information about the same underlying data.</p> <img src="//images.ctfassets.net/6yom6slo28h2/26Ube1sVMssWQG4MsqYM42/8e4ecef1e974b3b711a6560023688c0e/RTA-summary_table-821w.png" alt="RTA-summary_table-821w.png" class="image center" style="max-width: 800px;" /> <p>As you can see, each row represents traffic (over the selected 24-hour window) for a given route. The row tells us the percentage of overall traffic the route represents as well as that route’s average and maximum Mbps. It also shows the percentage of the forwarding table that is covered up to this row, and the cumulative average Mbps of all routes up to this row.</p> <h4 id="top-1000-routes">Top 1000 routes</h4> <p>Once you are done digesting the route distribution summary, let’s dig a bit deeper. Back on the Action menu, select Top 1000 Routes, then click Apply Changes. You should then see a table that looks something like the following.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3ULvXooP7iyCcSiuEe4mWa/b1fdae42c593d49a135a7bef1ef06430/RTA-Top_1000-808w.png" alt="RTA-Top_1000-808w.png" class="image center" style="max-width: 800px;" /> <p>So what is this telling us? Each row in the table gives us a set of details (representing a 24-hour period) for a given route, including average Mbps, Maximum Mbps, and total cumulative average Mbps of this route plus the ones above it. It also shows the percent of overall traffic this route represents, the percentage of your forwarding table covered up to this row, the number and name of the route’s destination AS, and the route’s destination country code.</p> <h4 id="how-route-traffic-analytics-helps">How Route Traffic Analytics Helps</h4> <p>In a previous life I worked at a hardware vendor, and if I’d had access to this kind of information I could have used it all the time. Best practices in network design traditionally dictated that an edge router must be able to hold a full BGP route table, and as the size of the global BGP route table exploded, so too did the memory requirements for those routers. As forwarding table size became one of the biggest cost drivers for routers, customers often wanted to know if they really needed an expensive router or if they could get away with a cheaper switch.</p> <p>The question is still valid today: do you really need a full BGP route table in your device? What if you deployed a lower cost switch that could store only a partial table? How many routes do you really need? The fact is that if you can cover 99% of your traffic with a few thousand routes, you can just get a default route for the other 1% of your traffic. Most modern switches can handle 128k routes with no issue. That could equate to huge savings on the type of device you deploy at the edge of your network. For more information on “Lean FIB”, check out <a href="https://www.fastly.com/blog/building-and-scaling-fastly-network-part-1-fighting-fib/">this blog post</a> from Fastly.</p> <p>Once we’ve analyzed our route table and figured out what we need to know, we still might like to be able to dump this data to a CSV file so we can analyze it further. No problem! Go back to the Action menu, choose Export Top Routes As CSV, and then Apply Changes. You’ll now see a link in the main display area that you can click on to download the Top 1000 routes data as a CSV file.</p> <p>Does all this Route Traffic Analytics talk leave you eager to unlock the knowledge hiding in your routing table? <a href="#demo_dialog">Request a demo</a> of Kentik Detect, or start a <a href="#signup_dialogl">free trial</a> today. If you’re already a Kentik Detect user, you’ll find more detailed information on the Analytics features in our <a href="https://kb.kentik.com/v3/Db06.htm#Db06-Route_Traffic_Analytics">Knowledge Base</a>. Or contact <a href="mailto:[email protected]">support</a> for further assistance.</p><![CDATA[News in Networking: DDoS Mitigations and VPNs in China and Russia]]><![CDATA[DDoS attacks for all: Netflix tests against these disruptive attacks. Analyst firm Frost & Sullivan talks about how service providers need better DDoS mitigation. New Kaspersky Lab research recognized a Chinese telecom as seeing the biggest DDoS attack in Q2. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-ddos-mitigations-and-china-russias-vpnshttps://www.kentik.com/blog/news-in-networking-ddos-mitigations-and-china-russias-vpns<![CDATA[Michelle Kincaid]]>Fri, 04 Aug 2017 17:01:12 GMT<p>This week’s top story picks from the Kentik team.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/5Jii8AbKi4Soi6oq8mICSK/ee0b8e828564512674df0ba27302e2eb/News_tablet-396w.png" alt="News_tablet-396w.png" class="image right" style="max-width: 300px;" /> DDoS attacks and their mitigation efforts are all over the news this week. A new Wired feature talks about how streaming giant Netflix tests against these disruptive attacks. Analyst firm Frost &#x26; Sullivan talks about how service providers need better DDoS mitigation. And new Kaspersky Lab research recognized a Chinese telecom as seeing the biggest DDoS attack in Q2. Also in the news this week, China and Russia both cracked down on VPN usage.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.wired.com/story/netflix-ddos-attack/"><strong>How Netflix DDoS’ed Itself to Protect the Entire Internet</strong></a> <strong>(Wired)</strong> In a Wired feature out this week, Scott Behrens, a security engineer at Netflix, talks about how the streaming giant tests DDoS attacks. According to the story, “Netflix engineers reroute customers away from a certain region of production servers so they can have a real-world sandbox in which to experiment. The process also helps ensure that Netflix can continue to provide service to its customers even if one of its regions goes down or experiences problems; during a Chaos Kong [test] all user traffic gets rerouted from a particular region, ideally without customers noticing.”</li> <li><a href="https://ww2.frost.com/news/press-releases/sophisticated-ddos-attacks-drives-service-providers-seek-more-optimal-and-efficient-mitigation-approach/"><strong>Sophisticated DDoS Attacks Drive Service Providers to Seek Better Approach</strong></a> <strong>(Frost &#x26; Sullivan)</strong> A new report called “Service Provider Requirements for DDoS Mitigation” from Frost &#x26; Sullivan discusses how “DDoS attacks have become more formidable than ever, growing year-over-year in terms of scale, frequency and sophistication. As a result, service providers must revisit their DDoS defenses and strategies on a regular basis, on top of re-evaluating their effectiveness and ability to meet their needs.”</li> <li><a href="http://www.darkreading.com/attacks-breaches/chinese-telecom-ddos-attack-breaks-record-/d/d-id/1329518"><strong>Chinese Telecom DDoS Attack Breaks Record</strong></a> <strong>(Dark Reading)</strong> Two-hundred and 77 hours: that’s how long a DDoS attack on a Chinese telecom company lasted earlier this year, according to a new report published this week by Kaspersky Lab. The Lab team said the attack has “a 131% hourly increase compared to the longest attack recorded earlier this year.”</li> <li><a href="https://www.theregister.co.uk/2017/08/02/typosquatting_npm/"><strong>This Typosquatting Attack on npm Went Undetected for 2 weeks</strong></a> <strong>(The Register)</strong> According to The Register, “A two-week-old campaign to steal developers’ credentials using malicious code distributed through npm, the Node.js package management registry, has been halted with the removal of 39 malicious npm packages.”</li> <li><a href="https://www.computerworld.com.au/article/625550/why-ssl-tls-attacks-rise/"><strong>Why SSL/TLS Attacks Are on the Rise</strong></a> <strong>(Computerworld)</strong> New research from Zscaler reports that “as enterprises get better about encrypting network traffic to protect data from potential attacks or exposure, online attackers are also stepping up their SSL/TLS game to hide their malicious activities.”</li> <li><a href="http://searchnetworking.techtarget.com/tip/Service-providers-rushing-in-to-provide-container-networking"><strong>Service Providers Rushing in to Provide Container Networking</strong></a> <strong>(TechTarget)</strong> TechTarget reminds, “It is no secret that container networking has been the new hotness in the development and open source world for the last couple of years.” Now, thanks to big networking vendors, container networking is also picking up momentum.</li> <li><a href="https://www.nytimes.com/2017/08/03/business/china-internet-censorship.html"><strong>China’s Internet Censors Play a Tougher Game of Cat and Mouse</strong></a> <strong>(The New York Times)</strong> China is known for its Great Firewall as a way to limit content from getting into its citizens’ hands. Now, the country is shutting down VPNs for tech-savvy citizens who know how to get around the wall. According to The New York Times, “In recent days, Apple has pulled apps that offer access to such tools — called virtual private networks, or VPNs — off its China app store, while Amazon’s Chinese partner warned customers on its cloud computing service against hosting those tools on their sites. Over the past two months, a number of the most popular Chinese VPNs have been shut down, while two popular sites hosting foreign television shows and movies were wiped clean.”</li> <li><a href="http://www.reuters.com/article/us-russia-internet-idUSKBN1AF0QI"><strong>Putin Bans VPNs to Stop Russians Accessing Prohibited Websites</strong></a> <strong>(Reuters)</strong> Also in an effort to ban access to certain websites and content, Russian President Vladimir Putin signed a law this week prohibiting VPNs.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Learning From Your BGP Tables]]><![CDATA[Can BGP routing tables provide actionable insights for both engineering and sales? Kentik Detect correlates BGP with flow records like NetFlow to deliver advanced analytics that unlock valuable knowledge hiding in your routes. In this post, we look at our Peering Analytics feature, which lets you see whether your traffic is taking the most cost-effective and performant routes to get where it's going, including who you should be peering with to reduce transit costs.]]>https://www.kentik.com/blog/learning-from-your-bgp-tableshttps://www.kentik.com/blog/learning-from-your-bgp-tables<![CDATA[Justin Ryburn]]>Mon, 31 Jul 2017 13:00:40 GMT<h3 id="kentiks-advanced-analytics-turns-routes-into-insights"><em>Kentik’s Advanced Analytics Turns Routes into Insights</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/3VA02EuWZ206acC8e6Kcwe/2163dc06a3b4bc057a9bceccc01a76e5/Monitor_screen-430w.png" class="image right" style="max-width: 400px" alt="" /> <p>When it comes to routing traffic on the global Internet, BGP is the one and only protocol. While it’s not perfect, it has withstood the test of time and is now so embedded in how the Internet works that it would be near impossible to change. That’s why network engineers have long used the BGP routing table on routers or looking glasses to get an idea of how their Internet traffic is routed. An experienced network engineer can read a BGP table like Neo reads the code in the Matrix. Kentik was founded by network engineers, and we’ve taken BGP-reading a step further by marrying the routing table on network devices with flow data (e.g. NetFlow) coming from those devices. This gives you the ability to analyze your route table in ways that were never before possible. When I was a network operator, this type of visibility would have been a game changer. What would I have done with all the time I saved? This is the premise behind the Analytics features in Kentik Detect, Peering Analytics and Route Traffic Analytics. We’ll look at Peering Analytics in this post and at Route Traffic Analytics in a future post.</p> <h4 id="peering-analytics">Peering Analytics</h4> <p>Peering Analytics allows you to visualize the traffic leaving your network by geography, site, or BGP AS path. This is extremely useful to Peering Coordinators and IP Capacity Planning teams but can also be used as a target prospect tool by Sales and Marketing teams at service providers. To use this powerful feature you must be running BGP between at least one device in your network and the Kentik Data Engine (KDE). Details on how to do that can be found in Knowledge Base topics <a href="https://kb.kentik.com/Ab06.htm">BGP Overview</a> and <a href="https://kb.kentik.com/Bd01.htm">Router BGP Configuration</a>. Once that is done, Peering Analytics can be found in the Kentik Detect portal at Analytics » Peering. You will also want to have 3-4 days worth of flow data stored in the Kentik Data Engine (KDE) for Peering Analytics to return useful information. Before jumping into peering visualizations you must first define a BGP Dataset. The BGP Dataset allows you to precisely specify the information that you want to include in your analysis. When creating a Dataset, you determine which devices you want to look at, how you want to filter the traffic (we recommend using the <a href="https://kb.kentik.com/Cb04.htm">MYNETWORK_OUT</a> tag described in our Knowledge Base), the time range for the data, and the depth of the path information. Since this feature uses one-hour slices of data, the minimum Time Range is two days. For more details on each option check out our KB topic <a href="https://kb.kentik.com/Db05.htm#Db05-Dataset_Advanced_Fields">Dataset Advanced Fields</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5tkRVsc7q8uQCQwuiqWOeS/4fa843d0a5ab254c5d639e7f0834c85d/Dataset_create-810w.png" class="image center" style="max-width: 800px" alt="" /> <p>Once you click on Create Dataset you will be returned to the BGP Dataset List. Make sure that the Rows (#) field is incrementing which tells you that your Dataset is being populated with data.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4cX9c6ZB5uUaYGEk2Miuiq/445d8f0aa6d1a5eb1e6bf2027b5363c4/Dataset_list-825w.png" class="image center" style="max-width: 825px" alt="" /> <h4 id="visualizing-your-routes">Visualizing Your Routes</h4> <p>Now that we’ve built our Dataset, let’s take a look at the awesome visualizations available from this feature. To get started, click on the Peering link in the list (at the right of the row, under Actions) to bring up the Peering Analytics page. In the left pane, you have some options to narrow down what you are looking at. You can narrow to individual devices, filter on a number of dimensions, filter by Interface, and filter by ASN. In the main pane you’ll see visualizations of your BGP Path data. The first one should look something like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/4L20zweFdmweAgY0GSwgAa/894bc103eafdd74a6c74c78f71b1c752/Peering_sankey-674w.png" class="image center" style="max-width: 625px" alt="Sankey" /> <p>This is a Sankey diagram showing the BGP paths that your traffic is taking through the Internet to reach it’s destination. The width of each grey bar indicates the proportion of the traffic that is taking a given path. If you hover over a grey bar, it will give you more details on the source and destination of that particular path and how much traffic is following it. The next visualization we will see further down the page is a time-series line graph that looks something like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/6hZUPq2NnqMGCqqmE4UCyW/8a949183aae8441f93d131d071bb65f3/Peering_paths-674w.png" class="image center" style="max-width: 625px" alt="Peering paths" /> <p>Each line on this graph represents a different BGP Path. On the Y-axis we have the Mbps that is taking that particular path. This is plotted against a timeline on the X-axis. The final visualization we have on the BGP Path tab is a table of the data from these two graphs. This one should look something like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1ZpH3dFZQwEaGKmAAeM4ks/ed1b94e1d0a3b98b3a4562cb6a550e93/Peering_table-674w.png" class="image right" style="max-width: 670px" alt="Peering table" /> <p>Each row in the table represents a different BGP path. To get more details you can click on the magnifying glass in a given row to get a popup with additional visualizations for that specific BGP Path. Clicking on the graph icon in the Explore column will load this data into the Data Explorer where you can dive even further into the traffic details. In addition to the valuable information on this BGP Path tab there are four more tabs to explore. While all of these tabs have similar visualizations, each displays slightly different types of information:</p> <ul> <li><strong>Transit ASNs</strong> - This shows the sum of traffic through and to each ASN (after the first hop). This allows you to see potential peers and customers that you may want to contact.</li> <li><strong>Last-Hop ASNs</strong> - This shows the ASNs and countries to which traffic leaving your network is ultimately going, which allows you to quickly see where your traffic is being consumed. This is particularly helpful for content providers who need to be able to see who is consuming their content.</li> <li><strong>Next Hop ASNs</strong> - This shows the traffic from your network to the ASNs with which you have a direct BGP relationship. This allows your IP Capacity Planning team to see the total traffic across the network that is being sent to each BGP neighbor.</li> <li><strong>Countries</strong> - This shows the source and destination countries of traffic flowing through your network.</li> </ul> <p>As you can see, once you start looking at your BGP tables combined with flow data you can extract a lot of useful information. As a scale-out big data SaaS, Kentik Detect is uniquely capable of providing these insights. To learn more about how Kentik Detect can unlock the knowledge hiding in your BGP routing table, <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a>. If you’re already a customer, consult our <a href="https://kb.kentik.com/Db05.htm">Knowledge Base</a> or contact <a href="mailto:[email protected]">support</a> for more detailed information on how to use Peering Analytics.</p><![CDATA[News in Networking: The Best Network Speeds, Cuba’s DIY Internet, and the Cost of Automation]]><![CDATA[This week, Verizon was dubbed a winner of network speeds. Mitel agreed to buy ShoreTel. Viacom said it wouldn’t buy Scripps. Packet grew globally. The Ericsson-Cisco partnership slowed. And feature articles highlighted Cuba's internet and IT automation costs. More headlines after the jump...]]>https://www.kentik.com/blog/news-in-networking-the-best-network-speeds-cubas-diy-internet-and-the-cost-of-automationhttps://www.kentik.com/blog/news-in-networking-the-best-network-speeds-cubas-diy-internet-and-the-cost-of-automation<![CDATA[Michelle Kincaid]]>Fri, 28 Jul 2017 13:00:05 GMT<h3 id="this-weeks-top-story-picks-from-the-kentik-team"><em>This week’s top story picks from the Kentik team.</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/3nr4EFGlxYwSYQmOWWWMoq/3b27deacedc85631c52bd2b265cb1328/News_tablet-396w.png" style="max-width: 396px;" class="image right">This week Verizon was dubbed the winner of network speeds. Mitel agreed to buy ShoreTel, while Viacom said it wouldn’t buy Scripps Networks. The cloud provider Packet grew globally, while the Ericsson-Cisco partnership slowed. A feature story explored the workarounds that Cubans go to for internet access, and another feature looked at the cost of IT automation.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://www.cnet.com/news/verizon-keeps-top-spot-as-best-performing-network/"><strong>Verizon Keeps Top Spot as Best-Performing Network</strong></a> <strong>(CNET)</strong> A new report from RootMetrics ranks Verizon as having the best network speeds. According to CNET, “All four major wireless carriers improved their median network speeds in at least 20 markets.”</li> <li><a href="https://techcrunch.com/2017/07/27/mitel-to-buy-shoretel-for-430-million-to-create-unified-communications-powerhouse/"><strong>Mitel to Buy ShoreTel for $430M to Create Unified Communications Powerhouse</strong></a> <strong>(TechCrunch)</strong> Telecommunications company and UC provider Mitel is buying competitor ShoreTel. According to TechCrunch, “The two companies when combined will have 3,200 channel partners and 4,200 employees worldwide.”</li> <li><a href="https://www.wsj.com/articles/viacom-out-of-the-running-for-scripps-networks-1501112001?mod=djem10point"><strong>Viacom Out of the Running for Scripps Networks</strong></a> <strong>(Wall Street Journal)</strong> Media conglomerate Viacom has bowed out of the bidding to buy Scripps Networks Interactive. The Wall Street Journal reports that leaves Discovery Communications as “the only remaining suitor in talks to purchase the cable TV programmer.”</li> <li><a href="https://techcrunch.com/2017/07/26/packet-expands-its-global-infrastructure-with-11-new-edge-locations/"><strong>Packet Expands Its Global Infrastructure with 11 New Edge Locations</strong></a> <strong>(TechCrunch)</strong> Bare metal cloud provider Packet is quickly gaining ground. The company announced its expanding into 15 data centers across the globe. “The idea behind these new edge locations, which are now online in Los Angeles, Seattle, Dallas, Chicago, Ashburn, Atlanta, Toronto, Frankfurt, Singapore, Hong Kong and Sydney, is to give Packet’s users the ability to easily deploy a global low-latency infrastructure,” reported TechCrunch.</li> <li><a href="http://searchnetworking.techtarget.com/news/450423229/Ericsson-Cisco-partnership-activity-expected-to-slow-until-2020"><strong>Ericsson-Cisco Partnership Activity Expected to Slow Until 2020</strong></a> <strong>(TechTarget)</strong> Some analysts’ have lowered their expectations for telecom equipment-manufacturer Ericsson’s partnership with Cisco. TechTarget reports the deal intended to generate $1 billion in annual sales by 2018, but Ericsson has seen a buying slowdown largely due to the fact that “most service providers have bought the network infrastructure needed for the 4G networks connecting mobile devices today.” That will change with 5G — but likely not until 2020.</li> <li><a href="https://arstechnica.com/tech-policy/2017/07/facebook-alphabet-amazon-and-netflix-called-to-testify-on-net-neutrality/"><strong>Congress Summons ISPs and Websites to Hearing</strong></a> <strong>(Ars Technica)</strong> Net neutrality debates continue, with ISPs and web companies being the latest group called to testify before Congress. “I’m sending formal invitations to the top executives of the leading technology companies including Facebook, Alphabet, Amazon, and Netflix, as well as broadband providers including Comcast, AT&#x26;T, Verizon, and Charter Communications, inviting each of them to come and testify before our full Energy and Commerce Committee,” US Rep. Greg Walden (R-Ore.) said during a FCC oversight according to Ars Technica.</li> <li><a href="https://www.wired.com/2017/07/inside-cubas-diy-internet-revolution/"><strong>Inside Cuba’s DIY Internet Revolution</strong></a> <strong>(Wired)</strong> Wired just published a feature story on how Cuban are creating workarounds for lack of access to an open internet. One of those workarounds: “Every week, more than a terabyte of data is packaged into external hard drives known as el paquete semanal (“the weekly package”). It is the internet distilled down to its purest, most consumable, and least interactive form: its content. This collection of video, song, photo, and text files from the outside world is cobbled together… and it travels around the island from person to person… a network that transmits data via shoe rubber, bus, horseback, or anything else.”</li> <li><a href="http://www.lightreading.com/automation/the-hidden-%28human%29-cost-of-automation-/a/d-id/734923?"><strong>The Hidden (Human) Cost of Automation</strong></a> <strong>(Light Reading)</strong> Light Reading’s founder and CEO discusses automation in a new article this week. Included in his opinion piece is an example: “AT&#x26;T currently employs 246,000 staff. Moving to a software-based, virtualized and automated network could potentially enable it to eliminate 30% of its staff (or 78,000 employees) according to one source. Assuming an average salary of $70,000, not including benefits, that’s an annual savings for AT&#x26;T of $5,166,000,000 — or a scoot over $5 billion — but also, obviously, results in a major human cost to those who could lose their jobs.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Why Your NetFlow is Safe in the Cloud]]><![CDATA[Among Kentik Detect's unique features is the fact that it's a high-performance network visibility solution that's available as a SaaS. Naturally, data security in the cloud can be an initial concern for many customers, but most end up opting for SaaS deployment. In this post we look at some of the top factors to consider in making that decision, and why most customers conclude that there's no risk to taking advantage of Kentik Detect as a SaaS.]]>https://www.kentik.com/blog/why-your-netflow-is-safe-in-the-cloudhttps://www.kentik.com/blog/why-your-netflow-is-safe-in-the-cloud<![CDATA[Alex Henthorn-Iwane]]>Mon, 24 Jul 2017 13:00:36 GMT<h3 id="five-reasons-the-kentik-detect-saas-is-secure"><em>Five Reasons the Kentik Detect SaaS is Secure</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/6m7r9D7oQM8socI2s4gsMa/0a0fda0f4deac28ab2093fddbb39e849/Cloud_safe-500w.png" alt="Cloud_safe-500w.png" class="image right no-shadow" style="max-width: 400px;" /> <p>We spend a lot of time in this blog talking about the key capabilities that make Kentik Detect unique in the world of network management. We ingest flow records — augmented with goodies like BGP and GeoIP — at massive scale (currently a hundred billion daily). Data is available for querying in under three seconds from receipt. We respond to ad hoc queries across billions of rows in under two seconds (95th-percentile). And we retain network data unsummarized for 90 days (longer by arrangement). Enabled by a scale-out big data architecture that’s purpose-built for network operations, these capabilities are critical for effective visibility. But our architecture also makes possible another unique and valuable attribute of our product, which is that Kentik Detect is a SaaS.</p> <p>Being a SaaS offers huge benefits to our customers, but it’s not a characteristic that every prospect initially finds attractive. (Full disclosure: we also offer Kentik Detect deployed on premises or on a private cloud.) In fact, depending on the type of organization, we often hear concern from the network operations team about whether the security team will let them export NetFlow to the cloud. In this post we’ll look at why so many of our customers who start with a “no NetFlow in cloud” stance ultimately realize how safe — and cost-effective — it is to use us as a SaaS.</p> <p>We spend a lot of time in this blog talking about the key capabilities that make Kentik Detect unique in the world of network management. We ingest flow records — augmented with goodies like BGP and GeoIP — at massive scale (currently a hundred billion daily). Data is available for querying in under three seconds from receipt. We respond to ad hoc queries across billions of rows in under two seconds (95th-percentile). And we retain network data unsummarized for 90 days (longer by arrangement). Enabled by a scale-out big data architecture that’s purpose-built for network operations, these capabilities are critical for effective visibility. But our architecture also makes possible another unique and valuable attribute of our product, which is that Kentik Detect is a SaaS.</p> <p>Being a SaaS offers huge benefits to our customers, but it’s not a characteristic that every prospect initially finds attractive. (Full disclosure: we also offer Kentik Detect deployed on premises or on a private cloud.) In fact, depending on the type of organization, we often hear concern from the network operations team about whether the security team will let them export NetFlow to the cloud. In this post we’ll look at why so many of our customers who start with a “no NetFlow in cloud” stance ultimately realize how safe — and cost-effective — it is to use us as a SaaS.</p> <h4 id="1-netflow-data-is-only-metadata">1. NetFlow data is only metadata.</h4> <p>NetFlow is derived only from packet headers, and is made up only of metadata about the packets that make up your traffic (in that sense it’s analogous to the call detail records retained by your telephony providers). There’s no way to determine from NetFlow the actual content of the packets themselves, which might include proprietary data or be subject to PII, PCI, or HIPAA restrictions. Further, if the flow data is being produced by Internet edge devices, all of the packet headers from which that data is derived have already traversed the public Internet in the clear.</p> <h4 id="2-netflow-is-less-sensitive-than-your-other-data-in-the-cloud">2. NetFlow is less sensitive than your other data in the cloud.</h4> <p>Like most businesses these days, your organization probably already makes use of SaaS tools such as Salesforce, Office365, Google Apps, Slack, GitHub, Box, Dropbox, Evernote, Marketo, or Eloqua. If so, the information you’re already putting into the cloud is far more private and sensitive than anything in NetFlow. Sending NetFlow to the cloud effectively adds zero additional risk.</p> <h4 id="3-bgp-peering-tables-are-already-publicly-visible">3. BGP peering tables are already publicly visible.</h4> <p>If you’re using BGP to peer with service providers, then you’re already sharing all of your routes with the Internet. There are many publicly available looking-glasses and route-view sites that can show the BGP routes you are advertising and to whom they are being advertised. Peering with Kentik is safer than the peering you’re already doing for Internet traffic delivery, and it allows our data engine to combine your NetFlow records with time-correlated BGP data, creating a unified datastore that enhances your ability to extract operational, security, and business insights.</p> <h4 id="4-kentik-accepts-netflow-via-encrypted-transit-or-pni">4. Kentik accepts NetFlow via encrypted transit or PNI.</h4> <p>Kentik offers an easy-to-deploy agent that will encrypt NetFlow at your local premises and put it in a secure tunnel to Kentik Detect. And if you have connectivity into Equinix, you have another and even more private way to get us your NetFlow, which is via a Private Network Interconnect to our Equinix colocation site.</p> <h4 id="5-netflow-can-be-stored-securely-and-privately">5. NetFlow can be stored securely and privately.</h4> <p>The Kentik Data Engine (KDE) was built from the ground up to keep each customer’s data completely separate, with no path that can be used to jump the fence. We also utilize many security safeguards such as regular vulnerability assessments, two-factor authentication, and automated source code security analyses. For more information on Kentik’s information security management program, check out the <a href="https://kb.kentik.com/Ab03.htm">Kentik Security article</a> in our Knowledge Base.</p> <h4 id="saas-as-a-safe-solution">SaaS as a Safe Solution</h4> <p>Based on the points above, it’s easy to see why the vast majority of our customers deploy on our multi-tenant SaaS infrastructure. Most of those who have initial concerns about NetFlow in the cloud are able to address the issues to the satisfaction of their security stakeholders. Of course there are some very large organizations that deploy Kentik Detect on-premises, and we also offer the option of a single-tenant cloud deployment. But it’s hard to beat the simplicity and low total cost of ownership of SaaS. Among the hundreds of customers to date that use our public SaaS you’ll find large enterprises, banking and finance companies, and government agencies. Having laid their data security concerns to rest, these customers are able to take advantage of the advanced capabilities mentioned at the start of this post.</p> <p>So now it’s your turn. If you’re ready for big data-powered network traffic intelligence, why wait? Learn more by digging into our <a href="https://www.kentik.com/kentik-detect/">product</a>, seeing what our <a href="https://www.kentik.com/customers/">customers</a> think, or reading our white paper about the <a href="https://info.kentik.com/rs/869-PAD-887/images/Kentik-KDE-whitepaper-Jul2015.pdf">Kentik Data Engine</a>. Better yet, dive right in by <a href="#demo_dialog">requesting a demo</a> or starting a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[News in Networking: Google’s Networking Algorithm, NASA’s IoT, and Another S3 Leak]]><![CDATA[Google announced a new algorithm for higher bandwidths and lower latencies. NASA gave advice on IoT networks. Python is the most popular programming language. And Dow Jones is the latest to see an S3 misconfiguration.]]>https://www.kentik.com/blog/googles-networking-algorithm-nasas-iot-network-and-another-s3-leakhttps://www.kentik.com/blog/googles-networking-algorithm-nasas-iot-network-and-another-s3-leak<![CDATA[Michelle Kincaid]]>Fri, 21 Jul 2017 14:53:14 GMT<p>This week’s top story picks from the Kentik team.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/3UCDkS9g4gUeeASkC8WOEm/a781d671de28f20336c38de46a62b1ba/News_tablet-396w.png" alt="News_tablet-396w.png" class="image right" style="max-width: 300px;" /> This week Google announced a new congestion control algorithm for higher bandwidths and lower latencies for internet traffic. NASA’s Jet Propulsion Laboratory CTO gave advice on successfully running an IoT network. Python is the most popular programming language. And Dow Jones is the latest to see an Amazon S3 misconfiguration, contributing to a data leak.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="https://techcrunch.com/2017/07/20/google-cloud-gets-a-new-networking-algorithm-that-improves-internet-throughput/"><strong>Google’s New Networking Algorithm to Boosts Internet Throughput</strong></a> <strong>(TechCrunch)</strong> Google announced a new congestion control algorithm this week called TCP BBR. The algorithm “achieves higher bandwidths and lower latencies for internet traffic,” including for Google.com and YouTube.</li> <li><a href="http://www.networkworld.com/article/3208350/internet-of-things/nasa-s-cto-tells-enterprises-how-to-network-iot.html"><strong>NASA’s CTO Tells Enterprises How to Network IoT</strong></a> <strong>(NetworkWorld)</strong> For optimal network infrastructure in the age of industrial IoT, “build an IoT network that’s separate from the regular network.” That’s according to the CTO of NASA’s Jet Propulsion Laboratory, who advises that it will reduce cybersecurity risks for your organization.</li> <li><a href="http://www.zdnet.com/article/programming-languages-python-is-hottest-but-go-and-swift-are-rising/"><strong>Python is Hottest Programming Language, But Go and Swift Are Rising</strong></a> <strong>(ZDNet)</strong> Open source programming language Python has passed C and Java to become the most popular programming language, according to IEEE Spectrum’s fourth interactive programming language ranking.</li> <li><a href="http://www.zdnet.com/article/fedex-said-tnt-petya-attack-financial-hit-will-be-material-some-systems-wont-come-back/"><strong>FedEx Said Some Systems Won’t Come Back After TNT Petya Attack</strong></a> <strong>(ZDNet)</strong> FedEx is still feeling the effects of the recent Petya ransomware attack. According to ZDNet, the shipping giant did not have cyber insurance and is “still evaluating the financial impact of the attack, but it is likely that it will be material.”</li> <li><a href="https://www.wsj.com/articles/dow-jones-inadvertently-exposed-some-customers-information-1500237742"><strong>Cloud Configuration Error Exposes Dow Jones Subscribers</strong></a> <strong>(Wall Street Journal)</strong> Dow Jones seems to be the latest to experience an Amazon S3 misconfiguration, causing a big data leak. (Verizon and Republican data firm Deep Root Analytics have also seen this happen recently.) About 2.2 million Dow Jones subscriber records were affected, according to the Wall Street Journal.</li> <li><a href="https://www.sdxcentral.com/articles/news/kentik-scores-kddi-deal-extends-monitoring-platform-to-japan/2017/07/"><strong>Kentik Scores KDDI Deal, Extends Monitoring Platform to Japan</strong></a> <strong>(SDxCentral)</strong> Here at Kentik this week, we announced our newest customer KDDI, a Fortune 500 Japan-based company and the largest organization that has publicly affirmed its use of Kentik. We also announced a reseller relationship with Net One Systems, one of Japan’s largest systems integrators. As SDxCentral reports, the two announcements together mark Kentik’s entry into the Japanese marketplace.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Big Data SaaS Saves Network Operations!]]><![CDATA[What do summer blockbusters have to do with network operations? As utilization explodes and legacy tools stagnate, keeping a network secure and performant can feel like a struggle against evil forces. In this post we look at network operations as a hero's journey, complete with the traditional three acts that shape most gripping tales. Can networks be rescued from the dangers and drudgery of archaic tools? Bring popcorn...]]>https://www.kentik.com/blog/big-data-saas-saves-network-operationshttps://www.kentik.com/blog/big-data-saas-saves-network-operations<![CDATA[Alex Henthorn-Iwane]]>Wed, 19 Jul 2017 18:31:05 GMT<p>A Summer Blockbuster in Three Acts</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/19dnuuyqmCywGUw84kYMQc/50b3a3b5b0ca886bdec70c4dc6cb6238/Feline_triptych-500w.png" alt="Feline_triptych-500w.png" class="image right" style="max-width: 300px;" /> I recently had the chance to present at A10 Connect, a user conference for A10 Networks. I thought it would be fun to frame my presentation in three acts like a typical summer blockbuster. In act one, you have the “fundamental problem” facing the hero, typically an external problem like a zombie apocalypse or broken romance. In act two, you have the “deeper challenge,” which is more internal, such as self-doubt due to a traumatic history, unreasonable demands from a supposed ally, or a betrayal from within inside the hero’s circle of trust. The third act is where new resources are typically revealed to help the hero gain resolution.</p> <h4 id="act-1-the-network-traffic-visibility-problem">Act 1: The Network Traffic Visibility Problem</h4> <p>Networks are delivery systems, like FedEx. What would happen if FedEx didn’t have any package tracking? In a word, chaos. Sadly, most large data networks operate in a similar vacuum of visibility. <img src="//images.ctfassets.net/6yom6slo28h2/3UrNalS0MEe0guUCC4AICU/875628de82e92cf3e8fe3383b1d05d52/wordpress-media-8771" alt="PackTrack-Route_map-492w.png" class="image right" style="max-width: 300px;" /> As anyone in operations knows, despite decades of continual advances in networking technology, it’s still a daily struggle to answer basic questions:</p> <ul> <li>Is it the network?</li> <li>What happened after deploying that new app?</li> <li>Are we under attack or did we shoot ourselves in the foot?</li> <li>How do we efficiently plan and invest in the network?</li> <li>How do we start to automate?</li> </ul> <p>Why is it still such a challenge to get actionable information about our networks? Because “package tracking” in a large network is a big data problem, and traditional network management tools weren’t built for that volume of data. As a point of comparison, Fedex and UPS together ship about 20 million packages per day, with an average delivery time of about one day. A large network can easily “ship” nearly 300 billion “packages” (aka traffic flows) per day with an average delivery time of 10 milliseconds. Tracking all those flows is like trying to drink from a fire hose: huge volume at high velocity.</p> <h4 id="the-red-herring-more-tools--screens">The Red Herring: More Tools &#x26; Screens</h4> <p><img src="//images.ctfassets.net/6yom6slo28h2/2034VlmQr2QA4Gmu62WgyO/b66f9cd67b9be253fbfdf684aca44eee/Tool_outcomes-300w.png" class="image right" style="max-width: 300px;" /> In traditional network management, the answer to this data overflow is to pile on more tools, resulting in more screens to look at when you need to figure out what’s going on. But study after study shows that the more you have to swivel between network management tools the less you are able to achieve positive network management outcomes. A representative paper from Enterprise Management Associates (EMA), <a href="https://info.kentik.com/rs/869-PAD-887/images/EMA-End-Of-Point-Solutions-WP.pdf">The End of Point Solutions</a>, concludes that network teams with one to three tools are able to detect problems before end users do 71% of the time. For teams with 11+ tools, that figure drops to only 48% of the time. Similarly, only 6% of teams with one to three tools experience days with multiple network outages, whereas 34% of teams with 11+ tools experience multiple-outage days. If the goal of a NetOps team is to ensure the delivery of high quality service, those are pretty telling numbers. Your network team is the hero in this movie, and your fundamental challenge is now clear.</p> <h4 id="act-2-the-deeper-challenge">Act 2: The Deeper Challenge</h4> <p>At this point in the movie, you can’t have all doom and gloom, so there is a ray of light. The good news is that most networks are already generating huge volumes of valuable data that can be used to answer many critical questions. Of course, that in turn brings up the deeper challenge: how on earth can you actually use that massive set of data? <img src="//images.ctfassets.net/6yom6slo28h2/5S2idEPU0EIAEUoQoKOeCu/fba7751e2bcdf0ca0d9556395f981452/Bit_lake-811w.png" alt="Bit_lake-811w.png" class="image right" style="max-width: 300px;" /> Old ways of thinking clearly aren’t going to get you anywhere, so a revelation is needed — an opening of the mind to new possibilities. In the case of network data, that means realizing that to get value from that data you need to consolidate it into a unified solution with the following capabilities:</p> <ul> <li>Data unification: the ability to fuse varied types of data (traffic flow records, BGP, performance metrics, geolocation, custom tags, etc.) into a single consistent format that enables records to be rapidly stored and accessed.</li> <li>Deeply granular retention: keep unsummarized details for months, enabling ad hoc answers to unanticipated questions.</li> <li>Drillable visibility: unlimited flexibility in grouping, filtering, and pivoting the data.</li> <li>Network-specific interface: controls and displays that present actionable insights for network operators who aren’t programmers or data analysts.</li> <li>Fast (&#x3C;10 sec) queries: answers in real operational time frames, so users won’t ignore the data go back to bad old habits.</li> <li>Anomaly detection: notify users of any specified set of traffic conditions, enabling rapid response.</li> <li>API access: integration via open APIs to allow access to stored data by third-party systems for operational and business functions.</li> </ul> <p>Of course just opening one’s mind to the dream isn’t the same as having the solution. How do you get your hands on a platform offering the capabilities described above? You could try to construct it yourself, for example by building it with open source tools. But you’ll soon find that path leads to daunting challenges. You’ll need a lot of special skills on your team, including network expertise, systems engineers who can build distributed systems, low-level programmers who know how to deal with network protocols, and site reliability engineers (SREs) to build and maintain the infrastructure and software implementation. Even for organizations that have all those resources, it may not be worthwhile to devote them to this particular issue. At this point in the movie, you, as the hero, are truly facing a deeper challenge.</p> <h4 id="act-3-big-data-saas-to-the-rescue">Act 3: Big Data SaaS to the Rescue</h4> <p><img src="//images.ctfassets.net/6yom6slo28h2/21z0tlWgq4u68QqoCiCMEo/166d3489845ccc46400dd37a4710bfa3/Bat_cat-300w.png" alt="Bat_cat-300w.png" class="image right" style="max-width: 300px;" /> Fortunately for network teams, there’s an accelerating trend toward cloud services as the solution for all sorts of thorny problems. In the network traffic visibility realm, Kentik Detect is that solution. Kentik offers an easy-to-use big data SaaS that’s purpose-built to deliver real-time network traffic intelligence. Kentik Detect gives you detail at scale, super-fast query response, anomaly detection, open APIs, and a UI built by-and-for network operators. With Kentik Detect, network teams now have the analytical superpowers they’ve always needed. Our public cloud ingests and stores hundreds of billions of flow records in an average day and makes all of that incoming data queryable within three seconds of receipt. Even more impressive, ad-hoc multi-dimensional queries across multiple billions of rows return answers in less than two seconds (95th percentile). That defines “fast and powerful.” Instead of swivelling between several tools to troubleshoot a network performance issue, your teams can now quickly detect and address a huge range of network-related issues from a single solution. Kentik Detect doesn’t require a costume — though we’ve been known to give away some pretty <a href="https://kentik.com/tshirt">snazzy t-shirts</a> and socks — to transform you into a networking superhero. And your network stakeholders need not suspend belief to experience Kentik’s blockbuster story (spoiler alert: it has a happy ending). Learn more by digging into our <a href="https://www.kentik.com/kentik-detect/">product</a>, seeing what our <a href="https://www.kentik.com/customers/">customers</a> think, or reading a white paper on the <a href="https://info.kentik.com/rs/869-PAD-887/images/Kentik-KDE-whitepaper-Jul2015.pdf">Kentik Data Engine</a>. Better yet, dive right in by <a href="#demo_dialog">requesting a demo</a> or starting a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Announcing KDDI’s Adoption of Kentik]]><![CDATA[We’re very excited to share that KDDI, one of the world’s largest telecommunications companies, has selected the Kentik platform for network planning, operations, and to protect their key network assets and services. Read about the reasons that KDDI chose Kentik and why we’re so excited about this announcement.]]>https://www.kentik.com/blog/why-kddi-selected-kentik-for-network-planninghttps://www.kentik.com/blog/why-kddi-selected-kentik-for-network-planning<![CDATA[Avi Freedman]]>Wed, 19 Jul 2017 17:57:54 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/3Hi3yq5dHqqmg8aYW2WkWG/025d4b95c293845472b9d595a3918dce/kddi-japan-300x158.jpg" alt="kddi-japan-300x158.jpg" class="image right" style="max-width: 300px;" /> We’re incredibly excited today to <a href="https://www.kentik.com/news/top-japanese-telecom-operator-kddi-corporation-selects-kentik-for-network-traffic-intelligence/">announce</a> that KDDI, a leading global carrier and Fortune 500 company, has chosen to deploy the Kentik platform to drive planning, operating, and protecting their network assets and services.</p> <p><strong>Why We’re Excited</strong> There are a few reasons why this announcement is so significant for Kentik:</p> <ol> <li>As a Fortune Global 500 company (technically in the top 300), KDDI is the largest organization that has publicly affirmed its use of Kentik.</li> <li>As a leading global carrier, KDDI showcases that Kentik can deliver the scale, performance, insight and value needed by the world’s largest service providers.</li> <li>As one of the largest companies in Asia, our KDDI direct relationship, and our relationship with <a href="https://www.kentik.com/news/net-one-systems-and-kentik-partner-to-bring-network-traffic-intelligence-to-japan/">Net One Systems</a>, one of Japan’s largest systems integrators, mark our entry into the Japanese local market.</li> </ol> <p><strong>Why KDDI Selected Kentik</strong></p> <p>KDDI’s network is massive, with network capacity and dozens of data centers spread across 28 countries and 63 cities, tied together via a global backbone comprised of a large and diverse set of subsea and terrestrial fiber cable links.  The overall network generates huge amounts of network traffic telemetry data, which is filled with valuable insights that were historically difficult to extract.  Kentik has turned that that data into real-time network traffic intelligence, calling out key insights in the areas of performance, efficiency, security, and revenue optimization, with instant access to any level of detail about technical and business context.</p> <p>Better network traffic intelligence makes a material impact on the business, because with faster and more flexible insights, engineers can optimize the performance of service delivery while lowering costs and avoiding risks of outages. Here’s what Toru Maruta, general manager of KDDI’s IP Network Department had to say:</p> <p>“At KDDI, we strive to be continuously innovative across our entire organization. However, for our network engineers, that ability was previously held back by the amount of time the team was spending on tracking down answers about our network. Given the sheer scale and complexity of our network and the amount of data it generates, network planning analyses often took days to compile. Using Kentik Detect, our team can now access a rich dataset that offers valuable insights about our network within seconds. Kentik gives us the answers we need to build a better network.”</p> <p><strong>Get Modern Network Traffic Intelligence</strong></p> <p>Service providers of all stripes can benefit from big data-powered network insights in similar ways as KDDI, both in planning as well as operational realms.  If you’d like to learn more, check out our <a href="https://www.kentik.com/kentik-detect/">products</a>, read our <a href="https://info.kentik.com/rs/869-PAD-887/images/Kentik-KDE-whitepaper-Jul2015.pdf">Kentik Data Engine</a> (KDE) white paper, and dig into why <a href="https://info.kentik.com/rs/869-PAD-887/images/why-nfv-mandates-big-data-management-approach.pdf">NFV needs advanced analytics</a>. Already know that you want to get a deeper look?  <a href="https://www.kentik.com/go/get-demo/">Request a demo</a> and we’ll walk you through how you can use network traffic intelligence to better operate, plan and innovate your network services.</p><![CDATA[News in Networking: Apple’s New Data Center, Intel’s New Chip, and a Survey on Network Challenges]]><![CDATA[Out with the old, in with the new, or so it seems this week. Apple has a new data center, its first in China. Intel announced a new line of microprocessors. Ericsson announced a new network services suite for IoT. Also this week, Kentik’s survey from Cisco Live reveals network challenges affecting digital transformation. More after the jump.]]>https://www.kentik.com/blog/news-in-networking-apples-new-data-center-intels-new-chip-and-a-survey-on-network-challengeshttps://www.kentik.com/blog/news-in-networking-apples-new-data-center-intels-new-chip-and-a-survey-on-network-challenges<![CDATA[Michelle Kincaid]]>Thu, 13 Jul 2017 17:09:40 GMT<h3 id="this-weeks-top-story-picks-from-the-kentik-team"><em>This week’s top story picks from the Kentik team.</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/3nr4EFGlxYwSYQmOWWWMoq/3b27deacedc85631c52bd2b265cb1328/News_tablet-396w.png" style="max-width: 396px;" class="image right">Out with the old, in with the new, or so it seems this week. Apple has a new data center, its first in China. Intel announced a new line of microprocessors. Cisco has a new acquisition. Ericsson has new network services for IoT. Also this week, Kentik’s survey from Cisco Live reveals network challenges affecting digital transformation.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="http://www.reuters.com/article/us-china-apple-idUSKBN19X0D6"><strong>Apple Announces Its First China Data Center</strong></a> <strong>(Reuters)</strong> Apple is building its first data center in China to comply with cybersecurity regulations. “The addition of this data center will allow us to improve the speed and reliability of our products and services while also complying with newly passed regulations,” Apples told Reuters.</li> <li><a href="https://www.reuters.com/article/us-intel-datacenters-idUSKBN19W2WT"><strong>Intel Rolls Out New Chips in Battle for Data Center Business</strong></a> <strong>(Reuters)</strong> Intel announced a new line of microprocessors for data centers. Reuters reports this sets up “a battle with Advanced Micro Devices and others for the lucrative business of supplying the chips that power cloud computing,” designed for high-compute tech like AI and autonomous cars.</li> <li><a href="http://www.zdnet.com/article/ericsson-introduces-massive-iot-network-services-suite/"><strong>Ericsson Introduces ‘Massive IoT’ Network Services Suite</strong></a> <strong>(ZDNet)</strong> Ericsson launched a suite of network services geared towards IoT. According to ZDNet, the company is also “enabling VoLTE across Cat-M1 networks and introducing automated machine learning to its Network Operations Centers.”</li> <li><a href="https://techcrunch.com/2017/07/13/cisco-acquires-network-security-startup-observable-networks/"><strong>Cisco Acquires Network Security Startup Observable Networks</strong></a> <strong>(TechCrunch)</strong> Cisco announced it has acquired St. Louis-based Observable Networks. According to TechCrunch, the startup “provides real-time network behavior monitoring to help IT teams detect anomalies that might be related to security breaches, focusing particularly on cloud deployments.”</li> <li><a href="https://virtualizationreview.com/articles/2017/07/12/cloud-complications.aspx"><strong>Study: Cloud Complicates Networks More than SDN, NFV</strong></a> <strong>(The Virtualization Review)</strong> Kentik this week announced findings from a survey we conducted at Cisco Live. The Virtualization Review reported on our news, noting “Cloud adoption — not disruptive trends such as software-defined networking (SDN) and network functions virtualization (NFV) — is introducing the most complexity into enterprise networking.”</li> <li><a href="https://blogs.wsj.com/cio/2017/07/13/digital-business-software-drive-it-spending-growth/"><strong>IT Spending Growth Accelerates as IoT, Blockchain Create New Categories of Investment</strong></a> <strong>(Wall Street Journal)</strong> Global IT spending is up. It’s expected to reach $3.5 trillion this year, up 2.4% from 2016, according to WSJ’s article on a new Gartner report. The growth is “fueled in part by the rise of new cloud-enabled digital platforms such as IoT and smart machines.”</li> <li><a href="https://enterprisersproject.com/article/2017/7/devops-lessons-learned-advice-it-leaders"><strong>DevOps Lessons Learned: Advice for IT Leaders</strong></a> <strong>(The Enterprisers Project)</strong> The Enterprisers Project asked seven IT professionals to share their DevOps lessons learned and Kentik’s Jim Frey weighed in. According to Frey, DevOps engineers should not ignore the network. “Recognize that you have to account for the network in your design, your planning and your testing,” he advised.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Kentik Detect Alerting: Configuring Alert Policies]]><![CDATA[Operating a network means staying on top of constant changes in traffic patterns. With legacy network monitoring tools, you often can't see these changes as they happen. Instead you need a comprehensive visibility solution that includes real-time anomaly detection. Kentik Detect fits the bill with a policy-based alerting system that continuously evaluates incoming flow data. This post provides an overview of system features and configuration.]]>https://www.kentik.com/blog/kentik-detect-alerting-configuring-alert-policieshttps://www.kentik.com/blog/kentik-detect-alerting-configuring-alert-policies<![CDATA[Justin Ryburn]]>Tue, 11 Jul 2017 16:00:30 GMT<p><em><strong>Deep, Powerful Anomaly Detection to Protect Your Network</strong></em></p> <img src="//images.ctfassets.net/6yom6slo28h2/jxnxbyf39Y4CwEkwuUOMy/31ba23f7b53e788927181b5d23661f39/Detailed_alerts-501w.png" alt="Detailed_alerts-501w.png" class="image right" style="max-width: 300px;" /> <p>One of the most challenging aspects of operating a network is that traffic patterns are constantly changing. If you’re using legacy network monitoring tools, you often won’t realize that something has changed until a customer or user complains. To stay on top of these changes, you need a comprehensive network visibility solution that includes real-time anomaly detection. Kentik Detect fits the bill with a policy-based alerting system that lets you define multiple sets of conditions — from broad and simple to narrow and/or complex — that are used to continuously evaluate incoming flow data for matches that trigger notifications. The power of Kentik Detect’s alerting system — offering a <a href="https://www.kentik.com/resources/penteledata-keeps-ddos-at-bay-with-kentik-detect/">field-proven 30 percent improvement</a> in catching attacks — comes from the processing capabilities of Kentik’s scale-out big data backend combined with an exceptionally feature-rich user interface. In this blog post we’ll look at the basics of configuring this system to keep your team up to date on traffic that deviates from baselines or absolute norms. In future posts we’ll explore particular capabilities and use cases of the system, and how to harness its power for better network operations.</p> <h4 id="kentik-alert-library">Kentik Alert Library</h4> <p>The first thing to know is that you don’t have to master the configuration of alert policies from scratch. Kentik support is available to walk you through the configuration process, and we’ve also provided an extensive library of policy templates that cover common alerting scenarios and get you most of the way toward creating an alert policy that meets your specific needs. You’ll find these templates on the Kentik Alert Library tab of the Alerting section of the Kentik Detect portal (accessed via the Alerts menu in the navbar).</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/4HyFeRzvCgSK4G46sAycwA/ec6ed6ff4fae717284e37a47db18fbfc/Alert_library-808w.png" alt="Alert_library-808w.png" class="image right" style="max-width: 300px;" /> The library includes templates for alerting on <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">DDoS attacks</a>, changes in top talker IP addresses, changes in geography, interfaces approaching capacity, devices that stop sending flow, and much more. Kentik is continually adding to the library as we add new features and uncover new use cases. To configure a policy based on any of these templates, click the copy button at the right of the template’s row in the template list on the library tab. A copy of the template will be added to the list of policies on the Alert Policies tab, which is where you go to configure it for your system.</p> <h4 id="alert-policies">Alert Policies</h4> <p>The Alert Policies tab is the access point for all of the policies currently configured for your organization, and it’s also where you add new policies. To configure a template copied from the library, or to edit an existing policy, click the edit button at the right of that policy’s row in the alert policies list. To start a fresh policy, click the Create Alert Policy button at the top right of the tab. Either action will take you to an Alert Policy Settings page, which is where you’ll get down to the configuration process. The policy settings page is made up of three main sections: the title bar (with a few basic controls, e.g. save, cancel, etc.), the sidebar, and the main settings area, which is itself divided into general, historical, and threshold settings. With all of these sections, policy settings can seem pretty complicated at first, but by breaking it down we’ll be able to see the logic that gives the system its power and flexibility.</p> <img src="//images.ctfassets.net/6yom6slo28h2/bRAH0ML2OO6GUIcgKCIGY/838bddb1cabd4b0136db4247b3c9a18c/Alert_policy_page-820w.png" alt="Alert_policy_page-820w.png" class="image right" style="max-width: 300px;" /> <p>Let’s start with the sidebar, which is where you specify the metrics that a given alert will be looking at and focus the alert on a specific subset of total traffic. This gives you a lot of flexibility to define the scope of your policies so you can include or exclude only certain matches. The top pane in the sidebar is for setting the Devices whose traffic will be monitored by this policy. This pane is nearly identical to the Devices pane in Kentik Detect’s Data Explorer (see our Knowledge Base topic <a href="https://kb.kentik.com/KB_Articles/Db03.htm"><strong>Explorer Device Selector</strong></a>), but without the Router and nProbe buttons.</p> <h4 id="dimensions-metrics-and-filters">Dimensions, metrics, and filters</h4> <img src="//images.ctfassets.net/6yom6slo28h2/3mWpFCsbNeG82osiw4Ks8c/d03f783d6a67fa75485f15fec786f899/Alert_dimensions-394w.png" alt="Alert_dimensions-394w.png" class="image right" style="max-width: 300px;" /> <p>Next comes the Query pane, where you determine which aspects of ingested flow data will be evaluated for a match with the alert policy’s specified conditions. The first order of business is to define the dimensions (derived from the columns in Kentik Detect’s database) to query for data. You can choose up to 8 different dimensions simultaneously, including the IP 5-tuple, geography information, BGP, QOS, Site, and many other types of information. This lets you get very granular with the types of network traffic or conditions you want to trigger an alert on, something that’s not possible with the limited controls of legacy tools. <img src="//images.ctfassets.net/6yom6slo28h2/5HJJ4kg2sge2m846aOuyWC/14dd1a6c43d73f781cf31b01315197b3/Alert_metrics-159w.png" alt="Alert_metrics-159w.png" class="image right" style="max-width: 300px;" /> To complement that long list of dimensions you can also choose from a wide variety of metrics. While most legacy tools monitor only Bits/s and maybe Packets/s, Kentik Detect has the ability to monitor Flows/s, Retransmits/s, a count of source or destination IPs, ASNs, Countries, Regions (States), and Cities. With all those different metric options, you can build a really granular and powerful policy, such as monitoring for a change in the top 10 countries that are sending traffic to your network. Also noteworthy is the concept of primary and secondary metrics. The primary metric will be the main item that is evaluated. If the user is building a “Top N” type of policy, for example, then the primary metric would be the units (e.g. bits/s, packets/s, flows/s, etc.) by which ingested flow data will be evaluated to determine the ranking. The addition of one or more secondary metrics allows the user to set additional static thresholds that must be met in order for the policy to trigger. To expand on our “top countries” example above, we could add a secondary metric of Bits/s and Packets/s to make sure that the amount of traffic is above a set level before the alert is triggered. The result would be a policy that notifies us when there’s a change in the top counties, but only if traffic exceeds the specified secondary metric values. We can get even more specific about the traffic we want to track by defining one or more filters in the Filters pane. Either saved filters or locally defined filters can be applied, and you can filter for any of the dozens of dimensions that we saw earlier in the Query pane. This lets you get really granular in what you are tracking. For example, you might use filters to build an alert for private IP (RFC1918 allocated space) traffic coming into your network but filtering out internal interfaces that are allowed to have private IP traffic on them.</p> <h4 id="general-policy-settings">General Policy Settings</h4> <p>Once the sidebar panes are defined the action moves to the panes of the main settings area. The General Settings pane defines the overall settings for a policy, including the name, description, how many items to track (e.g. 10 for top-10 policies), as well as to enable/disable the alert. One noteworthy aspect of the track items setting is that you don’t have to specify which items you want to track; instead the system can automatically learn what the top-N items are and then notify you of changes to that list. For example, maybe you want to track the top 25 interfaces in your network receiving web (80/TCP) traffic and be notified if that list changes for any reason. With Kentik Detect, you can create a policy for that without having to know in advance which interfaces are your top 25. <img src="//images.ctfassets.net/6yom6slo28h2/5PJkwwUsoMyK4OA0AmScKI/dc3da6f0d0a78478676fa54ab23643e9/Policy_general-800w.png" alt="Policy_general-800w.png" class="image right" style="max-width: 300px;" /> Another interesting feature is Learning Mode, which allows you to set a period for which the policy’s traffic will be tracked, with notifications muted, in order to establish a baseline. You can also associate a dashboard with a policy so that when you click through to that dashboard from an alarm generated by the policy the panels of the dashboard show the traffic subset that is defined by the policy.</p> <h4 id="historical-baseline-settings">Historical Baseline Settings</h4> <p>The Historical Baseline settings allow you to configure what the current traffic patterns are compared against to see if there is a change (these settings are ignored for Static mode; see policy threshold settings below). The Look Back settings set the depth and granularity of the historical data that will be used for the baseline. The aggregation settings define how the historical data points will be calculated for comparison with current traffic. The Look Around setting is useful for traffic patterns that are hard to capture in a single 1-hour window. <img src="//images.ctfassets.net/6yom6slo28h2/63lSLmGe1GIaQAiCQ0qCyU/1a2bb77a8db082fe4daca5327a11fdde/Policy_historical-800w.png" alt="Policy_historical-800w.png" class="image right" style="max-width: 300px;" /> One of the coolest features of the historical baseline is the ability to make it Weekend vs. Weekday Aware. If your network has different traffic patterns on a weekend than it does on a weekend, this will help make the historical baseline more accurate.</p> <h4 id="policy-threshold-settings">Policy Threshold Settings</h4> <p>The policy threshold settings define the conditions that must be met in order to trigger an alarm (which results in notifications and, optionally, mitigation). Multiple thresholds can be configured per alert policy. Each threshold is assigned a criticality (critical, major2, major, minor2, or minor) to keep them distinct based on their importance to the user. <img src="//images.ctfassets.net/6yom6slo28h2/56p9KkANZucUUayI0gcgCI/0c86152afeddf3f9a7af9ed411413ab3/Policy_threshold-800w.png" alt="Policy_threshold-800w.png" class="image right" style="max-width: 300px;" /> A key setting for each threshold is the Comparison Mode, which determines what the primary traffic is compared to when it’s evaluated to determine whether there is a match with the conditions defined in the threshold. The options are Static, Baseline, or If-Exists. Unless the Comparison Mode is Static you’ll see some additional Comparison Option settings. One important comparison option is Direction, which determines which key (current or historical) is the primary and which key will be the comparison. In other words:</p> <ul> <li>If the direction is Current to History (default), the current dataset is primary and the comparison dataset is historical data which is derived based on the alert’s Historical Baseline settings.</li> <li>If the direction is History to Current, the historical dataset is primary and the comparison dataset is the current.</li> </ul> <p>In most use cases, the default direction is the one to use. However, the History to Current setting is useful if for Top-N type policies where you want to be notified if something drops out of the top-N group. For example, we might want to know if any of the countries that are normally in our Top 10 no longer are. Meanwhile the Activate If and Activate When settings allow the user to configure the conditions that must be met, and how long they must be present, in order for the policy to trigger an alarm.</p> <h4 id="notification-channels">Notification Channels</h4> <p><img src="//images.ctfassets.net/6yom6slo28h2/6EJhzdQIrCuIieAeYQEqok/955ed10d9addaf5db348557cb47a400e/Create_notification-394w.png" alt="Create_notification-394w.png" class="image right" style="max-width: 300px;" /> Kentik Detect uses notification channels to determine how users are notified when an alarm is triggered. There are currently five types of channels (email, JSON POST Webhook, Syslog, Slack, and PagerDuty), but we can easily and quickly add new notifications and integrations as customer demand warrants. This is another aspect that makes the SaaS approach of Kentik Detect so powerful; as our users know, our features are continually expanding over time. Multiple notification channels can be defined and attached to one or more thresholds as needed. This flexibility allows different notifications to be sent depending on the threshold that is triggered. For example, you may want an email sent to the team for minor issues but a PagerDuty page sent for a critical alert. In addition you can modify the “when” in the threshold to send different notifications at different times. For example, you can send an email immediately but only send a PagerDuty notification if the alert is not acknowledged within an hour.</p> <h4 id="mitigations">Mitigations</h4> <p><img src="//images.ctfassets.net/6yom6slo28h2/2jniCtAuXWmcuQOAwqWY6U/79a31ee76c9c5ba3ab665bc305a621f6/Apply_mitigation-406w.png" alt="Apply_mitigation-406w.png" class="image right" style="max-width: 300px;" /> Kentik Detect offers both built-in and third-party mitigation options that are configured on the Alerting system’s Mitigation tab and then attached to an policy in the Alert Policy Settings page as shown below. Organizations whose Kentik Detect plan includes BGP have an RTBH mitigation option, in which the system originates a Remote Trigger BlackHole route into the user’s network to mitigate an attack. Kentik Detect also includes APIs that support third-party orchestration servers like Radware DefenseFlow as well as hardware scrubbing appliances like Radware DefensePro or A10 Thunder TPS. Mitigations can be triggered automatically, activated manually in response to an alarm, or activated if there’s been no user acknowledgement within a certain time.</p> <h4 id="summary">Summary</h4> <p>With a system that’s as deep and powerful as Kentik Detect, it’s difficult for an overview like this to do more than skim the surface, so we’ve covered just a few of the many capabilities available with our alerting system. If you’re already a Kentik Detect user, head on over to our <a href="https://kb.kentik.com/v0/Ab10.htm">Knowledge Base</a> for more detailed information on how to configure policy-based alerts. Or contact <a href="mailto:[email protected]">Kentik support</a>; we’re here to help you get up to speed as painlessly as possible. If you’re not yet a user and would like to experience first hand what Kentik Detect has to offer, <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Security in an SD-WAN World]]><![CDATA[As one of 2017's hottest networking technologies, SD-WAN is generating a lot of buzz, including at last week's Cisco Live. But as enterprises rely on SD-WAN to enable Internet-connected services — thereby bypassing Carrier MPLS charges — they face unfamiliar challenges related to the security and availability of remote sites. In this post we take a look at these new threats and how Kentik Detect helps protect against and respond to attacks.]]>https://www.kentik.com/blog/security-in-an-sd-wan-worldhttps://www.kentik.com/blog/security-in-an-sd-wan-world<![CDATA[Justin Ryburn]]>Thu, 06 Jul 2017 18:30:39 GMT<h3 id="protecting-availability-for-internet-based-wan-sites"><em>Protecting Availability for Internet-based WAN Sites</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/1wwpCiOZeEwIE2Q8AAWWea/c5a954bd72dd2322a8ca052e095bcb3e/SDWAN_500w.jpg" alt="SDWAN_500w.jpg" class="image right" style="max-width: 500px;" /> SD-WAN is one of 2017’s hottest networking technologies. You can’t read tech news or listen to a tech podcast these days without hearing about it. Last week at Cisco Live was no exception; Cisco’s $610M acquisition of Viptela in May made for a lot of SD-WAN buzz.</p> <p>For enterprises, SD-WAN makes a lot of sense. Carrier MPLS services are really expensive bandwidth. Enterprises have historically paid those high prices to get SLA guarantees. But the reality is that within most providers both MPLS VPN services and public Internet services ride on the same shared network infrastructure. Enterprises are starting to realize this and looking to save some money by switching to public Internet circuits as a transport mechanism.</p> <div class="pullquote left">Like most IT journeys, the road to SD-WAN has its challenges.</div> <p>SD-WAN products enable this shift by using overlay tunnels on top of underlying Internet links. According to Gartner analyst Andrew Lerner in his June, 2017 blog post “<a href="https://blogs.gartner.com/andrew-lerner/2017/06/03/sd-wan-is-going-mainstream/">SD-WAN is going Mainstream</a>” there are an estimated 6000+ paying SD-WAN customers and 4000+ production deployments. Unfortunately, like most things in IT, this journey is not without challenges.</p> <p>One thing that needs to be carefully considered is the security ramifications of moving from a private MPLS circuit to a public Internet circuit. The first aspect that comes to mind is privacy of the data. Now that the traffic is routing over the Internet, encryption becomes a must have. Most organizations have some sort of data that has be carefully guarded whether it is HIPPA, PCI, or any other compliance rules they are dealing with. This can be fairly easily solved with a firewall or SD-WAN appliance that creates IPSec tunnels between the sites.</p> <p>Another important consideration is the possibility of DDoS attacks, which become a real concern once your office locations are on public IP addresses. E-commerce websites have been dealing with this for years but most enterprise WAN sites were spared due to the private IP routing they were using on MPLS VPN services.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1kywDkqCTCYAOAEwEaYu2c/db050f1b6e9dcb1ca536ac7102f64ce9/ddos_attack-819w.png" alt="ddos_attack-819w.png" class="image center" style="max-width: 800px;" /> <p>For a lot of enterprises, the risk of having a remote site taken off the network by a DDoS attack is simply unacceptable. That’s definitely true if it’s a customer service office, and becomes critically so when you’re talking about retail locations that generate revenue. With voice and data services being combined to ride over the same circuit, that type of attack would mean the employees in that remote office cannot get their work done or the cash registers stop raking in the dough.</p> <div class="pullquote right">Kentik's integrations with A10 and Radware enable multiple mitigation options.</div> <p>Thankfully Kentik Detect can help here. Our flexible and powerful alerting policies give network managers very granular visibility into attacks when they take place, and our anomaly detection system can send users proactive alert notifications whenever current traffic deviates noticeably from the historical pattern. We also offer a number of automated mitigation options. One option is to use the power of BGP to announce a Remote Trigger BlackHole (RTBH) route from our portal to the routers in your network. If the IP address under attack is blocked, however, this may complete the attack by taking the site offline. So we’ve also partnered with A10 and Radware to create bi-directional API integrations with their orchestration and scrubbing solutions to mitigate DDoS attacks. This allows the attack traffic to be scrubbed out while clean traffic continues to flow to the remote office.</p> <p>If you’re rolling out SD-WAN and haven’t thought about a comprehensive way to understand end-to-end traffic, as well as to detect and stop DDoS attacks, take a look at how Kentik Detect can help. Learn more about our big data-powered <a href="https://www.kentik.com/kentik-detect/">network traffic intelligence</a>, dig into our industry-leading <a href="https://www.kentik.com/ddos-detection/">DDoS detection</a>, take a look at our integrations with <a href="https://info.kentik.com/rs/869-PAD-887/images/Radware_Kentik_Solution_Brief.pdf">Radware</a> and <a href="https://info.kentik.com/rs/869-PAD-887/images/A10-SB-19171-EN-01.pdf">A10</a> and explore how we can help monitor <a href="https://www.kentik.com/network-performance-monitoring/">network performance</a>. Or, if you’re ready to dive right in, you can <a href="#demo_dialog">request a demo</a> or start a free, fully functional <a href="#signup_dialog">30 day trial</a>.</p><![CDATA[News in Networking: Verizon-Disney Rumors, ‘Most Promising’ 5G Operators, and UN List of Security Gaps]]><![CDATA[Verizon’s deal with Yahoo may still be on the minds of many, but Disney may be Verizon's new big brand-of-interest. Also on acquisitions, Forrester’s CEO says Apple should buy IBM. Juniper Research released a list of the most promising 5G operators in Asia. And the UN released a list of countries with cybersecurity gaps. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-verizon-disney-rumors-most-promising-5g-operators-and-un-list-of-security-gapshttps://www.kentik.com/blog/news-in-networking-verizon-disney-rumors-most-promising-5g-operators-and-un-list-of-security-gaps<![CDATA[Michelle Kincaid]]>Thu, 06 Jul 2017 15:30:52 GMT<p>This week’s top story picks from the Kentik team.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4RxB2Qzvg4gieCAAACA8mu/a51f7109a38a5e4dbcb2bbcdbc9f5cc9/News_tablet-396w.png" alt="News_tablet-396w.png " class="image right" style="max-width: 300px;" /> <p>Verizon’s deal with Yahoo may still be on the minds of many, but Disney is rumored to be Verizon’s new big brand-of-interest. Also on the acquisitions front, Forrester’s CEO says Apple should buy IBM. Meanwhile, Juniper Research released a list of the most promising 5G operators in Asia, and the UN released a list ranking countries based on cybersecurity hygiene.</p> <p><strong>Here are those headlines and more:</strong></p> <ul> <li><a href="http://www.lightreading.com/video/mobile-video/verizon-disney-rumor-dismissed-by-analysts/d/d-id/734250?piddl_msgid=254604"><strong>Verizon-Disney Rumor Dismissed by Analysts</strong></a> <strong>(Light Reading)</strong> It was only a few weeks ago when Verizon closed its acquisition of Yahoo. However, rumors are swirling that it could be eyeing another big deal. Although analysts say it won’t happen, a new report claims Verizon has Disney on its M&#x26;A shortlist. Why? Light Reading reports that the brand could help with targeting millennials.</li> <li><a href="http://blogs.forrester.com/george_colony/17-06-29-apple_should_buy_ibm"><strong>Apple Should Buy IBM</strong></a> <strong>(Forrester)</strong> Forrester CEO George Colony is making the case for Apple to buy IBM. Why? According to Colony, “By buying IBM, Apple would get the world-class machine learning platform.” He adds, “Apple could write the check tomorrow — it has $268 billion of cash reserves, and IBM’s purchase price would be in the $200-$250 billion range.”</li> <li><a href="http://www.networkworld.com/article/3204594/mobile-wireless/http-and-dns-in-a-5g-world.html"><strong>HTTP and DNS in a 5G World</strong></a> <strong>(Network World)</strong> HTTP and DNS may be “household name protocols,” according to Network World contributor and wireless wiz Alan Carlton, but NFV and MEC could change that. Carlton notes, “HTTP will need to become more streamlined and lightweight to meet the high throughput and strict delay requirements of 5G.” As for DNS, he advises that IoT devices with 5G connectivity could create “whole new requirements for discovery and addressing of these devices.”</li> <li><a href="https://www.sdxcentral.com/articles/news/sk-telecom-ntt-docomo-beat-att-promising-5g-operator-list/2017/07/"><strong>SK Telecom, NTT DoCoMo Beat AT&#x26;T in ‘Most Promising’ 5G Operator List</strong></a> <strong>(SDxCentral)</strong> Juniper Research just released a list ranking the “most promising” Asian operators in the 5G space. Interestingly, it put SK Telecom, NTT DoCoMo, KT, and China Mobile higher on the list than AT&#x26;T. According to SDxCentral, Juniper said it based its rankings “on the operators’ time in development; the breadth and value of the operators’ 5G partnerships; and their progress in 5G network testing.”</li> <li><a href="http://www.reuters.com/article/us-cyber-un-idUSKBN19Q19L"><strong>UN Survey Finds Security Gaps Everywhere Except Singapore</strong></a> <strong>(Reuters)</strong> If you’re looking for a country with good cybersecurity hygiene, head to Singapore. According to a new survey from the United Nation’s International Telecommunication Union, Singapore tops the list based on “countries’ legal, technical, and organizational institutions, their educational and research capabilities, and their cooperation in information-sharing networks.” It’s followed by the U.S. and Malaysia.</li> <li><a href="https://arstechnica.com/information-technology/2017/06/50-million-us-homes-have-only-one-25mbps-internet-provider-or-none-at-all/"><strong>50M US Homes Have Only One 25Mbps Internet provider or None at All</strong></a> <strong>(Ars Technica)</strong> A group of researchers just released findings on home internet access via data they pulled from the FCC. According to Ars Technica, the research shows, “More than 10.6 million U.S. households have no access to wired Internet service with download speeds of at least 25Mbps, and an additional 46.1 million households live in areas with just one provider offering those speeds. That means more than 56 million homes in the U.S. do not have access to high-speed broadband connections.</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Three Little NetFlow Databases in a Big Bad World]]><![CDATA[Obsolete architectures for NetFlow analytics may seem merely quaint and old-fashioned, but the harm they can do to your network is no fairy tale. Without real-time, at-scale access to unsummarized traffic data, you can't fully protect your network from hazards like attacks, performance issues, and excess transit costs. In this post we compare three database approaches to assess the impact of system architecture on network visibility.]]>https://www.kentik.com/blog/three-little-netflow-databases-in-a-big-bad-worldhttps://www.kentik.com/blog/three-little-netflow-databases-in-a-big-bad-world<![CDATA[Alex Henthorn-Iwane]]>Mon, 26 Jun 2017 13:00:45 GMT<h3 id="storybook-endings-require-architecture-that-cant-be-blown-down"><em>Storybook Endings Require Architecture That Can’t Be Blown Down</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/26JaTD2JUgAQews80cUcMa/f370036734d0b2c1a0c69b28b8068bfe/Three_little_pigs-500w.png" alt="Three_little_pigs-500w.png" class="image right" style="max-width: 500px; margin-top: 15px; margin-bottom: 20px;" /> <p><em>I hope I’m not going out on a limb when I say that most of us have heard of the tale of the three little pigs and the big bad wolf. Today we’ll use that classic fable to talk about three ways that folks have tried to collect and analyze flow data — with varying degrees of success…</em></p> <p>Once upon a time there were three organizations that wanted to keep themselves safe from network performance problems, congestion, weird traffic shifts, sub-optimal planning, costly transit, and many other issues. They decided that what they needed was to be able to collect, store, and analyze network flow data like NetFlow, sFlow, and IPFIX. So they all set out to build their own flow analytics system, each one based on a different database.</p> <p><strong>Single Server Straw House</strong></p> <p>The first organization decided to build with straw… that is, with a single-server software architecture using a relational database like mySQL to contain the data. But the single server architecture was weak. Its walls were made of thin stalks of memory, CPU, and storage. And the relational database was too slow and cumbersome to be of much help. So most of the data had to be rolled up into summaries, and because the straw hut was so tiny the vast majority of details had to be thrown away. Sad! What little data was left over was packed into inflexible tables with maybe one or two drill-down tables, each with only a single indexed column.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1vfSO25U76sGW0muUoc0uC/dedc6bb7408cd8673c287132827bbf46/Trad_appliance_architecture-822w.png" alt="Trad_appliance_architecture-822w.png" class="image center" style="max-width: 822px;" /> <p>Needless to say, the little straw hut of sparse, summarized data was no match for the huffing and puffing of real-world use cases. When the big bad wolf came to the door, the system collapsed. <strong>Traditional Big Data Wood House</strong> The second organization chose to build using a traditional, Hadoop-style big data system. While this allowed much more space for data, it was still rather fragile. Like most such systems, it had to discard data by subsampling the flow records it received. What was left was MapReduced into data cubes — basically complex multi dimensional counters— which reduced cardinality and created duplicates that had to be resolved because they could no longer tell the difference between them. The cubes were essentially pre-digested data, which made it easier for a single-server system to return quick responses to queries. The problem was that they lose fidelity, and they had to pre-build the data cube to run a query, a process that took from minutes to days.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Ep8tWfeo4EeKSMsk6wAS0/c8d5fbab953ec7e508af3e484735c02d/Trad_big_data_architecture-824w.png" alt="Trad_big_data_architecture-824w.png" class="image center" style="max-width: 824px;" /> <p>This inflexibility meant that their structure couldn’t bend to new pressures and requirements, like new use cases that required a different set of queries than the ones their cubes had been built for. So instead of a strong, unified data structure they ended up with many different and fragile fragments of data pre-compiled as separate tools for different use cases. As those fragmented tools multiplied, their teams lost efficiencies. So while the “wooden house” of traditional big data was bigger and better than a tiny straw hut, it still couldn’t stand up to the big bad wolf of pressure from the requirements of digital transformation. Despite the best efforts of the organization’s network operations, engineering, and planning teams, the wooden house ultimately collapsed.</p> <p><strong>Steel Reinforced Kentik House</strong></p> <p>The third organization chose to build its network protections on the solid foundation of Kentik Detect. Kentik doesn’t roll up, summarize, reduce, chop, boil, or fricassee network flow data. All of the data in the fields of the received flow records is not only retained but also enriched with added information from BGP, GeoIP, and custom data sources. This enhanced data is stored in a massive, scale-out infrastructure with Petabytes of capacity. When a user submits a query, the Kentik Data Engine breaks it into thousands of subqueries that are sharded across the entire storage layer using a fairness queueing methodology (see our post on <a href="https://www.kentik.com/designing-for-database-fairness/">Designing for Database Fairness</a>).</p> <img src="//images.ctfassets.net/6yom6slo28h2/2zrnO51r0EQMSge40YuuAk/7ee3f71738fd39f91e632de56e1e3e55/Kentik_architecture-824w.png" alt="Kentik_architecture-824w.png" class="image center" style="max-width: 824px;" /> <p>Compared to the straw and wooden houses of the first two organizations, Kentik’s data analytics architecture was like building with steel-reinforced concrete. No more siloed, weak strands or fragile datacubes. Answers came back in a few seconds, even on large datasets. And the sheer power of the Kentik platform meant that it could support a ton of different use cases, automation via native APIs, and webhooks. The biggest, baddest wolves huffed and puffed around the clock, but the Kentik house kept the third organization safe and sound.</p> <p><strong>Enter the House of Kentik</strong></p> <p>So maybe this analogy is a bit precious, but the truth remains that Kentik’s architecture provides unsurpassed flow data analytics and stands up to the toughest use cases. And you needn’t take just our word for it. Check out what Pandora says about their experience with Kentik in this <a href="https://youtu.be/ktXv1sKHzfU">cool video</a>. Visit our <a href="https://www.kentik.com/customers/">Customers page</a> to see who else is using Kentik. Or step inside today for your own look around: <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[News in Networking: Intent-based Networking, a Huge S3 Data Leak, and BGP Route Leaks]]><![CDATA[This week Cisco made a series of product announcements, including intent-based networking for automation. NTT launched an SD-WAN. An open Amazon S3 server exposed millions of voters’ PII. A group proposed a way to reduce BGP route leaks. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-intent-based-networking-a-huge-s3-data-leak-and-bgp-route-leakshttps://www.kentik.com/blog/news-in-networking-intent-based-networking-a-huge-s3-data-leak-and-bgp-route-leaks<![CDATA[Michelle Kincaid]]>Thu, 22 Jun 2017 14:42:22 GMT<p>This week’s top story picks from the Kentik team.</p> <p>This week Cisco made a series of product announcements, including intent-based networking aimed at increasing automation in IT. NTT launched a Virtela-based SD-WAN platform. An open Amazon S3 server exposed the PII of millions of U.S. voters. And a network working group announced a draft of a proposal that aims to reduce BGP route leaks.</p> <p>Here are those stories and more:</p> <ul> <li><a href="http://www.networkworld.com/article/3202105/software-defined-networking/cisco-brings-intent-based-networking-to-the-end-to-end-network.html"><strong>Cisco Unveils Intent-based Networking</strong></a> <strong>(Network World)</strong> Ahead of Cisco Live U.S. next week, Cisco made a series of product announcements, including intent-based networking aimed at more IT automation. According to Network World, “Intent-based systems operate in a manner where the administrators tell the network what it wants done and the how is determined by the network and then the specific tasks are automated to make this happen.”</li> <li><a href="https://www.sdxcentral.com/articles/news/ntt-launches-sd-wan-platform-virtelas-sdn-expertise/2017/06/"><strong>NTT Launches SD-WAN Platform with Virtela’s SDN Expertise</strong></a> <strong>(SDxCentral)</strong> NTT Communications this week launched an SD-WAN platform based on capabilities from Virtela, the SDN company it acquired for $525 million back in 2014.</li> <li><a href="http://www.zdnet.com/article/security-lapse-exposes-198-million-united-states-voter-records/"><strong>198 Million Americans Hit by “Largest Ever” Voter Records Leak</strong></a> <strong>(ZDNet)</strong> A huge database of voter information, owned by Republican data analytics firm Deep Root Analytics, was exposed this week. A security firm found the database was being kept in an open Amazon S3 storage server. According to ZDNet, “It’s believed to be the largest ever known exposure of voter information to date.”</li> <li><a href="https://www.wsj.com/articles/oracles-cloud-business-lifts-profit-1498077696"><strong>Oracle Claims Best Quarter Ever as Cloud Takes Off</strong></a> <strong>(Wall Street Journal)</strong> Oracle is trying to gain ground on Amazon, Google and Microsoft in the cloud wars. With its its earnings report out this week, we learned it had a 15 percent jump in its fiscal Q4. “It’s the best quarter we have ever had,” Oracle co-Chief Executive Mark Hurd said, reported WSJ. “We had a goal of $2 billion in ARR; we finished with nearly $2.1 billion. Next year, we will sell more.”</li> <li><a href="https://www.theregister.co.uk/2017/06/19/bgp_route_leak_prevention_draft_policy/"><strong>Internet Boffins Take Aim at BGP Route Leaks</strong></a> <strong>(The Register)</strong> A network working group published an “Internet Draft” this week, aiming to reduce BGP route leaks. The Register notes the draft proposes that “neighboring BGP speakers announce not only their routes, but also their roles, so systems receiving those route announcements better understand the scope of the advertisement.”</li> <li><a href="http://blogs.enterprisemanagement.com/shamusmcgillicuddy/2017/06/07/optimizing-business-network-analytics/"><strong>Optimizing the Business With Network Analytics</strong></a> <strong>(EMA Blog)</strong> In this blog post, EMA analyst Shamus McGillicuddy notes that network managers “need tools that help them present that insight to the business. Reports and dashboards that are designed for consumption by non-technical personnel will be critical for impactful network analytics. Some vendors might produced canned versions of these reports and dashboards, but customization will be essential, too. Furthermore, vendors may want to consider developing professional services organizations that help close the gap between network operations and the business.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Accuracy in Low-Volume NetFlow Sampling]]><![CDATA[NetFlow data has a lot to tell you about the traffic across your network, but it may require significant resources to collect. That's why many network managers choose to collect flow data on a sampled subset of total traffic. In this post we look at some testing we did here at Kentik to determine if sampling prevents us from seeing low-volume traffic flows in the context of high overall traffic volume.]]>https://www.kentik.com/blog/accuracy-in-low-volume-netflow-samplinghttps://www.kentik.com/blog/accuracy-in-low-volume-netflow-sampling<![CDATA[Jim Meehan]]>Mon, 19 Jun 2017 16:39:19 GMT<h2 id="seeing-a-needle-in-the-traffic-haystack">Seeing a Needle in the Traffic Haystack</h2> <p>Network flow data (e.g., NetFlow, sFlow, IPFIX) is often used to gain visibility into the characteristics of IP traffic flowing through network elements. This data — flow records or datagrams, depending on the protocol — is generated by network devices such as routers and switches. Equivalent to PBX “call detail records” from the world of telephony, flow data is metadata that summarizes the IP packet flows that make up network traffic. In this post we’ll look at why flows are sampled and explore how visibility is impacted by the rate at which total flows are sampled to collect flow records for monitoring and analysis.</p> <h2 id="sampling-what-and-why">Sampling: What and Why</h2> <p>When a network device monitors flows and produces flow records, it’s consuming resources such as CPU and memory. Those resource requirements can be significant for devices that are deployed on networks with high volumes of traffic. A high-volume stream of flow records can also be challenging for monitoring/management systems to ingest on the receiving end. Consequently, network devices typically aren’t configured to capture every flow. Instead, traffic is sampled to make the load at both ends more manageable. With sampling enabled, network devices generate flow records from a random 1-in-N subset of the IP traffic packets. The value of N is chosen based on the overall volume of traffic passing through the device. It can range from as low as 100 for devices passing around 1 Gbps of traffic to as high as 50,000 for devices that pass about 1 Tbps of traffic. As N increases, the flow records derived from the samples may become less representative of the actual traffic, especially of low-volume flows over short time-windows. In the real world, how high can N be while still enabling us to see a given traffic subset that is a relatively small part of the overall volume? To get a sense of the limits, we decided to set up a test.</p> <h2 id="a-sampling-test-environment">A Sampling Test Environment</h2> <p>Our test involves sending low-volume test traffic across a device with relatively high overall traffic volume to see if we can pick out the test traffic against the backdrop of the existing traffic. The test is conducted using the setup shown in the following image. As traffic passes through the router on its way from the generator to the target, the router collects flow records and sends them to Kentik, our big data network visibility solution. The unsummarized flow data is ingested in real time into the Kentik Data Engine (our scale-out backend) where it’s ready for ad hoc analytics and visualization.</p> <img src="https://images.ctfassets.net/6yom6slo28h2/3upyTWgdmU0GqAgu6ec2Uy/5d54a0a2eb5b12696c1336cf12f4cea1/Test_traffic_diagram-622w.png" alt="Test_traffic_diagram-622w.png " class="image center no-shadow" style="max-width: 500px;" /> <p>In the following image from the Kentik portal, the sampling column (at right) of the Device List shows that the device, named “test_router,” is configured to sample 1 in 10000 packets.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3ezo6f5bqgwqQKoKYcOI4E/e62ea4a82736f9dd68a3ac57f26da732/Device_list-800w.png" alt="Device_list-800w.png" class="image center" style="max-width: 700px;" /> <p>As shown in the following graphs from the Kentik portal, the baseline traffic volume forwarded by test_router is 20 Mpps (first image) or 150 Gbps (second image).</p> <img src="//images.ctfassets.net/6yom6slo28h2/4sOyLYW5WwsCiiyEEcOEK2/acc887c1bb1c83e525204a58d6d26025/Total_by_packets-815w.png" alt="Total_by_packets-815w.png" class="image center" withFrame style="max-width: 700px;" /> <img src="//images.ctfassets.net/6yom6slo28h2/15c4zk7qq62WkW00wcGqey/c73d683995040d7c924cd603d1c05de5/Total_by_bits-823w.png" alt="Total_by_bits " class="image center" withFrame style="max-width: 700px;" /> <h2 id="generating-test-traffic">Generating Test Traffic</h2> <p>We generate test traffic with a simple python script using <a href="https://github.com/secdev/scapy/">Scapy</a>, a packet crafting library. The script generates randomized TCP packets toward our test target, and also prints the pps rate every second.</p> <code> from scapy.all import \* import socket import time from timeit import default_timer as timer s = conf.L2socket(iface="eth0") count = 0 oldtime = timer() while True: count += 1 newtime = timer() if (newtime - oldtime) >= 1: print time.ctime() + ': ' + str(round(count / (newtime - oldtime))) oldtime = newtime count = 0 pe=Ether()/IP(dst="10.10.10.1")/fuzz(TCP()) s.send(pe) </code> <p style="margin-bottom: 0;">&nbsp;</p> <p>Running the script, we can see that it is generating roughly 560 packets/second, which represents a very low volume of traffic compared to the total being forwarded by this router.</p> <code> Thu May 4 23:59:50 2017: 546.0 Thu May 4 23:59:51 2017: 561.0 Thu May 4 23:59:52 2017: 568.0 Thu May 4 23:59:53 2017: 577.0 Thu May 4 23:59:54 2017: 575.0 Thu May 4 23:59:55 2017: 568.0 Thu May 4 23:59:56 2017: 565.0 Thu May 4 23:59:57 2017: 568.0 Thu May 4 23:59:58 2017: 572.0 Thu May 4 23:59:59 2017: 573.0 </code> <h2 id="test-traffic-observation">Test Traffic Observation</h2> <img src="//images.ctfassets.net/6yom6slo28h2/vY0lgSpwmkay2c0ScqU2m/6b4e3855b0c847ef1c4ae163f427a251/Test_traffic_filter-400w.png" alt="Test_traffic_filter-400w.png " class="image right" style="max-width: 300px;" /> <p>Now let’s see if the small volume of test traffic is discernible against the large volume of overall traffic. To do so, we’ll add a two-dimension filter in Data Explorer, Kentik’s primary ad hoc query environment. The filter narrows our view to just the traffic going from the traffic generator to the test target. When we run a query using this filter, with the metric set to show packets/second, we get the following graph and table.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4CFJVTAALu00IuKU8sCIMo/d79f54c4fff27311c04639b319a2414b/Test_result_graph-820w.png" alt="Test_result_graph-820w.png " class="image center" withFrame style="max-width: 800px;" /> <p>The graph has a definite sawtooth pattern induced by the sampling rate, but the traffic is clearly visible. We can see from the Avg pps column of the table that over a relatively short window, the average (547 pps) has converged to a value very close to the known rate of 560 pps that the script is sending.</p> <h2 id="conclusion">Conclusion</h2> <p>These initial test results show that it is possible to measure small “needle-in-a-haystack” traffic flows at relatively high sampling intervals (high N) with accuracy that is adequate for all common use cases. To quantify the parameters more precisely, we plan additional testing on the effects of various sample rates on data reporting accuracy. Check back for future posts on this topic. In the meantime, you can learn more about how Kentik helps you see and optimize your network traffic by <a href="#demo_dialog">scheduling a demo</a>. Or explore on your own by signing up today for a <a href="#signup_dialog">free trial</a>; in 15 minutes you could be looking at your own traffic via the Kentik portal.</p><![CDATA[News in Networking: AWS Downtime, Skills for DevOps Engineers, and Tabs vs. Spaces Debate]]><![CDATA[The NSA officially blamed North Korea for the WannaCry ransomware attacks. A Virginia school got creative to keep its students on fast broadband. Ericsson predicts a 5G user spike. The tabs versus spaces programmer debate continues. And the Kentik team advises on avoiding AWS downtime and offers skills for DevOps engineers to know. All that and more after the jump...]]>https://www.kentik.com/blog/news-in-networking-aws-downtime-skills-for-devops-engineers-and-tabs-vs-spaces-debatehttps://www.kentik.com/blog/news-in-networking-aws-downtime-skills-for-devops-engineers-and-tabs-vs-spaces-debate<![CDATA[Michelle Kincaid]]>Thu, 15 Jun 2017 14:06:50 GMT<h3 id="this-weeks-top-story-picks-from-the-kentik-team"><em>This week’s top story picks from the Kentik team.</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/3nr4EFGlxYwSYQmOWWWMoq/3b27deacedc85631c52bd2b265cb1328/News_tablet-396w.png" style="max-width: 396px;" class="image right">This week the NSA officially blamed North Korea for the WannaCry ransomware attacks. A Virginia school got creative to keep its students on fast broadband. Ericsson predicts a 5G user spike. The tabs versus spaces programmer debate continues. And the Kentik team advises on avoiding AWS downtime and offers skills for DevOps engineers to know.</p> <p><strong>Here are those stories and more:</strong></p> <ul> <li><a href="https://www.theverge.com/2017/6/14/15805346/wannacry-north-korea-linked-by-nsa"><strong>NSA Blames North Korea For WannaCry Ransomware</strong></a> <strong>(The Verge)</strong> The U.S. National Security Agency said it has “moderate confidence” that WannaCry, the name of the ransomware that took down thousands of computers, came from North Korean hackers.</li> <li><a href="http://searchaws.techtarget.com/tip/Know-your-AWS-downtime-limits"><strong>Know Your AWS Downtime Limits</strong></a> <strong>(TechTarget)</strong> “For many AWS customers, the reality is that most will tolerate some downtime almost fatalistically,” Kentik Co-founder and CEO Avi Freedman told TechTarget in this article on boosting uptime. “Zero downtime is achievable, but only with investment in architecture and layered or multi-vendor solutions.”</li> <li><a href="https://www.wired.com/story/schools-secret-spectrum-free-internet-digital-divide"><strong>Tapping Secret Spectrum to Give Kids Fast Broadband</strong></a> <strong>(Wired)</strong> “About five million households with school-aged children are mired in the so-called homework gap, because they can’t afford broadband or they live in underserved rural areas,” the article cites from Pew Research Center. Monticello High School in Albemarle County, Virginia doesn’t want that stat to slow its students education, so it got creative.</li> <li><a href="http://www.lightreading.com/mobile/5g/ericsson-eyes-half-a-billion-5g-users-by-2022/d/d-id/733651"><strong>Ericsson Eyes Half a Billion 5G Users by 2022</strong></a> <strong>(Light Reading)</strong> Ericsson predicts there will be more than half a billion 5G customers by the end of 2022. According to Light Reading, Ericsson attributes this spike to the momentum behind the technology and the industry’s decision to “fast track its standardization.”</li> <li><a href="http://www.lightreading.com/services/broadband-services/isp-startup-targets-disruption-in-us-suburbs/d/d-id/733681"><strong>ISP Startup Targets Disruption in US Suburbs</strong></a> <strong>(Light Reading)</strong> “In an ISP industry that is dominated by a handful of players, the chances for Common Networks to succeed are slim,” reports Light Reading. However, the startup is targeting suburban markets and believes “it can exploit a customer niche in areas that won’t attract the same fiber investments as larger, denser cities.”</li> <li><a href="https://arstechnica.com/information-technology/2017/06/according-to-statistics-programming-with-spaces-instead-of-tabs-makes-you-rich/"><strong>Tabs or Spaces? Developer Survey Reveals Where Programmers Make More Money</strong></a> <strong>(Ars Technica)</strong> The tabs or spaces debate continues for developers. But Stack Overflow’s annual survey says developers using spaces make more money. Ars Technica jokes that the latest survey includes “lots of bad news.”</li> <li><a href="http://www.techrepublic.com/article/10-critical-skills-that-every-devops-engineer-needs-for-success/"><strong>10 Skills Every DevOps Engineer Needs for Success</strong></a> <strong>(TechRepublic)</strong> Network awareness is one of the key skills DevOps engineers need to be success, according to Kentik’s VP of Strategic Alliances Jim Frey in this TechRepublic article. Why? Jim advises, “The end objective of any DevOps project is to successfully deliver an application to the end user who will consume it. That involves the network,” Frey said. “My advice for DevOps engineers is to ignore the network at your peril. A good DevOps engineer will recognize that you have to account for the network in your design, your planning and your testing.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Telecoms Take On Network Analytics & Visibility]]><![CDATA[Telecom and mobile operators are clear on both the need and the opportunity to apply big data for advanced operational analytics. But when it comes to being data driven, many telecoms are still a work in progress. In this post we look at the state of this transformation, and how cloud-aware big data solutions enable telecoms to escape the constraints of legacy appliance-based network analytics.]]>https://www.kentik.com/blog/telecoms-take-on-network-analytics-visibilityhttps://www.kentik.com/blog/telecoms-take-on-network-analytics-visibility<![CDATA[Alex Henthorn-Iwane]]>Mon, 12 Jun 2017 13:00:05 GMT<h3 id="for-those-who-escape-the-box-a-transformational-opportunity"><em>For Those Who Escape the Box, a Transformational Opportunity</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/2AFWOy4ekISeECUMOcqMS4/bb04945334409273c4c628ffa04c5701/Fishing_contrast-500w.png" class="image right" style="max-width: 400px; margin-bottom: 20px; margin-left: 15px;" alt="" /> <p>I recently attended the Light Reading Big Communications Event (BCE) in Austin, which is focused on networking in the telecommunications industry. I had a series of discussions there that provided me with helpful insight into the state of network analytics and visibility for telecoms. Fundamentally, my take is that telecom and mobile operators are clear on both the need and the opportunity to apply big data for advanced operational analytics. But in terms of how they see themselves utilizing big data network analytics, many telecoms are still evolving from being vendor-solution driven to being data driven.</p> <p>It’s widely accepted that telecoms have made some questionable investments in big data over the past several years. Many of these deployments have been akin to science experiments. Every possible form of data was collected in massive Hadoop data lakes based on frameworks sold by software vendors who made most of their revenue through huge consulting fees. In many cases, telecoms were essentially paying these vendors to build trawlers, staff them, and send them out on these data lakes to fish for value. In many cases these projects never delivered a solid ROI, and after tens of millions of dollars the plug was pulled.</p> <p>What can be learned from these early experiments? While big data is an interesting technology in and of itself, big data network analytics only earns its keep when it’s driven by real-world use cases. Fortunately, big data network analytics is now a reality for use cases focused on network operations and planning. But there’s a complicating factor, which is that being data driven isn’t just a technology move, it’s also a cultural move. And there’s definitely a wide spectrum of cultural attitudes about analytics across different telecom organizations.</p> <h4 id="data-driven-vs-vendor-driven">Data-Driven vs. Vendor-Driven</h4> <img src="//images.ctfassets.net/6yom6slo28h2/1xaCWFEPHOKw4KEAwGeyso/f001d0b8055da1e4111643bcab001510/data-driven-1024x479.jpg" class="image left" style="max-width: 350px" alt="" /> <p>This breadth of opinion was brought home to me by two somewhat contrasting discussions at BCE. The first took place around a panel I participated in on “Implementing Predictive Analytics.” I was struck by comments from Ray Watson, VP of Global Technology from Masergy, about the company’s approach. I knew that Masergy had leaned forward aggressively into virtualizing network functions and utilizing automation. Ray said that they take an iterative improvement approach to everything they do. Speaking with him after the panel, it was clear that they had made investments in operational big data network analytics. It was also clear that they saw network analytics as something that wasn’t just a solution delivered to them, but was something that supported organizational competence, leading to better and better outcomes as iterative learning is applied.</p> <div class="pullquote right">How does the structure of a network analytics system affect its value for telecoms?</div> <p>The other discussion was an informal chat with a team from a different telecom company. As we were talking about the ways that big data platforms for network analytics can provide value, I could clearly see a divide in opinion. Some of the team saw the potential for big data network traffic intelligence as a platform. If they had access to the right data, they could use APIs to create iterations on analyses that could be incorporated into operational runbooks, anomaly detection policies, and automation. Others felt that it was up to the vendor to figure out how to make advanced analytics work for them out the box.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2dQSDJsKV2MwG26YC6ecS2/db858cd250cb1bb718321fdaba626d82/dataskillz.jpg" class="image right" style="max-width: 400px" alt="" /> <p>My take was that developing data-driven skills was a far better option. My reasoning was that those within the organization itself are far more likely to have deep technical and institutional familiarity with the network. Further, because vendors must create features with general appeal, it’s difficult for a commercial product to fully meet the needs of any particular business. And it can actually be quite difficult for vendors to evolve generalized features once they become widely adopted. Finally, in this era of agile digital business, driven by the cloud and APIs, why would you want to be captive to a small and static set of features that are identically available to all of your competitors? To achieve long-term competitive advantage, telecoms must instead develop the in-house “muscles” required to iterate optimizations in the short-term. You can’t do that when your environment is restricted to UI-only interactions.</p> <h4 id="a-transformational-opportunity">A Transformational Opportunity</h4> <p>There’s a third discussion that’s also relevant here, which is one I had recently with a Gartner analyst who focuses on the telecom sector. One observation he made is that big data network analytics is a transformational platform. In his view, too many telecoms are focused on software-defined networking architectures without yet having developed the skills and culture to reliably leverage deep analytics for effective automation. To make decisions, you need a platform that can provide actionable data, which enables you to apply insights and learn from iterative observation. The analyst and I agree that without such a platform, and the skills developed from using it, telecoms won’t have sufficient perspective to make decisions on SDN architecture. That makes big data network analytics, based on real world use cases, a transformational opportunity for telecoms.</p> <div class="pullquote right">Data-driven transformation depends on escaping the appliance box.</div> <p>Of course, to reap the rewards of that transformation, telecoms must be ready to break out of the box. By box, I mean the traditional appliance-bound ways of network monitoring. It’s deeply ironic for telecoms — who are aggressively trying to move applications and services toward big data and the cloud — to continue to settle for 90s-era appliances (whether hardware or virtual) in any major area of their operations. But old habits die hard, and so-called “industry leading” vendors are practiced at spreading FUD to keep customers wedded to outmoded solutions.</p> <p>What about your business? Are you ready to learn more about what a big data network traffic intelligence platform is, and what it can do for you? Check out our Kentik Detect <a href="https://www.kentik.com/kentik-detect/">product pages</a>. You can also get some third-party perspectives on Kentik by visiting our <a href="https://www.kentik.com/resources/">Resources page</a>, where you’ll find analyst reports from Forrester, Ovum, 451 Research, EMA, and IDC (check the “Analyst Report” filter box to find them more easily). Or, if you’re a Gartner research client, talk to your NPMD and Gartner for Technology Professionals analysts and ask them about Kentik. And if you’d like to see Kentik Detect in action, <a href="#demo_dialog">schedule a demo</a> or sign up for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[News in Networking: Security-Defined Networking, SD-WAN, and Container Strategies]]><![CDATA[The meaning of SDN changed. That is, if you’re working in Cisco’s Security Business. It means “security-defined networking,” which is where they’re focusing. SD-WAN is still hogging the spotlight, but CenturyLink says it’s “no quick fix.” Meanwhile, containers are a big part of AT&T’s network strategy. “Not everything is suited for virtual machines,” said AT&T’s CTO. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-security-defined-networking-sd-wan-and-container-strategieshttps://www.kentik.com/blog/news-in-networking-security-defined-networking-sd-wan-and-container-strategies<![CDATA[Michelle Kincaid]]>Wed, 07 Jun 2017 19:01:53 GMT<p>This week’s top story picks from the Kentik team.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/6xchuS0eWWMSYWoEEQ8cqQ/307c69ad226ef4a5d3c50671600b8058/News_tablet-396w.png" alt="News_tablet-396w.png" class="image right" style="max-width: 300px;" /> This week the meaning of SDN changed. That is, if you’re working in Cisco’s Security Business. It means “security-defined networking,” which is where this unit of Cisco is focusing its efforts. SD-WAN is still hogging the spotlight, but CenturyLink says it’s “no quick fix.” Meanwhile, containers are a big part of AT&#x26;T’s network strategy. “Not everything is suited for virtual machines,” said the telecom giant’s CTO. <strong>Here are those stories and more:</strong></p> <ul> <li><a href="https://www.sdxcentral.com/articles/news/ciscos-ulevitch-say-sdn-stands-still-nothing/2017/06/"><strong>Cisco’s Ulevitch: We Say SDN Stands for ‘Still Does Nothing’</strong></a> <strong>(SDxCentral)</strong> OpenDNS creator and Kentik investor David Ulevitch, who now serves as SVP and GM of Cisco’s Security Business, is aiming to change the acronym SDN, typically used for software-defined networking. According to SDxCentral, he joked at a recent fintech conference that “SDN really stands for ‘still does nothing.’ But we think it might actually stand for security-defined networking.”</li> <li><a href="http://www.lightreading.com/carrier-sdn/sd-wan/centurylink-sd-wan-no-quick-fix/d/d-id/733508"><strong>CenturyLink: SD-WAN No Quick Fix</strong></a> <strong>(Light Reading)</strong> The hype of SD-WAN is here. However, “success in this arena needs to be based on establishing long-term customer relationships, not offering a quick fix to the need for Internet offload or cheaper connections than MPLS can provide,” according to Bill Grubbs, network solutions architect with CenturyLink.</li> <li><a href="https://marketexclusive.com/possible-sale-gigamon-inc-nysegimo-attracts-likes-hewlett-packard-enterprise-co-nysehpe-cisco-systems-inc-nasdaqcsco/115672/"><strong>Cisco, HPE Eyed As Possible Gigamon Buyers</strong></a> <strong>(Market Exclusive)</strong> Cisco and Hewlett-Packard Enterprise have their sights set on network monitoring solution provider Gigamon, according to Market Exclusive. While Gigamon did not comment on whether it’s exploring a possible sale, the company’s shares jumped after the industry rumors started earlier this week</li> <li><a href="http://searchmicroservices.techtarget.com/news/450419875/IBM-Google-Lyft-launch-Istio-open-source-microservices-platform"><strong>IBM, Google, Lyft Launch Istio Open Source Microservices Platform</strong></a> <strong>(TechTarget)</strong> IBM, Google and Lyft this week announced a new OS platform that aims to connect, manage and secure microservices. According to TechTarget, the platform “supports managing traffic between microservices, enforcing access policies and aggregating telemetry data — all without requiring changes to the microservices code.”</li> <li><a href="http://www.eweek.com/virtualization/vmware-updates-vrealize-cloud-management-to-fit-hybrid-cloud-trend"><strong>VMware Updates vRealize Cloud Management to Fit Hybrid Cloud Trend</strong></a> <strong>(eWEEK)</strong> Not yet off-prem is the cloud trend these days, according to VMware, which announced it updated its vRealize technology this week to fit the trend. “We’ve watched the evolution of cloud from on-prem to public cloud and now to hybrid, and we are seeing a lot of customers doing both,” David Jasso, the company’s Cloud Management Platform director of product marketing, told eWEEK. “We’re addressing this trend—including the choices of running our cloud on AWS and/or Azure—in everything we do.”</li> <li><a href="https://www.sdxcentral.com/articles/news/att-exec-says-containers-are-key-to-network-architecture/2017/05/"><strong>AT&#x26;T Exec Says Containers Are Key to Network Architecture</strong></a> <strong>(SDxCentral)</strong> AT&#x26;T’s Andre Fuetsch, president of AT&#x26;T Labs and CTO, says containers are an important piece of the telecom giant’s network architecture strategy. “Not everything is suited for virtual machines,” Fuetsch told SDxCentral. “When you start looking at various parts of the network where you need speed, reliability, redundancy, there’s some benefits you can get from containers that you can’t get from the alternatives.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Network Visibility for Higher Education IT]]><![CDATA[In higher education, embracing the cloud enhances your ability to achieve successful outcomes for students, researchers, and the organization as a whole. But just as in business, this digital transformation can succeed only if it's anchored by modern network visibility. In this post we look at the network as more than mere plumbing, identifying how big data network intelligence helps realize high-priority educational goals.]]>https://www.kentik.com/blog/network-visibility-for-higher-education-ithttps://www.kentik.com/blog/network-visibility-for-higher-education-it<![CDATA[Alex Henthorn-Iwane]]>Tue, 06 Jun 2017 13:00:25 GMT<p>How Big Data Network Intelligence Enables Institutional Success</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/10okeA1aAoucou6GOmImYW/333838d006435aa314a5b6fccbd28c73/IT_themes-500w.png" alt="IT_themes-500w.png" class="image right" style="max-width: 300px;" /> There’s little debate any more about the importance of IT to the success of modern organizations. Of course the meaning of “success” varies depending on the organization type. For traditional business it’s typically a matter of metrics like market share, revenue growth, and profit. In the world of higher education, meanwhile, the yardsticks are more varied, involving not only financial health but measures like completion rates and research leadership. Even so, IT has a central role to play in both arenas. Institutions that leverage digital technologies in a strategic and prioritized way can enhance their ability to achieve successful outcomes for students, researchers, and the organization as a whole. That means embracing the cloud, putting the network at the heart of learning, research, and operational systems. And as with digital business transformation, this digital higher education transformation can only succeed if it’s anchored by modern network visibility.</p> <p>Successful outcomes are driven by digital transformation.</p> <p>Digital transformation isn’t about change for the sake of change; it’s guided by the goals of the organization. In higher education, the priorities that IT must serve are encapsulated into the above diagram from an article, published in the January 2017 issue of Educause Review, by Susan Grajek and the 2016-2017 Educause IT Issues Panel. The diagram shows student success as the driving goal; at research-heavy institutions, research success would be given equivalent weight. The desired outcome at the top of the pyramid is directly supported by digital transformation, which involves a hybrid cloud approach that includes traditional IT infrastructure and home-grown applications as well as internet-based open source tools, IaaS/PaaS/SaaS, and Web APIs. The rest of the pyramid in large part underscores the importance of digital transformation to all aspects of IT. Data-driven decision-making, enabled by big data, must not only influence student analytics but drive a continuous deployment of optimization across the IT landscape, shepherded by sound data management and governance. An organizational orientation to “data-drivenness” helps to deliver on a solid IT foundation. For example, data-driven learning and optimization lead to effective automation, wiser enterprise IT investments, and smarter information security. Leveraging and showcasing this data-driven efficiency and ROI helps IT organizations make the case for more stable funding, which can help create a sustainable staffing environment, while also contributing to improved overall affordability in higher education.</p> <h4 id="visibility-is-key-to-it-success">Visibility is Key to IT Success</h4> <p>As the diagram shows, the network is obviously a critical part of next-generation education IT. But its influence goes far beyond simply acting as digital “plumbing.” After all, the network doesn’t just support digital education transformation, it’s also in a unique position to see how well all that digital stuff is working. In the age of digital education and hybrid cloud, all those web APIs, SaaS tools, infrastructure, and disaggregated applications rely on a huge matrix of north-south and east-west network communications. Complicating matters, higher education IT has to offer network services to multiple user bases, including educational and administrative departments, grant-driven research projects, and large residential student populations.</p> <p>Performance, efficiency, and security depend on modern network visibility.</p> <p>Modern network visibility is key to understanding all these interactions from a performance, efficiency, and security point of view. Modern visibility requires digital transformation of how education networks are planned, operated and secured. If it makes sense to move from siloed databases to big data for student success analytics, then it also makes sense to take a big data approach to the network itself, moving away from the siloed, summarized datasets of 90’s era network management software and appliances. Research such as the EMA report “<a href="https://info.kentik.com/rs/869-PAD-887/images/EMA-End-Of-Point-Solutions-WP.pdf">The End of Point Solutions</a>” has repeatedly shown tangible improvements in network operations outcomes from unifying network data (e.g. traffic, routing, performance, DNS, and HTTP) and making it available for integrated approaches to analysis and anomaly detection. More specifically, modern network visibility, powered by big data, can help higher education in a number of ways that strongly correlate to the top themes called out in the Educause Review article:</p> <ul> <li><strong>Data</strong>-<strong>driven decision making</strong>: A big data network analytics platform makes the vast and valuable details generated by network telemetry useable. Legacy network monitoring tools don’t have sufficient storage or computing power, so they discard details and instead retain summary reports that shed zero light on important operations, security and planning decisions. The depth of details and speed of modern big data network analytics platforms allow for a virtuous cycle of analysis, learning, optimization, and automation that leads to better performance, staff productivity and continuous improvement. This leads in turn to more sustainable staffing, even in the face of rising demands.</li> <li><strong>Better performance at lower cost</strong>: HIgher education organizations must support very high bandwidth internet throughput. Internet path-aware network traffic analytics helps network organizations find ways to reduce paid transit and increase settlement-free peering while delivering better performance for key internet-based apps and tools.</li> <li><strong>More accurate DDoS protection</strong>: Higher education is a constant target for DDoS attacks. Dealing efficiently and accurately with attacks is the mandate because DDoS is a form of performance and availability overhead as well as a potential cover for more intrusive threats. Big data analytics have been shown to be superior in detection accuracy, leading to greater automation. And their open APIs allow greater flexibility.</li> <li><strong>More pervasive security analytics</strong>: One of the biggest challenges for information security is how to corral data effectively for defensive purposes when security tools that have the power to examine issues deeply are too expensive to deploy pervasively. Big data network analytics helps by providing a platform that can store pervasively collected traffic telemetry and other data. This data can be used for anomaly detection of conditions that might escape view in other tools due to lack of deployment footprint, mistaken assumptions, or configuration entropy.</li> </ul> <h4 id="network-leadership-for-digital-transformation">Network Leadership for Digital Transformation</h4> <p>What’s been keeping higher education from the advantages above? The cloud era has brought increasingly data-driven operation and automation, but network teams often remain stuck in low gear, with staff still primarily deployed on manual, CLI-based tasks. Much has been made of culture issues within IT, and networking in particular, but the major factor has been the lack of sufficient data to enable an alternative path forward.</p> <p>Kentik makes it easy to access and utilize network traffic data.</p> <p>With modern network visibility powered by big data, network teams can instead become leaders in digital transformation. The raw materials of rich, pervasive network data are already available, and with Kentik Detect it’s now become easy to access a comprehensive platform for big data network traffic intelligence. Kentik’s SaaS solution is used by well over 100 web and digital business leaders, including web enterprises like Box, Yelp, and Pandora, telecom giants, and top-5 cloud providers. We also serve regional education networks like OSHEAN, OneNet, and Kanren, individual universities like University of Washington and UPenn, and even K-12 education organizations. To get a feel for what big data network traffic intelligence from Kentik Detect can do, check out our <a href="https://www.youtube.com/embed/ktXv1sKHzfU">Pandora case study video</a>. You can also find general information about Kentik Detect’s capabilities on our <a href="https://www.kentik.com/kentik-detect/">product pages</a>. Best of all, see for yourself what Kentik Detect could do for your own network management operations: start a <a href="#signup_dialog">free trial</a> or email us at <a href="mailto:[email protected]">[email protected]</a> to <a href="#demo_dialog">request a demo</a>.</p><![CDATA[News in Networking: Meeker’s Internet Trends, SD-WAN Adoption, and OS Networking]]><![CDATA[In headlines this week, investor Mary Meeker released her annual “Internet Trends” report, which includes internet growth across regions. SD-WAN is not top-of-mind for IT professionals, according to a new survey. And open source networking tools may be getting better, but there’s still a lot of challenges with them. More after the jump...]]>https://www.kentik.com/blog/6949-2https://www.kentik.com/blog/6949-2<![CDATA[Michelle Kincaid]]>Wed, 31 May 2017 18:44:56 GMT<h3 id="this-weeks-top-story-picks-from-the-kentik-team"><em>This week’s top story picks from the Kentik team.</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/3nr4EFGlxYwSYQmOWWWMoq/3b27deacedc85631c52bd2b265cb1328/News_tablet-396w.png" style="max-width: 396px;" class="image right"> <p>Making headlines this week, investor Mary Meeker released her annual “Internet Trends” report, which includes internet growth, or lack of it, across regions. SD-WAN is not top-of-mind for IT professionals — at least not yet, according to a new survey. And open source networking tools may be getting better, but there’s still a lot of challenges with them.</p> <p><strong>Here are those stories and more:</strong></p> <ul> <li><a href="https://www.recode.net/2017/5/31/15693686/mary-meeker-kleiner-perkins-kpcb-slides-internet-trends-code-2017"><strong>Mary Meeker’s 2017 Internet Trends Report: All the Slides, Plus Analysis</strong></a> <strong>(Recode)</strong> One of Silicon Valley’s most-viewed annual slide decks is live. Mary Meeker, partner at VC firm Kleiner Perkins Caufield &#x26; Byers delivered her “Internet Trends 2017” presentation earlier today. Among the trends (via 355 slides): There’s room for internet growth in India. Netflix is booming; TV, not so much. And Alexa is becoming a popular household name, as voice services are taking over typing.</li> <li><a href="http://www.lightreading.com/carrier-sdn/sd-wan/survey-says-sd-wan-no-panacea/a/d-id/733310?"><strong>Survey Says: SD-WAN is No Panacea</strong></a> <strong>(Light Reading)</strong> Results of a new study conducted by 451 Research and commissioned by Cato Networks shows the SD-WAN market is growing. However, according to the 350 survey participants working in the IT space, only 20 percent reported plans to deploy SD-WAN in the next year.</li> <li><a href="https://themerkle.com/top-5-cryptocurrency-exchanges-hit-by-ddos-attacks/"><strong>Top 5 Cryptocurrency Exchanges Hit by DDoS Attacks</strong></a> <strong>(The Merkle)</strong> Top exchanges for Bitcoin and other cryptocurrencies are seeing spikes in internet traffic. Some of that is due to increasing interest in these digital assets. However, according to reports, traffic spikes are also the result of DDoS attacks on these exchanges. Not necessarily surprising, as BGP hijacks can contribute to the attacks.</li> <li><a href="http://www.eweek.com/security/ibm-cisco-integrate-threat-intelligence-to-improve-cyber-security"><strong>IBM, Cisco Integrate Threat Intelligence to Improve Cybersecurity</strong></a> <strong>(eWeek)</strong> Today IBM and Cisco announced a partnership to actively share threat intelligence data, products and services. According to eWeek, “the IBM Resilient Incident Response Platform (IRP) will also be integrated with Cisco’s ThreatGrid platform to pull in indicators of compromise.”</li> <li><a href="http://searchsdn.techtarget.com/tip/Open-source-network-software-matures-but-needs-incentives-for-use"><strong>Open Source Network Software Matures, Nut Needs Incentives for Use</strong></a> <strong>(TechTarget)</strong> Open source software for networking is on the rise, due to its design, improved flexibility and reduced costs, according to this article. The article goes on to note that “networking open source adoption challenges run the gamut of technical, political, cultural and economic.”</li> </ul> <p><strong>Until next week, follow us on <a href="https://twitter.com/kentikinc">Twitter</a> and <a href="https://www.linkedin.com/company/kentik">LinkedIn</a> to see more of these headlines in real time.</strong></p><![CDATA[Consolidated Tools Improve Network Management]]><![CDATA[Stuck with piles of siloed tools, today's network teams struggle to piece together both the big picture and the actionable insights buried in inconsistent UIs and fragmented datasets. The result is subpar performance for both networks and network teams. In this post we look at the true cost of legacy tools, and how Kentik Detect frees you from this obsolete paradigm with a unified, scalable, real-time solution built on the power of big data.]]>https://www.kentik.com/blog/consolidated-tools-improve-network-managementhttps://www.kentik.com/blog/consolidated-tools-improve-network-management<![CDATA[Alex Henthorn-Iwane]]>Wed, 31 May 2017 13:00:30 GMT<p>Freeing Your Network Teams From Legacy Limits</p> <img src="//images.ctfassets.net/6yom6slo28h2/4HWtv0SCveIi6MyWu46SQc/59db8f2544fda5b91b729b07483b88ca/Old_tools-500w.png" alt="Old_tools-500w.png" class="image right" style="max-width: 300px;" /> <p>At Kentik, we firmly believe that network engineering, operations, and planning are critically dependent not only on the quality of traffic data but also the accessibility of that data. While quality is easily defined in terms of accuracy, relevance, and detail, the value of even high-quality data is severely diminished if it can’t be accessed quickly and easily by those who need it. Unfortunately, that’s pretty much the status quo for network teams today. Stuck with piles of siloed tools, they struggle to piece together the big picture and the actionable insights buried in inconsistent UIs and fragmented datasets. The result is subpar performance for both networks and network teams. It’s high time to move away from this legacy paradigm to a unified, scalable, real-time solution built on the power of big data.</p> <p>Today’s siloed network management tools can be traced back to an earlier era, when design was constrained by the limited computing, memory, and storage capacity of appliances or single-server software deployments. This meant that any given tool could only handle a narrow use case, which led to tool fragmentation. On the one hand, you had large management software vendors like HP, CA, IBM, and BMC who sold big branded “suites” that were actually, under the hood, a collection of separate tools, often from acquired companies. On the other side were “best of breed” tools, sold by smaller vendors that specialized in one particular area. Either way, it turned out, APIs were mostly architectural afterthoughts and users ended up with a collection of disparate, narrow tools that couldn’t — even with hefty consulting fees — be integrated into a seamless, efficient whole.</p> <p>The problem facing network teams who still work with these fragmented tools is described in the report from Enterprise Management Associates titled: “The End of Point Solutions: Modern NetOps Requires Pervasive and Integrated Network Monitoring,” by Senior Analyst Shamus McGillicuddy:</p> <p><em>“Fragmentation of visibility has long plagued the world of network operations. IT organizations have no shortage of tools that provide them with glimpses of what is happening with network infrastructure, but these tools often provide very narrow views. Some tools present insights gleaned from the collection of device metrics while others use network flows. Other tools gain insight through analysis of packet data, and so on. In many cases, multiple, separate tools receive the same set of source network data but retain different data subsets. While a network operations team can assemble a good understanding of the health and performance of a network with these tools, it is not easy. In fact, as Enterprise Management Associates (EMA) research has shown, a lack of end-to-end network visibility is the top challenge to enterprise network operations today.”</em></p> <p>“Integration” via Swivel Chair</p> <img src="//images.ctfassets.net/6yom6slo28h2/49O2v5kvj2miWi2O2wYY0K/a3bd0523fee1a302f0717b4d60df3dd2/Noc_jock-400w.png" alt="Noc_jock-400w.png" class="image right" style="max-width: 300px;" /> <p>With true cross-tool integration being extremely rare in legacy tools, engineers have to do the “integration” work themselves via swivel-chair operations. High-value engineers are forced into highly inefficient workflows in which they visually correlate peaks on graphs, tables, and other data from multiple tools spread across several screens. This archaic approach is exemplified in the photo at right, in which the tools available to a representative network operations center (NOC) engineer include a particularly arcane piece of gear: a telephone.</p> <p>Poorly integrated tools that discard actionable details on network traffic result in lots of wasted time. But the overall cost goes beyond the efficiency of individual engineers. According to the EMA “End of Point Solutions” report, there’s an inverse relationship between the number of tools juggled by network operations teams and their performance outcomes. One example is problem detection:</p> <p><em>“Our research found that IT organizations using fewer network management tools reported more effective network problem detection than organizations that were using more tools. The typical network operations team reported detecting 60% of network problems before end-users experience and report these issues. However, organizations using 11 or more network monitoring and troubleshooting tools detect only 48% of problems before end-users, and organizations using only one to three tools catch 71% of problems before they affect end users.”</em></p> <p>Another area where worse outcomes tracked with more tools is network stability:</p> <p><em>“Network stability also correlates with the size of a management toolkit. Among organizations that use 11 or more tools, 34% experience several network outages a day and another 28% experience network outages several times a month. Meanwhile, just 6% of organizations using one to three tools experience several outages a day. Instead, 21% of them experience just one or two outages per year, and 18% said they almost never have an outage.”</em></p> <p>The Consolidated Solution</p> <p>Kentik’s founders, who ran large network operations at Akamai, Netflix, YouTube, and Cloudflare, well understand the challenges faced by teams working with siloed legacy tools and fragmented data sets. They knew that network data was filled with valuable operations information, and also how much of that value (and their own time) was being wasted. So they built a big data engine that could ingest diverse traffic data into a unified time-series data set, keep that data unsummarized for months, and make it available to answer any ad-hoc question within moments. The result is Kentik Detect.</p> <p>With Kentik Detect you can finally get the following data sets in one platform for both hyper-speed queries and streaming anomaly/DDoS detection:</p> <ul> <li>Netflow, sFlow, IPFIX</li> <li>BGP</li> <li>GeoIP</li> <li>SNMP (device &#x26; interface data)</li> <li>Host or sensor-based network performance metrics such as latency, TCP retransmits, errors, out-of-order packets, etc.</li> <li>DNS log data</li> <li>HTTP data such as URLs</li> <li>Your custom tags applied in real-time, based on ingested data field values</li> </ul> <p>This data correlation dovetails nicely with the take-away of the EMA report:</p> <p><em>“The days of network operators relying on point tools for network monitoring and troubleshooting are over, even if network managers aren’t yet aware of this change. The time has come for them to put away their point solutions, spreadsheets, and open source tools. EMA research shows that network operations teams are more effective when they use an integrated, consolidated toolset.”</em></p> <p>Our customers have found the integration, speed, and power of Kentik Detect to be liberating. To get a feel for what’s possible, check out the <a href="https://www.youtube.com/embed/ktXv1sKHzfU">Pandora case study video</a>. If you’d like more general information about Kentik Detect, check out our <a href="https://www.kentik.com/kentik-detect/">product pages</a>. To read the referenced EMA white paper, download it <a href="https://info.kentik.com/rs/869-PAD-887/images/EMA-End-Of-Point-Solutions-WP.pdf">here</a>. And if you’d like to see for yourself what Kentik Detect can do for your network management operations, start a <a href="#signup_dialog">free trial</a> or <a href="#demo_dialog">request a demo</a>.</p><![CDATA[News in Networking: Open Data Centers, Internet 911, and $138.5B for Data]]><![CDATA[This week in networking news, GE and HPE are just two of the companies pushing for open data center designs via a new nonprofit. Meanwhile, researchers are pushing for a way for first responders to have faster internet access. And 451 Research is forecasting the data platforms and analytics market will reach $138.5 billion by 2021. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-open-data-centers-911-internet-and-138-5b-for-datahttps://www.kentik.com/blog/news-in-networking-open-data-centers-911-internet-and-138-5b-for-data<![CDATA[Michelle Kincaid]]>Wed, 24 May 2017 19:35:43 GMT<h3 id="this-weeks-top-story-picks-from-the-kentik-team"><em>This week’s top story picks from the Kentik team.</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/3nr4EFGlxYwSYQmOWWWMoq/3b27deacedc85631c52bd2b265cb1328/News_tablet-396w.png" style="max-width: 396px;" class="image right">This week in networking news, GE, HPE and LinkedIn are just a few of the companies pushing for open data center designs via the launch of a new nonprofit. Meanwhile, researchers are pushing for a way for first responders to have faster internet access during emergency situations. Facebook’s Telecom Infrastructure Project is aiming to disrupt the $350 billion market. And 451 Research is forecasting the data platforms and analytics market will reach $138.5 billion by 2021. </p> <h4 id="here-are-those-stories-and-more">Here are those stories and more:</h4> <ul> <li><a href="http://www.zdnet.com/article/linkedin-hpe-launch-nonprofit-to-support-open-datacenter-designs/"><strong>GE, HPE Launch Nonprofit to Support Open Data Center Designs</strong></a> <strong>(ZDNet)</strong> New nonprofit Open19 Foundation was launched this week by notable companies like GE, HPE and LinkedIn. According to the organization’s website, its mission will be to provide “an industry specification that defines a cross-industry common server form factor, creating a flexible and economic data center and edge solution for operators of all sizes.”</li> <li><a href="https://phys.org/news/2017-05-team-high-speed-internet-lane-emergency.html"><strong>Team Creates High-Speed Internet Lane for Emergency Situations</strong></a> <strong>(Phys.org)</strong> Emergency situations and natural disasters can disrupt networks with huge spikes of traffic. For first responders and crisis managers, quickly sharing critical information during these times can be nearly impossible. That’s why researchers at Rochester Institute of Technology created the Multi Node Label Routing protocol, a new network protocol for “a faster and more reliable way to send and receive large amounts of data through the internet.”</li> <li><a href="http://www.fiercetelecom.com/telecom/verizon-asks-fcc-to-eliminate-annual-international-traffic-revenues-reporting"><strong>Verizon Asks FCC to Eliminate Annual International Traffic, Revenues Reporting</strong></a> <strong>(FierceTelecom)</strong> The FCC tracks international network traffic and revenue reports on an annual basis. The purpose is so that the government can benchmark and monitor rates. However, Verizon says the process is too “burdensome.” That’s why they’re asking the FCC to eliminate these reports.</li> <li><a href="http://www.businessinsider.com/inside-facebooks-telecom-infrastructure-project-2017-5?op=1"><strong>Inside Facebook’s Plan to Eat the $350B Telecom Market</strong></a> <strong>(Business Insider)</strong> Facebook’s Telecom Infrastructure Project is aiming to disrupt the $350 billion market. It’s all part of the social media giant’s mission to “make the world more open and connected,” according to reports. Rightfully so, as it already has support from industry heavyweights like Equinix, Orange and Telia. Axel Clauberg, a VP for Technology Innovation at Deutsche Telekom weighed in to say, “Decades ago, hardware and communications engineers went to the telecom companies to build amazing new things, like mobile networks. Then they went to the tech companies like Cisco to build the tech that created the internet. Today, they are going directly to the internet companies like Google and Facebook and creating new hardware so they don’t have to rely on the vendors.”</li> <li><a href="http://www.lightreading.com/data-center/data-center-infrastructure/how-iot-5g-and-nfv-will-impact-data-center-infrastructure/a/d-id/733000?itc=lrnewsletter_iotm2mupdate"><strong>How IoT, 5G &#x26; NFV Will Impact Data Center Infrastructure</strong></a> <strong>(Light Reading)</strong> Data center operators need to prepare for the incoming data generated by IoT, 5G and NFV, according to industry analyst Roz Roseboro. How can the operators stay ahead? “Automation will allow operators to move faster, but they cannot afford to do so recklessly,” advises Roseboro. “They will need to continue to be disciplined in their testing, leveraging the increased visibility and analytics capabilities of next-generation service assurance and monitoring systems.”</li> <li><a href="https://451research.com/report-short?entityId=92531&#x26;type=mis&#x26;alertid=488&#x26;contactid=0033200002BXCL9AAP&#x26;utm_source=sendgrid&#x26;utm_medium=email&#x26;utm_campaign=market-insight&#x26;utm_content=newsletter&#x26;utm_term=92531-451+Research+predicts+Total+Data+market+to+reach+%24138.5bn+by+2021"><strong>451 Research Predicts Total Data Market to Reach $138.5B by 2021</strong></a> <strong>(451 Research)</strong> Last but certainly not least, 451 Research this week is predicting the “Total Data” market, which consists of data analytics and data platforms, will reach $138.5 billion by 2021. The analysts’ forecast is based on analysis from a sample size of 336 relevant vendors.</li> </ul><![CDATA[Network Traffic Intelligence for ISPs]]><![CDATA[Large or small, all ISPs share the imperative to stay competitive and profitable. To do that in today's environment, they need traffic visibility they can't get from legacy network tools. Taking their lead from the world's most-successful web-scale enterprises, ISPs have much to gain from big data network and business intelligence, so in this post we look at ISP use cases and how Kentik Detect's SaaS model puts key capabilities within easy reach.]]>https://www.kentik.com/blog/network-traffic-intelligence-for-ispshttps://www.kentik.com/blog/network-traffic-intelligence-for-isps<![CDATA[Alex Henthorn-Iwane]]>Tue, 23 May 2017 13:00:20 GMT<p>Why Every ISP Needs a Robust Network Monitoring Solution</p> <img src="//images.ctfassets.net/6yom6slo28h2/3b50zGLWd2Qo8SiS8kiqKm/3d7e001fbbe372ed0bd634338886f85d/Monitored-500w.png" alt="Monitored-500w.png" class="image right" style="max-width: 300px;" /> <p>Internet Service Providers (ISPs) come in varying sizes, from rural broadband and small cable MSOs to Tier 2 players and Tier 1 global giants. Their customers might be consumers, businesses, or a mix of the two. But one thing they have in common is the imperative to stay competitive and profitable. To do that in today’s network environment, ISPs need deeper network visibility. Given the advanced capabilities provided by cloud and big data technology, there’s no longer any justification for legacy monitoring appliances that summarize away all the details and force operators to swivel between siloed tools.</p> <p>It used to be that cloud-scale network monitoring was within reach of only the biggest, richest organizations, those that were most software-savvy and R&#x26;D-heavy. The web-scale companies that successfully pioneered big data approaches reaped institutional rewards when they used the resulting data to improve operations and planning. Data-driven operations led to huge gains in efficiency as automation enabled by rich data reduced human involvement in manual tasks and helped shift human resources to higher-value activities.</p> <p>ISPs can gain similar advantages by becoming far more data driven. Imagine a big data time-series datastore that unifies traffic flow records (NetFlow, sFlow, IPFIX) with related data such as BGP routing, GeoIP, network performance, and DNS logs, that retains unsummarized data for months, and that has the compute and storage power to answer ad hoc queries across billions of data points in a couple of seconds. The insights available to ISPs from that kind of network monitoring solution would include:</p> <ul> <li>Peering and transit analytics to maximize direct peering opportunities</li> <li>Faster root-cause correlation</li> <li>Alerting on anomalies such as traffic shifts</li> <li>DDoS detection and API-driven mitigation</li> <li>Transit backbone traffic entry and ultimate exit optimization (ensuring that traffic is routed off the backbone as fast as possible)</li> <li>Automated rebalancing of traffic to maintain transit commits at lowest cost</li> <li>Margin analysis for traffic from hosting and transit customers</li> <li>CDN attribution (understanding how much traffic you’re carrying for CDNs operated by Akamai, Google, Netflix, etc.)</li> </ul> <p>Note that the above use cases cover network performance monitoring, planning, and business intelligence. Big data insights have the power to drive efficiency, market savvy, automation, and better service experience. By leveraging that power, ISPs can increase competitiveness and profitability.</p> <p>Build versus Buy</p> <p>The skills and resources required for open source don’t match core ISP priorities.</p> <p>It’s one thing to know that something would be good for your business, and quite another to actually achieve it. With the advent of open source big data engines, the power of big data network analytics has seemed tantalizingly close. But open source tools require skill sets and resources that are outside of the core competencies and priorities of most ISPs. It’s no trivial task to develop and maintain the network-savvy applications needed to fully address your use cases. And that keeps generic open source tools from being a fully viable path.</p> <p>That’s why we started Kentik. Kentik’s founders were network operators running some of the world’s largest networks, and they were frustrated with the pitiful state of commercially available network monitoring tools, particularly for flow data. Why couldn’t the analytical power wielded so profitably by web-scale companies be available for network data as well? And why couldn’t it be available for everyone, without having to build it yourself?</p> <p>Need Answered: Kentik Detect</p> <p>Kentik Detect is the answer to those questions. Kentik’s founders researched all of the available open source platforms and found none that were able to meet what would be required in terms of ingest scale and latency, multi-tenancy, and ad-hoc query responsiveness. So they innovated a purpose-built big data engine for network flows and related data. Kentik Detect ingests hundreds of billions of network data records daily from our customers and empowers network operations with powerful analytics, customizable dashboards, peering analytics, and more. It also protects your network with anomaly detection, DDoS detection, and automated mitigation via RTBH and popular mitigation solutions.</p> <p>Kentik’s affordable SaaS avoids the cost of appliances, upgrades, and maintenance.</p> <p>From an adoption point of view, one compelling aspect of Kentik Detect for ISPs is that it is offered as SaaS. No capital costs, appliances, or software to install and maintain. No upgrade and EOL cycles. As an annual subscription, it’s affordable enough that big data network monitoring power is now being used by ISPs ranging from rural telecoms all the way up to Tier 1 ISPs like GTT and global carriers.</p> <p>Kentik Detect is also built in an API-first fashion, so that it supports aggressive network automation. Kentik customers are automating many aspects of their network operations and management tasks. An intriguing example is Pandora, which harvests Geo, performance, and traffic data from Kentik and uses it to automatically configure their Geo server load balancers (GSLBs) to rebalance traffic across their large network. Check out our <a href="https://www.youtube.com/watch?v=ktXv1sKHzfU">Pandora case study video</a> to learn more.</p> <p>If you’re an ISP and you haven’t yet considered how big data network traffic intelligence can transform your business, then it’s time to take a look at <a href="https://www.kentik.com/kentik-detect/">Kentik Detect</a>. <a href="#demo_dialog">Request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[News in Networking: Australia Internet Woes, VMware’s Deal, and NFV at BCE]]><![CDATA[This week we learned Australia invested $36B to modernize broadband. But it’s not working, according to NY Times, which reports on the country’s crawling internet speeds. Also happening this week, VMware swooped up mobile app intelligence startup Apteligent for more analytics capabilities. Meanwhile, over in Austin, the annual BCE is taking place. That’s where BT talked NFV challenges. More after the jump...]]>https://www.kentik.com/blog/news-in-networking-australia-internet-woes-vmwares-deal-and-nfv-at-bcehttps://www.kentik.com/blog/news-in-networking-australia-internet-woes-vmwares-deal-and-nfv-at-bce<![CDATA[Michelle Kincaid]]>Wed, 17 May 2017 17:53:51 GMT<p>This week’s top story picks from the Kentik team.</p> <img src="//images.ctfassets.net/6yom6slo28h2/42iQNhlPoQMWIqE8CQS482/52223d73d8a201a72058e9c760bdf42c/News_tablet-396w.png" alt="News_tablet-396w.png" class="image right" style="max-width: 300px;" /> <p>This week we learned Australia invested $36 billion (USD) to modernize its broadband. But that’s not working, according to The New York Times, which reports on the country’s crawling internet speeds. Not crawling this week is VMware, which swooped up mobile app intelligence startup Apteligent for more analytics capabilities for its customers. Meanwhile, over in Austin, Light Reading’s annual Big Communications Event (BCE) is taking place. That’s where BT talked about the challenges of NFV. </p> <p>Here are those headlines and more:</p> <ul> <li> <ul> <li><a href="https://www.nytimes.com/2017/05/11/world/australia/australia-slow-internet-broadband.html"><strong>How Australia Bungled Its $36 Billion High-Speed Internet Rollout</strong></a> <strong>(The New York Times)</strong> Forget the “down under” jokes. Australia is trying to figure out how to dig its way up and out of an internet speed slump. The country ranks No. 51 in terms of internet speeds, according to Akamai, despite its $36 billion (USD) investment to modernize broadband. Australia’s internet speeds are slower than developing economies like Kenya and Thailand.</li> <li><a href="https://techcrunch.com/2017/05/15/apteligent-acquired-by-vmware/"><strong>App Analytics Company Apteligent Acquired by VMware</strong></a> <strong>(TechCrunch)</strong> VMware this week announced it’s acquiring mobile app intelligence startup Apteligent, the company previously known as Crittercism. According to TechCrunch, “VMware will be able to provide a more robust set of capabilities to its mobile analytics customers.”</li> <li><a href="http://www.lightreading.com/nfv/nfv-specs-open-source/nfv-woes-could-be-fixed-with-service-models-bt-suggests/d/d-id/732870?itc=lrnewsletter_lrdaily/"><strong>NFV Woes Could Be Fixed With Service Models, BT Suggests</strong></a> <strong>(Light Reading)</strong> Happening in Austin this week is Light Reading’s Big Communications Event, where BT’s chief network architect, Neil McRae, where base suggested, “Having a common, base-level service model would make life easier for telcos by standardizing the drab, ordinary parts of a service. Carriers could still fine-tune the services to their liking, and they could also differentiate in the way they stitch services together.”</li> <li><a href="http://searchsdn.techtarget.com/tip/Why-SDN-DevOps-dont-share-automated-network-missions"><strong>Why SDN, DevOps don’t share automated network missions</strong></a> <strong>(TechTarget)</strong> How does SDN and DevOps fit together? Do they share the same goals? According to one network engineer “Comparing the goals of DevOps and SDN in terms of automation as interchangeable fails to understand the point of a software-defined network.”</li> <li><a href="http://www.brendangregg.com/blog/2017-05-16/working-at-netflix-2017.html"><strong>A Netflix Performance Engineer Explains His Job</strong></a> <strong>(Brendan Gregg’s Blog)</strong> Ever wonder what it’s like to be a performance engineer for the single largest source of peak downstream Internet traffic in the U.S.? Check out Netflix performance engineer Brendan Gregg’s latest blog post for “a day in the life.”</li> <li><a href="https://news.hpe.com/a-new-computer-built-for-the-big-data-era/?utm_source=MIT+Technology+Review&#x26;utm_campaign=72e7faef25-The_Download&#x26;utm_medium=email&#x26;utm_term=0_997ed6f472-72e7faef25-154823461"><strong>HPE Unveils Computer Built for the Era of Big Data</strong></a> <strong>(Press Release)</strong> If you want to buy a computer built for big data, talk to HPE. This week they introduced “the world’s largest single-memory computer.” According to their press release, the prototype contains 160 terabytes of memory, and the company expects the architecture to scale to a nearly-limitless pool of memory.</li> </ul> </li> </ul><![CDATA[SDN and Self-Driving Networks]]><![CDATA[SDN holds lots of promise, but it's practical applications have so far been limited to discrete use cases like customer provisioning or service scaling. The long-term goal is true dynamic control, but that requires comprehensive traffic intelligence in real time at full scale. As our customers are discovering, Kentik Detect's traffic visibility, anomaly detection, and extensive APIs make it an ideal source for actionable traffic data that can drive network automation.]]>https://www.kentik.com/blog/sdn-and-self-driving-networkshttps://www.kentik.com/blog/sdn-and-self-driving-networks<![CDATA[Jim Meehan]]>Mon, 15 May 2017 13:00:55 GMT<h3 id="traffic-intelligence-is-the-key-to-effective-network-automation"><em>Traffic Intelligence is the Key to Effective Network Automation</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/6jCHVBf9csaC4yKaKsk4KK/1d22586ea8245ff35f4b19a1230be698/Freeway_traffic-500w.png" alt="Freeway_traffic-500w.png" class="image right" style="max-width: 500px; margin: 15px;" /> <p>Reading the tech press, one might understandably conclude that software defined networks (SDN) are “eating the world” (to borrow from Marc Andreesen). Some of this excitement is justified, because SDN holds lots of promise for improving the way that we build, scale, and operate network infrastructure. As our traffic grows and network loads become increasingly dynamic, it’s difficult to keep pace using manual provisioning. So we have SDN/overlay technologies like VXLAN to simplify provisioning of connectivity between VMs running on diverse physical hosts, and NFV in service-provider networks to simplify deployment of customer services. To date though, practical, deployable SDN technology has mostly been limited to these niches. Meanwhile the hype of SDN goes far beyond automating and simplifying network provisioning.</p> <img src="//images.ctfassets.net/6yom6slo28h2/RY2CQAyvyoiqSEeWqKGWu/75ce752e691a12ed7b191c8fb13ea7f3/Scarecrow_brain-300w.png" alt="Scarecrow_brain-300w.png" class="image left" style="max-width: 300px; margin: 15px;" /> <h4 id="where-da-brain">Where Da Brain?</h4> <p>SDN advocates often tout the promise of things like “<a href="https://www.juniper.net/us/en/dm/the-self-driving-network/">self-driving networks</a>” that execute automatic responses to adverse conditions such as attacks and congestion. What’s been delivered so far, though, focuses primarily on network programmability. SDN APIs, interfaces, and orchestration typically enable operators to respond to discrete external events, like customer provisioning or service scaling, by reconfiguring the network from a central point of control. Getting from that to a truly self-driving network will require a feedback loop in which brain-like components consume massive amounts of information about network traffic and dynamically re-program the network based on metrics and analytics.</p> <img src="//images.ctfassets.net/6yom6slo28h2/wY7YrEAZ1es8cuQ2CUs0u/d93c3febb800b53e53a44f9af7c47bde/Google_Espresso-497w.png" alt="Google_Espresso-497w.png" class="image right" style="max-width: 497px; margin: 15px;" /> <p>By and large, the SDN components available in commercially consumable solutions haven’t yet achieved that level of intelligence, though hypergiants like Google and Facebook are making progress with in-house projects. Google recently published some details on their <a href="https://blog.google/topics/google-cloud/making-google-cloud-faster-more-available-and-cost-effective-extending-sdn-public-internet-espresso/">Espresso peering edge architecture</a> which uses metrics from their various apps to dynamically choose the datacenter/location from which an end user will be served, optimizing for loss, latency, and other factors to deliver the best user experience.</p> <p>In a similar fashion, Facebook recently published a blog describing their <a href="https://code.facebook.com/posts/1782709872057497">Express Backbone</a> that provides internal connectivity between datacenters. A major component of the system is the ability to continuously generate a matrix of site-to-site traffic loads based on metrics (sFlow) collected from the network. The system then dynamically provisions an MPLS LSP topology to meet the observed loads while optimizing for various traffic classes (e.g. latency sensitive vs. insensitive).</p> <img src="//images.ctfassets.net/6yom6slo28h2/5P7kZl1M6QoGS2q2WUiSSy/734828274a171e3b9fe17cf282e8374f/Facebook_Express-736w.png" alt="Facebook_Express-736w.png" class="image center no-shadow" style="max-width: 736px;" /> <p>Like the SDN hype in the press — and the collateral from network hardware vendors — Facebook glosses over the component on their diagram that is labeled “Sflow Collector,” which is what actually provides inputs to the controller. Google’s Espresso blog post says “we leverage our large-scale computing infrastructure and signals from the application itself…” which gets a little closer to describing the scope of the task. Collecting network traffic metadata at scale and transforming it into network control inputs, all in real-time, is a significant engineering challenge. What if you don’t have an army of in-house software engineers to craft a purpose-built, real-time big data platform?</p> <h4 id="automation-ready-traffic-intelligence">Automation-ready Traffic Intelligence</h4> <p>Kentik was essentially founded as a response to the above question. Our API-enabled SaaS, Kentik Detect, is a single, central service for collecting, processing, and analyzing flow data, and it produces outputs that arm network operators with visibility, forensics, automated detection, and — increasingly — automated control. Using the built-in DDoS detection feature set, for example, you can automatically orchestrate a number of different network actions. RTBH route injection via BGP will drop malicious traffic at the network edge. And API integration with partners like Radware and A10 enables automated triggering of DDoS scrubbing hardware when Kentik Detect identifies traffic that is outside of either absolute parameters or baseline norms.</p> <p><a href="https://www.youtube.com/watch?v=ktXv1sKHzfU&#x26;feature=youtu.be"><img src="//images.ctfassets.net/6yom6slo28h2/2BhLQSLZ2McSuY4OqCMaUa/fb6f60518771869324b7599e3bdd1ba3/Pandora_Kelty-500w.png" alt="Pandora_Kelty-500w.png" class="image right" style="max-width: 500px; margin: 15px;" /></a></p> <p>Kentik Detect also enables customers such as Pandora to optimize user experience while cutting IP transit costs. Pandora’s feedback loop relies on Kentik APIs to observe traffic volumes and network conditions. Translated into actionable intelligence, the data from Kentik Detect helps determine (similar to Google’s Espresso SDN) the servers from which individual end-users are served. And it also guides automated manipulation of routing to control the transit involved in reaching those users. Pandora’s senior director of network operations and engineering, James Kelty, discusses the company’s use of Kentik Detect in <a href="https://www.youtube.com/watch?v=ktXv1sKHzfU&#x26;feature=youtu.be">this video</a>.</p> <p>Watch this space for further posts on how Kentik Detect’s big data-powered network traffic intelligence helps drive automation in our customers’ networks. We’re excited to help our customers achieve the full promise of SDN and be part of the move toward the “self-driving network.” If you’re looking to build more self-driving capabilities into your network, Kentik Detect is the network traffic intelligence component needed to provide actionable network data to your automation subsystems. You can learn more about Kentik Detect on our our <a href="https://www.kentik.com/platform/kentik-detect/">product pages</a>, or let us show it to you by <a href="https://www.kentik.com/go/get-demo/">scheduling a demo</a>. Or try it out directly by starting a <a href="https://portal.kentik.com/">free trial today</a>.</p><![CDATA[News in Networking: ISPs, NFVs, and a SpaceX-based Internet]]><![CDATA[Following reports of a Russian BGP hijacking last week, in the headlines this week is new research to suggest a similar hijacking incident could take down bitcoin’s ecosystem. Meanwhile, Light Reading is talking network functions virtualization (NFV) and how network operators can face relevant challenges with it. SpaceX also makes our highlights, with news of more than 4,000 internet satellites it plans to launch. Read about these stories and more after the jump...]]>https://www.kentik.com/blog/news-in-networking-isps-nfvs-and-a-spacex-based-internethttps://www.kentik.com/blog/news-in-networking-isps-nfvs-and-a-spacex-based-internet<![CDATA[Michelle Kincaid]]>Wed, 10 May 2017 17:19:25 GMT<p>This week’s top story picks from the Kentik team.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6zHBEi3XfUkgyICKSOMs2c/4db8e63aa6a2dc81b4a00a936538f5e3/News_tablet-396w.png" alt="News_tablet-396w.png" class="image right" style="max-width: 300px;" /> <p>Following reports of a Russian BGP hijacking, making the headlines this week is new research that suggests a similar hijacking incident could take down bitcoin’s ecosystem. Meanwhile, Light Reading is talking network functions virtualization (NFV) and how network operators can overcome relevant challenges. SpaceX also makes our highlights, with news of more than 4,000 internet satellites it plans to launch. </p> <p>Here are those stories and more:</p> <ul> <li><a href="https://www.bleepingcomputer.com/news/security/isps-could-damage-bitcoin-ecosystem-if-they-wanted-to/"><strong>ISPs Could Damage Bitcoin Ecosystem If They Wanted To</strong></a> <strong>(Bleeping Computer)</strong> Today most ISPs know about Border Gateway Protocol (BGP) hijacking. To add to those risks, newly published research suggests a hijacking incident could take down the Bitcoin ecosystem.</li> <li><a href="http://www.lightreading.com/nfv/nfv-mano/nfvs-major-movements/d/d-id/732527?itc=lrnewsletter_lrdaily"><strong>NFV’s Major Movements</strong></a> <strong>(Light Reading)</strong> When it comes to network functions virtualization (NFV), network operators need more support. According to the editor, “There are [only] a few key areas where operators can engage with collaborative bodies and independent organizations to further their strategies, particularly when it comes to multi-vendor testing and deciding on an all-important management and orchestration strategy.”</li> <li><a href="http://insidebigdata.com/2017/05/05/shift-turn-key-big-data-intelligence/"><strong>The Shift to Turn-Key Big Data Intelligence</strong></a> <strong>(InsideBigData)</strong> It’s still early innings for big data, according to this article penned by Kentik’s Alex Henthorn-Iwane. Two major aspects of maturation are defining big data’s future: the journey from open source to SaaS, and the journey from bespoke business intelligence to real-time operational use cases.</li> <li><a href="http://www.cio.com/article/3194609/software/appdynamics-explains-logic-behind-that-37-billion-cisco-acquisition.html?idg_eid=ef5d8103c0b492d0599db067e86d7910&#x26;email_SHA1_lc=11769875a7f08e6a79a7a30d51400a0c796aaa22&#x26;cid=cio_nlt_cio_enterprise_applications_2017-05-08&#x26;utm_source=Sailthru&#x26;utm_medium=email&#x26;utm_campaign=CIO%20Enterprise%20Applications%202017-05-08&#x26;utm_term=cio_enterprise_applications"><strong>AppDynamics explains logic behind that $3.7 billion Cisco acquisition</strong></a> <strong>(CIO)</strong> At a recent user conference, AppDynamics CEO David Wadhwani said one of the reasons the company was acquired was because “Cisco realized through the partnership conversations that we were having that the data model could make their networks even smarter.”</li> <li><a href="http://www.cnbc.com/2017/05/04/spacex-internet-satellites-elon-musk.html"><strong>SpaceX details internet satellite plans</strong></a> <strong>(CNBC)</strong> Beginning in 2019, SpaceX will launch over 4,000 internet satellites. According to CNBC, SpaceX thinks the U.S. “lags behind other developed nations in broadband speed and price competitiveness, while many rural areas are not serviced by traditional internet providers. The company’s satellites will provide a “mesh network” in space that will be able to deliver high broadband speeds without the need for cables.”</li> <li><a href="http://www.computerweekly.com/news/450418305/HSBC-adopts-cloud-first-strategy-to-solving-big-data-business-problems?utm_content=control&#x26;utm_medium=EM&#x26;asrc=EM_ERU_76743007&#x26;utm_campaign=20170508_ERU%20Transmission%20for%2005/08/2017%20(UserUniverse:%202362553)&#x26;utm_source=ERU&#x26;src=5633007"><strong>HSBC adopts cloud-first strategy to solving big data business problems</strong></a> <strong>(ComputerWeekly)</strong> Managing data is a core competency for international banking organization HSBC. In this article, the bank’s Chief Architect David Knott talks about how machine learning, big data analytics and cloud are changing the way HSBC does business.</li> <li><a href="https://www.technologyreview.com/s/604251/a-bug-fix-that-could-unlock-the-web-for-millions-around-the-world/?set=607833&#x26;utm_source=MIT+Technology+Review&#x26;utm_campaign=3e06bab595-The_Download&#x26;utm_medium=email&#x26;utm_term=0_997ed6f472-3e06bab595-154823461"><strong>A ‘Bug Fix’ That Could Unlock the Web for Millions Around the World</strong></a> <strong>(MIT Tech Review)</strong> A new study from International Corporation for Assigned Names and Numbers (ICANN), which the maintains the list of valid Internet domain names, suggests “companies that do business online are missing out on billions in annual sales thanks to a bug that is keeping their systems incompatible with Internet domain names made of non-Latin characters.”</li> </ul><![CDATA[Package Tracking for the Internet]]><![CDATA[Without package tracking, FedEx wouldn't know how directly a package got to its destination or how to improve service and efficiency. 25 years into the commercial Internet, most service providers find themselves in just that situation, with no easy way to tell where an individual customer's traffic exited the network. With Kentik Detect's new Ultimate Exit feature, those days are over. Learn how Kentik's per-customer traffic breakdown gives providers a competitive edge.]]>https://www.kentik.com/blog/package-tracking-for-the-internethttps://www.kentik.com/blog/package-tracking-for-the-internet<![CDATA[Jim Meehan]]>Tue, 09 May 2017 13:00:13 GMT<p>Kentik Introduces BGP Ultimate Exit Traffic Analysis</p> <img src="//images.ctfassets.net/6yom6slo28h2/76SpJdQplSSIkEYsSQYGws/712b3c99b548ee7d22912c36abb05418/PackTrack-Route_map-492w.png" alt="PackTrack-Route_map-492w.png" class="image right" style="max-width: 300px;" /> <p>Imagine for a second what FedEx would be like without package tracking. Generally speaking, packages would probably still be delivered reliably, but sometimes before a package reached its ultimate destination it would be sent all over the map to get where it’s going. Or maybe that would happen more than just “sometimes.” Without the visibility and analytics provided by tracking data, there would be no way to know, nor any way to leverage that data to improve delivery times, reduce cost, or allocate load across the paths and system components that serve various customers.</p> <p>Carriers, transit providers, and other entities that generate revenue by delivering IP traffic on behalf of others are in many ways analogous to FedEx. But believe it or not, many of these services have been operating for 25 years now — since the advent of the commercial Internet — without the IP equivalent of package tracking. It’s not their fault, but rather a reflection of what’s been available to do the job. Until now, neither existing commercial options nor hand-built in-house systems have been able to address more than bits and pieces of what providers need.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5YfZnqfDhKKKmS0usmIkkc/bdf04a3542ef9d0f1c27b3f5d7ba3532/PackTrack-Unobtanium-400w.png" alt="PackTrack-Unobtanium-400w.png" class="image right" style="max-width: 300px;" /> <p>Effective tracking requires end-to-end understanding, and that’s the component that’s been especially hard to get. You need to be able to look at some subset of traffic entering the network on one side and see where and to whom that traffic exits on the other side. With that type of analysis you can quantify not only traffic volume but also traffic “distance.” The combination of those two metrics enables an operator to understand the “cost” for that traffic subset. Cost analysis at that level of granularity is one of the “holy grails” of network traffic analysis, having so far remained — like unobtanium — just beyond reach.</p> <p>Kentik’s release of “Ultimate Exit” functionality changes all that, bringing the seemingly unobtainable not only within reach but — even better — within the framework of a powerful, comprehensive network traffic visibility solution. In a nutshell, Ultimate Exit uses BGP to enhance every flow record (NetFlow, sFlow, IPFIX, etc.) ingested into the Kentik Data Engine with two new fields that represent the Device (router) and Site (PoP) where that flow will exit the network to an adjacent autonomous system (AS). That enables service providers using Kentik Detect to answer a key business question: where and in what quantity is a particular customer’s traffic traversing our network?</p> <p>Using Ultimate Exit</p> <img src="//images.ctfassets.net/6yom6slo28h2/3d0snvTLBeMO46qwEkIyOa/4012167d45f6428f4ddb3a457b8d90e1/PackTrack-shot_1_2-400w.png" alt="PackTrack-shot_1_2-400w.png" class="image right" style="max-width: 300px;" /> <p>Now that we understand how valuable end-to-end traffic tracking can be, let’s take a practical use case and see how Ultimate Exit is used in the Kentik Detect portal. We’ll assume that the interfaces everywhere across our network are appropriately labeled to indicate the customer to which they’re connected. That allows us, using the Filter pane in the Data Explorer sidebar (shown in part at upper right) to use a simple Interface Description filter to select all of the traffic entering the network from a particular customer (regardless of how many interfaces are involved).</p> <p>In the sidebar’s Query pane, meanwhile, we can use the dimension selector (shown at lower right) to group the matching traffic by source ASN and next-hop (adjacent) ASN. Also, we’ll set the display type (in the Display pane) to render the results as a Sankey Flow Diagram, which is a particularly useful visualization for this type of analysis. When we click Apply Changes, the diagram below will return within a couple of seconds.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/6hd9Ezzm0M6O0uEk4sW0Mo/a04f05a83737465568f46fa501686977/PackTrack-shot_3-800w.png" alt="PackTrack-shot_3-800w.png" class="image right" style="max-width: 300px;" /> <img src="//images.ctfassets.net/6yom6slo28h2/5zdCdzFWFOkAMiOSuSyS0M/27cc61f74745100535fe9151ebeeda8e/PackTrack-shot_4-400w.png" alt="PackTrack-shot_4-400w.png" class="image right" style="max-width: 300px;" /></p> <p>This Sankey diagram answers the “to whom” part of the question, and to be fair, that’s something that existing solutions might be able to do as well. The diagram also shows us that Customer XYZ sends us traffic from two different ASNs (1110 and 1111). Even so, key information is missing. We can’t yet see where (geographically or topologically) the traffic from XYZ is entering the network, and we also can’t see where it egresses to the next hop. As shown in the Query pane at right, we can fix this by adding a couple more dimensions to our view. Applying these changes, the diagram will be updated almost instantly.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1bGGTA5cC622sE0OiQ8kU4/af1e7b0b425494bf11dd0c5e4b8685ad/PackTrack-shot_5-800w.png" alt="PackTrack-shot_5-800w.png" class="image right" style="max-width: 300px;" /> <p>Now we can see not only source ASN and next-hop ASN, but also the distribution of traffic across the ingress sites/PoPs (in the 2nd column), and for each of those, the distribution across the egress sites/PoPs (in the 3rd column). Mousing over different parts of the diagram helps us understand that, in this case, traffic is being accepted and delivered relatively efficiently. For example, traffic entering the MIA PoP is egressing from the same PoP (MIA) and an adjacent PoP (ATL), not from PoPs on other continents.</p> <p>Focusing with Filters</p> <img src="//images.ctfassets.net/6yom6slo28h2/1VQAvkGEtWu2wCamY6QAao/0e7fe9792508acaea25a3f656a757bdd/PackTrack-shot_6-400w.png" alt="PackTrack-shot_6-400w.png" class="image right" style="max-width: 300px;" /> <p>As with any Kentik visualization, we can easily add filters for further drill-down into details. Let’s say, for example, that we want to refocus the visualization to look only at the traffic entering the DAL PoP (bottom of column 2 in the Sankey diagram above). We can do so by adding a filter for that PoP (the second filter shown at right), which narrows the resulting Sankey diagram as shown below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4nukwiybp6A8m6UwgCmI6Y/7f871670f315a597b0d51009e6c492b4/PackTrack-shot_7-800w.png" alt="PackTrack-shot_7-800w.png" class="image right" style="max-width: 300px;" /> <p>Ultimate Exit Use Cases</p> <p>As we’ve seen from the above, Ultimate Exit gives Kentik Detect users a quite straightforward way to understand traffic patterns at the customer level. That’s nice, but what’s the business value of that information? One key use case, especially during contract negotiation, is to understand the relative cost of a given customer’s traffic. A customer whose traffic egresses mostly to domestic peers is going to cost a lot less than a customer whose traffic must be transported over a long haul transmission network. Arming your sales team with this insight provides a rational basis for the pricing of transit services, and gives you a serious leg up over competitors.</p> <p>A key use case is to understand the relative cost of a given customer’s traffic.</p> <p>Another major use case is network optimization and discovery/resolution of network misconfiguration. Ultimate Exit has already been used by a Kentik customer to discover that traffic from one of their downstream customers was egressing primarily via paid transit when it should have been egressing mostly via settlement-free peers. Correcting that misconfiguration led to an overall 30% reduction in paid IP transit traffic at one of their PoPs.</p> <p>Enhancing each flow record with data about its exit router/site is just the beginning for the Ultimate Exit feature set. Our roadmap for the remainder of 2017 calls for several additional iterations, including the labeling of flows with exit interface information and allowing the operator to supply cost information for site-to-site pairs, so we can calculate true economic cost for any traffic subset.</p> <p>Learn More With a Demo or Free Trial</p> <p>Ultimate Exit is just one aspect of Kentik Detect’s comprehensive provider-friendly feature set for traffic analytics. To learn more, check out our <a href="https://www.kentik.com/kentik-detect/">product pages</a>, then <a href="#demo_dialog">request a demo</a>. Or dive right in with a <a href="#signup_dialog">free trial</a>; in 15 minutes you can be exploring your own network traffic in the Kentik Detect portal.</p><![CDATA[Internet2 Global Summit: Open Source Network Tools & DDoS Detection]]>https://www.kentik.com/blog/internet2-global-summit-open-source-network-tools-ddos-detectionhttps://www.kentik.com/blog/internet2-global-summit-open-source-network-tools-ddos-detection<![CDATA[Alex Henthorn-Iwane]]>Fri, 05 May 2017 14:50:08 GMT<p>In my last post from the Internet2 Global Summit, I covered some of the less techie topics. In this post, I wanted to briefly cover a couple of interesting tidbits related to current open source network visualization and how Kentik can complement Internet2’s forthcoming DDoS scrubbing service offering.</p> <h4 id="open-source-projects-for-network-traffic-and-performance-visualization">Open Source Projects for Network, Traffic and Performance Visualization</h4> <p>On the second day of the conference, I enjoyed a session called “Pragmatic Network Visualization,” which covered two open source projects presented by Sean Dilda from Duke University and Jennifer Schopf at Indiana University.</p> <p>Sean shared about the Duke University developed Cartographer tool. They use the tool to visualize their network, in particular how their devices interact with the network. It utilizes information gained from performing SSH connections to campus routers and switches to create maps of device to network interconnections as well as the path a packet takes through the network. They offer this tool from their website to university staff, who can search for devices by IP, MAC Address, subnet name, vrf name, or device name on the Cartographer site. While the tool is built against a Cisco-only network infrastructure, it bridges multiple different switch and router OS. Not yet portable or deployable outside of Duke, but something they’re working on, you can see the presentation slides <a href="http://meetings.internet2.edu/media/medialibrary/2017/04/21/20170425-dilda-cartographer_MLy6UHT.pdf">here</a>.</p> <p>Another open source project is called <a href="http://www.netsage.global/">NetSage</a>, which Jennifer presented. The NetSage project is led by Schopf at IU, working with folks from UC Davis and the University of Hawaii. Its goal is to understand the behaviors of the NSF-funded international transit circuits using flow data such as sFlow, performance data from deployed instances of <a href="https://www.perfsonar.net/">perfSonar</a> — which is an open source, distributed network performance system used by science networks. NetSage utilizes open source archiving to store the data, and they’ve built some cool visualizations on top of it to help look at traffic patterns, sources and destinations of traffic, and where transmission issues are occurring on international links. The project is entering year three of a five-year, $5M funding grant from National Science Foundation.</p> <h4 id="kentik-highly-complementary-to-new-ddos-scrubbing-service">Kentik Highly Complementary to New DDoS Scrubbing Service</h4> <p>Internet2 is launching a new, clean pipe DDoS scrubbing service in conjunction with a partner commercial service provider company. The idea is that an institution will detect attacks, sends an alert to the scrubbing service operations team to start scrubbing a particular prefix. The service provider then redirects the traffic to their scrubbing center, and clean traffic is returned on a VLAN within the institution’s existing Internet2 link.</p> <p>The service is in a pilot phase through June 30, 2017, and is planned to be available for production use from July 1, 2017. Higher education institutions and research and education network can pass down the service to their downstream networks (universities or K-12 systems). However, there isn’t anything restricting a K-12 system from subscribing to and paying for the service directly.</p> <p>If an institution doesn’t have on-premises DDoS detection capabilities, that can be procured by sending flow data to the service center, along with providing SNMP access to edge routers. However, the cost of that detection service add-on isn’t trivial.</p> <p>Kentik offers a highly complementary alternative that covers network operations, peering analytics, network anomaly and DDoS detection and triggered alerting to the scrubbing service, at a lower overall cost. For existing R&#x26;E network customers of Kentik, this is a sweet side benefit to an existing investment. For potential R&#x26;E network teams who would are considering using Internet2’s scrubbing service but don’t have DDoS detection capabilities, this is an ideal time to evaluate Kentik since you can get a complete network analytics and detection platform for less than detection alone.</p> <p>If you’d like to learn more about Kentik Detect, visit our <a href="https://www.kentik.com/kentik-detect/">product page</a>. To learn more, <a href="#demo_dialog">request a demo</a>, or if you know you’re ready to try big data-powered network visibility out, you can be up and running with a fully functional <a href="#signup_dialog">free trial</a> in fifteen minutes.</p><![CDATA[News in Networking: Transatlantic Typos, Money in the Cloud, and Pandora's New Visibility]]><![CDATA[Digital transformation is underway. If you don’t believe it, read this week’s WSJ recap of a panel on the topic. You should also check out the earnings reports released this week by the big-three public cloud providers, who all saw spikes. Lastly, watch to Pandora talk about how the music-streaming company is getting better network performance for better customer experience. All that and more after the jump...]]>https://www.kentik.com/blog/news-in-networking-transatlantic-typos-money-in-the-cloud-and-pandoras-new-visibilityhttps://www.kentik.com/blog/news-in-networking-transatlantic-typos-money-in-the-cloud-and-pandoras-new-visibility<![CDATA[Michelle Kincaid]]>Wed, 03 May 2017 17:35:22 GMT<p>This week’s top story picks from the Kentik team.</p> <p><a href="https://www.youtube.com/watch?v=ktXv1sKHzfU&#x26;feature=youtu.be"><img src="//images.ctfassets.net/6yom6slo28h2/5lc6OOiVOwuEMMUEuWyAgq/d13faa3c65005a31c0926e2f92cb41d5/Pandora_Kelty-500w.png" alt="Pandora_Kelty-500w.png" class="image right" style="max-width: 300px;" /></a></p> <p>Digital transformation is underway. If you don’t believe it, read this week’s Wall Street Journal recap of a panel on how VCs are viewing this era compared to that of the Industrial Age. You can also look no further than the earnings reports released this week by the big-three public cloud providers, who all saw spikes. Lastly, listen to Pandora’s head of network operations and engineering talk about how the music-streaming company is getting better network performance for better customer experience.</p> <p>Here are those articles and more from this week:</p> <ul> <li><a href="https://blogs.wsj.com/cio/2017/04/28/digital-transformation-requires-rethinking-vc-says/"><strong>Digital Transformation Requires Rethinking, VC Says</strong></a> (Wall Street Journal) Digital transformation is here. Don’t believe it? Ask Albert Wenger, a partner with venture capital firm Union Square Ventures, who joined a recent panel and stated: “The thing that’s alarming to me is a refusal, generally, to acknowledge that the digital transformation is as profound as the industrial transformation.”</li> <li><a href="https://www.nytimes.com/2017/04/27/technology/quarterly-earnings-cloud-computing-amazon-microsoft-alphabet.html?em_pos=small&#x26;emc=edit_tu_20170428&#x26;nl=bits&#x26;nl_art=0&#x26;nlid=80002889&#x26;ref=headline&#x26;te=1"><strong>Cloud Produces Sunny Earnings at Amazon, Microsoft, Google</strong></a> (The New York Times) The big three public cloud providers came out on top in Q1. Amazon reported that most of its overall profits during the quarter came from AWS. Microsoft Azure grew by 93 percent from the previous-year’s quarter. Google doesn’t break down Alphabet’s numbers, but the company was up 49 percent from the same quarter last year.</li> <li><a href="https://www.youtube.com/watch?v=ktXv1sKHzfU&#x26;feature=youtu.be"><strong>Pandora Turns Up Volume on User Experience with Network Traffic Intelligence</strong></a> (Video) It’s not surprising that music-streaming Pandora has built out a significant network infrastructure to connect a global community of musicians and listeners. However, without sufficient depth and breadth of visibility and insight, they couldn’t achieve the best possible network performance, cost-efficiency, and protection. Watch how they’ve added network visibility in this video.</li> <li><a href="http://www.theregister.co.uk/2017/05/02/telia_hiccups_cloudflare_falls_over/"><strong>Transatlantic link typo by Sweden’s Telia broke Cloudflare in the US</strong></a> (The Register) Typos can break the internet. For proof, on Tuesday, Cloudflare reported a slowdown in its network traffic due to human error. A typo in one of its transatlantic internet backbone links caused the issue.</li> <li><a href="https://code.facebook.com/posts/1782709872057497"><strong>Building Express Backbone: Facebook’s new long</strong>-<strong>haul network</strong></a> (Facebook Blog) Facebook says it’s improving upon some of the technical constraints of the “classic” backbone network. In the process, the social media giant said it learned some of the “growth pains of keeping the network state consistent and stable, the centralized real-time scheduling has proven to be simple and efficient in the end.”</li> <li>[<strong>DDoS a top security and business issue, study shows</strong>](<a href="http://www.computerweekly.com/news/450418006/DDoS-a-top-security-and-business-issue-study-shows?utm_medium=EM&#x26;asrc=EM_EDA_76487742&#x26;utm_campaign=20170502_DDoS">http://www.computerweekly.com/news/450418006/DDoS-a-top-security-and-business-issue-study-shows?utm_medium=EM&#x26;asrc=EM_EDA_76487742&#x26;utm_campaign=20170502_DDoS</a> a top security and business issue, study shows&#x26;utm_source=EDA) (ComputerWeekly) A new study from Neustar found that security and business leaders want DNS at the core of information security strategies. That’s in part due to the rise of DDoS attacks.</li> </ul><![CDATA[Top Reasons to Leave Legacy NPM Behind]]><![CDATA[NPM appliances and difficult-to-scale enterprise software deployments were appropriate technology for their day. But 15 years later, we're well into the era of the cloud, and legacy NPM approaches are far from the best available option. In this post we look at why it's high time to sunset the horse-and-buggy NPM systems of yesteryear and instead take advantage of SaaS network traffic intelligence powered by big data.]]>https://www.kentik.com/blog/top-reasons-to-leave-legacy-npm-behindhttps://www.kentik.com/blog/top-reasons-to-leave-legacy-npm-behind<![CDATA[Alex Henthorn-Iwane]]>Mon, 01 May 2017 13:00:30 GMT<h2 id="why-its-time-to-move-on"><em>Why It’s Time to Move On…</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/9yAasaYMvKyYGAMwkQuUe/14c6ba945a3f2de17a957690920207a0/Broke_wagon-sepia-500w.png" class="image center no-shadow" style="max-width: 500px" /> <p>NPM appliances and difficult-to-scale enterprise software deployments were appropriate technology for their day. But 15 years later, we’re well into the era of the cloud, and legacy NPM approaches are far from the best available option. So it’s high time to sunset the horse-and-buggy NPM systems of yesteryear and instead take advantage of SaaS network traffic intelligence powered by big data. With that in mind, here are a dozen things about legacy NPM that underscore the new reality: it’s time to migrate to a modern, advanced NPM solution.</p> <p>1. Low fidelity</p> <p>Legacy NPM systems are built on low-scale, single server software architectures with highly constrained computing, memory, and storage resources. As a result, they can’t retain very much data. So they have to roll everything into summaries and discard the underlying details. That robs you of data fidelity when you really need to dig in deep. It’s not just that it’s difficult or slow to figure things out — it’s impossible, because the details are no longer there.</p> <p>2. Highly constrained ‘analytics’</p> <p>Legacy systems have limited computational power and memory. Even with relatively small sets of detailed data, they don’t allow you the freedom to ask the questions you need answered, particularly the ones you didn’t anticipate in advance. Instead, they limit your views to pre-defined reports.</p> <p>3. Dead-end troubleshooting</p> <p>When you don’t have a lot of details and you don’t have much analytical freedom, it doesn’t take long — three clicks? — for your troubleshooting workflows to hit a dead end. That leaves you in a blind alley with only your intuition and guesswork to figure things out.</p> <p>4. Overly simplistic anomaly detection</p> <p>With most NPM products, the alerting is as simplistic as the reporting. You’re generally able to define only one field and one metric to monitor, such as bps per destination IP. This simplistic approach leads to lots of false positives by alerting on normal traffic. When operators “de-tune” triggers to avoid a sea of red in their alarm management views, they get false negatives instead, missing significant anomalies and attacks.</p> <p>5. Operational blindness</p> <p>With legacy tools, your users will often notice issues before you do. Even when problems are reported, they disappear - like footprints in the sand - before you can figure them out. Without complete, detailed historical data, there is no way to go back and look at the details from while it was happening.</p> <p>6. Inaccurate or no DDoS detection</p> <p>Most NPM tools don’t offer DDoS detection. Those that do are highly inaccurate because they can’t do more than track a handful of different packet types. The result is tons of false negatives, allowing attacks to go undetected until they are wreaking havoc.</p> <p>7. Siloed tools and costly infrastructure</p> <p>If all of the limitations we’ve already covered aren’t frustrating enough, siloed tools — each requiring its own copy of the same dataset, are sure to multiply the frustration. Multiple tools mean separate UIs, little to no usable APIs, islands of summarized partial visibility. And once you accumulate a number of these siloes, you have to buy costly packet brokers so they can all share limited span ports. Now you’re running a network infrastructure to support an ecosystem of siloed tools. Costly and ineffective.</p> <p>8. Fragmented visibility</p> <p>With all those siloed NPM tools, your visibility is fragmented. Maybe your team will swivel-chair between different systems and laboriously calculate correlations by eyeball. Or maybe they’ll dump .csv files and work in excel. Or maybe operators will just stay siloed in their different knowledge sets.</p> <p>9. Painful upgrades</p> <p>Oh boy. Now that you’ve got all those tools and infrastructure, you’re in for some fun as various software upgrades need to be deployed. These are time-consuming and risky upgrades so you only do them occasionally, if at all.</p> <p>10. Feature lag, training gaps, and no mercy</p> <p>With only occasional software upgrades being deployed, you won’t be getting new functionality very often. You’ll be months to a year or more behind the curve. This means the upgrades are more dramatic, time-consuming and risky. You’ll be doing serial minor to major OS upgrades, and all the deployed versions mean that it’s impossible for the vendor to ensure that all upgrade paths work. But that’s your fault, the vendor says, that you waited so long. And when you do bite the bullet and catch up on upgrades, you face a whole raft of changes, which creates a training gap.</p> <p>11. EOLs</p> <p>Randomly dropped into all of those painful upgrade cycles, EOLs turn the complexity and frustration knobs to eleven. Now you have to go replace your NPM boxes with the latest generation, usually at the same cost (or more) than the original products that you purchased. And you thought software upgrades were painful?</p> <p>12. Cloud and big data head fakes</p> <p>Adding to the cruelty, legacy NPM vendors will do anything to keep their outdated architectures pumping out the dollars. So they’ll offer “cloud” and “big data” solutions that are at best bandaids and in some cases just pile more burden on you. Some NPM vendors offer VM based versions of their venerable appliances, which means that you get to deal with all of the infrastructure issues. Other vendors will try to convince you to tack an open source big data “project” onto your deployment, while retaining all of the existing appliances.</p> <p>If the above symptoms of legacy NPM sound all too familiar, there is a genuine solution: Kentik Detect, the industry’s first big data, multi-tenant SaaS for network traffic analytics. Kentik Detect unifies network visibility — operational troubleshooting, highly accurate anomaly and DDoS detection, and peering and capacity planning — into a single cost-effective solution that delivers ultra-fast answers to ad-hoc questions. Learn more on our <a href="https://www.kentik.com/kentik-detect/">product pages</a>. <a href="#demo_dialog">Request a demo</a> or start a <a href="#signup_dialog">free trial</a> today and be up and running in fifteen minutes.</p><![CDATA[Internet2 Global Summit: The Future of R&E Networks & ITaaS]]>https://www.kentik.com/blog/internet2-global-summit-the-future-of-re-networks-itaashttps://www.kentik.com/blog/internet2-global-summit-the-future-of-re-networks-itaas<![CDATA[Alex Henthorn-Iwane]]>Fri, 28 Apr 2017 19:25:03 GMT<p>I attended the Internet2’s Global Summit this week, an event that gathers a broad audience from Research and Education (R&#x26;E) networks from around the world, from engineers up to C-level attendees. The conference was jam-packed with excellent presentations, to which it’s impossible to do full justice. In this post, I’ll cover a couple of topics covered in the executive track on the first day of the conference.</p> <h3 id="the-future-of-re-networks-better-sharing-of-resources"><strong>The Future of R&#x26;E Networks:  Better Sharing of Resources</strong></h3> <p>One session was a panel consisting of Wendy Huntoon, President &#x26; CEO of Kinber, Ibrahim Ha, head of EMEA Infra Connectivity Programs at Facebook, and Chris Sedore, CEO of NYSERNET.  A major theme of this session was more efficiently sharing resources. Wendy addressed the need to ensure that R&#x26;E leaders establish strong communications and onboarding for researchers to become aware of and get access to infrastructure resources and services.  Without an easy way to onboard, researchers tend to give up on central IT teams and procure their infrastructure resources directly. Chris emphasized the need to cut down on needless duplication of investments such as fiber runs, particularly when looking at constrained resources.  As an example, he mentioned that his fiber map looked remarkably similar to Internet2’s map in many places. Ibrahim addressed how Facebook overcome barriers to getting infrastructure to emerging countries in regions like Africa by collaborating with telecom and other webscale companies to share the costs of building subsea cables infrastructure to create access to bandwidth.  One executive from AARNet (Australia’s version of Internet2), commented that R&#x26;E networks could gain tremendously from collaborating with Facebook and other webscale companies around fiber and other infrastructure-related topics.</p> <h3 id="id-rather-remove-my-appendix-with-a-spork"><strong>“I’d Rather Remove My Appendix with a Spork”</strong></h3> <p>In a separate session, Michael McCartney, the CIO of Purdue University’s main campus gave an insightful and humorous talk, telling the story of a successful IT as a Service transformation.  When he started out, there were 67 distributed data centers on campus.  He made it his mission to win researchers over to a new service model that utilized pre-planned scale-out purchases, one-time fees for what was in effect a five-year subscription, and a cloud-like model where unutilized capacity could be allocated on an as-needed basis to others.  He made that model successful by hiring liaisons from academic backgrounds, who intimately understood researchers’ requirements and help them get what they needed. At first, folks on campus were more than skeptical, with one researcher telling Michael, “I’d rather remove my appendix with a spork than let you touch my computers.”  But over time, he won them over, and the vast majority of the distributed data center owners happily moved to and stayed with the new services. Elias Eldayrie, CIO of the University of Florida told a very similar story in his talk, with similarly impressive results achieved.  He recounted how UF’s central infrastructure services grew from supporting a tiny fraction of overall research spending to something like 75% over the course of a few years.  In the context of a large research university that has $750M+ in annual research funding, that’s pretty impressive.  Overall, both of their teams transformed central IT infrastructure services from minor to major players in supplying infrastructure to their campuses. In my next post on Global Summit, I’ll cover some nifty open source projects and how Kentik complements Internet2’s forthcoming DDoS scrubbing service.</p><![CDATA[News in Networking: ONUG, Bufferbloat, and DDoSing with IoT]]><![CDATA[Today we’re launching a weekly blog series called “News in Networking.” Tune in each week for a quick roundup of industry news that seems noteworthy to the Kentik team. This week’s highlights include a look at hot topics at the Open Networking User Group (ONUG) Spring 2017 conference, an article on bufferbloat (yes, it's real) and the causes behind a slow internet, a list of IoT-enabled DDoS attacks that underscore security risks, and more...]]>https://www.kentik.com/blog/news-in-networking-onug-bufferbloat-and-ddosing-with-iothttps://www.kentik.com/blog/news-in-networking-onug-bufferbloat-and-ddosing-with-iot<![CDATA[Michelle Kincaid]]>Wed, 26 Apr 2017 18:41:35 GMT<p>This week’s top story picks from the Kentik team.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3nr4EFGlxYwSYQmOWWWMoq/3b27deacedc85631c52bd2b265cb1328/News_tablet-396w.png" alt="News_tablet-396w.png" class="image right" style="max-width: 300px;" /> <p>We’re launching a weekly blog series called “News in Networking.” Starting today, every Wednesday we’ll give you a glance at the industry news our team at Kentik is reading. If you like our weekly aggregation and want to see it more regularly, or if you want to weigh in on any of the stories, tweet us at @kentikinc!</p> <p>Here’s what’s happening this week:</p> <ul> <li><a href="http://searchsdn.techtarget.com/news/450417298/ONUG-Spring-2017-conference-issues-include-barriers-to-cloud-adoption?utm_medium=EM&#x26;asrc=EM_NLN_76113637&#x26;utm_campaign=20170424_Riverbed%20widens%20WAN%20portfolio%20with%20Xirrus&#x26;utm_source=NLN&#x26;track=NL-1817&#x26;ad=913984&#x26;src=913984"><strong>ONUG Spring 2017 conference issues include barriers to cloud adoption</strong></a> (SearchSDN) The Open Networking User Group (ONUG) Spring 2017 conference is underway in San Francisco. Kentik CTO Dan Ellis is on a panel today on how traditional tools are inadequate for modern networks. This TechTarget’s SearchSDN article says you can also expect to hear Amazon and Microsoft talking about barriers to enterprise cloud adoption.</li> <li><a href="http://www.networkworld.com/article/3107744/internet/the-hidden-cause-of-slow-internet-and-how-to-fix-it.html?idg_eid=ef5d8103c0b492d0599db067e86d7910&#x26;email_SHA1_lc=11769875a7f08e6a79a7a30d51400a0c796aaa22&#x26;cid=ndr_nlt_Insider%20Normal%20Subscribers_2017-04-23&#x26;utm_source=Sailthru&#x26;utm_medium=email&#x26;utm_campaign=Best%20of%20Insider%202017-04-23&#x26;utm_term=Insider%20Normal%20Subscribers"><strong>The hidden cause of slow Internet and how to fix it</strong></a> (NetworkWorld) A TCP-related phenomenon known as “bufferbloat” is a real thing, according to NetworkWorld’s Phil Hippensteel, who interviewed Jim Gettys, the Google computer programmer uncovering it. “What is not fully understood is the extent of its impact on the normal flow of Internet traffic,” reports Hippensteel.</li> <li><a href="http://www.crn.com/slide-shows/internet-of-things/300084663/8-ddos-attacks-that-made-enterprises-rethink-iot-security.htm"><strong>8 DDoS Attacks That Made Enterprises Rethink IoT Security</strong></a> (CRN) DDoS attacks via IoT devices have wreaked havoc on universities, Netflix, Twitter and even banks in Russia. CRN listed those attacks among the eight that should make enterprises rethink IoT security. To validate the spike in attacks, CRN cites Neustar “mitigated 40 percent more DDoS attacks from January through November, compared to the year earlier.”</li> <li><a href="https://www.wsj.com/articles/verizon-at-t-in-billion-dollar-bidding-war-for-5g-spectrum-1493146927"><strong>Verizon, AT&#x26;T in billion</strong>-<strong>dollar bidding war</strong></a> (Wall Street Journal) Verizon bid $1.8 billion for Straight Path Communications, reported WSJ’s Thomas Gryta and Ryan Knutson. Earlier this month, AT&#x26;T offered $1.6 billion for Straight Path, a potentially critical player in the move towards 5G networks.</li> <li><a href="http://p.nytimes.com/email/re?location=pMJKdIFVI6pghfX2HXfSzxRpdoyDWYNWReyZzkKFb2WW7pwd0XyLzn1+CRzLXkafRGyOfwzvVcijfpxsHbg5qJtp1GaLYfXv6KNdoUvvWmws+Gc7bgnlJt1JO32/DKYdzgqwDJEhs+0Op1fLfWulyZVGHq4hRRLBbMUk9Z5S0bIofumKVUbCXIEzMHlo5BKOZJvx+OpG6Ps=&#x26;campaign_id=688&#x26;instance_id=96349&#x26;segment_id=107199&#x26;user_id=ef5d8103c0b492d0599db067e86d7910&#x26;regi_id=80002889"><strong>F.C.C. Leader Seeks Tech Companies’ Views on Net Neutrality</strong></a> (The New York Times) FCC Chairman Ajit Pai recently met with Cisco, Oracle, Intel and Facebook leaders, seeking feedback on his plans to roll back net neutrality rules. Pai’s proposal is set to be announced Wednesday afternoon.</li> </ul><![CDATA[Accurate Visibility with NetFlow, sFlow, and IPFIX]]><![CDATA[Most of the testing and discussion of flow protocols over the years has been based on enterprise use cases and fairly low-bandwidth assumptions. In this post we take a fresh look, focusing instead on the real-world traffic volumes handled by operators of large-scale networks. How do NetFlow and other variants of stateful flow tracking compare with packet sampling approaches like sFlow? Read on...]]>https://www.kentik.com/blog/accurate-visibility-with-netflow-sflow-and-ipfixhttps://www.kentik.com/blog/accurate-visibility-with-netflow-sflow-and-ipfix<![CDATA[Alex Henthorn-Iwane]]>Mon, 24 Apr 2017 13:00:13 GMT<h2 id="comparing-flow-protocols-for-real-world-large-scale-networks">Comparing flow protocols for real-world large-scale networks</h2> <p>A lot of ink has been spilled over the years on the topic of flow protocols, specifically how they work and their relative accuracy. Historically, however, most of the testing, opinion, and coverage has been based on enterprise use cases and fairly low-bandwidth assumptions. In this post we’ll take a new look, focusing instead on use cases and bandwidths that are more representative of — and relevant to — large internet edge and datacenter operations.</p> <p>One of the things that can be rather confusing is that there are a lot of different flow protocol names. Behind the many variants there are actually only two major technologies related to capturing and recording traffic-flow metadata. The first is based on stateful flow tracking, and the other is based on packet sampling. Both approaches are explained below. But first, based on that fundamental distinction, let’s classify common flow protocols accordingly:</p> <ul> <li><strong>Stateful flow tracking</strong>:</li> </ul> <p>— <em>NetFlow</em>: Originally developed by Cisco in 1996. The most used versions are v5 and v9. — <em>NetFlow by another name</em>: Other vendors support NetFlow but call it something else, including J-Flow (Juniper), RFlow (Redback/Ericsson), cFlowd (Alcatel), Netstream (3Com/HP/Huawei). — <em>IPFIX</em>: The IETF standards-based successor to NetFlow, sometimes referred to as NetFlow v10.</p> <ul> <li><strong>Packet sampling</strong>:</li> </ul> <p>— <em>sFlow</em>: Launched in 2003 by a multi-vendor group (sflow.org) that now includes Alcatel, Arista, Brocade, HP, and Hitachi.</p> <h2 id="flow-tracking-vs-packet-sampling">Flow tracking vs. packet sampling</h2> <p>So what’s the difference between the stateful flow tracking of NetFlow and the packet sampling of sFlow? A while back, Kentik’s CEO Avi Freedman wrote an excellent two part blog on <a href="https://www.kentik.com/netflow-sflow-and-flow-extensibility-part-1/">flow protocol extensibility</a>, so I’ll leverage his words:</p> <p><em>Routers and switches running NetFlow/IPFIX designate a collection of packets as a flow by tracking packets, typically looking for packets that come from and go to the same place and share the same protocol, source and dest IP address, and port numbers. This tracking requires CPU and memory — in some circumstances, a huge amount of it. For example, with a forged source-address DDoS attack, every packet can be a flow, and routers have to try to maintain massive tables on the fly to track those flows! Also, to cut down on CPU and network bandwidth, flows are usually only “exported” on average every 10 seconds to a few minutes. This can result in very bursty traffic on sub-minute time scales.</em></p> <p><em>sFlow, on the other hand, is based on interface counters and flow samples created by the network management software of each router or switch. The counters and packet samples are combined into “sFlow datagrams” that are sent across the network to an sFlow collector. The preparation of sFlow datagrams doesn’t require aggregation and the datagrams are streamed as soon as they are prepared. So while NetFlow can be described as observing traffic patterns (“How many buses went from here to there?”), with sFlow you’re just taking snapshots of whatever cars or buses happen to be going by at that particular moment. That takes less work, meaning that the memory and CPU requirements for sFlow are less than for NetFlow/IPFIX.</em></p> <h2 id="which-is-more-accurate">Which is more accurate?</h2> <p>Most published tests and blogs have historically made two major points of comparison are typically between NetFlow and sFlow:</p> <ul> <li>Aggregate traffic volume accuracy on a per interface basis.</li> <li>Per-host traffic volume accuracy as compositional elements of the interface traffic.</li> </ul> <p>These tests have mostly been performed at very low traffic volumes (under 10Mbps in multiple tests). In those test scenarios, it’s been repeatedly observed that aggregate interface volumes are essentially the same for both protocols. Drilling down into the host-level details, however, tests with traffic rates in the Kbps range often indicate that NetFlow more reliably captures traffic flow statistics at greater granularity.</p> <h2 id="when-traffic-volume-is-low-netflow-can-capture-the-details-of-every-flow">When traffic volume is low, NetFlow can capture the details of every flow</h2> <p>This observation makes sense when traffic volume is low, because in that situation it’s not too taxing for NetFlow to collect every flow (as distinct from sampling the flows) and thereby reliably capture all flow statistics. If you’re looking to examine traffic coming from individual client machines in an SMB or even an enterprise network setting, then that increased granularity is helpful.</p> <p>At higher traffic volumes, however, network engineers and operators are focused on different issues, and the need for per-flow granularity starts to fade significantly. For example, if you’re operating a network for an ISP, cloud-or-hosting provider, or Web enterprise, it wouldn’t be unusual to be push tens to hundreds of Gbps through your Internet edge devices. For large carrier ISP operations located in major metropolitan areas, with highly consolidated IP edge points of presence, it’s fairly commonplace for individual routers to push 1Tbps.</p> <h2 id="high-volume-concerns-include-ddos-attacks-volumetric-shifts-and-traffic-engineering">High-volume concerns include DDoS attacks, volumetric shifts, and traffic engineering</h2> <p>When you’re dealing with Internet edge networks at this scale, Kbps-level traffic coming from an individual host somewhere typically doesn’t matter enough to track individually. Instead, your primary concerns are with traffic engineering, with sudden volumetric shifts that can clog pipes and reduce service performance, and with volumetric DDoS attacks that most commonly range from the single to low-tens of Gbps. If you’re a hosting, cloud, or Web provider, the types of hosts you’re concerned with are servers, which are each usually pushing traffic at tens to hundreds Mbps.</p> <p>Our SaaS platform, Kentik Detect, collects and stores upwards of 125B flow records per day from 100+ customers operating networks whose use cases are weighted toward high-volume Internet edge and east-west traffic, both intra-datacenter and inter-datacenter. Based on their feedback and our own observation and analysis, our take is that for these applications the accuracy of sFlow and NetFlow/IPFIX is essentially the same. In addition, since packet samples are sent immediately while NetFlow records can queue for minutes before being sent, sFlow has the advantage of delivering telemetry data with lower latency.</p> <h2 id="flow-retention-and-analytical-scale">Flow retention and analytical scale</h2> <p>Whether you’re running an enterprise, web company, cloud/hosting, ISP, or mobile/telco network, you are likely running significant volumes of traffic, which makes it critical to be able to retain flow records and analyze them in detail at scale. Most flow collectors of any note can handle a fairly high level of streaming ingest, but that’s where their scalability ends. With single-server or appliance systems, detail at the flow-record level isn’t retained for long, if at all. Instead you get summaries of top talkers, etc. We talk to tons of users of these legacy tools and they tell us what we already know, which is that those few summary reports are neither comprehensive nor flexible enough to be informative about anything below the surface.</p> <h2 id="kentik-solves-the-problem-of-inflexible-summary-reports">Kentik solves the problem of inflexible summary reports</h2> <p>Kentik was created to solve this problem. Our platform ingests massive volumes of flows, correlates them into a time-series with BGP, GeoIP, performance, and other data, and retains records for 90 days (more be arrangement). You can perform ad-hoc analysis on billions of rows of data without any predefined limits, using multiple group-by dimensions and unlimited nested filtering. You can run queries and get answers back in a few seconds, pivot your analysis, drill down, zoom in or out, and keep getting answers fast until you get just data you need to make a decision. There’s much more to our solution; for additional information check out our website’s <a href="https://www.kentik.com/product/kentik-platform/">platform pages</a>.</p> <p>If you’re not yet familiar with how Kentik applies the power of big data to flow analysis, we’d love to show you; contact us to <a href="#demo_dialog">schedule a demo</a> to walk you through it. Or you can dive in directly by starting a <a href="#signup_dialog">free trial</a>; within 15 minutes you can be in the Kentik Detect portal looking at traffic on your own network. Either way, if you’re operating a serious network then Kentik’s scale, granularity, and power will enable you to see, understand, and respond to every aspect of your traffic.</p><![CDATA[Big Data Analytics on Tap for Kentik at ONUG Spring 2017]]>https://www.kentik.com/blog/big-data-analytics-on-tap-for-kentik-at-onug-spring-2017https://www.kentik.com/blog/big-data-analytics-on-tap-for-kentik-at-onug-spring-2017<![CDATA[Alex Henthorn-Iwane]]>Fri, 14 Apr 2017 15:51:02 GMT<h2 id="onug-spring-2017-in-sf">ONUG Spring 2017 in SF</h2> <p><img src="//images.ctfassets.net/6yom6slo28h2/1KrcfnPeNqiayOgqcQWcEU/e00ec06ad64932d587d488dce02e3676/Onug_logo-300w.png" alt="Onug_logo-300w.png" class="image right" style="max-width: 300px;" /> Next week, Kentik will be sponsoring and participating in the Open Networking User Group (ONUG) Spring 2017 conference. Running April 25-26 at the Mission Bay Conference Center in San Francisco, this is the first time ONUG has held a conference outside of NYC. We’re excited that one of the main foci is modern instrumentation and analytics. We’ll be sponsoring and exhibiting our big data-powered network traffic intelligence, plus our CTO Dan Ellis will be part of a panel on the afternoon of Wednesday, April 26th.</p> <h3 id="why-kentik--onug">Why Kentik &#x26; ONUG?</h3> <p>For those who aren’t aware, ONUG is a user-driven consortium. The board includes IT leaders primarily from large enterprise and financial companies like Bank of America, Cigna, GE, Gap, Morgan Stanley, and FedEx, plus tech companies like Yahoo and Intuit. The mission of ONUG is “to enable greater choice and options for IT business leaders by advocating for open interoperable hardware and software-defined infrastructure solutions that span across the the entire IT stack, all in an effort to create business value.”   The ONUG community is increasingly relevant to Kentik because while we started out with customers primarily clustered around web companies and various types of service providers, Kentik has not only gained customers but we’re also seeing increasing engagement at financial services and enterprise organizations. All of these enterprise organizations are looking to modernize their approach analytics across the board to harness cloud-scale computing platforms and open, native APIs.  Nick Lippis, Co-Founder and Co-Chair of ONUG adds: “As a forward thinker in improving how businesses use big data, we are thrilled to have Kentik joining the ONUG Community and demonstrating at their first ONUG Conference this week. Kentik is proving their commitment in helping businesses — and the IT executives that run them — optimize their capacities by turning network data into real time intelligence. We look forward to hearing their insights and strategies in moving our businesses forward.”</p> <h3 id="where-to-find-kentik-while-at-onug-spring"><strong>Where to Find Kentik while at ONUG Spring:</strong></h3> <p>  <strong>Exhibits:</strong> We’ll be exhibiting and giving demonstrations of our big data network traffic intelligence.  Find us, along with a number of other great solution providers and sponsors such as Arista, Cisco, Huawei, NTT i³, Verizon and Viptela. <strong>Wednesday Panel:  Retooling for the Software-Defined Enterprise</strong> Dan Ellis, Kentik CTO and former Netflix-CDN-Ops-badass, will be participating in a panel moderated by ACG Research senior analyst Steve Collins.  The panel, which also include folks from Yahoo, FedEx and Brocade, will focus on how traditional tools are inadequate for modern networks, and explore the “new big data analytics approach to IT infrastructure management and its promise of greater infrastructure insight plus operational efficiency.”  What a perfect setup to talk about how our customers are getting value from big data-powered NetFlow, BGP, geoIP, and performance data, not only from a human operator perspective, but via automation as well.</p> <h3 id="see-you-there">See you there!</h3> <p>We’re looking forward to ONUG, seeing you there and showing you how powerful big data network traffic intelligence is.  Visit our <a href="https://www.kentik.com/kentik-detect/">product page</a> if you’d like to learn more about Kentik Detect.  If you’d like to get a demo, hit us up on the chat on <a href="https://kentik.com">kentik.com</a> or send us an email at <a href="mailto:[email protected]">[email protected]</a>.  Know you’d like to get this analytics power in your hands?  Start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[From Network Flow Monitoring to CapEx Savings]]><![CDATA[It's very costly to operate a large-scale Internet edge, making lower-end edge routers a subject of keen interest for service providers and Web enterprises alike. Such routers are comparatively short on FIB capacity, but depending on the routes needed to serve your customers that might not be an issue. How can you find out for sure? In this post, Alex Henthorn-Iwane, VP Product Marketing, explains how a new feature in Kentik Detect can show you the answer.]]>https://www.kentik.com/blog/from-network-flow-monitoring-to-capex-savingshttps://www.kentik.com/blog/from-network-flow-monitoring-to-capex-savings<![CDATA[Alex Henthorn-Iwane]]>Mon, 10 Apr 2017 13:00:43 GMT<h3 id="route-traffic-analytics-enables-lower-cost-edge-routers"><em>Route Traffic Analytics Enables Lower-Cost Edge Routers</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/3TNM2LuxeEAmgqc8Cy6IGw/17a448babeb2acf537d2a4fe4b272c88/Cut_costs-500w.png" alt="Cut_costs-500w.png" class="image right" style="max-width: 400px;" /> <p>With a single name-brand edge router costing up to $1M+, operating a large-scale Internet edge can be a costly proposition. Whether you’re a Web enterprise selling goods and services online or an ISP selling connections, that cost can be a significant hurdle to launching or expanding your business. That’s why a major area of exploration these days is the use of lower-cost edge routers or even white box-based solutions. But there’s a major advantage that you give up when moving from established, high-end routers to lower cost options: routing and forwarding scalability.</p> <p>High-end, brand-name routers can hold the entire Internet routing table many times over in their control plane processor memory, and their control plane software is efficient and stable enough to enable timely processing of continuous updates to the Routing Information Base (RIB, a.k.a. IP Routing Table). This capacity to scale, which illustrates the advantage of mature products, is something that brand name routers have developed over a long time. Lower-cost routers don’t have that advantage, but they are steadily catching up in terms of RIB scalability.</p> <p>A much larger capacity gap exists between between high- and low-end routers with respect to the Forwarding Information Base (FIB, a.k.a. CEF table or IP forwarding table), meaning the scale of the forwarding plane hardware. In high-end routers the FIB accommodates the entire Internet RIB, over 360K CIDR-aggregated entries and roughly twice that in unaggregated prefixes. Lower-cost routers have dramatically less FIB capacity, typically on the order of 30K entries.</p> <h4 id="can-your-fib-fit-your-rib">Can Your FIB Fit Your RIB?</h4> <p>Let’s suppose that your network is heavily multi-homed to the Internet, and that you are looking for ways to reduce CapEx on your edge routers. The key question is whether you can handle traffic for the vast majority of your customers with 30K FIB entries. If you can manage to send most of your important traffic within that constraint you can then use a default route to handle your remaining traffic flows.</p> <div class="pullquote left">Can you handle traffic for the vast majority of your customers with 30K FIB entries?</div> <p>To arrive at a practical answer to that high-level question, network managers need to answer two more detailed questions. First, what’s the correlation of traffic flows to routed prefixes? And second, how much and how often does that correlation change? The latter question is important because even if you can calculate route traffic density, if the correlations change too rapidly and dramatically, it might not be feasible to utilize route traffic density data to configure the FIBs in lower-cost routers. Fortunately, Kentik Detect’s fast, deep analytical visibility into traffic flows can help answer those questions.</p> <p>Our recent <a href="https://www.kentik.com/kentik-hackathon/">Kentik hackathon</a> included a project by some of our engineers showing how Kentik Detect could be used to provide an analysis by tranches of routes, thereby showing how much traffic (by percent) is being handled by how many prefixes. An analysis of this type addresses the real need discussed above to assess the feasibility of utilizing lower-cost routers. And it also provides practical insight into which prefixes to include in a lower-cost router’s FIB. The engineers who worked on the hackathon project utilized traffic flow and routing data from a number of networks that were interested in the results.</p> <h4 id="yes-most-likely">Yes! (Most Likely)</h4> <img src="//images.ctfassets.net/6yom6slo28h2/ZJWlYdr7oqQ6keO44gAQs/07d177e5768fe80b7e56c2a01d9295e1/RTA-hackathon_result-300w.png" alt="RTA-hackathon_result-300w.png" class="image right" style="max-width: 300px;" /> <p>The finding of the hackathon project was that in many large networks, it is indeed possible to handle the vast majority of traffic flows in a FIB whose capacity is limited to 30k BGP routes. The following table, based on data from a large, multi-homed network and condensed for brevity, shows the correlation between the number of routes and the percentage of traffic flows associated with those routes. The number of routes is shown in the left column, while the right column shows the percentage of overall traffic volume covered by those routes. As you can see, in this network it is possible to serve 98% of all traffic flows with only 7716 routes, far below the 30K threshold set for the project.</p> <p>The hackathon project showed that it’s relatively easy to derive route traffic density using the capabilities of Kentik’s big data network traffic intelligence platform. For the hackathon the use case of the route assessment was in the realm of business intelligence. But given the level of detail Kentik Detect makes available, a dataset of this type could also be used in a production scenario as the input to operational automation that pushes prioritized routes into the FIBs of lower-capacity routers.</p> <p>Further analysis of the results showed that in the networks surveyed there was little change over time in the percentage of traffic associated with the top prefixes. That indicates that even pushing a top N list of prefixes into the FIB on a daily basis may be good enough to keep traffic flowing optimally within the capacity constraints.</p> <p>Aside from the interesting data gathered from customer networks, this project also demonstrated the power and flexibility of Kentik Detect to deliver highly valuable analytics that straddle the difficult line between operational details and business intelligence. Which leads me to…</p> <h4 id="route-traffic-analytics-in-the-kentik-portal">Route Traffic Analytics in the Kentik Portal!</h4> <p>I originally wrote this blog post with a placeholder statement that we’d be developing this feature at some point in the future. But before we could publish the post, our engineering team beat me to the punch by turning the hackathon project into a new feature — Route Traffic Analytics — that is already live in the Kentik portal (reached via the Analytics menu).</p> <p>Route Traffic Analytics is configured in the sidebar, where you choose the devices to include, filter traffic by any of dozens of dimensions, and choose options from a variety of settings (order, data-series resolution, slice sizing, time range) that help ensure that you see the data that’s of greatest utility. The Actions setting includes a number of analyses that yield practical outputs showing the correlation of routing and traffic flows:</p> <ul> <li><strong>Correlate flows to routes</strong>: The Summary analysis provides insight into the number and percentage of traffic flows correlated to the number and percentage of routes, plus Mbps per analyzed tranche of routes. This insight is helpful to understand the capacity needed in edge router FIBs to handle the desired percentage of traffic flows.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/2kTAUxobE0quUiU6UsGimu/c9e782b0c95a9c07430e9d185720ea04/RTA-Summary-800w.png" alt="RTA-Summary-800w.png" class="image center" style="max-width: 800px;" /> <ul> <li><strong>Show top 1000 routes</strong>: A listing of the top 1000 routes by traffic density, which provides more details per route.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5QCwtVIWxGaCskUCmg48wc/636343add5ba13c679436f21c1cfaeeb/RTA-Top_1000-800w.png" alt="RTA-Top_1000-800w.png" class="image center" style="max-width: 800px;" /> <ul> <li><strong>Show volume overview</strong>: A quick calculation of average and max Mbps per route.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/1iAHo87dBAIO60iAo0q04q/6401bb7514b81930791acb6c5243ac8f/RTA-Quick_traffic-800w.png" alt="RTA-Quick_traffic-800w.png" class="image center" style="max-width: 800px;" /> <ul> <li><strong>Export CSV for top routes</strong>: Exports top routes to CSV, which can be used to configure routers.</li> </ul> <p>Route Traffic Analytics is just one example of the kind of network insights that Kentik Detect can provide to help you reduce costs without sacrificing service reliability or performance. If this sounds like the level of network visibility you need, <a href="#demo_dialog">schedule a demo</a>. Or dive in directly right now by starting a <a href="#signup_dialog">free trial</a>. In just 15 minutes you could be signed up and analyzing real traffic data on our powerful, secure, multi-tenant SaaS.</p><![CDATA[Flow Data is Top Source for Network Analysis]]><![CDATA[Not long ago network flow data was a secondary source of data for IT departments trying to better understand their network status, traffic, and utilization. Today it's become a leading focus of analysis, yielding valuable insights in areas including network security, network optimization, and business processes. In this post, senior analyst Shamus McGillicudy of EMA looks at the value and versatility of flow for network analytics.]]>https://www.kentik.com/blog/flow-data-is-top-source-for-network-analysishttps://www.kentik.com/blog/flow-data-is-top-source-for-network-analysis<![CDATA[Shamus McGillicuddy]]>Mon, 03 Apr 2017 13:00:57 GMT<h2 id="enterprise-it-embraces-flow-for-monitoring-and-planning"><em>Enterprise IT embraces flow for monitoring and planning</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/1k9BmiOVDQQ8Q8IO2Aa20Y/271006a524ec31cd66344215952bb11b/Surfing-500w.png" class="image right" style="max-width: 500px" alt="surfing" /> <p>Network flow data is quickly emerging as an essential source of data for <a href="https://www.kentik.com/kentipedia/network-traffic-analysis/" title="Kentipedia: Network Traffic Analysis">analyzing network performance</a>, optimizing infrastructure, and detecting malicious traffic. This wasn’t the case just a couple of years ago when EMA issued its <a href="http://www.enterprisemanagement.com/research/asset.php?id=2749">Network Management Megatrends 2014</a> report. At the time, EMA’s research found that network flow data was only the fourth most popular source of data used for network engineering and capacity planning. Instead, enterprises were much more likely to approach these use cases with packet inspection tools, log files, and data extracted from management system APIs.</p> <p>Just two years later, as EMA reported in <a href="http://www.enterprisemanagement.com/research/asset.php?id=3230">Network Management Megatrends 2016</a>, things had changed significantly. Network flow records emerged as the most popular source of data for engineering and capacity planning, used for these purposes by 41% of the surveyed network infrastructure teams. It’s a change that makes perfect sense, because network flow records provide clear and concise insight into network traffic patterns.</p> <h3 id="flow-data-for-wan-monitoring">Flow data for WAN monitoring</h3> <p>The trend in engineering and capacity planning applies as well to wide-area network (WAN) service assurance, where network flow data is also essential. Enterprise WANs are currently undergoing tremendous change. Many enterprises are, for example, replacing managed WAN services like MPLS with public internet connections at remote sites. Organizations are pursuing these changes for a variety of reasons, not least of which is that they enable direct cloud access and higher bandwidth at a lower cost.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2LajoGTrU4aOgqKScaweiq/454735b9b774f471e4aa209a877661f5/Most_popular-300w.png" class="image left" style="max-width: 300px" alt="" /> <p>While <a href="https://www.kentik.com/kentipedia/sd-wan-software-defined-networking-defined-and-explained/" title="Kentipedia: SD-WAN Explained">software-defined WAN (SD-WAN) technology</a> provides the infrastructure that helps enterprises leverage the internet for business applications, network flow data will be the data source they use to assure performance at these remote sites, regardless of whether they connect via Internet or MPLS. EMA’s 2016 research report, <a href="http://www.enterprisemanagement.com/research/asset.php/3283/Next-Generation-Wide-Area-Networking">Next-Generation Wide-Area Networking</a>, found that network flow analysis was the most popular type of tool for monitoring the health and performance of MPLS-based connectivity and Internet connectivity at remote sites, with 51% and 52% of enterprises using the tool for those use cases, respectively. So regardless of the nature of the network, enterprises are using network flow data for service assurance.</p> <p>Popularity does not necessarily indicate value — many tools are popular simply because they are less expensive or less disruptive to infrastructure. But when EMA asked organizations to identify their most valuable WAN monitoring tools, network flow analysis again led all other tools, with 31% of enterprises identifying it as the most valuable tool for monitoring MPLS performance and 30% of enterprises indicating network flow analysis as the best tool for monitoring internet performance.</p> <h3 id="advanced-network-analytics">Advanced network analytics</h3> <p>As enterprises leverage more sophisticated analytics technology, network flow data has even more potential value for IT operations and lines of business. Many enterprises apply advanced analytics techniques like big data to network data. In our research for Network Management Megatrends 2016, EMA found that network flow records are the most popular type of data included in these initiatives, used in 49% of all network analytics projects.</p> <h3 id="flow-analysis-helps-enterprises-with-security-optimization-and-business-processes">Flow analysis helps enterprises with security, optimization, and business processes.</h3> <p>So how do enterprises derive value from advanced network analytics? First and foremost, they enhance network security monitoring (36% of projects). The second most popular use case is <a href="https://www.kentik.com/kentipedia/what-is-network-optimization/" title="Kentipedia: What is Network Optimization?">network optimization</a> (30%). And 27% of these projects are able to apply network analytics to business process optimization.</p> <p>The application of network analytics to business process optimization has huge potential because it shows that the IT organization is poised to leverage network data, particularly network flow data, to improve business operations. In these situations, network analytics enables an IT organization to reposition itself as a business partner rather than a cost center.</p> <h3 id="the-upshot-value-and-versatility">The upshot: value and versatility</h3> <p>EMA believes that <a href="https://www.kentik.com/kentipedia/netflow-guide-types-of-network-flow-analysis/" title="NetFlow Guide: Types of Network Flow Analysis">network flow records</a>, whether formatted as NetFlow, IPFIX, sFlow, or similar protocol alternatives, constitute one of the most versatile and valuable classes of network data. From <a href="https://www.kentik.com/kentipedia/network-capacity-planning/" title="Kentipedia: Network Capacity Planning">capacity planning</a> and <a href="https://www.kentik.com/kentipedia/network-performance-monitoring/" title="Kentipedia: Network Performance Monitoring">performance monitoring</a> to network security monitoring and business process optimization, the potential value is there. We recommend that any IT organization that is not already leveraging network flow data evaluate its applicability to their use cases. For organizations that are already using flow data, but with legacy tools, we recommend investigating how advanced analytics that include big data can increase your ability unlock the value of network flow data, which will help you achieve your IT and overall business goals.</p> <hr> <p><em>Ready for a first-hand look at the benefits of big data network visibility? Contact us to <a href="#demo_dialog">schedule a demo</a>, or sign up now for a <a href="#signup_dialog">free trial</a>.</em></p><![CDATA[Why Large Enterprises Need Modern DDoS Defense]]><![CDATA[Today's increased reliance on cloud and distributed application architectures means that denial of just a single critical dependency can shut down Web availability and revenue. In this post we look at what that means for large, complex enterprises. Do legacy tools protect sufficiently against new and different vulnerabilities? If not, what constitutes a modern approach to DDoS protection, and why is it so crucial to business resilience?]]>https://www.kentik.com/blog/why-large-enterprises-need-modern-ddos-defensehttps://www.kentik.com/blog/why-large-enterprises-need-modern-ddos-defense<![CDATA[Alex Henthorn-Iwane]]>Mon, 27 Mar 2017 13:00:53 GMT<p><em><strong>Cloud complexities raise the bar for effective protection</strong></em></p> <img src="//images.ctfassets.net/6yom6slo28h2/56BisiymdqimaAwsakK6Eq/116c755c16fd3bc5138c4c5a72d266f6/Complex_net-400w.png" alt="Complex_net-400w.png" class="image right no-shadow" style="max-width: 300px;" /> <p>I recently had an interesting conversation with an industry analyst about how Kentik customers use our big data network visibility solution for more accurate DDoS detection, automated hybrid mitigation, and deep ad-hoc analytics. I was focused on our current customer base in digital business as well as cloud and service providers. But it became clear to me based on his feedback that any complex enterprise today can benefit from a modern approach to DDoS protection, and that Kentik can add real value in that context.</p> <p>What’s a complex enterprise? The term, which surfaces repeatedly in my conversations with various analysts, goes beyond just the notion of size, though there’s a fairly strong correlation between the size of a business and the complexity of its IT infrastructure. From an infrastructure point of view, a complex enterprise operates multiple, geographically dispersed datacenters and cloud deployments. These datacenters each have multiple BGP Internet peerings to facilitate resilience and performance. The collective bandwidth profile of these datacenters is high volume, in some cases adding up to hundreds of Gbps.</p> <h3 id="vulnerabilities-and-diversity">Vulnerabilities and Diversity</h3> <p>It’s important to note that the IT architecture of many of these complex enterprises leans increasingly towards distributed applications. Service components and dependencies are spread across datacenters, the cloud, and the Internet, and applications involve increased east-west traffic flows, which makes end-to-end performance heavily reliant on predictable network behavior. The disruption of network bandwidth or performance to any particular set of services can create a disastrous ripple effect on the entire application eco-system, which makes it critical to protect more than just north-south, client-server Web traffic.</p> <h3 id="hackers-can-shut-down-web-revenues-by-disabling-just-one-critical-dependency">Hackers can shut down Web revenues by disabling just one critical dependency.</h3> <p>To get an idea of how important effective protection can be, remember that by disabling just one digital dependency — the Dyn DNS service — hackers were recently able to shut down e-commerce for many of the industry’s most visible Web brands. But given the number of datacenters, diversity of internet peering, and overall traffic volume, outsourcing DDoS mitigation full time to cloud services is astronomically expensive and doesn’t provide a sound ROI.</p> <p>Another wrinkle for complex enterprises is that over time they’ve often acquired a variety of Internet-edge facing devices, including edge routers, switches, and load balancers. Plus they often have a multitude of siloed tools for network analysis, DDoS detection, and mitigation. In the many cases where these enterprises have grown via mergers and acquisitions, the diversity of infrastructure and tools is even more pronounced.</p> <p>The above realities — growing vulnerability due to distributed applications, increasingly complex and diverse infrastructure, and archaic siloed tools — underscore the challenges that complex enterprises face in the realm of DDoS protection. They need greater unification of telemetry data, detection policies, and mitigation triggering. And they need to be able to construct their own hybrid defense system, with a combination of detection plus both on-site mitigation appliances and on-demand bursting to cloud mitigation.</p> <p>These factors are driving enterprises to reconsider further investment in legacy tools, looking instead for modern solutions that transcend traditional silos and limitations. When I asked the analyst what percentage of his client inquiries about DDoS are from large enterprises his responses was “most,” in particular from large financial services organizations.</p> <h3 id="modern-architecture-and-accuracy">Modern Architecture and Accuracy</h3> <p>So far we’ve established that there’s a real need for modern DDoS protection, but what does that mean? Many analysts define as “modern” the widespread move in IT toward hybrid cloud and cloud-scale applications and analytics. In this context, modern DDoS protection can be understood to mean a fully hybrid approach combining the following characteristics:</p> <ul> <li>Cloud-like scale for the “application” aspect of detection and forensics.</li> <li>A hybrid on-premises and cloud-bursting approach to deep packet inspection (DPI) mitigation hardware, both in terms of where it lives and how it’s paid for (CAPEX vs OPEX).</li> </ul> <p>Kentik Detect is a modern approach to detection and analytics, and supports a modernized approach to DDoS defense. Kentik Detect delivers both cloud scale and <a href="https://youtu.be/9TyJdPdZrn0">field-proven gains of 30%</a> in attack detection accuracy versus traditional detection appliances. How is this possible? Detection appliances are limited to simple policies such as looking at volume (bits/packets/flows per second) against large IP pools. Any baselining, if it’s available at all, can only be performed on a per-exporter basis and relies on static policies that quickly fall out of date. Kentik, meanwhile, leverages its significant computational advantage to perform far more sophisticated anomaly detection policies than earlier appliance-based solutions can handle.</p> <p>Freed from the compute and storage constraints of appliance architectures, Kentik Detect enables you to do things no appliance-based system can:</p> <ul> <li>Track millions of individual IPs, so you don’t lose important attacks in the noise of high overall traffic volumes.</li> <li>Configure multi-dimensional alerting policy criteria by grouping BGP, IP, GeoIP, and other fields.</li> <li>Measure a broad variety of metrics including the traditional (bps/pps/fps) as well as innovative new types including unique source IPs, unique dest IPs, unique source AS, unique dest AS, unique source Geo, unique dest Geo, and TCP retransmits.</li> <li>Baseline on a network-wide basis using an adaptive inclusion-set configuration so that significant IPs are always baselined and important deviations aren’t missed.</li> </ul> <h3 id="big-network-data-unification">Big (Network) Data Unification</h3> <p>Aside from the capabilities enabled by superior compute power and storage capacity, Kentik is also different because it’s built to be a data unification platform combining flow records (NetFlow, sFlow, IPFIX, etc.) with BGP, GeoIP, SNMP, and performance metrics from packet capture. With daily storage of flow records in the range of 130 billion, our SaaS platform has the proven scale to serve the largest enterprise needs, and we also deploy on-premises for large telecoms, cloud providers, and enterprises. SaaS or on-prem, the performance of our ad-hoc analytics is super-fast even on multi-dimensional queries across multiple billions of rows. Over 95% of our customers’ queries return answers in just a couple of seconds.</p> <p>Kentik’s column store architecture is also extensible, allowing customers to create custom columns that are auto-generated on ingest from a combination of inbound data records and pre-configured tags. We’re continuously adding more supported data types, with data columns such as threat feeds already in the works.</p> <h3 id="speed-and-extensibility-make-kentik-detect-a-one-stop-silo-free-visibility-solution">Speed and extensibility make Kentik Detect a one-stop, silo-free visibility solution.</h3> <p>The combination of speed and data extensibility means that Kentik can be a one-stop, silo-free platform for network visibility — not just for DDoS but for a broad range of use cases including operations, security, planning, and business intelligence. That’s why Kentik has been recognized by leading analyst firms Gartner, Forrester, and IDC for the big data power of its analytics platform. When it comes to DDoS detection, that analytical power is crucial for gaining forensic and situational awareness, which is why there’s no comparison between Kentik Detect and traditional appliances.</p> <p>Kentik Detect is also mitigation neutral, with built-in RTBH capabilities and automated triggering of 3rd-party solutions via multi-condition settings in our alert policies. Because it already supports multiple mitigation techniques and vendors out of the box, and is “API-first” in its software architecture, Kentik Detect is an ideal arbiter for detecting attacks and triggering hybrid mitigation across the whole enterprise network.</p> <h3 id="time-to-make-a-change">Time to Make a Change?</h3> <p>As with the rest of IT, the modernization of DDoS defense is being driven by the rising importance of the cloud, a trend that is likely increasing the complexity of your organization’s infrastructure, network, and DDoS attack surface. That’s why, as advised by the analyst I began this post with, if you’re looking to modernize it’s important to thoroughly assess what you’re trying to protect. Getting a handle on all of your critical digital dependencies is an essential step in developing a clear picture of your requirements.</p> <h3 id="theres-no-better-time-to-modernize-your-approach-to-ddos">There’s no better time to modernize your approach to DDoS.</h3> <p>The bottom line is that there’s no better time than the present to assess whether your enterprise network security organization needs to modernize its approach to DDoS. For a big-picture view of DDoS trends and why Big Data Analytics is needed in the Age of DDoS, check out our <a href="https://www.kentik.com/resources/webinar-the-age-of-ddos-requires-big-data-analytics">webinar</a> by that name, which we presented in collaboration with Joseph Blankenship, a senior analyst from Forrester. If you’re a Gartner client and you’d like to get their perspective on Kentik, feel free to talk to your contacts in the core analyst or GTP organization; they’re familiar with us.</p> <p>Meanwhile, if you’re already keeping your eyes open for the best way to modernize, it’s definitely worth your while to take a look at what we’re doing at Kentik. Read our solution brief to learn more about <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a>.</p> <p>Ready to try it for yourself? <a href="#demo_dialog">Request a demo</a> or start a <a href="#signup_dialog">free trial</a> today!</p><![CDATA[How Important is the Internet to Enterprises Today?]]><![CDATA[After presenting at the recent CIOArena conference in Atlanta, Kentik VP of Strategic Alliances Jim Frey came away with a number of insights about the adoption of digital business operations in the enterprise. In his first in a series of related posts, Jim looks at audience survey responses indicating how reliant enterprises — even those that aren't digital natives or located in tech industry hotspots — have become on the Internet for core elements of their business.]]>https://www.kentik.com/blog/todays-enterprises-rely-on-the-internethttps://www.kentik.com/blog/todays-enterprises-rely-on-the-internet<![CDATA[Jim Frey]]>Mon, 20 Mar 2017 13:00:26 GMT<p>Digital Business Transformation Makes the Network Indispensable</p> <img src="//images.ctfassets.net/6yom6slo28h2/6ayUt6XCP6YUqeE6o4ucQS/ca82b9ca835251a96d4422609e5e40f4/Safety_net-500w.png" alt="Safety_net-500w.png" class="image right" style="max-width: 300px;" /> <p>CIOArena conferences offer C-level, executive, and senior IT leaders from mid- to large-sized enterprises a chance to network, look at key trends, and learn about cutting-edge solutions like Kentik Detect. Kentik participated in the March CIOArena event in Atlanta, which enabled us to share perspectives with other participants and to learn how they view modern network visibility as their approach to digital business operations matures. I’ll be writing a series of posts capturing some key observations from our time at the conference, but here’s one of our main take-aways from this event: while enterprises are increasingly focused on digital business they are still maturing in their approach to understanding and managing the key dependencies, network and beyond, of their digital business applications.</p> <p>I led off Kentik’s participation in the conference by giving a presentation entitled “The Network Doesn’t Just Run the Business, It Sees the Business.” I’ll cover that presentation in more depth in a future post, but you can [view the slides here](<a href="https://info.kentik.com/rs/869-PAD-887/images/JFrey">https://info.kentik.com/rs/869-PAD-887/images/JFrey</a> Kentik CIOarena 8Mar17.pdf). The main point I made that’s relevant to this blog post is that the Internet is a critical piece of the delivery path for digital business. As such, the network is literally running the business, carrying the revenue. That also means that the network is more than a utility, it’s a key source for operational and business intelligence, because it’s in a unique position to see both the activity and performance of digital business.</p> <p><strong>Digital Business Brings Internet Reliance</strong></p> <p>While it might seem obvious by now, pretty much every business of any appreciable size today sees the Internet as important. But just how important? One of the survey questions we ran at the conference asked “How important is the Internet to your business?” The point of this survey question was to separate businesses into distinct segments based on their reliance or focus on digital business. Here are the results of that survey question:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1cU3fPXWRosqqwQwGWMcwQ/ec7f6bc3fde629fa34d3eb15d329e4c6/Pie_chart-350w.png" alt="Pie_chart-350w.png" class="image right" style="max-width: 300px;" /> <p>In addition to those shown in the graph there was also a fifth choice: “The Internet has no importance to my business.” But nobody chose that response. Not a huge surprise — businesses that have zero dependence on the Internet are not likely to show up at a CIO conference in the first place — but if you think about it, nearly every business today has, at the very least, an email address and a website.</p> <p>At the low end of the Internet-dependence spectrum was the roughly 16% of respondents who see the Internet as important but not core to the business. I interpret this as meaning that they aren’t critically dependent on the Internet in terms of what they make or deliver as services, how they go to market, and how they carry out key business processes. These businesses communicate externally via email, perform some social media marketing, utilize some productivity apps, perhaps even host non-essential but useful IT functions in the cloud or utilize SaaS. But if their Internet connectivity went down or was severely disrupted, no critical aspect of the business would stop working. “Internet down? Bummer. Can’t get all my work done today. Oh well, there’s always tomorrow.”</p> <p>Near in spirit to those respondents were just over 14% that see the Internet as “important to key aspects of the business.” To me, this indicates that some portion of the business relies heavily on either cloud or SaaS resources, that their go-to-market involves digital resources or programs, or that their website or a mobile app is a major customer interface for information, though not for revenue or transactions. There would be real pain if those resources or go-to-market mechanisms were disrupted, but no existential threat to the business. “Internet down? Not good. We need to get this fixed soon or we’re going to feel it on the bottom line.”</p> <p>A majority now sees the Internet as critical to their business.</p> <p>Most fascinating to me though, was the fact that a majority of those surveyed see the Internet as “critical” to the business. To me, this means that revenue is at stake in some fashion, that a major disruption to cloud resources or Internet-accessible users, customers, and markets would be a board-level event. Heads may very well roll if the disruption is significant enough. “Internet down? Ouch! We need to fix this immediately, and then plan to prevent this in the future.”</p> <p>On the far end from “no importance” were 19% of surveyed attendees that hail from digital-native organizations. These include Web companies that rely exclusively on the Internet to reach and transact with their customers, as well as service providers for whom Internet traffic *is* the business. It’s certainly interesting that nearly a fifth of the represented businesses in the Atlanta metro are essentially Internet-based. “Internet down? Crisis! How could we possibly let this happen in the first place?!?”</p> <p><strong>Digital Business is Everywhere</strong></p> <p>While Atlanta is a major metropolitan area, it’s not Silicon Valley. But if over 70% of business respondents represented at a CIO-ish conference see the Internet as revenue-critical or they couldn’t imagine their business without it, I think we can safely conclude that digital business is nearly everywhere. In further blogs, we’ll look to the responses from these surveyed businesses for further insights into the rapidly evolving story of digital business transformation. In particular we’ll cover two additional survey questions that we asked, which focused on application architectures and hosting deployment models as well as the maturity of the respondents’ approach to network management.</p> <p>For the 70% that sees digital business operations and internet traffic flow as crucial to their business, it follows that network traffic intelligence is mission-critical. If you’re using legacy tools that were architected in the pre-cloud era, it’s high time to protect your services and revenue by upgrading to a modern solution. Check out the <a href="https://www.kentik.com/product/kentik-platform/">Kentik Platform</a> and learn how big data can power both operational and business insights. If you already know that you’re ready for big data-powered network visibility, you can sign up for a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Kentik Joins Internet2]]><![CDATA[Kentik is pleased to announce our membership in the Internet2® consortium, which operates a nationwide research and education (R&E) network and establishes best practices for R&E networking. Because Internet2 is a major source of innovation, our participation will enable us to grow our connection to the higher education networking community, to learn from member perspectives, and to support the advancement of applications and services for R&E networks.]]>https://www.kentik.com/blog/kentik-joins-internet2https://www.kentik.com/blog/kentik-joins-internet2<![CDATA[Alex Henthorn-Iwane]]>Mon, 13 Mar 2017 13:00:57 GMT<h3 id="big-data-network-intelligence-comes-to-research--education-community"><em>Big Data Network Intelligence Comes to Research &#x26; Education Community</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/30Ycxgt07eGKwkAGU0GysM/d9ba80906a86337b6866dd1c5832eb86/internet2_member-300w.png" alt="internet2_member-300w.png" class="image right no-shadow" style="max-width: 200px;" /> <p>Internet2® is a powerhouse in the world of networking. The consortium’s nationwide research and education (R&#x26;E) network serves hundreds of universities, dozens of government agencies, tens of regional and state education networks, and tens of thousands of community anchor institutions. Operating at the intersection of networking with R&#x26;E, Internet2 sets the pace for best practices for many R&#x26;E networking teams, and is also is a major source of innovation. That’s why we’re excited to announce that Kentik recently joined Internet2 as an industry member. Participation in Internet2 will give us the opportunity to grow our connection to the higher education networking community, to learn from member perspectives, and to support the development and deployment of advanced applications and services for R&#x26;E networks.</p> <h4 id="the-need-to-see-deeper">The Need to See Deeper</h4> <p>The network organization at a medium-to-large university or college essentially operates like an ISP. Most of the early, venerable Autonomous System Numbers (ASNs) are associated with such institutions, and they are typically multi-homed, often quite densely. Their network teams must meet the service performance expectations of large populations of resident students who consume lots of bandwidth. And there can be dramatic, unpredictable fluctuations in the demand for both internal and Internet bandwidth as various academic departments win or lose funding for research projects, which are often structured as collaborations with colleagues from other institutions.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5PbsxrjbVuoMO0oGk40OY4/5c642d641f2dd2e3687dd05a80f8761d/See_deeper-400w.png" alt="See_deeper-400w.png" class="image right" style="max-width: 400px;" /> <p>The combination of multi-homing, high bandwidth demands, and unpredictable traffic patterns makes comprehensive visibility essential for higher education networks. Universities need to manage internal network capacity, optimize their peering and transit traffic, and deal with anomalies and DDoS attacks. With NetFlow, sFlow, and IPFIX emanating from edge and internal routers and switches, there’s plenty of data to work with. But unfortunately most university network engineers are still hamstrung with network analysis tools that were born in the Nineties and designed around constrained storage and computing power. The result is that analytics are typically limited to summary snapshots.</p> <p>We know from direct experience that there’s a much more effective approach to providing the visibility that networking teams need to manage their demanding network traffic and performance requirements. As part of Internet2, we’ll now have the chance to raise awareness in the R&#x26;E community about applying big data analytics to the network.</p> <h4 id="kentik-goes-to-college">Kentik Goes to College</h4> <img src="//images.ctfassets.net/6yom6slo28h2/5dsUWlYXWwkQQAAY4imKgw/d34264f781856674755a2d09f03cc486/Campus-400w.png" alt="Campus-400w.png" class="image right" style="max-width: 400px;" /> <p>Another reason for our participation in Internet2 is the fact that Kentik is gaining customers in the higher education space. Our first higher-ed customer, University of Washington, which signed on mid-2016, has since been joined by several East Coast peers. Strong interest from additional institutions continues; just last month, for example, we met with network engineers from a couple of universities at Nanog 69 in Washington, D.C.</p> <p>Why is higher education interested in Kentik? Because Kentik Detect gives these institutions the power of big data network analytics in an easy-to-adopt SaaS, without all of the overhead and cyclical replacement costs associated with appliance-based solutions. For more insight into the value of Kentik to these types of institutions, here’s an excerpt of what one university network engineer had to say in an <a href="https://www.itcentralstation.com/product_reviews/kentik-review-41883-by-david-bendersky">IT Central Station review</a>:</p> <p><em>“Kentik answers the flow question: what are my flows, where are they are going, and what can I do to better optimize my connectivity. Kentik also baselines flow behavior and can alert you when there are abnormal flows such as DDoS.”</em></p> <h4 id="lets-connect-at-global-summit">Let’s Connect at Global Summit</h4> <p>I’ll be getting more plugged-in to Internet2 by attending the 2017 Global Summit in Washington this coming April, and I’m looking forward to meeting folks there. In the meantime, if you’d like to learn more just let us know at <a href="mailto:[email protected]">[email protected]</a> and we can arrange a <a href="#demo_dialog">demonstration of Kentik</a> Detect, our big data network analytics and DDoS protection solution. If you’re ready to take it for a spin, you can start a <a href="#signup_dialog">free trial</a> and be up-and-running with big data network traffic visibility in fifteen minutes.</p><![CDATA[The State of DDoS Attacks and Defense]]><![CDATA[DDoS attacks constitute a very significant and growing portion of the overall cybersecurity threat. In this post we recap highlights of a recent Webinar jointly presented by Kentik's VP of Product Marketing, Alex Henthorn-Iwane, and Forrester Senior Analyst Joseph Blankenship. The Webinar focused on three areas: attack trends, the state of defense techniques, and key recommendations that organizations can implement to improve their protective posture.]]>https://www.kentik.com/blog/the-state-of-ddos-attacks-and-defensehttps://www.kentik.com/blog/the-state-of-ddos-attacks-and-defense<![CDATA[Alex Henthorn-Iwane]]>Mon, 06 Mar 2017 14:00:09 GMT<h3 id="recap-webinar-with-forrester-senior-analyst-joseph-blankenship"><em>Recap: Webinar With Forrester Senior Analyst Joseph Blankenship</em></h3> <p>I recently had the pleasure of presenting a webinar on Distributed Denial of Service (DDoS) attacks in collaboration with Forrester Senior Analyst Joseph Blankenship. Our specific focus was in three areas: attack trends, the state of defense techniques, and key recommendations that organizations can implement to improve their protective posture. This post is a recap of some of the material we covered.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1gMfFSOllYIUMsIosqgs04/f04c8e706bd4ca995b440459aee4a602/Webinar-poll_result-400w.png" alt="Webinar-poll_result-400w.png " class="image right" style="max-width: 400px;" /> <h4 id="ddos-concerns-for-business">DDoS Concerns for Business</h4> <p>Before we kicked off our formal presentations, we thought it would be interesting to poll audience members to identify their primary business concerns related to DDoS. The audience was diverse in terms of representing organizations from across the spectrum, including service providers, digital businesses, and traditional IT organizations. As shown at left, DDoS raises a range of concerns, with the foremost being protection of the organization’s brand.</p> <h4 id="key-ddos-trends">Key DDoS trends</h4> <img src="//images.ctfassets.net/6yom6slo28h2/2ZK8hi4MHuimSAigAIWsg/e0be2c57e0b17183331d1d49b2aae908/Webinar-attack_mode-400w.png" alt="Webinar-attack_mode-400w.png " class="image center" style="max-width: 400px;" /> <img src="//images.ctfassets.net/6yom6slo28h2/FBJVvwSO4eae2mgYgE4yQ/da9fb07f9c7eba354abb81cb73230b59/Webinar-DDoS_increase-400w.png" alt="Webinar-DDoS_increase-400w.png " class="image center" style="max-width: 400px;" /> <p>Joseph began the presentation by covering DDoS trends from Forrester’s perspective. A key observation — which shouldn’t come as a surprise but is important to keep in mind — is that DDoS attacks constitute a very significant portion of the overall cybersecurity threat. 30 percent of respondents to Forrester’s 2016 Global Business Technographics Security Survey reported suffering a cybersecurity breach as a result of an external attack, and 25% of those attacks were DDoS. Further, according to the same study, the portion of attacks that are DDoS is steadily and aggressively on the rise, as illustrated in the next graph.</p> <p>At this point Joseph handed off to me and I covered some more-technical aspects of DDoS trends. One key point, as revealed in Akamai’s Q3 2016 State of the Internet Security Report, is that the vast majority of DDoS attacks — over 98% — are focused on disrupting access to network or server infrastructure rather than on targeting application limitations or vulnerabilities. The precise mix of infrastructure attack types changes over time depending on how attackers think they can get the most leverage from their efforts. But while the proportion may vary, the presence of a few usual suspects is constant. The Akamai report identifies these top infrastructure attack vectors as UDP fragmentation, Domain Name Service (DNS) reflection, and Network Time Protocol (NTP) reflection.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2ziMoPZtneQyyUUqSQycUI/93fbd44e8eee79bc7135211626d4424b/Webinar-app_vs_infra-300w.png" alt="Webinar-app_vs_infra-300w.png " class="image right no-shadow" style="max-width: 240px;" /> <p>Another interesting fact is that while last year’s mega-attacks against OVH, Brian Krebs, and Dyn got most of the media attention, there is, according to the Neustar Fall 2016 DDoS report, a sweet spot for the traffic volume involved in DDoS attacks, which is in the range of 500 Mbps to just under 10Gbps.</p> <h4 id="iot-as-a-cyberweapon">IoT as a Cyberweapon</h4> <p>Next up was a look by Joseph at how IoT devices have been turned against us, contributing to the DDoS problem. He covered IoT attack surfaces, how IoT contributed to the mega-attacks mentioned above, the timeline of the Mirai botnet, and the IoT concerns of cybersecurity leaders within IT organizations. Interestingly, DDoS ranked as the second-highest IoT cybersecurity concern in both the previously mentioned 2016 Forrester security report and its 2015 predecessor.</p> <h4 id="the-state-of-ddos-protection">The State of DDoS Protection</h4> <img src="//images.ctfassets.net/6yom6slo28h2/7pMfM2UXjUU2swu6sEu0Ey/6d8cc45a0c1448b59bd6d17bc764a12e/Webinar-detectionmitigation-500w.png" alt="Webinar-detectionmitigation-500w.png " class="image left no-shadow" style="max-width: 500px;" /> <p>In our next section we covered some issues around how DDoS protection is handled, beginning with a look at the main requirements for modern DDoS defense. While these may seem like no-brainer points, it’s interesting to note that according to the Neustar report, only 30% of DDoS attacks globally are detected in under an hour. A major reason for this is that a tremendous amount of DDoS detection as practiced in the industry today involves manual intervention.</p> <p>To speed detection and response, detection must be accurate enough to allow for fully automated mitigation. But there’s a major limitation to achieving that level of accuracy, which is the continued use of single-server appliances for DDoS protection. In the webinar, I laid out how constrained these appliances are in terms of both computation and storage. These limits cause suboptimal detection due to:</p> <ul> <li>Monitoring ranges of IP addresses rather than individual IPs.</li> <li>Incomplete baselining: - Per router/flow exporter (not network-wide); - Single dimension monitoring &#x26; alerting.</li> <li>Static policies that quickly fall out of sync with reality.</li> </ul> <p>The above factors lead to significant false negatives. On top of that, appliance-based DDoS detection approaches offer nearly zero analytics for forensics and situation-awareness, and they have very limited, almost non-functional APIs.</p> <p>Naturally I took the opportunity to contrast these appliances with Kentik’s DDoS protection solution. Kentik Detect offers the industry’s most accurate detection because:</p> <ul> <li>It can track millions of individual host IPs.</li> <li>It performs far more sophisticated baselining and anomaly detection on network-wide data.</li> <li>It performs multi-dimensional monitoring.</li> <li>It includes sophisticated multi-threshold alerting policies with auto-adaptive baselining and auto-triggered mitigation (built-in or via third-party integration).</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/2CIaarnlW0MmEqemOMQWkI/6969b101c70f8302d40cefd7c45df100/DDoS_diagram-820w.png" alt="DDoS_diagram-820w.png " class="image center no-shadow" style="max-width: 820px;" /> <p>On top of far greater detection accuracy, Kentik Detect offers big data network analytics for deep visibility and awareness.</p> <h4 id="recommendations">Recommendations</h4> <p>Building on what we’d covered so far, Joseph offered three major recommendations for protection:</p> <ul> <li>Implement a two phase mitigation strategy with both downstream and upstream mitigation.</li> <li>Ensure that DDoS attacks are incorporated into incident response plans.</li> <li>Take into account other sources of service disruption, such as DNS service provider outages.</li> </ul> <p>From the Kentik perspective, I recommended that the audience consider a true hybrid approach to detection and mitigation. The world has moved far beyond the days when DDoS protection meant a single-vendor stack of devices. It’s possible to use a best-of-breed SaaS detection and analytics solution like Kentik Detect, paired with hybrid mitigation from on-premises appliances, cloud providers, and routing-based black-holing.</p> <div class="pullquote left">30% more attacks were caught and stopped with Kentik Detect and hybrid mitigation.</div> <p>One example of this hybrid approach is now deployed at regional ISP PenTeleData, which had been struggling with a legacy, single-vendor system that was missing a lot of attacks and causing a significant amount of pain to their network operations team. PenTeleData replaced that system with a solution pairing Kentik Detect (for detection and analytics) with Radware hybrid mitigation, including both on-premises and cloud-bursting components. The results were dramatic: PenTeleData experienced a 30% improvement in catching and stopping DDoS attacks.</p> <h4 id="watch-it-learn-more-try-it">Watch it. Learn More. Try it.</h4> <p>All in all it was great fun collaborating with Joseph on the webinar. We covered several other topics that we don’t have space to recap here, including statistics on where most DDoS attacks originate geographically as well as the number of repeat attacks on businesses. Check out the full webinar to get more details and see the results of our second audience poll, in which we asked: “How tangible is the DDoS threat to your business?” If you’d like to learn more about Kentik Detect, you can <a href="#demo_dialog">request a demo</a>. Better yet, experience for yourself the power of big data <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">DDoS protection</a> and analytics by signing up today for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Modern NPM: Critical for Effective APM]]><![CDATA[As the architecture of digital business applications transitions toward the cloud, network teams are increasingly involved in assuring application performance across distributed infrastructure. Filling this new role effectively requires a deeper toolset than provided by APM alone, with both internal and external network-level visibility. In this post from EMA's Shamus McGillicudy, we look at how modern NPM solutions empower network managers to tackle these new challenges.]]>https://www.kentik.com/blog/modern-npm-critical-for-effective-apmhttps://www.kentik.com/blog/modern-npm-critical-for-effective-apm<![CDATA[Shamus McGillicuddy]]>Mon, 27 Feb 2017 14:00:03 GMT<h2 id="distributed-applications-make-network-role-key"><em>Distributed Applications Make Network Role Key</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/2GHrr13iMw6McgkY8uQgk2/f5a147736d11a19e9033393e91abd008/Muybridge_race_horse-500w.png" class="image right" style="max-width: 350px; margin-bottom: 20px;" alt="Muybridge race horse" /> <p>Network operations is no longer about the free and unfettered flow of packets across the wire. It’s bigger than that. Network managers are increasingly supporting cross-domain efforts to deliver high-performing applications. And effective application performance management (APM) depends on modern network performance management (NPM) tools.</p> <p>EMA has long tracked the operational mindset of enterprise network infrastructure teams, most recently in last year’s research report, “Network Management Megatrends 2016: Managing Networks in the Era of the Internet of Things, Hybrid Cloud, and Advanced Network Analytics.” For that report, we surveyed network professionals on a number of operational concepts, asking them to tell us which were becoming more important to them. Internal service-level agreements (SLAs) were at the bottom of the list. Security and application performance were at the top. Clearly, network managers no longer operate their networks in a vacuum. They are now focused on how overall network health and performance contributes to application performance, and they need tools that can support this focus.</p> <h3 id="application-performance-depends-on-the-network">Application Performance Depends on the Network</h3> <p>As application performance increasingly depends on network performance there is growing evidence that the network team is playing a leading role in application operations. For instance, over the last few years network managers have seized a much bigger role in leading the cross-domain teams that form in response to application performance problems. In 2014, just 14% of enterprises said network operations led these cross-domain teams “most of the time” or “all of the time.” In 2016, that number shot up dramatically to 82%. This huge shift demonstrates that network managers need visibility into application performance.</p> <div class="pullquote left">Data centers now support applications that are dynamic, elastic, and distributed.</div> <p>Why is this happening? Many enterprises are adopting architectures that change the very nature of application infrastructure and place more pressure on the network. In EMA”s “Network Megatrends,” network infrastructure teams revealed that internal cloud transformation and software defined data center (SDDCs) initiatives are extremely influential on their current decision-making. The rise of cloud transformation and SDDCs reveals that many data center networks are supporting applications that are very dynamic, elastic, and distributed. Enterprises adopt these technologies to enable multi-tier applications that generate a tremendous amount of east-west traffic, rather than the north-south traffic patterns traditionally generated by monolithic, client-server application architectures.</p> <h3 id="the-network-role-in-apm">The Network Role in APM</h3> <p>The advent of multi-tier applications poses two related challenges for network teams. One is that the dynamic nature of applications served by cloud and SDDC architectures results in a higher rate of change in the network, as infrastructure teams are constantly receiving change tickets to establish network connectivity and network and security services for new workloads. High rates of change bring higher risk of mistakes.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4k8tQW6Mh2wuSwwwSgweWK/88e86f80aac288f4c36d9bd79c6e2bce/McClure_Tunnel_west-400w.png" class="image right" style="max-width: 400px" alt="" /> <p>The shift to distributed application architectures also creates a much higher rate of east-west traffic patterns in these environments, which changes the nature of network monitoring. In the days when north-south traffic patterns ruled the data center, the operations team could monitor a few key upstream nodes on the network and gain visibility into nearly every fault and congestion issue. However, when most traffic is flowing east-west, the so-called choke-points on the network that offer a global view of network traffic fade away, which means that network teams need to broaden their monitoring vantage points. And with so much of this traffic being intra-application, the network team also needs deeper application visibility. For example, it’s important to be able to see how latency, dropped packets, and jitter are affecting interactions between the web, application, and database tiers of a given business service.</p> <p>These developments have placed a new APM mandate on network teams, who are struggling to adapt. Network problems are usually to blame for application trouble. EMA’s “Network Megatrends” research found that “network configuration changes” and “network performance/congestion” were reported as the two most frequent root causes of complex IT service problems that required a response from cross-domain teams. Bear in mind, these survey respondents were network professionals, and they reported that their own technology domain was more often at fault for IT performance problems than any other domain, including security systems, storage systems, servers and hypervisors, application design and health, external cloud services, and even user error. They know that they need to do better.</p> <div class="pullquote left">Modern NPM gives an end-to-end view of both your own network and external cloud services.</div> <p>And they can do better. Through sophisticated analysis of network flows, packets, logs, and other data, modern NPM tools increasingly offer insight into the interplay between network performance and application performance. These modern tools should, first and foremost, give network managers an end-to-end view of their network. Given the increasingly hybrid nature of today’s applications, they should also provide visibility into external cloud services. Network managers also need faster real-time analysis of their network data to help them cope with increased complexity and higher rates of change. Furthermore, their management tools need to integrate with APM tools to facilitate collaboration between network operations and application management teams, and to give each group context for how their management domain is affected by the other. Thirty-seven percent (37%) of network managers already have this integration with APM tools today.</p> <h3 id="npm-to-dos-for-network-managers">NPM To-dos for Network Managers</h3> <p>If you are a network manager and are increasingly tasked with directly supporting the health and performance of your organization’s applications, you need tools that can support this mandate. NPM solutions can provide varying levels of insight into application performance. Examine the tools you have and compare them with competing solutions on the market. Find out which ones will give you the end-to-end network visibility that will help you deal with increased complexity and high rates of change. If your network generates high volumes of traffic, find tools that can handle that volume. Look for tools that can offer meaningful integration with APM and other IT management systems. Ultimately your requirements will be unique, so evaluate your infrastructure, determine what it will look like in five years, and map your NPM tool requirements accordingly.</p><![CDATA[Cloud-Native Network Management]]><![CDATA[As IT technologies evolve toward greater reliance on the cloud, longstanding networking practitioners are adapting to a new environment. The changes are easier to implement in greenfield companies than in more established brownfield enterprises. In this third post of a three-part series, analyst Jim Metzler talks with Kentik's Alex Henthorn-Iwane about how network management is impacted by the differences between the two situations.]]>https://www.kentik.com/blog/cloud-native-network-managementhttps://www.kentik.com/blog/cloud-native-network-management<![CDATA[Jim Metzler]]>Tue, 21 Feb 2017 14:00:34 GMT<h3 id="greenfield-companies-see-fewer-hurdles-to-cloud-based-technology"><em>Greenfield Companies See Fewer Hurdles to Cloud-based Technology</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/2nErYtA9TSUW4w6WIQ0ASs/88defd85547b13a0e3447c4c7272f8fe/Cloud_city2-500w.png" class="image right" style="max-width: 450px; margin-top: 15px; margin-bottom: 20px" alt="Cloud city" /> <p>This is the third and last blog post in my series about the impact of cloud computing on management tools. In the <a href="https://www.kentik.com/blog/culture-war-network-vs-cloud/">first post</a> I discussed the burgeoning culture war between advocates of cloud-centric and network-centric approaches to IT. In the <a href="https://www.kentik.com/blog/culture-war-how-network-vs-cloud-impacts-tools/">second post</a> I discussed the ways in which the adoption of cloud computing impacts how network organizations should think about their management tools. This time I’ll summarize an interview I recently conducted with Alex Henthorn-Iwane, Kentik’s VP of product marketing. We talked about some of the differences between a cloud-centric and a network-centric approach, and we also looked at how Kentik Detect advances the practice of network management.</p> <h4 id="risk-and-reward">Risk and Reward</h4> <p>I began the interview by asking Henthorn-Iwane if cloud-native companies approach IT differently than traditional companies. He responded that there are differences, many of which are based on the classic divide between companies that are greenfield and those that are brownfield. Because cloud-native companies are greenfield, he said, they have the luxury of not being burdened by legacy applications and infrastructure. As a result, they can focus on the optimal way to implement technology today. In contrast, a traditional brownfield company has to account for the impact that the implementation of new technology would have on existing applications and infrastructure. They need to consider whether it’s possible to migrate some or all of their existing base to the new technology, how that migration might be accomplished, and what that would cost.</p> <div class="pullquote left">Greenfield companies tend to be less risk-averse than brownfield companies.</div> <p>Henthorn-Iwane also pointed out that because brownfield companies have been in operation longer they tend to be more risk averse than greenfield companies. In part that’s because significant components of the IT environment in most brownfield companies are deemed to be fragile, which amplifies the technical risk associated with making changes. And because the network organization is, in many companies, the first group to be blamed for any technical problems, the network team in brownfield companies is often particularly wary when it comes to implementing new technologies. Adding to this caution is the fact that most brownfield companies are past the stage of rapid growth and instead are very focused on managing their profit margin. The possibility that implementing new technology could negatively impact the company’s financials compounds the risk-sensitive mindset.</p> <img src="//images.ctfassets.net/6yom6slo28h2/ExJOsfL3PwkQKmqkY446O/f93901f8dc9f87121edfd8c8a0d4a9ab/Leap_of_faith-400w.png" class="image right" style="max-width: 400px" alt="Leap of faith" /> <p>In contrast to traditional brownfield companies, most cloud-native companies have relatively little legacy IT environment to support, and their culture typically celebrates high risk and high reward. As a result, cloud companies typically have little if any technical aversion to trying state-of-the art technologies and architectures. In addition, many cloud-native companies are more concerned with growth than with profitability, so they also tend to not have any significant financial aversion to trying new things.</p> <p>Henthorn-Iwane said that while cloud-native companies are typically more aggressive about implementing new technologies that doesn’t mean that they don’t have significant concerns about the technologies they use. One example he gave was that while some cloud-native companies utilize public cloud providers such as AWS, many don’t. Typically, the ones that don’t are concerned about performance and as a result, they feel that they need to build and manage their own data centers.</p> <h4 id="data-without-silos">Data Without Silos</h4> <p>As our discussion turned toward Kentik’s products, Henthorn-Iwane noted that one of the most important trends in IT is the growing adoption of DevOps, and that the success of DevOps requires that developers have good feedback from the operational components of IT. In general there has been a lot of operational feedback available to developers and others from servers and applications, but notably less from the network. As described by Henthorn-Iwane, Kentik’s solution fills this gap, including identifying how much latency various services experience as well as how much traffic they generate.</p> <div class="pullquote left">Kentik Detect ingests, unifies, and retains massive amounts of management detail.</div> <p>Many traditional management tools keep only summarized network traffic data, discarding the raw details due to compute and storage constraints. In contrast, Kentik takes a unique approach — big data on a distributed back end — that enables Kentik Detect to ingest and retain massive amounts of detailed management data, and that also enables multiple types of data to be unified into a single time-series datastore for effective analysis. This architecture allows planners, developers, and operations personnel to interactively ask questions and get answers in seconds. In addition, because the Kentik solution is cloud-based and supports multi-tenancy, it can respond securely to an extremely large volume of queries.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6VczLi3Zm0iG4AYOaU4K64/445645ac82a168d85447c9ce49c3911a/Silo_demo-400w.png" class="image right" style="max-width: 400px; margin-top: 15px;" alt="Silo" /> <p>Another Kentik advantage is the integration of management data into a single solution that operates across varying management tasks. The traditional approach to management was built around using separate tools to perform each task, which resulted in management data being stored in disparate data silos. Unfortunately, a given data silo is only accessible by the tool that created the silo, and in practice is only accessed by the person or team that is responsible for that tool. As a result, data silos create analysis silos. The Kentik solution creates and leverages a unified data set that enables more cogent, data-driven analytics for planning, security, and operational support.</p> <h4 id="apis-for-full-integration">APIs for Full Integration</h4> <p>Our final topic was the role of APIs in a management tool. Henthorn-Iwane said that APIs are an integral part of the Kentik solution and that they enable all of the appropriate IT teams to have cloud-like access to all of the management data. This is in contrast to management tools that were created prior to the cloud era, in which users interact solely with the UI while APIs are intended solely for proprietary interaction between a given tool’s internal components. To overcome this limitation, many of these pre-cloud tools are stuck with a tacked-on API that is not engineered to provide deep access to all features and functions. One impact of these limited APIs is that management tools from the pre-cloud era typically don’t integrate fully with cloud-era tools.</p> <p>As I wrap up this three-part series on the impact of cloud computing on network management tools, it’s clear to me that network managers are dealing with a sea change, not only in the technology of IT but also in the IT industry’s mindset when it comes to the cloud. Their response to this change will determine how effectively they can perform their roles. If Kentik is any guide, cloud-based and cloud-oriented approaches to network management can be a major step forward in reorienting what and how network managers are able to deliver to the business.</p><![CDATA[Kentik Hackathon!]]><![CDATA[The range, creativity, and skills of the Kentik engineering team were on full display at our recent on-site for local and remote engineers. Gathering to confer on coming product enhancements, the team also enjoyed San Francisco dining and tried recreational welding and blacksmithing. But the highlight was our first-ever hackathon, which yielded an array of smart ideas for extending the capabilities of the Kentik platform.]]>https://www.kentik.com/blog/kentik-hackathonhttps://www.kentik.com/blog/kentik-hackathon<![CDATA[Alex Henthorn-Iwane]]>Mon, 13 Feb 2017 14:00:21 GMT<p><strong>Versatile Platform Inspires New Angles on Network Visibility</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/4kWPSOa0rSosIqcaQOw0OO/f9135cd6884c0d25ce83578a0fb48af1/Welding-500w.gif" alt="Welding-500w.gif" class="image right" style="max-width: 300px;" /> <p>What does a top-notch engineering team do for fun? At a recent on-site in our San Francisco headquarters, Kentik engineers both local and remote came together to confer on work in progress, discuss technical topics, and plan for the year ahead. But it wasn’t all business; there were also nights out at San Francisco’s famed eateries and team activities including recreational welding and blacksmithing (you never know when you’ll need to shoe a horse). Perhaps the most interesting of these events was a hackathon that kicked off after work one evening and concluded with participants presenting their efforts the following afternoon. Some were mostly for fun and others were more practical, but all of our hackathon projects illustrated in one way or another the range, creativity, and skills of our engineering team. Let’s take a look at a few examples…</p> <p><strong>CDN Traffic Density-O-Meter</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/47Lq7o2jxKgGoOmYGMiW08/cb020d169a4b9da29c7ce74b4e454a57/Density_meter-300w.png" alt="Density_meter-300w.png" class="image right" style="max-width: 300px;" /> <p>Large service providers are particularly interested in knowing how much of their transit traffic is actually CDN traffic. The problem is that since CDN servers are hosted in many ISP domains, under the ASNs of those ISPs, it’s not possible to identify CDN traffic strictly from source ASN. This project took a feed of DNS data and performed a streaming analysis of flow data. Using source AS or DNS matching it was possible to detect the percent of traffic in a given network that was from CDNs as well as the cumulative (overlapping) percentage.</p> <p><strong>Geo-Mapping Projects</strong></p> <p>We had two engineers who created nice visualizations utilizing data from our API. This first one is a 2D geo-visualizer based on snapshots fed into the webgl library wrapped in React.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5p1oTq1VzGcUwW4o4CsY48/82145cb6bab1cf16c6b9350b79446479/Geo_mapping1-816w.png" alt="Geo_mapping1-816w.png" class="image right" style="max-width: 300px;" /> <p>A second geo-mapping project, this one in 3D, utilized the three.js library at 60 frames-per-second rotation.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1zbeNAPtNeeCaAwwO22cE6/61fadd798acce854d6baf2bb744d6926/Geo_mapping2-810w.png" alt="Geo_mapping2-810w.png" class="image right" style="max-width: 300px;" /> <p><strong>“Kentik, How Much Traffic Am I Sending to…?”</strong></p> <p>This one is pretty hard to illustrate without a sound recording (oops!), but the idea was to integrate Amazon’s Alexa voice control with the Kentik portal UI, allowing the operator to pose simple, voice-based queries to Kentik Detect. The project primarily scoped around answering the simple question “How much traffic am I sending to <em>[destination AS name]</em>.” It actually worked, and it was pretty entertaining.</p> <p><strong>Sensor Data to KDE</strong></p> <p>Another fun project utilized kFlow (Kentik’s internal flow-data protocol) to send measurements from an Intel Arduino board and GPIO-connected temperature sensor to the Kentik Data Engine (KDE), our distributed big data backend. The data was used to trigger alarms that were defined in alert policies in our alerting system. There wasn’t time in the hackathon to modify the UI, so as shown in the following graph humidity was mapped to the protocol field while the temperature (divided by 100) was mapped to the Packets Per Second (PPS) metric.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1y3ffb0h3KWiecoocmgUU0/aaee3bdf54a25d6c4a6c3222719784e5/Sensor_data-836w.png" alt="Sensor_data-836w.png" class="image right" style="max-width: 300px;" /> <p><strong>Getting the Most out of Docker for Mac</strong></p> <p>One of our intrepid backend devs focused on a practical use of the Docker native application for Mac. The challenge is to not have any local variables affect anything when invoking the build scripts from the github repository. The answer is to ensure that everything happens in a docker container from images that are pre-built. This way, when you control the whole stack, you can run Docker on any distro and it will work the same no matter what. The project demonstrated running a seamless build process on Docker for Mac.</p> <p><strong>A Probe Prototype</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/4iFvOjNggMKgQ4AwqSm6wG/ee20d342cd929cc1accb49b2f02db1ea/Probe_prototype-400w.png" alt="Probe_prototype-400w.png" class="image right" style="max-width: 300px;" /> <p>In this project, the engineer created prototype code in Rust of a packet capture probe that reads directly from the Berkeley Packet Filter (BPF) on Linux and performs protocol decodes, such as the SQL query statement and time at right.</p> <p><strong>Lighting Up Alerts</strong></p> <p>Another project involved triggering different-colored LED lighting modes from alarms that were defined in alert policies in our alerting system. Simple… and fun stuff!</p> <img src="//images.ctfassets.net/6yom6slo28h2/3V89K8HXzi2MYg264QaEkm/098a251e778c6ddf0673a8136201735c/Alert_lighting-836w.png" alt="Alert_lighting-836w.png" class="image right" style="max-width: 300px;" /> <p><strong>Route Traffic Density Calculator</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/49sWYBawjC4wYQGO4A2SGa/7c4a5d88e525aea80f10bdb1fd976e86/Route_density-405w.png" alt="Route_density-405w.png" class="image right" style="max-width: 300px;" /> <p>For Web enterprises operating large-scale Internet edges, the cost of a single, name-brand edge router can be $1M or more. It takes a lot of $24.99/month subscribers to make that pay. So a major area of interest is exploring the use of lower-cost edge routers or even white box-based solutions. The challenge is that lower-cost routers have dramatically less FIB capacity than the big brand boxes. So the key question is whether you can handle traffic for the vast majority of your customers with a much more limited FIB. If you can manage sending most of your (important) traffic within that envelope, you can use a default route to handle what’s left over.</p> <p>In this project, traffic was grouped by tranches of routes and analyzed to determine what percentage is being handled by how many prefixes, giving practical insight into which prefixes to include in the FIBs of lower-cost routers. This project can be a starting point for network teams to explore how to maintain service quality and user experience for apps via their Internet edge while reducing CapEx.</p> <p><strong>Come join us!</strong></p> <p>Given the fairly short time frame, our first hackathon generated a lot of fun and valuable projects. Some projects can be put into action internally, and some can even be used to a limited degree in their raw form to help customers figure out answers to business-relevant questions. So it was a worthy capper to our engineering conference. We’re looking forward to more hackathons in the year ahead as both the product and the engineering team grow rapidly. What about you? Would you like to help build the industry’s most powerful network analytics engine? We’re <a href="https://www.kentik.com/careers/">hiring</a>!</p><![CDATA[Cisco’s Acquisition of AppDynamics]]><![CDATA[Cisco's late-January acquisition of AppDynamics confirms what was already evident from Kentik's experience in 2016, which is that effective visibility is now recognized industry-wide as a critical requirement for success. AppDynamics provides APM, the full value of which can't be realized without the modern NPM offered by Kentik Detect. In this post we look at how Kentik uniquely complements APM to provide a comprehensive visibility solution.]]>https://www.kentik.com/blog/ciscos-acquisition-of-appdynamicshttps://www.kentik.com/blog/ciscos-acquisition-of-appdynamics<![CDATA[Avi Freedman]]>Mon, 06 Feb 2017 14:00:49 GMT<h3 id="why-industry-validation-of-apm-makes-us-appy">Why Industry Validation of APM Makes Us ‘Appy</h3> <img src="//images.ctfassets.net/6yom6slo28h2/2m0FrUzD006iuKYqikC82o/e6749e14296d9d42e2f570947063c974/Be_appy-500w.png" alt="Be_appy-500w.png" class="image right" style="max-width: 300px;" /> <p>Six months ago Cisco made a big-time splash in the network analytics pool with the introduction of <a href="https://www.kentik.com/welcoming-cisco-to-the-2016-analytics-party">Tetration Analytics</a>. Now, with the late-January announcement of its $3.7 Billion acquisition of AppDynamics, Cisco has made another blockbuster analytics-related move. On top of feeling happy for our friends at AppDynamics, we’re also very excited about what this means for our space! Coming off of a great 2016 at Kentik, we’re more confident than ever that Web, enterprise, and service-provider organizations are deeply hungry for the kind of network traffic intelligence offered by Kentik Detect. And it’s obvious that the major players have picked up on this hunger.</p> <h3 id="without-modern-npm-the-full-value-of-apm-remains-unrealized">Without modern NPM, the full value of APM remains unrealized.</h3> <p>Our excitement is based on more than just general good vibes in the industry. AppDynamics performs application performance management (APM), and it’s widely acknowledged that the full value of APM can’t be realized unless it’s paired with a modern take on <a href="https://www.kentik.com/resources/gartner-market-guide-npmd-2020/" title="Read Gartner&#x27;s 2020 Market Guide for Network Performance Monitoring and Diagnostics">Network Performance Management (NPM)</a>. Industry analysts at EMA have found that the top two root causes of difficult-to-troubleshoot IT/app performance issues are network config changes and network performance/congestion. Unfortunately, legacy NetFlow analysis tools and NPM appliances are siloed, they can’t store enough traffic data, and they don’t deploy well in cloud and distributed app environments. As a result, traditional NPM tools don’t provide the visibility that engineers need to determine if problems originate in the network and, if so, what is the root cause.</p> <h3 id="thats-where-kentik-comes-in">That’s where Kentik comes in.</h3> <p>Network infrastructure generates massive volumes of valuable metadata about the traffic across your network, including flow records (e.g. NetFlow, sFlow, IPFIX), BGP, GeoIP, and SNMP. Kentik Detect unlocks the value of that data by collecting it into a unified time-series database, making it available in real time for anomaly detection and alerting, and storing it unsummarized for months to enable fast, super-powerful analytics. With our cloud-, container-, and microservices-friendly NPM host agent, key network performance metrics are pervasively and contextually measured from packet capture and integrated into the time series with data from the rest of your infrastructure.</p> <img src="//images.ctfassets.net/6yom6slo28h2/BRtbUqHb2gOkUQMAGAU0K/904f41e2d354da58f50d33f8f0802ee6/Eureka-500w.png" alt="Eureka-500w.png" class="image right" style="max-width: 300px;" /> <p>Kentik Detect also has the industry’s most accurate anomaly detection system, which baselines large, adaptive sets of individual host IPs keyed to custom multi-dimensional criteria. The system can trigger alarms based on a nearly infinite variety of conditions involving metrics such as performance, traffic, or even unique source and destination IP. Engineers who subscribe to notifications about anomalies and threats can click through to contextual dashboard views and rapidly drill down in a full, ad-hoc data explorer. They can then pivot metrics and analytical dimensions to explore performance, traffic congestion, and even DDoS attacks. There’s gold in your network data, and Kentik Detect uncovers it.</p> <h3 id="apm-needs-npm-npm-needs-kentik">APM Needs NPM, NPM Needs Kentik</h3> <p>When Cisco invests in APM we see a Kentik-sized hole in their product line-up.</p> <p>Through our customers we know a great deal about the demand for effective, comprehensive network visibility — and how the promise of APM can’t be fully realized without it. So when we see Cisco investing in an APM vendor we see a big, Kentik-sized hole in their product line-up (and that of many other vendors to boot).</p> <p>Speaking of customer demand, we’re busy meeting the market’s need for highly scalable, granular, real-time network visibility, which Kentik can uniquely provide. Among our highlights in 2016:</p> <ul> <li>Our annual recurring revenue grew nearly 6x.</li> <li>Our innovative big data SaaS earned recognition from Gartner, IDC, and Forrester.</li> <li>We landed a top-five cloud provider, multiple Tier 1 ISPs, a globally-known telecom provider in Asia, and major CDN providers.</li> <li>Other new customers include SMBs, Web enterprise giants, financial services companies, institutions of higher learning, and other kinds of enterprises and organizations.</li> </ul> <p>Looking ahead, 2017 is also promising to be very exciting for us as we grow our team and our business rapidly. Yes, we’re (h)appy, not just because of the AppDynamics acquisition, but because we’re able to fill a real need for customers, the same need that we felt ourselves when we were running networks.</p> <p>Are you ready to see how Kentik can provide you with better network visibility? <a href="#signup_dialog">Start a free trial</a> today and you can be up and running in just fifteen minutes.</p><![CDATA[Arbocalypse Now: Saving Yourself from DDoS Appliance EOL]]><![CDATA[With end-of-life coming this summer for Peakflow 7.5, many Arbor Networks customers face another round of costly upgrades to older, soon-to-be-unsupported appliance hardware. In this post we look at how Arbor users pay coming and going to continue with appliance-based DDoS detection, and we consider whether that makes sense given the availability of a big data SaaS like Kentik Detect.]]>https://www.kentik.com/blog/arbocalypse-now-saving-yourself-from-ddos-appliance-eolhttps://www.kentik.com/blog/arbocalypse-now-saving-yourself-from-ddos-appliance-eol<![CDATA[Alex Henthorn-Iwane]]>Mon, 30 Jan 2017 14:00:10 GMT<h2 id="choosing-a-future-proof-ddos-detection-architecture">Choosing a Future-proof DDoS Detection Architecture</h2> <img src="//images.ctfassets.net/6yom6slo28h2/6jywYD8UEMmM8yQ8way4KY/74f834ab9dbb93bc9c26e78f01068bbc/Cliff_sign-500w.png" alt="Cliff_sign-500w.png " class="image right" style="max-width: 300px;" /> <p>One of our core beliefs at Kentik is that network analytics should be cloud-scale and cloud-based so that you’re not limited — today or tomorrow — by hardware infrastructure. By taking a big data SaaS approach to network analytics and DDoS detection, Kentik provides a distributed solution that scales with your traffic. It also has the critical advantage of protecting you from the continuous cycle of OS and appliance hardware EOLs and upgrades, which often are packaged with unwelcome licensing and vendor monetization surprises. In contrast, many Arbor Networks customers are facing an upcoming EOL and upgrade cycle with Peakflow version 7.5. Let’s look into how this will work out for Arbor customers so that we can get a feel for the pros and cons of the two approaches.</p> <h3 id="the-coming-peakflow-75-cliff">The Coming Peakflow 7.5 Cliff</h3> <p>Historically, Arbor, like many appliance vendors, has offered a few generations of appliance hardware. You get to upgrade the Peakflow software on your appliance for a while, but eventually Arbor releases a version that isn’t supported on your older hardware. At that point you are stuck with a software version that’s gone EOL. If you want to maintain support, you’re forced into a hardware upgrade, which is when the monetization magic happens (for the appliance manufacturer, that is). The product licenses are attached to the appliance, so you have to re-buy the entire product. Of course, existing customers get a substantial discount, but basically it means that customers face at least a partial repayment cliff every few years. Not fun.</p> <p>This is the situation that Peakflow 7.5.x customers will be in this coming summer. Just two years after the software’s release, Arbor will be ending support. And much of the installed base of hardware running 7.5 hasn’t been qualified for an upgrade to Peakflow 8.0. If you’re one of the unlucky owners of excluded appliances, be prepared to dig deep into your pockets for a hardware upgrade.</p> <img src="//images.ctfassets.net/6yom6slo28h2/36GLqsrz0QiUCmQoUMIyQc/af98608793464efe7ee50aaef73a04b1/Sales_pitch-400w.png" alt="Sales_pitch-400w.png " class="image right" style="max-width: 300px;" /> <h3 id="how-much-will-you-pay-more">How Much Will You Pay? More!</h3> <p>When Arbor began supporting VM-based appliances, they realized they were going to lose the ability to force upgrades based on “out of date hardware.” So they introduced “Flex Licensing,” which decouples the flow source licenses from the hardware. Flex Licensing includes an upfront perpetual license capex cost PLUS annual software subscription (for new versions) PLUS annual support and maintenance. The “software subscription” component is designed to recapture the revenue lost from the decline in forced hardware upgrades.</p> <p>If you’re an Arbor customer and haven’t already converted to Flex Licensing, it’s likely that you’ll be required to do so — on top of buying new hardware appliances — in order to upgrade to 8.0 and remain under support. Of course, you can convert to a VM-based deployment, but one way or another you’re still paying for the infrastructure.</p> <p>Good News! There’s an Add-On!</p> <p>As a side note, Arbor recently announced a big data add-on to Peakflow called SP Insight, which is built on Druid open source software. I emphasize “add-on,” because you’re required to keep using Peakflow appliances, in addition to which you add your own infrastructure to run SP Insight. It’ll require you to buy yet another license, and to maintain an enterprise software deployment.</p> <p>But no biggie. After all, you’re already hooked on those expensive and continuously EOL’d appliances, so what’s a major enterprise software investment to boot? You’ll get to enjoy the process of configuring all of that stuff, and thrill at the experience of jumping between the appliance and the UI for the big data back-end, all the while devoting valuable resources to a bespoke big data compute and storage infrastructure. And just when you’re getting the whole thing settled into your daily workflow you’ll encounter the infinite pleasures of separate and likely asynchronous software upgrades. Gosh, you’ll be so busy chasing the dream that you won’t have time to realize that…</p> <img src="//images.ctfassets.net/6yom6slo28h2/4PpJAGnKg8cQIqyouaaE0S/5de07145593236bcfa05e4c00c243f4b/Its_a_trap-407w.jpg" alt="Its_a_trap-407w.jpg " class="image right" style="max-width: 300px;" /> <h3 id="escape-the-dizzying-cycle">Escape the Dizzying Cycle</h3> <p>Or, you could stop the madness and escape this carousel of EOLs, appliance upgrades, fragmented solutions, enterprise software, hosting maintenance, and licensing cliffs. How? By switching to a far more powerful DDoS detection solution based on big data and deployed in a cost-effective, zero-maintenance SaaS.</p> <p>With Kentik, you get it all. No appliances, enterprise software, or infrastructure maintenance burden. You pay a straight subscription fee based on how many of your network devices (routers, switches, hosts) are sending flow to Kentik Detect. Not only do you get DDoS detection that’s field-proven to catch 30% more attacks than traditional appliances, you can also trigger multiple DDoS mitigation methods, including RTBH and third-party integrations from an expanding list of leading vendors such as Radware (DefensePro) and A10 (Thunder TPS). Plus you’ll have raw traffic data that’s available in real time and retained for months (90 days standard), enabling ultra-granular, super-fast ad-hoc traffic analysis, network performance monitoring, and peering and transit analytics.</p> <p>Ready to see the power of big data DDoS protection and the world’s deepest, fastest network traffic analytics? Sign up for a <a href="#signup_dialog">free trial</a> on our SaaS; turn on a flow exporter and you can be analyzing your own data in fifteen minutes. Or <a href="#demo_dialog">schedule a demo</a> so we can schedule a step-by-step walk-through of our solution.</p><![CDATA[Culture War: How Network vs. Cloud Impacts Tools]]><![CDATA[As cloud computing continues to gain ground, there's a natural tension in IT between cloud advocates and those who prefer the status quo of in-house networking. In part two of his three-part series on this "culture war," analyst Jim Metzler clarifies what is — and is not — involved in the transition to the cloud, and how the adoption of cloud computing impacts the way that network organizations should think about the management tools they use.]]>https://www.kentik.com/blog/culture-war-how-network-vs-cloud-impacts-toolshttps://www.kentik.com/blog/culture-war-how-network-vs-cloud-impacts-tools<![CDATA[Jim Metzler]]>Mon, 23 Jan 2017 14:00:43 GMT<p>What Cloud Technology Means for Network Management</p> <img src="//images.ctfassets.net/6yom6slo28h2/4S1NA0mem4262iKSwGyeAU/ae632be3046f221ccf8e8a544228e462/Greek_fire-500w.png" alt="Greek_fire-500w.png" class="image right" style="max-width: 300px;" /> <p>In my first post in this series I discussed the burgeoning cultural war between people and organizations that are cloud-centric and those that are network-centric. As I described, every so often a shift in technology is so fundamental that it both impacts and is impacted by the culture of the IT organization. I also pointed out that the organizations that are able to maximize the potential benefits of these shifts are the ones that come to grips with the cultural impact that these shifts create. In this post I’ll discuss how the adoption of cloud computing impacts the way that network organizations should think about the management tools they use.</p> <p>Any change as fundamental as the shift to cloud computing is accompanied by lots of hyperbole and statements that might come true in the long run but have virtually no impact in the short term. An example of that phenomena is the attention currently being paid to NoOps (no operations). A recent blog post[1] pointed out that NoOps was being driven by the adoption of cloud computing and quoted Forrester’s definition of NoOps as “the goal of completely automating the deployment, monitoring, and management of applications and the infrastructure on which they run.” Looked at this way it is easy to see that NoOps is one possible end-state for the ongoing adoption of DevOps. It is also quite obvious that NoOps is an aspirational goal that will not have any significant impact on the choice of management tools for the foreseeable future.</p> <p>Cloud computing doesn’t mean relinquishing all in-house resources.</p> <p>To analyze how the adoption of cloud computing impacts management tools in the near term, we need to establish what that adoption does and does not mean. One thing it doesn’t mean is a transition within a very brief period of time from the traditional model in which all IT resources are provided on site to a new model in which those resources are instead provided via the cloud. Over the last several years we’ve seen growing adoption of public cloud applications and services, but most large enterprises still provide the majority of their applications in house. In addition, whether driven by concern over security or performance, some applications will continue to be provided on site for the foreseeable future.</p> <p>Another thing that the adoption of cloud computing doesn’t mean is that cloud-based delivery will become the end-state for all applications. Choosing to run a given application over the public cloud doesn’t lock you into always running that application in the cloud. Some companies find it advantageous to get started with running an application in the public cloud but to later bring that application on site due to economic, security, or performance considerations. The interest that companies have in being able to move applications back and forth was highlighted in the recent announcement made by VMware and Amazon[2].</p> <p>Management tools must perform well regardless of where applications are housed.</p> <p>The fact that application deployment will be diverse and dynamic — some will run on site, some will run in the cloud, and some will go back and forth — means that IT organizations must have management tools that are designed to perform well regardless of where the application is housed. Other key characteristics that IT organizations should look for in a management tool include:</p> <ul> <li>Easily deployed;</li> <li>Provides access to all available management data;</li> <li>Enables access by a wide range of users with differing access rights;</li> <li>Capable of sharing data programmatically;</li> <li>Economically feasible.</li> </ul> <p>The above characteristics are very similar to the factors that are driving IT organizations to make increasing use of cloud services. It follows that IT organizations should give strong consideration to acquiring SaaS-based management tools. One of the many advantages of these tools is that they afford IT organizations the ability to try a tool with a minimum of overhead.</p> <p>In my next post I’ll explore how Kentik’s approach to network management enables IT organizations to respond to the changes brought about by the adoption of cloud computing.</p> <hr> <p>[1] <a href="http://searchcloudapplications.techtarget.com/definition/noops">http://searchcloudapplications.techtarget.com/definition/noops</a><br> [2] <a href="https://blogs.vmware.com/vsphere/2016/10/vmware-aws-announce-strategic-partnership.html">https://blogs.vmware.com/vsphere/2016/10/vmware-aws-announce-strategic-partnership.html</a></p><![CDATA[Partnering with A10 Networks for DDoS Defense]]><![CDATA[While Kentik Detect's ground-breaking DDoS detection is field-proven to catch 30% more attacks than legacy systems, our DDoS capabilities aren't limited to standalone detection. We're also actively working with leading mitigation providers to create end-to-end DDoS protection solutions. So we're excited to be partnering with A10 Networks, whose products help defend some of the largest networks in the world, to enable seamless integration of Kentik Detect with A10 Thunder TPS mitigation.]]>https://www.kentik.com/blog/partnering-with-a10-networks-for-ddos-defensehttps://www.kentik.com/blog/partnering-with-a10-networks-for-ddos-defense<![CDATA[Jim Frey]]>Tue, 17 Jan 2017 14:00:01 GMT<h3 id="kentik-integration-with-thunder-tps-pairs-detection-with-mitigation"><em>Kentik Integration with Thunder TPS Pairs Detection with Mitigation</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/uR9KMOenuKOsCeQWYUGCu/9cdb01635291bfba306b3ef24bf1611e/A10_logo-401w.png" alt="A10_logo-401w.png" class="image right no-shadow" style="max-width: 250px;" /> <p>Late 2016 was a busy time for DDoS on multiple fronts. As an industry we experienced unprecedented DDoS attacks on OVH, Krebs, and Dyn that were driven by IoT botnets. As a company, Kentik announced ground-breaking DDoS detection capabilities in Kentik Detect. Drawing on our big data scale and our learning algorithms for baselining, we’ve now proven in the field that we can catch significantly more attacks than traditional approaches. Furthermore, we’ve enabled Kentik Detect to trigger hybrid mitigation, including support for “free” methods like RTBH as well as for integrations with industry-leading mitigation solutions.</p> <p>All of which brings me to the topic of today’s blog post. While we’ve previously made public reference to the fact that Kentik has been working with A10 Networks, and today we are proud to formally announce both the partnership and our field-ready integration between Kentik Detect and the A10 Thunder TPS mitigation solution.</p> <div class="pullquote left">A10 solutions help protect some of the world's largest networks.</div> <p>A10 Networks is a networking industry leader, and their series of application networking, load balancing, and DDoS protection solutions accelerate and secure the applications and networks of the world’s largest enterprises, service providers, and cloud platforms. A10’s marquee customers include Yahoo, Box, GoDaddy, Ericsson, and Subaru.</p> <p>The fact that we had number of customers in common meant that we started seeing A10 Networks regularly in client accounts. Eventually, a mutual customer asked for deeper integration. The result was a project to connect Kentik’s industry-leading DDoS detection and automated notification/triggering with A10’s advanced DDoS mitigation capabilities. Working with A10, we’re now able to support some exciting functionality, like triggering Thunder TPS’ dynamic mitigation, which escalates suspect traffic through progressively tougher countermeasures to minimize impact on legitimate traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3WM7UoLH9uWWEy0UsqgGWu/a8f40d84550f0d5c096ff28ebea7576b/create_mitigation-800w.png" alt="create_mitigation-800w.png" class="image center" style="max-width: 800px;" /> <p><strong>The Need Has Never Been Greater</strong></p> <p>Kentik’s integration with A10 Networks couldn’t come at a better time, because the need for DDoS protection is rising along with the threat posed by IoT botnets. The recent attacks on OVH, Krebs, and Dyn were (yet another) wake-up call to the industry, reminding us that DDoS-induced outages can have a major negative impact on any targeted business. An <a href="https://www.a10networks.com/sites/default/files/A10-MS-23175-EN.pdf">IDGConnect survey</a> in 2016 found that the average downtime due to DDoS attacks is 17 hours. Another <a href="https://www.networkworld.com/article/3064677/security/hit-by-ddos-you-will-likely-be-struck-again.html">article in Network World</a> states that “the loss numbers are big, too. Half of the organizations would lose at least $100,000 per hour in a peak-time DDoS-related out-age, [and] 33% would lose more than $250,000 per hour.”</p> <p>There is a very real need for the industry at large to take a greater degree of ownership over the cyber-security hygiene of IoT devices and internet infrastructure. But the genie is already out of the bottle, so it’s also up to individual digital businesses to take responsibility for their own defense. The partnership between Kentik and A10 Networks will provide a far more accurate and powerful defensive solution than legacy tools and technologies developed for the Internet of a prior generation. Gunter Reiss, Vice President of Strategic Alliances at A10 Networks, puts it this way:</p> <p><em>“Recent studies have reported that DDoS attacks are costing companies more and more money. Companies must prepare their networks to stand up to DDoS attacks. We are delighted about our partnership with Kentik, as the Kentik Detect solution augments A10 Networks Thunder TPS appliances and ecosystem, providing massive scale and speed for multi-vector DDoS protection. Together we provide additional detection capabilities for our customers, alleviating the burden on the network and security IT staff.”</em></p> <p><strong>Effective Protection Starts with Detection</strong></p> <p>At Kentik, teaming with A10 is right in line with our ongoing focus on the front half of the end-to-end DDoS protection equation: accurate attack detection paired with fast, flexible notification and triggering. DDoS detection has always been a key use case for customers who deploy Kentik Detect. For example, Christian Balorre from DailyMotion mentions DDoS as a driving factor in the company’s move to Kentik:</p> <p><em>“When you’re handling several hundred gigabits of network traffic each second, effective DDoS detection and traffic engineering insights are critical. At Dailymotion we’ve been looking to replace our open source tools with a solution that can handle our volume and is fast, flexible, scalable, and easy to use. After testing several market-leading systems we chose Kentik Detect, which provides us with the metrics and insights we need to understand everything about our traffic.”</em></p> <p>We’re pumped up about sharing the powerful new “one-two” punch of Kentik+A10, which will help organizations of all sizes achieve the higher degree of DDoS protection that 2017 is going to demand. If you’d like to find out more about how Kentik Detect integrates with A10 Networks Thunder TPS solutions, check out the <a href="https://info.kentik.com/rs/869-PAD-887/images/A10-SB-19171-EN-01.pdf">joint solution brief</a>, or to <a href="#demo_dialog">schedule a demo</a>. Or you can experience Kentik Detect for yourself by signing up online for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Using NetFlow Analysis to Optimize IP Transit]]><![CDATA[Unless you're a Tier 1 provider, IP transit is a significant cost of providing Internet service or operating a digital business. To minimize the pain, your network monitoring tools would ideally show you historical route utilization and notify you before the traffic volume on any path triggers added fees. In this post we look at how Kentik Detect is able to do just that, and we show how our Data Explorer is used to drill down on the details of route utilization.]]>https://www.kentik.com/blog/using-netflow-analysis-to-optimize-ip-transithttps://www.kentik.com/blog/using-netflow-analysis-to-optimize-ip-transit<![CDATA[Alex Henthorn-Iwane]]>Mon, 09 Jan 2017 16:29:10 GMT<h3 id="how-bgp-enabled-visibility-cuts-transit-costs"><em>How BGP-enabled Visibility Cuts Transit Costs</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/3QZBgiToBqEQYQCekwegaK/2e257577dab61288728474ec5b26077f/Internet_connectivity-500w.png" alt="Internet_connectivity-500w.png" class="image right" style="max-width: 500px; margin-top: 15px;" /> <p>Whoever said “there’s no such thing as a free lunch” might just as well have been speaking about the Internet. Getting traffic from place to place has a cost, and one way or another, someone has to pay. For the biggest of the big — the “Tier 1” service providers like AT&#x26;T, Verizon, NTT, TeliaSonera, and China Telecom — payment is effectively based on barter. Known in the trade as “settlement-free peering,” the deal is simple: you transport my traffic and I’ll transport yours. But if you’re an ISP without national or international scope, the big boys won’t let you into their select peering club. Instead you’ll be paying with actual money for at least for part of your access to the Internet. Needless to say, the less paid for these “IP transit services,” the better. This post will show you how Kentik Detect can help you keep those transit costs to a minimum.</p> <h4 id="a-vale-of-tiers">A Vale of Tiers</h4> <p>As noted above, the largest networks in the Internet (which is a network of networks) are the so-called Tier 1 service providers. All other networks in the Internet are connected, directly or indirectly, to one or another of the Tier 1s, so when the Tier 1s peer with one another they gain settlement-free access to every network on the Internet.</p> <p>Below Tier 1s are the Tier 2 ISPs that are typically regional providers or national ISPs that haven’t joined the Tier 1 club. Tier 2 ISPs utilize a mix of settlement-free peering and IP transit services. Below them are the Tier 3 ISPs, which are typically small operators that provide services — local broadband, telecom, fixed wireless, and cable — to retail subscribers. Tier 3s purchase IP transit services, as do some digital businesses that depend on bandwidth to deliver content (e.g. music or gaming) or to maintain availability and transaction performance (e.g. e-commerce sites).</p> <div class="pullquote right">Minimize transit costs by keeping traffic volume within pre-committed ranges.</div> <p>Whichever tier an organization belongs to, the IP transit services it buys are likely based on a Committed Information Rate (CIR), which is commonly metered based on “95th percentile.” Bursting above the CIR is billed extra (according to each provider’s formula), so a key way to minimize transit costs over time is to ensure that traffic volume stays (narrowly) within pre-committed ranges. But how? You need a path-aware traffic analytics system that shows traffic volumes on various routes, both historically, for planning, and in real time to detect transit links that are nearing their CIR so that you can take corrective action.</p> <p>The traditional appliance-based, single-server approach to network monitoring, with its inherent limitations in ingest and compute power, falls short for this application because it’s underpowered for large-scale real-time detection and it also involves discarding most of the data you need for historical analysis. Kentik Detect, on the other hand, offers a scalable big data backend, BGP-enriched flow data, and a dedicated BGP Analytics section, all of which makes it ideally suited to enable your organization to avoid excess costs by optimizing IP transit utilization.</p> <img src="//images.ctfassets.net/6yom6slo28h2/15gbP0mNQU8CsaaiQM644K/f02ddba97e5875cf354ac12e7e2aeab9/Next_hop_sidebar-300w.png" alt="Next_hop_sidebar-300w.png" class="image right" style="max-width: 300px;" /> <h4 id="checking-next-hop-in-data-explorer">Checking Next Hop in Data Explorer</h4> <p>To see how Kentik Detect can help, let’s begin with the Data Explorer section of our portal. Data Explorer can be used to rapidly assess 95th percentile traffic levels over a specified time-range. Let’s assume, for example, that our billing period starts on the 16th of the month. In the screenshot below, we can see how we would set configuration options in the sidebar to assess our utilization in relation to our CIR:</p> <ul> <li>Set the group-by dimension to next hop AS number.</li> <li>For metric, use bits per second.</li> <li>For traffic perspective, choose egress.</li> <li>For display and sort by, choose 95th percentile.</li> <li>Set the time range from the 16th to the present date</li> <li>In the device selector, choose all relevant flow exporters (17).</li> </ul> <p>As shown in the following graph and table, we can now see a stack-ranked set of 95th percentile, next hop ASN traffic volume.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3zfbvA1HHGIAWaIWioogsw/bcf229714320289d9dc08529e5799559/Next_hop_chart-800w.png" alt="Next_hop_chart-800w.png" class="image center" style="max-width: 800px;" /> <p>If the traffic volume for all paths is within the committed range, we’re good. However, if the volume on any path is getting close to the corresponding CIR then we would want to examine destination traffic from a highly used transit AS link to a less utilized one.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4YM8fpF8fS6kcoiSmOQIQM/e296d2a4695328b4a5689d0154810f42/Dimensions-300w.png" alt="Dimensions-300w.png" class="image right" style="max-width: 300px;" /> <p>This analysis is pretty easy to do. Based on the above example, let’s say that Turk Telekom is nearing its commit level. We can rapidly create a Data Explorer analysis by adding destination ASN to our group-by dimensions and looking for destination ASN traffic that we might want to reroute.</p> <p>With the added group-by dimension the resulting visualization looks like the following:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3mA7HVGp7aW0SkQcuqCWWI/e9c1de29f37e48d2754675919c6fad91/Next_hopDest_AS-800w.png" alt="Next_hopDest_AS-800w.png" class="image center" style="max-width: 800px;" /> <p>If needed, we could have filtered to just look at Turk Telekom, but it turns out that there are some destination ASN traffic flows that are easily spotted. From here, we can work on route maps to shift that traffic to another exit router and next hop ASN path.</p> <h4 id="automated-traffic-rebalancing">Automated Traffic Rebalancing</h4> <p>One of the great things about Kentik Detect is that everything that you can do in the portal UI can also be done via SQL or REST. So it’s possible to run the above analysis on an automated basis by scripting queries via our APIs and using the top destination AS traffic information to automate router configurations so that traffic is rebalanced on a regular basis.</p> <p>If you’re not quite sure that you need that sort of automation, you can find out by using our built-in alerting system. Configure alerts to look for 95th percentile traffic that is above your CIRs. If you regularly get notifications from these alerts, then it’s likely that the developer resources required to build an automated traffic rebalancing process would provide a solid return on investment.</p> <h4 id="kentik-means-visibility">Kentik Means Visibility</h4> <p>“Kentik” literally means visible in Yiddish (as discovered by Avi Freedman, our co-founder and CEO). Figuratively speaking, Kentik is here to give you new eyes: to provide you with the exact visibility you need to deliver a great network experience. That helps make your user and customer experience the best it can be, while also keeping a lid on your costs. If you’d like to learn more, check out the <a href="https://www.kentik.com/kentik-detect/">Kentik Detect</a> product page and visit our <a href="https://www.kentik.com/visibility-performance/">NetFlow Analysis</a> solutions page.</p><![CDATA[Kentik Earns Forrester Nod as Breakout Vendor]]><![CDATA[Kentik is honored to be the sole network monitoring provider named by Forrester Research as a “Breakout Vendor" in a December 2016 report on the Virtual Network Infrastructure (VNI) space. The report asserts that I&O leaders can dramatically improve customer experience by choosing cloud networking solutions, and cites Kentik Detect as one of four groundbreaking products that are poised to supercede typical networking incumbents.]]>https://www.kentik.com/blog/kentik-earns-forrester-nod-as-breakout-vendorhttps://www.kentik.com/blog/kentik-earns-forrester-nod-as-breakout-vendor<![CDATA[Alex Henthorn-Iwane]]>Tue, 03 Jan 2017 14:00:27 GMT<p>Kentik Detect Recognized for Virtual Network Infrastructure</p> <img src="//images.ctfassets.net/6yom6slo28h2/IIza9bRqKcSc4w66EAM2C/b82099613b4a7fe011ddc813c9cab4ed/logo-vni.png" alt="logo-vni.png" class="image right" style="max-width: 300px;" /> <p>Kentik is honored to have been named by Forrester Research as a breakout vendor in the Virtual Network Infrastructure (VNI) space. In fact, Kentik was the sole network monitoring vendor included in the report “Breakout Vendors: Virtual Network Infrastructure — Four Vendors That Can Help Build a VNI.” Published December 6, 2016, the report is authored by Forrester analyst Andre Kindness with Glenn O’Donnell, William McKeon-White, and Diane Lynch.</p> <p>Cloud networking solutions demonstrably improve customer experience.</p> <p>The report starts from the premise that new and innovative networking solutions from hyperscale cloud providers have already been demonstrated to dramatically improve customer experiences. That leads to an imperative for Infrastructure &#x26; Operations (I&#x26;O) leaders who want to win in digital business: work with brand-new product lines instead of falling back on the typical networking incumbents. As an introduction to innovative technologies that are ready now to help virtualize the network, Kindness and his team profiled four vendors, both startup and established, that “should be on your radar as you begin developing virtual network infrastructure.” Along with Kentik, the report includes Arista Networks, Nuage Networks, and VeloCloud.</p> <p>What’s VNI?</p> <p>What does Forrester mean by VNI? Fundamentally, the term refers to a network built based on Forrester’s five VNI tenets, which can be paraphrased as follows:</p> <ul> <li>Powered by both software and hardware components</li> <li>Provides a business-wide fabric</li> <li>Interweaves layers from L2 through L7</li> <li>Leverages a programmable automation and orchestration system</li> <li>Empowers users and customers of network services</li> </ul> <p>What’s a Breakout Vendor?</p> <p>Breakout Vendors ease the shift to virtual infrastructure while minimizing the risks.</p> <p>The Breakout Vendor designation is meant to identify “groundbreaking vendors that have moved away from business as usual and are helping teams virtualize the network.” Also factored in is the extent to which products “help teams make the shift with the least amount of risk.”</p> <p>Why Was Kentik Included?</p> <p>Forrester’s inclusion of Kentik was based on an assessment of how Kentik Detect — Kentik’s big data SaaS for network visibility, DDoS protection, performance monitoring, and peering analytics — addresses the following key pain points for I&#x26;O professionals:</p> <p>Kentik addresses key I&#x26;O pain points such as monitoring gaps and siloed tools.</p> <ul> <li>Virtual network monitoring gap: “A lack of monitoring has inhibited I&#x26;O professionals from virtualizing the network.”</li> <li>Fractured, siloed, inefficient monitoring solutions:<br> - Fractured &#x26; siloed: “Most monitoring implementations are limited and confined to certain networking segments.”<br> - Inefficient: “Forrester clients indicate that the complexity and costs hinder organizations from investing in monitoring tools.”</li> </ul> <p>The report assesses each vendor, including Kentik, on five criteria:</p> <ul> <li>Product offering and functionality</li> <li>Use case scenarios supported</li> <li>Maturity of the company and solution</li> <li>Challenges and mitigating factors to those challenges</li> <li>Road map</li> </ul> <p>The Kentik Take</p> <p>The networking industry has been slowly coming to grips with the fact that a lack of cloud-friendly network monitoring is hampering efforts to deliver an optimal customer experience. In this context, being cloud-friendly involves two things:</p> <ul> <li>The ability to monitor cloud-based infrastructure and the distribution application components and services that are running in virtual machines and containers across hybrid clouds.</li> <li>The use of cloud-scale resources to unify previously siloed, appliance-based monitoring solutions.</li> </ul> <p>Traditional network monitoring tools were built for the age of relatively monolithic applications run in centralized, internally owned and managed datacenters. While changes in information technology are typically more like a gradual slope than an abrupt cliff, the shift to cloud application and infrastructure models is having an unusually disruptive effect on networking and therefore on network monitoring. So much so that we’re now in the midst of a struggle for influence between cloud and traditional networking approaches, a topic that our guest analyst Jim Metzler wrote about in his recent blog post “<a href="https://www.kentik.com/culture-war-network-vs-cloud/">Culture War: Network vs. Cloud</a>.“</p> <p>Kentik Detect is built from the ground up as a VNI-friendly network monitoring solution.</p> <p>We built Kentik Detect from the ground up as a cloud-friendly — or in Forrester terms, VNI-friendly — approach to network monitoring. Based on real-time capture and long-term retention of unsummarized NetFlow enriched with BGP, GeoIP, and SNMP, we’ve been busy delivering a cloud-scale platform that unifies a variety of functional areas that so far includes:</p> <ul> <li>Network traffic analysis</li> <li>Network performance monitoring</li> <li>Advanced detection of DDoS and network anomalies</li> <li>Sophisticated automation of attack mitigation</li> <li>BGP-based routing/peering analytics</li> </ul> <p>Needless to say, we’re pleased to have earned Forrester’s recognition, and we’re eager to spread the news. So we’ve arranged to make the report available for you to <a href="https://reprints.forrester.com/#/assets/2/584/&#x27;RES136801&#x27;/reports">read for yourself</a>. If you’d like to learn more about our purpose-built big data SaaS for network operations (available via public cloud or private deployment), start by checking our product page on <a href="https://www.kentik.com/kentik-detect/">Kentik Detect</a>. If you’re ready to dive right in with cloud-friendly network monitoring and DDoS protection, sign up for a <a href="#signup_dialog">free trial</a> or <a href="#demo_dialog">request a demo</a>.</p><![CDATA[RouterFreak on Kentik Network Performance Monitoring]]><![CDATA[Earlier this year the folks over at RouterFreak did a very thorough review of Kentik Detect. We really respected their thoroughness and the fact that they are practicing network engineers, so as we've come up with cool new gizmos in our product, we've asked them to extend their review. This post highlights some excerpts from their latest review, with particular focus on Kentik NPM, our enhanced network performance monitoring solution.]]>https://www.kentik.com/blog/router-freak-on-kentik-network-performance-monitoringhttps://www.kentik.com/blog/router-freak-on-kentik-network-performance-monitoring<![CDATA[Alex Henthorn-Iwane]]>Thu, 22 Dec 2016 14:00:58 GMT<h3 id="excerpts-from-an-in-depth-look-at-kentik-detect"><em>Excerpts from an in-depth look at Kentik Detect</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/4XNUvwTZ7WcskosCk2MQmq/f251ce64aaf3fe241002852ce858ad2f/router_freak_logo-400w.png" alt="router_freak_logo-400w.png" class="image right no-shadow" style="max-width: 300px;" /> <p><em>Earlier this year the folks over at RouterFreak did a very thorough review of Kentik Detect. We really respected their thoroughness and the fact that they are practicing network engineers, so as we’ve come up with cool new gizmos in our product, we’ve asked them to extend their review. Following here, from their latest review, are some excerpts that focus on Kentik NPM, our enhanced network performance monitoring solution.</em></p> <hr> <p>Recently we <a href="http://www.routerfreak.com/kentik-detect-review/">reviewed Kentik Detect</a>, a very customizable, flexible and scalable cloud based <strong>NetFlow collector</strong>. Today we’ll be reviewing Kentik’s <strong>Network Performance Monitor (NPM)</strong> solution, which offers a new host monitoring agent in conjunction with Kentik Direct and gives users an even deeper level of <strong>network visibility</strong>.</p> <p><strong>What is Network Performance Monitoring (NPM)?</strong></p> <p>Kentik’s NPM solution goes beyond typical NetFlow traffic analysis in that it is enabled through the installation of an <strong>nProbe</strong> application on Linux based servers. The probe captures packets from sampled flows of live, incoming and outgoing traffic and sends that information to Kentik Direct in IPFIX packets.</p> <p>What’s the benefit of this you ask? Well, as the probe is installed on the server itself, it is privy to information which NetFlow devices are not.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2igaKPAzrauoUeGSkoKGUa/83aec811d1e24d54d2f4e63026e2d8d9/Metric_selector-300w.png" alt="Metric_selector-300w.png" class="image left" style="max-width: 300px;" /> <p>As per Kentik’s nProbe documentation, hosts which have the probe installed generate four additional metrics that are sent to the Kentik Detect back-end:</p> <ul> <li>Retransmits per second and %</li> <li>Out-of-order packets per second and %</li> <li>Fragments per second and %</li> <li>Network latency per client/server/application (ms)</li> </ul> <p>Further to this, the metrics are seamlessly added to the <strong>Data Explorer</strong> if the selected device(s) have the probe installed. For example, as per the image below, when I select the two “nntp” hosts and then click on the “Metric” dropdown menu, I’m provided with the standard metrics as well as the additional augmented metrics.</p> <p>On the other hand, if I select hosts which do not have the probe installed, these metrics are not displayed. Sure it’s a little feature, but it’s nice nevertheless as it avoids the need to memorize which hosts can and which cannot use these metrics.</p> <p><strong>Do I Need the Additional Visibility?</strong></p> <p>Yes! Here are just some of the issues you can identify using the augmented metrics:</p> <ul> <li><strong>Retransmits per second and %:</strong> If your clients and/or servers are retransmitting packets regularly it could be due to congestion. If this is the case, the retransmits will amplify the issue.</li> <li><strong>Out</strong>-<strong>of</strong>-<strong>order packets per second and %</strong>: Suboptimal use of redundant delivery paths. Reordering packets wastes resources and should be avoided.</li> <li><strong>Fragments per second and %:</strong> A device in the delivery path has a lower than expected MTU. This issue should be rectified because:<br> - Fragmented packets are often dropped by intermediate devices and firewalls.<br> - Reconstructing fragmented packets wastes resources.<br> - Applications often send their traffic with the DF bit set. As a result of this, this traffic will be dropped.</li> <li><strong>Network latency per client/server/application (ms):</strong> Slow performance is often blamed on the network, whether it be the client or server side. However, it’s just as possible the actual application itself who is at fault. This metric will allow you to identify where the latency is being introduced.</li> </ul> <p><strong>Test Drive</strong></p> <p>Let’s take a look at how we can use the <strong>“% Retransmits”</strong> metric to see how we can gain a deeper understanding of what is causing packet loss in a network.</p> <p>With the metric selected, along with the <strong>“Destination IP/CIDR”</strong> dimension, our graph looks like this:</p> <img src="//images.ctfassets.net/6yom6slo28h2/9reW4lJNhmeqQskqk6scG/989457b68e9e590ccd9f5551438a8779/Top_Dst_IP-819w.png" alt="Top_Dst_IP-819w.png" class="image center" style="max-width: 819px;" /> <p>What we see here is the percentage of retransmits to specific servers. This is a great start though this information doesn’t tell us which application(s) are experiencing the retransmissions. Adding the <strong>“Destination Port”</strong> dimension provides us with that visibility:</p> <img src="//images.ctfassets.net/6yom6slo28h2/4eNSBCMyOQMISA4eiwYmqK/637e325c3f6c558c08291356e2ccdc21/Top_Dst_IP_2-819w.png" alt="Top_Dst_IP_2-819w.png" class="image center" style="max-width: 819px;" /> <p>But now let’s say that we no longer think the issue is specific server or service related. What could we do if we thought the issue was path related? We could remove both the <strong>“Destination Port”</strong> and <strong>“Destination IP/CIDR”</strong> dimensions and replace them with <strong>“Destination AS Number”</strong>. What this does is it gives us an AS level view of where packets are being retransmitted:</p> <img src="//images.ctfassets.net/6yom6slo28h2/2o97HSoRmIAma2yysMWWy2/8204aab3eac83df2a49410cca8df3b35/Top_Dst_AS-819w.png" alt="Top_Dst_AS-819w.png" class="image center" style="max-width: 819px;" /> <p>As we’ve just seen, using these augmented metrics in conjunction with the preexisting dimensions provides us with <strong>new levels of visibility</strong> which were not available to us previously.</p> <p><strong>Router Freak’s Verdict</strong></p> <p>The nProbe agent adds more features to an already fantastic product. Kentik Detect does <strong>a great job</strong> of providing traffic visibility, but the nProbe agent takes it to <strong>a whole new level</strong>.</p> <p>While performing packet captures at multiple points in your network is a great idea when you’re troubleshooting an issue, it can be very time-consuming. Further to this, you need to ensure your captures are running before the issue occurs again in order to be able to analyze the data. On the other hand, as the nProbe agents are collecting data before, during and after the issue, you’re able to start your analysis <strong>immediately</strong>.</p> <p>What we really liked is that nProbe is much more that a simple NetFlow probe. NetFlow is the de-facto standard for network traffic accounting, but nProbe includes both a <strong>NetFlow v5/v9/IPFIX probe</strong> and <strong>packet capture</strong> (pcap) function that can be used to increase the available metrics.</p> <p>Another great feature of nProbe is the availability for Linux, Windows and embedded system such as ARM and MIPS/MIPSEL.</p> <p>The supported layer-7 applications are <strong>more than 250</strong>, including the most popular Skype and BitTorrent. Last but not least, both IPv4 and IPv6 are available.</p> <p>All in all it’s a great feature and I really can’t think of a reason why you wouldn’t want it running on your servers.</p> <hr> <p><em>To read the whole review including further walk throughs of real-world examples, go to <a href="http://www.routerfreak.com">www.routerfreak.com</a>. If you want to learn more about Kentik NPM, check out our performance monitoring <a href="https://www.kentik.com/network-performance-monitoring/">solution page</a>. If you’d like to get cloud-friendly network performance monitoring today, start a <a href="#signup_dialog">free trial</a> or <a href="#demo_dialog">hit us up</a> and we can walk you through a demo.</em></p><![CDATA[Kentik CEO Avi Freedman with PacketPushers on NPM & DDoS]]><![CDATA[Avi Freedman recently spoke with Ethan Banks and Greg Ferro of PacketPushers about Kentik's latest updates, which focus primarily on features that enhance network performance monitoring and DDoS protection. This post includes excerpts from that conversation as well as a link to the full podcast. Avi discusses his vision of appliance-free network monitoring, explains how host monitoring expands Kentik's functionality, and gives an overview of how we detect and respond to anomalies and attacks.]]>https://www.kentik.com/blog/avi-freedman-packetpushers-npm-ddoshttps://www.kentik.com/blog/avi-freedman-packetpushers-npm-ddos<![CDATA[Avi Freedman]]>Mon, 19 Dec 2016 14:00:01 GMT<p>Kentik CEO Avi Freedman in Conversation with PacketPushers</p> <img src="//images.ctfassets.net/6yom6slo28h2/4CbpMaQ6xyKoCQg2a00Kqe/2aac8ca1cda1aeef08a3f76b00df1390/packetpushers.png" alt="packetpushers.png" class="image right" style="max-width: 300px;" /> <p><em>I recently had the chance to talk with fellow nerds Ethan Banks and Greg Ferro from PacketPushers about Kentik’s latest updates in the arena of network performance monitoring and DDoS protection. I thought I’d share some excerpts from that conversation, which have been edited for brevity and clarity. If you’d like you can also listen to the <a href="https://www.kentik.com/pq-show-101-kentik-npm-for-cloud-ddos-detection-sponsored-2/">full podcast</a>.</em></p> <p><strong>Destroying Silos and Ripping Out Appliances</strong></p> <p>Startup life is hectic, and that is by design. We thought when we started Kentik that there was an opportunity to rip the monitoring appliances out of the infrastructure, and in startup life it’s just a question of how fast you want to go. And we’ve decided that we want to try going pretty fast. That involves often doubling the company size every year, adding new kinds of customers, and, as a CEO, a whole new area of geekdom about business and marketing. And in some ways figuring out how to explain the technology to people is a harder problem than building it. Especially explaining to people who don’t know what BGP, AS paths, and communities are. They just want to know: is it the goddamn network that’s causing my goddamn problem?</p> <p>Part of our secret long-term plan is to address how you combine all of this technology. You’ve got APM, and metrics, and Net NPM, and DDoS protection, and it all needs to work, and it all needs to be related. But again, back to the startup world, one thing at a time. First, badass NetFlow tool, then bring the performance side into it, then protect the infrastructure, and then we’ll see where we go from there. These silos really hurt people, and the lack of visibility is very difficult for operations.</p> <p><strong>Our Cloud-Friendly Network Performance Monitoring</strong></p> <p>The Kentik base is the Kentik Detect offering that we’ve had out commercially for about a year and a half. If you’re a super network nerd, we just tell you that it’s a badass NetFlow tool. It lets you see anything that you want to know, and alert you to anomalous conditions, but in a way that lets you dig down because we keep all the details, and we present those details in a way that network and operations folks understand. So for the base product it’s really more about “where’s the traffic going?” than “what’s the performance of the network?”</p> <p>People used to say that measuring performance means SNMP. Before you retch, let’s say that means measuring congestion. If the link is congested, then a basic NetFlow tool function allows you to double click on that, and you can see that the link is full because there’s tons of traffic due to the database servers syncing.</p> <p>NPM is about leveraging data sources that are not just standard NetFlow, sFlow, or IPFIX.</p> <p>What that doesn’t really tell you is whether there’s a performance problem. The link may be almost full, but there may be no TCP retransmits, and the latency may be good. So really getting into network performance monitoring is about how we can get data sources that are not just standard NetFlow, sFlow, or IPFIX, but come from packets or web logs, or devices that can send that kind of augmented flow data, which has standard NetFlow but also latency information and retransmit information. And then to make it usable, so you can say “It isn’t the network,” or “Aha! It is the network, in fact it’s that remote network or it’s my data center,” or wherever it is.</p> <p>If you’re 100% in the cloud, generally what you would do is use an agent that can run on the server and watch the packets. We actually work with nTop, who’s a granddaddy in the space, and they have something called nProbe, which can watch packets generate flow but also add performance and metadata. As you deploy that, you have to be careful that you don’t have the Heisenbugs, where your monitoring interferes with production and then things happen strangely. So you have to watch, sometimes a percentage of the traffic.</p> <p>The agent sends data in IPFIX, so standard format, but also adds in the performance data that we need — and soon application semantics and some other things — and it looks like just a network device. So every host looks like a network device, we’re getting Flow from it, and we’ll typically in Cloud then use a BGP table that’s representative of that kind of infrastructure. Because you probably don’t speak BGP with your Cloud provider. Unless you’re doing anycast, which some people do, but not AWS, GCE, Azure, Soft Layer, not the big guys. So the data’s coming from the servers. If you run hybrid Cloud, and you’re running at least some of your own infrastructure, then again switches, routers, BGP tables, but to get that sort of augmented information, it can come from the host agent.</p> <p>Our new host agent gives you metrics like retransmits, out-of-order, and latency.</p> <p>The goal of our first NPM release is to really get at whether there a performance issue, and if so is it the network or back in the application? We’re not pointing down deep into the application because we’re not really taking the full application logs. But if we have the augmented data, it opens up a whole number of new metrics for Kentik Detect. So instead of just bits per second, or packets per second, or number of unique IPs, now you get retransmits, out-of-order packets, client latency, server latency, and application latency.</p> <p>For NPM the emphasis is really on those last three, because we have anomaly detection and alerting that will tell you, for example, that the observed client latency is now 10x, and show you that the network latency is fine and the application latency is high. Those are the things that the people with NPM are looking to do. What do you do once you know that? That’s the alerting part of it, which we’ll talk about in a little bit when we talk about DDoS and our new alerting system. But really alerting should be not just attacks but also performance and other application issues in our opinion.</p> <p>We’ve got the BGP data, so we lay it out, and we show you on a path basis where we think the problem is. So that if you see a problem, and you see that it looks like it’s a network, you can actually see, oh, is it the data center, is it my immediate upstream provider, or is it off in the middle?</p> <p>Our customers actually send our reports to pinpoint problems for upstream neighbors.</p> <p>We’ve got an advertising technology customer, and ad tech is a really interesting space for network sophistication. Or at least, I’ll say for network frustration when they don’t have the data. You’ve got all these companies that compete with each other, but they need to cooperate to give 50-millisecond SLAs, because if they don’t then the web pages don’t load. When people load a web page and see ad-tech stuff blocking it, they get upset. So where’s the problem? Often our customers actually have to send reports from our system to show their upstream neighbors where that problem is. That may sound like finger pointing, but what we’ve seen is that it’s actually taken very constructively compared to, I don’t know, last decade, when I just saw endless organizations that didn’t want to admit problems. Now people understand that there are problems, and they’re really just looking to collaborate. And with tools that are siloed, you can’t really do that.</p> <p>So for us NPM is about application versus network performance problems, and identifying where in the network, or where in the application, the problem might be.</p> <p><strong>On DDoS Protection</strong></p> <p>It’s not possible to prevent being attacked. And I think that there are attacks that could be generated but we have not yet seen that very few people other than maybe Google or Akamai would be able to stop. But for the 99% of attacks that actually happen on the Internet, it’s a tractable problem. You can detect them, you can mitigate them. It might not be as cheap as you would like, but it is a solvable problem. Absolutely, it is. Even at the terabit level.</p> <p>We’re taking the traffic summary data, whether it’s sFlow, NetFlow, IPFIX, again from hosts, routers, load balancers, live. We have a continually running streaming data system that is looking at what the usual traffic is, and most of our customers are looking really at anomalies with minimum thresholds. So I don’t want to know that 3 kilobits went to 3 megabits, that’s not really interesting. I want to look at things that are in the tens of megabits going up much higher.</p> <p>You can alert if your traffic volume is way above baseline or fits a known DDoS pattern.</p> <p>So you’ve got baselines built in our system and you’re looking at traffic, and you say, oh, that’s a problem. And then we can look back and see in more detail what has the traffic volume been, and if it’s way above normal, especially for a known kind of DDoS pattern, then we can have that trigger an alert. What people want to do in response to the notification, that’s a business question for them. We have people that automatically trigger mitigation. We have people that want the big, red button, so it opens a ticket, and they log in and push that button to do something about it. And we have people that just want to be notified, and then examine on their own.</p> <p>Over the last year, really since late 2015, we’ve been dealing with larger and larger customers that say, “We need a protection solution that has open, big-data, real-time visibility and detection, but then ties to box and appliance and Cloud-provider mitigation.” And the two providers for which we’ve heard the most requests for that kind of integration are Radware and A10. So those are the mitigations that we’ve built. In terms of a complete mitigation solution, you can detect, you can do RTBH, you can use that to drive Flow Spec, and we also have running in production an end-to-end integration so that we can drive Radware and A10 appliance-based solutions, and then we’ll have some Cloud-based solutions as well that we’ll be integrating with, which a lot of our customers use.</p><![CDATA[Kentik Cited as IDC Innovator]]><![CDATA[Kentik's recent recognition as an IDC Innovator for Cloud-Based Network Monitoring was based not only on our orientation as a cloud-based SaaS but also on the deep capabilities of Kentik Detect. In this post we look at how our purpose-built distributed architecture enables us to keep up with raw network traffic data while providing a unified network intelligence solution, including traffic analysis, performance monitoring, Internet peering, and DDoS protection.]]>https://www.kentik.com/blog/kentik-cited-as-idc-innovatorhttps://www.kentik.com/blog/kentik-cited-as-idc-innovator<![CDATA[Alex Henthorn-Iwane]]>Thu, 15 Dec 2016 14:00:09 GMT<h3 id="kentik-detect-recognized-by-idc-for-cloud-based-network-monitoring"><em>Kentik Detect Recognized by IDC for Cloud-Based Network Monitoring</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/4fg0osBfN6GY2iGE6I0Aek/06f6f6fef9cd96edfddfa072bb8b3ccd/IDC_Innovator-500w.png" alt="IDC_Innovator-500w.png" class="image right no-shadow" style="max-width: 300px;" /> <p>At Kentik, we’re honored to have been recognized recently as an IDC Innovator for Cloud-Based Network Monitoring. Needless to say, this call-out by analysts Nolan Greene and Rohit Mehra reflects well on what we’ve been doing to advance the state of the art in areas such as DDoS protection, infrastructure visibility, performance monitoring, and peering analytics. But I think it also speaks to the importance of the cloud to the future of IT Operations Management in general and, more specifically, network monitoring.</p> <p>In the past year or so, various folks at Kentik have been blogging, <a href="https://www.kentik.com/pq-show-101-kentik-npm-for-cloud-ddos-detection-sponsored-2/">podcasting</a>, and <a href="http://www.apmdigest.com/keeping-digital-business-running">writing articles</a> about the need for cloud-based Network Performance Monitoring (NPM). We’ve also talked about how, in terms of moving to the cloud, IT Operations Management (ITOM) and network monitoring has <a href="https://devops.com/devops-impact-on-it-operations-management/">generally lagged</a> other areas of IT management such as IT Service Management (ITSM) and Application Performance Management (APM). By highlighting cloud-hosted monitoring with an IDC Innovators category, IDC lends independent analytical heft to the point we’ve been making.</p> <h4 id="why-kentik">Why Kentik?</h4> <p>IDC’s recognition of Kentik was two-fold, based not only on the fact that we’re SaaS/cloud-based (in fact, we also can deploy our big data solution on an on-premises cluster) but also on the deep capabilities of our Kentik Detect product. As stated in the research note:</p> <p><em>“Kentik offers a SaaS-based product that allows users to collect, visualize, and analyze network traffic and performance. It is a cloud-based network visibility and analytics solution that provides a panoramic view of any network. It monitors service delivery, detects DDoS attacks, optimizes internet peering, and unleashes innovation with unbounded ad hoc traffic analysis at scale.”</em></p> <p>In other words, it’s not simply that Kentik Detect is SaaS, but that we are able to make industry-leading capabilities available in a SaaS/cloud solution that provides a unified network view, including network traffic analysis, performance monitoring, Internet peering, and DDoS protection in a single platform. That makes Kentik Detect notably distinct from traditional appliance-based approaches.</p> <h4 id="before-the-cloud-lets-go-crazy">Before the Cloud: Let’s Go Crazy</h4> <img src="//images.ctfassets.net/6yom6slo28h2/4cMDTnNtMceSgYcqgS64Ky/fe4b740d591ea826856c5112303d82fa/Crazy_pills-400w.png" alt="Crazy_pills-400w.png" class="image right" style="max-width: 400px;" /> <p>Network engineers have multiple use cases for network data. But because appliances are low on capacity and low on compute power, vendors have had little choice but to focus their appliance models on individual use cases. Within each such use-case silo, incoming network data must be reduced to summary aggregates, with the rest of the raw data discarded regardless of its potential value in other areas.</p> <p>For an organization with multiple departments — network engineering, planning, operations, and security — the appliance-based approach requires sinking resources into a jumble of siloed systems that can’t inter-communicate or share information. It’s maddening enough for engineers to have to swivel-chair between a variety of insufficient tools, but when those tools are in different departments, the barriers to solving real-world problems are enough to drive engineers crazy. As we explained to <a href="http://www.clouditweek.com/topics/cloud-it/articles/427228-kentik-cloud-world-calls-cloud-scale-npm.htm">Cloud IT Week</a> not long ago: “It’s time for appliance-based network performance solutions to go bye-bye.”</p> <h4 id="kentik-detect-true-3rd-platform">Kentik Detect: True 3rd Platform</h4> <p>Kentik’s approach to Kentik Detect isn’t just about moving single-server code from a physical server or appliance to a VM and calling it SaaS. That’s cloud-washing, a phenomenon that should be approached with wariness. Rather, as IDC put it:</p> <p><em>“Kentik is a 3rd Platform-oriented company, leveraging a multitenant Big Data and analytics engine that enables a range of ad hoc network traffic and performance inquiries at scale”</em></p> <p>In case you’re not familiar with the phrase “3rd Platform,” it’s part of IDC’s forward-looking vision for IT, in which 3rd Platform is defined as business Information Technology that drives innovation and continuous change and is built on the four technology pillars of cloud, big data, mobility, and social business. One of IDC’s predictions is that 100+ “Industry Clouds” will emerge by 2020, disrupting today’s established industry market leaders. It is certainly Kentik’s ambition to build our network visibility cloud offerings in a way that will disrupt and replace the curse of legacy appliances.</p> <h4 id="siloes-be-gone">Siloes Be Gone!</h4> <p>So how does Kentik get around the limitations of appliances? By creating a distributed big data backend that’s purpose-built for the scale and speed of today’s network traffic. Called Kentik Data Engine (KDE), this datastore enables us to capture in real time — and keep for months without summarization — all of the details of network traffic data (flow records, BGP, GeoIP, etc.). And we can make all of that unified, time-correlated data not only easily accessible for rapid ad hoc querying but also highly useable across multiple use-case siloes. As IDC observed about the Kentik solution:</p> <p><em>“[Kentik] insights are applicable to network architects, network engineers, network systems analysts, and security operators.”</em></p> <p>This kind of comprehensive view isn’t possible without the scalability enabled by Kentik’s approach. As mentioned before, you can’t solve the issues of data reduction and siloes by simply virtualizing and cloud-washing old solutions. A fundamentally different architecture is required.</p> <h4 id="network-experience-delivered">Network Experience, Delivered</h4> <p>One of major premises of IDC’s 3rd Platform vision is that digital transformation is happening at an astonishing rate and digital business practices will soon affect everything. The rise of IoT is evidence of that, with normal/offline lifestyles becoming digitally connected. Digital business is all about delivering a positive user or customer experience, and that experience relies on three key things: UI/UX, application performance, and the network.</p> <p>This brings us back to the start of this post. UI/UX is a well-known and amply addressed issue. Application (and related server) performance management has modernized and cloudified already. The laggard, in so many ways, is network management. And yet, without a great network experience — free from traffic congestion, latency, or other network performance issues, and safe from disruption by DDoS attacks — you can’t deliver a great customer experience.</p> <p>By providing cloud-based, big data-powered visibility and DDoS defense without the siloes, Kentik Detect helps network teams to work collaboratively, solve problems, innovate, and deliver the optimal network experience that’s needed for digital business to succeed. Ready to see for yourself? Sign up today for a <a href="#signup_dialog">free trial</a>, or <a href="#demo_dialog">request a demo</a>.</p><![CDATA[Culture War: Network vs. Cloud]]><![CDATA[Every so often a fundamental shift in technology sets off a culture war in the world of IT. Two decades ago, with the advent of a commercial Internet, it was a struggle between the Bellheads and the Netheads. Today, Netheads have become the establishment and cloud computing advocates are pushing to upend the status quo. In this first post of a 3-part series, analyst Jim Metzler looks at how this dynamic is playing out in IT organizations.]]>https://www.kentik.com/blog/culture-war-network-vs-cloudhttps://www.kentik.com/blog/culture-war-network-vs-cloud<![CDATA[Jim Metzler]]>Mon, 12 Dec 2016 14:00:12 GMT<h3 id="it-organizations-grapple-with-a-fundamental-shift-in-technology"><em>IT Organizations Grapple with a Fundamental Shift in Technology</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/4wDkdLe6NimmcSSkcgaAWu/59f83c5c2c50132837e8e7962cb91c8e/toy_boxing-500w.png" class="image right" style="max-width: 450px; margin: 15px;" alt="Network vs Cloud" /> <p>This is the first in a series of three blogs that will look at the growing cultural war that enterprise IT organizations are facing as they adopt cloud computing. Cultural wars within an IT organization aren’t a frequent event and they don’t typically result from a technology going through its normal evolutionary cycle. For example, over its life cycle of 30+ years Ethernet has evolved from being a shared technology to being a switched technology and Ethernet speeds have increased by four orders of magnitude. While this type of technological evolution has an impact on IT organizations, it doesn’t significantly change their culture. Every so often, however, there is a technology shift that isn’t evolutionary, but is instead a basic shift in how technology is implemented. Because these shifts are indeed fundamental, the organizations that maximize the potential benefit of these shifts are also the organizations that must come to grips with the cultural impact these shifts create.</p> <div class="pullquote left">Bellheads built networks to Bell System specs while Netheads designed for data.</div> <p>The mid to late 1990s saw a major IT culture war that was often referred to as being between the Bellheads and the Netheads[1]. The term “Bellheads” primarily referred to people who came from the Bell System and whose approach to networking came in large part from an 885-page book entitled <em>Engineering and Operations in the Bell System</em>[2], as well as from the <em>Bell System Practices</em>[3], which was a compilation of technical publications that described the best methods for engineering, constructing, installing, and maintaining the telephone plant of the <a href="https://en.wikipedia.org/wiki/Bell_System">Bell System</a>. Implicit in the culture of the Bell System was that the network was designed primarily to support voice traffic, that capacity was dedicated for each call, that the network had to have five 9s (99.999%) of availability, and that it was acceptable to take years to introduce a new service.</p> <p>In contrast, the Netheads were the young Turks who developed and expanded the Internet and were enabled more by collaborative groups such as the IETF than by the type of prescriptive how-to guides relied on by the Bellheads. The Netheads saw the Bellheads as being dinosaurs in part because the Netheads used technologies, such as IP, that were foreign to the Bellheads to build networks that were designed primarily to carry data traffic. The Netheads developed NetFlow to support management, QoS to enable their networks to support delay-sensitive traffic, and a range of network functionality such as optimization and security to enable acceptable delivery application.</p> <div class="pullquote right">Those who once fought the establishment are now themselves firmly entrenched.</div> <p>As often happens in our industry, the young Turks who lead the fight against the established approach to technology and its embedded culture end up becoming firmly entrenched in the culture that they helped to introduce. That point was recently driven home to me when I attended a conference on cloud computing. One of the speakers at the conference was the vice president of cloud infrastructure engineering at a Fortune 500 company. While I was interested in what the VP had to say about the technologies that his company was using, I was more interested in what he had to say about the changes that were happening within the company’s IT organization because of their adoption of cloud computing. According to the VP, it had been relatively easy to get their younger application developers excited about moving to more agile development processes and leveraging open source-based applications such as OpenStack. He added that these young application developers became the champions of the cultural changes that enabled the organization to adopt a more agile DevOps approach to application development.</p> <p>The speaker then contrasted the agility and the culture of his company’s cloud-focused, application development organization with that of their network organization, a.k.a. the Netheads. He described the network organization as being stuck in a culture where it is acceptable to take weeks to implement a change. He said that culture needs to change. And he said that he was in the process of helping the network organization realize that it needs to become more agile, and of making sure that they had whatever training or other resources they needed to make the shift. The impression that he gave me was that the members of the network organization would either make the shift or be replaced.</p> <p>In the next post in this series I’ll look a bit deeper at the cloud vs. network culture war and discuss what this war says about how network organizations should think differently about the management tools they use.</p> <hr> <p>[1] <a href="https://www.wired.com/1996/10/atm-3/">https://www.wired.com/1996/10/atm-3/</a></p> <p>[2] <a href="https://www.amazon.com/Engineering-Operations-Bell-System-Rey/dp/B000FQ0ACM">https://www.amazon.com/Engineering-Operations-Bell-System-Rey/dp/B000FQ0ACM</a></p> <p>[3] <a href="https://en.wikipedia.org/wiki/Bell%5C_System%5C_Practices">https://en.wikipedia.org/wiki/Bell\_System\_Practices</a></p><![CDATA[The (Net)Flow That Kentik Makes Go: Know Your Traffic Flow Data Protocols]]><![CDATA[“NetFlow” may be the most common short-hand term for network flow data, but that doesn't mean it's the only important flow protocol. In fact there are three primary flavors of flow data — NetFlow, sFlow, and IPFIX — as well as a variety of brand-specific names used by various networking vendors. To help clear up any confusion, this post looks at the main flow-data protocols supported by Kentik Detect.]]>https://www.kentik.com/blog/the-netflow-that-kentik-makes-gohttps://www.kentik.com/blog/the-netflow-that-kentik-makes-go<![CDATA[Alex Henthorn-Iwane]]>Thu, 08 Dec 2016 14:00:33 GMT<h3 id="not-just-for-netflow-kentik-detects-analytics-cover-all-major-flow-protocols"><em>Not Just for NetFlow, Kentik Detect’s Analytics Cover All Major Flow Protocols</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/6lPImbVsVq0s8oqmccEWsi/eb9d7b14f6dec164fc12d0f3baa3838c/Flow_balloons-500w.png" alt="Flow_balloons-500w.png " class="image right" style="max-width: 400px;" /> <p>In networking terms, a “flow” defines a uni-directional set of packets sharing common attributes such as source and destination IP, source and destination ports, IP protocol, and type of service. “NetFlow” may be the most common short-hand term for this network flow data, but that doesn’t mean it’s the only important protocol for the exchange of metadata related to flows transiting network infrastructure. In fact there are three primary flavors of flow data — NetFlow, sFlow, and IPFIX — as well as a variety of brand-specific names used by various networking vendors. This practice allows some vendors to provide NetFlow-equivalent functionality without invoking a Cisco-owned trade name, but it also creates a bit of confusion in the marketplace. So to help provide clarity, we’ve listed below names and descriptions for the main flow-data protocols supported by Kentik Detect.</p> <ul> <li><strong>NetFlow</strong>: NetFlow is the trade name for a flow export protocol invented by Cisco Systems. NetFlow statefully tracks IP packets on a per-flow basis. The primary deployed versions, v5 and v9, are both supported by Kentik Detect. The main distinction is support in v9 for templating, which allows greater flexibility in content and interpretation of flow records.</li> <li><strong>IPFIX</strong>: IPFIX is an IETF standards-based protocol that is largely modeled on NetFlow v9 and is sometimes referred to as NetFlow v10. Like NetFlow (as well as J-Flow, cflowd, and RFlow) IPFIX is generated by stateful monitoring of packets within flow. IPFIX, however, allows variable-length fields and can integrate into its flow records types of information that would otherwise be sent to syslog or SNMP.</li> <li><strong>sFlow:</strong> sFlow is used to record statistical, infrastructure, routing, and other information about IP traffic traversing an sFlow-enabled router or switch. Unlike NetFlow, sFlow doesn’t statefully track packets within flows, but is derived instead from packet sampling. Created by InMon Corporation, sFlow is now a multi-vendor protocol that is supported by many vendors such as A10, Alaxala, Alcatel-Lucent, Allied Telesis, Arista, Aruba, Big Switch, Brocade, Cisco, Cumulus Networks, Dell, D-Link, Enterasys, Extreme Networks, F5, Fortinet, Hewlett-Packard, Huawei, IBM, Juniper Networks, NEC, Netgear, and ZTE.</li> <li><strong>J-Flow</strong>: J-Flow is a flow monitoring implementation from Juniper Networks. It is functionally equivalent to NetFlow. J-Flow comes in v5, v8, and v9 variants, each of which is cross compatible with the corresponding version of NetFlow.</li> <li><strong>cflowd</strong>: cflowd is used by vendors such as Alcatel-Lucent and Nokia to designate both flow monitoring functionality and the format (protocol) of the resulting data. It is functionally equivalent to NetFlow, with multiple versions corresponding to to v5, v8, v9, and IPFIX.</li> <li><strong>RFlow</strong>: RFlow was the flow monitoring implementation of Redback Networks, now part of Ericsson. RFlow is based on NetFlow v5.</li> </ul> <h4 id="flow-data-variations">Flow Data Variations</h4> <p>Pretty much all flow data protocols support what we might call the “basic” flow fields in the following list:</p> <ul> <li>Source &#x26; dest IP</li> <li>Protocol</li> <li>Source &#x26; dest port</li> <li>TCP flags</li> <li>Input interface and output interface ID</li> <li>Byte and packet counts</li> <li>ToS/DSCP value</li> <li>Source and dest ASN</li> <li>Source and dest IP mask</li> <li>Next hop IP</li> </ul> <p>Because of the variable functionality available in various protocol versions and implementations, however, there is much more to flow data than just the basics listed above. Some versions, but not all, support other data fields such as MAC address, VLAN ID, and IPv6. For example, NetFlow v9, IPFIX, and sFlow support IPv6 but NetFlow v5 and its equivalents don’t. For more details on some of these variations check out our Knowledge Base topic on <a href="https://kb.kentik.com/Ab02.htm">Flow Protocols</a>.</p> <h4 id="flow-exporting-devices">Flow Exporting Devices</h4> <p>Flow data is commonly associated with routers and switches, but devices such as load balancers, ADCs, network visibility switches, and security devices can also export flow data. There are some white box switches, however, that don’t support any flow protocol. What if you don’t have any network devices that can export flow? Fortunately, Kentik has partnered with ntop to provide Kentik-compatible host agent software called <a href="http://www.ntop.org/products/netflow/nprobe/">nProbe</a>, which can be run either as a host agent or as a probe running on a data center appliance. nProbe sends IPFIX to Kentik Detect.</p> <p>No matter which protocol you use, flow data adds up quickly, requiring an ingest, storage, and querying architecture that can handle massive volumes of traffic. <a href="https://www.kentik.com/product/kentik-platform/">Kentik</a> offers the ease of SaaS but also the power of big data, turning flow, performance, BGP, SNMP, and geolocation data into powerful, real-time insights for <a href="https://www.kentik.com/visibility-performance/">network traffic analysis</a>, <a href="https://www.kentik.com/network-performance-monitoring/">network performance monitoring</a>, <a href="https://www.kentik.com/network-planning/">network planning</a>, and <a href="https://www.kentik.com/ddos-detection/">DDoS protection</a>. Ready to learn more? <a href="https://www.kentik.com/go/get-demo/">Contact us</a> and we’ll be happy to walk you through a demo. Or try it for yourself by signing up for a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Kentik Troubleshoots Network Performance]]><![CDATA[How does Kentik NPM help you track down network performance issues? In this post by Jim Meehan, Director of Solutions Engineering, we look at how we recently used our own NPM solution to determine if a spike in retransmits was due to network issues or a software update we'd made on our application servers. You'll see how we ruled out the software update, and were then able to narrow the source of the issue to a specific route using BGP AS Path.]]>https://www.kentik.com/blog/kentik-troubleshoots-network-performancehttps://www.kentik.com/blog/kentik-troubleshoots-network-performance<![CDATA[Jim Meehan]]>Mon, 05 Dec 2016 14:00:53 GMT<h2 id="ad-hoc-data-analysis-answers-key-questions-in-real-time"><em>Ad-hoc data analysis answers key questions in real time</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/3ilip7ddMccOW8oogU4KkA/ff65684db7d9f60d939fb79b3339a521/nProbe_diagram-500w.png" alt="nProbe_diagram-500w.png" class="image right" style="max-width: 500px;" /> <p>In an exciting development for Kentik, we’ve recently been <a href="https://www.idc.com/getdoc.jsp?containerId=prUS41922116">recognized by IDC</a> as a cloud monitoring innovator. The ability of Kentik Detect to offer cloud-friendly network performance monitoring — <a href="https://www.kentik.com/network-performance-monitoring/">Kentik NPM</a> — was key to the recognition we earned. In a recent <a href="https://www.kentik.com/how-detect-nprobe-monitor-network-performance/">blog post</a> by Kentik Solutions Engineer Eric Graham we explained how we “dog food” our own NPM solution to troubleshoot network performance issues within our own cloud-based application. In that post, Eric shows how he found issues on a group of internal hosts that were impacting a critical microservice. Using the performance metrics from nProbe agents running on each host, we were able to identify a particular switch as the root cause.</p> <p>Eric’s story illustrates why our operations team continues to use nProbe and Kentik NPM on a regular basis. How does it work? Kentik NPM monitors network performance by capturing samples of live traffic with nProbe and sending the enhanced flow data to the Kentik Data Engine (KDE; Kentik’s distributed big data backend). Correlated with BGP, GeoIP, and unsummarized NetFlow from the rest of the network infrastructure, the nProbe data becomes part of a unified time-series database that can span months of network activity. You can run ad-hoc queries covering a broad variety of performance and traffic metrics — including retransmits and latency — and zoom in or out on any time range. With 95th percentile query response-time in a few seconds across billions of datapoints, Kentik NPM provides the real-time insights into performance that are needed to effectively operate a digital business.</p> <h3 id="tracing-bgp-instability">Tracing BGP Instability</h3> <p>Now let’s look at another example of how we use Kentik NPM in house. After a software update on November 19th, we saw some instability in the BGP sessions from a few customer devices that feed live routing data to our Kentik Detect backend. We were concerned about whether the software update might be the source of the instability. So of course we turned to Kentik Detect to answer the age-old question: “Is it the network or the application?”</p> <img src="//images.ctfassets.net/6yom6slo28h2/3uYwP212us2OiwS2eq4WiK/258089b380d8db5cd27cb005c0c53dd1/NPM1_Dimensions-400w.png" alt="NPM1_Dimensions-400w.png" class="image right" style="max-width: 400px;" /> <p>To answer that question we used a series of visualizations in the Data Explorer. For the first such visualization, we started by setting the group-by dimension in the Query pane (in the sidebar). As you can see from the image at left, Kentik Detect provides dozens of group-by dimensions, up to eight of which may be used simultaneously, which gives you many ways to pivot your analyses. In this case we chose the “Destination:IP/CIDR” dimension to frame the overall view in terms of the IPs that were the destination of the problematic traffic.</p> <p>Next we set Metric (still in the Query pane) to look at the percentage of TCP retransmits. Under Advanced Options, we set the Visualization Depth to very low (so we see just the top few results) and Display and Sort By to show results by 98th percentile. In the Filters pane, we filtered to show only dest port 179 (BGP). In the Devices pane we chose any devices that could be sending traffic to the destination IPs in question. And in the Display pane we chose Time Series Line Graph. With our visualization parameters defined, we clicked the Apply Changes button. The result looked like the image below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1KjUi69niYkUYkssQGiO0c/bada4d644e6944ff97e354dad3252770/NPM2_Dest_IP-800w.png" alt="NPM2_Dest_IP-800w.png" class="image center" style="max-width: 800px;" /> <p>The graph and table in the main display area show us the top three destination network elements (anonymized of course) in terms of 98th percentile of retransmits of traffic using the BGP port. In the table’s “key” column (at left) we see the IP addresses of those elements along with their corresponding hostnames (also anonymized). As we move right we see the statistics about retransmits to these IPs, including percent, rate (number per second), and the traffic involved (packets and mbps). What immediately jumps out from the graph is that there is a significant spike in the percentage of retransmitted TCP packets for these three hosts. And since the network elements are at two different customers (“customer.com” and “client.com”) we can tell that it’s not a problem with just one customer’s network.</p> <p>If you’re familiar with TCP/IP, you’ll know that anything above a very low percentage (anywhere from .5% to max 3% depending on the application) will be very destructive to application performance. BGP is relatively resilient to TCP retransmissions because the protocol provides a “hold timer” to keep sessions between peers alive in case keepalive messages are dropped and must be retransmitted. However, as seen in the graph above, the retransmission percentage for traffic sent to these devices was spiking to between 15% and 30% repeatedly. The result was hold-timer timeouts and persistent flaps for these BGP sessions.</p> <h3 id="is-it-one-of-our-servers">Is It One of Our Servers?</h3> <p>With a graph like the above, it seemed highly likely that the root cause was the network. But was the issue possibly within Kentik’s own infrastructure? The next step toward finding out was to add “Full:Device” as another group-by dimension. This changes the analysis to look at the combination of the customer network element destination IP addresses and the devices exporting flow (NetFlow, IPFIX, sFlow, etc.) to Kentik. In this case, those devices are the nodes within the Kentik infrastructure that handle the BGP sessions for each customer device. All of these have nProbe deployed and are exporting network performance metrics and traffic flow details. Our next image shows the result of the revised analysis.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6RKIaM0xYkeoyOCaSMEMiQ/657161e5017445a8938888a8050d4c54/NPM3_Device-800w.png" alt="NPM3_Device-800w.png" class="image center" style="max-width: 800px;" /> <p>In this visualization, the key column shows the combination of the two dimensions specified in the query pane: the destination IP and hostname combined with the name of the device or Kentik node that is exporting the performance metrics. We can see that all three destination IP’s were being served by different nodes, so there was no apparent Kentik infrastructure commonality. Since there was no correlation with the servers running Kentik applications, it seemed pretty certain that the BGP issues had nothing to do with our software update.</p> <h3 id="where-in-the-world">Where in the World?</h3> <p>So now we had the answer to our initial question. But it was still of interest to get a better sense of where the problem was. So we pivoted our analysis by changing our group-by settings, replacing “Full:Device” with “Destination:Country”. When we re-ran the visualization we got the result shown in the image below.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2LQMdsgjFuKmqcwqGw8syg/32853fe82493549451ed5ee7ad366c17/NPM4_Country-800w.png" alt="NPM4_Country-800w.png" class="image center" style="max-width: 800px;" /> <p>This time the key column showed the destination IP and hostname combined with the destination country, which in this case is Singapore (SG) for all three rows. So we saw a clear geographical commonality. (If we had also wanted to see if other geographies were experiencing like problems we could easily have re-run the visualization without the Destination:IP/CIDR dimension.)</p> <h3 id="clues-from-as-path">Clues from AS Path</h3> <p>At this point, we were pretty darned certain that the root cause of the TCP retransmits was due to an issue with Internet communications, somewhere outside of our own servers. But where? We were able to find some clues by pivoting the analysis again, this time dropping destination country and including instead the dimension “Destination:BGP AS_Path.” In this latest visualization (below) the key column shows the combination of IP and hostname with the Autonomous System (AS) Path, which is a list of ASNs or Autonomous System Numbers (numerical IDs for Internet-peered networks) that indicates the path (route) taken by the traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/zaKOsdNFLMQaUiSooUSwk/39af5061700b1ad6855c61ae087f8735/NPM5_DestASPATH-800w.png" alt="NPM5_DestASPATH-800w.png" class="image centert" style="max-width: 800px;" /> <p>The first thing we saw was that the destination ASN — the network that the traffic ends up in and that the customer network elements are deployed in — did not present any correlation. However, the rest of the path was common. We can’t know for sure based on this analysis alone which particular “hop” in the AS Path was the problem network, though it was likely in the last few hops, as a problem earlier in the path would have affected many more customer devices. In any case, we had the confirmation we needed that the issue with TCP retransmits was not within our own network but rather due to an external root cause out on the Internet.</p> <h3 id="precise-continuous-npm">Precise, continuous NPM</h3> <p>So there you have another example of how we use Kentik NPM to monitor and troubleshoot network performance issues for our distributed, cloud-based, micro-services application. You may not be running BGP peerings, but TCP sessions are the bread and butter of communications between distributed application components, whether you’re running a three-tier web application or a massively distributed micro-services architecture. Kentik NPM allows you get precise, continuous measurement of network performance for your actual application traffic.</p> <p>Learn more about applications of Kentik in <a href="https://www.kentik.com/solutions/usecase/troubleshoot-networks/" title="Network Troubleshooting with Kentik">Network Troubleshooting</a>.</p> <p>If you already know that you want to get your hands on a cloud-friendly network performance monitoring solution, start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[How to Configure Remotely Triggered Black-Hole Routing with Kentik Detect]]><![CDATA[Destination-based Remotely Triggered Black-Hole routing (RTBH) is an incredibly effective and very cost-effective method of protecting your network during a DDoS attack. And with Kentik’s advanced Alerting system, automated RTBH is also relatively simple to configure. In this post, Kentik Customer Success Engineer Dan Rohan guides us through the process step by step.]]>https://www.kentik.com/blog/how-to-rtbh-with-kentik-detecthttps://www.kentik.com/blog/how-to-rtbh-with-kentik-detect<![CDATA[Dan Rohan]]>Thu, 01 Dec 2016 14:00:31 GMT<h2 id="a-fast-cost-effective-route-to-automated-network-protection">A Fast, Cost-effective Route to Automated Network Protection</h2> <img src="//images.ctfassets.net/6yom6slo28h2/55KUix9yF20GkWgwqk4oWI/f18aa8a47a402450728aa62648bfed68/ddos_alerting-500w.png" alt="ddos_alerting-500w.png" class="image right" style="max-width: 300px;" /> <p>Remotely Triggered Black-Hole routing (RTBH) is an extremely powerful technique that network operators can use to protect their network infrastructure and their customers against Distributed Denial of Service (DDoS) attacks. By automating the redirection of undesirable traffic at discrete points in a network, RTBH gives operators the ability to mitigate DDoS and to enforce blacklist/bogon routing policy from a centralized station.</p> <p>While there are a variety of methods to implement RTBH, including RFC 5634 and source-based RTBH, most network engineers consider <a href="https://tools.ietf.org/rfc/rfc3882.txt">destination-based RTBH</a> to be their best first-line defense against DDoS attacks. In a traditional, non-automated RTBH setup, a customer might call and say, “Help! I think I’m under attack.” An operator would then log into a route server and configure the mitigation, which is a /32 host “black hole” route. The route is then redistributed via BGP — along with a ‘no-export’ community and a next-hop address — to the routers where the attack traffic is entering the network. These routers then route the traffic to a destination that doesn’t exist (the black hole), for example a null interface.</p> <div class="pullquote left">RTBH is effective, inexpensive, and simple to implement with Kentik.</div> <p>Destination-based RTBH is not only incredibly effective at protecting network infrastructure, it’s also inexpensive and can be simple to implement, particularly with Kentik, which makes it exceptionally easy and flexible to automate. You create policies that define any number of conditions that will trigger an alarm, and define as well as how the system should respond to an alarm depending on the specific situation. That response may be immediate, delayed, or manual, it may include notification and/or mitigation, and it may involve just a /32 host route or include an aggregate /24.</p> <p>We’ve seen gung-ho customers who are using Kentik as a detection and trigger mechanism configure effective destination-based RTBH policies in as little as 30 minutes. The basic steps outlined below explain how you can take advantage of Kentik’s power and flexibility to implement RTBH protection on your network.</p> <h3 id="develop-a-plan-for-rtbh-ddos-mitigation">Develop a plan for RTBH DDoS mitigation</h3> <p>RTBH can be as simple or as complex as you want to make it. Here are the bare essentials that you’ll want to figure out ahead of actually configuring anything:</p> <ul> <li><strong>Which routers will you implement your policies on?</strong> To enable RTBH within Kentik you’ll need to filter traffic at key points within your infrastructure. You’ll likely want to set up RTBH policies for routers that have direct connections to your transit providers or peers, because you won’t want to carry illegitimate or attack traffic any further than necessary on your network. To use RTBH, those routers must support BGP and provide some minimal implementation of routing policy.</li> <li><strong>What will your routers do with black</strong>-<strong>holed traffic?</strong> You’ve got a number of options as to what specifically the black hole will be in your configuration. If you need ideas on how you might set up the black hole in your policies, see this <a href="https://www.nanog.org/meetings/nanog32/presentations/soricelli.pdf">awesome presentation</a> by Joe Soricelli and Wayne Gustavus from NANOG 32.</li> <li><strong>How long will you keep a RTBH policy active?</strong> Kentik can automate the triggering of RTBH mitigations based upon policies, so it’s up to the user to determine how long such policies should remain in force. Do you want a policy to remain in place until it’s manually removed, or would you like it to deactivate as soon as the attack subsides? Or maybe 20 minutes afterwards, or 30 minutes, or…?</li> </ul> <h3 id="enable-alerting">Enable alerting</h3> <p>Most operators find that the real joy of Kentik is its endless flexibility. Chances are excellent that Kentik can be configured to reliably alarm on a given condition no matter how complex a signature or baseline is involved. Luckily for us, the conditions that indicate that a DDoS attack is underway are extremely well understood, and Kentik offers a robust library of alert policies to help our customers quickly protect themselves against DDoS events, from the most common to the most corner-case.</p> <div class="pullquote right">Kentik’s preset alert policies help you quickly enable protection against attacks.</div> <p>To take advantage of Kentik’s alert presets, choose Alerting from the alerts menu on the main navbar, then click the Kentik alert Library tab. Find an alert whose description matches your needs, then click the Duplicate icon in the Action column, which copies the preset to the list of policies on the Alert Policies tab. Click the policy name to open the Alert Policy Settings page, where you can spend a little time getting more familiar with the policy and tuning it to your specific needs (the payoff is well worth it).</p> <p>This page is also where you can build a policy from scratch (access via the Create Alert Policy button on the Alert Policies tab), a process that is probably wisest to initially attempt with a Kentik sales engineer or customer success engineer. Either way, when you want to enable your new policy you’ll click the enabled/disabled icon in the right-most column of the policy list (the icon will switch to green when the policy is enabled).</p> <h3 id="configure-rtbh">Configure RTBH</h3> <p>Here’s where things get technical, but nothing too crazy. You’ll create a mitigation platform and method, link the two together, then associate them with an alert. To understand how configuration of the mitigation method differs from configuration of the mitigation platform, consider a scenario where RTBH policies are differentiated based on transit providers, interface capacities, or available peers. By creating multiple methods, operators can achieve all the flexibility they need to match nearly any RTBH deployment scenario.</p> <p>Here’s an outline of the process:</p> <ul> <li><strong>Identify your routers</strong>: In the Kentik portal, go to Admin » Devices. From this pane, make a note of the device ID of the routers that you decided above to implement your policies on.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/32FV7a3H2MWmmekYAs6yCU/8d4704385be17617c6acc241fe6f3e0e/Device_list-800w.png" alt="Device_list-800w.png" class="image center" style="max-width: 800px;" /> <img src="//images.ctfassets.net/6yom6slo28h2/42lX4kEz8A6moGsKW8wMK4/5d0bad50a87e042dcb3cdddf1108d29e/Create_platform-400w.png" alt="Create_platform-400w.png" class="image right" style="max-width: 300px;" /> <ul> <li><strong>Create a mitigation platform</strong>: <ul> <li>Go to Alerts » Alerting, and then to the Mitigation tab.</li> </ul> - Click the Create a Mitigation Platform button.<br> - In the resulting modal, give your platform a good name and description, then enter the device IDs of the routers you selected in the previous step (comma-delimited list).<br> - Click the Submit button to close the modal.</li> <li><strong>Create a mitigation method</strong>:<br> - Back on the mitigation tab, click the Create a Mitigation Method button.<br> - In the resulting modal, give your method a name and description, then be sure to set up a notification channel so that you can be aware when your policy triggers an alarm.<br> <img src="//images.ctfassets.net/6yom6slo28h2/8PFCkR4vwQ2ecqUqMYmkQ/38edea6d8609ec39eaebf304e7821f36/Create_method-400w.png" alt="Create_method-400w.png" class="image right" style="max-width: 300px;" />- Next, enter any IP address that you’d like to exclude from ever being blackholed. Good candidates might be infrastructure addresses, point-to-point networks, or other address critical to the normal functioning of your network.<br> - Next, select the grace period that Kentik should honor prior to withdrawing the blackhole route. Many operators are happy with the 30-minute default because it provides enough cushion to discourage repeat attacks while not being excessively punitive to the IP that was the attack destination.</li> <li><strong>Select RTBH as the platform</strong>:<br> - Still in the modal, choose RTBH from the drop-down Platform menu.<br> - In the resulting fields, provide a meaningful community string and next hop address. This is where having your plan and policy that we mentioned earlier comes into play (you did read that awesome NANOG document we referenced, didn’t you?).<br> - Lastly, if you’re going to do anything fancy like withdraw blocks from certain routers and re-advertise in other locations, you may want to select the ‘Convert IP to a /24’ checkbox. Otherwise, leave it unselected.<br> - When you’re finished, click Submit to close the modal.</li> <li><strong>Link the mitigation method to a platform:</strong> After you’ve completed the mitigation method, you’ll need to link it to your mitigation platform. On the Mitigation tab, locate the platform that you just created and choose your new mitigation method from the drop-down Link Mitigation Method menu.</li> <li><strong>Link mitigation platform and methods to alerts:</strong> Almost done. Next, we’ll configure the specific alerts that you enabled earlier to trigger the mitigation platform and method(s) just configured:<br> - Go to Alerting » Alert Policies.<br> - Open a policy that you’d like to link to an RTBH mitigation, scroll down to the Thresholds section, and go to the Threshold (Critical, Major, etc.) that you’d like to trigger on.<br> <img src="//images.ctfassets.net/6yom6slo28h2/stsNv0dYpaeOKIw4qoAyc/2874e4e52752b068c1ce2c0cbadb5fb7/Enable_mitigation-400w.png" alt="Enable_mitigation-400w.png" class="image right" style="max-width: 300px;" />- Set the Mitigation switch to Enabled.<br> - Then choose your mitigation platform and method from the drop-down menus.<br> - Next, you’ll need to decide if you’d like to have the mitigation take effect immediately when Kentik raises the alarm, after a user manually acknowledges the alarm, or after a timer expires where no user has acknowledged the alarm.<br> - Finally, select how fast Kentik should clear the mitigation.<br> - Click the Save button (right top or bottom).</li> </ul> <p>So there you have it: You’ve just created an alert policy that is configured to trigger RTBH mitigation!</p> <p>If you’d like help with RTBH setup, please reach out to us at <a href="mailto:[email protected]">[email protected]</a>.</p> <p>Learn more about Kentik’s solutions for DDoS detection and DDoS mitigation:</p> <ul> <li><a href="https://www.kentik.com/solutions/detect-and-mitigate-ddos/" title="Learn about Kentik solutions for DDoS detection and mitigation">Kentik DDoS Detection and DDoS Mitigation</a></li> <li>Kentik Case Studies about DDoS protection: <a href="https://www.kentik.com/resources/case-study-square-enix/" title="Game on: Square Enix gains critical network insights from Kentik">Square Enix Network Observability and DDoS Protection Case Study</a>, <a href="https://www.kentik.com/resources/case-study-immedion/" title="Immedion Maintains Always-On, Secure Data Centers, Stops DDoS Attacks with Kentik">Immedion DDoS Protection Case Study</a>, <a href="https://www.kentik.com/resources/penteledata-keeps-ddos-at-bay-with-kentik-detect/" title="PenTeleData Keeps DDoS at Bay">PenTeleData DDoS Protection Case Study</a></li> </ul><![CDATA[How ISPs & Managed Service Providers Can Offer DDoS Protection]]><![CDATA[As organizations increasingly rely on digital operations there's no end in sight to the DDoS epidemic. That aggravates the headaches for service providers, who stand between attackers and their targets, but it also creates the opportunity to offer effective protection services. Done right, these services can deepen customer relationships while expanding revenue and profits. But to succeed, providers will need to embrace big data as a key element of DDoS protection.]]>https://www.kentik.com/blog/offering-big-data-ddos-protectionhttps://www.kentik.com/blog/offering-big-data-ddos-protection<![CDATA[Alex Henthorn-Iwane]]>Mon, 28 Nov 2016 14:00:02 GMT<h2 id="how-service-providers-can-seize-the-cybersecurity-opportunity"><em>How Service Providers Can Seize the Cybersecurity Opportunity</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/3pvd69MJNmG28YgQykUQga/6ad105d020bd434aad868f08254037f4/Thaad-400w.png" class="image right" style="max-width: 400px" alt="" /> <p>While Tier 1 communications providers have long made extensive cybersecurity capabilities part of their managed services portfolios, the offerings of Tier 2 telcos — hosting providers, ISPs, and MSPs — have typically been more limited. The recent dramatic rise in DDoS (distributed denial of service) threats, however, has given service providers of nearly every stripe the opportunity to offer protection services that can enhance customer relationships while growing revenue and profits. To do so successfully, service providers will need to embrace big data as a key element of powerful DDoS protection.</p> <h3 id="the-ddos-threat-and-opportunity">The DDoS Threat… and Opportunity</h3> <p>Recent years have brought the DDoS threat home in a way that makes it hard to ignore the risks. Back in 2014 — long ago in Internet time — 41% of organizations globally were hit by DDoS attacks, with three quarters of those (78%) targeted twice or more in the year. Far from dissipating over time, attacks have grown in severity and volume. Recent spectacular attacks include those against internet hosting company OVH, security researcher Brian Krebs, and, most famously, DNS provider Dyn, which resulted in outages at Twitter, Netflix, Amazon, and many other websites for hours.</p> <p>As the threat grows, so to does the vulnerability. More and more businesses are investing in significant digital initiatives to fuel competitiveness, revenues, and profits, and more IT assets are being outsourced to the cloud. That makes both top- and bottom-line aspects of businesses more susceptible to DDoS disruption. Luckily, it appears that the leaders of all types of businesses and government agencies are finally realizing that their organizations could be next. Partly as a result, worldwide spending on information security (per Gartner) was $85 billion in 2015 and is growing at a compound annual growth rate (CAGR) of 9.3%, making a projected market size of $117 billion in 2019.</p> <h3 id="big-data-enhances-accuracy">Big Data Enhances Accuracy</h3> <p>Most people think of DDoS protection simply as “stopping attack traffic,” and in a basic sense that’s true. In an increasingly competitive environment, however, it’s not enough for a service provider to offer just the basics. The first generation of DDoS protection services were based on products built solely around detection and mitigation appliances. Appliances are still necessary and relevant for mitigation because ASIC and Network Processor power is needed for deep packet inspection when scrubbing traffic. But there’s no longer any reason for detection to be trapped in first-generation technology.</p> <div class="pullquote right">Legacy constraints on CPU, memory, and storage limit high-traffic tracking.</div> <p>Legacy detection appliances are severely constrained in their CPU, memory, and storage, which limits their ability to track high volumes of traffic data. They try to compensate by relying on manual configurations and resorting to a variety of computational shortcuts. But they nonetheless miss an unacceptably high percentage of attacks.</p> <p>The key to solving this DDoS detection accuracy issue is big data. A distributed system that scales to network traffic volume can continuously scan network-wide data on a multi-dimensional basis without constraint. And it has the computational power to apply learning algorithms to baselining. The result is 30% more accurate DDoS attack detection. For more information on how big data helps accuracy, read our post on <a href="https://www.kentik.com/big-data-for-ddos-protection/">Big Data for DDoS Protection</a>.</p> <h3 id="analytics-enable-consultative-relationship">Analytics Enable Consultative Relationship</h3> <p>One of the chief advantages that telecom and hosting providers have is the fact that customers already entrust them with critical connectivity and infrastructure services. This trust places them in an ideal position to offer a highly consultative approach. Unfortunately, traditional DDoS systems are nearly devoid of real analytics. This means that to offer valuable data in the pre-sales, post-sales, or managed services phase of a customer relationship, service providers have to deploy a separate tool at additional cost.</p> <p>That shouldn’t be the case, since customers can already send vast quantities of rich network telemetry — traffic flow records, BGP routing, and SNMP metrics — to the detection layer of a DDoS protection service. Big data helps by retaining all of that data in full detail, and making it possible to leverage it to advise customers with insights that add real value, cementing the trust relationship. For a glimpse at the kind of analytical power that big data provides, read our post <a href="https://www.kentik.com/ddos-source-geography-netflow-analysis/">DDoS Detection: Source Geography NetFlow Analysis</a>.</p> <h3 id="big-data-that-works">Big Data That Works</h3> <p>Big data can mean many things, since there are a plethora of platforms, both open source and commercial, that promise the sun, moon, and stars to willing masochists. But most big data platforms aren’t fundamentally suited to real-time applications such as DDoS defense and network operations. The few that can be effectively adapted are extremely expensive. And no matter what home-grown option you choose, you have to develop and then maintain DDoS detection and network analytics capabilities. Nearly everyone who’s tried it has soon realized that it’s a painful and resource-intensive endeavor.</p> <div class="pullquote right">Kentik's big data solution is purpose-built for network operations.</div> <p>That’s why Kentik created a big data solution that’s purpose-built for network operations including DDoS protection, deep analytics, and network performance monitoring. Most providers will find Kentik Detect fast and convenient to deploy as a SaaS; you can sign on in less than 15 minutes, register your devices, and set up <a href="https://www.kentik.com/automate-rtbh-ddos-protection-in-under-an-hour/">automated RTBH in under an hour</a>. For service providers with data sovereignty, regulatory, or other requirements, Kentik Detect can also be set up on-premises.</p> <p>Kentik Detect’s big data backend enables us to offer the industry’s most powerful and accurate DDoS detection, combined with integrated support for hybrid mitigation via RTBH and leading mitigation vendor solutions. One regional ISP that’s taken advantage of our superior detection — plus integrated support for Radware DefensePro — is PenTeleData; learn more by reading the <a href="https://info.kentik.com/rs/869-PAD-887/images/Kentik-PenTeleData_case_study.pdf">case study</a>.</p> <p>For additional information about Kentik Detect’s DDoS protection capabilities, Read our solution brief to learn more about <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a>. If you already know that you’re ready to try Kentik <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Big Data DDoS Protection vs the DDoS Marketplaces Threat]]><![CDATA[The source of DDoS attacks is typically depicted as a hoodie-wearing amateur. But the more serious threat is actually a well-developed marketplace for exploits, with vendors whose state-of-the-art technology can easily overwhelm legacy detection systems. In this post we look why you need the firepower of big data to fend off this new breed of commercial attackers.]]>https://www.kentik.com/blog/the-case-for-big-data-ddos-protectionhttps://www.kentik.com/blog/the-case-for-big-data-ddos-protection<![CDATA[Alex Henthorn-Iwane]]>Mon, 21 Nov 2016 14:00:38 GMT<p>DDoS Has Crossed the Chasm. Where Does That Leave You?</p> <img src="//images.ctfassets.net/6yom6slo28h2/21I6l7bw6k0qgIwiIeGCo0/1bf1f6292cff66b2dc355f34eac96043/DDoS_hoodie-500w.png" alt="DDoS_hoodie-500w.png" class="image right" style="max-width: 300px;" /> <p>At Kentik, we recently debuted our updated DDoS protection solution, which pairs the most accurate available detection with support for hybrid mitigation, including both RTBH and integration with mitigation systems from industry-leading vendors. So we’ve all been thinking a good deal lately about DDoS protection. One of the things that has struck me is that the DDoS threat is so often conceived of as coming from the sort of mysterious outsider that you see personified in images as a hoodie-wearing amateur: malevolent but not incorporated into a larger organized structure.</p> <p>While the lone sociopath with an axe to grind may make good graphics and TV plots, it’s actually more apt to think of many cyber-security threats, including DDoS, in terms of a professional marketplace of exploits. Sure, the so-called “script kiddies” in hacker forums do exist. But according to security experts, many if not most of those kiddies are like entrepreneurs or startups: they’re working to impress commercial buyers of their capabilities and services.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/5gRBu0ffscEkMOYKGmgkge/0fcb6d1ebc9ec51e56fccd4492e06fbf/DDoS_Inc-500w.png" alt="DDoS_Inc-500w.png" class="image right" style="max-width: 300px;" />DDoS Goes to Market</p> <p>The reality is that we’re looking at a fairly well-developed marketplace with vendors at multiple levels. At the “enterprise” level we have nation states; it’s well established that they regularly launch DDoS attacks either as distractions or to punish geopolitical foes. At the mid-market we find criminal syndicates that use DDoS to extract ransoms. Hacktivists can be considered the non-profit sector. Meanwhile a host of retail-level DDoS attackers constitute a B2C sector — the main portion of the DDoS market — in which, for a fraction of a bitcoin, an unscrupulous gamer can launch a DDoS attack against an online foe.</p> <p>The existence of a broad marketplace drives innovation and the rapid adoption of new vectors, malware, and botnets. In just a few years we’ve witnessed the rise of IoT botnet herders, like Mirai, that have unleashed unprecedented attack power. This is power at cloud scale, except that using botnets makes massive scale-out bandwidth essentially free.</p> <p>So DDoS has, as author and marketing consultant Geoffrey Moore would say, “crossed the chasm.” It’s not just, or even primarily, the realm of hoodied hobbyists, hackers, and hangers-on. It’s serious business. Bottom line: if you have a business that depends heavily on the free flow of Internet traffic to reach your customers, your critical IT assets, or your digital solution suppliers, you are a target. And you’re up against an agile, innovative, and robust marketplace of players that are going for your network traffic jugular — and more.</p> <p>Detection is Stuck in the Past</p> <p>If DDoS has crossed the chasm with cloud-scale, innovative and agile approaches, why has DDoS protection lagged so far behind? DDoS detection has been based on single-server appliances since the late 1990s. This is like bringing a (rubber) knife to a (laser) gunfight.</p> <p>Legacy appliances lack two major requirements for success: accuracy and intelligence.</p> <p>It’s revealing to look at the detection layer of your DDoS defenses because legacy appliances lack two major requirements for success: accuracy and intelligence. In terms of accuracy, appliances tend to miss a lot of attacks because they are so strapped for compute, memory, and storage resources. That results in severe computational shortcuts in monitoring and baselining. One indicator is the amount of manual administration that it takes to keep monitoring schemes up to date with organic changes in the network. It’s an approach that’s not very modern and not very accurate.</p> <p>As for intelligence, traditional <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/">DDoS detection</a> appliances summarize the raw traffic data and then discard the details, so they are literally incapable of providing deep analytics. And analytics are precisely what you need to understand the full implications of changing conditions, vectors, and traffic trends. Sometimes there’s no substitute for slicing and dicing the details of network traffic to figure out what’s going on. But you can’t do that if you don’t have the data.</p> <p>The Case for Big Data</p> <p>The application of big data to network operations and anomaly detection is a major advance for <a href="https://www.kentik.com/resources/penteledata-keeps-ddos-at-bay-with-kentik-detect/">DDoS protection</a>. Using cloud-scale compute and storage gives you the headroom to finally look at traffic data holistically, even at very high volume. Big data systems can also utilize learning algorithms that simply aren’t possible on traditional appliances. The result is far higher accuracy of attack detection. In field deployment, Kentik Detect customers have seen a 30-percent improvement in detection accuracy — check it out in our <a href="https://www.kentik.com/resources/penteledata-keeps-ddos-at-bay-with-kentik-detect/">PenTeleData case study</a>.</p> <p>Big data systems are typically API-aware for ease of integration.</p> <p>Because big data systems are typically built to be highly API-aware they generally lend themselves to integration. In the case of Kentik Detect, we not only integrate with major mitigation solution providers, but can also signal <a href="https://www.kentik.com/automate-rtbh-ddos-protection-in-under-an-hour/">RTBH</a> for any BGP-peered customer.</p> <p>Kentik Detect’s deep storage of raw traffic data, combined with the ability to perform ad-hoc analytics, means that operations engineers can also get fast answers to critical diagnostic questions. For an example of the kind of rapid pivoting enabled by Kentik Detect, check out our post on <a href="https://www.kentik.com/ddos-source-geography-netflow-analysis/">source geography analysis</a> of a DDoS attack.</p> <p><a href="https://www.kentik.com/resources/kentik-detect-for-network-operations/">Kentik Detect</a>: Big Data SaaS for DDoS Protection</p> <p>DDoS attackers have long-since crossed the chasm from enthusiasts to pragmatists, and their arsenal is now brimming with state-of-the-art technology that can easily overwhelm legacy appliance-based detection. Built on big data, Kentik Detect gives you the firepower to fend off this new breed of determined commercial attackers.</p> <p>If you’re ready to experience Kentik Detect for yourself, sign up now to start a <a href="#signup_dialog">free trial</a>, or <a href="#demo_dialog">request a demo</a>. If you’d prefer to first learn more, check out our <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/">Kentik Detect for DDoS Protection</a> solution brief and our white paper on <a href="https://www.kentik.com/resources/the-case-for-big-data-powered-ddos-protection/">The Case for Big Data DDoS Protection</a>. Either way, given the true nature of the hoods inside those DDoS hoodies, sticking with the status quo isn’t a viable option.</p><![CDATA[It's Time for NPM Appliances to Go Bye-Bye]]>https://www.kentik.com/blog/its-time-for-npm-appliances-to-go-bye-byehttps://www.kentik.com/blog/its-time-for-npm-appliances-to-go-bye-bye<![CDATA[Alex Henthorn-Iwane]]>Fri, 18 Nov 2016 15:43:17 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/5hl00YwK1Uy4eaggUoOICg/d9dd22d3bd91a240697e4566809e71a5/cloudit.png" alt="cloudit.png" class="image right" style="max-width: 300px;" /> I recently had the chance to talk with some editors at TMCNet, specifically for Cloud IT Week, and we had a good conversation about the need for NPM to move beyond aging appliances and into cloud models.  It was one of those rapid, speed-dating type conversations and you never know how it will work out, but I just the article and they really captured things well.  The opener of the <a href="http://www.clouditweek.com/topics/cloud-it/articles/427228-kentik-cloud-world-calls-cloud-scale-npm.htm">article</a> is:</p> <blockquote> <p><em>“The writing’s on the wall – it’s time for appliance-based network performance solutions to go bye-bye.  At least that’s how Alex Henthorn-Iwane of Kentik sees it.”</em></p> </blockquote> <p>As the article mentions and we’ve referenced before, Gartner in its May, 2016 research note agrees that a real sea change is needed for NPM:</p> <blockquote> <p><em>“Migration to the cloud, in its various forms, creates a fundamental shift in network traffic that traditional network performance monitoring tools fail to cover. I&#x26;O leaders must consider cloud-centric monitoring technologies to fill visibility gaps.”</em></p> </blockquote> <p>Kentik offers a cloud-friendly NPM solution that includes the mature and proven nProbe NPM agent from ntop that can be installed on application and load balancing servers.  It performs packet capture on sampled traffic flows of actual application traffic.  The agents send this data, plus traffic flow statistics to our cloud-based, big data platform Kentik Detect.  Kentik Detect also takes in flow, BGP, SNMP and other data at scale from routers, switches, probes.  All that data is stored raw for 90 days and you can <a href="https://gestaltit.com/favorites/rich/drill-baby-drill-netflow-kentik/">drill, baby drill</a> into billion row datasets in seconds.  It’s a big data analytics portal for network data, and you’ve never seen anything like it. Why is this cloud friendly?  Because cloud infrastructure doesn’t have traditional switches with tap ports for you attach traditional probes.  Application components are more distributed in cloud environments and environments may spin up on demand.  With nProbe, you can include the agent for instrumentation in your container orchestration scheme so that wherever servers are hosting application components, you can have NPM metrics capture as well. If you’re an organization practicing DevOps, then you want to measure everything and feed that information back to the dev team.  Traditional appliance-based NPM API’s are at best second class citizens, so they’re not functionally useful for actually sharing data.  Kentik’s NPM SaaS back-end offers a powerful portal, but also REST and SQL APIs.  The vast majority of our customers utilize our APIs to get rich network data from our platform into other portals, tools, and databases.  We also don’t restrict users so developers can have full access to the portal to understand latency, retransmits, and traffic flow dynamics. The beauty of the Kentik Detect back-end is that it’s not just for NPM, but it is also the world’s most powerful NetFlow traffic analysis platform, plus the most accurate DDoS detection platform that can automatically trigger attack mitigation using multiple methods.  No matter the cause of performance issues, you need a platform that can help on multiple fronts rather than swivel chairing between different tools.  As I said to Cloud IT Week:</p> <blockquote> <p><em>“If you are doing digital business in the cloud, then you should really be looking at a cloud-friendly way of doing NPM and of really protecting your overall network experience.”</em></p> </blockquote> <p>If you want to get cloud-friendly NPM, then read more about <a href="https://www.kentik.com/kentik-detect/">Kentik Detect</a> and our <a href="https://www.kentik.com/network-performance-monitoring/">NPM solution</a>. Want to get your hands on big data, cloud-friendly NPM power now?  Start a <a href="#signup_dialog">free trial</a>, and we’ll be in touch to walk you through a demo and help you see how to move your NPM practice into the modern, cloud era.</p><![CDATA[The Dyn & Mirai Horror Show: The Weaponization of DDoS using Botnets]]><![CDATA[Whether its 70s variety shows or today's DDoS attacks, high-profile success begets replication. So the recent attack on Dyn by Mirai-marshalled IoT botnets won't be the last severe disruption of Internet access and commerce. Until infrastructure stakeholders come together around meaningful, enforceable standards for network protection, the security and prosperity of our connected world remains at risk.]]>https://www.kentik.com/blog/the-dyn-mirai-horror-showhttps://www.kentik.com/blog/the-dyn-mirai-horror-show<![CDATA[Alex Henthorn-Iwane]]>Mon, 14 Nov 2016 14:00:19 GMT<h3 id="what-those-70s-shows-tell-us-about-the-ddos-threat"><em>What Those 70s Shows Tell Us About the DDoS Threat</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/34RWLCof7OYmuCwOAcEUCG/a6d464f85f71a3dab8c239ad6532d7f5/Donny_Marie-400w.png" class="image right" style="max-width: 400px; margin: 15px;" alt="Donny and Marie" /> <p>Back when I was a kid in the 1970s (sorry, I’m aging myself), a major component of the very limited lineup on network television was the variety show. As you can tell from the name, these shows featured a variety of different acts, including music, stand-up comedy, and dance routines. Just like the success of Survivor and American Idol in the 1990s spawned a profusion of reality shows of different stripes, the seventies had a huge glut of variety shows that were trying to repeat the success of top-performing shows hosted by Carol Burnett and Sonny &#x26; Cher.</p> <p>Every celebrity seemed to get a variety show, and of course most didn’t last. I remember watching the premiere of the very short-lived variety show hosted by Wolfman Jack, a New York DJ famous for his signature wolf howl (he did it like seven times on the show). It was quite something to watch poor Wolfman attempt to sashay around the stage and “sing” the opening number in his growly voice. Another prime example of the variety-show gold rush was the Donny &#x26; Marie show. It featured the brother-sister pop duo Donny and Marie Osmond, who hailed from the Osmond entertainment family; every week they would faithfully sing “I’m a little bit country, I’m a little bit rock-n-roll.” Donny &#x26; Marie were very clean entertainers, kind of to an extreme. In fact, the show was so sickly sweet that even by the standards of the day, it was a bit nauseating.</p> <h4 id="here-come-the-copy-cats">Here come the copy-cats</h4> <p>By now you may be asking yourself: other than a very tenuous naming parallel, what do Donny &#x26; Marie have to do with DDoS attacks such as the one launched against Dyn by a botnet assembled with Mirai malware? Let me explain…</p> <p>Security companies and experts have been warning for years of a perilous lack of security in IoT devices, which has created a bonanza (ha — another classic television reference) for botnet malware like Bashlight and Mirai. Over the last couple of years, the footprint of botnet malware has grown dramatically and attacks have been getting far stronger, such as the Terabit+ attack on OVH not long ago. Outside of the usual security crowd, however, this phenomenon hasn’t garnered much sustained attention, because to most people the concept of IoT botnet attacks still seemed too abstract.</p> <p>Now, however, with Netflix, Amazon, and PayPal going down — and hundreds of millions of e-commerce dollars lost — everybody is sitting up and paying attention, giving attackers a bonafide hit. But this is no fun variety show. It’s a full-on horror show, which by the way, is one of the more profitable and copied segments in movie-making these days. Now that it’s apparent that neglect of basic security has created so many Internet vulnerabilities, we can expect attackers to be like producers rushing to emulate hit shows. Unless the industry chooses to act, we’ll see “successful” incidents like the Dyn attack replicated with stomach-churning frequency.</p> <img src="//images.ctfassets.net/6yom6slo28h2/14HcKd9FYooYeI0oeeS0UA/38200fe439ae53d1c62ca3076ad77c59/Death_star-400w.png" class="image left" style="max-width: 400px" alt="" /> <h4 id="just-a-demo">Just a demo?</h4> <p>Flashpoint did a very nice <a href="https://www.flashpoint-intel.com/action-analysis-mirai-botnet-attacks-dyn/">after-action analysis</a> of the attack on Dyn. Their conclusion? It’s unlikely that the attack was politically motivated or launched by a nation-state, primarily because Dyn is not a political target and also the same infrastructure used in the Dyn attack was used to launch an attack against a gaming company. The Flashpoint assessment is that this was the type of attack that typically originates with the Hackforums community, which is where attackers offer their services on a commercial basis. That possibly makes the attack something that was done simply to show what’s possible. In other words, a demo.</p> <p>A <a href="https://www.schneier.com/blog/archives/2016/10/ddos_attacks_ag.html">similar view</a> was put forward by security expert Bruce Scheier, though he adds that the Dyn attack bears some similarity to the kinds of probing that some major nation-states use to assess <a href="https://www.schneier.com/blog/archives/2016/09/someone_is_lear.html">vulnerabilities in Internet infrastructure</a>.</p> <p>So if this was just a “demonstration,” what would the fully functional death star look like when it gets going? We know for sure that it’s not going to be pretty.</p> <h4 id="the-clone-army">The clone army</h4> <img src="//images.ctfassets.net/6yom6slo28h2/7sFznv6Sze8mMgCk80Qgk6/aee1d2b26b40dd4377838681a20e9b45/Clone_army-500w.png" class="image right" style="max-width: 500px" alt="" /> <p>DDoS attacks often utilize botnets, consisting of thousands of compromised hosts that can be told to boo, throw fruit, or whatever vector the attackers decide upon. Of course, the owners of those hosts typically don’t know that they’re participants in a show. IoT has made it much, much easier to harness these unwitting extras. Just look at the username-password dictionaries that Mirai uses. Shipping products with default usernames and passwords — or even hard-coded passwords — is beyond negligent. At this point IoT manufacturers have created a fifth column by deploying armies of ready-to-activate clones into the heart of the digital economy.</p> <h4 id="isps-in-the-audience">ISPs in the audience</h4> <p>Another notable aspect of the Dyn attack is that it was absolutely seen by ISPs but those ISPs had very little incentive to act. During the attack, we were live with customers who could observe spikes of DNS traffic going out-of or through their networks but saw no reason to do anything about it. That’s partly because the attack was more like water torture than a massive, all-at-once flood, and partly because the volume of bandwidth consumed by the traffic was trivial in the context of each individual ISP. While that reasoning may be understandable, there is certainly a case to be made that more collaboration within the Internet infrastructure industry could be very helpful in combating the scourge of DDoS.</p> <h4 id="cancel-the-show">Cancel the show!</h4> <p>The Dyn &#x26; Mirai Show was frighteningly bad. So how do we get it canceled? Once again it comes back to a willingness on the part of the Internet and IoT industries to take the threat seriously and police themselves accordingly. Kentik CEO Avi Freedman has written about the need for internet Infrastructure players to utilize long-established BCP-38 techniques to <a href="http://venturebeat.com/2016/01/29/the-great-network-forgery/">prevent IP address spoofing</a>. That will prevent attackers from lurking so easily in the shadows. A related suggestion by Avi, reported recently in the <a href="http://www.sandiegouniontribune.com/news/science/sd-me-internet-ofthings-20161024-story.html">San Diego Union-Tribune</a>, is to establish an industry-wide label of assurance. “What’s needed,” he says, “is highly-visible, Underwriters Laboratory or Consumer Reports-style ratings of the cybersecurity of devices.”</p> <p>It’s clear that if we want to deal effectively with the coming onslaught of IoT-based attacks, the Internet and IoT industries will have to stop being passive spectators to the destruction, instead working actively to raise the standard for security hygiene. The real threat to the digital economy and even society at large is now apparent. It’s time for a new action series that stars the industry fighting back.</p> <p>Read our solution brief to learn more about <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a>.</p><![CDATA[Partnering with Radware for DDoS Protection]]>https://www.kentik.com/blog/partnering-with-radware-for-ddos-protectionhttps://www.kentik.com/blog/partnering-with-radware-for-ddos-protection<![CDATA[Jim Frey]]>Tue, 08 Nov 2016 14:00:11 GMT<img src="//images.ctfassets.net/6yom6slo28h2/2XJcYgjIJy2K4Y0Wc4c6Is/a694dad7a28a354892d99c47ba028f93/Radware_Logo_Color_300.png" alt="Radware_Logo_Color_300.png" class="image right" style="max-width: 300px;" /> <p>Kentik recently announced some major new enhancements to Kentik Detect in the arena of <a href="https://www.kentik.com/ddos-detection/">DDoS protection</a>. If you saw that announcement and/or some of the materials associated with it, you’ll notice that Kentik Detect now supports integrated automation of multiple DDoS attack mitigation options. A key aspect of our DDoS protection solution is our partnership with Radware and integration with Radware DefensePro.</p> <p>Radware is an industry leader in DDoS protection. They count some impressive customers of their cyber-security solutions, including:</p> <ul> <li>The Largest CDNs and Cloud DDoS providers</li> <li>7 of the top 14 Stock Exchanges</li> <li>12 of the top 20 Commercial Banks</li> <li>6 of the top 20 Retailers</li> <li>4 of the largest Telcos worldwide</li> </ul> <p>Naturally, Kentik started to cross paths with Radware in mutual accounts, and it just made sense to start working together. What really catalyzed things, as is the case in most business partnerships, was a joint customer who was looking for an integration, and in this case that joint customer was PenTeleData. PenTeleData was one of Kentik’s earliest adopters and had used Kentik Detect for network traffic analysis since 2014. Meanwhile, they had been using an appliance-based DDoS detection system from a well-known vendor in that space, but that was missing a lot of attacks, leading to a ton of havoc.</p> <p>PenTeleData’s desire for a better overall solution dovetailed with our development of the recently announced DDoS detection enhancements. PenTeleData chose Radware to refresh their DDoS mitigation layer, and asked for Kentik and Radware to connect the two systems. PenTeleData initially deployed the combined solution in May of 2016, and it has been in full production since June. The results have been stellar—a 30% improvement in catching and stopping DDoS attacks.</p> <blockquote> <p>“When we first deployed Kentik Detect, we started seeing attacks that weren’t being caught by our previous DDoS defense solution,” said Frank Clements, Engineering Manager at PenTeleData. “Once we set Kentik Detect to automatically trigger mitigation via our Radware DefenseFlow platform, the constant pattern of interrupts and firefighting really quieted down.”</p> </blockquote> <p>[<img src="//images.ctfassets.net/6yom6slo28h2/6Ma4aroUEw62kE0GQkss26/94c21636ca1c4318700dd26fa4972ba4/penteledata.png" alt="penteledata.png" class="image right" style="max-width: 200px;" />]</p> <p>You can’t argue with results. Since that time working together at PenTeleData, Kentik and Radware have launched direct field and technical collaboration so that anyone who wants the power of big data for detection and analytics, paired with the mitigation prowess of Radware DefensePro, can get it. As David Anderson, Corporate Vice President, North America for Radware said:</p> <blockquote> <p>“Radware’s award winning DefensePro DDoS protection and network security platform provides the fastest and most complete mitigation from all attack types on the market today. The integration with Kentik Detect provides the best of both worlds for fast and accurate detection with complete and accurate mitigation.”</p> </blockquote> <p>We couldn’t agree more. We’re delighted to be partnering with Radware and look forward to helping all of our mutual customers get the best possible DDoS protection. If you want to learn more about the joint solution, read the <a href="https://info.kentik.com/rs/869-PAD-887/images/Radware_Kentik_Solution_Brief.pdf">Radware-Kentik solution brief</a>. If you want more details on the PenTeleData story, read the <a href="https://info.kentik.com/rs/869-PAD-887/images/Kentik-PenTeleData_case_study.pdf">case study</a>. Know you need big data DDoS detection accuracy and deeper network analytics? Start a <a href="#signup_dialog">free trial or</a> or <a href="#demo_dialog">request a demo today.</a></p><![CDATA[Automate Remotely Triggered Black Hole DDoS Protection in Under an Hour]]><![CDATA[DDoS attacks pose a serious and growing threat, but traditional DDoS protection tools demand a plus-size capital budget. So many operators rely instead on manually-triggered RTBH, which is stressful, time-consuming, and error-prone. The solution is Kentik’S automated RTBH triggering, based on the industry’s most accurate DDoS detection, that sets up in under an hour with no hardware or software install.]]>https://www.kentik.com/blog/automate-rtbh-ddos-protection-in-under-an-hourhttps://www.kentik.com/blog/automate-rtbh-ddos-protection-in-under-an-hour<![CDATA[Alex Henthorn-Iwane]]>Tue, 01 Nov 2016 13:00:03 GMT<h2 id="ddos-detection-and-mitigation-at-the-speed-of-saas">DDoS Detection and Mitigation at the Speed of SaaS</h2> <img src="//images.ctfassets.net/6yom6slo28h2/6Cx7RTfsPuSuo8U4uO4uEw/c100f36b501e57769b31b3f7cf21ec77/dynamite.jpg" alt="ddos dynamite illustration" class="image right" style="max-width: 300px;" /> <p>In the wake of the recent takedown of DNS provider Dyn, it’s common knowledge that Distributed Denial of Service (DDoS) attacks pose a serious and growing threat to the Internet and its component networks. What’s less obvious is what to do about that threat, especially if you don’t have a large capital budget to invest in traditional DDoS protection tools. A common fallback is to rely on manually defending your network with Remote Triggered Black Hole (RTBH). Stressful, time-consuming, and error-prone, that’s a far-from-ideal way to protect network availability and performance. Instead, Kentik can help you automate RTBH triggering based on the industry’s most accurate DDoS detection. And you can set it all up in under an hour without having to install any hardware or software.</p> <h2 id="how-rtbh-works">How RTBH works</h2> <p>The “black hole” part of RTBH refers to the fact that edge routers are typically provisioned with a static route (commonly 192.0.2.1) pointing to the null0 interface — the black hole. Any traffic routed to that virtual interface is simply dropped, never to be heard from again. Using iBGP, traffic can be redistributed to this static route on your own network’s edge routers. Alternatively, you can announce routes for that traffic with a specific BGP community (commonly :666) that signals to your ISPs that they should blackhole it.</p> <p>During an attack, a remote iBGP speaker injects a “trigger” route with next-hop IP or BGP community attributes. This trigger initiates activation of blackholing policies, either at the local edge and/or within the upstream ISP networks. The trigger source can inject the route either via full-mesh iBGP peering with all of the edge routers or via any/all route reflectors that are required to propagate the trigger route to the edge routers.</p> <img src="//images.ctfassets.net/6yom6slo28h2/h8zu6GllPUYuwW2S282yA/61488e65d554d67d196c2a18ddca522d/typing.png" alt="ddos typing illustration" class="image right" style="max-width: 300px;" /> <h2 id="ddos-protection-via-typing-at-2-am">DDoS protection via typing at 2 a.m.</h2> <p>As noted above, if your network operations budget doesn’t allow you to invest in traditional DDoS protection or NetFlow analysis tools, you are pretty much stuck with manual triggering of black-holing, which is a stressful, time-consuming, and error-prone process. This is especially true when you’re a network tech who’s been woken up in the middle of the night by a hysterical pager. Bleary-eyed, you log into your systems to try to figure out what’s gone wrong (without automated analysis, you may be hunting around a bit to figure things out). Once you see that a DDoS attack is underway, you need to ssh to the remote trigger router and type in a network statement or other config, keeping your fingers crossed that you’re not fat-fingering something at 2 AM.</p> <h2 id="big-data-ddos-detection-with-automated-rtbh">Big data DDoS detection with automated RTBH</h2> <p>The 2 AM scenario is clearly less than foolproof, but luckily it’s no longer the only affordable option, because automated RTBH is now available in a SaaS DDoS protection solution. With recent DDoS defense enhancements to Kentik, network operators can access state-of-the-art DDoS detection and automated RTBH mitigation without a bank-breaking capital investment. The major enhancements to Kentik’s DDoS protection capabilities include:</p> <ul> <li><em>The industry’s most accurate DDoS detection.</em> We’ve added true anomaly detection to our already-powerful, network-wide traffic monitoring and alerting. Kentik includes the typical statically-configured thresholds, but it also intelligently tracks which IPs are the top-N traffic receivers. These IPs are baselined individually and evaluated for anomalies. We then trigger alerting and auto-mitigation policies based on configured thresholds or deviations from baseline. Because Kentik always looks at network-wide data — even when that means billions of records per policy — and automatically adjusts baselining, it’s far more accurate than any other DDoS detection system. Period.</li> <li><em>Automated mitigation.</em> Kentik can now be configured as a remote triggering system. Configure Kentik as an iBGP peer of your edge routers or route reflectors, create some mitigation policies with mitigation enabled, and you’re ready to go. If you are using or considering Radware DefensePro or A10 Thunder TPS mitigation systems, those are also supported as auto-mitigation options that you can configure via the Kentik portal UI.</li> </ul> <p>Automated RTBH (and other mitigation techniques) means no more manual blackholing at 2 AM. Sleep easier. Use your time for more productive things. Move the business forward instead of rowing in place like a crazy person.</p> <p>On top of handling your DDoS-related needs, Kentik also gives you state-of-the-art big data network traffic visibility. We correlate flow records (NetFlow, sFlow, and IPFIX) with BGP attribute data from live BGP peers, as well as SNMP interface and GeoIP data. We unify the data into a distributed HA time-series database. And we retain that data for a minimum of 90 days. With Kentik you can group and graph billion-row sets of raw traffic details by up to eight dimensions (out of dozens) and get answers to ad hoc queries in a few seconds. This is Big Data power.</p> <h2 id="saas-is-sooo-easy">SaaS is SooO easy</h2> <p>The great thing about getting all this automation and analytics via Kentik is that it’s so easy. There are no boxes, no software to install, no capital expenses, no operational overhead and maintenance. It’s an annual subscription. Simple.</p> <p>You can register for a <a href="#signup_dialog">free trial</a>, start sending flow data, and get up-and-running in the portal in about 15 minutes. If you’re already doing RTBH, you can configure Kentik as your automated remote trigger in under an hour. We’ve seen customers for whom automated mitigations started happening nearly immediately after signing up and configuring RTBH. And automated mitigation makes for happy network engineers.</p> <p>Ready to learn more about RTBH and how Kentik delivers the industry’s most accurate DDoS protection? Check out the following links:</p> <ul> <li><a href="https://www.bgp4all.com.au/ftp/isp-workshops/Security%20Presentations/00-RTBH.pdf">RTBH overview</a></li> <li><a href="https://www.kentik.com/solutions/detect-and-mitigate-ddos/" title="Learn about Kentik solutions for DDoS detection and mitigation">Kentik DDoS Detection and DDoS Mitigation</a></li> <li>Kentik Case Studies about DDoS protection: <a href="https://www.kentik.com/resources/case-study-square-enix/" title="Game on: Square Enix gains critical network insights from Kentik">Square Enix Network Observability and DDoS Protection Case Study</a>, <a href="https://www.kentik.com/resources/case-study-immedion/" title="Immedion Maintains Always-On, Secure Data Centers, Stops DDoS Attacks with Kentik">Immedion DDoS Protection Case Study</a>, <a href="https://www.kentik.com/resources/penteledata-keeps-ddos-at-bay-with-kentik-detect/" title="PenTeleData Keeps DDoS at Bay">PenTeleData DDoS Protection Case Study</a></li> </ul><![CDATA[How Scalable Architecture Boosts DDoS Detection Accuracy]]><![CDATA[Can legacy DDoS detection keep up with today's attacks, or do inherent constraints limit network protection? In this post Jim Frey, Kentik VP Strategic Alliances, looks at how the limits of appliance-based detection systems contribute to inaccuracy — both false negatives and false positives — while the distributed big data architecture of Kentik Detect significantly enhances DDoS defense.]]>https://www.kentik.com/blog/big-data-for-ddos-protectionhttps://www.kentik.com/blog/big-data-for-ddos-protection<![CDATA[Jim Frey]]>Tue, 25 Oct 2016 13:00:03 GMT<h3 id="how-scalable-architecture-boosts-accuracy-in-detection"><em>How Scalable Architecture Boosts Accuracy in Detection</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/64qY9NK0CIYwmykkY0K60o/d206d5da58d0b106c4e4dc616852a0bc/night_lights-500w.png" class="image right" style="max-width: 500px; margin-bottom: 20px;" alt="ddos night lights illustration" /> <p>Last week’s massive attack on DNS provider Dyn — with its attendant disruption to many web companies and their users — was yet another reminder of the severity of the DDoS threat. Though spectacular exploits against Internet infrastructure providers like Dyn are just a fraction of overall attacks, complacency is not a viable option. Success in digital business requires maximum availability and performance, which depends in turn on effective, comprehensive DDoS defense.</p> <p>Today’s most common approach to DDoS defense involves out-of-band detection appliances coupled with hybrid cloud mitigation. While the network environment — traffic volume, infrastructure distribution, and the size and sophistication of attacks — has evolved dramatically in recent years, the appliances themselves remain largely unchanged. Is appliance-based detection keeping up, or is it inherently limited in ways that have real consequences for network protection?</p> <h4 id="limits-of-the-status-quo">Limits of the status quo</h4> <p>In the prevailing appliance-based model, an out-of-band appliance detects attacks based on NetFlow, sFlow, IPFIX, and BGP data. This appliance then signals the network, via network control plane or element management protocols, to either drop traffic at the network edge or redirect traffic to a private or public cloud mitigation device.</p> <p>Over time, only a fraction of total traffic will need to be mitigated. So the cost-efficiency of the system is maximized by dedicating one or more appliances to detection and selectively pushing traffic to mitigation devices. The out-of-band architecture also provides the option to utilize hybrid mitigation techniques that are tailored to specific needs and objectives. These may include Remote Triggered Black Hole (RTBH), Access Control Lists (ACLs), local mitigation appliances, and cloud-based mitigation services.</p> <div class="pullquote left">Appliance-based systems are plagued by problems with detection accuracy.</div> <p>The approach makes sense on its face, but the dirty secret of such appliance-based systems is that they are plagued by vexing problems with detection accuracy. These issues are rooted in the inherent compute and storage limitations of scale-up detection architectures. Legacy detection software typically runs on a single, multi-core CPU server using some Linux OS variant. When confronted with a massive volume of flow records, these servers must apply nearly all of their compute and memory resources to unpacking payload data from UDP datagrams, converting it from binary flow-protocol formats to ASCII-style data, and then storing it in a generic MySQL-style relational database (with the attendant high-latency read/write table-schema structure).</p> <p>The constraints outlined above leave traditional detection appliances with precious little memory and computing capacity to operate detection algorithms. As a result, the state of the art in appliance-based detection leans on a few possible approaches:</p> <ul> <li>Simplistic static thresholds applied broadly across all potential attack targets.</li> <li>A small pool of statically configured objects that perform baselining of IPs. Since there are so few of these, and because it is so difficult to manually change them, most network security teams end up configuring large pools of IP addresses into these objects. Since the traffic behavior towards individual IPs is lost in the averages of the IP address pool, the result is a constant stream of false negatives.</li> <li>Segmented rather than network-wide views of traffic data. Since most traditional tools rely on separate tables to track monitoring data for different purposes, there are hard limits to how “wide” of a dataset can be handled by a given table and its corresponding monitoring process. That encourages the segmentation of data into more predictable buckets. Most commonly, detection tools (and NetFlow analytics tools in general) are hard coded to segment data by flow exporter IP. As a result, if there is more than one flow exporter, any baselining is performed on a fraction of the overall network traffic, leading to inaccurate evaluation of anomalous conditions.</li> </ul> <h4 id="big-data-to-the-rescue">Big data to the rescue</h4> <div class="pullquote right">DDoS is a big data problem — too big for scale-up architecture.</div> <p>How do we get beyond the inaccuracies in legacy DDoS detection systems? By recognizing that DDoS is a big data problem and removing the constraints of scale-up architecture. The fact is that there are billions of traffic flow records to ingest and millions of IPs that need to be tracked individually and measured for anomalies. How is it possible to know which are significant? In a scale-up reality, it’s not.</p> <p>Luckily, cloud-scale big data systems make it possible to implement a far more intelligent approach to the problem:</p> <ul> <li>Monitor the network-wide traffic level of millions of individual IP addresses.</li> <li>Monitor against multiple data dimensions. While in many cases it is sufficient to look for violations of simple traffic thresholds, for the vast majority of attacks it’s becoming necessary to go beyond a single dimension and recognize the relationships between multiple indicators.</li> <li>Automatically identify and track “interesting” IP addresses by auto-learning and continuously updating a list of top-N traffic receivers. Then perform baselining and measurement to detect anomalies on any current member of that list.</li> </ul> <p>This scalable, adaptive approach to monitoring and anomaly detection has been field-proven to be far more accurate than legacy approaches. One Kentik customer, PenTeleData, is reporting greater than 30 percent improvement in catching and stopping DDoS attacks (i.e. less false negatives) since implementing the built-in detection and alerting capabilities of Kentik Detect. For more detail, read our <a href="https://info.kentik.com/rs/869-PAD-887/images/Kentik-PenTeleData_case_study.pdf">PenTeleData case study</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/74lfhpbETmQEoMcWagGGQm/d7753d75bd4af34fc4d6e9b4cd7013c4/DDoS_diagram-820w.png" class="image center no-shadow" style="max-width: 820px" alt="ddos diagram" /> <h4 id="deep-analytics">Deep analytics</h4> <p>The big data approach that Kentik uses to deliver more accurate DDoS detection also makes possible long-term retention of raw flow records and related data. Kentik Detect is built on Kentik Data Engine (KDE), a distributed time-series database that correlates NetFlow, sFlow, and IPFIX with BGP routing and GeoIP data, then stores it unsummarized for months. As a post-Hadoop big data solution, KDE can perform ad-hoc analytical queries — using up to eight dimensions and filtered by multiple field values — on billions of records with answers returned in just a few seconds.</p> <div class="pullquote left">Kentik Detect enables big data in real time, without the delays inherent in MapReduce.</div> <p>By enabling big data in real time, without the delays inherent in MapReduce, Kentik Detect provides deep forensic insight that can be invaluable in understanding the nature of network, traffic, and attack vectors as they change. Where volumetric attacks may be suspected of covering up more intrusive attacks, exploratory, ad-hoc analytics can be used to find them. For a great example, take a look at this blog series on using Kentik Detect to <a href="https://www.kentik.com/ddos-separating-friend-from-foe/">dig deeper into DDoS attacks</a> and <a href="https://www.kentik.com/using-kentik-detect-find-current-attacks/">find current attacks</a>.</p> <h4 id="tearing-down-silos">Tearing down silos</h4> <div class="pullquote right">DDoS protection is ultimately about delivering a great network experience.</div> <p>DDoS detection has been such a difficult problem for legacy approaches that it’s easy to forget that it’s also been an information silo. Siloes are counter-productive because they impede the clarity of insight needed to achieve truly important business and organizational goals. That’s relevant because DDoS protection is about more than just defense. Ultimately, the goal for network operations teams is to deliver a great network experience that supports a superior user/customer experience. To accomplish that higher-level goal, you need to be able to traverse easily between network performance monitoring (NPM), network traffic analysis, Internet routing analysis, DDoS detection/protection, and network security forensics.</p> <p>Big data is the ideal approach for unifying the data details at scale and providing the compute power to get operational value from analytics fast. Kentik Detect offers all these forms of visibility in one platform for precisely this reason. If you’re interested in learning more about how Kentik’s big data approach enhances DDoS and anomaly detection, read our solution brief to learn more about [Kentik DDoS Detection and Defense (<a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/">https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/</a> “DDoS Detection and Defense from Kentik”). If you already know that you want to get better DDoS detection and automated mitigation, start a <a href="#signup_dialog">free trial</a>.</p><![CDATA[Valuable Network Data Deserves Life in the Cloud]]><![CDATA[There's a horrible tragedy taking place every day wherever legacy visibility systems are deployed: the needless slaughter of innocent NetFlow data records. NetFlow is cruelly disposed of, and the details it can reveal are blithely ignored in favor of shallow summaries. Kentik VP of Marketing Alex Henthorn-Iwane reports on this pressing issue and explains what each of us can do to help Save the NetFlow.]]>https://www.kentik.com/blog/save-the-netflowhttps://www.kentik.com/blog/save-the-netflow<![CDATA[Alex Henthorn-Iwane]]>Mon, 17 Oct 2016 13:00:07 GMT<p>Network operators and engineers have been victims of a horrible tragedy: the needless slaughter of innocent NetFlow data records. Every day, all over the globe, decent and hard-working network infrastructure elements like routers and switches diligently record valuable statistics from IP traffic flows and faithfully bundle those statistics into NetFlow, sFlow, and IPFIX records. These records are then sent to NetFlow collectors, where they are supposed to have the opportunity to work, putting their rich stores of valuable statistical information to use towards the universally accepted and common good of improving network activity awareness and performance. These upstanding deeds — like solving congestion problems, planning capacity, defeating DDoS attacks, and the like — are the outcomes that have been promised for a (technology) generation now. This is the vision that the creators of this good and wholesome system imagined.</p> <div as="Promo"></div> <p>How far we have fallen short of that vision.</p> <p>The details that NetFlow can reveal are ignored in favor of shallow summary reports.</p> <p>These innocent — and let’s not forget, <em>intrinsically valuable</em> — flow records are not being allowed to live the full productive lives they were promised. Instead, vendors of legacy network visibility systems have perpetrated an insidious program of NetFlow record layoffs. The nearly endless details that NetFlow records can reveal are blithely ignored, with shockingly shallow summary reports foisted on network engineers and operators instead. This fools no one, of course. Network engineers know that a few single-dimension top talker pie charts don’t give them enough operational detail to really solve problems. Yet, somehow the purveyors of data reduction have succeeded in peddling the notion that this is as good as it gets.</p> <p>A darker place</p> <p>While the shallow mockery of network visibility is bad enough, the full story goes to an even darker place. If denying proper employment to these valuable data citizens is a shame, then the sinister reality is even more heartbreaking: these NetFlow records are being cruelly disposed of. You heard that right. Every day around the world, literally trillions of NetFlow records innocently debark from the network into NetFlow collectors, little anticipating their own impending doom. In most cases, NetFlow, sFlow, and IPFIX records are FIFO’d without conscience within mere minutes of arrival.</p> <p>All the data that’s tossed away represents value forever lost to the world’s networks.</p> <p>The global scale of this tragedy is alarming. Consider that a single network device that generates 4000 NetFlow records per second is amassing nearly 350 million such records per day. Trillions is a conservative estimate of the losses across the globe. But beyond the numbing size of the quantities we’re talking about, let’s talk about the value that’s being lost to the world’s networks. All that detail, in even one network, could do a world of good that’s currently just tossed away. How much better application performance, capacity planning, DDoS defense, and peering and transit savings could result? If network engineers weren’t denied the kind of details they really need to do their jobs, how much sweat, tears, and hours of sleep could be saved?</p> <p>The mandarins of legacy network monitoring will claim that there’s no other way. But that’s just not true. The claim that NetFlow collectors “just can’t keep data around for very long” might have been true in 1999, but not now. We have the cloud, we have big data. Every NetFlow record that wants to live and work for the good of networks can and should have the opportunity to do so. We’re not talking about some Pollyanna-ish notion of data immortality. We’re talking about a useful life of service for NetFlow records and their valuable information. For every NetFlow record, there comes a day when it must join its brethren in the great null and leave behind just a trace of its individuality in summary reporting memories. But the current, anachronistic destruction of NetFlow before its time must come to a stop!</p> <p>A better way</p> <p>Despite this bleak picture, we can report that there is now renewed hope for NetFlow. And the great news is that it’s so easy. Through the miracle of SaaS, it literally takes just fifteen minutes to get big data and cloud-scale capacity and analytical power. Proof is available with our free trial.</p> <p>It’s time for engineers and operators to join in revolt against data destruction.</p> <p>It’s time for network engineers and operators to reject the data reduction orthodoxy and join the big data cloud revolution. <em>Save the NetFlow!</em></p> <p>If you want to join the revolution and put NetFlow records back to work in your network, contact us and we’ll give you a demo of how big data at cloud scale can make that a reality in your network management practices.</p> <p>P.S. Spread the word to Save the NetFlow!</p> <p>Twitter: @kentikinc #savethenetflow</p><![CDATA[Making Network Performance Monitoring Relevant for Cloud and Digital Operations]]><![CDATA[In this post, based on a webinar presented by Jim Frey, VP Strategic Alliances at Kentik, we look at how changes in network traffic flows — the shift to the Cloud, distributed applications, and digital business in general — has upended the traditional approach to network performance monitoring, and how NPM is evolving to handle these new realities.]]>https://www.kentik.com/blog/making-npm-relevant-for-cloud-and-digital-operationshttps://www.kentik.com/blog/making-npm-relevant-for-cloud-and-digital-operations<![CDATA[Jim Frey]]>Mon, 10 Oct 2016 13:10:50 GMT<p><em>This post is based on a <a href="https://www.kentik.com/webinars/">webinar</a> presented by Jim Frey, VP Strategic Alliances at Kentik, on Network Performance Management for cloud and digital operations. The webinar looks at how changes in network traffic flows — due to the shift to the Cloud, distributed applications, and digital business in general — have upended the traditional approach to <a href="https://www.kentik.com/kentipedia/network-performance-monitoring/" title="Network Performance Monitoring defined in Kentipedia">network performance monitoring</a>, and how NPM is evolving to handle these new realities.</em></p> <h4 id="defining-network-performance-monitoring">Defining Network Performance Monitoring</h4> <img src="//images.ctfassets.net/6yom6slo28h2/3q2sqfENioaMc4IqsgCKiA/33b1b39a66d07bfcd221a28923b9845e/Flying_tortoise-500w.png" alt="Flying_tortoise-500w.png" class="image right" style="max-width: 300px;" /> <p>It’s important to understand at a very high level, that with network performance monitoring, we’re talking about going beyond just recognizing whether or not the network is available and functional, to seeing how well it’s doing its job. Is the network is doing what is expected for delivering applications and services? That’s slightly different than just understanding “is the network there and available?”</p> <p>Some things that you would see as typical activity measures on the network are not necessarily directly relevant to NPM. So this is not just looking at traffic volumes, or the counters or stats that you can get from interfaces. Or even the kind of data you get in logs and events. Those can be helpful in understanding network activity, and some forms of that data are relevant in simple ways to network performance, but really just on the side of utilization.</p> <p>So NPM is about another set of metrics. Metrics such as round-trip time, out-of-order packets, retransmits, and fragments, tell you how well the network is doing its job. And more specifically, they tell you what role the network is playing when it comes to the bigger question, which is the end-user or customer experience.</p> <p>Ideally, when you put a network performance monitoring solution in place, not only can you get these new metrics about what’s going on with the network, but you have a way to tie them back to the applications and services that you’re trying to deliver. That’s what helps you tie improving and protecting network performance to achieving your ultimate business goals.</p> <h3 id="key-npm-data-sources">Key NPM Data Sources</h3> <p>There are two primary categories of data sources for network performance monitoring. There’s synthetic data, which you can generate by setting up robots that will send examples of traffic through the network, and then see what happens to it, in order to calculate the metrics mentioned above.</p> <p>Then there’s data generated via passive observation of real, actual traffic. That’s done by either looking directly at packets that are going across the network or by looking at flow records generated from looking at those packets. Flow records are simply summaries about the packets that have been going across the network. So ultimately, both of those techniques come back down to understanding what’s happening at the packet level. The difference between them is that the data is transmitted and inspected in slightly different ways.</p> <p>Logs are also a way to get some information about network performance, but you have to be fortunate enough to have log sources that will give you discrete performance metrics. There are some log sources out there that do so, however they are not common, and it’s not something that most people can make a primary part of their NPM strategy.</p> <h3 id="npm-architectures">NPM Architectures</h3> <p>There are different types of NPM tool architectures. The most common is buying and deploying direct packet inspection appliances, connecting them to the network. This is the way most packet inspection happens today.</p> <p>SaaS is a way to access technologies that would otherwise be difficult to install and maintain.</p> <p>There are plenty of software-based solutions, too, that you can download and then deploy in whatever server you happen to have handy. That doesn’t tend to work as well when you’re doing packet inspection, because of the need to tune applications and servers to cope with a high volume of packet inspection.</p> <p>The emerging way to do NPM is SaaS. SaaS is becoming an important way to access technologies that would otherwise be difficult to install and maintain. SaaS allows you to get levels of functionality that would otherwise be expensive or difficult to achieve.</p> <h4 id="cloud-and-digital-operations-npm-challenges">Cloud and Digital Operations NPM Challenges</h4> <p>The real purpose of the discussion today is to talk about some of the challenges around network performance management with respect to cloud and digital operations. I want to start with pulling back a little bit and looking at the 10,000-foot view because this is an important context to think about.</p> <p>Remember that the reason why organizations are moving to the cloud and remaking themselves into digital operations is to achieve business goals. In most cases, those goals are better productivity and revenue.</p> <p>However, the operations and engineering guys in the technology team need to figure out how to do this while operating efficiently. They have to look at cost containment, optimizing what they have in place, and doing that quickly.</p> <p>Cloud and digital operations create gaps that are not well filled by existing tools.</p> <p>What is the challenge? A Gartner analyst named Sanji Ganguli captured this well in a piece that was put out in May of this year, called “Network Performance Monitoring Tools Leave Gaps in Cloud Monitoring,” and it stated a few of those major issues that exist. One point he made was that you’re still responsible for performance even if you don’t have control of the infrastructure. That’s certainly the case when you’re dealing with external cloud resources. Another point was that cloud and digital operations are creating a fundamental change in network traffic patterns. And that’s creating some gaps that are not well-filled by the existing tools.</p> <p>Packet analysis, in particular, becomes far more challenging when you have a cloud or hybrid cloud environment because packets travel across networks to which you do not have direct access. So the question of instrumentation becomes, where are you going to get those traditional measurements? If you can’t get the data, how do you answer performance questions?</p> <h3 id="three-npm-challenges">Three NPM challenges</h3> <p><strong><em>Instrumentation Challenges:</em></strong> When I owned all of my infrastructure, my data centers, my own WAN, and my own network, it was pretty straightforward for me to say, OK, I know where I can go and instrument for network performance data. I can deploy some appliance to look at packets and draw some data from it. So it’s pretty easy to get the data that I need to do NPM.</p> <p>But with mixed and hybrid infrastructures, even though you can find instrumentation points in the cloud portions of your infrastructure, you are not going to be able to see the rest of the contextual environment in which those resources are operating. And this makes it hard to know if you have the metrics and measurements that you need.</p> <p><strong><em>The Challenge of Changing Dynamics:</em></strong> The second problem is the dynamics problem. Digital infrastructures, whether they’re internal, private Cloud, or they’re hybrid, or they’re external cloud, they are by definition very dynamic. They’ve been designed to be this way to provide agility for the organization, quick time to market, a faster ability for an organization to move and follow opportunity without having to go through long procurement and deployment cycles of traditional infrastructure.</p> <p>Well, this means it’s moving and changing much more quickly. And therefore, not only do you have a problem with finding a point to instrument, but that point, if you do find it, keeps moving. So it’s a constantly changing environment, and it’s difficult to keep up with.</p> <p><strong><em>Scale Challenge:</em></strong> The last problem is scale. How do you keep up with the fact that you can start to create and generate a bunch of new infrastructure very quickly? And then turn it off, by the way. So part of the natural behavior of these digital and Cloud environments is this sort of elasticity. Essentially, the whole point is, how can you handle and maintain the volume of data you need for keeping on top of performance and understanding it?</p> <h4 id="five-keys-to-dealing-with-npm-challenges">Five Keys to Dealing with NPM Challenges</h4> <p><strong><em>Synthetic and Passive Monitoring:</em></strong> I mentioned earlier that there were two categories of data used in NPM: synthetic and passive monitoring of real traffic. Synthetic test agents seem like a sound way to deal with cloud environments, because you can’t get in there necessarily and deploy your traditional hardware-based probes. But at least you can ping them, right? You can check to see if they’re up. And that has some value.</p> <p>Measuring real traffic is essential for understanding actual performance.</p> <p>However, you need to measure real traffic, because it’s essential for understanding actual performance. You need to know more than whether a test message can make it through to an endpoint. You need to know what’s happening to all of the real production traffic that’s moving back and forth to those resources. When you have real traffic data, you can correlate more accurately between performance from the view of the endpoint, and the rest of the context of the traffic in the network and across the Internet.</p> <p>There is still value in synthetic data in making sure resources are responsive and available, even when there’s no real traffic going to them, such as during off hours. Also, if you are fortunate enough to have very simple and repeatable transactions that you’re monitoring, you can use synthetic tests to reproduce those and get a good proxy reading.</p> <p><strong><em>Deployable and Feasible:</em></strong> Not all tools and technologies can be used in the new, hybrid cloud environment. You’re probably going to need a mix of tools. You’re going to have to think about using agents, even though to some people that’s a dirty word, and you’re probably going to have to use some sensors. To other people, that’s a dirty word. Go ahead, get dirty, you’re going to have to use some mix to get your head around the complete environment and all of the potential viewpoints you’d need.</p> <p>Appliances can’t give you access to packets on cloud-based infrastructure.</p> <p>A key point that I want to make clear is that traditional appliance-based approaches just aren’t going to be enough on their own. Appliances are extremely useful for performing deep packet-based inspection. But you have to get access to the packets. And that’s just not practical when you’re dealing with external cloud-based infrastructure. There are some adaptations of appliance techniques to try to do this. None of them has found great favor. They all have limitations. You may still want to have a traditional appliance back in your private data center, for your local environment. But you’ve got to go beyond just using appliances. That just doesn’t work anymore.</p> <p>Ultimately, you’ve got to be flexible with your approach here. But you’ve got to look at techniques and tools that are both very cost effective, because remember, I mentioned this earlier, the reason for moving to Cloud, and the reason for rebuilding IT or back-end infrastructure in a digital sort of manner, is to save money. You are trying to be cost effective. So you can’t take an approach that’s going to kill you from a cost perspective.</p> <p>You’ve got to find something that’s cost effective. It’s got to be easy to deploy, too. Really, the instrumentation and the methods you use for gathering performance metrics should be just as easy to deploy as the Cloud resources are themselves. You’ve got to find ways to be Cloud-like in the way you adapt, here.</p> <p>And finally, it really needs to move with the workloads. Remember I mentioned earlier that dynamics problem. The things you want to keep an eye on won’t necessarily be in the same place in an hour, or a day from now. So the instrumentation approach, the strategy you use for gathering these metrics, needs to be able to float and stick with those workloads as they move around, or they’ll lose visibility.</p> <p><strong><em>Internet Path and Geo Awareness:</em></strong> If you’ve got digital operations, you’re a digital business, and your business is based on your employees and customers reaching your infrastructure and applications across the Internet. You have a huge level of dependence on that Internet connectivity being up and reliable and high performing. So you have to start thinking about how you get that view into the Internet path between you and your cloud services, or you and your customers, This wasn’t part of the picture for traditional enterprise networks. In the past, most people were able to just blissfully ignore the performance impact of the Internet. But that’s no longer valid.</p> <p>Without visibility into Internet performance you can’t optimize customer experience.</p> <p>Here’s why. The more networks or hops between your users and your resource, whether your data center or in the cloud somewhere, the more risk. The more hops, the more latency. Not all paths across the Internet are equal, and they do change regularly. Sometimes those paths work well for others, but not for you. So this is why, in the age of digital operations, it’s important to understand what’s going on with these Internet paths. Without this visibility, you are going to be at risk of not being able to influence or optimize customer experience.</p> <p><strong><em>Cloud-scale data analysis:</em></strong> NPM frankly is a big data problem in many ways, shapes, and forms. Most tools do not have a big data architecture. Without a big data architecture to keep all of the details, most NPM tools simply summarize and then throw away the details. That means you’re lacking the information you need to get to the bottom of problems.</p> <p>Big data architectures help with dealing with the volume of data, and making it more flexible. A lot of folks have tried to build their own NPM tools using open source and commercial big data platforms, but it’s very expensive. Not necessarily expensive due to the cost of licenses, but due to the time and effort required to set up the big data back-ends, to figure out how to feed them data, and how to build reports and analyses.</p> <p>There are, though, tools that are coming along, Kentik is one of them, that are commercializing big data architectures. And that’s where the answer is going to be in the long run.</p> <p>Cloud-scale big data architectures can deal with cloud-sized challenges in the operating environment.</p> <p>The cloud-scale tool concept offers an opportunity to deal with a cloud-sized problem that you have in your operating environment. The cloud approach lets you access resources and solutions that can handle all the data without you having to do all the hard work and having to build a back end by yourself. Cloud it also makes it much easier to deploy these solutions. If you can access your NPM as a SaaS, it solves a lot of your total-cost-of-ownership pains.</p> <p><strong><em>API Friendliness:</em></strong> APIs are seeing tremendous adoption and use, and getting increased attention and value around this whole move to the cloud. We’re changing our applications development environment, but we’re now starting to have APIs be present in all aspects of cloud services, and virtualization services. Whether you’re virtualizing internally, making a private cloud, whether you’re accessing Amazon Web Services, APIs are everywhere now.</p> <p>Unfortunately, NPM data has not been well serviced by existing APIs. That’s too bad because your NPM system needs to support API integration so you can feed that data into your digital operations and business decision-making processes. That performance data contributes to understanding the customer experience, user experience, and total activity. You can use performance data to help you with billing, service activity, and success and product management decisions.</p> <h4 id="learn-more">Learn More</h4> <p>Listen to the rest of the <a href="https://kentik.com/webinars/">webinar</a> to learn about the Kentik NPM solution and how one Ad Tech company utilized Kentik’s nProbe agent and Kentik Detect to monitor and optimize revenue-critical performance issues. If you already know that you want big data network performance monitoring for your digital operations team, start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[Podcast with Kentik CEO Avi Freedman & Jim Metzler]]><![CDATA[Kentik recently launched Kentik NPM, a network performance monitoring solution for today's digital business. Integrated into Kentik Detect, Kentik NPM addresses the visibility gaps left by legacy appliances, using lightweight software agents to gather performance metrics from real traffic. This podcast with Kentik CEO Avi Freedman explores Kentik NPM as a game-changer for performance monitoring.]]>https://www.kentik.com/blog/podcast-avi-freedman-jim-metzlerhttps://www.kentik.com/blog/podcast-avi-freedman-jim-metzler<![CDATA[Avi Freedman]]>Mon, 03 Oct 2016 13:00:55 GMT<h2 id="podcast-with-kentik-ceo-avi-freedman-and-analyst-jim-metzler"><em>Podcast with Kentik CEO Avi Freedman and analyst Jim Metzler</em></h2> <p><em>On September 20, Kentik announced Kentik NPM, the first network performance monitoring solution designed for the speed, scale, and architecture of today’s digital business. Addressing the visibility gaps left by legacy appliances in centralized data centers, Kentik NPM uses lightweight software agents to gather performance metrics from real traffic wherever application servers are distributed across the Internet. Performance data is correlated with NetFlow, BGP, and GeoIP data into a unified time series database in Kentik Detect, where it’s available in real time for monitoring, alerting, and silo-free analytics. In this podcast Kentik CEO Avi Freedman and Jim Metzler, VP at Ashton, Metzler &#x26; Associates, discuss Kentik NPM and how it fits into the overall performance monitoring market. Excerpts from the <a href="/resources/metzler-freedman-interview-on-kentik-network-performance-monitoring">full podcast</a> are presented below.</em></p> <h3 id="what-is-kentik-npm">What is Kentik NPM?</h3> <img src="//images.ctfassets.net/6yom6slo28h2/MqkzjPNxoAaw8uqS0WCge/fabb1f850e172dbd483fe2c8396c9c22/DE-Retransmits_stacked-500w.png" alt="DE-Retransmits_stacked-500w.png" class="image right" style="max-width: 500px;" /> <p>Our Network Performance Management solution, Kentik NPM, sits on top of the Kentik Detect platform. We think that what’s been missing from the network analytics tool suites is an easy way for people to look at their actual traffic, and determine whether applications and users are affected. So anyone that runs applications over the Internet that have to do with revenue, or the continuity of the business, we think that those are the folks that should care about Kentik NPM.</p> <h3 id="why-we-need-another-npm-solution-and-why-saas">Why we need another NPM solution, and why SaaS</h3> <p>What we’ve heard from our customers is that there’s no solution that they can consume the way they want. There are appliances, there’s downloadable software, but there’s been no SaaS option that gives them the basic network visibility that they want, with an understanding of their topology and how networks interconnect, but also gives them the context of what the performance of that traffic is. So what people want is SaaS, big-data based, open, and integratable with their tool suites, and they haven’t had options in the market.</p> <p>To have a true network time machine, you need to keep all of your data, which makes it a big-data problem.</p> <p>The main frustration that people had with a lot of the tool suites, especially on the NetFlow analytics side, which is where we started and spent the first couple of years of Kentik focusing on, is that most of the solutions take all the data in, aggregate it, and throw it away. So in order to really have a network time machine, you need to be able to keep all that data. But now you’ve got a really large big-data problem. Most of our users are storing billions, tens of billions, even hundreds of billions of records, and they don’t really want to fire up that much infrastructure and run that on-prem.</p> <p>So SaaS is attractive for people that just want to have the data processed in the cloud and stored in the cloud without having to turn on that infrastructure. And for the people that really do want it to be on-prem, we do on-prem SaaS as well, and in that case it’s really that they don’t want to manage, they don’t want to develop, they don’t want to build their own network-savvy big-data solution. They just want a managed solution.</p> <h3 id="why-kentik-npm-uses-a-host-agent">Why Kentik NPM uses a host agent</h3> <p>The basic component of the NPM solution is the core Kentik Detect offering, which takes everything from sFlow, NetFlow, and IPFIX to what we call “augmented flow data,” which has performance and security information in it, and makes that data available via SaaS portal, alerting system, and APIs. Most of our customers start with that, monitoring and sending us data from their existing network elements.</p> <p>The NPM solution adds to Kentik Detect with data from an nProbe host agent or sensor.</p> <p>The NPM solution adds to that by specifically taking data from a host agent, or a sensor. We’ve decided to use the nProbe technology from a company called ntop. Luca Deri has been running that project. It’s really the granddaddy, in some sense, of a lot of network monitoring. And that agent goes on a server, or on a sensor that can see copies of traffic, and then feeds us that data with not only the flow of the traffic but also the performance. And in some cases, the semantics of what the application is actually doing.</p> <p>So that data goes in, and then the NPM elements of the Kentik Detect platform will let you actually look at the flow of the traffic colored by performance and see if there is a problem, and if so, where it is. Is it the network? Is it the WAN? Is it the Internet? Is it the data center? Or where in the application should I look?</p> <p>So it’s Kentik Detect, with augmented flow coming from nProbe as the main sensor, serving up traffic from looking at pcap on servers or taps and spans.</p> <h3 id="the-downside-of-npm-appliances">The downside of NPM appliances</h3> <p>A lot of our customers write their own applications and control their entire infrastructure. And they are of the attitude that the last thing they want is to buy appliances. A lot of them are actually looking at White Box or running White Box, and they don’t even want to buy major brand switches and routers. Much less $300,000 appliances to go into the network.</p> <p>The last thing a lot of our customers want is to buy appliances.</p> <p>So the host agent gives them the flexibility to run on top of and integrated with all of their current application infrastructure, and to send the kind of data that they need to actually make decisions. We need an agent because most routers and switches, and firewalls and load balancers, don’t have the ability to report on the traffic that’s occurring with performance instrumentation. You can do it with a few different vendors, a few of their models, but there’s really no blanket way to do that.</p> <p>The host agent is something that plays well with our big-data architecture, it’s easy to deploy, and doesn’t require any physical mucking around with packet brokers. Now, if they have packet brokers installed, then that same host agent could be run on a $3,000, 40-gig sensor if they want to. But in our world, an appliance is just a commodity, white-box server running the same host agent, and that’s another way to deploy.</p> <h3 id="how-kentik-customers-are-using-kentik-npm">How Kentik customers are using Kentik NPM</h3> <p>A vertical that we’ve had a lot of success in — that we actually developed a lot of this technology with — is Ad Tech. You’ve got a number of companies, all very technically sophisticated, who often actually need to cooperate with their competitors. And they all, as a group, have latency SLAs in the tens of milliseconds. So 50ms to do the complete transaction of putting an ad on a web page. Because that web page may have multiple ads.</p> <p>The question that always comes up if they’re not meeting their SLA is why? Where’s the problem? It gets back to that age-old “is it the network?” question. And the main way that the Ad Tech folks are using Kentik’s NPM is to take a look at the performance and project it on top of the network flow, on top of their topology, and be able to see very clearly where the bottlenecks are, if there are any in the network, and then be able to drive action to do something about it.</p> <p>We have customers that are automatically routing around Internet bottlenecks.</p> <p>A very common thing is when there’s a problem with a network in the middle of the Internet that the traffic goes through to get to one of their trading partners. And then we actually have customers that are automatically influencing routing to route around those kinds of bottlenecks. So it’s detect, decide what to do, react, and sometimes automate it.</p> <p>The other big category is what we call SaaS companies. That’s sort of a subset of the web-company vertical. SaaS companies all believe in customer success and customer experience, and their revenue is almost always going over the Internet. So again, they need to be able to quickly determine when there’s a user complaint, or ideally proactively that there’s an issue but it isn’t with the network, or there’s an issue and it’s with this part of the network, so let’s go do something about it.</p> <p>And those are decisions and really observations that they can’t make without performance data, which you can’t really get from the routers and switches that they run in their infrastructure.</p> <h3 id="how-kentik-npm-helps-isps-and-hosting-companies">How Kentik NPM helps ISPs and hosting companies</h3> <p>The hosting companies and ISPs — we call them service providers — are the people that make the Internet, as opposed to the people that use the Internet. Their interest is in knowing about issues before customers are calling their NOC, really understanding what the performance is, and then, again, doing something about it. They’re typically very well-connected, have provisioning systems, and can signal their infrastructure.</p> <p>Companies that don’t have control of the hosts will deploy the agent in a sensor mode.</p> <p>But what’s been missing is actually an understanding of what’s going on. Now, for most of those companies, they don’t have control of the hosts. So they’ll deploy the agent in a sensor mode where it’s on top of a $3,000 machine seeing copies of the traffic. And typically they’ll focus on the outside of the network, and be looking at where they’re having performance problems in their peers and upstream providers.</p> <p>And their goal is not as much to maximize revenue that’s going inside the packets that are their application. Their goal is to increase customer satisfaction, and in some cases just have the data so that when someone reports an issue, they can go and look at it and have an intelligent conversation.</p> <h3 id="how-kentik-npm-applies-to-enterprises">How Kentik NPM applies to enterprises</h3> <p>Kentik’s NPM is used by enterprise very similarly to the way that SaaS companies use us. Which is, the parts of the network, and specifically the traffic, that makes revenue — the marketing web site, the APIs, which allow the business to go, to place orders and to make orders, and is part of their digital supply chain — that’s typically the area of traffic that they start focusing on. To be able to see what to do to keep traffic flowing well. To be able to proactively know where there are problems, tag that back to what application or users are affected, and to be able to deal with it.</p> <p>For enterprise users it’s largely about providing great service to the rest of the company.</p> <p>I’d say secondarily, a lot of enterprise, even the ones using SDWAN, are themselves service providers to their internal customers. So we’re talking more about the Fortune 1,000 in this case, that have global networks. And in that case, it’s less about optimizing revenue like the network-operator folks, and more about just providing great service to the rest of the company. We’re targeting the larger companies that either make revenue over the Internet or that have internal users, typically pretty broadly deployed, and have a service-provider-like outlook towards them.</p> <h3 id="kentik-npm-cost">Kentik NPM cost</h3> <p>We charge as an annual contract with a fee for each router, switch, and host. The host monitoring is typically a lot lower cost per host than any other approach. It depends a little bit on the volume, and how much people want to store. But we have customers that are monitoring hundreds of hosts, and tell us that it’s much more economical than either than building it themselves, which is what they had been trying to do, or running the big data platforms, or using the appliance and downloadable software solutions we talked about.</p> <h3 id="trying-kentik-npm">Trying Kentik NPM</h3> <p>People can just sign up for a <a href="#signup_dialog">free trial</a> directly on our website, at <a href="https://www.kentik.com">www.kentik.com</a>, and actually get into trial on their own. Just download the agent, install, and start sending flow. If someone does sign up, we’re going to immediately contact and offer our assistance, but a lot of our customers don’t want to be bugged. They just want to go try it. So we’re fine with that. Typically people can be up and running in a matter of minutes with data from the routers and switches, depending on their config management system. It might be 15 minutes to a couple of hours to push something into a docker build, or however they deploy software onto hosts.</p> <p>People can sign up for a free trial, download the agent, install, and start sending flow.</p> <p>Anyone can sign up for the trial, including our competitors — we see them doing that fairly often. And you can just start adding devices. We have an extensive knowledge base, so many people can do that themselves. That just starts a 30-day trial. Again, people can do that entirely autonomously, or we’re happy to jump in and help try to understand the use cases, especially once data is flowing, to walk through how Kentik Detect and the NPM components can solve the issues that they’re seeing. And especially talk with them about how it integrates with their metrics, databases, and their APM tools. So typically it’s a 30-day free trial. And then we extend it for some larger customers, or if we need to prove an integration, or add a connector for a certain type of device that they have.</p> <hr> <p><em>Want to learn more about the industry’s only purpose-built Big Data SaaS for network traffic analysis? See for yourself what you can do with Kentik by signing up for a</em> <a href="#signup_dialog"><em>free trial</em></a>. And if you’re inspired to get involved, <a href="https://www.kentik.com/careers">we’re hiring</a>!</p><![CDATA[Closing the Network Performance Monitoring Gap and Achieving Full Network Visibility]]><![CDATA[As Gartner Research Director Sanjit Ganguli pointed out last May, network performance monitoring appliances that were designed for the era of WAN-connected data centers leave significant blind spots when monitoring modern applications that are distributed across the cloud. In this post Jim Frey, VP Strategic Alliances, explores Ganguli's analysis and explains how Kentik NPM closes the gaps.]]>https://www.kentik.com/blog/closing-the-network-performance-monitoring-gaphttps://www.kentik.com/blog/closing-the-network-performance-monitoring-gap<![CDATA[Jim Frey]]>Mon, 26 Sep 2016 13:00:59 GMT<p>Delivering NPM for Cloud and Digital Operations</p> <img src="//images.ctfassets.net/6yom6slo28h2/4dwzlgyfZCO2c0uoKgiQws/7388e631c882ae962b349451b27689a8/wordpress-media-8914" alt="cleveland_museum-500w.png" class="image right" style="max-width: 300px;" /> <p>On May 27 of this year, Gartner Research Director Sanjit Ganguli released a research note titled “Network Performance Monitoring Tools Leave Gaps in Cloud Monitoring.” It’s a fairly biting critique of the state of affairs in NPM. And I couldn’t agree more.</p> <p>In the note, Ganguli makes a number of key points in explaining the gap in cloud monitoring. The first is that traditional NPM solutions are of a bygone era. To quote:</p> <p><em>“Today’s typical NPMD vendors have their solutions geared toward traditional data center and branch office architecture, with the centralized hosting of applications.”</em></p> <p>Until just a few years ago, enterprise networks were predominantly comprised solely of one or more private data centers connected to a series of campuses and branch offices by a private WAN that was based on MPLS VPN technology from a major telecom carrier. In those settings, you might drop NPM appliances into major data centers, perhaps directly connected to router or switch span ports, or via a tap or packet broker from the likes of a Gigamon or Ixia. You’d put a few others — I emphasize “few” because these appliances were and are not cheap — at other major choke points in and out of the network. Voila, you’ve covered (most of) your network. The devices would do packet capture, derive and crank out performance alerts and summary reports, and retain raw packets for a small window (usually just minutes), enabling closer examination if you could get back to them in time. Or you could spend the big $$ and get the stream-to-disk NPM appliances that might store packets a little longer (think hours instead of minutes).</p> <p>The pervasive cloud</p> <p>Today, cloud is a pervasive reality. Just to be clear, what do I mean by cloud? For one, cloud refers to the move to distributed application architectures, where components are no longer all resident on the same server or data center, but instead are spread across networks, commonly including the Internet, and are accessed via API calls. Monolithic applications running in an environment like the one described earlier are not quite gone yet, but they represent the architecture of the past.</p> <p>Cloud also means that users aren’t just internal users. Digital business initiatives represent more and more of the revenues and profits for most commercial organizations, so users are now end-customers, audiences, clients, partners, and consumers, spread out across the Internet, and perhaps across the globe. The effect on network traffic is profound. As Ganguli says in his research note:</p> <p><em>“Migration to the cloud, in its various forms, creates a fundamental shift in network traffic that traditional network performance monitoring tools fail to cover.”</em></p> <p>Why can’t traditional NPM tools cover the new reality? For one, if you’re using an appliance, even a virtual one, you’re assuming that you can connect to and record packets from a network interface of some sort. That assumption worked in a traditional scenario when monolithic applications meant that there were relatively few points in the network where the LAN met the WAN and important communication was happening. But cloud realities break that assumption. There are way more connectivity points that matter. API calls may be happening between all sorts of components within a cloud that don’t have a distinct “network interface” that an appliance can attach to. Ganguli observes that:</p> <p><em>“Packet analysis through physical or virtual appliances do not have a place to instrument in many public cloud environments.”</em></p> <p>The fact is that physical appliances are downright irrelevant in the modern environments of web enterprises, SaaS companies, cloud service providers, and the enterprise application development teams that act like them. And even virtual appliances become impractical to employ effectively. The sheer number of connectivity points that matter means that you’d need to distribute physical and virtual appliances far more broadly than ever before. That sounds like a lip-smacking feast for NPM vendors. But because the cost is completely prohibitive, it has never been economically feasible for network managers to deploy very many NPM appliances.</p> <p>The result is functional blindness, or to put it more mildly—it’s <strong>really</strong> hard to figure out what’s happening. In Ganguli’s words:</p> <p><em>“Compute latency and communications latency become much harder to distinguish, forcing network teams to spend more time isolating issues between the cloud and network infrastructure.</em></p> <p>In the new cloud reality, when you have an application performance problem that is impacting user experience you can’t easily tell if it’s the network or not. Network teams can no longer run around with manual tools like TCPDump and Wireshark, trying to figure things out. Further, this creates what is actually an existential problem, because digital business, revenues, profits, audiences, brand, and competitiveness all go down the tubes when user experience goes into the can. And if you’re a network manager, you’re not going to get the dreaded “3 AM call” from just the CIO, you’re also going to be hammered by the CMO, CRO, CFO, and CEO.</p> <p>NPM for a new era</p> <p>It’s high time for <a href="https://www.kentik.com/resources/gartner-market-guide-npmd-2020/" title="Read Gartner&#x27;s 2020 Market Guide for Network Performance Monitoring and Diagnostics">network performance monitoring</a> to come into the cloud era. It’s now possible to get rich performance metrics from your key application and infrastructure servers, even components like HAProxy and NGINX load balancers. You can map those metrics against Internet routes (BGP) and location (GeoIP), and correlate them with billions of volumetric traffic flow details (NetFlow, sFlow, IPFIX) from network infrastructure (e.g. routers and switches). It’s even possible to store those details unsummarized for months and to get answers in seconds on sophisticated queries across multi-billion row datasets. All this enables you to recognize issues immediately, rapidly figure out if the problem is with the network, and determine what to do about it. But you can’t do any of it without the instrumentation of cloud-friendly monitoring and the scalability of big data.</p> <p>Kentik’s server-side NPM instrumentation goes wherever the servers go, in the data center or in the cloud.</p> <p>At Kentik, we’ve added server-side NPM instrumentation that can go where the servers go, whether they are in the data center or in the cloud — any cloud, anywhere. The Kentik nProbe Host Agent examines packets as they pass through the NIC (or vNIC) and generates traffic activity and performance metrics that are forwarded to Kentik Detect for correlation and analytics. This puts end-point NPM data into the hands of the network pros, and instantly reveals what the actual network experience is from the viewpoint of the connected device. No more guessing or approximations from a bump somewhere on a wire that may or may not be anywhere near the actual server!</p> <p>The agent we are using is worth a few more words. It’s developed and supported by ntop, and is the same technology that Boundary Networks used as a basis for its visionary offerings more than five years ago. Boundary was ahead of its time when it tried, unsuccessfully, to sell its solution to the nascent (and, from a networking perspective, unsophisticated) DevOps sector. But along the way, hundreds of these agents were deployed and delivered solid value to many, many organizations. We know this because several of those organizations have come to Kentik seeking a restoration of the visibility they lost when Boundary later pivoted in another direction and eventually wound down its operations.</p> <p>With the advent of Kentik’s big data SaaS for network traffic analytics, monitoring, alerting, and defense, the data generated by nProbe becomes part of a unified overall solution that network engineers can use to ensure network performance.</p> <p>Want to know more about the industry’s only Big Data-based SaaS NPM solution? <a href="#demo_dialog">Request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p> <p>For the latest on Gartner’s take on network performance monitoring, read the <a href="https://www.kentik.com/resources/gartner-market-guide-npmd-2020/" title="Read Gartner&#x27;s 2020 Market Guide for Network Performance Monitoring and Diagnostics">Gartner Market Guide for Network Performance Monitoring and Diagnostics, 2020,</a> compliments of Kentik.</p><![CDATA[How Kentik Detect & nProbe Monitor Your Network Performance]]><![CDATA[Network performance is mission-critical for digital business, but traditional NPM tools provide only a limited, siloed view of how performance impacts application quality and user experience. Solutions Engineer Eric Graham explains how Kentik NPM uses lightweight distributed host agents to integrate performance metrics into Kentik Detect, enabling real-time performance monitoring and response without expensive centralized appliances.]]>https://www.kentik.com/blog/how-detect-nprobe-monitor-network-performancehttps://www.kentik.com/blog/how-detect-nprobe-monitor-network-performance<![CDATA[Eric Graham]]>Tue, 20 Sep 2016 10:00:07 GMT<h3 id="network-performance-monitoring-with-kentik-detect-and-nprobe"><em>Network Performance Monitoring with Kentik Detect and nProbe</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/10i1Hq0ZZksMOKAuwsoAym/09d1a7f62c1a604a40bdf1afe965fd9a/NPM_metrics-500w.png" alt="NPM_metrics-500w.png " class="image right" style="max-width: 500px;" /> <p>In the era of digital business, network performance is a critical aspect of keeping customers, subscribers, partners, and users happy. Can your customers complete transactions readily, or are you unintentionally pushing them toward your competitors? Are your employees able to complete their tasks without delay, or left idle waiting for “the system” to catch up? It’s up to IT and network engineering groups to address performance, but to do so they need effective ways to measure and monitor. They need to be able to see when performance is impacting user experience, service, or application quality so that they can respond with proactive or preventative steps.</p> <p>At a high level, aspects of network performance monitoring functions are already available in many of the tools that are commonly embedded within the network infrastructure. For example, using SNMP or flow data to track traffic volume on interfaces, servers, and devices can provide necessary visibility into possible congestion points. Most network teams have some type of tool that can collect and display this information.</p> <div class="pullquote left">What's been missing is the ability to look at actual packets to track host-level performance.</div> <p>Nevertheless, the networking world is behind in understanding actual user performance because network devices, for the most part, only report traffic volumes. What’s been missing, until now, is the ability to look at actual packets to assess network health and track performance at the host level — without collecting and sifting through cumbersome and expensive full-packet captures. In this post we’ll talk about using Kentik’s NPM solution to watch the TCP performance of traffic as reported to Kentik Detect from a Kentik nProbe host agent. As an example, we’ll use traffic from Kentik’s own servers.</p> <h4 id="tcp-is-the-key">TCP is the key</h4> <p>To gather the metrics required to truly understand performance, a monitoring tool must look at one of the most important and widely used layers in the OSI model: transport. Specifically TCP, which is used in over 90% of network transactions. The Transport layer is important because TCP visibility can give you reliable, detailed insight into network performance. TCP was designed as a stateful protocol to provide reliability across network infrastructure. This is accomplished by using acknowledgements, sequence numbers, and the ability to retransmit data to ensure packets reach their destination. If the sending side does not receive an acknowledgment, the packet can be retransmitted or the TCP/IP connection closed. One key measure of performance — retransmit rate — starts to increase when packets are delayed or dropped due to network congestion or other network problems.</p> <div class="pullquote right">To truly understand what's going on you need to be able to look at real traffic.</div> <p>In the past, network engineers would need to use Wireshark to find TCP-related information within packet-capture files, which is a lengthy, time-consuming process. Or they would use expensive packet streaming/inspection appliances, which are difficult to deploy in any quantity without significant capital resources. The industry has also played with synthetic transaction alternatives, but those never really took off because to truly understand what’s going on you need to be able to look at real traffic.</p> <p>While traditional NPM tools remain stuck in the era of centralized applications that can be monitored with monolithic appliances, IT as a whole has largely shifted to new approaches that distribute applications across the cloud, leaving gaps in what old-school NPM tools could see. As network traffic continues to grow, network engineers have struggled to maintain effective performance monitoring without affordable, comprehensive solutions.</p> <h4 id="modern-network-performance-monitoring">Modern network performance monitoring</h4> <p>At Kentik, we see the trend toward distributed applications as both an opportunity and a model. We start with a cloud-scale analytics engine — Kentik Detect — that is powered by a distributed post-Hadoop big Data backend and runs a time-series database that is optimized for network data such as NetFlow and BGP. That enables us to ingest traffic data at massive scale, retain unsummarized details for months, and provide real-time answers to ad-hoc queries across billions of records.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6nFtks7IpqaoAkekuwQyMo/e4dd0a4fa97fb2185ce695081afc0083/metrics_menu-300w.png" alt="metrics_menu-300w.png " class="image left" style="max-width: 300px;" /> <p>For the NPM use case, we augment Kentik Detect with Kentik nProbe agent software that we’ve jointly developed with ntop, a leading provider of network visibility tools and technology, and that Kentik NPM customers can easily deploy on their hosts. Kentik nProbe inspects packets, creates augmented flow records, and sends them to Kentik Detect for continuous analytics processing in parallel with other flow data sources in the environment.</p> <p>The Kentik NPM combination enables network engineers, operators, and managers to capture, visualize, and analyze performance metrics (listed in the metrics menu, at right, from the Kentik Detect portal) in real time and within the context of the entire network infrastructure. And it enables them to easily pivot between traffic volumetrics and performance analysis to better diagnose issues and fix the root causes. With Kentik NPM, network performance monitoring moves into the modern age.</p> <h4 id="seeing-performance-in-context">Seeing performance in context</h4> <p>The TCP-based metrics that the nProbe host agent adds to Kentik Detect give operators the visibility to proactively recognize and troubleshoot network performance problems. These metrics include: retransmits, out-of-order packets, fragments, and even server/client/application latency statistics. Kentik Detect unifies this data — along with flow data from non-host devices such as routers and switches — into a time series database that correlates flows with geolocation (Country, Region, City), SNMP (interface level descriptions), and BGP routing information.</p> <p>Using Kentik Detect for NPM, traffic statistics can be correlated using basic 5-tuple TCP/IP information and visualizations can show the performance for grouped objects as well as where traffic is coming from and going to. Operators can quickly visualize flow data to determine what protocol and ports, IP-to-IP conversations, devices, device interfaces, BGP routes, and Autonomous System Numbers are active, combining up to eight dimensions at a time to understand where performance is a problem. The following graph, for example, is a multi-dimensional traffic flow diagram using nProbe host agent data.</p> <img src="//images.ctfassets.net/6yom6slo28h2/lah79D7Hdm4iOEg0iC2gk/0fbe91fe6fa8a377d12756d641e62274/Retransmits_line-840w.png" alt="Retransmits_line-840w.png " class="image center" style="max-width: 840px;" /> <h4 id="kentik-npm-use-case">Kentik NPM use case</h4> <p>We use Kentik Detect’s NPM capabilities internally at Kentik to troubleshoot network performance. If we didn’t have nProbe to provide host data, we would instead have to use packet captures and detailed pcap analysis. The following looks at one example of how we have used Kentik NPM.</p> <p><strong><em>Situation:</em></strong> Measuring kernel retransmits on some of our flow ingest servers, we saw that several of the servers indicated a high number of retransmits, but we didn’t have any TCP/IP detail to understand or correct the problem. We were also observing slow response times from applications running on the server. Our Operations team initially assumed that the problem was caused by external users and Internet-sourced traffic, but with no detail we couldn’t be certain.</p> <p><strong><em>Resolution:</em></strong> We installed the nProbe host agent on the server and added it to our collection of devices exporting flow to Kentik Detect. On the Data Explorer page of the Kentik Detect portal, we were now able to look at the nProbe-provided metrics, and we quickly found some pretty serious issues on a group of internal destination hosts: retransmits of 4% or more on hosts doing greater than 100 pps, as well as long latency. While these retransmits were minimal compared to overall traffic, they were significant to a critical microservice.</p> <p>Drilling into the performance statistics, we saw that the issues were all on hosts sitting behind switch interfaces that were converting 1G to 10G. We were then able to identify the root cause, which was a switch with shallow (or broken) buffers. We were already planning to upgrade the physical interfaces on these servers to 10G, and were able to accelerate the 10G upgrade to correct the problem.</p> <p>The following graph plots retransmits for all /32 destination hosts on the day of the 10G upgrade, showing a pretty dramatic improvement from beginning to end.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2rgL7EIKc02Oy4aegKgyw2/b163ab36dafb45e4ecb222bb26b2de14/Retransmits_stacked-840w.png" alt="Retransmits_stacked-840w.png " class="image center" style="max-width: 840px;" /> <p>While the use case discussed here may seem simple, when you’re managing hundreds of servers, it makes a huge difference to have the ability to look at performance and quickly drill into the problem without deploying hardware probes and analyzing pcap data.</p> <h4 id="what-next">What next?</h4> <p>The Kentik nProbe host agent can be deployed today by downloading the nProbe software and installing it on servers (see <a href="https://kb.kentik.com/?Bd03.htm">Host Configuration</a> in the Kentik Knowledge Base). No nProbe license is necessary if you use the startup flags that send resulting flow records directly to Kentik’s back end. You will need a device license in Kentik Detect, but you can get started and try it out for 30 days at no charge.</p> <img src="//images.ctfassets.net/6yom6slo28h2/7JzUey26fmUsKwA6AeyOEu/db3153b99e3bfa3b97ed3d0ebe4b65c7/nProbe_diagram-840w.png" alt="nProbe_diagram-840w.png " class="image center" style="max-width: 840px;" /> <p>The interesting thing about the nProbe technology is that it can operate in several modes, and can serve a number of different functions. For example, you could install nProbe on a dedicated server and send that server a copy of network packet streams, via tap or SPAN. One of our current Kentik Detect users is already trying this out. We call this “sensor mode” and it is, essentially, a much more cost-effective alternative to the packet inspection instrumentation appliances that are available in the market today. More on that soon - stay tuned!</p> <p>In the meantime, you can sign up for Kentik Detect in 15 minutes and begin to experience the power of Kentik’s Big Data NetFlow analysis, network performance monitoring, and DDoS detection. Start your <a href="#signup_dialog">free trial</a> today and let us know what you think at <a href="https://twitter.com/kentikinc">@kentikinc</a> or <a href="mailto:[email protected]">[email protected]</a>.</p><![CDATA[Avi Freedman Talks Attacks & Solutions: Cisco Live 2016]]><![CDATA[In our second post related to BrightTalk videos recorded with Kentik at Cisco Live 2016, Kentik CEO Avi Freedman talks about the increasing threats that digital businesses face from DDoS and other forms of attacks and service interruptions. Avi also discusses the attributes that are required or desirable in a network visibility solution in order to effectively protect a network.]]>https://www.kentik.com/blog/avi-freedman-talks-attacks-solutions-cisco-live-2016https://www.kentik.com/blog/avi-freedman-talks-attacks-solutions-cisco-live-2016<![CDATA[Avi Freedman]]>Mon, 12 Sep 2016 13:00:17 GMT<p>[vc_row][vc_column][vc_column_text]</p> <p>Avi Freedman Talks Attacks and Solutions in Cisco Live 2016 Interview</p> <img src="//images.ctfassets.net/6yom6slo28h2/7rrFPcJry0QCAKsKoas8YA/6e33d88ac3b3aa2ac8ae7a465dd9d17a/Avi_CiscoLive-500w.png" alt="Avi_CiscoLive-500w.png" class="image right" style="max-width: 300px;" /> <p><em>This is the second in a series of posts related to discussions that Kentik video-recorded with BrightTalk at Cisco Live 2016. In this post, Kentik CEO Avi Freedman talks about the attributes that are required or desirable in a network visibility solution in order to effectively protect one’s digital business from DDoS and other forms of attacks and service interruptions. Excerpts of what Avi had to say are posted below.</em></p> <p>DDoS pain points</p> <p>DDoS attacks have become an increasing problem, especially visible up to the C-suite and even to the Board level. Criminals have had a much easier time getting access to the resources needed to be able to attack companies sufficiently to actually stop their business, especially if they’re digitally transforming or most of their business and their revenue flow already depends on the network infrastructure.</p> <p>Criminals have had a much easier time attacking companies and stopping their business.</p> <p>So the main pain points that people have when dealing with DDoS attacks are, number one, what’s going on? Is it a misconfiguration, is it a denial of service attack, or is it something against our web site to try to get information. So first, what’s going on?</p> <p>Second, a lot of companies don’t have enough infrastructure to actually protect internally. So either they need to engage a cloud service for analytics and/or detection, or they need to do something that’s more like the hybrid approach that people take in cloud computing. Most of these are 15-year-old technologies that run on appliances that you put in one place on your network. And they’ve got a limited amount of information that they can process and store for a limited amount of time.</p> <p>Why Big Data?</p> <p>The big challenge is that you don’t always know in advance what you want to ask.</p> <p>The big challenge with DDoS attacks, with network planning, with most of the network analytics cases is you don’t always know in advance what you want to ask the system. So you really need a Big Data approach that can take all the data from the infrastructure, store it, give you insights, but also let you explore the data ad hoc in ways that you didn’t expect.</p> <p>Kentik’s solution is Big Data based. Most of the tools out there let you see your network in one specific way. But the way that practitioners actually want to interact with the visibility plane is like you would use Google Maps. It’s, you look at something and you say, “Oh, that’s interesting. Let me scroll it over here.” Well, if you don’t have the data anymore, you can’t answer those questions.</p> <p>So it’s really an iterative process that people want to get to. Once they have it, they never want to go back.</p> <p>Time to value</p> <p>With Kentik, the time to value is five to 15 minutes. That’s the time it takes to log in, set up an account, and start exporting data from infrastructure that you already have. And then within a couple of seconds of our receipt of the data it’s available for you to get alerts on, and to dig into at any level of detail that you want.</p> <p>That’s very different from current solutions, which typically require a proof of concept, a sales process, equipment to be dropped in, change controls to be applied, and then even once it’s running, the data is typically hours out of date and not available at the granularity that you need.</p> <p>Kentik value propositions</p> <p>Most of our customers are digital enterprise, SaaS companies, Web companies, or service providers. They know that their packets are their revenue. So when we talk to executives, we talk about customer experience, and customer success, which is what the fastest-growing enterprise companies in the world use. They relentlessly focus on the performance and the ability of their customers to use their services. So that’s the first message that we talk about.</p> <p>If your infrastructure isn’t working, you lose traffic, you lose revenue, and your brand suffers.</p> <p>The second is just availability, and absolute magnitude of revenue. If your infrastructure isn’t working, you lose the traffic right then, you lose that revenue, and also your brand suffers.</p> <p>And the third is security. It’s really an existential threat. Do you have the tools to be able to understand whether your business can continue into tomorrow?</p> <p>Platform vs. silo</p> <p>When you’re looking at network visibility solutions, if you want to have a future-proof system the key is to look for something that can really be a platform. That means that it isn’t a silo in-and-of-itself, and that it takes data, provides certain functions, and integrates with the other tools that you need. Or even with your own software that you need to write.</p> <p>Make sure that it can actually keep the data that you need. Especially if you don’t know in advance what you’re going to need to use the data for. And make sure that it’s consumable in the ways that you want. Ideally, it’s something you either can run on premises, if that’s your security requirement, or can outsource to a cloud or SaaS vendor to run for you, if you want to be able to just focus on your business mission.</p> <hr> <p><em>Want to learn more about the industry’s only purpose-built Big Data SaaS for network traffic analysis? See for yourself what you can do with Kentik by signing up for a <a href="#signup_dialog">free trial</a>. And if you’re inspired to get involved, <a href="https://www.kentik.com/careers/">we’re hiring</a>!</em></p> <p>[/vc_column_text][/vc_column][/vc_row]</p><![CDATA[Big Data for NetFlow Analysis]]><![CDATA[Cisco Live 2016 gave us a chance to meet with BrightTalk for some video-recorded discussions on hot topics in network operations. This post focuses on the first of those videos, in which Kentik's Jim Frey, VP Strategic Alliances, talks about the complexity of today's networks and how Big Data NetFlow analysis helps operators achieve timely insight into their traffic.]]>https://www.kentik.com/blog/big-data-for-netflow-analysishttps://www.kentik.com/blog/big-data-for-netflow-analysis<![CDATA[Jim Frey]]>Tue, 06 Sep 2016 13:00:54 GMT<p><strong>Cisco Live 2016 Interview Covers Why, How, and What’s Next</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/2Yzji3CSRaGkKO6qO4kEoA/7f82806babb784e07eb558ad5cd2e596/Jim_Frey.png" alt="Jim_Frey.png" class="image right" style="max-width: 300px;" /> <p><em>Cisco Live 2016 gave us a chance to connect with scores of visitors to our booth — both old friends and new — as well as the opportunity to meet with BrightTalk for some video-recorded discussions on hot topics in network operations. This post focuses on the first of those videos, in which Kentik’s Jim Frey, VP Strategic Alliances, talks about the complexity of today’s networks and how Big Data NetFlow analysis helps operators achieve timely insight into their traffic. You can read an overview below of what Jim had to say.</em></p> <h3 id="why-big-data-netflow-analysis">Why Big Data NetFlow Analysis?</h3> <p>The founders of Kentik have been watching for many years now to see the effect that Big Data is having on network management and analytics. It’s really fascinating, I think, that <a href="https://www.kentik.com/blog/the-new-normals-of-network-operations-in-2020/">network management</a> is one of the last areas within digital and IT operations that has taken advantage of Big Data technologies.</p> <h3 id="network-management-is-one-of-the-last-areas-of-it-operations-to-take-advantage-of-big-data">Network management is one of the last areas of IT operations to take advantage of Big Data.</h3> <p>One of the things that’s really different about Kentik is that we’ve decided to build a Big Data architecture at the core of a network monitoring solution. That brings some real advantages, because Big Data is great not only about handling large volumes of data, but also about letting you navigate through and explore that data very quickly.</p> <p>You have to build the right architecture to do that. You have to package it in a way that makes it effective and efficient. But once you’ve figured that out — and that’s one of the things we’ve done at Kentik — it becomes a very powerful solution for understanding the state of your network and then being able to drill in, drill down, drill left, drill right, and pivot your analysis. However you want to, you can change your view to get to the bottom of any sort of interesting or worrisome situation that you see.</p> <p>Our Big Data implementation also gives us a basis for doing automated evaluation of the metrics, so we can start doing things like automated alerting. DDoS detection is a big use case for us, and Big Data enables us to be both definitive and clear when you’re trying to answer a question about whether your network is performing the way you expect it to perform.</p> <p>Big Data analytics really helps with a couple of key things in particular. One is clear visibility into current activity on the network, so you can see exactly who’s talking to whom and driving what kind of activity, traffic, and volume. It can help you see trends in activity. You can use it to drill down on those trends and to help you recognize what’s normal and what’s not. When you see something that you’re not sure about, you want to be able to get down into it and understand what’s underneath the surface as quickly as possible.</p> <h3 id="big-data-architectural-considerations">Big Data Architectural Considerations</h3> <p>You can try to adapt other Big Data tools for network use cases, but it’s a fair bit of work.</p> <p>Organizations that are interested in Big Data as an architecture for network management have to think about a few things. Number one, not all Big Data architectures are specifically designed for the network monitoring use case. You can take other Big Data tools and try to adapt them, but it’s a fair bit of work to get to the level of functionality that most folks expect out of the non-Big Data tools that are out there. You don’t want to give up the goodness of those tools just to have Big Data.</p> <p>So what you really need to look for is solutions that take advantage of all the great things that Big Data can do for you but that have been adapted and optimized specifically for your <a href="https://www.kentik.com/blog/kentik-bridges-the-intelligence-gap-for-hybrid-cloud-networks/">network management</a> and security management use cases. You need a solution that does the heavy lifting for you, and gives you all of the benefits without having to build it all yourself.</p> <p>A lot of Big Data systems do a really good job of harvesting all this data, and storing it, and letting you get to it. But then when you want to run a report, it can take tens of minutes, dozens of minutes, even in worst cases hours, to get the data back out that you want. It’s hard to change and shift and ask new questions, because it means starting from scratch each time. So one of the big challenges with solutions that are built around Big Data architectures is to provide that ease of access, that quickness of access, with unrestricted ability to reach to all of the data that’s there available for you.</p> <p>That’s one of the things we’ve really focused on at Kentik. We handle very large volumes of data in the backend, with billions of records coming into our SaaS each day. But the other thing is being able to get data back out as quickly as possible. The vast majority — 95% of the queries run against our backend by customers today — return results in less than two seconds. So that means that pretty much right away you’re going to know what’s going on, or you can find out.</p> <h3 id="other-interesting-trends">Other Interesting Trends</h3> <p>Some of the other trends that I think are really interesting in this space, and that Kentik is watching, are things like SDN. SDN changes the way that networks behave. So we’re going to be monitoring what happens, and we can help tell you, for instance, how the behavior of the network changes after a change has been made through some sort of software-defined policy enforcement. So we’re keeping a close eye on that whole SDN and automated configuration management space.</p> <p>Another thing that we’re watching is the internet of things, or the “internet of everything,” as Cisco likes to call it. So we watch all the trends, and we try to figure out how we are going to be positioned for what’s coming next. And there’s always something new, right?</p> <h3 id="our-big-data-architecture-is-designed-to-be-very-flexible-and-have-a-very-long-life">Our Big Data architecture is designed to be very flexible and have a very long life.</h3> <p>So the Big Data architecture that we built is designed to be very flexible, and to have a long, very long life. Because we actually use open APIs to allow you to connect it to your other systems. It can connect very simply and easily with whatever sorts of other data sources you have coming in or outputs that you want to put in place. And that really adds to its longevity.</p> <p>We also have plans to continue to expand this platform, and grow it, and provide the SaaS service from multiple geographies. So that’s going to help us keep up with the geographic growth that’s a natural part of all sorts of network systems and businesses.</p> <h3 id="whats-next-for-big-data-netflow-visibility">What’s Next for Big Data NetFlow Visibility</h3> <p>We also intend to continue adding more functionality. Some of the things that we’re looking at now are to add deeper and richer and more-specific security-type investigations, alerts, and <a href="https://www.kentik.com/blog/how-race-communications-used-kentik-to-stop-mirai-botnet-infection-and-abuse/">anomaly detection</a>. We’re in the process of enhancing our ability to automatically recognize departures from normal, so we can understand the anomalies when they come up.</p> <p>We’re also enhancing our ability to directly integrate with external systems to trigger responses to problems. You can automatically hand off data to another system for action, or even set up an automated action if you think it’s warranted. We have a number of customers who are already doing that, using our system to monitor and recognize problems, and then automatically taking corrective action with other related systems.</p> <p>The absolute explosion in the number of connected devices means that there’s going to a huge increase in the total traffic that’s using the Internet, the sources that need to be tracked, and the behavior that will need to be understood and characterized as normal versus abnormal. There are going to be some really interesting challenges coming along with that. And we’re going to be in a really good position to help, because of the scalability that we already have within our architecture.</p> <hr> <p><em>Want to learn more about the industry’s only purpose-built Big Data SaaS for network traffic analysis? See for yourself what you can do with Kentik by signing up for a <a href="#signup_dialog">free trial</a>. Need a more basic overview of NetFlow analysis?  Check out <a href="https://www.kentik.com/kentipedia/netflow-analysis/">NetFlow analysis</a> in the Kentipedia.</em></p><![CDATA[Kentik Shows NetFlow Analytics at NFD12]]><![CDATA[It was a blast taking part in our first ever Networking Field Day (NFD12), presenting our advanced and powerful network traffic analysis solution. Being at NFD12 gave us the opportunity to get valuable response and feedback from a set of knowledgeable network nerd and blogger delegates. See what they had to say about Kentik Detect...]]>https://www.kentik.com/blog/kentik-shows-netflow-analytics-at-ndf12https://www.kentik.com/blog/kentik-shows-netflow-analytics-at-ndf12<![CDATA[Alex Henthorn-Iwane]]>Mon, 29 Aug 2016 13:00:06 GMT<p>Warm Reception Makes for Fun Networking Field Day</p> <img src="//images.ctfassets.net/6yom6slo28h2/1wfwqxLFQM4q8y4IyuswwU/c35c0f0b99fd2d959544f3dd2db7d10f/NFD_Logo-300w.png" alt="NFD_Logo-300w.png" class="image right" style="max-width: 300px;" /> <p>It was a blast taking part in our first ever Networking Field Day (NFD12). We’re already confident that we are offering the most advanced and powerful NetFlow analysis solution available. But presenting at NFD12 gave us the opportunity to get valuable response and feedback from a set of knowledgeable network nerd and blogger delegates, as well as from the NFD12 streaming audience.</p> <p>Kentik CEO Avi Freedman gave an overview of our company and of our post-Hadoop Big Data engine, which ingests billions of NetFlow, sFlow, IPFIX, BGP, and SNMP data records, offers ad-hoc analyses, alerting, dashboarding, and provides open API integration. Principal engineer and co-founder Ian Pye presented the architecture of the data engine. Then Avi and Jim Meehan, Director of Solutions Engineering, provided demonstrations of how Kentik’s visibility is used for troubleshooting, DDoS detection, peering analytics, network performance, and network security management.</p> <p>If you haven’t had a chance to see the videos, you can watch them all at our NFD12 hub, where you can also sign up for a free trial (and get our cool NetFlow t-shirt).</p> <img src="//images.ctfassets.net/6yom6slo28h2/65XBLJgCyIyeAEgS6GeCeW/830701c6384d30db1f8e99a33723f625/NFD_screenshot-500w.png" alt="NFD_screenshot-500w.png" class="image right" style="max-width: 300px;" /> <p>It was gratifying to hear positive feedback from the NFD delegates, which mirrors the kinds of things we hear from our customers on a regular basis. Pasted below are some comments that the delegates tweeted during our presentations.</p> <p>On Kentik’s focus:</p> <p><em>“Loving the focus from kentikinc and @avifreedman - very clear on what they are and are not.”</em> - Carl Niger (@carl_niger)</p> <p><em>“Couldn’t agree more. ‘As a small startup we cannot do everything’ --- That’s called focus. Bravo!”</em> - Justin Cohen (@cantechit)</p> <p>On the need for full NetFlow details:</p> <p><em>“Not all customers are going to put expensive network taps in everywhere.”</em> - Brandon Mangold quoted Avi Freedman, adding #HeGetsIt</p> <p>On Kentik’s performance:</p> <p><em>“@kentikinc is impressively fast-moving around looking at data live. Looks highly optimized.”</em> - Brandon Mangold @SDNgeek</p> <p><em>“For queries against datasets in the many billions of records, @kentikinc claims worst-case latency in the 2 to 10 second range.”</em> - Ethan Banks (@ecbanks).</p> <p>And in response: “<em>Most current tools would take several minutes to get the same result. Very impressive.”</em> - Brandon Mangold (@SDNgeek)</p> <p>On Kentik’s power and usefulness:</p> <p><em>“You think you know your traffic, so you build your policies. @kentikinc actually knows your traffic, and you improve your policies.”</em> - Ethan Banks (@ecbanks)</p> <p><em>“Cool max-bits view for flows in @kentikinc - Very easy to see from distance when traffic is changing.”</em> - Justin Cohen (@cantechit)</p> <p><em>“@kentikinc provides an interface that lets you ask multi-layered questions of your network data, and then visualize the answer.”</em> - Ethan Banks (@ecbanks)</p> <p><em>“I can imagine getting lost for days just browsing through data in @kentikinc’s visualization platform, idly looking for issues :-)”</em> - John Bigboote (@mrtugs)</p> <p>A nice summary of the impression we left:</p> <p><em>“OH: ‘Badass netflow tool’ :-)”</em> - John Bigboote</p> <p>We’d like to give a great big thank you to the Tech Field Day crew, the awesome delegates, and everyone who tuned in. It was a great event for us, and we look forward to being part of Networking Field Day again in the future. In the meantime, if you just can’t wait to get badass netflow visibility, visit our <a href="https://www.kentik.com/nfd">NFD</a> hub and sign up for a free trial!</p><![CDATA[The Network Is Your Headphone Cord]]><![CDATA[The recorded music market used to be dependent on physical objects to distribute recordings to buyers. Now it's as if our headphone cords stretch all the way from our smartphones to the datacenter. That makes network performance and availability mission-critical for music services — and anyone else who serve ads, processes transactions, or delivers content. Which explains why some of the world's top music services use Kentik Detect for network traffic analysis.]]>https://www.kentik.com/blog/the-network-is-your-headphone-cordhttps://www.kentik.com/blog/the-network-is-your-headphone-cord<![CDATA[Alex Henthorn-Iwane]]>Mon, 22 Aug 2016 13:00:10 GMT<h3 id="netflowanalysis-in-the-age-of-streaming"><em>Netflow Analysis in the Age of Streaming</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/3SqBPvoxNuoiEmSqcmEQE/379fce301ab0dc2dc07bd12f03115f03/headphones-500w.png" alt="headphones-500w.png" class="image right" style="max-width: 300px;" /> <p>Music is rapidly completing its astonishing transformation from a physical object — like a CD — that you have and hold to an ephemeral service experience. According to the latest annual report from the International Federation of the Phonographic Industry (IFPI), streaming music revenue increased 45% in 2015 and now makes up nearly half of worldwide digital music revenue, which for the first time exceeded revenue from sales of physical recordings. The adoption of paid streaming music by tens of millions of subscribers comes on top of the nearly billion-strong audience using advertising-based platforms like YouTube to stream music for free.</p> <p>Through most of its history, the recorded music market has been dependent on a physical object — 45s, LPs, cassettes, CDs, 8-track tapes — to distribute recordings to buyers. The rise of the MP3 marked a transitional phase in which media players covered the dual roles of audio player and portable file storage. Now, from a music point of view, the smartphone or streaming media player is essentially just a network-connected audio converter. This new reality of streaming is transforming the way that music service providers have to deal with the physical realities of getting music in the form of bits to listeners, who expect a seamless experience.</p> <div class="pullquote left">The headphone cord has been stretched to reach from datacenter to playback device.</div> <p>To better understand what’s happening, we can think of this transformation as the disaggregation of the MP3 player into distributed components. The file storage and player has been distributed to servers living in hyper-connected datacenters. The headset cord has effectively been stretched to include a long digital segment (Internet connectivity from the datacenter to the user device), an insertion point for a device (smartphone or media player) that handles digital-to-analog conversion, and then a short, local segment of actual physical cord connecting that device to earphones (this may be done instead via Bluetooth, which introduces additional conversion and transmission steps).</p> <h4 id="from-cds-to-cdns">From CDs to CDNs</h4> <p>The most vulnerable part of the above scenario is the Internet connectivity. To deal with that challenge, music service providers are going beyond using public cloud hosting for their files. Instead they are constructing industrial-strength versions of the public Internet to ensure that the digital segment of the “headphone cord” is as good as it can get.</p> <p>The technical term for what music service providers are building is a Content Delivery Network (CDN). In other words, they are building private versions of the type of service that commercial providers like Akamai, CloudFlare, and others have long been offering on a fee basis to enterprises that want to ensure the performance and resilience of their e-commerce and marketing websites to different audiences around the world. Private CDNs built by the likes of Pandora and Spotify are each dedicated and tuned to their particular service requirements and push music files to servers that are as close to their major audiences as possible. These CDNs also ensure that the connectivity from those servers to the mobile and retail broadband networks that connect smartphones and other music players are robust and performant.</p> <div class="pullquote right">To ensure that bits flow freely, music providers are investing in Big Data network analytics.</div> <p>With great power — in the form of costly investments in private CDN infrastructure and connectivity — comes great responsibility. The goal is to ensure that music bits flow freely so revenue can flow freely. To do this, music providers are also investing in heavy-duty NetFlow analysis based on cutting-edge Big Data technologies. Intelligently implemented, traffic analytics based on Big Data allows you to examine in full detail billions of management data records generated by network and server infrastructure. And it enables you to answer — on an ad hoc basis and in seconds —key questions about the flow of traffic, including:</p> <ul> <li>How effectively is traffic flowing over the Internet to your audience via multiple ISPs and network transit providers?</li> <li>How can you optimize the cost-performance of that traffic flow?</li> <li>Is a big bump of anomalous traffic due to new demand, a misconfiguration in your network, or a DDoS attack?</li> </ul> <p>First generation approaches to Big Data analytics, like Hadoop MapReduce, can’t return the above answers within the operational timeframes required to make key decisions by operations engineers at music services, video, gaming, and other streaming providers. But luckily, with new columnar-store databases, flash storage, containerized micro-services, and bare metal computing clusters, it’s now possible to take in of tens of billions of network data records per day and to perform analytics on those records immediately upon ingest.</p> <img src="//images.ctfassets.net/6yom6slo28h2/52m0uC0ANO4cMqUWOUU60S/696893ce407056ab351ff205fcf483dc/8track-300w.png" alt="8track-300w.png" class="image right" style="max-width: 300px;" /> <p>If Internet traffic is mission-critical for your organization — to serve ads, process transactions, or deliver content — then you can’t afford not to invest in network traffic analysis that delivers data-driven insight at operational speeds. The alternative, in today’s competitive digital business environment, is to risk seeing your business go the way of the 8-track tape. Ready to learn more? Sign up for a <a href="#signup_dialog">free trial</a>, or contact us to <a href="#demo_dialog">request a demo</a>.</p><![CDATA[BGP Routing Tutorial Series: Part 4]]><![CDATA[BGP used to be primarily of interest only to ISPs and hosting providers, but it's become something with which all network engineers should get familiar. In this conclusion to our four-part BGP tutorial series, we fill in a few more pieces of the puzzle, including when — and when not — it makes sense to advertise your routes to a service provider using BGP.]]>https://www.kentik.com/blog/bgp-routing-tutorial-series-part-4https://www.kentik.com/blog/bgp-routing-tutorial-series-part-4<![CDATA[Avi Freedman]]>Mon, 15 Aug 2016 13:00:24 GMT<p>For an updated version of all four parts of this series in a single article, see “<a href="https://www.kentik.com/kentipedia/bgp-routing/" title="BGP Routing: A Tutorial | Kentipedia">BGP Routing: A Tutorial</a>” or download our <a href="https://www.kentik.com/go/ebook/ultimate-guide-bgp-routing/" title="Download The Ultimate Guide to BGP Routing">Ultimate Guide to BGP Routing</a>.</p> <h2 id="further-thoughts-on-advertising-your-routes-with-bgp">Further Thoughts on Advertising Your Routes with BGP</h2> <img src="//images.ctfassets.net/6yom6slo28h2/6yacWNBWmWYiOq8sscESgW/98ed4c9a191604b08f341e4ea7943af0/traffic_circle-500w.jpg" class="image right" style="max-width: 300px" alt="BGP Routing" /> <p>In <a href="https://www.kentik.com/bgp-routing-tutorial-series-part-3/">Part 3</a> of this BGP routing tutorial, we looked at how to establish peering sessions with neighbor networks. This time we’ll take look at the impact of using BGP with upstream service providers, whether you have only one (single homed) or several (multi-homed).</p> <p>If you’ve only got a single upstream service provider, why might you want to bother speaking BGP to them? Well, you could say “for practice,” but configuring BGP generally involves a fair amount of behind-the-scenes work on the part of upstream providers, so they typically aren’t going to waste their time unless you have a good reason.</p> <p>If you’re single-homed you also don’t really need to “run defaultless” by accepting all routes. Since every packet destined for the Internet (as opposed to your internal network) is going to go out the same router interface, it doesn’t matter whether it does so via one default route or via searching a list of 45,000 or more routes heard via BGP.</p> <p>That leaves only one really valid reason for single-homed networks to use BGP, which is to have more control in advertising routes. To make a compelling case to your provider, you’ll have to understand two concepts that they will likely ask you about. One is “flaps,” which requires a bit of explanation and is covered in the following section. The other is routing-table space. If you’re in your service provider’s IP space or “aggregate announcement” they will likely ask why it make sense to pollute the routing tables with an extra few routes by announcing your routes more specifically. You’re on your own for the answer to that one, but if you think you have a good case, talk to either your current or potential provider.</p> <h2 id="flapping-and-damping">Flapping and damping</h2> <img src="//images.ctfassets.net/6yom6slo28h2/GekpXkJTUqgSKkoqsm0ia/3955f1a8409c5d104ee9bcd52bda6af0/hummingbird-300w.png" class="image right" style="max-width: 300px" alt="hummingbird" /> <p>When you assert a route you are saying “I know how to get to 192.204.4.0/24” based on some internal knowledge that you actually do know how to get to 192.204.4/0. When you no longer know how to get to that route, the natural — and previously considered correct — thing to do is to withdraw that assertion. Advertising a route and then withdrawing that route is referred to as “flapping.”</p> <p>One downside of flapping is that it’s contagious. When you withdraw an assertion, your providers must then also withdraw that assertion, and then their providers and peers must do the same. All in all, thousands of routers around the world will have to look at that route, decide if they have a next-best path in their BGP table, and if so insert it into their IP routing table as the current best path.</p> <p>Route flapping consumes many CPU-seconds on routers that are sometimes very busy. In fact, it was consuming so much CPU time years ago that Sean Doran of Sprintlink said “this must stop.” Several people came up with an idea, which Cisco implemented in record time, to dampen the route flaps (you’ll hear people say “damp” and “dampen”; there’s no real consensus about which is the correct term.) What this means in practice today is that if you flap a route more than once or twice that route will be dampened by many providers for at least an hour or so. In other words, the route is suppressed, meaning that it will not be advertised even if it is up.</p> <p>If you’re single-homed, you will be dampened if your provider withdraws your routes because someone resets the router. So if you want to have more control in advertising your routes, and you ask your upstream provider to announce you, you’ll likely be asked to explain why it makes a difference to you, meaning why the benefit of being multiply-announced outweighs the possible negative effects of being dampened due to instability in either your or your provider’s network. After all, if you’re singly- connected to the ‘net, the whole Internet doesn’t need to know if you lost connectivity to your provider, since there’s no other path to get to you. So why bother all of the routers in the world by telling them whether or not you’re currently reachable?</p> <h2 id="bgp-for-the-multi-homed">BGP for the multi-homed</h2> <img src="//images.ctfassets.net/6yom6slo28h2/1m9FApmwmEUIucAy8UsYAU/363cdc46bec49e8abf9ab056dc76c245/antennas-400w.png" class="image right" style="max-width: 400px" alt="antennas" /> <p>Many networks have business-critical needs for assured Internet connectivity, and a common way to achieve this connectivity is by multihoming, which means using the services of two or more upstream service providers. Generally, the goal of multi-homing is to use both upstream provider connections in a sane manner and “load-balance” them. Ideally, you’d like roughly half the traffic to go in and out of each connection. You’d also like “failover” routing, where if one connection goes down the other one keeps you connected to the Internet. In an ideal network, you’d be able to have any one of your connections to the ‘net go down and still maintain connectivity and speed.</p> <p>You don’t need BGP to load-balance; you can do that almost as well with a “round-robin” or “route-caching.” What’s really most important about BGP to you if you’re multi-homed is the ability to advertise routes. In multi-homed situations the network operator may want to express different routing policies to each upstream provider, which it can do using BGP by using its own ASN and advertising these routing policies to each upstream destination, which are then advertised in turn to their providers and peers (i.e. to “the rest of the Internet”). As noted in <a href="https://www.kentik.com/bgp-routing-tutorial-series-part-2/">Part 2</a>, doing this basic level of route advertisement is not hard, but you have to do it in a paranoid way because if you screw up your BGP route advertisements it can be felt all over the Internet.</p> <p>One nice thing about using BGP to advertise routes if you are multi-homed is that if you do have connectivity issues BGP is pretty smooth about handling them. If your providers are announcing specific prefixes for you, they would normally stop announcing you when they don’t know how to get to you any more. The beauty of speaking BGP to your providers is that when you lose connectivity to them, the BGP session will go down as well and all of those route advertisements will be automatically withdrawn.</p> <h2 id="routes-and-as-paths">Routes and AS-PATHs</h2> <p>A key concept to understand when you decide that you want to advertise and receive routes via BGP is the AS-PATH attribute. Every time a router advertises a route via BGP, that route is stamped with the Autonomous System Number (ASN) of the Autonomous System (AS) to which the router belongs (see Routes and Autonomous Systems in <a href="https://www.kentik.com/bgp-routing-tutorial-series-part-1/">Part 1</a>). As a route moves from AS to AS, it builds up an AS-PATH, which is useful for the following reasons:</p> <ul> <li>AS-PATH provides a diagnostic trace of routing on the Net. If you have full routes in one of your routers, or have query access to a router that does (such as telnet://route-server.cerf.net), you can find the route that encompasses a particular IP address and see which ASNs have advertised it. If you do some poking around, you can even see how a provider is actually connected.</li> <li>AS-PATH is one of a number of metrics that determine how routes heard via BGP are inserted into the actual IP routing table. We’ll be talking more about metrics in the future.</li> <li>AS-PATH can be used for filtering that enables policy routing. There are many reasons why you’d want to filter based on the AS-PATH including, for example, to make sure that you only send routes that originate in your network. AS-PATH filtering is the best first step that you can work with to get comfortable with filtering routes. And if your network is fairly simple (as 90 percent of networks are), then you won’t need anything fancier for quite some time.</li> </ul> <p>For now, keep in mind that unless you do any tuning on your own:</p> <ul> <li>The most specific route always wins, whether it’s a BGP route or a static internal route.</li> <li>If there’s a choice between multiple BGP routes, then the one with the shortest AS_PATH wins.</li> </ul> <p>To sum up, here are the most important questions to keep in mind for each peer when you’re either considering how to do BGP in general or specifically bringing up a new BGP session:</p> <ul> <li>What routes do you want the peer to hear? The most important thing is to ensure that you do not reconfigure routes to which you are not providing Internet connectivity.</li> <li>What do you want to do with the routes that you hear via the session? Do you want to tune them? Only take some? Take them all?</li> </ul> <h2 id="conclusion">Conclusion</h2> <div class="pullquote right">BGP routing has become something with which all network engineers should now be familiar.</div> <p>BGP used to be primarily of interest only to ISPs and hosting providers, whose revenue depends on delivering traffic. It then became the business of web businesses to manage their connectivity to the Internet in a more intelligent way, since their user experience and revenue streams depend on reliable, high-performance Internet traffic delivery. Now, with adoption of cloud solutions by many enterprises to meet their IT needs, as well as the overall trend to digital business models, BGP and Internet routing is becoming something with which all network engineers should get familiar.</p> <p>It would be naive to think that we’ve done complete justice to the topic of BGP in the four parts of this series, but I hope you’ve found these tutorials helpful. If we’ve piqued your interest in how the routes used by your traffic affect network performance and costs, you’ll find that a Big Data-based approach to BGP analytics provides the most powerful platform for valuable insights. At Kentik, we’ve incorporated BGP analytics into our cloud-based network visibility SaaS, which you can experience by signing up for a <a href="#signup_dialog">free trial</a>.</p> <h3 id="related-posts">Related posts</h3> <ul> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1/">BGP Routing Tutorial Series: Part 1</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-2/">BGP Routing Tutorial Series: Part 2</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-3/">BGP Routing Tutorial Series: Part 3</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-4/">BGP Routing Tutorial Series: Part 4</a></li> </ul><![CDATA[PacketPushers: Flight from the Public Cloud?]]><![CDATA[Kentik CEO Avi Freedman joins Ethan Banks and Chris Wahl of the PacketPushers Datanauts podcast to discuss hot topics in networking today. The conversation covers the trend toward in-house infrastructure among enterprises that generate revenue from their IP traffic, the pluses and minuses of OpenStack deployments for digital enterprises, and the impact of DevOps practices on network operations.]]>https://www.kentik.com/blog/packetpushers_flight_from_public_cloudhttps://www.kentik.com/blog/packetpushers_flight_from_public_cloud<![CDATA[Avi Freedman]]>Mon, 08 Aug 2016 13:00:55 GMT<p>On Public Cloud, OpenStack, DevOps, and NetOps</p> <img src="//images.ctfassets.net/6yom6slo28h2/4HO2R8bz0IWwWYAyi6qGAG/4fd4daa7a61937f3d233eaa03265909f/packetpushers.png" alt="packetpushers.png" class="image right" style="max-width: 300px;" /> <p>It’s always a pleasure to have an in-depth conversation with smart, technical people. This past January, I got the chance to share about Kentik’s Big Data-based approach to NetFlow analysis during the <a href="http://packetpushers.net/podcast/podcasts/pq-show-71-kentik-real-time-network-visibility-sponsored/">PacketPushers PriorityQueue podcast</a>. At the time, we also tossed around some other topics that would be intriguing to explore at a later time. That time is now.</p> <p>In this three-part episode of the PacketPushers Datanauts podcast, entitled “Flight from the Public Cloud?”, I joined hosts Ethan Banks and Chris Wahl to discuss three hot button topics related to the cloud.</p> <p>In part one, Ethan and Chris and I discuss why web companies, ISPs, hosting companies and other digital enterprises that generate revenue from their IP traffic have tended over time to migrate from public cloud to infrastructure they own and operate themselves.</p> <p><strong>Listen to the podcast now:</strong></p> <p>In part two, we explore the phenomenon of digital enterprises opting out of OpenStack deployments while considering where OpenStack is helpful. We also talk about GIFEE (Google Infrastructure For Everyone Else).</p> <p>In part three, we dig into the idea that network operations has been doing DevOps for decades, what DevOps-like network monitoring looks like today, what is holding back broader adoption of DevOps practices in networking, and what the future holds for network orchestration.</p> <p>To access the full Datanauts podcast page, visit:</p> <p><a href="http://packetpushers.net/podcast/podcasts/datanauts-044-flight-from-the-public-cloud/">http://packetpushers.net/podcast/podcasts/datanauts-044-flight-from-the-public-cloud/</a></p> <p>It was a fun conversation and I hope you enjoy it as much as I did. If you have thoughts or questions, feel free to look me up on Twitter at <a href="https://twitter.com/@avifreedman">@avifreedman</a>.</p> <hr> <p><em>Want to learn more about NetFlow analysis in the cloud? Read about Kentik Detect, the industry’s first Big Data-powered SaaS solution for network traffic and performance analytics. Or see for yourself what you can do with Kentik by signing up for a <a href="#signup_dialog">free trial</a>. And if you’re inspired to get involved, <a href="https://www.kentik.com/careers/">we’re hiring!</a></em></p><![CDATA[On Raising Our $23M B Round]]>https://www.kentik.com/blog/on-raising-our-23m-b-roundhttps://www.kentik.com/blog/on-raising-our-23m-b-round<![CDATA[Avi Freedman]]>Thu, 04 Aug 2016 18:19:58 GMT<p>A Big Step Forward for Big Data Network Visibility</p> <img src="//images.ctfassets.net/6yom6slo28h2/5ytP2mXZJYQIwY8k4OgiWc/8424b7e1dc4415c833bd4ccccf273a3e/canstockphoto2382088.jpg" alt="canstockphoto2382088.jpg" class="image right" style="max-width: 300px;" /> <p>I am thrilled to announce today that Kentik has closed $23 million in Series B financing!  This round was led by Third Point Ventures, added new investors Glyn Capital and David Ulevitch, and was supported by continued investment from our seed and A round investors: August Capital, Data Collective VC, First Round Capital, and Engineering Capital.</p> <p>We started Kentik in 2014 to use network traffic insights to give operational, performance, and security intelligence to the people that make and use the Internet for their businesses, and we’ve seen strong growth and market adoption.  In the last year, we’ve gone from a handful of early adopters to over 60 customers, many well known like Box, Yelp, OpenDNS, Dailymotion, and Pandora.  Others we can’t mention, but are among the largest web enterprise and service providers in the world - and we have a number of smaller infrastructure provider and enterprise customers as well.</p> <p>It’s been exciting to see the feedback loop of great engineering partnered with our customers.  We’ve gone from first product to a sophisticated and robust analytics platform meeting real operational needs.  In the last year we’ve released adaptive alerting, ground-breaking peering and transit traffic analytics, large-scale ad-hoc analytics with multi-dimensional views of the flow of traffic, and enabled 99% of these answers to be rendered across queries tunable in billions of combinations, in just a few seconds.</p> <p>And - it’s helped that we’ve been able to align with premier investors.  At Kentik, ethical hustle is one of our core values. Third Point showed deep founder-like hustle to build the relationship and win the deal - and the same kind of conviction and willingness to act on it as our earlier investors.  Given their deep understanding of the kind of companies we’re helping, they were a perfect fit for us and our board of other technologists turned growth-experienced business folks.</p> <p>Going forward, our vision remains the same - to unlock the value of network data so that our customers can innovate, helping them to achieve ever higher performance, user experience, efficiency, revenue and profitability.  I’m grateful to our customers, partners, investors, and the Kentik team, and excited to see what the coming year will bring!</p><![CDATA[BGP Routing Tutorial Series: Part 3]]><![CDATA[In this post we continue our look at BGP — the protocol used to route traffic across the interconnected Autonomous Systems (AS) that make up the Internet — by clarifying the difference between eBGP and iBGP and then starting to dig into the basics of actual BGP configuration. We'll see how to establish peering connections with neighbors and to return a list of current sessions with useful information about each.]]>https://www.kentik.com/blog/bgp-routing-tutorial-series-part-3https://www.kentik.com/blog/bgp-routing-tutorial-series-part-3<![CDATA[Avi Freedman]]>Mon, 01 Aug 2016 13:00:52 GMT<p>For an updated version of all four parts of this series in a single article, see “<a href="https://www.kentik.com/kentipedia/bgp-routing/" title="BGP Routing: A Tutorial | Kentipedia">BGP Routing: A Tutorial</a>” or download our <a href="https://www.kentik.com/go/ebook/ultimate-guide-bgp-routing/" title="Download The Ultimate Guide to BGP Routing">Ultimate Guide to BGP Routing</a>.</p> <div as="Promo"></div> <h2 id="configuring-peering-for-neighbor-autonomous-systems">Configuring Peering for Neighbor Autonomous Systems</h2> <img src="//images.ctfassets.net/6yom6slo28h2/3d6GnVmnPyQgkIiIIIm0c6/1b6beee2c9cfd598f326801f1089571a/rail_junction-500w.png" class="image right" style="max-width: 400px; margin-bottom: 20px;" alt="" /> <p>So far in this series we’ve looked at a number of basic concepts about BGP, covering both who would want to use it and why. In particular we’ve learned that speaking or advertising BGP to your service providers and/or peers lets you do two things:</p> <ul> <li>Make semi-intelligent routing decisions concerning the best path for a particular route to take outbound from your network (otherwise you would simply set a default route from your border routers into your service providers).</li> <li>Advertise your routes to those providers, for them to advertise in turn to others (for transit connectivity) or just use internally (in the case of peers).</li> </ul> <p>We also pointed out some of the negative consequences that can result from careless BGP configuration. In this post, we’ll delve deeper into the mechanics of BGP by looking at how you actually configure BGP on routers.</p> <h2 id="autonomous-systems-and-asns">Autonomous Systems and ASNs</h2> <p>As discussed in Part 1, the term Autonomous System (AS) is a way of referring to a network such as a private enterprise network or a service provider network. Each AS is administered independently and may also be referred to as a domain. Each AS is assigned at least one Autonomous System Number (ASN), which identifies the network to the world. Most networks use (or at least show to the world) only one ASN. Each ASN is drawn from a 16-bit number field (allowing for 65,536 possible ASNs):</p> <ul> <li>ASNs 0 and 65,535 are reserved values</li> <li>The block of ASNs from 64,512 through 65,534 is designated for private use</li> <li>The remainder of possible ASN values available for Internet routing range from 1 through to 64,511 (except 23,456).</li> </ul> <h2 id="ebgp-vs-ibgp">eBGP vs. iBGP</h2> <p>One more clarification before we start configuring: BGP can be used internally (iBGP) within an AS to manage routes, and externally (eBGP) to route between ASes, which is what makes possible the Internet itself. In this article when we say BGP we’re talking about eBGP, not iBGP. eBGP and iBGP share the same low-level protocol for exchanging routes, and also share some algorithms. But eBGP is used to exchange routes between different ASes, while iBGP is used to exchange routes within the same AS. In fact, iBGP is one of the “interior routing protocols” that you can use to do “active routing” inside your network/domain.</p> <p>The major difference between eBGP and iBGP is that eBGP tries like crazy to advertise every BGP route it knows to everyone, and you have to put “filters” in place to stop it from doing so. iBGP, on the other hand, tries like crazy not to reconfigure routes. In fact, iBGP can actually be a challenge to get working because to make it work you have to peer all of the iBGP-”speakers” inside your network with all of the other iBGP speakers. This is called a “routing mesh” and, as you can imagine, it can get to be quite a mess when you have 20 routers that each have to peer with every other router. The solution to this is “BGP confederations,” a topic we’ll cover in a subsequent tutorial.</p> <h2 id="peering-sessions">Peering Sessions</h2> <p>So now let’s look at the actual configuration. BGP-speaking routers exchange routes with other BGP-speaking routers via peering sessions using ASN identification. At a technical level, this is what it means for one network or organization to peer with another. Here’s a simplified Cisco code snippet of a router BGP clause:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">router bgp 64512 &lt;omitted lines...> neighbor 207.106.127.122 remote-as 701</code></pre></div> <p>The clause starts out by saying “router bgp 64512.” This means that what follows is a list of commands that describe how to speak BGP on behalf of ASN 64512. (We’re using 64512 in our examples because it’s not a live ASN, so if anyone uses a configuration straight from this column and uses this made up ASN, automated route-examination programs will detect it.)</p> <p>All that’s required to bring up a peering session is that one neighbor line under the router bgp clause. In this example, this line specifies 207.106.127.122 as the remote IP address (with respect to the customer’s route) of a router in the AS with ASN 701.</p> <p>The purpose of neighbor commands is to initiate peering sessions with neighbors. It’s possible to have BGP peering sessions that go over multiple hops, but eBGP multi-hop is a more advanced topic and has many potential pitfalls. So for now, let’s assume that all neighbors must be on a LAN interface (Ethernet, Fast Ethernet, FDDI). In practice, you nearly always use more than one line to specify how to exchange routes with a given neighbor in a given peering session. So a typical neighbor command sequence would look more like this:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">router bgp 64512 &lt;omitted lines...> neighbor 207.106.127.122 remote-as 4969 neighbor 207.106.127.122 next-hop-self neighbor 207.106.127.122 send-communities neighbor 207.106.127.122 route-map prepend-once out neighbor 207.106.127.122 filter-list 2 in &lt;omitted lines...></code></pre></div> <p>Every time a neighbor session comes up, each router will evaluate every BGP route it has by running it through any filters you specify in the BGP neighbor command. Any routes that pass the filter are sent to the remote end. This filtering is a critical process. The most dangerous element of BGP is the risk that your filtering will go awry and you’ll announce routes that you shouldn’t to your upstream providers.</p> <h2 id="seeing-routes">Seeing routes</h2> <img src="//images.ctfassets.net/6yom6slo28h2/59lkeKZKbuOaCueoUgMWY0/bf4d462d5f7f87739defce61b0a12d90/flight_board-400w.png" class="image right" style="max-width: 400px" alt="" /> <p>While the session is up, BGP updates will be sent from one router to the other each time one of the routers knows about a new BGP route or needs to withdraw a previous route announcement. To see a list of all current peering sessions you can use the Cisco “sho ip bgp sum” command line:</p> <p><code class="language-text">brain.companyx.com# sho ip bgp summ</code></p> <p>The command typically returns results like the following, which is a session summary from a core router at an ISP. The 6451x Autonomous Systems are BGP sessions to other routers at the same ISP whose ASNs are not shown to the world. The 205.160.5.1 session is a session that is down, and the sessions where the remote Autonomous Systems are 4231, 3564, and 6078 are external peering sessions with routers from another ISP.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">BGP table version is 1159873, main routing table version 1159873 44796 network entries (98292/144814 paths) using 9596344 bytes of memory 16308 BGP path attribute entries using 2075736 bytes of memory 12967 BGP route-map cache entries using 207472 bytes of memory 16200 BGP filter-list cache entries using 259200 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State 205.160.5.1 4 6313 0 0 0 0 0 never Active 207.106.90.1 4 64514 1145670 237369 1159873 0 0 4d03h 207.106.91.5 4 64515 6078 5960 1159869 0 0 4d03h 207.106.92.16 4 64512 6128 6782 1159870 0 0 4d03h 207.106.92.17 4 64512 5962 6894 1159870 0 0 10:08:46 206.245.159.17 4 4231 161072 276660 1159870 0 0 2d05h 207.44.7.25 4 3564 6109 310292 1159867 0 0 22:40:50 207.106.33.3 4 64513 164708 724571 1159866 0 0 3d23h 207.106.33.4 4 3564 6086 274182 1159853 0 0 4d03h 207.106.127.6 4 6078 5793 310011 1159869 0 0 2d03h </code></pre></div> <p>Most of the above table is fairly self-explanatory:</p> <ul> <li>The neighbor column gives the IP address of the neighbor with which the router is peered.</li> <li>The V column is the BGP version number. If it is not 4, something is very wrong! BGP version 3 doesn’t understand about Classless (CIDR) routing and is thus dangerous.</li> <li>The AS column is the remote ASN.</li> <li>InQ is the number of routes left to be sent to us.</li> <li>OutQ is the number of routes left to be sent to the other side.</li> <li>The Up/Down column is the time that the session has been up (if the State field is empty) or down (if State field is not empty).</li> <li>Anything in a State field indicates that the session for that row is not up. In just one of the nomenclature flaws of BGP, a state of Active actually indicates that the session is inactive.</li> </ul> <p><strong>Coming Next</strong></p> <p>In our next installment we’ll be looking at what to keep in mind when configuring BGP, as well as topics such as route withdrawal, route flaps, route selection, load balancing, and BGP metrics.</p> <h2 id="related-posts">Related posts</h2> <ul> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1/">BGP Routing Tutorial Series: Part 1</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-2/">BGP Routing Tutorial Series: Part 2</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-3/">BGP Routing Tutorial Series: Part 3</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-4/">BGP Routing Tutorial Series: Part 4</a></li> </ul><![CDATA[BGP Routing Tutorial Series: Part 2]]><![CDATA[BGP is the protocol used to route traffic across the interconnected Autonomous Systems (AS) that make up the Internet, making effective BGP configuration an important part of controlling your network's destiny. In this post we build on the basics covered in Part 1, covering additional concepts, looking at when the use of BGP is called for, and digging deeper into how BGP can help — or, if misconfigured, hinder — the efficient delivery of traffic to its destination.]]>https://www.kentik.com/blog/bgp-routing-tutorial-series-part-2https://www.kentik.com/blog/bgp-routing-tutorial-series-part-2<![CDATA[Avi Freedman]]>Mon, 25 Jul 2016 13:00:46 GMT<p>For an updated version of all four parts of this series in a single article, see “<a href="https://www.kentik.com/kentipedia/bgp-routing/" title="BGP Routing: A Tutorial | Kentipedia">BGP Routing: A Tutorial</a>” or download our <a href="https://www.kentik.com/go/ebook/ultimate-guide-bgp-routing/" title="Download The Ultimate Guide to BGP Routing">Ultimate Guide to BGP Routing</a>.</p> <div as="Promo"></div> <h2 id="more-basics-advertising-homing-and-cardinal-sins">More Basics: Advertising, Homing, and Cardinal Sins</h2> <img src="//images.ctfassets.net/6yom6slo28h2/2Zfdo0QKMooCwCWquWOw6C/b4cf4317cbfab3f7faaa93a0b8a9df11/wordpress-media-8933" class="image right" style="max-width: 400px" alt="Route 66" /> <p>In part 1 of this series, we established that BGP is the protocol used to route traffic across the interconnected Autonomous Systems (AS) that make up the Internet. We also looked at why effective BGP configuration is an important part of controlling your destiny on the Internet, and we covered some of the basic building-block concepts needed to understand what BGP does and how it does it. We’ll continue on that path in this post, adding more concepts and digging deeper into how BGP works and what makes it of value.</p> <h2 id="advertising-routes">Advertising Routes</h2> <p>The core function of BGP is provide a mechanism through which any Autonomous System — a network connected to the Internet — can get traffic to any other AS. As discussed in Part 1, the path traveled by traffic is referred to as a route, and BGP is the protocol by which one “advertises” to the Internet the routes available to get traffic to your AS.</p> <p>One way of thinking about the BGP routes that you advertise to other entities is as promises to carry data to the IP space represented by the advertised route. For example, if you advertise 192.204.4.0/24 (in class C terms, the block starting at 192.204.4.0 and ending at 192.204.4.255), you promise that you know how to carry to its ultimate destination data that is destined for any address in 192.204.4.0/24.</p> <h2 id="single--and-multi-homed-networks">Single- and Multi-homed Networks</h2> <p>Another important BGP-related concept is “single-homed” vs. “multi-homed,” which is a major determinant in who uses BGP and who doesn’t:</p> <ul> <li>Single-homed means that you have only one upstream provider giving your network transit to the rest of the Internet.</li> <li>Multi-homed means that you connect to multiple providers to provide transit to the rest of the world.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/3003KuGPPyIACAY6MCuk8M/d18e7d337e41b3d578886c3703bc9d9b/One_way-400w.png" class="image right" style="max-width: 300px" alt="One way" /> <p>When you’re single-homed, you usually won’t want to use BGP to your upstream providers because you only have one path out of your network. So filling your router with 500,000+ BGP routes isn’t going to do you any good, since all of those routes point to the same place (your one upstream provider). You can get the same result much more simply by using a “default route” to point all packets that aren’t otherwise matched in your internal routing table to your upstream provider.</p> <p>Also, if you have one upstream provider, it’s almost guaranteed that your IP space is a sub-allocation (CIDR delegation, to be precise) of their larger IP blocks (aggregates). So you won’t be advertised to the outside world specifically; instead your provider will just advertise the overall block. If you have any other networks (e.g. an old Class C, customers with address space, etc.), then your provider will statically announce those routes to the world and statically route them inside their network to your router interface(s). So even if you did advertise to your provider the routes to your IP space, they’re not going to re-advertise them to the rest of the world because there already is a route to your provider for one of the larger aggregates of address space that you are inside of.</p> <p>Despite the above, there is one circumstance where you might wind up using BGP as a single-homed customer, which is if you have multiple connections to a single ISP. In some cases, you will use BGP to manage the load-balancing across these links. Often, your provider will want to help configure, monitor, and manage BGP since it will affect the service they deliver.</p> <h2 id="connecting-with-and-without">Connecting With and Without</h2> <p>To better understand the practical value of BGP if you’re either multi-homed or you have multiple connections to a single ISP, let’s look at what happens when you connect to the Internet without speaking BGP to your provider:</p> <ul> <li>You create a default route toward your upstream provider, and all non-local packets go out of the interface specified by that route.</li> <li>Your provider probably puts static routes toward you on their side, and redistributes those static routes into their interior gateway protocol (IGP). Then — unless all of their BGP is done statically — they probably redistribute their IGP into BGP.</li> </ul> <p>What happens differently if you do use BGP? Your provider will give you all of the routes that they have (that’s the easy part), will listen to your route announcements, and will then redistribute some or all of those to their peers and customers (that’s the hard part for them). The net differences boil down to this:</p> <ul> <li>They may start advertising a more specific route, which is no mean task in a complicated network designed, as most networks are, to prevent the accidental leaking of more specific routes.</li> <li>The routes that they normally advertise for you under just their ASN will now have your ASN attached as well.</li> </ul> <p>So what’s the most important benefit to you of using BGP? It’s not that you get full or partial routes from your providers. That’s cool — and maybe even useful — but you can do almost as well by just load-balancing all outgoing traffic in either a round-robin or route-caching manner. The most important thing for you about BGP will actually be the ability to have your routes advertised to your providers, and by them to their providers and peers (i.e. to the rest of the Internet).</p> <h2 id="cardinal-sins">Cardinal Sins</h2> <p>So now we understand why you might want to use BGP. Does that mean that you’re now ready to start configuring? Doing a basic level of route advertisement using BGP is not hard, but if you screw it up, you may get slapped down pretty hard, because screw-ups with BGP route advertisements can be felt all over the Internet. That’s right: <em>screw ups with BGP route advertisements can be felt all over the Internet!</em></p> <img src="//images.ctfassets.net/6yom6slo28h2/MgIX1aV7WumaW6aSGScyi/de435c72f3f932890c0f5fbe797c27ef/Blackhole-352w.jpg" class="image right" style="max-width: 350px" alt="" /> <p>The first cardinal sin of BGP is advertising routes to which you are not actually able to deliver traffic. This is called “black-holing,” which is one form of “route hijacking.” If you advertise some part of the IP space that is owned by someone else, and that advertisement is more specific than the one made by the owner of that IP space, then all of the data on the Internet destined for that space will flow to your border router. This will effectively disconnect that black-holed address space from the rest of the Internet. Let’s say, for example, that you announce a route for Instagram’s servers that’s more specific than the otherwise-best route. The result is that you’ve just black-holed Instagram for a period of time. Needless to say, that will make many people very unhappy.</p> <p>The second cardinal sin of BGP routing is not having strict enough filters on the routes you advertise. If you don’t filter well and have a BGP-speaking customer, you can pass on their poor hygiene and be an inadvertent vector for disrupting networks far from you on the Internet. If your provider is smart, there are filters in place to prevent you from a spectacular fail, which would hurt them and everyone else. But don’t count on it.</p> <p>The key to avoiding these sins is multi-level:</p> <ul> <li>Implement good filtering on your end.</li> <li>Check that your provider is also doing excellent filtering wherever possible.</li> <li>Be paranoid when configuring your BGP: Test your configs and watch out for typos! Think through everything that you do in terms of how it could screw things up and land you on the front page of the New York Times.</li> </ul> <p>Remember: the vast majority of the route hijacking on the Internet is due to misconfiguration! That doesn’t mean that someone couldn’t be attempting to disrupt service or intercept packets, but usually the issue is a typo in someone’s config. Focusing on the points listed above is your best defense against shooting yourself in the foot when configuring BGP.</p> <h2 id="coming-next">Coming Next</h2> <p>In future posts we’ll cover more BGP fundamentals: peering sessions, injecting routes, basic filtering, and best-path selection. Because the adverse consequences of not quite knowing what you’re doing are so severe, I don’t advise that you go ahead and start playing with BGP yet. But if you just can’t wait, and you’re implementing BGP for the first time, then please at least get a friend or another provider to review your proposed configs for you before implementing them.</p> <h3 id="related-posts">Related posts</h3> <ul> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1/">BGP Routing Tutorial Series: Part 1</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-2/">BGP Routing Tutorial Series: Part 2</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-3/">BGP Routing Tutorial Series: Part 3</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-4/">BGP Routing Tutorial Series: Part 4</a></li> </ul><![CDATA[Unearthing the Value of Network Traffic Data with Big Data Network Analytics]]><![CDATA[In most digital businesses, network traffic data is everywhere, but legacy limitations on collection, storage, and analysis mean that the value of that data goes largely untapped. Kentik solves that problem with post-Hadoop Big Data analytics, giving network and operations teams the insights they need to boost performance and implement innovation. In this post we look at how the right tools for digging enable organizations to uncover the value that's lying just beneath the surface.]]>https://www.kentik.com/blog/unearthing-the-value-of-network-traffic-datahttps://www.kentik.com/blog/unearthing-the-value-of-network-traffic-data<![CDATA[Alex Henthorn-Iwane]]>Mon, 18 Jul 2016 13:00:16 GMT<p>Seeing Beneath the Surface with Post-Hadoop Big Data</p> <img src="//images.ctfassets.net/6yom6slo28h2/5Y8iyuYZK8QiOWMM8GqCgc/436b09cd1a3de8252453ca781e7aaca8/mosaic-500w.png" alt="mosaic-500w.png" class="image right" style="max-width: 300px;" /> <p>Not long ago, the <em>New York Times</em> published a fascinating article about a rug designer named Luke Irwin who lives in Wiltshire, England. Irwin needed to run some electrical cables under his yard. While digging the trench, his contractor revealed an intricate mosaic floor of red, blue, and white tiles just 18 inches down. That’s how Irwin learned that his family home was built on top of a luxurious villa that was inhabited by upper-class Romans between A.D. 175 and 220.</p> <p>Comprising an estimated 20 to 25 rooms on the ground floor alone, the Irwin site is one of the richest Roman-era archaeological discoveries in recent history. According to the <em>Times</em>, the heritage organization Historic England called the find “‘unparalleled in recent years,’ in part because the remains of the villa, with its outbuildings, were so undisturbed.” With just a little digging, Irwin had uncovered a trove of nearly unprecedented value lying just beneath the surface.</p> <p>Buried out of reach</p> <p>Irwin’s story may be interesting, but what does it have to do with network traffic data? The answer is rooted in the experience of Kentik’s founders, who’ve spent decades building and operating some of the world’s biggest and most complex networks. They know first hand that network teams typically carry around a vast reservoir of technical and institutional knowledge in their heads. But the value of that knowledge often remains buried because even experienced organizations have traditionally lacked the timely, comprehensive information required to yield actionable insights. Until now the tools available to generate and access such information have been limited at best. At Kentik, we believe deeply in the power of post-Hadoop Big Data to address those limitations, making rich data readily accessible not only to engineering and operations, but also to wider areas of the organization.</p> <p>Data-driven insights can boost user satisfaction, make a business more competitive, and increase profits.</p> <p>Access to rich data matters in part because it enables insights that can make routine tasks far faster and more accurate. But information can also power innovation — not just seemingly unattainable innovation with a capital “I,” like flying cars, but also continuous incremental improvement in the operation of a digital business. Data-driven insights can reduce costs, achieving huge efficiencies over time. They can also improve network performance, laying the foundation for improved user experience, new features that weren’t previously feasible, and new revenue streams. The result is to boost user/customer satisfaction, make a business more competitive, and increase profits. (I wrote previously about this kind of potential in <a href="https://www.kentik.com/moneyball-your-network-with-big-data-analytics/">Moneyball Your Network</a>.) At the same time, access to rich data makes network teams happy because it empowers them to go beyond drudgery, driving the business forward with passion, excellence, and creativity.</p> <p>While this scenario sounds idyllic, it’s unfortunately not the reality for most network teams today. Like the pre-dig Irwin family, surrounded by buried riches, too many network organizations are separated from the true value of their network data by legacy limitations on the collection, storage, and analysis of flow records (e.g. NetFlow) and other other network traffic data like BGP and GeoIP. And too many network managers and operators are trapped in a whack-a-mole existence, with insufficient data to make decisions and insufficient tools and resources to close the gap.</p> <p>Slow, shallow, and costly</p> <img src="//images.ctfassets.net/6yom6slo28h2/RlgF2ZwusKcSWqmYmQw8o/3eb36e536079fefeda296841501a8d81/shallow-420w.png" alt="shallow-420w.png" class="image right" style="max-width: 300px;" /> <p>Built on appliances, text files, or SQL databases, traditional network traffic analysis systems reduce rich, raw data to a few indexed tables, discarding most details in the process. Limited, slow, and costly, they’re too shallow to get you even 18 inches down, as it were, to the true value of your network data. Sure, you can get some pretty graphs of summary views, but without real analytical depth. For the practitioners who have to operate, engineer, and improve service delivery, shallow data is a bit of a curse.</p> <p>The alternative to these old-school systems has been Hadoop-based Big Data approaches. Some (think MapReduce) are prohibitively slow for operational use. Others (Spark, ELK) are prohibitively costly when you add up what it takes to get both raw data ingest and ad-hoc analytics in operational time frames. And that doesn’t include the cost, in the OSS case, of building and maintaining your own user-friendly user interface for analytics. Without it, the utility of your system is limited to a tiny cadre of expert users. You put in a lot of hard work, capital, and operational expense, but you shut out the broader set of users that would enable you to get a meaningful return on your investment. So while building a Big Data system on your own may seem like a promising solution, in reality it can be a scary (business) proposition.</p> <p>Dig deep without a backhoe</p> <p>Kentik exists to enable customers to unearth the value of their network data. That’s a job that requires the retention of massive volumes of raw data, the ability to instantly dig deep into details, and the flexibility of unconstrained data exploration. No stingy, limited indexes, no fragile BI data cubes. Instead we give you the freedom to perform any ad-hoc query on any subset of your data and the speed to get results in a few seconds or less. We give you fast time-to-value, getting you from sign-up to traffic visibility in fifteen minutes or less — without installing software or deploying massive on-premises machines. And we give you an affordable datastore that you can leverage via REST or SQL APIs for use by 3rd-party systems for DDoS mitigation or business intelligence. So with Kentik Detect you won’t be left looking at just the surface of your network data, wondering what unrealized business value lies buried below.</p> <hr> <p><em>Ready to learn more about Kentik? Read how we handle queries against huge volumes of traffic in this blog post on <a href="https://www.kentik.com/designing-for-database-fairness/">designing for database fairness</a>. Or see for yourself what you can do with Kentik by signing up for a <a href="#signup_dialog">free trial</a>. And if you’re inspired to get involved, <a href="https://www.kentik.com/careers/">we’re hiring!</a></em></p><![CDATA[Cisco Tetration: A Step in the Right Direction]]><![CDATA[Cisco's recently announced Tetration Analytics platform is designed to provide large and medium data centers with pervasive real-time visibility into all aspects of traffic and activity. Analyst Jim Metzler says that the new platform validates the need for a Big Data approach to network analytics as network traffic grows. But will operations teams embrace a hardware-centric platform, or will they be looking for a solution that's highly scalable, supports multiple vendors, doesn't involve large up-front costs, and is easy to configure and use?]]>https://www.kentik.com/blog/cisco-tetration-a-step-in-the-right-directionhttps://www.kentik.com/blog/cisco-tetration-a-step-in-the-right-direction<![CDATA[Jim Metzler]]>Mon, 11 Jul 2016 13:00:26 GMT<p>Data Centers Need Big Data Network Analytics, But as SaaS</p> <p>Cisco recently announced a new data center analytics platform, called Cisco Tetration Analytics, that is designed to resonate with operations teams at medium and large data centers by delivering pervasive real-time visibility across all aspects of data center traffic and activity. The announcement is significant for Cisco in part because it creates expectations of data center functionality to which competitors such as HP and Dell/EMC will likely have to respond.</p> <p>Cisco’s CEO Chuck Robbins underscored the importance of the new platform to Cisco by writing a blog post to highlight the announcement. Robbins said that “data centers can be thought of as the brain of a company, where the most critical information and applications operate and run,” but that unfortunately “we don’t know what’s happening inside our data centers.”</p> <img src="//images.ctfassets.net/6yom6slo28h2/67PmlfCa6AYMMIekUeAQo0/9dec0a33b12a6e55ea0c71e66dad7c25/Tetration_diagram-500w.png" alt="Tetration_diagram-500w.png" class="image right" style="max-width: 300px;" /> <p>In its initial release, the Cisco Tetration Analytics platform consists of 36 servers (1RU Cisco UCS C220 Rack Servers) and three Cisco Nexus 9372PQ Switches. Also included will be software sensors that can be deployed on servers to collect telemetry information, and hardware sensors that reside on Cisco’s 9000 Series Switches.</p> <p>To ease adoption, during the initial launch period the platform comes bundled at no extra cost with Tetration Analytics Quick-Start Implementation Services, which Cisco says will “help each customer integrate Cisco Tetration Analytics into the customer’s data center, define the most relevant use cases, and transform the data center operations to make them more efficient and secure.” Cisco hasn’t yet announced pricing for the platform, but given the amount of equipment that it requires, I have to assume that it will be very expensive.</p> <p>As management data increases, legacy approaches will be even less able to provide the required visibility.</p> <p>I thoroughly buy into the need for a Big Data-based approach to data center management. In a recent post I discussed some of the factors driving that need. One of those drivers is that current approaches to network management don’t provide the necessary visibility, largely because they enable only a small amount of management data to be stored for more than a brief period of time. Going forward, global IP traffic is expected to almost triple in volume between 2014 and 2019, driving a correspondingly dramatic increase in the volume of network management data. As management data increases, legacy approaches will be capable of storing an even smaller percentage of it, and will be even less able to provide the required visibility.</p> <p>Beyond storage, we can see from the big iron that Cisco deploys for Tetration that to rapidly pore through all of that network data, and get at the valuable details needed to make timely decisions, you need a lot of processing power.</p> <p>For most organizations, Big Data network analytics will look more like SaaS and less like a hardware stack.</p> <p>Because I see the need for a Big Data-based management platform and I believe that some organizations will want this platform to be on site, I think that Cisco Tetration Analytics is an important step in the right direction. To appeal to operations teams at the majority of medium and large data centers, however, what’s needed is a Big Data platform that supports equipment from all vendors, offers highly scalable processing, doesn’t come with a large up-front cost, and doesn’t require third-party services to configure and use. That all points to a platform that is based in the cloud. So while Tetration Analytics is indicative of the industry’s direction, for most organizations the fulfillment of the Big Data network analytics promise will look more like SaaS and less like a hardware stack.</p> <hr> <p><em>Jim Metzler is an independent industry analyst and consultant with a broad background in high-speed data services, network hardware, and network services, including positions as a network manager, product manager, engineering manager, and software engineer.</em></p><![CDATA[Welcoming Cisco to the 2016 Analytics Party]]><![CDATA[Cisco's announcement of Tetration Analytics means that IT and network leaders can no longer ignore the need for Big Data intelligence. In this post, Kentik CEO Avi Freedman explains how that's good for Kentik, which has been pushing hard for this particular industry transition. Cisco's appliance-based approach is distinct from the SaaS option that most customers choose for Kentik Detect, but Tetration's focus on Big Data analytics provides important validation that Kentik is on the right track.]]>https://www.kentik.com/blog/welcoming-cisco-to-the-2016-analytics-partyhttps://www.kentik.com/blog/welcoming-cisco-to-the-2016-analytics-party<![CDATA[Avi Freedman]]>Tue, 05 Jul 2016 13:00:56 GMT<p>Tetration Announcement Validates Big Data Direction</p> <img src="//images.ctfassets.net/6yom6slo28h2/4PxRpaEcV20YgmkYGaoO6O/36895e357236e9129d4ff129b02db334/Bruegel_wedding-500w.png" alt="Bruegel_wedding-500w.png" class="image right" style="max-width: 300px;" /> <p>I’d like to welcome Cisco to the 2016 analytics party. Why? Because while Cisco didn’t start this party, they are a big name on the guest list and their presence means that IT and network leaders can no longer ignore the need for Big Data intelligence. And that is good for those of us who have been tooting our horns for this particular industry transition.</p> <p>For those of you who missed the news, Cisco just announced Tetration Analytics, a full rack appliance meant to collect sensor data from data center infrastructure and analyze it with Big Data and machine learning power. Cisco claims that customers will be able to analyze billions of data records in a second.</p> <p>The Tetration Analytics announcement is a watershed moment in the embrace of Big Data analytics by the IT and network industry. It signals the product mainstreaming of what has largely been the domain of home-grown or purely services-driven open source software projects. Cisco’s seal of approval indicates that Big Data has finally arrived for operational network and infrastructure management, an area long overlooked by the mainstream Big Data analytics sector (not least because reliance on Hadoop put the required performance beyond reach).</p> <p>Compelling Business Case</p> <p>By offering a multi-million dollar rack of integrated hardware and software for data center analytics, Cisco has recognized the demand for a more out-of-the-box solution. The company has clearly calculated that it can offer a compelling business case for high-performance analytics to organizations for whom the DIY or services-based model would have been at least as expensive — and significantly riskier.</p> <p>Architected properly, one platform can satisfy operational, business, performance, and security intelligence.</p> <p>Of course, Tetration Analytics isn’t just about standalone analytics. It exists in the context of a multi-billion dollar business segment where Cisco has been investing heavily, including Insieme, AVC, and its take on converged server and data center networking architectures. We believe that architected properly, one platform can satisfy operational, business, performance, and security intelligence in a scalable and open way. So it will be intriguing to see where Cisco takes Tetration.</p> <p>Will Tetration go to the WAN? To the Internet Edge? Given ACI, AVC, and the on-chip performance instrumentation that Cisco has built, will they extend to NPM and APM functionality? We see happy customers of Cariden and Tail-f products, which Cisco has kept open. It will be telling to see how multi-vendor, open, and integrate-able Tetration will become. Or will it remain within a walled garden of Cisco-only infrastructure?</p> <p>Bullish on Big Data</p> <p>Whatever Cisco eventually does with Tetration, we’re very bullish on this development. In part that’s because we’re huge believers in Big Data analytics; it’s in Kentik’s DNA. For years, so many of our peers in the networking industry complained that the visibility provided by single-box or federated appliances was far too slow, shallow, and costly to give them the operational insight they need. We founded Kentik to meet the demand for highly granular intelligence using Internet-scale network data delivered at operational speed.</p> <p>Cisco’s approach provides important validation that we’re on the right track.</p> <p>As we grow, we see dozens of web companies, streaming music providers, digital enterprises, service providers, and many other organizations using Kentik Detect. And we see how excited they become when they can store and query tens-to-hundreds of billions of instantly accessible flow data records and perform ad hoc, multi-dimensional analytics with sub-second response times. Cisco’s approach, which addresses similar requirements for feeds, speeds, and analytics, provides us with an important additional validation that we’re on the right track.</p> <p>Another reason that we welcome Cisco’s announcement is because we’re complementary. In an era where IT is obsessed with public and hybrid clouds, it’s impossible to miss the fact that Cisco has taken a hardware-centric, private cloud approach. While we offer on-premises deployments, the vast majority of our customers opt to consume Kentik Detect as a SaaS offering. They love the fact that they can go from setup to full utilization of a super-powerful network intelligence platform in fifteen minutes — without having to deploy any hardware or software. That said, we’ve already seen demand for architectures like Tetration: strictly in-perimeter and with tightly-coupled analytics.</p> <p>So… Big Data network analytics is real. The party is on. Cisco’s announcement has drawn attention from millions of folks to the very point we’ve been making all along: no organization should settle for decade-old 1RU technology when the power of Big Data analytics — Internet scale at operational speed — is now available for IT operations. Are you ready to join the 2016 analytics party?</p><![CDATA[Kentik Detect for Network Security]]><![CDATA[Network security depends on comprehensive, timely understanding of what’s happening on your network. As explained by information security executive and analyst David Monahan, among the key value-adds of Kentik Detect are the ways in which it enables network data to be applied — without add-ons or additional charges — to identify and resolve security issues. Monahan provides two use cases that illustrate how the ability to filter out and/or drill-down on dimensions such as GeoIP and protocol can tip you off to security threats.]]>https://www.kentik.com/blog/protecting-your-network-comprehensive-understandinghttps://www.kentik.com/blog/protecting-your-network-comprehensive-understanding<![CDATA[David Monahan]]>Mon, 27 Jun 2016 13:00:22 GMT<h3 id="protecting-your-network-with-comprehensive-timely-understanding"><em>Protecting Your Network with Comprehensive, Timely Understanding</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/MHIgQBMHIqQSSWmmUsG8Y/422e2a8fa520778e5c5d9efb79f3e2b1/Secure_tower-500w.jpg" alt="Secure_tower-500w.jpg" class="image right" style="max-width: 500px;" /> <p>Network security depends on comprehensive, timely understanding of what’s happening on your network. That requirement dovetails nicely with the primary purpose of Kentik Detect™, which is to enable customers to monitor and analyze network traffic so that they can identify and resolve issues affecting network availability, performance, efficiency, and integrity. The foundation for this work is the network data that is captured and stored by Kentik’s post-Hadoop Big Data backend, Kentik Data Engine™ (KDE), which includes full-resolution flow records (e.g. NetFlow, IPFIX, sFlow), BGP routing tables, GeoIP data, and SNMP. KDE fuses these sources into a unified time-correlated dataset against which customers can run ad-hoc queries that are grouped and filtered on 40+ dimensions, yielding advanced visualizations and detailed tables in seconds. Customers can also access the data via APIs (SQL or REST) that enable tight integration with third-party analytical or mitigation tools.</p> <div class="pullquote left">Kentik Detect enables network data to be applied to enhance network security.</div> <p>Among the key value-adds of Kentik Detect are the ways in which it enables network data to be applied — without add-ons or additional charges — to enhance network security. The platform captures data from networks operating at Terabit scale, makes the data available for querying within seconds of receipt, and can store historical data for months in its SaaS platform. That means that security issues can be identified and explored not only in the present but also historically for as long as you remain a customer.</p> <p>To get a concrete feel for how Kentik Detect applies to security, let’s look at a couple of use cases that Kentik Detect can address. Note that queries created for ad hoc exploration can be converted into alerts that provide proactive notification of future recurrences.</p> <h4 id="detecting-unauthorized-communications">Detecting unauthorized communications</h4> <p>Because Kentik Detect observes all traffic to and from the network, it can easily identify all of the interactions between the monitored networks and the rest of the Internet. You can use this capability to identify compromised hosts and command-and-control communications by setting up queries that break out traffic by protocol and origin (GeoIP). Advanced Big Data style visualizations returned from these queries make it easy to see where the traffic is going and on which ports.</p> <div class="pullquote right">GeoIP can be used to show traffic from regions that you know you don't operate in or do business with.</div> <p>Kentik Detect’s GeoIP functionality can be used to show traffic from regions that you know you don’t operate in or do business with, so you can block traffic of suspicious origin. From there you can drill down into traffic from the remaining regions. Internal servers should have a very limited contingent of hosts and domains that they are allowed to communicate with. With a few simple queries you can investigate further to identify hosts that you don’t recognize and may want to designate as out of bounds.</p> <p>A complementary approach is to check the protocols used by your traffic. Once again, servers should communicate using a fairly narrow range of protocols, and this is particularly true for internal servers that communicate with the Internet. Parties attempting external command and control (C2) would typically need to use TLS/SSL or non-standard ports for communications. You can use Kentik Detect to catch traffic using these unexpected protocols, especially traffic going to unexplained hosts, which is a significant indicator of unauthorized C2 communications.</p> <h4 id="identifying-data-exfiltration">Identifying data exfiltration</h4> <p>Data exfiltration involves an external threat actor or a trusted insider trying to move the data off-site. Indicators of data exfiltration include out-of-time-slot communications, excessively long communications, or significantly increased communications. These signs are easier to see if you’ve already blocked compromised hosts that you identified using the techniques discussed above.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4RhPZESjGo08MuaeqAak6A/ecb7ca58e3241f5864313874a067edfe/Data_intruder-300w.jpg" alt="Data_intruder-300w.jpg" class="image right" style="max-width: 300px;" /> <p>If two known good hosts begin communicating out of their normal time ranges, there may be an exfiltration issue. Using time boundaries on a Kentik Detect query, you’ll be able to see when the communications began and ended, and to cross reference odd-hours transfers against data transfer protocols and TLS traffic.</p> <p>If two hosts are engaged in extended or otherwise highly patternistic communications that are out of policy or “the norm,” that can also be indicative of dangerous activity such as command and control, reconnaissance, and/or data exfiltration. The flow records in Kentik Detect are sampled, so the detection pattern is dependent upon the volume of data. In the case of data exfiltration, attackers generally try to avoid tripping alarms that look for data spikes or odd-hours transfers, so they use throttled data transfers at regular intervals until the exfiltration is completed. These approaches actually make it easier for Kentik, because the longer the interactions take place the more data is collected to identify the anomalous behavior. The suspect patterns can be identified through Kentik’s search and reporting tools and then remediated using the existing network infrastructure. Since attackers often use redundant C2 servers, additional host-based remediation may also be necessary.</p> <p>Kentik Detect also makes it easy to see “bursty” or other large-volume communications. Users can drill down into oddly large data flows and compare them with historical flows to identify increases, after which investigation can focus on determining why the flows increased. Even analysts who have only been using Kentik Detect for a short time can quickly identify changes in data flows and volumes, kick off remediation workflows, and create new alerts that will trigger for future similar events.</p> <p>Read our solution brief to learn more about <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a>.</p> <hr> <p><em>David Monahan is a senior information security executive with years of experience in physical and information security for Fortune 100 companies, local governments, and small public and private companies.</em></p><![CDATA[Common Ground for CIOs and CISOs]]><![CDATA[CIOs focus on operational issues like network and application performance, uptime, and workflows, while CISOs stress about malware, access control, and data exfiltration. The intersection of these concerns is the network, so it seems evident that CIOs and CISOs should work together instead of clashing over budgets and tools. In this post, network security analyst David Monahan makes the case for finding a solution that can keep both departments continuously and comprehensively well-informed about infrastructure, systems, and applications.]]>https://www.kentik.com/blog/common-ground-for-cios-and-cisoshttps://www.kentik.com/blog/common-ground-for-cios-and-cisos<![CDATA[David Monahan]]>Mon, 20 Jun 2016 13:00:55 GMT<h2 id="cooperation-on-network-analytics-brings-better-business-outcomes"><em>Cooperation on Network Analytics Brings Better Business Outcomes</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/nARYIDKfQGqM6CWKEi4iw/e46aeb136722d4f119d192d7b927ebf8/Teamwork-500w.png" class="image right" style="max-width: 400px" alt="Teamwork" /> <p>Whether your title reads Chief Information Officer or Chief Information Security Officer, there’s an ever-growing list of concerns driving your agenda. For CIOs, the focus is on operational issues such as network and application performance, uptime, and workflows. For CISOs, the stress is about malware, access control, and data exfiltration. For both roles, the list of challenges often seems like more than can be adequately managed with available resources. Though IT budgets are increasing overall, especially for security, the demands on those budgets are also intensifying. So to maximize the impact of every dollar it’s in the interest of CIOs and CISOs — whose roles each contribute to the success of the other — to find common ground wherever they can.</p> <p>At the intersection of CIO and CISO issues and concerns is the network and its underlying infrastructure. These vital pathways for information impact virtually every facet of business operations and security, especially for businesses that are Internet or wide-area network dependent, such as e-commerce companies or enterprises that are geographically distributed.</p> <p>From the CIO perspective, network issues related to bandwidth constraints, latency, congestion, and packet loss are all tightly interconnected. The first three set the stage for the last, ultimately leading to customer frustration, abandoned transactions, and loss of revenue. From the CISO perspective, meanwhile, the network is the gateway to the business for attackers. Whether an internal or an external threat-actor is involved, over 95% of attacks traverse the network at some point during reconnaissance, data collection, or data exfiltration.</p> <div class="pullquote right">It seems obvious that CIOs and CISOs should work together on operations and security.</div> <p>Given that the network is an obvious point of shared concern, or at least should be, for all CIOs and CISOs, it seems equally obvious that these two roles should be working together to find tools that support both operations and security out of the same console using the same dollars. But having been in security and network operations for over 20 years before becoming an analyst, it is amazing to me how many organizations seem to be at odds over budgets and tools, in some cases getting into severe turf wars that negatively affect the ability of both teams to function optimally.</p> <p>What should CIOs and CISOs be doing instead? Cooperatively finding solutions that can collect and analyze data from the network in as close to real time as possible so that they can keep well-informed — continuously and comprehensively — about their infrastructure, systems, and applications. Today, network analytics are more advanced than ever, drawing on algorithms and tools that are far beyond what was available even five years ago. Both IT and Security operations teams can benefit greatly from the crucial information that this latest generation of analysis solutions can reveal.</p> <div class="pullquote right">Define the attributes and capabilities your teams need to address challenges and threats.</div> <p>It’s important to realize that the available solutions are not by any means all created equal. So to take full advantage of the latest advances you need a clear understanding of your requirements before you step into the ring with vendors. Specifically, you need to know the attributes and capabilities that a solution must have in order to enable your organization to identify, troubleshoot, and isolate issues that affect both operational performance and security. It’s up to CIOs and CISOs to speak with their operations teams to define those requirements.</p> <p>At a minimum, each team should be able to identify the areas in which it’s critical for a network analytics solution to provide detailed, timely insight, such as major traffic sources, network delays, routing, communications patterns (especially changes in ports, sources, destinations, and highs and lows), latency, and transactions. Greater understanding in these areas will improve each team’s ability to find their culprits. Having common tools to achieve that understanding will not only reduce budgetary pressures but also increase workflow productivity and cooperation, making those tools a win-win-win for the company.</p> <hr> <p><em>David Monahan is a senior information security executive with years of experience in physical and information security for Fortune 100 companies, local governments, and small public and private companies.</em></p><![CDATA[Accuracy and Efficiency for Network Security]]><![CDATA[This guest post brings a security perspective to bear on network visibility and analysis. Information security executive and analyst David Monahan underscores the importance of being able to collect and contextualize information in order to protect the network from malicious activity. Monahan explores the capabilities needed to support numerous network and security operations use cases, and describes Kentik Detect as a next-generation flow analytics solution with high performance, scalability, and flexibility.]]>https://www.kentik.com/blog/accuracy-and-efficiency-for-network-securityhttps://www.kentik.com/blog/accuracy-and-efficiency-for-network-security<![CDATA[David Monahan]]>Mon, 13 Jun 2016 13:00:21 GMT<h2 id="kentik-delves-deeper-into-your-data-for-detection-and-defense"><em>Kentik delves deeper into your data for detection and defense</em></h2> <p>According to 2015 research reports published by Ponemon, Mandiant, and others, the median pre-detection dwell time for an intruder in a target network ranges at around 200 days. We can all agree that having an undetected intruder for over six months is totally unacceptable, but there seems to be little solid agreement on how to solve the problem. The typical pat answers — “use better security” or “patch your vulnerabilities” or “practice least privilege” — may be sound pieces of advice, but they are clearly not getting us where we need to be. Why? Because they focus on the preventative aspects of security which, though valuable, are inherently imperfect.</p> <div class="pullquote right">Reducing a network's intruder dwell time is mostly about detection.</div> <p>Truly reducing dwell time isn’t primarily about prevention, it’s about detection. Detection requires visibility and analysis. Visibility requires information, and analysis requires the ability to contextualize the information. Thus, reducing dwell time requires a better ability to capture and contextualize information.</p> <p>Over 95% of IT security attacks traverse some aspect of a network to execute. Whether a local LAN, a private WAN, or the Internet, a network functions as the circulatory system for a body of knowledge. To get the data we need, we must tap into that flow. Medical professionals monitor blood pressure and cholesterol to understand the basics of health, but when there is a problem, they have to dig deeper for things like blood-gases, toxins, and pathogens. The same is true for networks. IT performs performance checks on interfaces to understand the basics on network health but when there is a real problem, they have to gather more data from the traffic on the lines.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4f3hhFjJ44kQseIOUEM8uk/ca981b1b8d3e5eeb3d4157bded934d6e/DE_Main-840w.png" class="image center" style="max-width: 800px" /> <p>One way to gather this data is with Kentik Detect™ (above), a solution designed to solve many visibility issues, supporting both IT and security operations. Kentik Detect collects network telemetry at service provider speed and scale, ingesting and storing terabits of full-resolution network data including NetFlow, Geo-IP, BGP, SNMP, and network performance data. The data is not merely correlated but fully unified to create useful metadata for enhanced context. Kentik Detect’s Big Data engine also provides advanced analysis and alerting capabilities to make operations and security professionals aware of issues and incidents.</p> <p>Kentik Detect doesn’t just provide visibility — with scalability and performance — into communications patterns, protocol usage, top talkers, and other commonly used troubleshooting information. It also provides an analysis engine that goes beyond what many other network data capture tools provide, including visualization tools that deliver at the level of Big Data analytics.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1I7TZ4X7P20amUCkA82C2a/2f263ea635ae43644ed025281e224fbe/Alerting-500w.png" class="image right" style="max-width: 500px" /> <p>Kentik Detect isn’t yet a DDoS remediation tool, but it can be very useful in identifying network-based attacks such as DDoS. While many tools have an easy time catching volumetric attacks, other attacks such as those targeting specific system resources, protocols, and applications can be far more difficult to identify. These types of attacks often go unnoticed by IT teams because they do not exceed thresholds set to identify a volumetric attack. In cases where the DDoS is a smoke screen for another attack, Kentik Detect provides a single tool for identifying and analyzing both the DDoS and the underlying attack.</p> <div class="pullquote right">Kentik combines cloud scalability and big-data analytics to create agility for detection and defense.</div> <p>Kentik’s value proposition is based on the combination of cloud scalability and big-data analytics, which creates agility for detection and defense of network-borne incidents. Kentik Detect can provide customers with effectively unlimited storage for collected data, so that any event can be researched back to the start of service or start of the event, whichever comes first. The query engine has Google-like response times for ad-hoc queries involving multi-dimensional field grouping and filtering across terabyte-sized datasets, removing one of the most common operational frustrations with query tools. Another feature that adds value is the ease with which a query can be converted into an alert. If operations defines a set of conditions that they want to track, a SQL-based query can then be used to create an ongoing alert. This is a fantastic time-saver for numerous operations scenarios.</p> <p>All in all, Kentik is positioned to offer next-generation network flow analytics with the high performance, scalability, and flexibility needed to support numerous network and security operations use cases.</p> <hr> <p><em>David Monahan is a senior information security executive with years of experience in physical and information security for Fortune 100 companies, local governments, and small public and private companies.</em></p><![CDATA[Why Network Traffic Visibility is Crucial for Managed Service Providers]]><![CDATA[Today's MSPs frequently find themselves without the insights needed to answer customer questions about network performance and security. To maintain customer confidence, MSPs need answers that they can only get by pairing infrastructure visibility with traffic analysis. In this post, guest contributor Alex Hoff of Auvik Networks explains how a solution combining those capabilities enables MSPs to win based on customer service.]]>https://www.kentik.com/blog/saving-the-customer-experiencehttps://www.kentik.com/blog/saving-the-customer-experience<![CDATA[Alex Hoff]]>Mon, 06 Jun 2016 13:00:49 GMT<p>Don’t come up empty-handed when clients ask questions</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/4ggMVoUhFCKkWaIAYAsIaw/29c3dfc32f9fbf4ecbf3f2009185ae0e/question_cloud-400w.jpg" alt="question_cloud-400w.jpg" class="image right" style="max-width: 300px;" />The biggest risk a managed service provider (MSP) faces is a customer question they can’t answer. “I don’t know” hurts the confidence and trust a client has in them. “I don’t know” weakens the relationship.</p> <p>And yet, “I don’t know” is an answer that many MSPs must currently give to one of the most common client questions of all: Why is my Internet slow?</p> <p>This is where infrastructure <a href="https://www.kentik.com/resources/gartner-market-guide-npmd-2020/">visibility </a>paired with network traffic analysis can save the customer experience.</p> <p>Finding the answers—fast</p> <p>Using a network monitoring tool that focuses on infrastructure devices, you can easily identify bandwidth bottlenecks. For example, you might see that a client’s daily peak Internet usage is 20MBps and it’s choking the network. But is that a real capacity problem—or is it an inappropriate use of resources?</p> <p>Flow analytics and reporting can tell you that two IP addresses are streaming Pandora all day, another IP address is torrenting software, and yet another is using Netflix. But those analytics can’t immediately tell you who’s using those IP addresses.</p> <p>Combining network details with flow analytics gets you answers for your client.</p> <p>Combine the network infrastructure details with flow analytics and in seconds you have answers for your client: Danny and Patricia stream Spotify, Darren is torrenting software. Oh, and Jessica loves her Netflix over lunch hour.</p> <p>Non-IT executives often don’t understand the nuts and bolts of network management, so it can be hard to demonstrate your value in a way that makes sense to them. But output a quick report on the top 10 bandwidth users in their environment—and why they’re the top 10—and suddenly you’re talking their language. They get it.</p> <p>Identifying the right solutions at the right time</p> <p>Having clear insight into network traffic issues also gives you the intel you need to serve as the value-add partner you were hired to be. If you uncover that Spotify and Netflix are the issues, you can recommend implementing some corporate policies about what users can and can’t do on the network.</p> <p>Clear insight into traffic gives you the intel you need to add value.</p> <p>If there truly is a network capacity issue, then you’ll know that too. Traffic trends over time can point to gear or connections that will need an upgrade—and when they’ll need it.</p> <p>Planning ahead to maintain capacity and smooth network performance? That’s a great customer experience. Advance notice to allow for budgeting of infrastructure upgrades? That’s a great customer experience, too.</p> <p>Conversely, flow data and infrastructure visibility might show you equipment that’s not being used, or not being used efficiently. A configuration change or the removal of gear that’s outlived its usefulness might help squeeze more from the existing network. In that case, you might be able to point to cost-savings or improved performance—more customer experience wins.</p> <p><strong>Shoring up security</strong></p> <p>As data breaches and ransomware attacks make the daily news, security is fast becoming a growth area for managed services. A 2015 report from CompTIA showed that 38% of companies chose to hire an MSP primarily because of security concerns.</p> <p>38% of companies hire an MSP primarily because of security concerns.</p> <p>Security threats come in many forms. They range from malware and phishing to DoS attacks and backdoor infiltration of a network. Here again, infrastructure visibility combined with traffic analysis is a great way to problem-solve.</p> <p>For example, an analysis of network traffic might show connections on an unusual port to an unknown host. This could signal the presence of a Trojan hiding somewhere deep inside a client’s network. To find it, you need to map the unusual traffic in question to the device that’s generating it.</p> <p>And when it comes to security, you often don’t have time to play around. In the case of a DoS attack, for instance, you need to know immediately where the attack is coming from so you can use your firewalls and routers to block it before it takes services down. For that, real-time visibility is a must.</p> <p>The bottom line</p> <p>By 2020, it’s expected that customer experience will overtake price and product as the key brand differentiator across all industries. This is already true in managed services.</p> <p>Competition is intense. Winning comes down to customer service.</p> <p>With MSPs multiplying and increased pressure from value-added resellers (VARs) entering the market, competition is intense. But price is not the way to win. It comes down to customer service.</p> <p>Service that resolves problems quickly with minimal interruption to the customer.</p> <p>Service that identifies and fixes problems before the customer even knows about them.</p> <p>Service that has a prompt, specific, and helpful answer to any question the customer might ask.</p> <p>When it comes to the network, infrastructure <a href="https://www.kentik.com/go/assessment-network-readiness/">visibility</a> and traffic analytics help you deliver.</p> <hr> <p><em>Alex Hoff is vice-president of product management at Auvik Networks. Auvik’s cloud-based RMM system for network infrastructure helps MSPs build a complete and profitable network service.</em></p><![CDATA[Leveraging Big Data for Continuous Improvement]]><![CDATA[Intelligent use of network management data can enable virtually any company to transform itself into a successful digital business. In our third post in this series, we look at areas where traditional network data management approaches are falling short, and we consider how a Big Data platform that provides real-time answers to ad-hoc queries can empower IT organizations and drive continuous improvement in both business and IT operations.]]>https://www.kentik.com/blog/leveraging-big-data-for-continuous-improvementhttps://www.kentik.com/blog/leveraging-big-data-for-continuous-improvement<![CDATA[Jim Metzler]]>Tue, 31 May 2016 13:00:09 GMT<p>The Digital Business Benefits of Data-driven Network Management</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/5Xj17FdHBSeU0S6OUGoEqk/723d141ee3cd164f706a19deef5362ad/Leverage-504w.png" alt="Leverage-504w.png" class="image right" style="max-width: 300px;" />This post is the third in my series about the intelligent use of network management data to enable virtually any company to transform itself into a digital business. In the first post we discussed the need for a Big Data approach to network management in order to support agile business models and rapid innovation. In the second post we looked at how insights from a Big Data approach to network management enable data-driven network operations. This time we’ll cover how a Big Data solution that provides real-time answers to ad-hoc queries helps individuals within an organization leverage their expertise to drive continuous improvement in both business and IT operations.</p> <p>The advantages of applying a Big Data approach to business operations have been widely discussed, including in the article “When Big Data Goes Lean” from McKinsey &#x26; Company, which identifies the impact that Big Data is having on manufacturing. “The application of larger data sets, faster computational power, and more advanced analytic techniques,” the authors say, “is spurring progress on a range of lean-management priorities. Sophisticated modeling can help to identify waste, for example, thus empowering workers and opening up new frontiers where lean problem solving can support continuous improvement. Powerful data-driven analytics also can help to solve previously unsolvable (and even unknown) problems that undermine efficiency in complex manufacturing environments: hidden bottlenecks, operational rigidities, and areas of excessive variability. Similarly, the power of data to support improvement efforts in related areas, such as quality and production planning, is growing as companies get better at storing, sharing, integrating, and understanding their data more quickly and easily.”</p> <p>Proactive Analytics for Network Security</p> <p>The potential benefits from what McKinsey describes as “larger data sets, faster computational power, and more advanced analytic techniques” aren’t limited to manufacturing. Network security is another example of an area with lots of potential for data-driven improvement. Until the notorious 2014 breach of Target’s credit card database, security was widely perceived as being just an IT issue. That changed when Target’s profits dropped by almost 50% and the company fired not just their CIO but also their CEO. The belated realization that security was a core issue has since been reinforced on an ongoing basis by reports such as the IBM X-Force Threat Intelligence Report 2016, which pointed out that by 2019 cybercrime will become a 2.1 trillion dollar problem.</p> <p>Network security is an area with lots of potential for data-driven improvement.</p> <p>Because the potential financial impact is now indisputable, security has become a priority concern for CIOs, CEOs, and boards. Particularly unsettling is the fact that organizations may not recognize when security has been compromised, meaning that months can elapse between the start of a breach and its resolution. The extent to which organizations have unknown security problems was highlighted in the 2015 Cost of Data Breach Study. The study found that malicious attacks can take an average of 256 days to identify while data breaches caused by human error take an average of 158 days. The study also pointed out that the the ultimate cost of containing a given security issue is directly linked to how long it takes to identify it.</p> <p>With the stakes so high, it’s clear that network security is an area that can benefit greatly from enabling experts to implement continuous improvement and to solve challenging or hidden problems. Network operations and network security groups regularly see anomalous traffic patterns and are faced with the question: “Is this traffic shift indicative of an organic change, a misconfiguration, or some form of security breach?” That seemingly simple question is nearly impossible to answer adequately within the confines of traditional network management tools. As discussed earlier in this series, legacy network analytics systems store only relatively small amounts of management data and use siloed GUIs. Without a comprehensive, unified view of network data — enabled by a distributed Big Data architecture to handle traffic at scale — there’s no effective way to share the detailed traffic insights that would allow various network teams to work together to address common problems.</p> <p>Troubleshooting and Preventing Application Issues</p> <p>When applications are key to running the business, ensuring performance is critical.</p> <p>Another priority issue for a company’s IT organization and business leaders alike is to ensure acceptable application performance. This is particularly critical if the application is key to running the business. As pointed out in the 2015 Application and Services Delivery Handbook, the management task that organizations say is most important for them to get better at in the near term is rapidly identifying the source of degraded application performance.</p> <p>In part, troubleshooting is a high priority because poor performance in customer-facing applications typically leads to reduced revenue. But another reason is that troubleshooting is getting more challenging. Network operators must now manage new application delivery models that increasingly rely on public cloud services and include mobile workers, virtualized infrastructure, and new application types such as cloud-native. The growing complexity and distributed nature of applications means that there are more angles from which network operations needs to look at management data to rapidly solve performance or availability issues. Traditional network management based on summary-only views doesn’t allow access to the details needed to support that sort of analysis.</p> <p>Time for Big Data Network Management</p> <p>As network complexity grows it’s increasingly evident that only a Big Data solution can cope. But as pointed out in our second post, not all Big Data-based analytics platforms are created equal. Systems developed for business intelligence (BI) analytics rely on batch processing and may require hours to run a single query. To be effective for network operations and security, however, an analytics solution must enable users to get a response to an individual query in seconds. Timely answers enable network operations and network security personnel to combine their technical expertise and institutional knowledge of their organization’s infrastructure with insight gained from rapid, detailed, iterative querying. This combination is the formula for solving application and availability problems faster. Ideally, these teams will use these deep analytical capabilities to proactively examine anomalies that point to sub-optimal conditions and then fix those issues before they impact users or compromise security.</p> <p>Instead of continuing with traditional approaches, IT organizations should empower workers with comprehensive, real-time data-driven analytics.</p> <p>Einstein is often credited with saying that insanity can be defined as doing the same thing over and over again while expecting a different result. The traditional approach to combatting security incidents has led growing financial impact of cybercrime, which is expected to be a 2.1 trillion dollar problem by 2019. The traditional approach to troubleshooting degraded application performance has led to it being consistently identified as the area in which network organizations most need near term improvement. While continuing with traditional approaches and expecting to get notably better at security or troubleshooting may not be insane, it does seem foolhardy. Instead, IT organizations should adopt a Big Data network management solution that empowers workers with comprehensive, real-time data-driven analytics. By leveraging that power with their own expertise, network teams will be able to seek continuous improvement and to address problems that they haven’t previously been able to solve or even see.</p><![CDATA[Peering for the Win]]><![CDATA[Traffic can get from anywhere to anywhere on the Internet, but that doesn’t mean all networks are directly connected. Instead, each network operator chooses the networks with which to connect. Both business and technical considerations are involved, and the ability to identify prime candidates for peering or transit offers significant competitive advantages. In this post we look at the benefits of intelligent interconnects and how networks can find the best peers to connect with.]]>https://www.kentik.com/blog/peering-for-the-winhttps://www.kentik.com/blog/peering-for-the-win<![CDATA[Jim Meehan]]>Mon, 23 May 2016 13:00:37 GMT<h3 id="finding-the-business-benefits-of-intelligent-interconnects"><em>Finding the Business Benefits of Intelligent Interconnects</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/7Hrtcmq0g0qIcoi2qWGmQI/c9da2ba0b5d89176094f7ce7c22e4a0f/Internet_BGP_map-504w.png" alt="Internet_BGP_map-504w.png" class="image right" style="max-width: 500px;" /> <p>It’s common knowledge that “the Internet” is actually a set of networks belonging to a diverse range of independent organizations such as content providers, ISPs, corporations, and universities. These networks create the Internet by interconnecting. Without these interconnections there would be no path for data originating in one network to travel to a destination in another network. But the fact that traffic can get from any Internet location to any other doesn’t mean that all networks are directly connected. Instead, each network operator chooses which other networks to connect with. With that in mind, it’s worth thinking a bit about the business and technical considerations involved when networks interconnect.</p> <h4 id="transit-and-peering">Transit and Peering</h4> <p>The first thing to know about a relationship between two networks is the form of interconnection. Two main types are common:</p> <ul> <li><strong>Transit</strong>: The networks interconnect so that one (usually an ISP, telco, or carrier) can provide reachability to the entire Internet for the other, which is typically an “endpoint” entity (e.g. enterprise, content or application provider, residential broadband provider, etc.). There is almost always an accompanying commercial relationship, meaning that the endpoint entity pays the ISP to carry traffic to and from the rest of the Internet.</li> <li><strong>Peering</strong>: The networks interconnect to exchange only traffic that originates or terminates within their own networks (or perhaps the networks of their direct customers). Peering is usually between — not surprisingly — peers, meaning entities that are comparable. A wholesale carrier whose primary business is selling transit is not going to agree to peer with an endpoint content provider who would typically be a customer.</li> </ul> <div class="pullquote right">Peering can offer both business and technical advantages over transit.</div> <p>Compared to transit connections, peering can be advantageous to networks on both business and technical levels. Let’s suppose, for example, that you’ve been going through network B to get traffic to and from network C, and you then discover that it would be possible to connect with network C directly. Why would you want to do so? The benefits of peering typically boil down to three primary areas:</p> <ul> <li><strong>Reduced Cost</strong>: Peering shifts traffic between the two parties onto a direct link between their two networks. Both parties benefit because now neither of them have to pay a “middleman” ISP to carry that traffic. So peering with network C would reduce your costs by eliminating the transit fees that you were paying network B to exchange traffic with network C.</li> <li><strong>Improved Performance</strong>: Bypassing intervening networks (like network B) reduces the number of hops between the two networks. That means less latency and fewer potential points of failure.</li> <li><strong>Resiliency</strong>: Peering links also act as a redundant path between the two networks. If the peering link fails, traffic can still flow via transit, and vice versa. A residential broadband provider might peer with large content networks (Google, Facebook, Netflix, etc.). Their users could still reach those top destinations via peering links if the transit links were congested — by a large DDoS attack, for example (assuming that the attack traffic doesn’t originate from within the content providers’ networks).</li> </ul> <h4 id="applying-network-analytics">Applying Network Analytics</h4> <p>So now we understand why you might want to peer. But it doesn’t make sense to peer with just anyone; you have to find a network with whom peering would be mutually beneficial. How do you do that?</p> <div class="pullquote left">A traffic analytics system that correlates flow with BGP can reveal the best opportunities for peering.</div> <p>It turns out that when network flow records (e.g. NetFlow, IPFIX, sFlow, etc.) are correlated with BGP routing data in a datastore that’s optimized for traffic analytics it’s relatively easy to discover the best peering opportunities for your network. Presented within a well-designed query and visualization interface, BGP analytics will help you see your prime peering candidates, which are the remote ASNs that terminate or originate the majority of the traffic flowing into and out of your network.</p> <img src="//images.ctfassets.net/6yom6slo28h2/423TPuiqLegc04k4moCsWw/d316768d939edd72ab9776a8f0f00290/PA-Main-no-navbar-840w.png" alt="PA-Main-no-navbar-840w.png" class="image center" style="max-width: 840px;" /> <p>An added benefit of applying correlated Flow-BGP analytics is that you can find additional insights that don’t fall squarely into the category of peering:</p> <ul> <li><strong>Transit Planning:</strong> Analytics might reveal that much of your traffic through an existing transit provider is actually being handed off to another transit provider before reaching its final destination. If the second provider sells transit for less, making a direct transit interconnection could cut your costs. It would also ensure that you avoid the relatively common problem of congestion-related disputes between “top tier” and “low cost” providers over who should pay for additional capacity at interconnection points.</li> <li><strong>Uncovering Sales Opportunities:</strong> If you’re a transit provider, correlated Flow-BGP analytics can uncover leads for your sales team. Looking at top destination (or pass-through) ASNs who are not currently direct connections can reveal entities that receive a significant volume of traffic from your network, and who could benefit (in terms of cost or performance) by buying some transit from you.</li> <li><strong>Customer Cost Analytics:</strong> Transit providers can get a leg up on the competition by better understanding the routes taken by their customers’ traffic. There’s a lot more room to negotiate with a potential customer whose traffic gets delivered mostly via no-cost domestic peering links than with a customer who has a lot of traffic being delivered via high-cost international transit.</li> </ul> <h4 id="big-data-big-benefits">Big Data, Big Benefits</h4> <div class="pullquote right">The key is to recognize that flow data plus BGP data makes Big Data.</div> <p>The common thread of the examples above is that better understanding — based on flow and BGP analytics — leads to better business and technical outcomes. And the key to better understanding is to recognize that flow data plus BGP data makes Big Data. It’s not uncommon for a multi-homed network to generate billions of flow records per day. Until recently, however, traffic analysis solutions were severely limited in compute and storage capacity. That meant that they could provide summary reports, but not the kind of deep, path-aware analyses that offer the insights outlined above. Only a big data solution can handle the required data at the required scale.</p> <p>Kentik has introduced the industry’s first purpose-built big data engine, built around a distributed post-Hadoop core, for network traffic and BGP analytics. Offered as a cost-effective SaaS, Kentik Detect includes key features such as real-time ad-hoc querying, alerting and DDoS detection, and intuitive, multi-dimensional flow visualizations.</p> <p>Learn more about using Kentik for <a href="https://www.kentik.com/solutions/usecase/improve-peering-interconnection/" title="network peering and interconnection with Kentik">network peering and interconnection</a>.</p> <p>If you’re ready to start taking advantage right now of the insights offered by big data-based network intelligence, <a href="#demo_dialog">schedule a demo</a> or <a href="#signup_dialog">sign up for a free trial</a>.</p><![CDATA[Kentik APIs for Customer Portal Integration]]><![CDATA[The network data collected by Kentik Detect isn't limited to portal-only access; it can also be queried via SQL client or using Kentik's RESTful APIs. In this how-to, we look how service providers can use our Data Explorer API to integrate traffic graphs into a customer portal, creating added-value content that can differentiate a provider from its competitors while keeping customers committed and engaged.]]>https://www.kentik.com/blog/kentik-apis-for-customer-portal-integrationhttps://www.kentik.com/blog/kentik-apis-for-customer-portal-integration<![CDATA[Eric Graham]]>Mon, 16 May 2016 13:00:26 GMT<h3 id="using-the-data-explorer-api-for-added-value-content"><em>Using the Data Explorer API for Added-value Content</em></h3> <p>Kentik Detect™ is a powerful solution that ingests and stores large volumes of network data on a per device, per customer basis. The data is stored in the Kentik Data Engine™, a timeseries database that unifies flow records (NetFlow v5/9, IPFIX, sFlow) with BGP, Geo-IP, and SNMP. Information stored in KDE can be utilized for a variety of use cases, including network traffic engineering, Distributed Denial of Service (DDoS) detection, BGP traffic analytics, network planning, and infrastructure security monitoring. Kentik Detect subscribers have full access to this data via the easy-to-use Kentik Detect portal as well as any PostgreSQL-capable client. In addition, one of the unique features of Kentik Detect is the ability to access stored data directly through one of Kentik’s open RESTful APIs.</p> <p>Kentik APIs enable programmatic access to KDE-stored data for a variety of purposes, including integration with existing network monitoring tools, backend datastore integration, and augmentation of customer portals. In this piece we’ll focus specifically on that last use case. Service providers may use a visually pleasing customer portal as an upsell to existing services, a way to attract new customers, or a means to reduce customer churn. The data stored in KDE can serve as the basis for graphs and tables that represent customer network traffic, and service providers can draw on that data to provide an interesting, attractive enhancement to their customers’ portals. The key to making that work is Kentik’s Data Explorer API.</p> <p>Note that what follows here is not intended as an HTML or PHP programming guide; the reader is expected to understand Web server security, API authentication, and customer portal development. With that in mind, this document can serve as a general framework for understanding the different options, arrays, and general structure that should be used with the Data Explorer API.</p> <p><strong>About the Data Explorer API</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/6JU6qWN4wo4OmEO4wCcYuU/9c40995cd13b9b33c3987de03f8581f1/DE-Main-521w.jpg" class="image center" style="max-width: 521px" /> <p>The Data Explorer API models the functionality of the Data Explorer in the Kentik portal (pictured at right), allowing data in the KDE to be queried and for results to be returned in one of three forms:</p> <ul> <li><em>Timeseries graph:</em> the data for an image (JPG, PDF, PNG, or SVG) of a graph that is a visualization of query results, similar to what is seen in the Data Explorer.</li> <li><em>TimeSeries data:</em> the data used to generate time-series graphs in Data Explorer.</li> <li><em>Top-X data</em>: raw JSON like the data used to generate the tables that are displayed in Data Explorer.</li> </ul> <p>The API supports the same querying options that are available in the Data Explorer, including group-by dimension, metric, time-range, filters, and devices. Service providers can also choose how customer data is displayed, including options such as stacked or line view, and whether history or 95th percentile is plotted.</p> <p>The data associated with individual customers can be differentiated using tags that are defined either in the Admin section of the portal (see <a href="https://kb.kentik.com/v3/Cb04.htm#Cb04-Tag_Field_Definitions">Tag Settings</a>) or using the <a href="https://kb.kentik.com/v0/Ec07.htm#Ec07-Tag_API">Tag API</a> in the Kentik V1 APIs. Tags can be defined based on a number of configuration parameters, including interface description (regex can be used to pull a portion of the name), IP addresses and subnets, device name, and ASN. Tags are applied as network data is ingested into KDE, with source or destination flow records each evaluated independently for matches. In the Data Explorer API, the tags to match in the query are passed in as filter settings.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5z6gnAxvAQWQ0Myys4OE66/cbf086d70ebf98b3601076c58622c8d3/Example-Customer-Portal-400w.png" class="image center" style="max-width: 400px" /> <p><strong>Laying the foundation</strong></p> <p>In the explanation below we’ll be looking at using the Data Explorer API to generate timeseries graphs with PHP. Before we dig into the details of the PHP code that addresses the API, let’s look at the building blocks that you’ll need to put in place:</p> <ul> <li><strong>Tags</strong>: In the Kentik Detect portal, go to the Add Tags page (Admin » Tags) to create tags that can be used to distinguish between customers.</li> <li><strong>Test directory</strong>: Create a new folder on the Web server (e.g. http://your_domain.com/test/) and be sure to check ownership and permissions.</li> <li><strong>Script file</strong>: Paste the PHP script covered in the next section into a file called “welcome.php” and put it into the test directory.</li> <li><strong>Customer login page</strong>: Create a basic customer login page that includes a request for the company name. This will define the customer tag passed in the filterValue option of the API’s query object so that the traffic of the requesting customer’s can be distinguished from the traffic of other customers. To differentiate between inbound and outbound customer traffic, pass tags in the value of the src_flow_tags and/or dst_flow_tags filterField (see <a href="https://kb.kentik.com/Ec02.htm#Ec02-filterSettings_Object">filterSettings Object</a>).</li> <li><strong>HTML form</strong>: On the page that customers go to after logging in, create an HTML form to request information from your customer. The group-by dimensions, metrics, units, and lookback_seconds are typical options available for customer selection. To determine what metrics and units to include, reference the <a href="https://kb.kentik.com/Ec02.htm#Ec02-query_Object">query object documentation</a> defined in the Kentik Knowledge Base.</li> </ul> <p>The following example shows a basic HTML form for capturing a query request in a customer portal. This simple form would allow a customer to select the time-range to cover in the query, the dimension to plot in the graph, and the metric (units) in which the traffic should be measured:</p> <div class="gatsby-highlight" data-language="php"><pre class="language-php"><code class="language-php"><span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>div</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>section<span class="token punctuation">"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>b</span> <span class="token attr-name">id</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>welcome<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Enter a query:<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>b</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>form</span> <span class="token attr-name">action</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>welcome.php<span class="token punctuation">"</span></span> <span class="token attr-name">;</span> <span class="token attr-name">method</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>post<span class="token punctuation">"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>input</span> <span class="token attr-name">type</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>text<span class="token punctuation">"</span></span> <span class="token attr-name">name</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>company<span class="token punctuation">"</span></span> <span class="token attr-name">readonly</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span><span class="token php language-php"><span class="token delimiter important">&lt;?php</span> <span class="token keyword">echo</span> <span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string double-quoted-string">"company"</span><span class="token punctuation">]</span><span class="token punctuation">;</span><span class="token delimiter important">?></span></span><span class="token punctuation">"</span></span> <span class="token punctuation">/></span></span><span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>br</span><span class="token punctuation">></span></span> Choose a time-range:<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>br</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>select</span> <span class="token attr-name">name</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>TimeRange<span class="token punctuation">"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>900<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>15 minute<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>3600<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>1 hour<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>21600<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>6 hour<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>86400<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>1 day<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>604800<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>1 week<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>select</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>br</span><span class="token punctuation">></span></span> Choose a dimension:<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>br</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>select</span> <span class="token attr-name">name</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>ReportAttribute<span class="token punctuation">"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>Traffic<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Total Traffic<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>Geography_src<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Source Country<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>src_geo_city<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Source City<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>ASTopTalkers<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>AS Top Talkers<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>PortPortTalkers<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Top L4 Ports<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>TopFlowsIP<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Top IP Flows<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>IP_src<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Top Source IPs<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>IP_dst<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Top Destination IPs<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>select</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>br</span><span class="token punctuation">></span></span> Choose a metric:<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>br</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>select</span> <span class="token attr-name">name</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>Units<span class="token punctuation">"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>bytes<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Bits Per Second<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>packets<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Packets Per Second<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>option</span> <span class="token attr-name">value</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>unique_src_ip<span class="token punctuation">"</span></span><span class="token punctuation">></span></span>Unique Source IPs<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>option</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>select</span><span class="token punctuation">></span></span><span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>br</span><span class="token punctuation">></span></span> Submit your query:<span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>br</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;</span>input</span> <span class="token attr-name">type</span><span class="token attr-value"><span class="token punctuation attr-equals">=</span><span class="token punctuation">"</span>submit<span class="token punctuation">"</span></span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>form</span><span class="token punctuation">></span></span> <span class="token tag"><span class="token tag"><span class="token punctuation">&lt;/</span>div</span><span class="token punctuation">></span></span></code></pre></div> <p><strong>Scripting the request</strong></p> <p>Now we’re ready to look at code that handles the form input, makes a request to the KDE via the timeSeriesGraph endpoint of the Data Explorer API, and returns (in an HTTP response body) the binary data for an image of a graph. A script to pass REST API parameters can be in PHP, JavaScript, or Python.</p> <p>For basic testing and understanding you can use the PHP script provided below. We’ll look at the script in detail in the following section. Note that the API-token and API-email should be hidden from the customer, which it is not in this example.</p> <p>This script queries outbound traffic and uses source flow tags. The script uses curl_setopt to define the HTTP POST request and format it in JSON:</p> <div class="gatsby-highlight" data-language="php"><pre class="language-php"><code class="language-php"><span class="token php language-php"><span class="token delimiter important">&lt;?php</span> <span class="token variable">$ch</span> <span class="token operator">=</span> <span class="token function">curl_init</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$fields</span> <span class="token operator">=</span> <span class="token keyword">array</span><span class="token punctuation">(</span> <span class="token string single-quoted-string">'imageType'</span><span class="token operator">=></span><span class="token string single-quoted-string">'image/svg+xml'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'width'</span><span class="token operator">=></span><span class="token string single-quoted-string">'1200'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'query'</span><span class="token operator">=></span><span class="token keyword">array</span><span class="token punctuation">(</span> <span class="token string single-quoted-string">'device_name'</span><span class="token operator">=></span><span class="token string single-quoted-string">'all_devices'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'metric'</span><span class="token operator">=></span><span class="token function">urlencode</span><span class="token punctuation">(</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'ReportAttribute'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'units'</span><span class="token operator">=></span><span class="token function">urlencode</span><span class="token punctuation">(</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'Units'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'time_type'</span><span class="token operator">=></span><span class="token string single-quoted-string">'relative'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'lookback_seconds'</span><span class="token operator">=></span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword type-casting">int</span><span class="token punctuation">)</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'TimeRange'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'query_title'</span><span class="token operator">=></span><span class="token punctuation">(</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'ReportAttribute'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'sub_title'</span><span class="token operator">=></span><span class="token string single-quoted-string">'outbound'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterSettings'</span><span class="token operator">=></span><span class="token keyword">array</span><span class="token punctuation">(</span> <span class="token string single-quoted-string">'connector'</span><span class="token operator">=></span><span class="token string single-quoted-string">'All'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'custom'</span><span class="token operator">=></span><span class="token string single-quoted-string">'false'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterString'</span><span class="token operator">=></span><span class="token string single-quoted-string">''</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterGroups'</span><span class="token operator">=></span><span class="token keyword">array</span><span class="token punctuation">(</span><span class="token punctuation">[</span> <span class="token string single-quoted-string">'connector'</span><span class="token operator">=></span><span class="token string single-quoted-string">'All'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterString'</span><span class="token operator">=></span><span class="token string single-quoted-string">''</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filters'</span><span class="token operator">=></span><span class="token keyword">array</span><span class="token punctuation">(</span><span class="token punctuation">[</span> <span class="token string single-quoted-string">'filterField'</span><span class="token operator">=></span><span class="token string single-quoted-string">'src_flow_tags'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'operator'</span><span class="token operator">=></span><span class="token string single-quoted-string">'ILIKE'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterValue'</span><span class="token operator">=></span><span class="token punctuation">(</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'company'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$header</span> <span class="token operator">=</span> <span class="token keyword">array</span><span class="token punctuation">(</span><span class="token string single-quoted-string">'Content-Type: application/json'</span><span class="token punctuation">,</span><span class="token string single-quoted-string">'X-CH-Auth-Api-Token:'</span><span class="token punctuation">,</span><span class="token string single-quoted-string">'X-CH-Auth-Email: '</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$data_string</span> <span class="token operator">=</span> <span class="token function">json_encode</span><span class="token punctuation">(</span><span class="token variable">$fields</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$postvars</span> <span class="token operator">=</span> <span class="token string single-quoted-string">''</span><span class="token punctuation">;</span> <span class="token variable">$url</span> <span class="token operator">=</span> <span class="token string double-quoted-string">"https://api.kentik.com/api/v4/dataExplorer/timeSeriesGraph"</span><span class="token punctuation">;</span> <span class="token function">curl_setopt</span><span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">,</span> <span class="token constant">CURLOPT_CUSTOMREQUEST</span><span class="token punctuation">,</span> <span class="token string double-quoted-string">"POST"</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">curl_setopt</span><span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">,</span> <span class="token constant">CURLOPT_HTTPHEADER</span><span class="token punctuation">,</span> <span class="token variable">$header</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">curl_setopt</span><span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">,</span> <span class="token constant">CURLOPT_URL</span><span class="token punctuation">,</span> <span class="token variable">$url</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">curl_setopt</span><span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">,</span> <span class="token constant">CURLOPT_POSTFIELDS</span><span class="token punctuation">,</span> <span class="token variable">$data_string</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">curl_setopt</span><span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">,</span> <span class="token constant">CURLOPT_RETURNTRANSFER</span><span class="token punctuation">,</span> <span class="token constant boolean">true</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">curl_setopt</span><span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">,</span> <span class="token constant">CURLOPT_CONNECTTIMEOUT</span><span class="token punctuation">,</span> <span class="token number">3</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">curl_setopt</span><span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">,</span> <span class="token constant">CURLOPT_TIMEOUT</span><span class="token punctuation">,</span> <span class="token number">20</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token variable">$image</span> <span class="token operator">=</span> <span class="token function">curl_exec</span><span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">curl_close</span> <span class="token punctuation">(</span><span class="token variable">$ch</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">print_r</span><span class="token punctuation">(</span><span class="token variable">$image</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token delimiter important">?></span></span></code></pre></div> <p>The script above can be modified to look at inbound traffic instead by making the following two changes:</p> <ul> <li>Change sub_title to “Inbound.”</li> <li>Change filterField to “src_flow_tags.”</li> </ul> <p><strong>A closer look</strong></p> <p>Now let’s look closer at certain elements of the script for greater detail.</p> <div class="gatsby-highlight" data-language="php"><pre class="language-php"><code class="language-php"> <span class="token string single-quoted-string">'imageType'</span><span class="token operator">=></span><span class="token string single-quoted-string">'image/svg+xml'</span><span class="token punctuation">,</span></code></pre></div> <p>The image type is passed to tell the system how to render the image. Service providers can choose between any of the following options:</p> <ul> <li>SVG: image/svg+xml,</li> <li>PDF: application/pdf</li> <li>PNG: image/png</li> <li>JPG: image/jpeg</li> </ul> <div class="gatsby-highlight" data-language="php"><pre class="language-php"><code class="language-php"> <span class="token string single-quoted-string">'width'</span><span class="token operator">=></span><span class="token string single-quoted-string">'1200'</span><span class="token punctuation">,</span></code></pre></div> <p>The width of the image can be defined for formats other than SVG.</p> <div class="gatsby-highlight" data-language="php"><pre class="language-php"><code class="language-php"> <span class="token string single-quoted-string">'query'</span><span class="token operator">=></span><span class="token keyword">array</span><span class="token punctuation">(</span> <span class="token string single-quoted-string">'device_name'</span><span class="token operator">=></span><span class="token string single-quoted-string">'all_devices'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'metric'</span><span class="token operator">=></span><span class="token function">urlencode</span><span class="token punctuation">(</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'ReportAttribute'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'units'</span><span class="token operator">=></span><span class="token function">urlencode</span><span class="token punctuation">(</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'Units'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'time_type'</span><span class="token operator">=></span><span class="token string single-quoted-string">'relative'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'lookback_seconds'</span><span class="token operator">=></span><span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token keyword type-casting">int</span><span class="token punctuation">)</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'TimeRange'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'query_title'</span><span class="token operator">=></span><span class="token punctuation">(</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'ReportAttribute'</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'sub_title'</span><span class="token operator">=></span><span class="token string single-quoted-string">'outbound'</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterSettings'</span><span class="token operator">=></span><span class="token keyword">array</span><span class="token punctuation">(</span><span class="token operator">></span> <span class="token string single-quoted-string">'connector'</span><span class="token operator">=></span><span class="token string single-quoted-string">'All'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'custom'</span><span class="token operator">=></span><span class="token string single-quoted-string">'false'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterString'</span><span class="token operator">=></span><span class="token string single-quoted-string">''</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterGroups'</span><span class="token operator">=></span><span class="token keyword">array</span><span class="token punctuation">(</span><span class="token punctuation">[</span> <span class="token string single-quoted-string">'connector'</span><span class="token operator">=></span><span class="token string single-quoted-string">'All'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterString'</span><span class="token operator">=></span><span class="token string single-quoted-string">''</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filters'</span><span class="token operator">=></span><span class="token keyword">array</span><span class="token punctuation">(</span><span class="token punctuation">[</span> <span class="token string single-quoted-string">'filterField'</span><span class="token operator">=></span><span class="token string single-quoted-string">'src_flow_tags'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'operator'</span><span class="token operator">=></span><span class="token string single-quoted-string">'ILIKE'</span><span class="token punctuation">,</span> <span class="token string single-quoted-string">'filterValue'</span><span class="token operator">=></span><span class="token punctuation">(</span><span class="token variable">$_POST</span><span class="token punctuation">[</span><span class="token string single-quoted-string">'company'</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span></code></pre></div> <p>The query object is one of the most important objects because it tells the interpreter what information the customer wants to see graphed. The query object includes a variety of fields whose values can either be preset by the provider or presented to the customer for input in the HTML form:</p> <ul> <li><em>device_name</em>: in most cases this will be all_devices (routers and hosts), but it could also be a comma-delimited list of individual devices.</li> <li><em>metric</em>: this represents the group-by dimension and wood most likely be customer-selected.</li> <li><em>units</em>: another value that would most likely be customer-selected. Options include bps, pps, flows/s, unique-source-ips, and unique-destination-ips. If the service provider is using the host agent (host-nprobe-basic), other units can be included such as latency, retransmits, out-of-order packets, and fragment data.</li> <li><em>time_type</em>: the time range can be a fixed at a specified time or relative to the current time.</li> <li><em>lookback_seconds</em>: If time_type is relative, service providers can define the duration of the time-range covered by the query.</li> <li><em>viz_type</em>: the graph type, either stacked area chart (default) or line chart.</li> <li><em>show_overlay</em>: a Boolean that specifies whether a history line is included in the graph.</li> <li><em>filterField</em>: part of filterSettings, this field defines the customer tag that enable differentiation between customers. If service providers want to show both inbound and outbound traffic, two separate REST calls will be needed to reference src_flow_tags and dst_flow_tags.</li> <li><em>filterValue</em>: the provider’s unique customer identifier, taken from the customer’s portal login.</li> </ul> <div class="gatsby-highlight" data-language="php"><pre class="language-php"><code class="language-php"> <span class="token variable">$header</span> <span class="token operator">=</span> <span class="token keyword">array</span><span class="token punctuation">(</span><span class="token string single-quoted-string">'Content-Type: application/json'</span><span class="token punctuation">,</span><span class="token string single-quoted-string">'X-CH-Auth-Api-Token:'</span><span class="token punctuation">,</span><span class="token string single-quoted-string">'X-CH-Auth-Email: '</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre></div> <p>The HTTP header needs to include the API token and API-email, both of which can be found on the Kentik portal for your (provider) user account. This information must be protected by the service provider and hidden from individual customers.</p> <div class="gatsby-highlight" data-language="php"><pre class="language-php"><code class="language-php"> <span class="token variable">$data_string</span> <span class="token operator">=</span> <span class="token function">json_encode</span><span class="token punctuation">(</span><span class="token variable">$fields</span><span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre></div> <p>The REST API call will need to be JSON encoded.</p> <div class="gatsby-highlight" data-language="php"><pre class="language-php"><code class="language-php"> <span class="token variable">$url</span> <span class="token operator">=</span> <span class="token string double-quoted-string">"https://api.kentik.com/api/v4/dataExplorer/timeSeriesGraph"</span><span class="token punctuation">;</span></code></pre></div> <p>The URL string passes the path for the REST call, including the endpoint, which in this case is set to timeSeriesGraph to return image file data. As mentioned above, the other available endpoints are timeSeriesData and topXData.</p> <p><strong>Final results</strong></p> <p>When the customer submits the query from a customer portal that uses the example PHP above, a JSON-encoded HTTP POST message will be generated and sent to the KDE via the Data Explorer API, and an SVG image will be returned on a separate HTML page similar to the following sample output, which shows a line chart of top outbound flows by Layer 4 port (port-to-port talkers) as measured in bits/second:</p> <img src="//images.ctfassets.net/6yom6slo28h2/24f8H1f7Rq4cCagssOs60E/c8c805f62036dda75e28b8a74dde46e7/outbound_chart-840w.png" class="image center" style="max-width: 840px" /> <p>Part of the beauty of the Data Explorer API is that it corresponds closely to the Data Explorer itself, so you can play with options in the portal to see the effect of different parameter values in the API. The following image, for example, shows top inbound flows by Layer 4 port as measured in bits/second, this time as a stacked area chart that includes a dashed line for history:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3Nt4oDaz5C8KM0yiqsM662/31602dac506ad7e80440be7c85ded29b/inbound_chart-840w.png" class="image center" style="max-width: 840px" /> <p>As you can see from the above, the Data Explorer API can be a powerful tool for creating added-value content that can be instantiated into customer portals. Providing informative customer-specific graphs isn’t just about making the customer portal more interesting and attractive; by giving customers greater insight into the traffic over their networks, the Data Explorer API becomes part of the toolset that providers can use to differentiate themselves from competitors while keeping customers committed and engaged. If you’d like to learn more about how Kentik integration can add value to your customer portal, contact us at <a href="mailto:[email protected]">[email protected]</a> (existing Kentik customers please contact <a href="mailto:[email protected]">[email protected]</a>).</p><![CDATA[Transforming NetOps with Big Data]]><![CDATA[Looking ahead to tomorrow's economy, today's savvy companies are transitioning into the world of digital business. In this post — the second of a three-part series — guest contributor Jim Metzler examines the key role that Big Data can play in that transformation. By revolutionizing how operations teams collect, store, access, and analyze network data, a Big Data approach to network management enables the agility that companies will need to adapt and thrive.]]>https://www.kentik.com/blog/transforming-netops-with-big-datahttps://www.kentik.com/blog/transforming-netops-with-big-data<![CDATA[Jim Metzler]]>Sun, 08 May 2016 13:00:30 GMT<p>Enabling a Data-driven Approach for Better Network Management</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/4m4IFLjTn2aca0Q8Q8qsK8/8adf28d4ffab81e60dbc0234f32bd064/Big_data_world-420w.jpg" alt="Big_data_world-420w.jpg" class="image right" style="max-width: 300px;" />Across virtually every sector of the economy today, companies face a common imperative: integrate digital technologies and practices or risk falling by the wayside. As we discussed in an earlier post, <a href="https://www.kentik.com/from-netflow-analysis-to-business-outcome/">From NetFlow Analysis to Business Outcome</a>, a Big Data approach to collecting, storing, accessing, and analyzing management data enables the level of collaboration that’s required for an organization to exhibit the key characteristics of a digital business. Those characteristics can be identified as:</p> <ul> <li>Agile business models and rapid innovation;</li> <li>An agile IT function.</li> </ul> <p>In this post we’ll continue our focus on the intelligent use of network management data to enable companies to transform. Building on the first post, we’ll look at the role of network data in enabling this transformation, particularly how a Big Data approach to network management data provides an IT organization with the needed depth of insight to achieve data-driven network operations.<br> <br> <strong>Traffic Growth Stresses Traditional Management Tools</strong></p> <p>The volume of data traffic across today’s networks is large and growing. Driven in part by the digital business movement, this growth is also propelled by factors such as the increasing use of streaming, the growing number of mobile users and connected devices, and the emerging adoption of the Internet of Things (IoT). Every indication is that the growth will continue. In May of 2015, for example, the Cisco VNI Global IP Traffic Forecast projected (see graph) that global IP traffic will almost triple in volume between 2014 and 2019, from about 60,000 to nearly 168,000 Exabytes per month.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/4XFrdK1kdywkQAUqqoY2S4/cc76beed28b721521778078137c0bfe3/cisco_forecast_graph-420w.png" alt="cisco_forecast_graph-420w.png" class="image right" style="max-width: 300px;" />As pointed out in the previous blog, the traditional approaches to network management don’t provide the visibility required to identify security incidents or to troubleshoot problems. In part that’s because legacy systems discard most raw network data, storing only small fraction for the long term. The strong growth in data traffic and the associated growth in management data means that on a going-forward basis these traditional approaches will likely store an even smaller percentage of management data and hence be even less able to provide the required detailed visibility.</p> <p>One of the key characteristics of network management is that there are multiple sources of management data. Each source has advantages and disadvantages. Flow data (NetFlow, sFlow, J-Flow, and IPFIX), for example, is critical because it provides pervasive coverage and details on the traffic. But the tradeoff for this detail is that there is a huge volume of data to collect, store, and analyze. And flow data also doesn’t provide insight into all aspects of performance.</p> <p>Another source of management data comes from routing protocols such as BGP. One advantage of BGP is that it provides details on the end-to-end traffic paths. But it lacks any awareness of the traffic that transits those paths. Additional sources of network data are GeoIP databases — which map IP addresses to specific geographic locations such as region, country and city — and packet capture (PCAP), which can give insight into application and network performance.</p> <p>Traditional management systems use one type of data source. Unfortunately, since network organizations typically don’t know in advance the source of the problem they are trying to troubleshoot, they won’t know if a particular data source is the best, or even an adequate source of insight. A much more powerful approach is to fuse multiple data types to create a multi-dimensional view of network operations. To do that effectively, enabling ad hoc querying across multiple data types, you need to be able to retain, in detail, all of the relevant types of management data. That’s not feasible within the constraints of traditional approaches.<br> <br> <strong>Big Data for Network Management</strong></p> <p>As described by the Gartner IT Glossary, Big Data involves high-volume, high-velocity, and/or high-variety information assets that utilize cost-effective, innovative forms of information processing to enable enhanced insight, decision making, and process automation. Wikipedia, meanwhile, defines Big Data as a term for data sets that are so large or complex that traditional data processing applications are inadequate. To the extent that Big Data enables more accurate analysis, decisions can be made with greater confidence, and better decisions can result in greater operational efficiency, cost reduction, and reduced risk.</p> <p>Based on these definitions, network management data is a perfect application for Big Data. As noted, the velocity and volume of network management data is so large that in current approaches most of the raw data must be discarded and only rollups of predefined aggregates can be kept. Further, the need to combine multiple types of network data for effective analysis meets the “high-variety” criterion established by Gartner.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/5BV4M9vMNUGskmIWsA0EKA/ff6d430b3c40be8ea3617665ea89dde1/ooda_loop-319w.png" alt="ooda_loop-319w.png" class="image right" style="max-width: 300px;" />A Big Data solution for network management enables ingesting, storing, and querying of massive amounts of management data. While the batch-based processing of Big Data systems commonly used for business intelligence (BI) analytics may require hours to run queries, a Big Data solution for network operations demands a much faster query response timeframe. The OODA (observe, orient, decide, act) loop for network operations requires actionable information within minutes. This means that Big Data queries must complete in seconds, since even knowledgeable engineers will need to query data repeatedly to observe conditions sufficiently to orient, decide, and execute a course of action.</p> <p>The good news is that Big Data technologies have been advancing at a rapid rate, enabling them to support the multiple data types, use cases, and OODA timeframes required for network management. As a result, Big Data network management insight is poised to enable data-driven network operations. In our next post we’ll look at how that plays a key role in helping companies transform into the agile digital businesses that are most likely to thrive in tomorrow’s economy.</p><![CDATA[Inside the Kentik Data Engine, Part 2]]><![CDATA[In part 2 of our tour of Kentik Data Engine, the distributed backend that powers Kentik Detect, we continue our look at some of the key features that enable extraordinarily fast response to ad hoc queries even over huge volumes of data. Querying KDE directly in SQL, we use actual query results to quantify the speed of KDE's results while also showing the depth of the insights that Kentik Detect can provide.]]>https://www.kentik.com/blog/inside-the-kentik-data-engine-part-2https://www.kentik.com/blog/inside-the-kentik-data-engine-part-2<![CDATA[Jim Meehan]]>Mon, 02 May 2016 13:00:25 GMT<p><img src="//images.ctfassets.net/6yom6slo28h2/2aoWtpJedeCwg8CCS60q2K/7ff91e03c2bfa766dab45fd038b6934d/jet_engine-521w-negrev.jpg" alt="jet_engine-521w-negrev.jpg " class="image right" style="max-width: 300px;" /> In part 1 of this series we introduced Kentik Data Engine™, the backend to Kentik Detect™, which is a large-scale distributed datastore that is optimized for querying IP flow records (NetFlow v5/9, sFlow, IPFIX) and related network data (GeoIP, BGP, SNMP). We started our tour of KDE with a word about our database schema, and then used SQL queries to quantify how design features such as time-slice subqueries, results caching, and parallel dataseries help us achieve extraordinarily fast query performance even over huge volumes of data. In this post we’ll continue our tour of KDE, starting with filtering and aggregation. SQL gives us easy methods to do both filtering and aggregation, either by adding additional terms to the WHERE clause or by adding a GROUP BY clause. We can filter and aggregate by any combination of columns using those options. Let’s look at traffic volume for the top source <strong>→</strong> dest country pairs where both the source and dest are outside the US:</p> <p>SELECT src_geo,</p> <p>   dst_geo,</p> <p>   Sum(both_bytes) AS f_sum_both_bytes</p> <p>FROM big_backbone_router</p> <p>WHERE src_geo &#x3C;> ‘US’</p> <p>  AND dst_geo &#x3C;> ‘US’</p> <p>  AND i_start_time > now() - interval ‘1 week’</p> <p>GROUP BY src_geo,</p> <p>   dst_geo</p> <p>ORDER BY f_sum_both_bytes DESC</p> <p>LIMIT 10</p> <p>|** src_geo   <strong>|</strong> dst_geo   <strong>|</strong> f_sum_both_bytes   **|</p> <p>| HK        | BR        | 27568963549063     |</p> <p>| GB        | BR        | 8594666838327      |</p> <p>| NL        | DE        | 6044367035356      |</p> <p>| HK        | GB        | 6004897386415      |</p> <p>| HK        | SG        | 5305439621766      |</p> <p>| HK        | CO        | 4893091337832      |</p> <p>| NL        | BR        | 4330923877223      |</p> <p>| HK        | JP        | 4086102823771      |</p> <p>| HK        | PL        | 3833512917644      |</p> <p>| HK        | TR        | 3501243783418      |</p> <p>SELECT 10</p> <p>Time: 0.675s</p> <p>The first row, HK (Hong Kong) <strong>→</strong> BR (Brazil) seems like an interesting, unlikely pair. Let’s filter on that and see who was talking to whom from a network/ASN perspective:</p> <p>SELECT src_as,</p> <p>   i_src_as_name,</p> <p>   dst_as,</p> <p>   i_dst_as_name,</p> <p>   Sum(both_bytes) AS f_sum_both_bytes</p> <p>FROM big_backbone_router</p> <p>WHERE src_geo = ‘HK’</p> <p>  AND dst_geo = ‘BR’</p> <p>  AND i_start_time > Now() - interval ‘1 week’</p> <p>GROUP BY src_as,</p> <p>   i_src_as_name,</p> <p>   dst_as,</p> <p>   i_dst_as_name</p> <p>ORDER BY f_sum_both_bytes DESC</p> <p>LIMIT 10</p> <p>|** src_as <strong>|</strong> i_src_as_name      <strong>|</strong>   dst_as <strong>|</strong> i_dst_as_name      <strong>|</strong> f_sum_both_bytes   **|</p> <p>|  65001 | Global Transit Net |    65101 | Eyeball Net A      | 11686340849432     |</p> <p>|  65001 | Global Transit Net |    65201 | Eyeball Net B ASN1 | 5076468451751      |</p> <p>|  65001 | Global Transit Net |    65301 | CDN Net A          | 3337948976347      |</p> <p>|  65001 | Global Transit Net |    65102 | Eyeball Net C      | 1261908657743      |</p> <p>|  65001 | Global Transit Net |    65103 | Eyeball Net D      | 1234101190857      |</p> <p>|  65001 | Global Transit Net |    65104 | Eyeball Net E      | 1211922009485      |</p> <p>|  65001 | Global Transit Net |    65105 | Eyeball Net F      | 334959552542       |</p> <p>|  65001 | Global Transit Net |    65202 | Eyeball Net B ASN2 | 247936925394       |</p> <p>|  65001 | Global Transit Net |    65106 | Eyeball Net G      | 229671528291       |</p> <p>|  65001 | Global Transit Net |    65401 | Enterprise Net A   | 209961848484       |</p> <p>SELECT 10</p> <p>Time: 0.458s</p> <p><strong>Substring and regex matching</strong> If we wanted to drill down on the first row, we could additionally filter on the specific source/dest ASNs. But let’s filter on the ASN names instead, so we can see how KDE supports SQL substring and regular expression matching on text columns. Substring/regex matching also works on other strings such as interface names and descriptions, AS_PATHs, and user-defined flow tags.</p> <p>SELECT ipv4_src_addr AS f_cidr24_ipv4_src_addr,</p> <p>   ipv4_dst_addr AS f_cidr24_ipv4_dst_addr,</p> <p>   Sum(both_bytes) AS f_sum_both_bytes</p> <p>FROM big_backbone_router</p> <p>WHERE i_src_as_name ~ ‘Peer|Transit’</p> <p>  AND i_dst_as_name LIKE ‘%Eyeball Net A%’</p> <p>  AND i_start_time > Now() - interval ‘1 week’</p> <p>GROUP BY f_cidr24_ipv4_src_addr,</p> <p>   f_cidr24_ipv4_dst_addr</p> <p>ORDER BY f_sum_both_bytes DESC</p> <p>LIMIT 10</p> <p>|** f_cidr24_ipv4_src_addr  <strong>|</strong> f_cidr24_ipv4_dst_addr   <strong>|</strong> f_sum_both_bytes   **|</p> <p>| 10.129.204.0            | 10.156.25.0              | 115419904954801    |</p> <p>| 10.221.155.0            | 10.141.201.0             | 78651524382556     |</p> <p>| 10.156.25.0             | 10.129.204.0             | 62329500664567     |</p> <p>| 10.254.246.0            | 10.31.38.0               | 39162753399340     |</p> <p>| 10.117.35.0             | 10.31.38.0               | 39073550458830     |</p> <p>| 10.144.99.0             | 10.254.210.0             | 28582936121869     |</p> <p>| 10.31.73.0              | 10.254.244.0             | 27632400104976     |</p> <p>| 10.31.75.0              | 10.17.153.0              | 26265050083173     |</p> <p>| 10.144.99.0             | 10.254.244.0             | 25763076333705     |</p> <p>| 10.17.100.0             | 10.93.63.0               | 23713868194889     |</p> <p>SELECT 10</p> <p>Time: 0.980s</p> <p>As we see above, KDE can aggregate IP address columns by arbitrary subnet masks. In this query we’ve grouped the data by source <strong>→</strong> dest /24 subnet pairs. KDE also natively understands IP addresses and CIDR notation in filters, so we can drill down on the top subnet pair and look at pairs of individual IPs:</p> <p>SELECT ipv4_src_addr,</p> <p>   ipv4_dst_addr,</p> <p>   Sum(both_bytes) AS f_sum_both_bytes</p> <p>FROM big_backbone_router</p> <p>WHERE ipv4_src_addr LIKE ‘10.129.204.0/24’</p> <p>  AND ipv4_dst_addr LIKE ‘10.156.25.0/24’</p> <p>  AND i_start_time > Now() - interval ‘1 week’</p> <p>GROUP BY ipv4_src_addr,</p> <p>   ipv4_dst_addr</p> <p>ORDER BY f_sum_both_bytes DESC</p> <p>LIMIT 10</p> <p>|** ipv4_src_addr  <strong>|</strong> ipv4_dst_addr  <strong>|</strong> f_sum_both_bytes   **|</p> <p>| 10.129.204.41  | 10.156.25.5    | 101922511168965    |</p> <p>| 10.129.204.34  | 10.156.25.4    | 16534277019052     |</p> <p>| 10.129.204.69  | 10.156.25.79   | 12821801454        |</p> <p>| 10.129.204.85  | 10.156.25.79   | 12408606234        |</p> <p>| 10.129.204.116 | 10.156.25.79   | 11170668135        |</p> <p>| 10.129.204.110 | 10.156.25.79   | 11078339112        |</p> <p>| 10.129.204.76  | 10.156.25.79   | 10895308401        |</p> <p>| 10.129.204.84  | 10.156.25.79   | 10497115055        |</p> <p>| 10.129.204.115 | 10.156.25.79   | 10361345421        |</p> <p>| 10.129.204.75  | 10.156.25.79   | 9923494659         |</p> <p>SELECT 10</p> <p>Time: 0.660s</p> <p><strong>Time-series data</strong> Summary tables are great, but often we want time-series data to build visualizations. In KDE, this is as simple as adding a time column to the SELECT and GROUP BY clauses. Let’s take the top IP pair from the results above and get time-series data (bytes and packets) over the last week:</p> <p>SELECT i_start_time,</p> <p>   Max(i_duration),</p> <p>   Sum(both_bytes) AS f_sum_both_bytes,</p> <p>   Sum(both_pkts) AS f_sum_both_pkts</p> <p>FROM big_backbone_router</p> <p>WHERE ipv4_src_addr LIKE ‘10.129.204.41’</p> <p>  AND ipv4_dst_addr LIKE ‘10.156.25.5’</p> <p>  AND i_start_time > Now() - interval ‘1 week’</p> <p>GROUP BY i_start_time</p> <p>ORDER BY i_start_time ASC</p> <p>|** i_start_time              <strong>|</strong>   max <strong>|</strong> f_sum_both_bytes   <strong>|</strong> f_sum_both_pkts   **|</p> <p>| 2016-03-10 23:00:00+00:00 |  3600 | 475231866603       | 558646000         |</p> <p>| 2016-03-11 00:00:00+00:00 |  3600 | 141987820665       | 180911990         |</p> <p>| 2016-03-11 01:00:00+00:00 |  3600 | 85119841990        | 130098569         |</p> <p>| 2016-03-11 02:00:00+00:00 |  3600 | 102749092833       | 124245217         |</p> <p>| 2016-03-11 03:00:00+00:00 |  3600 | 40349266424        | 74404852          |</p> <p>| 2016-03-11 04:00:00+00:00 |  3600 | 47615668871        | 80084659          |</p> <p>| 2016-03-11 05:00:00+00:00 |  3600 | 39601556357        | 71966274          |</p> <p>| 2016-03-11 06:00:00+00:00 |  3600 | 44595721100        | 55644084          |</p> <p>| 2016-03-11 07:00:00+00:00 |  3600 | 36984645947        | 73379683          |</p> <p>| 2016-03-11 08:00:00+00:00 |  3600 | 57309415120        | 86561840          |</p> <p>| 2016-03-11 09:00:00+00:00 |  3600 | 221576669330       | 219835996         |</p> <p>SELECT 168</p> <p>Time: 0.430s</p> <p>It doesn’t make sense to return 10,800 1-minute intervals for a week-long time series query; you can’t display that many data points in a visualization that fits in your average web browser. So KDE auto-selects the interval width to return an appropriate number of points / rows, based on the overall time range covered by the query. In this case, for a query covering a week, we get 168 one-hour intervals. For more information, see <a href="https://kb.kentik.com/Eb02.htm#Eb02-Time_Rounding">Time Rounding</a>. KDE can also return a column showing the width (in seconds) of each interval so we (or the Kentik Detect frontend) can easily calculate rates like bits/sec or packets/sec. <strong>Maintaining full granularity</strong> None of the per-interval data above was pre-calculated; it was generated on the fly, at query time, from the individual records that we saw in the response to the first query in part 1 of this blog series. That means that when we see an area of interest it’s easy to narrow the time range or apply filters to drill down. We can get full-resolution, 1-minute granularity for any historical period within the data that KDE stores, such as this one-hour time range from 90 days ago:</p> <p>SELECT i_start_time,</p> <p>   Max(i_duration) AS i_duration,</p> <p>   Sum(both_bytes) AS f_sum_both_bytes,</p> <p>   Sum(both_pkts) AS f_sum_both_pkts</p> <p>FROM big_backbone_router</p> <p>WHERE i_start_time > Now() - interval ‘2159 hours’</p> <p>  AND i_start_time &#x3C; Now() - interval ‘2158 hours’</p> <p>GROUP BY i_start_time</p> <p>ORDER BY i_start_time ASC</p> <p>|** i_start_time              <strong>|</strong>   i_duration <strong>|</strong> f_sum_both_bytes   <strong>|</strong> f_sum_both_pkts   **|</p> <p>| 2015-12-15 22:29:00+00:00 |           60 | 179245157376       | 189853696         |</p> <p>| 2015-12-15 22:30:00+00:00 |           60 | 181873404928       | 192246784         |</p> <p>| 2015-12-15 22:31:00+00:00 |           60 | 183132584960       | 193918976         |</p> <p>| 2015-12-15 22:32:00+00:00 |           60 | 180520254464       | 191270912         |</p> <p>| 2015-12-15 22:33:00+00:00 |           60 | 179917988864       | 190438400         |</p> <p>| 2015-12-15 22:34:00+00:00 |           60 | 175917901824       | 185893888         |</p> <p>| 2015-12-15 22:35:00+00:00 |           60 | 174799783936       | 184879104         |</p> <p>| 2015-12-15 22:36:00+00:00 |           60 | 175613580288       | 185396224         |</p> <p>| 2015-12-15 22:37:00+00:00 |           60 | 173256493056       | 182279168         |</p> <p>| 2015-12-15 22:38:00+00:00 |           60 | 170268498944       | 179223552         |</p> <p>| 2015-12-15 22:39:00+00:00 |           60 | 169344593920       | 178819072         |</p> <p>| 2015-12-15 22:40:00+00:00 |           60 | 169141132288       | 178192384         |</p> <p>| 2015-12-15 22:41:00+00:00 |           60 | 169238467584       | 178177024         |</p> <p>SELECT 60</p> <p>Time: 1.293s</p> <p>In summary, the Kentik Data Engine provides a very performant, scalable, and flexible platform that enables Kentik Detect to analyze network traffic data without making any compromises on the granularity or specificity of the results. But that’s actually just the tip of the iceberg. We’re excited about the other types of datasets that KDE will consume, and the additional applications we’ll be able to build on top of it in the areas of network performance, security, and business intelligence. Want to try KDE with your own network data? Start a <a href="#signup_dialog">free trial</a> of Kentik Detect and experience KDE’s performance first hand. Or maybe you’d like to help us take KDE to the next level? <a href="https://www.kentik.com/careers/">We’re hiring</a>.[/vc_column_text][/vc_column][/vc_row]</p><![CDATA[Inside the Kentik Data Engine, Part 1]]><![CDATA[Kentik Detect's backend is Kentik Data Engine (KDE), a distributed datastore that's architected to ingest IP flow records and related network data at backbone scale and to execute exceedingly fast ad-hoc queries over very large datasets, making it optimal for both real-time and historical analysis of network traffic. In this series, we take a tour of KDE, using standard Postgres CLI query syntax to explore and quantify a variety of performance and scale characteristics.]]>https://www.kentik.com/blog/inside-the-kentik-data-engine-part-1https://www.kentik.com/blog/inside-the-kentik-data-engine-part-1<![CDATA[Jim Meehan]]>Mon, 25 Apr 2016 13:00:40 GMT<p>Database design for Web-scale ad hoc queries <img src="//images.ctfassets.net/6yom6slo28h2/4OiwmpgDEQsuCEOeMMcKcC/0d41c95c6acba0bbc9c2542d0ed644e8/jet_engine-521w.jpg" alt="jet_engine-521w.jpg" class="image right" style="max-width: 300px;" />In a previous post, <a href="https://www.kentik.com/beyond-hadoop/"><strong>Beyond Hadoop</strong></a>, we looked at how the MapReduce approach to database design runs up against performance limits when queries can’t be defined in advance, and we introduced Google’s response to those limits: Dremel/BigQuery. With Dremel, Google pointed the way toward an architecture that enables a database to execute exceedingly fast ad-hoc queries over large datasets using an ANSI SQL query language. Here at Kentik, we’ve applied many of the same concepts to Kentik Data Engine™ (KDE), a datastore optimized for querying IP flow records (NetFlow v5/9, sFlow, IPFIX) and related network data (GeoIP, BGP, SNMP). In this series, we’ll take a tour of KDE and also quantify some of its performance and scale characteristics. KDE is the backend of Kentik Detect™ and as such enables users to query network data and view visualizations via the Kentik portal, a fast, intuitive UI. But KDE is also exposed to users via REST API and direct SQL (see <a href="https://kb.kentik.com/Eb01.htm">Connecting to KDE</a>). For this discussion we’ll access the backend directly so that we can get an “under the hood” look at the response times and how the queries are structured. <strong>KDE records structure</strong> First, let’s look at the structure of the records in the datastore. We can grab some individual flows from the last 5 seconds (or any time range), one row per flow record. Note that in this and all following examples, all identifying information (IP addresses, router names, AS names and numbers, subnets, etc.) has been anonymized:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SELECT protocol AS proto, ipv4\_src\_addr AS src_addr, l4\_src\_port AS sport, ipv4\_dst\_addr AS dst_addr, l4\_dst\_port AS dport, in_bytes AS bytes, in_pkts AS pkts FROM big\_backbone\_router WHERE i\_start\_time > Now() - interval '5 seconds' LIMIT 5; |** proto **|** src_addr **|** sport **|** dst_addr **|** dport **|** bytes **|** pkts **| | 17 | 10.235.226.99 | 26085 | 10.20.177.155 | 20815 | 2576 | 2 | | 6 | 10.217.129.102 | 51130 | 10.16.54.83 | 80 | 52 | 1 | | 6 | 10.93.39.104 | 9482 | 10.17.123.38 | 40558 | 52 | 1 | | 6 | 10.246.217.104 | 61815 | 10.18.213.199 | 9050 | 52 | 1 | | 6 | 10.246.217.104 | 45063 | 10.21.80.86 | 9050 | 52 | 1 | SELECT 5 Time: 0.438s</code></pre></div> <p>The example above shows just a small subset of the available columns (see <a href="https://kb.kentik.com/Eb03.htm#Eb03-Main_Table_Schema">Main Table Schema</a> for the full list). The bulk of the queryable columns are related to flow fields, but we also include many additional columns derived by correlating flow records with other data sources —BGP, GeoIP, SNMP — as the flows are ingested into KDE. KDE makes new data available immediately; flow records that were received less than 5 seconds ago are already available to query. <strong>How big is big?</strong> Next, let’s look at capacity: how big is our “big data”? To find out, we can query KDE to see how many flows we collected over the last week from a real carrier backbone device — we’ll call it “big backbone router” — that pushes ~250 Gbps:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SELECT Sum(protocol) AS f\_count\_protocol FROM big\_backbone\_router WHERE i\_fast\_dataset = FALSE AND i\_start\_time > Now() - interval '1 week'; |** f\_count\_protocol **| | 1764095962 | SELECT 1 Time: 4.726s</code></pre></div> <p>The answer (1,764,095,962) shows that KDE is able to query nearly 1.8 billion rows and still respond in under five seconds. How? In part it’s because the query was split into 10,800 subqueries, each representing a one-minute “slice” (see <a href="https://kb.kentik.com/Eb01.htm#Eb01-Subqueries_Slices_and_Shards">Subqueries, Slices, and Shards</a>) of the one week time range. These subqueries were executed in parallel over many nodes, after which the results were summed into a single response. You’ll notice that we didn’t use the standard COUNT(*) syntax. That’s because the distributed nature of KDE requires some special handling for aggregation. We’ve overloaded the “AS” keyword to allow us to pass in Kentik-specific aggregation functions (which always start with “f_”). In this example we chose to count the “protocol” column because it’s the narrowest field (1 byte) and provides the best performance for COUNT(). <strong>Caching for faster results</strong> Another aspect of KDE is that since historical data doesn’t change after it’s ingested, we’re able to take advantage of caching subquery results. To see how this helps, we can re-run the same query a few minutes later. This time the query executes in less than half a second, approximately ten times faster than the first time:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SELECT Sum(protocol) AS f\_count\_protocol FROM big\_backbone\_router WHERE i\_fast\_dataset = FALSE AND i\_start\_time > Now() - interval '1 week'; |** f\_count\_protocol **| | 1763974750 | SELECT 1 Time: 0.420s</code></pre></div> <p>The reason KDE is so much faster when re-running queries is that after the initial run it only has to run the subqueries for the one-minute slices that occurred since the prior run, which are then appended to the earlier, cached subquery results. This is especially useful for time-series queries that you might run on a regular basis (we’ll see those a bit later). Just to be clear, though, subquery result caching is not the primary driver of fast query times. For the rest of the examples in this post we’ll be looking at uncached results to get an accurate characterization of “first run” response times. <strong>Counting bytes and packets</strong> So far we’ve been looking at flow counts, which are important but only part of the story. What about counting bytes and packets? As shown in the following query (same router, same timespan), it’s equally straightforward to get those totals (for those of you who are keeping track, it comes to ~20 petabytes and ~23 trillion packets):</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SELECT Sum(both_bytes) AS f\_sum\_both_bytes, Sum(both_pkts) AS f\_sum\_both_pkts FROM big\_backbone\_router WHERE i\_fast\_dataset = FALSE AND i\_start\_time > Now() - interval '1 week'; |** f\_sum\_both_bytes **|** f\_sum\_both_pkts **| | 19996332614707365 | 23109672039934 | SELECT 1 Time: 68.625s</code></pre></div> <p>As most readers will already know, routers typically employ sampling to generate flow data; the flow records are based on 1-in-N packets sampled off the wire. KDE stores the sampling rate for each flow as a separate column, and the aggregation functions automatically normalize the byte and packet counts such that they accurately represent what was actually sampled, rather than just a simple sum of values from the samples themselves. <strong>Dataseries: Full or Fast?</strong> Gathering and aggregating the data for a bytes/packets count is clearly more labor-intensive than counting flows, and that’s reflected in the response time of the query above. While 68 seconds is amazingly fast for a full table scan over 1.8 billion rows, it may not be quite as fast as we’d want for an interactive web GUI. KDE addresses this by creating at ingest a dataseries at each of two resolutions: One is the “Full” dataseries that we’ve been looking at (every flow record sent by the device). The other is what we call the “Fast” dataseries, which is a subsampled representation of the received data that retains all of the dimensionality but with fewer rows (see <a href="https://kb.kentik.com/Ab04.htm">Resolution Overview</a> for details). When we run our previous query on the Fast dataseries the results return in less than one second and are only a hair’s breadth different from the Full dataseries results above — certainly accurate enough to provide the insights we need for any real-world use case:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SELECT Sum(both_bytes) AS f\_sum\_both_bytes, Sum(both_pkts) AS f\_sum\_both_pkts FROM big\_backbone\_router WHERE i\_start\_time > Now() - interval '1 week'; | **f\_sum\_both_bytes** | **f\_sum\_both_pkts** | | 19949403395561758 | 23097734283381 | SELECT 1 Time: 0.782s</code></pre></div> <p>By default, KDE auto-selects the Full or Fast dataseries depending on the timespan of the query. A typical workflow might start with a wide query (days to weeks), which defaults to the Fast dataseries, and progress to zooming in on a narrower time range (hours), which defaults to the Full dataseries. The user can override these defaults, both in the GUI and in SQL (see the “i_fast_dataset” parameter in our first bytes/packets example above). In this way, KDE strikes a great balance between query timespan, granularity, and response time, while still giving the user full control over those tradeoffs. So far we’ve seen that KDE leverages a combination of techniques — rich schema, time-slice subqueries, results caching, parallel dataseries — to achieve extraordinarily fast query performance even over huge volumes of data. In <a href="https://www.kentik.com/inside-the-kentik-data-engine-part-2/">part 2</a> of this series, we’ll continue our tour with a look at KDE’s extensive filtering and aggregation capabilities, and demonstrate how to drill down to timespans as fine as one minute that occurred as far back as 90 days. In the meantime there’s lots you can do to find out more about Kentik Detect: <a href="#demo_dialog">request a demo</a>, sign up for a <a href="#signup_dialog">free trial</a>, or simply <a href="mailto:[email protected]">contact us</a> for further information. And if you find these kinds of discussions fascinating, bear in mind that <a href="https://www.kentik.com/careers/">we’re hiring</a>.</p><![CDATA[NetFlow, sFlow, and Flow Extensibility, Part 2]]><![CDATA[NetFlow and IPFIX use templates to extend the range of data types that can be represented in flow records. sFlow addresses some of the downsides of templating, but in so doing takes away the flexibility that templating allows. In this post we look at the pros and cons of sFlow, and consider what the characteristics might be of a solution can support templating without the shortcomings of current template-based protocols.]]>https://www.kentik.com/blog/netflow-sflow-and-flow-extensibility-part-2https://www.kentik.com/blog/netflow-sflow-and-flow-extensibility-part-2<![CDATA[Avi Freedman]]>Mon, 18 Apr 2016 13:00:17 GMT<h3 id="looking-beyond-the-netflow-sflow-divide"><em>Looking beyond the NetFlow-sFlow divide</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/2jckcAiD6oIG2UIYmkcIsU/838b691b7c9b771562b8b82d5582f273/the_wave-521w.jpg" alt="the_wave-521w.jpg" class="image right" style="max-width: 520px;" /> <p>In <a href="https://www.kentik.com/netflow-sflow-and-flow-extensibility-part-1/">part 1 of this series</a>, we looked at the origins of NetFlow, why it was extended in v9 through the use of templating, and what some of the pros and cons are of the templating approach. While NetFlow v9 and it’s follow-on protocol IPFIX offer tremendous flexibility there are some tradeoffs including complexity of implementation and the fact that a template must be received before the underlying flow data records can be correctly understood. These factors led to the development of a NetFlow/IPFIX alternative called sFlow®. In this post we’ll look at how sFlow works compared to NetFlow, and then consider where flow data protocols are headed next.</p> <h4 id="the-sflow-difference">The sFlow difference</h4> <p><strong>sFlow</strong>, which has been available in switches and routers since 2001, is the brainchild of InMon Corporation, whose continued control over the protocol is both benevolent and absolute. Instead of the templating approach taken in NetFlow v9 and IPFIX, sFlow employs the similar concept of protocol extensions. These extensions are defined and optional, but you can write them into the code of either your sFlow library or your device. Unlike NetFlow/IPFIX, there’s only one possible set of data types in any given sFlow implementation, so you don’t need to wait for a template before you can begin processing the flow data packets.</p> <p>sFlow also differs from NetFlow/IPFIX in the way that flow records are generated. Routers and switches running NetFlow/IPFIX designate a collection of packets as a flow by tracking packets, typically looking for packets that come from and go to the same place and share the same protocol, source and dest IP address, and port numbers. This tracking requires CPU and memory — in some circumstances, a huge amount of it. For example, with a forged source-address DDoS attack, every packet can be a flow, and routers have to try to maintain massive tables on the fly to track those flows! Also, to cut down on CPU and network bandwidth, flows are usually only “exported” on average every 10 seconds to a few minutes. This can result in very bursty traffic on sub-minute time scales.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2BFPHGmNNOaAaq8QQMKWwa/f652681c0ed4219e9cb7c91f002a6a5b/sflow_logo-319w.png" alt="sflow_logo-319w.png" class="image right no-shadow" style="max-width: 250px;" /> <p>sFlow, on the other hand, is based on interface counters and flow samples created by the network management software of each router or switch. The counters and packet samples are combined into “sFlow datagrams” that are sent across the network to an sFlow collector. The preparation of sFlow datagrams doesn’t require aggregation and the datagrams are streamed as soon as they are prepared. So while NetFlow can be described as observing traffic patterns (“How many buses went from here to there?”), with sFlow you’re just taking snapshots of whatever cars or buses happen to be going by at that particular moment. That takes less work, meaning that the memory and CPU requirements for sFlow are less than for NetFlow/IPFIX.</p> <p>Like NetFlow/IPFIX, sFlow is extensible and binary, but unlike NetFlow you can’t add or change data types independently by changing your own template, because InMon ultimately controls what can and can’t be done. On the upside, sFlow gives you faster feedback and better accuracy than many NetFlow implementations, and it is relatively sophisticated. From a coding point of view, it’s not trivial to implement, but it’s easier than implementing templated flow.</p> <h4 id="flow-for-the-future">Flow for the future</h4> <p>Flow records incorporate the standard attributes of network traffic, but it’s often overlooked that today’s flow records can also incorporate many other types of data such as application semantics and network and application performance data. To create flow records that are augmented this way you’ve got to have an extensible method of passing data. Of course you can use IPFIX or NetFlow V9, but you run into the limitations discussed earlier: relatively high CPU/memory requirements in some situations, and not being able to process records until you receive the template, which can cause delays, particularly if you are using a high sample rate (template packets may slip through).</p> <h4 id="templates-could-be-kept-globally-available-instead-of-being-delivered-in-band">Templates could be kept globally available instead of being delivered in-band.</h4> <p>To deal with the latter problem one can create a form of NetFlow where templates are not delivered in-band, but are kept somewhere else that is globally available, allowing the definitions to be retrieved by endpoints. This approach would take a little bit less communication and require a lot less work to match templates to data packets. You could take the well-known data types like IPv4 and IPv6 and build a fast path for them. With templates out-of-band, the protocol would also be ‘re-sample-able’ in transport.</p> <p>That still leaves the issue of what’s the best form to use for the extended data. You might choose a simple NetFlow v5-like C structure that is extended to add the fields you like, a binary serialization format such as protobufs or Cap’n Proto, or an ASCII format such as XML or <a href="http://www.json.org/">JSON</a>. While JSON has gotten very popular, there are two issues with using it for flow data. The first is that even when compressed JSON is bigger than binary formats. The second and more important issue is that with JSON you’re taking ASCII and converting it to binary, which makes JSON less efficient to parse. Most of the data systems that primarily store JSON natively would melt if you tried to do flow analytics with them. There is a way around this, however, which is to put JSON data into <a href="http://kafka.apache.org/documentation.html">Apache Kafka</a>, which can easily take JSON as a data bus into different systems where you could translate it into binary.</p> <h4 id="kentik-kflow">Kentik KFlow</h4> <p>At Kentik™, we designed the Kentik Data Engine™ (KDE) to ingest flow records in heterogeneous protocols (NetFlow v5/v9, IPFIX, and sFlow) into a single unified database. So we’re protocol agnostic, but we’ve had lots of opportunity to think about the commonalities and distinctions of the protocols. The flexibility of NetFlow/IPFIX compared to sFlow can be a huge benefit to some of our customers, but processing IPFIX is relatively expensive and we want our data layer to be as efficient as possible. Also NetFlow can run over SCTP, which is encrypted, but that capability is not well supported by many exporting devices.</p> <p>Given the above, we’ve given a lot of thought to how to structure and implement a protocol that transports flow metadata and allows it to be enriched with data that is only available close to the packets. As a result, we’ve developed our own flow record protocol, called KFlow™, that can be used by any Kentik customer that sends us flow records via the Kentik agent (rather than directly from routers or switches).</p> <img src="//images.ctfassets.net/6yom6slo28h2/5ihQVFqGY0AOm4ceIIwACA/cad5298228dce98138e2d102bdc5a6a6/Capn_Proto-319w.png" alt="Capn_Proto-319w.png" class="image right" style="max-width: 280px;" /> <p>KFlow is based on a hybrid concept in which there are certain well-known attributes but also out-of-band templates. That way we know what every packet means as soon as we get it, but the data types are extensible. For our transport layer we use <a href="https://capnproto.org/">Cap’n Proto</a> — a binary serialization library that provides a way of representing binary data in extensible binary formats — over https. That gives us an encrypted and efficient way to feed augmented flow data to our cloud.</p> <p>KFlow has worked beautifully for billions of flow records per day since we introduced it in February of this year. Does the hybrid approach represent the future of flow protocols? We don’t know; the market will ultimately decide. But what does seem clear is that both flow data and flow protocols are areas that are ripe for continued innovation.</p> <p>Want to find out more about KFlow, or learn how your business can benefit from network visibility that unifies NetFlow, IPFIX, and sFlow with BGP, GeoIP, and SNMP? <a href="mailto:[email protected]">Contact us</a> to ask questions, to <a href="#demo_dialog">request a demo</a>, or to start a <a href="#signup_dialog" title="Try Kentik&#x27;s NetFlow Analyzer Solution">free trial</a> today. We’d love to hear from you.</p><![CDATA[Beyond Hadoop]]><![CDATA[As the first widely accessible distributed-computing platform for large datasets, Hadoop is great for batch processing data. But when you need real-time answers to questions that can’t be fully defined in advance, the MapReduce architecture doesn't scale. In this post we look at where Hadoop falls short, and we explore newer approaches to distributed computing that can deliver the scale and speed required for network analytics.]]>https://www.kentik.com/blog/beyond-hadoophttps://www.kentik.com/blog/beyond-hadoop<![CDATA[Jim Meehan]]>Mon, 11 Apr 2016 13:00:57 GMT<h3 id="clustered-computing-for-real-time-big-data-analytics"><em>Clustered computing for real-time Big Data analytics</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/7gbiz9NyRGyCSk8q6WkCk8/6dd00e58b446ac6650751d2d3cf8db2e/MapReduce_diagram-521w.png" alt="MapReduce_diagram-521w.png" class="image right no-shadow" style="max-width: 520px; margin-top: 15px; margin-bottom: 15px;" /> <p>The concept of parallel processing based on a “clustered” multi-computer architecture has a long history dating back at least as far as Gene Amdahl’s work at IBM in the 1960s. But the current epoch of distributed computing is often traced to December of 2004, when Google researchers Jeffrey Dean and Sanjay Ghemawat presented a paper unveiling MapReduce. Developed as a model for “processing and generating large data sets,” MapReduce was built around the core idea of using a map function to process a key/value pair into a set of intermediate key/value pairs, and then a reduce function to merge all intermediate values associated with a given intermediate key. Dean and Ghemawat’s work generated instant buzz and led to the introduction of an open source implementation, Hadoop, in 2006.</p> <p>As the first efficient distributed-computing platform for large datasets that was accessible to the masses, Hadoop’s initial hoopla was well deserved. It has since gone on to become a key technology for running many web-scale services and products, and has also landed in traditional enterprise and government IT organizations for solving big data problems in finance, demographics, intelligence, and more.</p> <p>Hadoop and its components are built around several key functionalities. One is the HDFS filesystem, which allows large datasets to be distributed over many nodes. Another is the algorithms that “map” (split) a workload over the nodes, such that each is operating on its local piece of the dataset, and “reduce” to aggregate the results from each piece. Hadoop also provides redundancy and fault tolerance mechanisms across the nodes. The key innovation is that each node operates on locally stored data, eliminating the network bottleneck that constrained traditional high-performance computing clusters.</p> <img src="//images.ctfassets.net/6yom6slo28h2/tJY9rKh7zMG2u0yauCYqQ/244bea9227bbe23c8d7c479324d77a67/Hadoop_logo-319w.png" alt="Hadoop_logo-319w.png" class="image left no-shadow" style="max-width: 300px;" /> <p><strong>The limits of Hadoop</strong></p> <p>Hadoop is great for batch processing large source datasets into result sets when your questions are well defined and you know ahead of time how you will use the data. But what if you need fast answers to questions that can’t be completely defined in advance? That’s a situation that’s become increasingly common for data-driven businesses, which need to make critical, time-sensitive decisions informed by large datasets of metrics from their customers, operations, or infrastructure. Often the answer to an initial inquiry leads to additional questions in an iterative cycle of question » answer » refine until the key insight is finally revealed.</p> <p>The problem for Hadoop users is that the Hadoop architecture doesn’t lend itself to interactive, low latency, ad-hoc queries. So the iterative “rinse and repeat” process required to yield useful insights can take hours to days. That’s not acceptable in use cases such as troubleshooting or security, where every minute of query latency means prolonged downtime or poor user experience, either of which can directly impact revenue or productivity.</p> <p>One approach that’s been tried to address this issue is to use Hadoop to pre-calculate a series of result sets that support different classes of questions. This involves pre-selecting various combinations of dimensions/columns from the source data, and collapsing that data into multiple result sets that contain only those dimensions. Known as “multidimensional online analytical processing” (M-OLAP), this approach is sometimes referred to more succinctly as “data cubes.” Relatively fast queries can be asked of the result sets, and the resulting performance is certainly leaps and bounds better than anything available before the advent of big data.</p> <p>While the use of data cubes boosts Hadoop’s utility, it still involves compromise. If the source data contains many dimensions, it’s not feasible to generate and retain result sets for all of the possible combinations. The result sets also need to be continually regenerated to incorporate new source data. And the lag between events and data availability can make it difficult to answer real-time questions. So even with data cubes, Hadoop’s value in time-dependent applications is inherently constrained.</p> <p><strong>Big Data in real time</strong></p> <p><img src="//images.ctfassets.net/6yom6slo28h2/MpeDyxP7igC2Wyg6mUE4U/141f51a1a4779b41516e50620067fa8c/Google_BigQuery-319w.png" alt="Google_BigQuery-319w.png" class="image right no-shadow" style="max-width: 300px;" />Among the organizations that ran up against Hadoop’s real-time limitations was Google itself. So in 2010 Google one-upped Hadoop, publishing a white paper titled “Dremel: Interactive Analysis of Web-Scale Datasets.” Subsequently exposed as the BigQuery service within Google Cloud, Dremel is an alternative big data technology explicitly designed for blazingly fast ad hoc queries. Among Dremel’s innovations are a columnar data layout and protocol buffers for efficient data storage and super-fast full table scans, along with a tree architecture for dispatching queries and collecting results across clusters containing hundreds or thousands of nodes. It also enables querying using ANSI SQL syntax, the “lingua franca” of analysts everywhere.</p> <p>Dremel’s results are truly impressive. It can execute full scan queries over billions of rows in seconds to tens-of-seconds — regardless of the dimensionality (number of columns) or cardinality (uniqueness of values within a column) — even when those queries contain complex conditions like regex matches. And since the queries operate directly on the source data, there is no data availability lag; the most recently appended data is available for every query.</p> <p>Because massively parallel disk I/O is a key prerequisite for this level of performance, a significant hardware footprint is required, with a price tag higher than many organizations would be willing to spend. But when offered as a multi-tenant SaaS, the cost-per-customer becomes quite compelling, while still providing the performance of the entire cluster for any given query.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6DMTiwGbL2WwUmkAkyaS02/dc600b5b362894ce76f33308ecb6c28c/Kentik_datastore-201w.png" alt="Kentik_datastore-201w.png" class="image right" style="max-width: 200px; padding: 20px;" /> <p><strong>Post-Hadoop NetFlow analytics</strong></p> <p>Dremel proved that it was possible to create a real-world solution enabling ad hoc querying at massive scale. That’s a game-changer for real-time applications such as network analytics. Flow records — NetFlow, sFlow, IPFIX, etc. — on a decent-sized network add up fast, and Hadoop-based systems for storing and querying those records haven’t been able to provide a detailed, real-time picture of network activity. Here at Kentik, however, we’ve drawn on many of the same concepts employed in Dremel to build our <a href="https://www.kentik.com/metrics-for-microservices/">microservice-based platform</a> for flow-based traffic analysis. Called Kentik Detect, this solution enables us to offer customers an analytical engine that’s not only powerful and cost-effective but also maintains real-time performance across web-scale datasets, a feat that is out of reach for systems built around Hadoop. (For more on how we make it work, see <a href="https://www.kentik.com/inside-the-kentik-data-engine-part-1/">Inside the Kentik Data Engine</a>.)</p> <p>The practical benefit of Kentik’s post-Hadoop approach is to enable network operators to perform — in real-time — a full range of iterative analytical tasks that previously took too long to be of value. You can see aggregate traffic volume across a multi-terabit network and then drill down to individual IPs and conversations. You can filter and segment your network traffic by any combination of dimensions on-the-fly. And you can alert on needle-in-the-haystack events within millions of flows per second. Kentik Detect helps network operators to uncover anomalies, plan for the future, and better understand both their networks and their business. <a href="#demo_dialog">Request a demo</a>, or experience Kentik Detect’s performance for yourself by starting a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[NetFlow, sFlow, and Flow Extensibility, Part 1]]><![CDATA[NetFlow and its variants like IPFIX and sFlow seem similar overall, but beneath the surface there are significant differences in the way the protocols are structured, how they operate, and the types of information they can provide. In this series we’ll look at the advantages and disadvantages of each, and see what clues we can uncover about where the future of flow protocols might lead.]]>https://www.kentik.com/blog/netflow-sflow-and-flow-extensibility-part-1https://www.kentik.com/blog/netflow-sflow-and-flow-extensibility-part-1<![CDATA[Avi Freedman]]>Mon, 28 Mar 2016 13:00:07 GMT<h3 id="how-flow-protocols-adapt-as-network-needs-evolve"><em>How flow protocols adapt as network needs evolve</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/ThXXCldhGEgKYcSW0uCei/72ef6bc763fe969b91fa0edea1aff8c8/Spyglass-521w.png" alt="Spyglass-521w.png" class="image right" style="max-width: 300px;" />Network flow records today are an integral part of the network operations landscape. But while NetFlow and its variants like IPFIX and sFlow are similar overall, beneath the surface there are significant differences in the way the protocols are structured, how they operate, and the types of information they can provide. These variations reflect the different histories and intent of the protocols, particularly how they approached the issues of extensibility and interoperability. NetFlow’s answer is vendor-extensible flow templating, while sFlow took a somewhat different path. In this series we’ll look at the advantages and disadvantages of each, and see what clues we can uncover about where the future of flow protocols might lead.</p> <p><strong>The origins of NetFlow</strong></p> <p>NetFlow began as the byproduct of a Cisco effort to enable faster execution of Access Control Lists (ACLs). New switching technology was developed to accelerate processing by treating packets that were going to the same place or traveling on the same path as a single group. To make it work, routers were equipped to read packet headers and use field values to group the packets.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/43WZInPf1e2U6CyumMI4EQ/04f31aaf48a9cf8b230b929785510432/Diorama_cavemen-420w.jpg" alt="Diorama_cavemen-420w.jpg" class="image right" style="max-width: 300px;" />The original idea was simply to define all of the packets that would be permitted or denied by an ACL as a “flow,” and then to keep track of whether each such flow is accepted or denied by the ACLs for each interface. But then the network ops and architecture folks, who at the time were mostly limited to looking at SNMP totals, asked for a mechanism by which these flow records could be exported so that they could be used for network analytics. In response to those requests, NetFlow was born.</p> <p>Even at the dawn of flow analytics, people were interested in many different kinds of “behind the SNMP” views. Starting with the basic ACLs of the mid 90s, the first flow fields added were IP addresses, ports, and protocol. By NetFlow V5 (the first widely adopted version), a variety of additional fields were included.</p> <p><strong>More of a good thing</strong></p> <p>As the utility of NetFlow became more widely realized, it didn’t take long for people to start asking for more. What about capturing MAC address, or VLAN tag, or IPv6? Later, users began wanting to add MPLS and other data that routers and switches could observe. And then they started wanting to track things that routers and switches couldn’t observe, like URL, DNS query, and application and network performance.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/43rDycywJWAYSOyuUg2ooy/ab01630c9e6ef9bc81ec9ef4a8a78dff/scaffolding-420w.jpg" alt="scaffolding-420w.jpg" class="image right" style="max-width: 300px;" />The problem of keeping up with these expanding requirements is that NetFlow 5 was built on a fixed data structure that didn’t have a place for the extra data fields. And the process for updating the NetFlow protocols was clumsy, so extending the structure every few months for things people wanted to add was going to be a huge pain. Plus even if you could keep up you’d eventually end up with packet headers that were uber-large because you’d have fields in there for data that many devices couldn’t grab.</p> <p>While NetFlow v5 remains useable, its lack of additional data types (e.g. IPv6, MAC addresses, VLAN, and MPLS) makes it more limited than other alternatives. Cisco’s response to these limits, while trying to avoid packet bloat, was to use templating to abstract the metadata from the flow data itself. Introduced by Cisco in NetFlow v9 and carried forward in IPFIX — the IETF standard that is sometimes referred to as NetFlow v10 — templates provide a less rigid basis for a collector to interpret flow records. Sent in-band from the exporter to the collector, templates make NetFlow v9 and IPFIX very flexible.</p> <p><strong>Templating pros and cons</strong></p> <p>As with any other protocol design decision, templating involves both pros and cons. In NetFlow v9, templates are each identified with a template ID. While that’s helpful it doesn’t prevent a situation in which a given template ID might be used by multiple network equipment vendors, in which case each vendor’s equipment would likely be collecting and sending a different set of NetFlow data. IPFIX addressed this issue by creating vendor IDs that can each contain their own unique set of template IDs, making it easier to avoid namespace conflicts.</p> <div class="pullquote left">Figuring out which types of flow data the collector is getting can be a slow process.</div> <p>Another issue with templates is that the template packets come very infrequently, and until you get one you don’t know what your collected flow data means. If you sample incoming packets then you may miss the template packets for a while and wind up with even more data of unknown meaning. And even though the template information is really the same from one exporter to another, the protocol specifies that you’re not supposed to cache and remember templates across multiple exporting devices. The result is that figuring out which types of flow data the collector is collecting can be a slow process.<br> Templating can also be a challenge to implement correctly because it involves a complex multistage process. That explains why many implementations were horribly inaccurate for the first decade or more of NetFlow v9’s existence.</p> <p>On the positive side, NetFlow v9 is binary and relatively compact. And it fits in UDP packets, which makes it easy to transport. So while it was originally Cisco-only, people started creating new IDs and it became the standard, widely used among many vendors. As the standards-based follow-on version, IPFIX has also gotten a lot of traction.</p> <p>In <a href="https://www.kentik.com/blog/netflow-sflow-and-flow-extensibility-part-2" title="NetFlow, sFlow and Flow Extensibility, Part 2">NetFlow, sFlow and Flow Extensibility: Part 2</a>, we’ll take a look at some key differences between NetFlow and sFlow, as well as speculate a bit on where today’s trends in network data are leading flow protocols.</p> <p>A number of our other <a href="https://www.kentik.com/blog/">Kentik blog posts</a>, including our recent post on The Evolution of BGP NetFlow Analysis, can help you learn more about some of the many practical applications of flow data. And please don’t hesitate to contact us if you’re interested in arranging a <a href="#demo_dialog" title="Let Us Show You Why Kentik is the Best Netflow Analysis Solution">Kentik demo</a> or starting a <a href="#signup_dialog" title="Try Kentik&#x27;s NetFlow Analysis Solution">free trial</a>.</p><![CDATA[Maximizing Network Metadata Value]]><![CDATA[The plummeting cost of storage and CPU allows us to apply distributed computing technology to network visibility, enabling long-term retention and fast ad hoc querying of metadata. In this post we look at what network metadata actually is and how its applications for everyday network operations — and its benefits for business — are distinct from the national security uses that make the news.]]>https://www.kentik.com/blog/maximizing-network-metadata-valuehttps://www.kentik.com/blog/maximizing-network-metadata-value<![CDATA[Jim Meehan]]>Mon, 21 Mar 2016 13:00:31 GMT<h3 id="making-flow-records-a-force-for-good-on-your-network"><em>Making Flow Records a Force for Good on Your Network</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/5PaRLB36r6OWAkS6qgagyM/aa29fd028b02d0e9bb3030bfdd119e13/data_eye-521.jpg" alt="data_eye-521.jpg" class="image right" style="max-width: 520px;" /> <p>Ever since revelations by Edward Snowden raised public awareness about government surveillance and data privacy, the term “metadata” has been tainted with sinister connotations. Regardless of your personal stance on the “war on terror,” it’s hard to overstate the role that telecom metadata has played in intelligence operations. As highlighted in the film “Zero Dark Thirty,” for example, the CIA used mobile phone call records to understand the existence, relationships, and movement of key Al Qaeda players, and eventually put those pieces together to effect the capture of Osama bin Laden.</p> <p>While metadata’s role in counterterror operations is dramatic, national security is by no means the only or even the most widespread metadata use case. In fact, the network infrastructure of nearly every business that uses IP networks — which these days is effectively everyone — generates volumes of metadata on an hourly basis. Leveraged effectively by network operators, this data (NetFlow, GeoIP, and BGP) can provide valuable operational and business insight. And due to striking advances in distributed computing as well as in disk and CPU economics, you can take advantage of IP metadata without having the financial and personnel resources of the CIA.</p> <p><strong>What is Metadata?</strong></p> <p>In the literal sense, metadata is “data about data” (as distinct from the underlying data itself). In the context of telecom networks, we’re talking about call detail records (CDR): who talked to whom, when, and for how long. Note that these records do not include the actual contents of the conversations. For IP networks, the metadata is similar except that instead of phone calls the records document packet flows. In a flow record, the “who” and “whom” are IP addresses and port numbers, and the “how long” is byte and packet counts.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5waEwm21H2WgSAAKoQ4yCI/8c76ea5d32150a7339568f6499e4dcd7/metadata-420w.png" alt="metadata-420w.png" class="image left no-shadow" style="max-width: 420px;" /> <p>Just like with CDRs, in flow records the metadata is external to the actual content of the packets in the flow. There’s much that can be learned from capturing the packets themselves and studying them for deep, direct details on infrastructure and network phenomena. But that also involves a level of technical complexity and expense that in many situations won’t yield much more actionable insight than an effective, comprehensive system for the collection and analysis of metadata in the form of flow records.</p> <p>One key practical advantage of IP metadata is that it’s often already readily available as a byproduct of some other process (e.g. usage-based billing) from components such as routers and switches that are already installed in the network. It’s also easier and more affordable to collect and store than packet capture (pcap). The biggest advantage of metadata, though, is that it’s pervasive: an always-on stream of information from every part of your network.</p> <p><strong>Analysis (R)evolution</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/4IgRcKlyvKyCi46i6UQYsS/05b0f5e784f693271d107567ceda8aef/analytics-420w.jpg" alt="analytics-420w.jpg" class="image right" style="max-width: 420px;" /> <p>Network metadata has been available for a relatively long time. For IP networks, we’ve had a couple of decades of NetFlow, with the more recent addition of sFlow, IPFIX, etc. For telecom networks, the CDR has been around much longer than that. Until recently, though, metadata use cases have been limited to non-real-time evidence gathering, like generating predefined traffic summaries or pulling out proof of conversations or connections among known phone numbers or endpoints. The cost of storage meant that the data couldn’t be retained for very long. And the high cost of compute — due largely to the lack of distributed architecture — meant that we couldn’t easily ask ad hoc questions or derive real-time insights with any level of detail.</p> <p>What’s changed? The cost of disk and CPU have plummeted, and distributed computing technology allows us to aggregate storage and horsepower in ways that are much more powerful than just the “sum of parts.” That enables long-term retention of metadata and fast ad hoc queries. So now we can design systems that allow us not only to get proof of known events and conditions, but also to quickly discover previously unknown actors, events, and conditions within the network. In many cases, fast analysis across pervasive metadata can deliver more value than sparse points of deep packet data surrounded by large gaps in time or location.</p> <p><strong>Delivering Value</strong></p> <p>In “Zero Dark Thirty” we saw the CIA use metadata in two ways. First, retroactive analysis of collected metadata allowed them to uncover the existence and identity of unknown individuals in the Al Qaeda network that could potentially lead to bin-Laden. And second, analysis of live metadata allowed them to know the location of those individuals in real-time, and when to launch tactical operations.</p> <img src="//images.ctfassets.net/6yom6slo28h2/35vqGQfWZa4USUeSEqKwqS/2b60af69b8eaae6834f8ed8e063c2736/microscope-319.jpg" alt="microscope-319.jpg" class="image left" style="max-width: 300px;" /> <p>Given the right tools, IP network operators can also leverage both retroactive and real-time analysis of network metadata. In the past, diagnosing revenue-impacting events on the network — outages, congestion, attacks, etc. — often required that they be “caught in the act” while the traffic was still present on the network. For operators, that’s an extremely frustrating and time-consuming constraint. But with the ability to perform retroactive analysis of unsummarized network metadata, we can now look back at the finest details of historical traffic, which empowers us to quickly spot patterns and root causes.</p> <p>The enhanced performance of distributed systems also helps us in the realm of real-time analysis, making it possible to see problems as they develop, to rapidly drill down to areas of concern, and to take corrective action based on informed conclusions. And by applying lessons learned from historical data, we can set up alerts that notify us in real time when similar conditions recur, so that they can be quickly (or even automatically) remediated. (For examples of both retroactive and real-time analysis with Kentik Detect, see <a href="https://www.kentik.com/detecting-hidden-spambots/">Detecting Hidden Spambots</a>).</p> <p><strong>What’s Holding You Back?</strong></p> <p>As more and more business operations become enabled by or reliant on IP networks, the actionable insights available from metadata become increasingly critical. Like the analytical wizards who turned mundane baseball statistics into a competitive advantage on the field, the accumulation of continuous improvements that you can make based on network data can help you <a href="https://www.kentik.com/moneyball-your-network-with-big-data-analytics/">Moneyball Your Network</a>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2xUYrwEFwIusACaUWEiWqs/f398cf6857fd03522a94a2707fe0dd10/money_flow-420w.jpg" alt="money_flow-420w.jpg" class="image right" style="max-width: 420px;" /> <p>So what’s currently keeping you from deriving the full value of your network metadata? If you’re using legacy network monitoring tools that are based on pre-cloud designs then it’s likely that processing and storage constraints are forcing you to discard most of that value. That’s a shame, because the data you’re throwing away could be used to boost performance, cut costs, and improve ROI. As a network data analysis solution built on state-of-the-art Big Data architecture, the <a href="https://www.kentik.com/product/kentik-platform/" title="Learn More About Our Platform: the Kentik Network Observability Cloud">Kentik Network Observability Cloud</a> can help. And because Kentik is an efficient SaaS, the cost of entry is surprisingly low. So if you’re ready to derive maximum value from your network metadata, take us for a test drive. Sign up now for a <a href="#signup_dialog">free 30-day trial</a>.</p><![CDATA[Evolution of BGP NetFlow Analysis, Part 2]]><![CDATA[In part 2 of this series, we look at how Big Data in the cloud enables network visibility solutions to finally take full advantage of NetFlow and BGP. Without the constraints of legacy architectures, network data (flow, path, and geo) can be unified and queries covering billions of records can return results in seconds. Meanwhile the centrality of networks to nearly all operations makes state-of-the-art visibility essential for businesses to thrive.]]>https://www.kentik.com/blog/evolution-of-bgp-netflow-analysis-part-2https://www.kentik.com/blog/evolution-of-bgp-netflow-analysis-part-2<![CDATA[Jim Meehan]]>Mon, 14 Mar 2016 13:00:44 GMT<h2 id="big-data-boosts-performance-benefits-business">Big data boosts performance, benefits business</h2> <p>In <a href="/blog/the-evolution-of-bgp-netflow-analysis-part-1/">part 1 of this series</a>, we looked at the incorporation of BGP into the NetFlow protocol and the subsequent use of passive BGP sessions to capture live BGP attribute data. These two innovations enabled some valuable analysis of Internet traffic, including DDoS detection and the assessment of links for peering and transit. But because NetFlow-based visibility systems were architected around scale-up computing and storage, the full potential of NetFlow and BGP data was left unrealized. Big Data and cloud architecture changed all that.</p> <h2 id="one-from-many">One from many</h2> <img src="//images.ctfassets.net/6yom6slo28h2/4htLV9UIfSSAi0oEk4oEgw/c8cd66e78ad106b6a84e2e07744a8e67/e_pluribus_unum-420w.jpg" alt="e_pluribus_unum-420w.jpg" class="image right" style="max-width: 300px;" /> <p>As the public cloud has grown, scale-up computing has given way to scale-out models, which has opened up a great opportunity to improve both the cost and functionality of NetFlow and BGP analysis. In terms of cost, the scale-out, micro-services approach made popular by public cloud and SaaS vendors offers a big leap in price-performance ratio. Combining commodity bare-metal hardware with open source containers, you can create a purpose-built public cloud platform that delivers exceptional processing and storage capabilities at cost per ingested flow record that is dramatically lower than commercial appliances. Software on top of generic public cloud (AWS or equivalent) is also an option, but the performance overhead of VMs and the commercial overhead make the costs significantly higher.</p> <div as="Promo"></div> <p>From a functionality point of view, massive scale-out computing allows augmented NetFlow records to be unified instead of fragmented. In single-server architectures, NetFlow is put in one table, BGP in another, and GeoIP in yet another, and you then have to key-match those tables and rows to correlate the data. In a distributed system with sufficient compute and memory, you can instead ingest all of the NetFlow, BGP, and GeoIP into a single time-series database. As each NetFlow record comes in, you look at its time stamp, grab the latest relevant BGP UPDATE from memory, and augment the NetFlow records with a variety of BGP attribute fields. And you do the same for GeoIP. With a scale-out cluster, this can happen in real-time for millions of inbound flow records.</p> <p>The performance impact of unifying all data into augmented flow records is pretty impressive. One metric that underscores the advantage is the number of rows that a system can query while returning results in a useable amount of time. In the older, single-server model, that number ranged from hundreds to thousands of rows of data across different tables. With a scale-out approach, the number jumps to millions or even billions of rows. This leap is accomplished by splitting the data into time-based slices, distributing the processing of those slices across different nodes in the cluster, and then aggregating the results. Instead of populating a limited set of report tables and then discarding the raw source data, a scalable, distributed architecture allows you to keep the raw data and build reports on the fly.</p> <img src="//images.ctfassets.net/6yom6slo28h2/3EDgSrOfHGqSGQaCsKi68I/d631e548da61903d6543622b1fad3ded/big_data-420w.jpg" alt="big_data-420w.jpg" class="image right" style="max-width: 300px;" /> <h2 id="real-world-big-data-benefits">Real-world big data benefits</h2> <p>The leap in performance, flexibility, and detail that’s possible with a distributed Big Data architecture directly impacts NetFlow and BGP analysis use cases. Let’s take the example of DDoS detection. There is now a wide variety of available mitigation options, from remote-triggered black holes (RTBH) and FlowSpec to on-premises and cloud-based scrubbing. The most cost-effective approach for network operators is to be able to access, at scale, a vendor-neutral range of such tools. But that’s not an option in single-server scenarios, where software detection and mitigation must be tightly coupled and delivered by the same vendor.</p> <p>BGP traffic analysis is another area where the distributed approach shines. Scale-up software can process a large set of BGP and NetFlow data and produce a picture of destination BGP paths according to traffic volume. However, that picture may take hours. The problem is, if someone wants to drill down on a portion of the picture to understand the “why” behind it, they’re basically out of luck.</p> <p>Compare that to a Big Data scenario, where you have the speed and capacity to ingest and store raw flow records and to query them ad hoc. Real-time analysis of the full dataset means that the number of operationally relevant use cases explodes, because the number of different questions that you can ask is never limited by predefined reporting tables that you’ve had to populate in advance. In this approach, the combination of terms on which you can run a query in real time is nearly infinite. And because you can ask what you want when you want, it’s possible to enable a completely interactive — and therefore far more intuitive — presentation of BGP traffic paths.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5mJBTv0JFKao28IkmmoYec/0ad20214fbfd3c143c94c7f8522854ac/Zoom_in_out-840w.jpg" alt="Zoom_in_out-840w.jpg" class="image center" style="max-width: 600px;" /> <p>The difference in utility is comparable to the difference between a one-time satellite snapshot of terrain versus a live camera drone that can fly in close and see any desired details. The snapshot is way better than nothing, but the drone makes it clear how limited the snapshot really is.</p> <h2 id="evolving-business-with-technology">Evolving business with technology</h2> <img src="//images.ctfassets.net/6yom6slo28h2/7nzTD4F94sE4Mo2gQq0s86/bd879d142a04e823c855b221c49d1c71/bridge_gap-crop-319w.png" alt="bridge_gap-crop-319w.png" class="image right" style="max-width: 300px;" /> <p>By now it’s hopefully clear that the implementation of distributed, Big Data architecture has enabled a huge step forward in the evolution of NetFlow-based network visibility. There are multiple ways to take advantage of this advance, and a number of factors to consider when choosing how best to do so. For most organizations the most practical, cost-effective solution is a SaaS platform like the <a href="https://www.kentik.com/product/kentik-platform/" title="Learn More About Our Platform: the Kentik Network Observability Platform">Kentik Network Observability Platform</a>.</p> <p>The key point is that as the technical capabilities of NetFlow engines evolve so too does their business utility. If you rely on Internet traffic for revenue or key productivity tools (like IaaS, PaaS or SaaS), then your network isn’t just carrying packets, its carrying your actual business. Snapshots may be good enough for “nice to know” management dashboards, but as your operations grow you need deeper insight to improve the quality of your customer experience (whether internal or external) and your infrastructure ROI. That’s why it makes business sense to invest in NetFlow and BGP analysis capabilities that are based on state-of-the-art Big Data architecture.</p> <p>We’d love the opportunity to show you the SaaS option. The easiest way to start? Take Kentik for a test drive with a <a href="#signup_dialog">free trial</a>.</p><![CDATA[The Evolution of BGP NetFlow Analysis, Part 1]]><![CDATA[Clear, comprehensive, and timely information is essential for effective network operations. For Internet-related traffic, there’s no better source of that information than NetFlow and BGP. In this series we’ll look at how we got from the first iterations of NetFlow and BGP to the fully realized network visibility systems that can be built around these protocols today.]]>https://www.kentik.com/blog/the-evolution-of-bgp-netflow-analysis-part-1https://www.kentik.com/blog/the-evolution-of-bgp-netflow-analysis-part-1<![CDATA[Jim Meehan]]>Mon, 07 Mar 2016 14:00:50 GMT<h3 id="enabling-comprehensive-integrated-analytics-for-network-visibility">Enabling comprehensive, integrated analytics for network visibility</h3> <img src="//images.ctfassets.net/6yom6slo28h2/PGvsYhZY2GWEuImSCYYOI/fb58c4287bef6700e3414cd77d8c8791/human_evolution-521w.jpg" class="image right no-shadow" style="max-width: 500px" alt="" /> <p>Clear, comprehensive, and timely information is the essential prerequisite for effective network operations. For Internet-related traffic, there’s no better source of that information than NetFlow and BGP. But while these protocols have been around for a couple of decades, their potential utility to network operators was initially unrealized, and the process of exposing more value has been a long, gradual evolution. The journey started with simply making the data available. It then progressed to the development of collection and analysis techniques that make the data useable, but with real limitations. The next leap forward has been to transcend the constraints of legacy approaches, both open source and appliance, using big data architecture. With a distributed, multi-tenant HA datastore in the cloud, Kentik has created a SaaS that enables network operators to extract far more practical value from NetFlow and BGP information. In this series we’ll look at how we got from the first iterations of NetFlow and BGP to the fully realized network visibility systems that can be built around these protocols today.</p> <p><strong>In the beginning…</strong></p> <p>Border Gateway Protocol (BGP) was first introduced in 1989 to address the need for an exterior routing protocol between autonomous systems (AS). By 1994 BGP4 had become the settled protocol for inter-AS routing. Then in 1996, as the Internet grew into a commercial reality and the need for greater insight into IP traffic patterns grew, Cisco introduced the first routers featuring NetFlow. Support for BGP was added in 1998 with NetFlow version 5, which is still in wide use today.</p> <img src="//images.ctfassets.net/6yom6slo28h2/2rIocssJCoYk8qkYEQEKoI/a094928f8071f00b1975e02c80133555/american_hops-420w.jpg" class="image right" style="max-width: 400px" alt="" /> <p>Support for BGP in NetFlow v5 enabled the export of source AS, destination AS, and BGP next hop information, all of which was of great interest to engineers dealing with Internet traffic. BGP next hop data provided the possibility for network engineers to know which BGP peer, and hence which neighbor AS, outbound traffic was flowing through. With that insight, network engineers could better plan their outbound traffic.</p> <p>A key use case for next hop data arises when determining which neighbor ASes to peer with. If both paid-transit and settlement-free peering options are available, and those options all provide equivalent and acceptable traffic delivery, then you’ll want to maximize cost savings by ensuring that the free option is utilized whenever possible. Armed with BGP next hop insights, engineers can favor certain exit routers by tweaking IGP routing, either by changing IGP link metrics or (with a certain more-proprietary protocol) by employing weights.</p> <p><strong>Kick AS and take names</strong></p> <p>While NetFlow 5’s BGP support was helpful with the above, simple aggregation of the supported raw data left many use cases unaddressed. Knowing ASNs is a first step, but it’s not that helpful unless you can also get the corresponding AS_NAME so that a human can understand it and take follow-up action. In addition, engineers wanted more visibility into the full BGP paths of their traffic. For example, beyond the neighbor AS, what is the 2nd hop AS? And how about the source and destination ASes? NetFlow’s v5 BGP implementation didn’t offer that full path data, and while v9 introduced greater flexibility it still provided only a partial view.</p> <div class="pullquote left">NetFlow's BGP gaps were addressed by direct collection of BGP routing data.</div> <p>In the early 2000’s, a first generation of vendors figured out how to address this gap by collecting BGP routing data directly and blending it with NetFlow. This was done by establishing passive BGP peering sessions and recording all of the relevant BGP attributes. A further enhancement came from integrating information from GeoIP databases to augment the NetFlow and BGP data by providing source and destination IP location. Now, with a GUI tool, network engineers could make practical use of NetFlow and BGP information.</p> <p>These enhancements helped engineers with a number of use cases. One was DDoS detection. Looking at a variety of IP header and BGP data attributes on inbound traffic, you could use pattern-matching to detect volumetric as well as more-nuanced denial of service attacks. Another use case was to find opportunities for transit cost savings, including settlement-free peering, by looking at traffic going through 2nd and 3rd hops in the AS_PATH. For companies delivering application traffic to end-users, the ability to view destination AS and Geography helps in understanding how best to reach the application’s user base.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5OBjodOv1C6QmkCQuOgGM/5f5218f733a4e880dc296b3b1a1f83e3/going_under-L-420w.jpg" alt="going_under-L-420w.jpg" class="image right" style="max-width: 300px;" /> <p><strong>Struggling to keep up</strong></p> <p>The integration of fuller BGP data with NetFlow and other flavors of flow records created a combined data set that was a huge step forward for anyone trying to understand their Internet traffic. But at the same the overall volume of underlying traffic was skyrocketing. Constrained by the technologies of the day, available collection and storage systems struggled to keep up, and network operators were prevented from taking full advantage of their richer data.</p> <p>One key issue was that the software architecture of NetFlow-based visibility systems was based on scale-up assumptions. Whether the software was packaged commercially on appliances or sold as a downloadable software-only product, this meant that any given deployment had a cap on data processing and retention. With most of the software functionality written to optimize single-server function and performance, stringing together a number of servers only yielded a sum-of-the-parts in aggregate price performance.</p> <div class="pullquote left">With legacy NetFlow datastores, running non-standard reports was possible but painfully slow.</div> <p>Another issue was that the databases of choice for early NetFlow and BGP analysis implementations were either proprietary flat files, or, even worse, relational databases like MySQL. In this scenario, one process would strip the headers off of the NetFlow packets and stuff the data fields into one table. Another process would manage the BGP peering(s) and put those records into another table. A separate process would then take data from those tables and crank rows of processed data into still more tables, which were predefined for specific reporting and alerting tasks. Once those post-processed report tables were populated, the raw flow and BGP data was summarized, rolled up, or entirely aged out of the tables due to storage constraints.</p> <p>While it was possible in some cases to run non-standard reports on the raw data, it was painfully slow. Waiting 24 hours to process a BGP traffic analysis report from raw data was not uncommon. In some cases, you could export that raw data, but given the single-processor nature of the software deployment, and considering all of the other processes running at the same time, it was so slow to do so that 99% of users never did. You might have to dedicate a server just to run those larger periodic reports.</p> <p><strong>Steep deployment costs</strong></p> <p>Yet another major issue was the cost of deployment. NetFlow and BGP are both fairly voluminous data sets. NetFlow, even when sampled, produces a lot of flow records because there are many short-lived flows. Whenever a BGP session is established or experiences a hard or soft reset, the software has to ingest hundreds of thousands of routes. Plus, you have the continuous flow of BGP UPDATE messages as route changes propagate across the Internet.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/2fV6jAkvvuQq8WEeo648Ys/d331ec6b1392280cb9d984aa098da1bf/server_install-420w.jpg" alt="server_install-420w.jpg" class="image right" style="max-width: 300px;" />Using single-server software, you may end up needing a bunch of servers to process all of that data. If you buy those servers pre-packaged with software from the vendor, you’ll pay a big mark-up. Consider a typical 1U rackmount server appliance from your average Taiwanese OEM. Raw, the cost of goods sold (COGS) may be anywhere from $1,000.00 to $2,000.00, but loaded with software, and after a basic manufacturing burn-in, you can expect to pay a steep $10K to $25K, even with discounts. And even if you’re buying a software-only product that isn’t pre-packaged onto an appliance, you still have the cost of space, cooling, power, and — most importantly — overhead for IT personnel. So owning and maintaining your own hardware in a scale-up software model is still expensive from a total cost of ownership (TCO) point of view.</p> <p>Considering the limited, inflexible reporting you get for these high costs, most users of legacy NetFlow and BGP analysis tools have been left hungry for a better way. In <a href="https://www.kentik.com/blog/evolution-of-bgp-netflow-analysis-part-2/">part 2 of this series</a>, we’ll look at an alternative approach based on big data SaaS, and consider how this new architecture can dramatically boost the value of BGP and NetFlow in network visibility, operations, and management.</p><![CDATA[From NetFlow Analysis to Business Outcome]]>https://www.kentik.com/blog/from-netflow-analysis-to-business-outcomehttps://www.kentik.com/blog/from-netflow-analysis-to-business-outcome<![CDATA[Jim Metzler]]>Mon, 29 Feb 2016 14:00:29 GMT<h3 id="enabling-digital-transformation-with-network-data"><em>Enabling digital transformation with network data</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/2Zh7o6nowgMK0gEaCSAuo2/f23963e35e9aa4400e958415498dbf47/bizness_hero.jpg" alt="bizness_hero.jpg" class="image right no-shadow" style="max-width: 330px;" /> <p>Over the last few years a lot has been written about the necessity for businesses of all types to embark on a <a href="https://www.capgemini.com/resource-file-access/resource/pdf/The_Digital_Advantage__How_Digital_Leaders_Outperform_their_Peers_in_Every_Industry.pdf">digital transformation</a>. This isn’t just an academic theory as demonstrated in the recently published book entitled <a href="http://www.amazon.com/Leading-Digital-Technology-Business-Transformation/dp/1625272472"><em>Leading Digital</em></a>. That book discussed how over 400 large companies in traditional industries such as finance, manufacturing and pharmaceuticals are making that transformation.</p> <p>The US retailer Home Depot is an example of a traditional company that is in the process of transitioning to become a digital business. According to an April 1, 2014 article in CIO magazine, Home Depot has already implemented:</p> <ul> <li>Onmichannel systems that include BORIS (buy online, return in store), BOSS (buy online, ship to store), and BOPIS (buy online, pick up in store);</li> <li>A project to create a mobile mapping application aimed to help customers more easily find items in Home Depot’s cavernous stores.</li> <li>Analytics to monitor the online pricing of its rivals and to adjust its own prices in real time.</li> </ul> <p>The impact that the transformation to become a digital business has on IT organizations was discussed in an article in Forbes entitled <a href="http://www.forbes.com/sites/oracle/2014/01/10/the-top-10-strategic-cio-issues-for-2014/"><em>The Top 10 Strategic CIO Issues for 2014</em></a><em>.</em> The Forbes article discussed the changing role of the CIO and stated that “Any CIO pining for a return to the good old days of bonuses based on server-uptime and SLA enforcement should consider swapping out the CIO title for a new one: senior director of infrastructure.” The article went on to say that “the CIO job itself continues to undergo a profound transformation that is pushing business-technology leaders inexorably closer to customer demands and customer experiences and customer engagements; to revenue generation, enhancement, and optimization; and to sometimes-revolutionary new business models and operating models, and unheard-of new processes.”</p> <p>Similar to what was said in the Forbes article, in the book that I wrote entitled <a href="http://www.amazon.com/Transforming-digital-business-Forum-guides-ebook/dp/B00KY57ITW"><em>Transforming to a Digital Business</em></a> I pointed out that two of the key characteristics of a digital business are:</p> <ul> <li>Agile business models and rapid innovation;</li> <li>An agile IT Function.</li> </ul> <p>Each of the characteristics listed above depends on collaboration. For example, a company can’t have agile business models and rapid innovation if it’s difficult for the company’s business units and business leaders to work together to explore new ideas. In similar fashion, it isn’t possible to have an agile IT function without close collaboration by all of the relevant parties.</p> <p><strong>Where the Network Comes into Play</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/5BamY0nTe8ocg6EC20aMeg/39e7139a2f5a49655134b8d770816c54/many_roads.jpg" alt="many_roads.jpg" class="image right" style="max-width: 400px;" /> <p>At the same time that IT functions are working to facilitate the movement to become a digital business they are facing a spectrum of technical challenges. Most of these challenges stem from the fact that we are moving away from the traditional IT environment in which the IT organization owned and deployed a variety of hardware-based devices in facilities that they controlled. In the emerging IT environment, there are hardware and software-based devices, some of which are owned by the business and some of which are owned by the employees. Increasingly these devices are not housed in facilities owned by the business.</p> <p>There is a critical commonality between the traditional and the emerging IT environment. That commonality is that the network touches everything. As a result, management data gleaned from the network using a technology such as NetFlow can provide multiple levels of value. One level of value is that IT becomes more agile. It can, for example, troubleshoot problems more quickly independent of the source of the problem since traffic is device agnostic. Another level of value is that business leaders gain insight into their business processes and can better understand what’s working and what needs to be changed.</p> <p><strong>Network Data Analysis Needs Transformation Too</strong></p> <div class="pullquote left">Traditional approaches to collecting and analyzing management data don’t enable business agility.</div> <p>As the value of IT and business agility becomes better understood the focus of technology leaders shifts toward positioning organizations to reap the benefits. Traditional approaches to collecting, storing, accessing, and analyzing management data fall short in that regard because they don’t enable the requisite ease of collaboration.</p> <p>Part of the challenge with the traditional approach is that only relatively small amounts of management data are stored for more than a brief period of time and the management data tends to be specific to a particular technology domain. Another part of the challenge with the traditional approach is that the typical interface to the management data is through a GUI or through a canned report. A GUI is fine if all you want to do is to drill down into a narrowly defined situation such as why some indicator icon has changed from green to yellow. A GUI is not helpful, however, if you want to get broader insight into what is happening in the network or in a key business process. A canned report is fine if all of the information you want is contained in that report. If it isn’t, you typically have to work with your vendor and wait for months for a new or enhanced report. There is no doubt that in our industry we overuse the word <em>agile.</em> As a result, it is sometimes difficult to know exactly what it means to be agile. That said, taking months to get the management insight you need is a good example of not being agile.</p> <p>The solution is to move away from the traditional approach to collecting, storing, accessing and analyzing management data and to adopt a big data approach. By being able to capture and store large data sets for a long period of time, network organizations are in position to be able to identify problems and analyze trends across the entire IT infrastructure, whether the infrastructure is traditional or emerging. However, in order to fully leverage those large data sets, network organizations need programmatic interfaces into them so that professionals from every part of the IT organization can access the same data and work collaboratively. In similar fashion, a big data approach combined with programmatic interfaces enables business units to also work collaboratively to identify problems, analyze trends in their business processes and identify where there are opportunities. The ability to do this is important if the business is a traditional bricks and mortar type company and it is critical as the company evolves towards becoming a fully digital business.</p><![CDATA[BGP Routing Tutorial Series: Part 1]]><![CDATA[Border Gateway Protocol (BGP) is a policy-based routing protocol that has long been an established part of the Internet infrastructure. Understanding BGP helps explain Internet interconnectivity and is key to controlling your own destiny on the Internet. With this post we kick off an occasional series explaining who can benefit from using BGP, how it's used, and the ins and outs of BGP configuration.]]>https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1<![CDATA[Avi Freedman]]>Mon, 08 Feb 2016 14:00:36 GMT<p>For an updated version of all four parts of this series in a single article, see “<a href="https://www.kentik.com/kentipedia/bgp-routing/" title="BGP Routing: A Tutorial | Kentipedia">BGP Routing: A Tutorial</a>” or download our <a href="https://www.kentik.com/go/ebook/ultimate-guide-bgp-routing/" title="Download The Ultimate Guide to BGP Routing">Ultimate Guide to BGP Routing</a>.</p> <div as="Promo"></div> <h2 id="bgp-basics-routes-peers-and-paths">BGP Basics: Routes, Peers, and Paths</h2> <p>Designed before the dawn of the commercial Internet, the <a href="https://www.kentik.com/kentipedia/what-is-bgp-border-gateway-protocol/" title="Kentipedia: What Is BGP? Understanding Border Gateway Protocol">Border Gateway Protocol (BGP)</a> is a policy-based routing protocol that has long been an established part of the Internet infrastructure. In fact, I wrote a series of articles about BGP, Internet connectivity, and multi-homing back in 1996, and two decades later the core concepts remain basically the same. There have been a few changes at the edge (which we’ll cover in future posts), but these have been implemented as the designers anticipated, by adding “attributes” to the BGP specification and implementations. In general, BGP’s original design still holds true today, including both its strengths (describing and enforcing policy) and weaknesses (lack of authentication or verification of routing claims).</p> <p>Why is an understanding of BGP helpful in understanding Internet connectivity and interconnectivity? Because effective BGP configuration is part of controlling your own destiny on the Internet. And that can benefit your organization in several key areas:</p> <ul> <li>Preserve and grow revenue.</li> <li>Protect the availability and uptime of your infrastructure and applications.</li> <li>Use the economics of the Internet to your advantage.</li> <li>Protect against the global security risks that can arise when Internet operators don’t agree on how to address security problems.</li> </ul> <p>BGP and Internet connectivity is a big subject, so there’s a lot of ground to cover in this series. The following list will give you a sense of the range of the topics we’ll be looking at:</p> <ul> <li>The structure and state of the Internet;</li> <li>How BGP has evolved and what its future might hold;</li> <li>DDoS detection and prevention;</li> <li>Down the road, additional topics such as MPLS and global networking, internal routing protocols and applications, and other topics that customers, friends, and readers are interested in seeing covered.</li> </ul> <p>For this first post we’ll get our feet wet with some basic concepts related to BGP: Autonomous Systems, routes, peering, and AS_PATH.</p> <h2 id="routes-and-autonomous-systems">Routes and Autonomous Systems</h2> <img src="//images.ctfassets.net/6yom6slo28h2/4ChhoAJuis8GkicS8cO0wG/e6697f54531f0e34b23e03bcc19555bf/Red_connectors-320w.jpg" class="image right" style="max-width: 300px"> <p>To fully understand BGP we’ll first get familiar with a couple of underlying concepts, starting with what it actually means to be connected to the Internet. For a host to be connected there must be a path or “route” over which it is possible for you to send a packet that will ultimately wind up at that host, and for that host to have a path over which to send a packet back to you. That means that the provider of Internet connectivity to that host has to know of a route to you; they must have a way to see routes in the section of the IP space that you are using. For reasons of enforced obfuscation by RFC writers, routes are also called Network Layer Reachability Information (NLRI). As of December 2015, there are over 580,000 IPv4 routes and nearly 26,000 IPv6 routes.</p> <p>Another foundational concept is the Autonomous System (AS), which is a way of referring to a network. That network could be yours, or belong to any other enterprise, service provider, or nerd with her own network. Each network on the Internet is referred to as an AS, and each AS has at least one Autonomous System Number (ASN). There are tens of thousands of ASNs in use on the Internet. Normally the following elements are associated with each AS:</p> <ul> <li>An entity (a point of contact, typically called a NOC, or Network Operations Center) that is responsible for the AS.</li> <li>An internal routing scheme so that every router in a given AS knows how to get to every other router and destination within the same AS. This would typically be accomplished with an interior gateway protocol (IGP) such as Open Shortest Path First (OSPF) or Intermediate System to Intermediate System (IS-IS).</li> <li>One or multiple border routers. A border router is a router that is configured to peer with a router in a different AS, meaning that it creates a TCP session on port 179 and maintains the connection by sending a keep-alive message every 60 seconds. This peering connection is used by border routers in one AS to “advertise” routes to border routers in a different AS (more on this below).</li> </ul> <h2 id="introducing-bgp">Introducing BGP</h2> <img src="//images.ctfassets.net/6yom6slo28h2/2iyQVTwKkYUaMYQKy0aUiK/a35d32375fef16b698843336dbebb21a/Peering_diagram-521w.png" class="image right no-shadow" style="max-width: 400px" alt="peering diagram"> <p>As explained above, the interconnections that are created to carry traffic from and between Autonomous Systems result in the creation of “routes” (paths from one host to another). Each route is made up of the ASN of every AS in the path to a given destination AS. BGP (more explicitly, BGPv4) is the routing protocol that is used by your border routers to “advertise” these routes to and from your AS to the other systems that need them in order to deliver traffic to your network:</p> <ul> <li>Peer networks, which are the ASs with which you’ve established a direct reciprocal connection;</li> <li>Upstream or transit networks, which are the providers that connect you to other networks.</li> </ul> <p>Specifically, your border routers advertise routes to the portions of the IPv4 and IPv6 address space that you and your customers are responsible for and know how to get to, either on or through your network. Advertising routes that “cover” (include) your network is what enables other networks to “hear” a route to the hosts within your network. In other words every IP address that you can get to on the Internet is reachable because someone, somewhere, has advertised a route that covers it. If there is not a generally advertised route to cover an IP address, then at least some hosts on the Internet will not be able to reach it.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4CJu0G2cDSmSMcG4Ysg6Ac/be6f7f8253fef0b6f369402ff5abced9/Global_links-521w.jpg" class="image right" style="max-width: 300px"> <p>The advertising of routes helps a network operator do two very important things. One is to make semi-intelligent routing decisions concerning the best path for a particular route to take outbound from your network. Otherwise you would simply set a default route from your border routers into your providers, which might cause some of your traffic to take a sub-optimal external route to its destination. Second, and more importantly, you can announce your routes to those providers, for them to announce in turn to others (transit) or just use internally (in the case of peers).</p> <p>In addition to their essential role in getting traffic to its destination, advertised routes are used for several other important purposes:</p> <ul> <li>To help track the origin and path of network traffic;</li> <li>To enable policy enforcement and traffic preferences;</li> <li>To avoid creating routing, and thus packet, loops.</li> </ul> <p>Besides being used to advertise routes, BGP is also used to listen to the routes from other networks. The sum of all of the route advertisements from all of the networks on the Internet contributes to the “global routing table” that is the Internet’s packet directory system. If you have one or more transit provider, you will usually be able to hear that full list of routes.</p> <p>One further complication: BGP actually comes in two flavors depending on what it’s used for:</p> <ul> <li>External BGP (eBGP) is the form used when routers that aren’t in the same AS advertise routes to one another. From here on out you can assume that, unless otherwise stated, we’re talking about eBGP.</li> <li>Internal BGP (iBGP) is used between routers within the same AS.</li> </ul> <h2 id="the-as_path-attribute">The AS_PATH attribute</h2> <img src="//images.ctfassets.net/6yom6slo28h2/5wzJv68RtmKsIgCQW8GSiy/9c36d3c8e38b2b61409b412fba75a92b/Treasure_map-521w.png" class="image right no-shadow" style="max-width: 320px" alt="treasure map"> <p>BGP supports a number of attributes, the most important of which is AS_PATH. Every time a route is advertised by one BGP router to another over a peering session, the receiving router prepends the remote ASN to this attribute. For example, when Verizon hears a route from NTT America, Verizon “stamps” the incoming route with NTT’s ASN, thereby building the route in AS_PATH. (Note that when a route is advertised between routers in the same AS, using iBGP, the ASN for both routers is the same and thus AS_PATH is left unchanged.)</p> <p>When multiple routes are available, remote routers will generally decide which is the best route by picking the route with the shortest AS_PATH, meaning the route that will traverse the fewest ASes to get traffic to a given destination AS. That may or may not be the fastest route, however, because there’s no information about the network represented by a given AS: nothing about that network’s bandwidth, the number of internal routers and hop-count, or how congested it is. From the standpoint of BGP, every AS is pretty much the same.</p> <p>Additional uses for AS_PATH include:</p> <ul> <li><em>Loop detection:</em> When a border router receives a BGP update (path advertisement) from its peers it scans the AS_PATH attribute for its own ASN; if found the router will ignore the update and will not advertise it further to its iBGP neighbors. This precaution prevents the creation of routing loops.</li> <li><em>Setting policy:</em> BGP is designed to allow providers to express “policy” decisions such as preferring Verizon over NTTA to get to Comcast.</li> <li><em>Visibility:</em> AS_PATH provides a way to understand where your traffic is going and how it gets there.</li> </ul> <h2 id="conclusion-and-a-look-ahead">Conclusion… and a look ahead</h2> <p>So far we’ve just scratched the surface of BGP, but we’ve learned a few core concepts that will serve as a foundation for future exploration:</p> <ul> <li><em>Internet connectivity:</em> the ability of a given host to send packets across the Internet to a different host and to receive packets back from that host.</li> <li><em>Autonomous system (AS):</em> a network that is connected to other networks on the Internet and has unique AS number (ASN).</li> <li><em>Route:</em> the path travelled by traffic between Autonomous Systems.</li> <li><em>Border router:</em> a router that is at the edge of an AS and connects to at least one router from a different AS.</li> <li><em>Peering:</em> a direct connection between the border routers of two different ASs in which each router advertises the routes of its AS.</li> <li><em>eBGP:</em> the protocol used by border routers to advertise routes.</li> <li><em>AS_PATH:</em> the BGP attribute used to specify routes.</li> </ul> <p>In future posts we’ll get deeper into the uses and implications of the above concepts. We’ll also look at single-homed and multi-homed networks, how using BGP changes the connectivity between a network and the Internet, and who can benefit from using BGP. When we’ve got those topics down we can then look at the ins and outs of BGP configuration. Stay tuned…</p> <h4 id="related-posts">Related posts</h4> <ul> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-1/">BGP Routing Tutorial Series: Part 1</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-2/">BGP Routing Tutorial Series: Part 2</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-3/">BGP Routing Tutorial Series: Part 3</a></li> <li><a href="https://www.kentik.com/blog/bgp-routing-tutorial-series-part-4/">BGP Routing Tutorial Series: Part 4</a></li> </ul><![CDATA[Introducing BGP Peering Analytics in Kentik Detect]]><![CDATA[By mapping customer traffic merged with topology and BGP data, Kentik Detect now provides a way to visualize traffic flow across across your network, through the Internet, and to a destination. This new Peering Analytics feature will primarily be used to determine who to peer (interconnect) with. But as you’ll see, Peering Analytics has use cases far beyond peering.]]>https://www.kentik.com/blog/introducing-bgp-peering-analyticshttps://www.kentik.com/blog/introducing-bgp-peering-analytics<![CDATA[Dan Ellis]]>Tue, 26 Jan 2016 16:33:35 GMT<h2 id="new-kentik-detect-section-shows-more-than-pretty-pictures"><em>New Kentik Detect Section Shows More than Pretty Pictures</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/5T8MTBV5i8MQOaKeqgeUqK/7c1d18927dd4f3666440dbaa6d22fac2/Networking_concept-320w.jpg" class="image right no-shadow" style="max-width: 320px" alt="" /> <p>Back when we launched Kentik last June, I talked about the reasons that I was excited to get involved with the company (see <a href="https://www.kentik.com/20-years-of-flying-blind/">20 Years of Flying Blind</a>). Our initial task was to enable fast spelunking of network traffic data, both present-time and historical, to help operators see and understand what’s happening on their infrastructure. Once that was well underway it was time to provide a way to visualize traffic flow across your network, through the Internet, and to a destination. By mapping the customer’s traffic merged with topology and BGP data, we’ve now done that in Kentik Detect.</p> <p>Peering is typically the act of determining who one’s network should connect to and creating those relationships. So we’ve called this new feature Peering Analytics, because it will primarily be used to determine who to peer (interconnect) with. But as you’ll see, Peering Analytics — which launched in November 2015 and has now emerged from Beta into a full v1 release — has use cases far beyond peering.</p> <p><strong>BGP plus flow</strong></p> <p>BGP analysis tells you a tremendous amount about how traffic gets where it’s going.</p> <p>Analyzing BGP paths is a very powerful way to tell a tremendous amount about how your traffic gets where it’s going. But until now most BGP path tools have typically been limited to looking strictly at BGP paths. That allows you to see how you could get traffic to a given AS, but not whether you actually have traffic on any given path (or, if you do, how much traffic). What we did instead is develop a solution that takes advantage of all of the raw flow records (NetFlow, IPFIX, sFlow, pcap, etc.) that we collect in Kentik Data Engine (our clustered HA datastore) and merges them with the customer’s BGP data in realtime. So now you can see not only what paths are available to you, but what paths you’re actually using, and what your volume of traffic is on each.</p> <p>Our goal for this new peering feature was that users would be able to accomplish all of the following in a single top-level view:</p> <ul> <li>quickly notice path, peering, or traffic engineering anomalies;</li> <li>pick a specific peer, customer, or site and see a complete picture of where the traffic is coming from, passing through, and exiting;</li> <li>see at a glance which countries traffic is destined to;</li> <li>pick an individual traffic-infused BGP path out of a visualization and see all of the details including how it changed over time;</li> <li>potentially determine the cost involved in getting traffic where it’s going.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/4euIMfIes8EMa0qua2QEIm/0699b0f1c06bac4ee8937e75faf06802/Peering_Analytics-BGP-Paths-840w.png" class="image center" style="max-width: 840px" alt="Peering analytics - BGP paths" /> <p>Using the initial iteration of our peering analysis feature you’ll be able to easily answer many of your most important questions about the traffic on your network:</p> <ul> <li>Who is my traffic going too? How is it getting there? Which country or region does it ultimately terminate in?</li> <li>With whom should I directly interconnect? Which transit provider should I buy my next circuit from?</li> <li>For a particular server, customer or peer, how much traffic is there? Where is that traffic going to? How much is it costing me?</li> <li>Are my peers taking the traffic (and only the traffic) they should be?</li> <li>Is it time to add circuits to my network? If so, where should they go from/to (internal or external)?</li> <li>What opportunities are out there for our company, as an ISP, to sell to? Is there anyone that I’m not connected to but I send traffic to?</li> </ul> <p><strong>Using Peering Analytics</strong></p> <p>Now that we’ve established the value of peering analytics in Kentik Detect, how do you actually use it? It basically boils down to just three steps, which we’ll go into in greater detail below:</p> <ul> <li>Generate a BGP dataset.</li> <li>Launch peering analysis on the BGP dataset.</li> <li>Use the filtering tools (dimensions, devices, interfaces, ASNs) to drill down into lower-level detail.</li> </ul> <div class="pullquote right">The flow records and BGP data in a dataset can be filtered further in the Peering sidebar.</div> <p>Generating a dataset involves narrowing the entire set of the flow and BGP data collected for your organization in KDE down to a subset that focuses on the time-range and devices that you’re interested in, which you can further filter based on exporting devices, minimum traffic volume per AS_PATH, source and destination AS, interfaces, and any of the 50+ filtering criteria found in the Data Explorer. We store this data in a “peering” aggregation format that allows quick exploration of paths.</p> <p>Once a dataset build is complete you can launch peering analytics for that dataset. In the peering portal you’ll see tabs for five top-level visualizations of your traffic: BGP Paths, Transit ASNs, Last-Hop ASNs, Next_Hop ASNs, and Countries. Each of these tabs contain an interactive Sankey diagram — like the one shown above for BGP Path — that you can click on to drill down on paths related to an individual AS. The tabs also contain line charts and tables. For the BGP Paths tab, these appear like the following:</p> <img src="//images.ctfassets.net/6yom6slo28h2/1Yj3i2YbVWC2MamAYe8y4Y/de9902c84e0c593b9c2562e6c925de87/GK-BGP-paths-Line-840w.png" class="image center" style="max-width: 840px" alt="GK BCP paths" /> <img src="//images.ctfassets.net/6yom6slo28h2/3MhKb1JpIASSyE0gk8886A/ab1e663dbf36fff505c756c735ee9cbc/GK-BGP-paths-Table-840w.png" class="image center" style="max-width: 840px" alt="GK-BGP paths" /> <img src="//images.ctfassets.net/6yom6slo28h2/1vJi16UvvagiSOAoUo4ACI/0d8e7b83efe398286ad0a822a7ae4bfc/Peering-sidebar-320w.jpg" class="image left" style="max-width: 320px" alt="Peering sidebar" /> <p>While the tabs give a great deal of detail, you can narrow your view further by drilling down within the visualizations, by selecting or deselecting exporting devices, and by filtering lists of ASNs.</p> <p>One simple example of a typical <a href="https://www.kentik.com/kentipedia/what-is-internet-peering/" title="Kentipedia: What is Internet Peering?">peering</a> analytics use case would be to exclude first-hop ASN from the visualizations so you can look at transit ASN traffic, allowing you to see candidate transit providers based on destination country. The last image below shows a line chart and table of traffic by Transit ASNs with the first hop ignored.</p> <p>It’s not necessarily hard to take a large set of data like NetFlow records and BGP routing updates, filter it down, and then compile a set of nice graphs. The problem is that once you compile, the dataset and all of the graphs are static. If you need a different perspective, you have to start over and compile a whole new dataset. That takes too much time to be practical as a way for network operators to get useful answers whenever they want. We had something more flexible in mind. Like our Data Explorer, we designed Peering Analytics to allow you to change views, apply filters, drill down, and keep adjusting until you see the data that answers your question.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5N6hkjyh0IkQQmG0uy4mak/b423f063568c942f650d019fd052a240/GK-Transit-Line-840w.png" class="image center" style="max-width: 840px" alt="GK transit line" /> <img src="//images.ctfassets.net/6yom6slo28h2/4XDHAMNZ5C8gQGseU6skCm/263ed8ae7195f9ee1060ab9f625417cc/GK-Transit-Table-840w.png" class="image center" style="max-width: 840px" alt="GK transit" /> <p><strong>The Analytics roadmap</strong></p> <p>Peering analytics is part of a bigger context in which we’ll offer a wide variety of analytics using datasets.</p> <p>As we develop our Peering Analytics feature in subsequent versions we’ll be looking at enhancements such as integration with PeeringDB, next-hop integration for East-West analysis, site-level aggregation, scheduled e-mail reporting with “difference information” and ingress flow analysis with confidence information. Also, we’ve built peering analytics as part of a bigger context, which is that we want to be able to offer users a wide variety of analytics features that utilize datasets. That’s why we’ve built peering analytics as one component of a new Analytics section in our portal. The emphasis in this section is on revealing the big picture over the long term (e.g. let’s look at the last 30 days, p95th for an entire network) while also enabling a deep dive into lower-level details. Among the most promising possibilities to include this section are Congestion Analysis, East-West Traffic Analysis, Threat Analysis, and CDN Analysis.</p> <p>We’ll cover key use cases of <a href="https://www.kentik.com/solutions/usecase/improve-peering-interconnection/" title="Learn more about using Kentik for peering analytics">Peering Analytics</a> in forthcoming blog posts. If you’re a content provider, ISP, wholesale service provider, or large enterprise with significant transit or peering connectivity, I think you’ll find a lot that can help you run your network better. So I hope you’ll give our team a chance to <a href="#demo_dialog">give you a demo</a> and get you started on a <a href="#signup_dialog">free trial</a>.</p><![CDATA[NetFlow, Cloud, and Data: Pacing Network Visibility With Growth]]>https://www.kentik.com/blog/netflow-cloud-and-big-data-making-sure-network-visibility-keeps-pace-with-growthhttps://www.kentik.com/blog/netflow-cloud-and-big-data-making-sure-network-visibility-keeps-pace-with-growth<![CDATA[Shamus McGillicuddy]]>Tue, 19 Jan 2016 14:00:04 GMT<p>In today’s hyperconnected world, the network sees all. Network operators can capture a tremendous amount of useful data from their infrastructure. The tricky part is extracting useful and actionable information from all that data.</p> <p><strong>Everything about your business traverses the network</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/18x2rhmfaGAIAAMQY6i260/f098f1cf8a918db32d83217bb4096498/all-seeing_eye-233x300.png" class="image center no-shadow" style="max-width: 233px" /> <p>Network data is a source of truth for everything in the modern business. Network data not only provides insight into the health and performance of IT infrastructure; it can also provide business intelligence. Enterprise Management Associates (EMA) research has found that enterprises are applying advanced analytics to network data for a variety of reasons, beginning with improved infrastructure capacity planning and technical performance monitoring.</p> <p>Flow records (NetFlow, sFlow, IPFIX, etc.), for instance, can provide network operators with significant insight into their infrastructure, especially when they apply advanced analytics.</p> <p><strong>Enterprises are crunching flow records with big data</strong></p> <p>EMA <a href="http://www.enterprisemanagement.com/research/asset.php/3056/EMA-Research-Report:-Big-Data-Impacts-on-IT-Infrastructure-and-Management">recently surveyed 156 enterprises</a> that are exporting IT infrastructure monitoring data into big data environments for advanced analysis. Forty-eight percent (48%) of these organizations considered flow records essential when applying advanced analytics to technical performance monitoring. Only application performance data was more important to a larger number (63%) of enterprises. What kind of monitoring are these enterprises doing? The most popular use cases were “network availability and performance monitoring” (53%), “systems availability and performance monitoring” (52%), and “storage availability and monitoring” (49%). Many enterprises (45%) are also applying these analytics to infrastructure optimization.</p> <p>For all the value that this analysis can bring to a network operator, the sheer volume of data that is collected from the network can be burdensome. EMA research of enterprises exporting infrastructure monitoring data to big data environments has found that by 2017, 69% expect to be storing more than 50 terabytes of such data per day, with some even expecting petabyte volumes.</p> <p>Many network operators will elect to store this data in the cloud. In fact, EMA research found that 47% of cloud managers at these enterprises have seen their infrastructure planning and design practices impacted by the presence of big data projects, and 43% have seen cloud operations impacted. This indicates that many enterprises are turning to cloud-based big data storage and analytics to ease the burden on their internal infrastructure.</p> <p><strong>Kentik applies big data and the power of the cloud to network analytics</strong></p> <p>Kentik provides SaaS-based network analytics solutions. Taking a big-data approach to advanced network analytics, Kentik leverages cloud scaling to capture and store entire streams of raw network flow records, not just summary metadata. It analyzes these flows at a highly granular rate for deep insight into network infrastructure that collectively represent huge carrying capacity, something the company describes as “terabit scale.” With this granularity, Kentik can detect bursts in activity missed by network monitoring platforms that rely on longer intervals for network data capture and analysis.</p> <p>All of Kentik’s flow capture and analysis happens in the cloud, which eases the burden of long-term data storage on local internal infrastructure. This cloud-scale service also allows network operators to delve into long-term data captures to figure out what’s happening with their infrastructure. They no longer have to rely on alerts and original analysis of their network flows. They can dig in at any time and find answers to questions about infrastructure performance.</p> <p>EMA believes that, with experience, more and more enterprises will be exporting network data for advanced analysis. As this occurs, they will need the right solution providers in place to help them execute. Network operators will increasingly need platforms like Kentik to glean essential, actionable information from their network data.</p><![CDATA[Exploring for Insights on Anomalous Network Traffic]]><![CDATA[By actively exploring network traffic with Kentik Detect you can reveal attacks and exploits that you haven't already anticipated in your alerts. In previous posts we showed a range of techniques that help determine whether anomalous traffic indicates that a DDoS attack is underway. This time we dig deeper, gathering the actionable intelligence required to mitigate an attack without disrupting legitimate traffic.]]>https://www.kentik.com/blog/exploring-insights-anomalous-network-traffichttps://www.kentik.com/blog/exploring-insights-anomalous-network-traffic<![CDATA[Jim Frey]]>Mon, 11 Jan 2016 14:00:59 GMT<h3 id="exploring-for-insights-on-anomalous-traffic"><em>Exploring for Insights on Anomalous Traffic</em></h3> <img src="//images.ctfassets.net/6yom6slo28h2/509sdyGDm0UGk0weMuugW/7b6b8d3c94fcf8e156b83d8c2f361754/Digging_dog.jpg" alt="Digging_dog.jpg" class="image right" style="max-width: 340px;" /> <p>In our recent blog post <a href="https://www.kentik.com/ddos-source-geography-netflow-analysis/">DDoS Detection: Source Geography NetFlow Analysis</a> we looked at how you can actively explore network traffic with Kentik Detect and discover attacks that you might not otherwise realize are happening. In particular, we saw that the Kentik Data Engine’s unrestricted capacity to store raw flow records (NetFlow, IPFIX, sFlow, etc.) allows you to start with a top-down view of all traffic and then use the portal’s toolset to see what’s going on from multiple angles. We filtered by source geo, zoomed into a time period, examined unique source IPs, and grouped traffic by destination IP. In the process we learned that a significant amount of attack traffic was coming from thousands of source IPs located in China and was aimed at a single destination IP: 10.10.10.10/32.</p> <p>In this post, we’ll use some additional techniques enabled by the Kentik Detect toolset to understand even more about this attack. Beyond academic curiosity, what’s the value of deeper analysis? To begin with, there are many different DDoS attack vectors, so knowing the type of an attack can help mitigate it effectively without potentially stepping on legitimate traffic. Further, while a DDoS attack may be the obvious problem there may also be other sorts of attacks happening at the same time. Getting more detail helps ensure that you’re not ignoring a more pernicious threat.</p> <p><strong>Next analytical steps</strong></p> <p>Based on the large number of source IPs involved, it’s obvious that we’re dealing with a distributed denial of service (DDoS) attack. So it’s worth checking whether attack traffic is coming from other countries besides China. Starting where we left off in the previous post, we do this by removing our China-specific traffic filter (click the blue X at upper right of the filter specifying src_geo = CN). Lo and behold, there is indeed DDoS traffic coming from multiple countries including the U.S., Japan, Russia, Sweden, China, Taiwan, Brazil, and Estonia. Good to know.</p> <img src="//images.ctfassets.net/6yom6slo28h2/jdS7Vpw7jqgeoawMMicYk/2589838e7046ac17285e24d7c3388f0a/DDoS_3-Src_IPs_by_Geo.png" alt="DDoS_3-Src_IPs_by_Geo.png" class="image center" style="max-width: 600px;" /> <p>Next we want to know specifically what type of traffic we’re seeing and what that tells us about the type of attack on 10.10.10.10/32. First we’ll look at what protocol the traffic consists of. We can do this in multiple ways. In our source country analysis our Units menu setting was <em>Unique Src IPs</em>; by now choosing <em>Full » Proto</em> from the drop-down Metric menu we can see how traffic is distributed amongst different protocols:<br> <img src="//images.ctfassets.net/6yom6slo28h2/7b5qpUSl3isos6aQaksCaE/f59fef9106fc0e95f5d2092b76f2ad7a/DDoS_3-Src_IPs_by_Proto.png" alt="DDoS_3-Src_IPs_by_Proto.png" class="image center" style="max-width: 600px;" /></p> <p>We can also easily shift to looking at per-protocol traffic measured in packets per second by choosing <em>Packets</em> from the Units menu:<br> <img src="//images.ctfassets.net/6yom6slo28h2/2OiNKbKbewCw8ckQ6wcMq0/e226f12f9f75603c51c5bd16c9d4999e/DDoS_3-Packets_by_Proto.png" alt="DDoS_3-Packets_by_Proto.png" class="image center" style="max-width: 600px;" /></p> <p>In both cases, it’s quite clear that we’re dealing with a UDP-based attack. So next, to shed further light on the nature of the DDoS attack, we want to know the port that this UDP traffic is aimed at. To see that, we go back to the Metrics menu and choose <em>Destination » Port</em>.</p> <img src="//images.ctfassets.net/6yom6slo28h2/4xMkrdWakMsqCYUYiiccU2/66761c6eb17dfde10d2e70f7cc762dd9/DDoS_3-Packets_by_Port_dst.png" alt="DDoS_3-Packets_by_Port_dst.png" class="image center" style="max-width: 600px;" /> <p>So now we can see that the UDP traffic is being sent to multiple ports, and it’s obvious that we’re experiencing a DNS redirection/amplification attack occurring on port 53, with a lot of port 0 UDP packet fragments being generated as collateral traffic.</p> <p><strong>Behind the DDoS diversion</strong></p> <p>So far we’ve gotten a lot of insight into the details of the DDoS attack. But we don’t yet know whether that attack is actually the main event or simply a diversion from other, less obvious threats. We can get an initial feel for this by looking a little deeper at the ports involved. What we see is that in addition to 0 and 53 there’s a fairly high level of packets per second being thrown at port 4444 (green line in graph). Port 4444 is the UDP port for the Kerberos service, and is — at least for Windows machines — a <a href="https://isc.sans.edu/port.html?port=4444">well-known target</a> for buffer overflow attacks, often used to insert trojans like Trojan, Hlinic, and Crackdown.</p> <p>What’s interesting about this is that there are potentially two types of attacks going on in parallel: a DDoS attack and a buffer overflow trojan insertion. Many security blogs and publications have noted that in 2015 it has become more common for DDoS attacks to be used as a way to <a href="http://www.esecurityplanet.com/mobile-security/9-enterprise-security-trends-for-2015.html">obfuscate other exploits</a>, and this may very well be an example of that technique.</p> <p>What we’ve learned so far is that by using Kentik Detect’s access to raw (unsummarized) flow records and the portal’s many flexible pivots for analysis we can find more than what a traditional pre-defined DDoS alert would have told us. This helps us not only to see and understand threats that we uncover by exploring, but also to apply these types of analyses in alerts that will generate notifications when a similar situation arises again.</p> <p><strong>Better-informed attack mitigation</strong></p> <p>It’s great to be able to satisfy our curiosity about anomalous traffic, but now that we recognize that we’re under attack and we understand the nature of that attack, the main focus shifts to insights that can inform our response. A straightforward analysis to give us some actionable intelligence would be to see all of the /24 source network addresses of the UDP traffic pointed at 10.10.10.10/32 from the source countries of the attack. Starting from our Packets/s by Port_dst view above, we do this by choosing <em>Source » IP/CIDR</em> on the Metric menu and entering 24 for the CIDR level. Here’s the table from that view, which allows us to easily identify which /24s are the heaviest senders.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6gPWcK5kdOekyGwmACeUSK/869c377e73ebd071d67e4e8ce2aee4d6/DDoS_3-Src_24s_table-e1452304559176.png" alt="DDoS_3-Src_24s_table-e1452304559176.png" class="image center" style="max-width: 700px;" /> <p>Armed with this information, we’re now able to take a couple of concrete steps to mitigate the attack traffic:</p> <ul> <li>Ask our upstream ISP to drop this traffic.</li> <li>As an added precaution, drop all traffic from these countries going to port 4444 on our own routers.</li> </ul> <p><strong>Conclusion</strong></p> <p>Over the course of our posts on DDoS detection, we’ve seen how explorative analysis can find attacks and exploits that may not have surfaced from an alert. Kentik Detect is exceptionally well-suited to these explorations because it stores complete, dis-aggregated flow records enhanced with GeoIP information and it also offers extensive analytical pivots, not only in the portal (as we’ve demonstrated) but also via psql client or our REST APIs. As a result, Kentik Detect enables much deeper insights about the nature of anomalies and attacks. While we began this analytical exercise with source geography as the starting point, I hope it’s become clear that there are many starting points as well as analytical options to use in understanding — and protecting — your network traffic.</p> <p>Read our solution brief to learn more about <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a>.</p><![CDATA[Designing for Database Fairness]]><![CDATA[Kentik Detect is powered by Kentik Data Engine (KDE), a massively-scalable distributed HA database. One of the challenges of optimizing a multitenant datastore like KDE is to ensure fairness, meaning that queries by one customer don't impact performance for other customers. In this post we look at the algorithms used in KDE to keep everyone happy and allocate a fair share of resources to every customer’s queries.]]>https://www.kentik.com/blog/designing-for-database-fairnesshttps://www.kentik.com/blog/designing-for-database-fairness<![CDATA[Ian Pye]]>Mon, 21 Dec 2015 14:34:26 GMT<h3 id="applying-multi-level-queues-in-multi-tenant-databases"><em>Applying Multi-level Queues in Multi-tenant Databases</em></h3> <p><img src="//images.ctfassets.net/6yom6slo28h2/2GWHGXWoN22Mgq2ewomImO/b9b451532e24963ae21ed4fdfc9dc76c/Goddess-of-Justice-crop-320w.jpg" alt="Goddess-of-Justice-crop-320w.jpg " class="image right" style="max-width: 300px;" />Under the hood, Kentik Detect is powered by Kentik Data Engine (KDE), a high-availability, massively-scalable, multi-tenant distributed database. One of the recurring challenges we encounter in our ongoing optimization of KDE is how best to address the need for fairness, which in computer science is another word for scheduling. What we want to achieve in practice is that queries by one customer will never cause another customer’s queries to run slower than a guaranteed minimum.</p> <p>We all know that slowness is bad, but what causes it in a distributed system? Traditionally, we are primarily concerned with four main resources: network, disk, memory, and CPU. Any of these, when exhausted, will cause a system’s throughput to plateau. Add more of this limited resource, and performance will increase until you hit the next resource limit. This post will cover the algorithms we use in KDE to ensure that everyone stays happy (queries don’t time out) and every customer’s queries get their fair share of resources.</p> <p><strong>Subquerying by time</strong></p> <p>KDE is first and foremost a time series database, which means that there is always an implicit index on event time, queries always hit this index, and the selectivity of the index (how precisely it can identify a particular record) is extremely high. Every query made by a KDE user has a time-range predicate clause, expressed as a start and end time. Dividing this continuous time range into a series of discrete subranges provides us with a logical way to break potentially vast amounts of data into chunks that can then be operated on in parallel.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6i1bQUlOhiEcOWoQsMimue/7a0b745503cb85005b1743d0c2a00dea/Timeslice-sunset-320w.jpg" alt="Timeslice-sunset-320w.jpg " class="image right" style="max-width: 300px;" /> <p>To see how this plays out in practice we need to understand a little about the logical structure of KDE. As a given customer’s flow records and associated data are ingested into KDE, the data from each device is stored in two parallel “main tables,” one for a full resolution dataseries and another for a “fast” dataseries, which is optimized for faster execution of queries covering long timespans (see <a href="https://www.kentik.com/KB/KB_Articles/Ab04.htm">Resolution Overview</a> in the Kentik Knowledge Base). These tables are each divided into logical “slices” of one minute (for Full dataseries) or one hour (for Fast dataseries). KDE handles each such slice with its own discrete subquery.</p> <p>With time-based slices, the independence of subqueries holds because all of our aggregation operations — functions, like sum or min, that take multiple rows as input and return a single row — are commutative: (f(g(x)) = g(f(x))). Max of maxes and sum of sums are easy; for harder cases like mean of means, we rely on a more complex data structure to pass needed information (e.g. sample size) back to the top-level aggregation operator.</p> <p>You may recall from a previous blog post, <a href="https://www.kentik.com/metrics-for-microservices/">Metrics for Microservices</a>, that the KDE system is composed of master and worker processes. Worker processes run on the machines that store “shards,” which are on-disk physical blocks that each contain one slice from a given device’s main table. Every slice is represented in KDE by two identical shards, which, for high availability, are always stored on different machines in different racks. For a given worker to handle a given subquery the shard for the timespan covered by that subquery must reside on that worker.</p> <div class="pullquote right">A given worker can only handle subqueries that cover timespans whose shards are local.</div> <p>Masters, meanwhile, are responsible for splitting each query into subqueries, for identifying (via metadata lookup) which workers have access to the shards needed for a given subquery, and for assigning each subquery to a worker. The subqueries, which flow to each worker in a constant stream, are marked with two identifiers: customer_id (corresponds with customer) and request_id (corresponds with the query). The worker generates a result set from data in a shard and returns that result to the issuing master. Both the subqueries and the query as a whole have a deadline by which they must be serviced.</p> <p>Workers need three local resources to service a subquery: RAM, CPU, and Disk I/O. For purposes of scheduling, we can abstract these away and just assert that for the purpose of this post the resources needed to service a subquery are represented as R.</p> <p><strong>Subquery fairness goals</strong></p> <p>To ensure fairness in the way that requests are serviced, KDE workers are designed to achieve three fairness-related goals:</p> <ol> <li>No customer can adversely affect another customer’s subquery response times. Adversely here means that the subquery misses its deadline. If all subquery deadlines are met, the overall query deadline will also be met.</li> <li>No request is allowed to starve the subqueries of other requests from the same customer.</li> <li>The system is work-conserving, meaning that if there is a subquery to run, it is run as quickly as possible, using all available resources.</li> </ol> <p>Goal number 3 eliminates a static reservation system where N is the number of customers and every subquery is allocated 1/N of every resource. Instead, we are forced to adopt an elastic global system where each subquery gets from 1 to 1/M of everything, with 1/M being the minimum fraction needed to complete a subquery before a deadline (R above).</p> <p>At this point you’re likely thinking: “If the goal is to run every subquery before its deadline, why not run the subqueries in order of arrival? After all, first in, first out makes for a happy queue.” This approach, known as EDF (earliest deadline first), turns out to be optimal in a uniprocessor system, where only one job can run at any given time. In other words, when there’s just a single processor, there’s no alternative scheduling algorithm that could process the workload in such a way that a deadline that would be missed under EDF would not also be missed under the alternative. For a formal mathematical proof of this, see FSU Professor Ted Baker’s post on <a href="http://www.cs.fsu.edu/~baker/realtime/restricted/notes/edfscheduling.html">Deadline Scheduling</a>.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/1SBdsGZTSIEiCQOug8wGQg/f829322cad6f2d1b93004585a6bd0f93/EDF-deadlines.png" alt="EDF-deadlines.png " class="image right" style="max-width: 520px;" />EDF is nice because it presents a very simple “greedy” algorithm. KDE, however, doesn’t run on a single processor; it runs on a lot of fancy servers, with R(total) > R(subquery). And that allows us to run many jobs at once. Using a simple counter, we can illustrate that EDF is non-optimal in a multiprocessor setting.</p> <p><strong>Forcing fairness with queuing</strong></p> <p>Based on the above we can see that ensuring fairness in a multiprocessor environment requires something fancier than simple EDF. What we came up with for KDE is that subqueries start in a queue of subrequests from a given request, are pulled into a queue of all subrequests for a given customer, and end up waiting for a processor in a queue with subrequests from other customers. This queue of queues of queues enables us to enforce fairness at the three points where subrequests advance from one queue to the next while still allowing us to ultimately reduce the problem to a uniprocessor model to which we can apply EDF and call it a day.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/dQhjq0e7aE4S8oGIiysQw/6d3a7c660fa84fd191653f47b730df13/Queing-diagram.png" alt="Queing-diagram.png " class="image right" style="max-width: 366px;" />The first instance of queueing is within a request (query); each request has its own queue. A request queue is spun up when the first subrequest (corresponding to an individual subquery) is received and is halted after 600 seconds without any subrequests. This queue enforces FIFO handling of subrequests from a given query.</p> <p>From here, a per-customer queue picks its subrequests from the least-recently selected (LRS) request queue that has an active subquery. Note that this request’s overall deadline is earliest because at the moment it hasn’t made any progress for the longest period of time. This combination ensures that all requests for a given customer make forward progress, and in essence are created equal (get equal throughput, on a weighted subrequest/sec basis).</p> <p>Next the subrequests in the per-customer queues are picked for process queues, where they wait for a thread that is available to actually process an individual subquery. “P” processing threads are created, where P varies depending on the capabilities of each server (P = 1/M = R). We apply the same allocation mechanism to picking from the customer queues as we do to the request queue. Whenever a process thread is free, it notifies a central dispatch service. This dispatcher then picks the least-recently selected customer queue that has an active subquery and adds that subquery to the process queue.</p> <p>The subrequests in each process queue are handled in order (EDF) by the next available process thread. All in all, this three-tiered queuing system ensures that the processors stay busy while also keeping any one customer or request from getting more than its fair share of processing power.</p> <p><strong>Master-worker scheduling</strong></p> <p>The fairness enforcement approach described above takes place within each worker. But to maximize throughput we also need to take into account how masters assign work to workers. As noted above, KDE is an HA system in which each shard is maintained on two different worker machines (if possible in two different racks). So two workers have access to the data needed for any combination of device and time-slice. The master has to decide which worker to use for each subquery to ensure fastest overall processing of a given query.</p> <div class="pullquote right">The master decides which worker for each subquery will ensure fastest processing of the overall query.</div> <p>Selecting the faster worker is complicated by the fact that workers run on heterogeneous hardware. Some boxes are beefy 4U monsters and some are not. In practice, we see that some hardware combinations are three to four times more efficient than others. A query isn’t finished until all of its subrequests are finished, so if half of a given query’s subqueries are sent to a fast box (2x faster) and half to a slow box, the overall query time will be 2x slower than if 75% of the subqueries are directed to the fast box.</p> <p>One approach to achieving the fastest aggregate throughput for all queries would be to exhaustively weight each worker so that the master can be smart about dispatching to ensure even load across workers. But in practice we’ve found that we can do an excellent job of keeping workers balanced by having each master keep local track of subqueries pending on each worker and prioritizing workers with the fewest outstanding subqueries.</p> <p>The awesome thing about this is that every master is keeping track on its own: there’s no global knowledge needed, just local knowledge. As more subqueries hit a box, the time subqueries take increases, so the chance of future subqueries going to this box goes down. We throw out all of the complications of sharing state across masters and having to statically weight workers, but we still get the real time balancing we want.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/26raehld2kmAioK2GCUGg6/a66eb157132c0c83ec3c7c81497c24fd/Worker-Balance.png" alt="Worker-Balance.png " class="image right" style="max-width: 266px;" />The readout at right shows an example of this balancing at work: .88 is 4x faster than .187. Both have the same number of outstanding subqueries, but .88 is handling 4x the volume. Not bad for a greedy system with zero shared knowledge.</p> <p>Think this is fascinating stuff? <a href="https://www.kentik.com/careers/">We’re hiring</a>!</p><![CDATA[Using Kentik Detect to Find Current Attacks]]><![CDATA[With massive data capacity and analytical flexibility, Kentik Detect makes it easy to actively explore network traffic. In this post we look at how to use this capability to rapidly discover and analyze interesting and potentially important DDoS and other attack vectors. We start with filtering by source geo, then zoom in on a time-span with anomalous traffic. By looking at unique source IPs and grouping traffic by destination IP we find both the source and the target of an attack.]]>https://www.kentik.com/blog/using-kentik-detect-find-current-attackshttps://www.kentik.com/blog/using-kentik-detect-find-current-attacks<![CDATA[Jim Frey]]>Tue, 15 Dec 2015 15:50:11 GMT<h3 id="source-geography-analysis-in-data-explorer"><em>Source Geography Analysis in Data Explorer</em></h3> <p>In an earlier post, <a href="https://www.kentik.com/ddos-separating-friend-from-foe/">DDoS Detection: Separating Friend from Foe</a>, we showed how to use Kentik Detect to diagnose a potential DDoS attack when it’s already apparent that there’s anomalous traffic to a particular IP address. In this post, we’ll look at another way in which Kentik Detect can help protect against attacks, which is to actively explore network traffic to discover attacks that you might not otherwise realize are happening. Specifically, we’ll look at how, starting with a simple source-geography analysis, we can use Kentik Detect’s Data Explorer to discover interesting and potentially important DDoS and other attack vectors.</p> <img src="//images.ctfassets.net/6yom6slo28h2/68nxPoHZSwG6kMSweMg4o4/e0fe92dddbf27f09002d3a150f884f4b/Total-traffic-in-bits-4days-833w.png" alt="Total-traffic-in-bits-4days-833w.png" class="image center" style="max-width: 600px;" /> When we log into the Kentik Detect portal, the first graph we see in the Data Explorer is going to look like most other NetFlow analysis tools: it’s a line graph of total traffic. Note that the example network that we’ll be looking at for this post is moving several tens of Gbps of traffic, so it’s representative of any ISP, hosting company, or online enterprise. As shown in the image below, there’s no obvious spike of traffic that indicates a massive DDoS attack or other exploit. But that doesn’t mean that there aren’t nefarious things going on at deeper levels of the network. <p>Fortunately, Kentik Detect has several features that make it easy to go beneath this surface picture:</p> <ul> <li><strong><em>Alerting:</em></strong> Kentik Detect enables you to craft custom SQL-based alerts that issue notifications on anomalous traffic so you can learn about potentially threatening conditions as they arise. For a look at using our alerting system to protect against threats, see our recent post <a href="https://www.kentik.com/detecting-hidden-spambots/">Detecting Hidden Spambots</a>.</li> <li><strong><em>Full-resolution data collection:</em></strong> Legacy network visibility tools only support drill-down to a limited set of pre-defined and calculated graphs and tables that often lead to analytical dead-ends. But with Kentik’s big data approach to flow records you’re no longer limited by those constraints. Kentik Detect allows you to analyze data on billions of flows, enriched with correlated data on BGP routing and geoIP.</li> <li><strong><em>Flexible analytical options:</em></strong> Kentik Detect provides a comprehensive set of options for shaping the results of query-generated graphs and tables, including grouping, device selection, time range, and units as well as filtering on dozens of different fields. As we go through this example, you’ll see that it’s very easy to change perspective on your data.</li> </ul> <p><strong>Getting Beneath the Surface</strong></p> <p>Now let’s look at putting Kentik Detect’s tools to work investigating suspicious traffic. As most of us are aware, most of the attacks on networks worldwide originate from just a few specific regions. With the querying capabilities of the Data Explorer, we can look at traffic that comes from a small set of countries that are typically considered most suspect. In this case, the network doesn’t typically get a lot of traffic from China, so we can use China as a test case to find anomalous traffic spikes.</p> <p>To show how this process works, let’s first create a visualization in the Data Explorer that’s narrowed to traffic coming only from China. We start by setting the Time Options (upper left) to a time range covering four days starting with the afternoon of December 7th and ending on the afternoon of December 11th, and checking that Group by Metric is set to Total Traffic and Units is set to Bits/second (both defaults). We then set a filter for source geography in the left-hand sidebar:</p> <ol> <li><img src="//images.ctfassets.net/6yom6slo28h2/3sJryWWEhqae8808MCA8U/02814d6d59b7f13e1dfc0868bce1e282/Filter-src_geoCN-320.png" alt="Filter-src_geoCN-320.png" class="image right" style="max-width: 220px;" />Click Add Group. The Filters pane expands to show filter Group 1.</li> <li>Define a filter for traffic whose source is China:<br> - From the drop-down metrics menu, we choose Source Country.<br> - We leave the operator setting at “=” (default).<br> - In the value field, we enter CN for China.</li> <li>Click the Apply button at the upper right of the page. A graph is rendered and a corresponding table is populated below the graph that shows total traffic in bits per second coming from China.</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/4JYqmvq8ogY4kCQOUiYi4u/2fe4280f46a7b78d05d9d89d8162623d/Total-traffic-from-CN-830w.png" alt="Total-traffic-from-CN-830w.png" class="image center" style="max-width: 700px;" /> The graph shows two obvious spikes that are well above average. The first spike occurs from roughly 17:15 to 18:30 on December 9. We can zoom in by clicking on the graph and dragging across the time range that we want to focus on, at which point we can see that there are in fact two rapidly occurring large traffic spikes. The question is whether or not this indicates an attack. <img src="//images.ctfassets.net/6yom6slo28h2/39X4bkFW80uEKisEWIsIWe/f317eddd85a6e58cbbac13251be7d0d1/Total-traffic-in-bits-3hrs-830w.png" alt="Total-traffic-in-bits-3hrs-830w.png" class="image center" style="max-width: 700px;" /> <p><strong>Unique Source IP Analysis</strong></p> <p>To find out if these traffic spikes are suspicious, we’ll check if they are anomalous not only in terms of bits per second but also in terms of the number of source IPs. We look for a spike in source IPs by selecting Unique Src IPs from the drop-down Units menu, then clicking Apply. We can see in the newly rendered graph that there is in fact a huge increase in the number of unique source IP addresses sending traffic to a particular destination IP.<br> <img src="//images.ctfassets.net/6yom6slo28h2/2Ss7nXt1LyyGUkUG2oYckQ/fcc26114937b732e592aec815546ffa2/Unique-src-ips-832.png" alt="Unique-src-ips-832.png" class="image center" style="max-width: 600px;" /></p> <p><strong>Who’s Getting All that Traffic?</strong></p> <p>The next question is clearly which IP or IP’s are getting all this traffic from 14K or so individual host IPs? To get at this, we keep units at Unique Src IPs while choosing Destination » IP/CIDR from the drop-down Group by Metric menu. When we group traffic by destination IPs we can see that the target is one solitary destination IP address: 10.10.10.1. There’s really only one likely explanation for traffic from a suspect country to a single IP that suddenly spikes from nearly nothing to more than 1 Gbps: this is definitely an attack.<br> <img src="//images.ctfassets.net/6yom6slo28h2/4vYx7AemGsssWEcc4kSqIC/02dcc295de1314907522bc69080a622a/Src-IPs-per-device-by-dest-IP-840w.jpg" alt="Src-IPs-per-device-by-dest-IP-840w.jpg" class="image center" style="max-width: 600px;" /></p> <p><strong>Flexibility and capacity</strong></p> <p>Thanks to Kentik Detect’s analytical flexibility and massive data capacity, we’ve so far been able to rapidly filter by source geo, to zoom into a time period, to look at unique source IPs in our traffic, and to group traffic by destination IP. If our access to data were limited by the same constraints that apply to legacy tools, we’d likely have to end our analysis here because we wouldn’t be able to access the underlying data needed to dig deeper. But it’s worth knowing more: was this just a volumetric attack, or was there more to it? We’ll explore that in our next DDoS-related blog post, where we’ll cover some interesting additional ways that Kentik Detect makes it easy to quickly change perspective on our data so we can gain deeper insight into what’s happening on our network.</p> <p>Read our solution brief to learn more about <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a>.</p><![CDATA[Detecting Hidden Spambots]]><![CDATA[If your network visibility tool lets you query only those flow details that you've specified in advance then you're likely vulnerable to threats that you haven't anticipated. In this post we'll explore how SQL querying of Kentik Detect's unified, full-resolution datastore enables you to drill into traffic anomalies, to identify threats, and to define alerts that notify you when similar issues recur.]]>https://www.kentik.com/blog/detecting-hidden-spambotshttps://www.kentik.com/blog/detecting-hidden-spambots<![CDATA[Jim Meehan]]>Thu, 03 Dec 2015 00:14:51 GMT<h3 id="protecting-your-network-with-spelunking-and-alerting-in-kentik-detect"><em>Protecting your network with spelunking and alerting in Kentik Detect</em></h3> <p>One of Kentik’s big ideas about network visibility is that it should be possible to identify and alert on any network anomaly even without advance knowledge of the specific threats you’re guarding against. That’s important because threat vectors are constantly evolving, so today’s network operations tools must excel at revealing anomalies as they occur and also at drilling down on unsummarized network data — flow records, BGP, SNMP — to uncover root causes.</p> <p>Kentik Detect is ideal in this regard because it’s able to ingest full resolution flow records (NetFlow v5/9, IPFIX, sFlow, etc.) at massive scale, to combine flow data with BGP and SNMP into a single, coherent datastore, and to make this unified data available within moments for responsive, multi-parameter querying. With that kind of power, it’s possible to find anomalies as they happen, not only via our sophisticated alerting system but also through real-time spelunking. Anomalies found in this way can then be used to define alerts that generate notifications upon future recurrences.</p> <h4 id="finding-spam-reflection">Finding spam reflection</h4> <p>One example of how to use Kentik Detect this way comes from a situation faced by one of our customers that is a large infrastructure provider. The issue was a new type of “spam reflection” that was causing headaches by allowing spammers to avoid detection while utilizing servers inside the customer’s network as spam sources. Here’s how these types of schemes work:</p> <ol> <li>The spammer sends a SYN from a spoofed source IP address to TCP port 25 (SMTP) on the target mail server. <strong><em>Note:</em></strong> Here at Kentik, we’re big fans of <a href="https://tools.ietf.org/html/bcp38">BCP-38</a> (a.k.a. anti-spoofing, a.k.a. ingress filtering). The spam reflection technique described here would be effectively neutralized if these best practices were pervasively employed.</li> <li>The mail server replies to the spoofed source IP, which is actually inside the infrastructure provider’s network. This “reflector” host is also under the control of the spammer (either compromised, or “procured” using stolen CC data).</li> <li>The reflector is configured with a generic routing encapsulation (GRE) tunnel to send the replies back to the real IP of the original source host.</li> </ol> <p>From the outside, the spam appeared to be coming from the infrastructure provider’s network, but the provider itself never observes any outbound SMTP traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/5dddjvzVj2GKwEAscCMw0/ba755bf75ef0b02e0f24e4e6c1d64109/TCP-Reflection-Diagram-840w.png" class="image center" style="max-width: 840px" alt="TCP Reflection Diagram" /> <p>Traditional [NetFlow analysis](<a href="https://www.kentik.com/kentipedia/netflow-analysis/">https://www.kentik.com/kentipedia/netflow-analysis/</a> “Read “An Overview of NetFlow Analysis” in the Kentipedia”) systems may allow us to see either top SMTP receivers in the network or top GRE sources, but discovering which hosts are doing both would typically require correlation of the two lists. Manual correlation doesn’t make for an efficient network operations workflow, so correlation would typically be handled via custom scripts, which are fragile and tend to break over time.</p> <p>Fortunately, Kentik Detect makes it relatively easy to construct queries that can quickly perform the correlation required to identify network servers that are possibly being exploited in this way. Kentik Detect uses SQL as the query language, but the underlying data is stored in the Kentik Data Engine (KDE), a custom, distributed, column-store database. Multi-tenant and horizontally scaled, KDE typically returns query results in a few seconds or less even when querying billions or trillions of individual flow records. (For further insight into how we do this, see our blog post on <a href="https://www.kentik.com/postgresql-foreign-data-wrappers/">PostgreSQL Foreign Data Wrappers</a> by Ian Pye).</p> <h4 id="querying-with-sql">Querying with SQL</h4> <p>Now that we’ve got some background, let’s see how to use Kentik Detect to uncover the signs of spam reflection. We’ll do so with a set of queries that can be used manually and also built into an alert that sends notifications whenever this type of behavior occurs. First, we’ll write a query to get a list of all the hosts in the network that are receiving traffic from remote source port TCP/25 (the traffic described in Step 2 above). Here’s a breakdown of the elements of the query:</p> <table class="Table_Body"> <tbody> <tr class="Table_Head"> <td><b>SQL</b></td> <td><b>Description</b></td> </tr> <tr> <td class="Table_Code">SELECT ipv4_dst_addr,</td> <td>Return the destination IP.</td> </tr> <tr> <td class="Table_Code">SUM(both_bytes) AS f_sum_both_bytes</td> <td>Return the count of bytes. Aside from renaming the bytes column, we’re also passing a Kentik-specific aggregation function to the backend. See <a href="https://kb.kentik.com/Eb02.htm#Eb02-Subquery_Function_Syntax">Subquery Function Syntax</a> in the Kentik Knowledge Base for more information.</td> </tr> <tr> <td class="Table_Code">FROM all_devices</td> <td>Look across all flow sources (e.g. routers).</td> </tr> <tr> <td class="Table_Code">WHERE ctimestamp &gt; 3600</td> <td>Look only at traffic from the last hour.</td> </tr> <tr> <td class="Table_Code">AND protocol = 6</td> <td>Look only at TCP traffic.</td> </tr> <tr> <td class="Table_Code">AND l4_src_port = 25</td> <td>Look only at traffic from source port 25 (SMTP).</td> </tr> <tr> <td class="Table_Code">AND dst_flow_tags ILIKE '%MYNETWORK%'</td> <td>Look only at traffic where the destination IP is inside our own network, as determined by it being tagged with a MYNETWORK tag. This condition makes use of Kentik’s user-defined tags functionality. See <a href="https://kb.kentik.com/Cb04.htm">Kentik Detect Tags</a> for more information.</td> </tr> <tr> <td class="Table_Code">GROUP BY ipv4_dst_addr</td> <td>Return one row per unique dest IP.</td> </tr> <tr> <td class="Table_Code">ORDER BY f_sum_both_bytes DESC</td> <td>Sort the list in order of number of bytes, highest first (descending).</td> </tr> <tr> <td class="Table_Code">LIMIT 10000;</td> <td>Limit the results to the first 10000 rows.</td> </tr> </tbody> </table> <p>Now we can run this query in the Kentik Detect portal’s Query Editor:</p> <ul> <li>Log in and choose Query Editor from the main navbar.</li> <li>Paste the above SQL into the SQL Query field, then click Submit.</li> </ul> <p>This image shows the query in the Query Editor along with the first few rows of results:</p> <img src="//images.ctfassets.net/6yom6slo28h2/5F9YdWax1KMAqmmewO2gaI/572214c591d676fd687536cb30d007ff/Query_Editor-SQL_1.png" class="image center" style="max-width: 840px" alt="Query Editor-SQL" /> <p>Next, we’ll build a query to find the top hosts inside the network that are sending the GRE (protocol 47) traffic discussed in Step 3 from above:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SELECT ipv4_src_addr, SUM(both_bytes) AS f_sum_both_bytes FROM all_devices WHERE ctimestamp > 3600 AND protocol = 47 AND src_flow_tags ILIKE ‘%MYNETWORK%’ GROUP BY ipv4_src_addr ORDER BY f_sum_both_bytes DESC LIMIT 10000;</code></pre></div> <p>This second query looks very similar to the first, with just a few changes:</p> <ul> <li>We’ll select source IPs instead of dest IPs.</li> <li>The WHERE conditions involve protocol 47 this time instead of TCP source port 25.</li> </ul> <p>Now let’s combine those two queries using a SQL UNION statement, and wrap the combined query in an outer SELECT that will put SMTP and GRE into two separate columns. We’ll return the traffic volume as Mbps (instead of total bytes) and we’ll also limit the results only to hosts where Mbps > 0 for both traffic types.</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SELECT ipv4_src_addr, ROUND(MAX(f_sum_both_bytes$gre_src_mbps) * 8 / 1000000 / 60, 3) AS gre_src_mbps, ROUND(MAX(f_sum_both_bytes$smtp_dst_mbps) * 8 / 1000000 / 60, 3) AS smtp_dst_mbps FROM (( SELECT ipv4_src_addr, SUM(both_bytes) AS f_sum_both_bytes$gre_src_mbps, 0 AS f_sum_both_bytes$smtp_dst_mbps FROM all_devices WHERE ctimestamp > 60 AND protocol=47 AND (src_flow_tags ILIKE ‘%MYNETWORK%’) GROUP BY ipv4_src_addr ORDER BY f_sum_both_bytes$gre_src_mbps DESC LIMIT 10000) UNION ( SELECT ipv4_dst_addr,0 AS f_sum_both_bytes$gre_src_mbps, SUM(both_bytes) AS f_sum_both_bytes$smtp_dst_mbps FROM all_devices WHERE ctimestamp > 60 AND protocol=6 AND l4_src_port=25 AND (dst_flow_tags ILIKE ‘%MYNETWORK%’) GROUP BY ipv4_dst_addr ORDER BY f_sum_both_bytes$smtp_dst_mbps DESC LIMIT 10000)) a GROUP BY ipv4_src_addr HAVING MAX(f_sum_both_bytes$gre_src_mbps) > 0 AND MAX(f_sum_both_bytes$smtp_dst_mbps) > 0 ORDER BY MAX(f_sum_both_bytes$gre_src_mbps) DESC, MAX(f_sum_both_bytes$smtp_dst_mbps) DESC LIMIT 100</code></pre></div> <p>The following image shows how the output looks in the Query Editor. The query found three on-net hosts that were receiving traffic from off-net TCP/25 sources and also sending outbound GRE traffic.</p> <img src="//images.ctfassets.net/6yom6slo28h2/1ZAtC6FXUEWmsG8cMa4oWI/1790d96d8cf8f834a22ef959f4bb4fef/Query_Editor-Results.png" class="image center" style="max-width: 840px" alt="Query Editor Results" /> <h4 id="from-query-to-alert">From query to alert</h4> <p>So far we’ve seen how straightforward it is to use SQL querying in the Query Editor to find anomalies that may indicate a threat to the network. With Kentik Detect’s alerting system we can turn this kind of one-time spelunking into an ongoing automated query that defends the network by notifying us when suspicious conditions arise. To see how this works in actual practice, let’s start with the query defined above and build it into an alert that runs the query every 60 seconds and notifies us if there are any results. We’ll begin by going to the Add Alert page in the Kentik Portal:</p> <ul> <li>From the Admin menu on the portal’s navbar, choose Alerts. You’ll go to the Alert List, a table listing all Alerts currently defined for your organization.</li> <li>Click the “Add Alert” button, which will take you to the Add Alert page, where you can define the settings that make up an alert.</li> </ul> <p>To use our SQL query as the basis of an alert, we’ll need to make a few modifications:</p> <ul> <li>We’ll rename the columns to include the strings “_key_”, “_out_”, and “_out2_”. The key is the column containing the item that we’re alerting about, which in this case is an IP address. The other two columns contain the values that we want to measure.</li> <li>We’ll replace any timeframe values in the query with a “%lookbackseconds%” variable. When the query runs, the alert engine will replace this variable with the value of the “Lookback window” setting. At least one reference to %lookbackseconds% is required.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/5SKUiPuHLyy8yI2SOs2Y2K/ff4213fca0d0c7de48fff22908e04ab6/Alert_Settings.png" class="image center" style="max-width: 840px" alt="" /> <p>Once we set the alert settings as shown in the image above the alert engine will run this query every 60 seconds (the Lookback window), and generate a notification if we see any results 5 or more times (the Trigger threshold) over the course of a 2 hour rolling window (the Event window). In addition to individual alert settings, we should also check our global alert settings (link at upper right of Alert List), which apply to all alerts for a given organization. As shown in the following image, these settings determine the details of alert notifications, including how the notifications are sent (via email, syslog, and/or JSON pushed to a user-specified URL).</p> <img src="//images.ctfassets.net/6yom6slo28h2/21QrbmtWuQOsS8qGkoSGAE/c81556e2ea6e56854a59b53728e94f18/Global_Alert_Settings.png" class="image center" style="max-width: 840px" alt="" /> <p>If our new alert is triggered, then events related to that alert will be listed in the Alerts Dashboard (choose Alerts from the portal’s navbar). Here’s a look at how these events will appear:</p> <img src="//images.ctfassets.net/6yom6slo28h2/3gb2QPFItai6GA4kEqSSei/8e07153ba46bfa8a0d2fb04056093ae6/Alert_Dashboard-840w.png" class="image center" style="max-width: 840px" alt="" /> <p>As described above, the querying and alerting capabilities of Kentik Detect are extraordinarily precise and flexible, enabling customers to quickly find activity of interest on their networks. If you’re not already experiencing Kentik Detect’s powerful NetFlow analysis visibility on your own network, <a href="#demo_dialog">request a demo</a> or start a <a href="#signup_dialog">free trial</a> today.</p><![CDATA[DDoS Detection: Separating Friend from Foe]]><![CDATA[DDoS attacks impact profits by interrupting revenue and undermining customer satisfaction. To minimize damage, operators must be able to rapidly determine if a traffic spike is a true attack and then to quickly gather the key information required for mitigation. Kentik Detect's Data Explorer is ideal for precisely that sort of drill-down.]]>https://www.kentik.com/blog/ddos-separating-friend-from-foehttps://www.kentik.com/blog/ddos-separating-friend-from-foe<![CDATA[Alex Henthorn-Iwane]]>Mon, 23 Nov 2015 16:50:57 GMT<p>Full traffic visibility to diagnose those nasty attacks</p> <p>In many organizations, networks are at the core of the business, enabling not only internal functions such as HR, supply chain, and finance but also the services and transactions on which the business depends for revenue. That makes network availability critical. Any interruption of access from the outside world turns off the revenue spigot, impacting profit and creating a bad user experience that can damage customer satisfaction and result in permanent loss of patronage. The worse the outage, the worse the damage. That’s why speed is so important in detecting, diagnosing, and responding to Denial of Service (DoS) and Distributed Denial of Service (DDoS) attacks.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/41UU9alD1KYAca62YQGgAm/a6b76931b8044a53929fac10d298254c/151123-chess-set-520w.jpg" alt="151123-chess-set-520w.jpg" class="image right" style="max-width: 300px;" />One of the chief challenges in responding to an attack is to distinguish friend from foe. Without a way to drill down into traffic details and examine host-level traffic behavior, it can be difficult to tell the difference. That’s why it’s critical to have a network visibility tool like Kentik Detect, which allows you to quickly filter for key attack metrics. The sooner you can determine where a traffic spike is coming from and going to, the sooner you can decide on the appropriate response. And even after an attack has passed, examining full resolution traffic data from that attack can still reveal information that can be applied to better prepare for future events.</p> <p><strong>About DoS and DDoS</strong></p> <p>Just to be sure we’re on the same page, a DoS attack is an attempt to make computing or network resources unavailable for normal usage, such as interrupting a host’s access to the Internet or suspending its services to the Internet. DoS becomes DDoS when the source of the attack is distributed, meaning that the attack comes from more than one unique IP address. DDoS attacks are commonly launched from “botnets” of compromised hosts that can number up into the thousands.</p> <p>DDoS attacks are rapidly increasing in both frequency and size.</p> <p>It’s widely known that DDoS attacks are rapidly increasing in frequency and size. While mega attacks that last for many hours and reach 200 Gbps or more make the news, the vast majority of attacks last under an hour and are less than 1 Gbps in volume. Smaller attacks often happen without being noticed, though they may be harbingers of larger attacks to come. Mid-sized attacks are more readily felt, but distinguishing between a friendly surge in normal traffic and an attack is key to timely response. Large attacks are fairly obvious, and in these cases diagnosing the traffic is important to understand network entry points and sources. In all cases, a clear assessment is important to understand the best way to mitigate the attack.</p> <p>The most common form of DDoS is the volumetric attack, in which the intent is to congest all of the target network’s bandwidth. Roughly 90% of all DDoS attacks are volumetric, with application-layer attacks making up the remaining 10%. According to Akamai’s <a href="https://www.stateoftheinternet.com/downloads/pdfs/2015-internet-security-report-q1.pdf">Q1 2015 State of the Internet report</a>, over 50% of volumetric attacks are IP flood attacks involving a high volume of spoofed packets such as TCP SYN, UDP, or UDP fragments. A growing percentage of attacks are reflection and amplification attacks using small, spoofed SNMP, DNS, or NTP requests to many distributed servers to bombard a target with the much more bandwidth-heavy responses to those requests.</p> <p>In the last few quarters, both Akamai and other Internet security observers have noted rapid growth of reflection attacks based on spoofed Simple Service Discovery Protocol (SSDP) requests sent to IP-enabled home devices with poorly protected Universal Plug-n-Play (UPnP) protocol stacks. SSDP reflection now accounts over 20% of all volumetric DDoS attacks.</p> <p><strong>DDoS detection and analysis cases</strong></p> <p>Depending on your organization type (e.g. ISP, hosting company, content provider, or end-user organization), you may be concerned only with attacks that directly affect your resources. Or you might want to know about any attack traffic that’s passed — or is passing — through your network. Either way, there are two general cases of DDoS analysis:</p> <ul> <li><em>Diagnosing</em> — You’ve already detected that something is amiss, for example one of your resources is experiencing service degradation, you’re seeing anomalous server log entries, or a circuit is unusually full. In this case, you’ll need to identify the traffic that’s causing the problem, and, if it’s not legitimate, to characterize the attack clearly enough to enable specific mitigating actions.</li> <li><em>Spelunking</em> — In this case you’re not aware of any current attacks but you want to explore your network traffic data to learn more about previous attacks (and maybe even find an attack in progress). We’ll cover this second mode of DDoS analysis in a separate forthcoming post.</li> </ul> <p>Kentik Detect’s big data datastore gives you many ways to analyze volumetric traffic</p> <p>In both of these analysis cases, the NetFlow and BGP information in Kentik Detect’s big data datastore gives you many ways to analyze volumetric traffic. In the following examples we’ll look at how this data can help you separate an attack from an innocent spike.</p> <p><strong>Diagnosis by Destination IP</strong></p> <p>For this first example, let’s say that you’re suddenly being alerted — e.g. by server overload alarms from your network management software, or by alert notifications from Kentik Detect — that the IP address “192.168.10.22” (anonymized here to protect the innocent) is getting hammered by anomalous traffic, indicating that it is possibly under attack. You’ll want to rapidly drill down on key characteristics of the suspect traffic to determine if it’s actually an attack, and if so to gather information that will help you to mitigate the attack quickly.</p> <p>As a starting point, we would go to the Data Explorer in the Kentik Detect portal. By clicking All in the device selection pane in the sidebar at left, and then the Apply button, we can see total network-wide traffic for all of the listed device(s). Since we know the IP address that we suspect is under attack, we’ll can then use the Filters pane in the left-hand sidebar to filter the total traffic for that address:<br> <img src="//images.ctfassets.net/6yom6slo28h2/1bZ4d6pnDKCyyCMQeGMGuq/3c1d6e01d91752fdf13039e76822f438/151123-Filter-setting-for-IP.png" alt="151123-Filter-setting-for-IP.png" class="image right" style="max-width: 300px;" /></p> <ol> <li>Click Add Group. The Filters pane expands, showing the first filter group (Group 1).</li> <li>Define a filter for traffic whose destination IP is 192.168.10.22:<br> - From the drop-down metrics menu, we choose Dest IP/CIDR.<br> - We leave the operator setting at “=” (default).<br> - In the value field, we enter 192.168.10.22/32.<br> - The Filters pane then appears as shown at left.</li> <li>Click the Apply button at the upper right of the page. A graph is rendered (Fig. 2) and a corresponding table is populated below the graph (Fig. 3).</li> </ol> <p>Once we’ve applied this destination IP address filter, the resulting graph shows clearly that there is a significant, anomalous spike of traffic for over 20 minutes and continuing.<br> <img src="//images.ctfassets.net/6yom6slo28h2/2DwojHNRWIAMiis8K280cE/8debae385be2394fd436c62ca5105d54/151123-Tot-Traf-in-bits.png" alt="151123-Tot-Traf-in-bits.png" class="image right" style="max-width: 300px;" /></p> <p><strong>Viewing by Source IP</strong></p> <p>Now we need to characterize this traffic. An abnormally large number of source IP addresses from atypical countries is indicative of a botnet, so we’ll look at traffic by source country to unique source IP addresses:</p> <ol> <li>In the Group by Metric drop-down above the graph, choose Source » Country.</li> <li>In the Units drop-down, choose Unique Src Ip, then click Apply.</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/2sTf2Y3C9Say6umkoKaW8e/efdbe56ffb91f5d2c0c18015e79f6685/151123-Src-IPs-by-Geo.png" alt="151123-Src-IPs-by-Geo.png" class="image right" style="max-width: 300px;" /> <p>We can now see that there is a huge number of unique source IP’s from China, U.S., Vietnam and other countries that are generating nearly a million packets per second in aggregate. Since this IP happens to be the U.S. and doesn’t typically get traffic from Asia, that’s clearly suspicious. China is the biggest contributor to this suspect traffic, so we’ll isolate China and look at packets per second per source IP:</p> <ol> <li>In the Group by Metric drop-down, choose Source » IP/CIDR.</li> <li>In the Units drop-down, choose Packets/s, then click Apply.</li> </ol> <img src="//images.ctfassets.net/6yom6slo28h2/dvjZA7Fvm8OUoe4uAG2ys/abc0b8aebff783553dcdd2d1c901843d/151123-Packets-by-IP-Src-32.png" alt="151123-Packets-by-IP-Src-32.png" class="image right" style="max-width: 300px;" /> <p>This view validates that we’re looking at a rather large number of source IP addresses that are sending equivalent packets per second, which is indicative of a botnet.</p> <p><strong>Additional attack characteristics</strong></p> <p>Now that we’re pretty sure we’re under DDoS attack, it would be helpful to know a bit more about the traffic we’re being hit with. So next we’ll look at the protocol and destination port # of the traffic. First, in the Group by Metric drop-down, choose Full » Proto. We can see that the traffic is primarily UDP:<br> <img src="//images.ctfassets.net/6yom6slo28h2/4l1rGTpMO488yIA8UG8msY/328df0460bc30a6a44a4b306cc88b1ed/151123-Packets-by-Proto.png" alt="151123-Packets-by-Proto.png" class="image right" style="max-width: 300px;" /></p> <p>Next we’ll set the Group by Metric drop-down to Destination » Port to look at the destination port number.<br> <img src="//images.ctfassets.net/6yom6slo28h2/ZYH45ri86GkEMcmC6IeAu/5bfdc7603320b36c293eb2fe795bcffd/151123-Packets-by-Port-Dst.png" alt="151123-Packets-by-Port-Dst.png" class="image right" style="max-width: 300px;" /></p> <p><strong>Assessing mitigation options</strong></p> <p>When we look at the destination port # of the traffic, we can see that there is remarkable consistency, in that the vast majority of the UDP packets are going to port 3074, which is the Xbox protocol. Now we can be pretty certain that this is a botnet attack. Since this address otherwise doesn’t receive traffic from Asia, we can mitigate the majority of the attack by dropping this traffic from China and some of the other Asian countries.</p> <p>Remember, though, that our look at source countries listed the U.S. right after China, with over a hundred thousand packets per second. So to develop a complete mitigation plan we need to explore that issue next. Since this IP gets traffic from the U.S. under normal conditions, simply dropping traffic from the U.S. isn’t a good idea. But what we can do is to look at packets by source IP, but this time instead of /32 host source IPs, we’ll look at /24s.<br> <img src="//images.ctfassets.net/6yom6slo28h2/2832dutfh6Ok2u86KA0oq2/94e76a6853438c321ba2c23c3812941d/151123-Packets-by-IP-Src-24.png" alt="151123-Packets-by-IP-Src-24.png" class="image right" style="max-width: 300px;" /></p> <p>We can see that there is a good number of /24s that are sending a fair amount of pps. So, one possible mitigation approach would be to rate-limit the pps from each of these /24. Another mitigation would be to redirect traffic from these /24s to an internal or cloud-based scrubbing center.</p> <p><strong>Conclusion</strong></p> <p>There’s a large-scale dark market that trades in DDoS, and that market continuously innovates and evolves to meet new demand. With the nature of DDoS attacks constantly changing, network-centric organizations need an agile approach approach to DDoS detection. By offering complete visibility into network traffic anomalies, including both alerting and full-resolution drill-down on raw flow records, Kentik Detect enables operators to respond rapidly and effectively to each DDoS threat.</p> <p>Read our solution brief to learn more about <a href="https://www.kentik.com/resources/brief-kentik-ddos-detection-and-defense/" title="DDoS Detection and Defense from Kentik">Kentik DDoS Detection and Defense</a>.</p><![CDATA[Metrics for Microservices]]><![CDATA[Kentik Detect handles tens of billions of network flow records and many millions of sub-queries every day using a horizontally scaled distributed system with a custom microservice architecture. Instrumentation and metrics play a key role in performance optimization.]]>https://www.kentik.com/blog/metrics-for-microserviceshttps://www.kentik.com/blog/metrics-for-microservices<![CDATA[Ian Pye]]>Mon, 16 Nov 2015 14:00:25 GMT<h2 id="time-series-reporting-for-performance-optimization"><em>Time-series reporting for performance optimization</em></h2> <img src="//images.ctfassets.net/6yom6slo28h2/2y6t3KfwliuqgmEk0ecgmE/782d87761fb58e69804e13eed03a0979/mooreslaw-tombstone-500x250-e1447445570282.jpg" class="image right" style="max-width: 300px; margin-bottom: 20px;" alt="" /> <p>Once upon a time, life was simple. Programs ran in a single thread and performance was CPU bound. (I know this is a simplification, but bear with me as I try to make a point.) Using fewer instructions resulted in faster runtimes, and Moore’s Law could be counted on to reduce the cost of each instruction with every new CPU generation.</p> <p>Today, however, we live in scary times. Moore’s law is effectively over. Applications now have to scale horizontally to meet performance requirements. And any time you have to deal with something more than a single box can handle, you introduce a whole host of complications (network, NUMA, coordination, and serialization, to name just a few).</p> <div class="pullquote left">KDE handles over 10B flow records/day with a microservice architecture that's optimized using metrics.</div> <p>Here at Kentik, our Kentik Detect service is powered by a multi-tenant big data datastore called Kentik Data Engine. KDE handles — on a daily basis — tens of billions of network flow records, ingestion of several TB of data, and many millions of sub-queries. To make it work at a cost that’s far below any existing solution, we’ve had to start off smart, with an in-house backend that can keep up with the volume of data and queries. That means scaling horizontally, which involves a complex distributed system with a custom microservice architecture. And that leads us to metrics.</p> <p><strong>Instrumentation as a key requirement</strong></p> <p>When we designed KDE, one of our key enabling requirements was to have instrumentation for end-to-end of monitoring of service delivery. My co-founder Avi Freedman’s experience at Akamai had already shown the value of having, for internal and customer use, functionality along the lines of <a href="https://www.akamai.com/us/en/multimedia/documents/technical-publication/keeping-track-of-70000-servers-the-akamai-query-system-technical-publication.pdf">Akamai Query</a>. Query, a combo real-time datastore and query engine, allows Akamai operations to monitor the health of their services and to “reach in” and query application components. Akamai uses it to support operational, performance, security, and efficiency use cases.</p> <p>Done right, there is almost no such thing as too much instrumentation.</p> <p>We knew that a similar level of support for holistic end-to-end monitoring of the constituent components of user performance — application, application environment, and host as well as data center, WAN, and Internet networks — would be needed to enable KDE to ingest trillions of items, to return timely SQL query results, and to be compatible with the architecture that we’ve built for our customers. So it was critical to instrument every component leading to, around, and within our data engine. Done right, there is almost no such thing as too much instrumentation!</p> <p><strong>The life of a query</strong></p> <p>Consider briefly the life of a query in Kentik Detect. Using a psql client, our RESTful API, or our portal, a user accesses the KDE’s PostgreSQL-based frontend and runs the following query:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SELECT src\_as, sum(in\_pkts) as f\_sum\_in\_pkts FROM all\_devices WHERE i\_start\_time >= now() - interval'5 min' GROUP BY src\_as ORDER BY f\_sum\_in\_pkts DESC LIMIT 10 </code></pre></div> <p>In English, this is saying to give me the top 10 AS numbers sending traffic to my network over the last 5 min. Simple, right? Not so fast. All of the underlying data lives on distributed nodes across a cluster. To make the Kentik Detect SaaS offering work at scale on a multi-tenant platform we needed to enable query rate-limiting and subquery result memorization. That means that a lot of work has to happen under the hood, all implemented as separate but cooperating services.</p> <p>Here’s what happens behind the scenes:</p> <ul> <li>The query gets first validated against the schema in PostgreSQL of the all_devices table (a pseudo-table merge of the network data from all of a customer’s devices sending flow records to KDE).</li> <li>From here it is sent to one of our middleware processes, chosen via Least Recently Used (LRU) from the pool of possibilities.</li> <li>The middleware:<br> - Runs a bunch of additional validations (Who is logged in? Does the user have access to the requested data? Etc…);<br> - Breaks the main query into five subqueries, one for each time unit covered by the main query (data is stored in 1 minute buckets).<br> - Further breaks the subqueries into one sub-subquery per device. Assuming that the user has 10 devices, this leaves us with 5 * 10 = 50 sub-subqueries (jobs), all of which must be fulfilled before the user sees anything.</li> <li>The master next connects to our metadata service and, for each job, selects one worker to service the job. Workers are processes that run on our storage nodes.</li> <li>The chosen worker gets a job, validates again that it has local data to run on, and, if valid, enqueues the job for processing.</li> <li>Another thread will pick up this job when it is ready to run (the scheduling algorithm for which will be the subject of a future blog post).</li> <li>The job is run, result checked for errors, and sent back to PG via the middleware process that requested it.</li> <li>Once all 50 jobs return to PostgreSQL:<br> - The data is de-serialized, turned into PG tuples, and sorted;<br> - LIMIT and GROUP BY are applied<br> - The top 10 results are displayed to the user.</li> </ul> <img src="//images.ctfassets.net/6yom6slo28h2/4AcHUYyRws2eqgSEkGCeGQ/034583594347c23d951e1cc48a5e98e4/Bowl_of_Queries.jpg" class="image right no-shadow" style="max-width: 200px; padding: 15px;" alt="" /> <p>The above is actually a simplified version of what really happens, because I cut out a lot of the annoying corner/failure cases that we have to handle!</p> <p>We’ve refined our production system to the point where the steps above all happen in a few hundred milliseconds. But things didn’t start out that optimized. So given all these moving parts, how did we figure out where performance was bottle-necking? Metrics.</p> <p><strong>Health checks and series metrics</strong></p> <p>As a first pass, all of our binaries use a healthcheck library. This allows us to connect to a local socket and get a static view of what’s going on at any given time, along with process version and build information. For example, here’s some sample output from an alert service:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">vagrant@chdev:~$ nc 127.0.0.1 10001 GOOD Alert System: 20ce3d935b6988a15b7a6661c8b6198bd1afe419 Built on Linux 3.16.0-4-amd64 x86_64 \[go version go1.5 linux/amd64\] Debian GNU/Linux 8.1 (jessie) (Thu Oct 29 01:51:25 UTC 2015) \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- 1013 In Q: 01013 16 | V3 Base Q: 0 1013 16 | V3 Long Q: 0 1013 16 | V3 State Q: 0 1013 16 | V3 Rate Spot: 50.195457 1013 16 | V3 Rate Select Spot: 50.035864</code></pre></div> <br /> <div class="pullquote left" style="max-width: 250px;">Graphed, time series metrics make obvious at a glance if new code is helping.</div> <p>Note that the system is GOOD, queue depth is 0, and the flow rate is 50 flows/sec, of which all are being selected for further processing. This is useful, but really only in a binary sense: things are either completely broken, or working fine. Queue depths are max, or 0. Very rarely are they reported as somewhere in the middle, which in practice is where things get interesting. We also get a rate for events/second, but is this a local max? A local min?</p> <p>To dig deeper and get useful numbers, we need metrics reporting over time. Graphed, time series metrics make obvious at a glance if new code is helping. For example, consider this before-and-after image:</p> <img src="//images.ctfassets.net/6yom6slo28h2/7JJkzNwQnuqkyoeCkuA8GE/0c6de7a871b98476a53b4052c3e86bd0/runner-first-12-hours.png" class="image center" style="max-width: 730px" alt="" /> <p>The graphs show the time in milliseconds needed to init a query worker. The top is before a fix, while the bottom version is after. Obvious, right? Up and to the right is not good, when you are talking latency. But just try confirming this without a centralized metric system. Internally, we are doing this with OpenTSDB to store the metrics, a <a href="https://www.kentik.com/postgresql-foreign-data-wrappers/">PostgreSQL Foreign Data Wrapper</a> used as an ops interface and inter-component glue, and Metrilyx as the front-end. In the future we may change to a different time series datastore for the “simple” time series metrics. But OpenTSDB, when enhanced with some “operationalization” diapers like the SQL interface, works well for now, and it has rich integrations that we can leverage for monitoring hosts and application components.</p> <p><strong>Metrics tags in Go</strong></p> <img src="//images.ctfassets.net/6yom6slo28h2/XiwrWSoP8AuqYUsMAGgIS/3f11941e8cfc3389952cd6c964c473d0/appenginegophercolor-300x189.jpg" class="image right no-shadow" style="max-width: 220px" alt="" /> <p>Our backend code is mostly written in the Go language. We heavily leverage Richard Crowley’s excellent go metrics library (<a href="https://github.com/rcrowley/go-metrics">https://github.com/rcrowley/go-metrics</a>), which is itself a port of Coda Hale’s and Yammer’s Java language library. The winning feature here is that Crowley’s code exports directly to OpenTSDB (along with Graphite and several other backends), saving us the trouble of writing a connector. One addition we did make, however, is the ability to register custom tags for each metric (e.g. node name and service name).</p> <p>The library offers five main types:</p> <ul> <li><strong>Counter</strong> — This is what it sounds like: just a number that goes to MAX_INT and then rolls over. Not so useful, we’ve found.</li> <li><strong>Gauge</strong> — Similar to a Counter, but goes up and down. Reports a single number, for example workers remaining in a thread pool. Again, not particularly useful because it lacks a time component.</li> <li><strong>Meter</strong> — Records a rate of events per second. Any time one is looking at a stream of events or jobs or processes, this is your friend.</li> <li><strong>Histogram</strong> — A gauge over time. The champ of metrics. Records percentile values. This allows us to say that the min depth of our worker pool is Y, while 95th percentile is Z.</li> <li><strong>Timer</strong> — Just like histogram, but optimized to take in duration of events. Using these, we track times for the max, min, 95th, 50th, etc. for each stage of our query pipeline.</li> </ul> <p><strong>Fewer surprises, happy customers</strong></p> <p>We run a microservice architecture, with components varying from microservice to “milliservice” in complexity. Every service is extensively instrumented, along with a whole host of server stats via the invaluable <a href="http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html">project tcollector</a>. After every release, this setup allow us to compare before and after stats to see exactly what has changed. This makes for many fewer surprises, which makes for happy customers.</p> <p>One more thing: because we write code in Go, we get to cheat. Go supports real time performance profiling (both memory usage and time spent in each function). Even better, you can connect your profiler to a running process via HTTP (<a href="https://golang.org/pkg/net/http/pprof/">https://golang.org/pkg/net/http/pprof/</a>). Remember how all of our processes expose a health check port? They also expose a PPROF HTTP port. At no runtime cost unless actively profiling, every binary allows us to tap in at any time and see where it is spending its time. How to use this will be covered in yet another blog post.</p> <p>Think some or all of this is fascinating? <a href="https://www.kentik.com/careers/">We’re hiring!</a></p><![CDATA[Moneyball Your Network with Big Data Analytics]]><![CDATA[The team at Kentik recently tweeted: "#Moneyball your network with deeper, real-time insights from #BigData NetFlow, SNMP & BGP in an easy to use #SaaS." There are a lot of concepts packed into that statement, so we thought it would be worth unpacking for a closer look.]]>https://www.kentik.com/blog/moneyball-your-network-with-big-data-analyticshttps://www.kentik.com/blog/moneyball-your-network-with-big-data-analytics<![CDATA[Alex Henthorn-Iwane]]>Mon, 19 Oct 2015 13:22:30 GMT<h3 id="turning-data-analytics-into-significant-competitive-advantage"><em>Turning data analytics into significant competitive advantage</em></h3> <p>Recently, the team at Kentik tweeted the following: “#Moneyball your network with deeper, real-time insights from #BigData NetFlow, SNMP &#x26; BGP in an easy to use #SaaS.” There are a lot of concepts packed into that statement, so I thought it would be worth unpacking for a closer look. We’ll use the Moneyball analogy as a way of exploring the vast, untapped potential of the telemetry data that’s being generated by telecom, cable, web enterprise, hosting, and other IP networks.</p> <img src="//images.ctfassets.net/6yom6slo28h2/6RaXiMKKOIgIAi2ucOEcGi/3495ccb9629f1d5a98e698e611dccefb/AdobeStock_76994623-300x200.jpeg" class="image right" style="max-width: 300px" alt="" /> <p>Most of you are probably familiar with the concept of Moneyball, which was popularized by Michael Lewis’ book (and later movie) of the same name. Lewis chronicled how Billy Beane, the GM of the Oakland Athletics, upended baseball operations with a new approach built on applying analytics to statistical data on player performance. As Lewis tells it, baseball was a field where tons of statistical data has been gathered and analyzed on an amateur basis for many decades but most of that data was ignored by the professionals running baseball operations. By mining data that was copiously available but previously unleveraged, Beane gained competitive advantage for his team, and GMs across baseball began seeing the value hidden in arcane stats.</p> <p>Like baseball, network operations is a field in which a huge volume of data is available. IP networks emit a ton of statistical information — flow data (NetFlow, sFlow, IPFIX, etc.), SNMP, and routing data from BGP — that is routinely collected. And though the underlying reasons differ, network operations, like pre-Moneyball baseball operations, has yet to fully leverage this wealth of available data. The problem hasn’t been that the data has been discounted or ignored, but rather that traditional approaches available for handling the data are obsolete and ineffective, making it difficult to extract actionable insight.</p> <h4 id="architectures-without-big-data-scalability-make-you-guess-in-advance-what-questions-youll-need-answered-in-the-future">Architectures without big data scalability make you guess in advance what questions you’ll need answered in the future.</h4> <p>The key realization here is that network telemetry data is big data. Any decent-size network, especially one that is multi-homed to the Internet, can easily generate millions or billions of telemetry records per day. Unfortunately the techniques typically applied to ingest, process, and analyze that data are about 15 years out of date. Neither traditional on-premises enterprise software nor appliance-based architectures were built to handle the scale of network data happening today. In fact, they have so little capacity that in most cases you have to guess well ahead of time what kinds of questions you may need to ask in the future, so that the desired details can be summarized into pre-canned aggregates and reports. Most of the rest of the raw data is simply thrown away. But real life usually doesn’t fit neatly into pre-canned reports. By the time you realize that you need to ask different questions, it’s too late: the raw data is long gone.</p> <p>Big Data and cloud architecture change this picture. The trick is to build a system that truly addresses the need for both real-time operational insights and longer-term business planning. It’s easy to skew to one side or the other. If you lean towards answering longer-term questions, you’ll typically collect data from different sources discretely, and hold off on processing until query run-time. That approach preserves flexibility but isn’t fast enough to be operationally useful. On the other hand, you can design for real-time performance by limiting the questions that can be asked at any given moment, but then you don’t have much flexibility.</p> <p>The beauty of Kentik is that we’ve developed a third way, building a big data approach that covers both ends of the short vs. long spectrum. Kentik Data Engine (KDE), the datastore behind Kentik Detect, is architected for real-time ingest of full-resolution network data at Terabit scale. And as it ingests it also brings diverse data types into a single, unified schema. That gives you a rich, deep set of detailed data on which to run real-time queries that ask any question you want. Your short-term “monitoring” questions might be something like “Is there an anomaly occurring right now?” or “Is anyone on our network on the receiving end of a DDoS attack?” The answers help you to improve operational performance and efficiency and to deliver better application and service performance. As for longer-term questions, you might want to know, “Who can I peer with to reduce my transit costs?” or “Are there organizations with significant traffic transiting my network that I can potentially convert into customers?”</p> <p>This is where we come back to Moneyball. You can only fully leverage the data emanating from your network if you’re able to access all of the raw detail in an environment that enables fast, flexible querying across both short- and long-term time-spans. Once you’ve got that, you can move aggressively to convert the answers to your questions into a boost in performance and ROI. And that’s what ultimately gives you the Moneyball effect: turning data analytics into significant competitive advantage.</p> <p>If you’re not sure that you have that kind of advantage going for your network I encourage you to check out <a href="https://www.kentik.com">www.kentik.com</a> to get an idea of what I’m talking about. I’d also love to hear what you think. Hit me up via my LinkedIn or on twitter at @heniwa. Thanks!</p><![CDATA[Transcending MySQL-Based NetFlow Analysis]]><![CDATA[For many of the organizations we’ve all worked with or known, SNMP gets dumped into RRDTool, and NetFlow is captured into MySQL. This arrangement is simple, well-documented, and works for initial requirements. But simply put, it’s not cost-effective to store flow data at any scale in a traditional relational database.]]>https://www.kentik.com/blog/transcending-netflow-in-mysqlhttps://www.kentik.com/blog/transcending-netflow-in-mysql<![CDATA[Ian Pye]]>Mon, 12 Oct 2015 15:45:40 GMT<p><strong><em>Once upon a time, in a network monitoring reality not so far away…</em></strong></p> <p>I’ve known a good number of network engineers over time, and for many of the organizations we’ve all worked with or known, network monitoring has been distinctly old school.  It’s not because people or the organizations aren’t smart.  Quite the contrary—we were all used to building tools from open source software.  In fact, that’s how a lot of those businesses were built from the ground up.  Only, in network monitoring, there are limits to what can be done by the smart, individual engineer.  Here’s a common practice:  SNMP gets dumped into RRDTool, an open source data logging and graphing system for time series data. And NetFlow is captured into mysql. This arrangement is attractive to network engineers because it’s simple, well-documented, and works for initial requirements and can grow to some small scale.</p> <p>When you run a Google search, a top result is a <a href="http://www.kelvinism.com/2008/12/netflow-into-mysql-with-flow-tools_5439.html">nifty blog post</a> that outlines a fast way to get this approach running with flow-tools, a set of open source flow tools. Compile with the “—with-mysql” configuration option, make a database, and you are off to the pub. If you scroll down below the post, however, you’ll find the following comment:</p> <blockquote> <p>“have planned.. but it´s not possible for use mysql records… for 5 minutos [minutes] i have 1.832.234 flows … OUCH !!”</p> </blockquote> <p>This comment underlines the limitations of applying a general purpose data storage engine to network data. Simply put, it’s not cost effective to store flow data (e.g. NetFlow, sFlow, IPFIX, etc.) at any scale in a traditional relational database.</p> <p>Why is this so? To begin with, the write volume of flow records (especially during an attack, when you need the info the most) will quickly overwhelm the ingest rate of a single node. Even if you could successfully ingest raw records at scale, your next hurdle would be to calculate topK values on that data. That requires one of two options: either a sequence scan and sort or else maintaining expensive indices, which slow down ingest.</p> <p><img src="//images.ctfassets.net/6yom6slo28h2/rbcMSFGWIKeEI64KauE0C/e719af045d902e9a26d8ffd4216fb667/kentik_illustration_white_architecture-comparison-300x202.png" alt="kentik_illustration_white_architecture-comparison-300x202.png" class="image right" style="max-width: 300px;" />At this point the smart engineer says, “Ah ha! I will run flow sampling at a high rate.” Sampling will definitely help reduce the required ingest rate for flow records, but it comes with a serious drawback: you start to lose fidelity when your sample rate gets up into the range of 100,000:1. Based on past experience, extreme sampling can make it possible to load a few minutes of flow data into MySQL and run a quick “show me top IP” query. But you’re basically hoping the whole time that you get a result before running out of RAM. And when your site is down due to an attack, the last thing you want is to be running manual queries under those conditions.</p> <p>So what’s the solution? Rather than discarding data to accommodate old-school data storage, the smarter approach is to build a data storage system that handles the volume of data you need for effective <a href="https://www.kentik.com/resources/five-steps-to-network-observability-nirvana-webinar/">network visibility</a>. Then you can store trillions of flows/day, enjoy a reasonable sample rate, and run queries spanning a week or more of traffic, which can give you a much fuller view of what’s really going on. That idea — architecting a robust datastore for massive volumes of network data — was a key part of what we set out to do at Kentik as we developed the Kentik Data Engine (KDE), which serves as the backend datastore for Kentik Detect.</p> <p><strong><em>If Kentik existed back then…</em></strong></p> <p>As I think of the experience of many of my colleagues, we’ve dealt with networks which run multiple, distributed points of presence (PoP).  Any organization with that kind of network footprint cares deeply about availability and security. A natural Kentik Detect deployment for this kind of network would be to install the Kentik agent and BGP daemon running locally in a VM in each PoP. The gateway router in each PoP would peer with the BGP daemon while exporting <a href="https://www.kentik.com/blog/flow-data-is-top-source-for-network-analysis/">flow data</a> and SNMP data to the agent. The agent would append BGP path data to each flow while also performing GeoIP lookups down to the city level for source and destination IPs. Every second, data for the last second would be sent via HTTPS to Kentik Detect.</p> <p>There are two major ways that the visibility provided by Kentik Detect would have benefited these networks. First, they would have had a more automated way of dealing with DDoS. Pretty much every significant network was and still is attacked constantly, and we all know stories where members of the operations team sometimes worked 24 to 48 hour shifts to manage the situation.  In fact, this phenomenon was a common experience in web, cloud, and hosting companies. Adding insult to injury, sleep deprivation often led to simple mistakes that made things even harder to manage. The ability to automatically detect volumetric DDoS attacks and rate-limit aggressive IPs would had meant way better sleep. It also would have meant way better network operations.</p> <p>The other benefit of having Kentik Detect would have been the ability to look at long-term trends. Peering analysis can help web companies and service providers save tens to hundreds of thousands of dollars per month in network transit fees. To peer effectively, however, you have to be able to view traffic over a long time-frame. The old-school approach to data storage left us unable to work with a sufficient volume of data. Kentik Detect enables you to analyze months of traffic data, broken out by BGP paths (the series of network IDs each packet traverses when it leaves your network). So, Kentik’s ability to pair BGP paths with flow would have made peering analysis trivial to set up even though it involves looking at billions of flow records behind the scenes. Kentik Detect could have easily picked out the top networks where peering would be most cost effective, and that information could have saved a lot of money.</p> <p>Of course, what I’m describing isn’t necessarily history for a lot of folks — it’s still reality. If that’s the case for you or someone you know, take a look at how the <a href="https://www.kentik.com/product/kentik-platform/" title="Learn More About Our Platform: the Kentik Network Observability Cloud">Kentik Network Observability Cloud</a> can help you get to a better place.</p><![CDATA[Parsing Alert JSON]]><![CDATA[Kentik Detect's alerting system generates notifications when network traffic meets user-defined conditions. Notifications containing details about the triggering conditions and the current status may be posted as JSON to Syslog and/or URL. This post shows how to parse the JSON with PHP to enable integration with external ticketing and configuration management systems.]]>https://www.kentik.com/blog/parsing-alert-jsonhttps://www.kentik.com/blog/parsing-alert-json<![CDATA[Eric Graham]]>Fri, 02 Oct 2015 21:24:32 GMT<h3 id="how-to-parse-json-output-from-a-kentik-detect-alert"><em>How to Parse JSON Output from a Kentik Detect Alert</em></h3> <p>This how-to covers a simple approach to parsing the output from the Kentik alerting system so that you can shape it into into whatever form(s) you need for your particular use-case(s). In this case the JSON is parsed with PHP, but the general idea is of course equally applicable to other languages such as Python, Perl, Ruby, or Go. This how-to is made up of the following topics:</p> <ul> <li><a href="#Fc02-Alerts_and_Alert_Output">Alerts and Alert Output</a></li> <li><a href="#Fc02-Step_by_Step">Step by Step</a></li> <li><a href="#Fc02-JSONparsing_PHP_Script">JSON-parsing PHP Script</a></li> <li><a href="#Fc02-Defining_an_Alert">Defining an Alert</a></li> <li><a href="#Fc02-JSON_Output">JSON Output</a></li> <li><a href="#Fc02-Parsed_Output">Parsed Output</a></li> </ul> <p><strong>Alerts and Alert Output</strong></p> <p>Kentik Detect’s alerting system is a powerful tool that informs users when network traffic meets a user-defined set of conditions. The alerting system supports several notification modes, including posting to customer-designated syslog and/or URL, which allows integration with external systems for functions such as ticketing and configuration management. The posted data, which provides information about the triggering conditions and the current status, is a set of JSON key:value pairs within an HTTP response body. This how-to is intended for any customer or partner that is interested in processing this JSON notification data. It focuses on using PHP to parse the JSON and to write the desired values to a human-readable file on a web server.</p> <p>Kentik Detect customers use alerts to monitor various metrics in the data that is ingested into the Kentik Data Engine (KDE), including information on devices, interfaces, IP/CIDR, Geo, ASN, and ports. The alerting system is very flexible and can be set up to inform users about a variety of conditions, including DDoS attack, the crossing of bandwidth thresholds (indicating the need to allocate more bandwidth), or changes in the status of device interfaces. The conditions are specified as a query that is run at a polling interval. If the query returns matches a specified number of times within a specified duration then the alert goes into alarm state, runs up to two user-specified supplemental queries to gather additional information, and begins generating notifications.</p> <p>The use-case example for this how-to is a DDoS attack. To follow the how-to you’ll need:</p> <ul> <li>A basic understanding of Kentik Detect alerts (for a full explanation please see <a href="https://kb.kentik.com/Ab10.htm"><strong>Alerts Overview</strong></a> in the Kentik Knowledge Base).</li> <li>A web server available for testing that is reachable from the Kentik Detect public SaaS deployment.</li> </ul> <p><strong>Step by Step</strong></p> <p>The flow of steps involved in parsing the alert notification JSON with PHP is as follows:</p> <ol> <li>Create a JSON-parsing PHP script similar to the script on GitHub that is referenced in JSON-parsing PHP script below.</li> <li>Create a test directory on your web server for your PHP script, check ownership and permissions, and copy the script to this directory, e.g.: <code class="language-text">http://your_domain.com/test/json-test.php</code></li> <li>In the Kentik Detect portal, go to Admin » Tags and confirm that a MYNETWORK tag is listed. This tag is typically defined with your network ASN or IP subnets. If you don’t already have one, see <a href="https://kb.kentik.com/Cb04.htm#Cb04-Add_a_MYNETWORK_Tag"><strong>Add a MYNETWORK Tag</strong></a>.</li> <li>In the Kentik Detect portal, go to Admin » Alerts » Global Alert Settings. In the Notification URL field, enter the full path to the location (from step 2) of your PHP script.</li> <li>Define a TCP Syn alert based on the queries provided in <a href="#Fc02-Defining_an_Alert"><strong>Defining an Alert</strong></a> below.</li> <li>On the web server where you put your PHP script, start a packet capture on port 80 so that you can see the pcap example of an HTTP POST with JSON: <code class="language-text">tcpdump -i eth0 port 80 -vvv -A -s0 -w /tmp/json-test.pcap</code></li> <li>In the Kentik Detect portal, go to the Alert Dashboard (click Alerts on the top navbar). When your alert enters alarm state it will be displayed as a row in the dashboard’s table, and an alert notification will be sent to the URL of the PHP script. At that point, stop the packet capture.</li> <li>To see the unparsed JSON in the notification from the alerting system (similar to JSON output below), you can read the JSON file: <code class="language-text">tcpdump -vvv -A -s0 -r /tmp/json-test.pcap</code></li> <li>In your web browser, open the file at the following URL to see the alert information extracted using PHP. The output should like like the example shown in Parsed output below: <code class="language-text">/tmp/testFile.txt</code></li> </ol> <p><strong>JSON-parsing PHP Script</strong></p> <p>This PHP script listens for HTTP POST messages, parses the JSON body and writes the results from the primary query and two supplemental queries to a file. This basic example can be extended to use the query results to activate ticketing/NOC notifications, mitigation policies, and more. The script file can be downloaded from the <a href="https://github.com/Kentik/integrations">Kentik integrations directory</a> on GitHub. <strong>Defining an Alert</strong> Needless to say, before we can test the PHP parsing script we have to have an alert that sends a notification to parse. The following queries are for an example alert that returns the following types of values from its primary and supplemental queries:</p> <ul> <li>primary: destination IP, device name</li> <li>supplemental 1: layer 4 destination port</li> <li>supplemental 2: destination device interface</li> </ul> <p>Example primary query:</p> <div class="gatsby-highlight" data-language="sql"><pre class="language-sql"><code class="language-sql"><span class="token keyword">select</span> <span class="token function">now</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> i_device_name<span class="token punctuation">,</span> ipv4_dst_addr <span class="token keyword">as</span> ipv4_dst_addr $ _key_<span class="token punctuation">,</span> <span class="token function">round</span><span class="token punctuation">(</span><span class="token function">sum</span><span class="token punctuation">(</span>both_pkts<span class="token punctuation">)</span> <span class="token operator">/</span><span class="token operator">%</span> lookbackseconds <span class="token operator">%</span> <span class="token punctuation">)</span> <span class="token keyword">as</span> f_sum_both_pkts $ _out_pps<span class="token punctuation">,</span> <span class="token function">round</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token function">sum</span> <span class="token punctuation">(</span>both_bytes<span class="token punctuation">)</span> <span class="token operator">/</span><span class="token operator">%</span> lookbackseconds <span class="token operator">%</span><span class="token operator">/</span> <span class="token number">1000000</span><span class="token punctuation">)</span><span class="token operator">*</span><span class="token number">8</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f_sum_both_bytes $ _out2_mbps <span class="token keyword">from</span> all_devices <span class="token keyword">where</span> protocol <span class="token operator">=</span> <span class="token number">6</span> <span class="token operator">and</span> tcp_flags <span class="token operator">=</span> <span class="token number">2</span> <span class="token operator">AND</span> i_device_type <span class="token operator">!=</span> <span class="token string">'host'</span> <span class="token operator">AND</span> src_flow_tags <span class="token operator">NOT</span> <span class="token operator">LIKE</span> <span class="token string">'%MYNETWORK%'</span> <span class="token operator">and</span> ctimestamp <span class="token operator">></span> <span class="token operator">%</span> lookbackseconds <span class="token operator">%</span> <span class="token keyword">group</span> <span class="token keyword">by</span> i_device_name<span class="token punctuation">,</span> ipv4_dst_addr <span class="token keyword">having</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token function">sum</span> <span class="token punctuation">(</span>both_pkts<span class="token punctuation">)</span> <span class="token operator">/</span><span class="token operator">%</span> lookbackseconds <span class="token operator">%</span> <span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">20000</span><span class="token punctuation">)</span></code></pre></div> <p>Example supplemental query 1:</p> <div class="gatsby-highlight" data-language="sql"><pre class="language-sql"><code class="language-sql"><span class="token keyword">select</span> i_device_name<span class="token punctuation">,</span> l4_dst_port<span class="token punctuation">,</span> <span class="token function">sum</span><span class="token punctuation">(</span>both_bytes<span class="token punctuation">)</span> <span class="token keyword">as</span> f_sum_both_bytes <span class="token keyword">from</span> all_devices <span class="token keyword">where</span> ipv4_dst_addr<span class="token operator">=</span><span class="token string">'%key%'</span> <span class="token operator">and</span> protocol<span class="token operator">=</span><span class="token number">6</span> <span class="token operator">and</span> tcp_flags<span class="token operator">=</span> <span class="token number">2</span> <span class="token operator">and</span> ctimestamp <span class="token operator">></span> <span class="token operator">%</span>lookbackseconds<span class="token operator">%</span> <span class="token keyword">group</span> <span class="token keyword">by</span> i_device_name<span class="token punctuation">,</span> l4_dst_port <span class="token keyword">order</span> <span class="token keyword">by</span> f_sum_both_bytes <span class="token keyword">limit</span> <span class="token number">10</span></code></pre></div> <p>Example supplemental query 2:</p> <div class="gatsby-highlight" data-language="sql"><pre class="language-sql"><code class="language-sql"><span class="token keyword">select</span> i_device_name<span class="token punctuation">,</span> input_port<span class="token punctuation">,</span> <span class="token function">sum</span><span class="token punctuation">(</span>both_bytes<span class="token punctuation">)</span> <span class="token keyword">as</span> f_sum_both_bytes <span class="token keyword">from</span> all_devices <span class="token keyword">where</span> ipv4_dst_addr<span class="token operator">=</span><span class="token string">'%key%'</span> <span class="token operator">and</span> protocol<span class="token operator">=</span><span class="token number">6</span> <span class="token operator">and</span> tcp_flags<span class="token operator">=</span> <span class="token number">2</span> <span class="token operator">and</span> ctimestamp <span class="token operator">></span> <span class="token operator">%</span>lookbackseconds<span class="token operator">%</span> <span class="token keyword">group</span> <span class="token keyword">by</span> i_device_name<span class="token punctuation">,</span> input_port <span class="token keyword">order</span> <span class="token keyword">by</span> f_sum_both_bytes <span class="token keyword">limit</span> <span class="token number">10</span></code></pre></div> <p><strong>JSON Output</strong></p> <p>An example of pre-parsed notification JSON is shown below. The JSON is constructed as an array containing a set of elements, some of which may be key:value pairs and others of which may themselves be arrays, including arrays for the main query result and the results of the two supplemental queries. In the example, line breaks have been inserted between these query arrays to make them easier to see.</p> <p><strong><em>Note:</em></strong> For each return value in the supplemental query an additional array will be created under <code class="language-text">supl_sql_one_value</code> and <code class="language-text">supl_sql_two_value</code>. For instance, if an attack targets a single destination IP address, but multiple destination ports, each destination port will be listed in a separate array, where Array ([ X ] becomes the next number. The example supl_sql_one_value includes one destination port (0). If the attack targeted 2 ports the next Array identifier would be 1 i.e. Array ([1] => etc. To process each Array, add logic to your PHP script to read each entry.</p> <div class="gatsby-highlight" data-language="sql"><pre class="language-sql"><code class="language-sql">Array <span class="token punctuation">(</span> <span class="token punctuation">[</span>message_type<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> CHALERT <span class="token punctuation">[</span>severity<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> CRITICAL <span class="token punctuation">[</span><span class="token keyword">type</span><span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> ALARM <span class="token punctuation">[</span>alert_id<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">3483</span> <span class="token punctuation">[</span>event_id<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">215027</span> <span class="token punctuation">[</span>key_value<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">38.118</span><span class="token number">.79</span><span class="token number">.72</span> <span class="token punctuation">[</span>key_name<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> ipv4_dst_addr <span class="token punctuation">[</span>out1_name<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> pps <span class="token punctuation">[</span>out1_value<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">137</span> <span class="token punctuation">[</span>out2_name<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> mbps <span class="token punctuation">[</span>out2_value<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">2</span> <span class="token punctuation">[</span>alert_start<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">1439312245</span> <span class="token punctuation">[</span>alert_end<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">0</span> <span class="token punctuation">[</span>notification_sent<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">1439312713</span> <span class="token punctuation">[</span>alert_name<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> JSON_test2 <span class="token punctuation">[</span>alert_desc<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token punctuation">[</span>query_result<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> Array <span class="token punctuation">(</span> <span class="token punctuation">[</span>f_sum_both_bytes$_out2_mbps<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">2</span> <span class="token punctuation">[</span>f_sum_both_pkts$_out_pps<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">137</span> <span class="token punctuation">[</span>i_device_name<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> cat2_readnews <span class="token punctuation">[</span>ipv4_dst_addr$_key_<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">38.118</span><span class="token number">.79</span><span class="token number">.72</span> <span class="token punctuation">[</span>now<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">2015</span><span class="token operator">-</span><span class="token number">08</span><span class="token operator">-</span><span class="token number">11</span> T17: <span class="token number">04</span>:<span class="token number">53.804936</span>Z <span class="token punctuation">)</span> <span class="token punctuation">[</span>clear_comment<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token punctuation">[</span>main_sql<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token keyword">select</span> <span class="token function">now</span><span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> i_device_name<span class="token punctuation">,</span> ipv4_dst_addr <span class="token keyword">as</span> ipv4_dst_addr$_key_<span class="token punctuation">,</span> <span class="token function">round</span><span class="token punctuation">(</span><span class="token function">sum</span><span class="token punctuation">(</span>both_pkts<span class="token punctuation">)</span><span class="token operator">/</span><span class="token operator">%</span>lookbackseconds <span class="token operator">%</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f_sum_both_pkts$_out_pps<span class="token punctuation">,</span> <span class="token function">round</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token function">sum</span> <span class="token punctuation">(</span>both_bytes<span class="token punctuation">)</span><span class="token operator">/</span><span class="token operator">%</span>lookbackseconds<span class="token operator">%</span><span class="token operator">/</span><span class="token number">1000000</span><span class="token punctuation">)</span><span class="token operator">*</span><span class="token number">8</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f_sum_both_bytes$_out2_mbps <span class="token keyword">from</span> all_devices <span class="token keyword">where</span> protocol<span class="token operator">=</span><span class="token number">6</span> <span class="token operator">AND</span> i_device_type <span class="token operator">!=</span> <span class="token string">'host'</span> <span class="token operator">AND</span> dst_flow_tags <span class="token operator">like</span> <span class="token string">'%MYNETWORK%'</span> <span class="token operator">and</span> ctimestamp <span class="token operator">></span> <span class="token operator">%</span>lookbackseconds<span class="token operator">%</span> <span class="token keyword">group</span> <span class="token keyword">by</span> i_device_name<span class="token punctuation">,</span> ipv4_dst_addr <span class="token keyword">having</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token function">sum</span> <span class="token punctuation">(</span>both_pkts<span class="token punctuation">)</span><span class="token operator">/</span><span class="token operator">%</span>lookbackseconds<span class="token operator">%</span><span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">30</span><span class="token punctuation">)</span> <span class="token operator">or</span> <span class="token punctuation">(</span><span class="token punctuation">(</span><span class="token function">sum</span> <span class="token punctuation">(</span>both_bytes<span class="token punctuation">)</span><span class="token operator">/</span><span class="token operator">%</span>lookbackseconds<span class="token operator">%</span><span class="token operator">/</span><span class="token number">1000000</span><span class="token punctuation">)</span><span class="token operator">*</span><span class="token number">8</span><span class="token punctuation">)</span> <span class="token operator">></span> <span class="token number">50</span> <span class="token keyword">order</span> <span class="token keyword">by</span> f_sum_both_bytes$_out2_mbps <span class="token keyword">DESC</span> <span class="token keyword">limit</span> <span class="token number">2</span> <span class="token punctuation">[</span>supl_sql_one<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token keyword">select</span> i_device_name<span class="token punctuation">,</span> l4_dst_port<span class="token punctuation">,</span> <span class="token function">sum</span><span class="token punctuation">(</span>both_bytes<span class="token punctuation">)</span> <span class="token keyword">as</span> f_sum_both_bytes <span class="token keyword">from</span> all_devices <span class="token keyword">where</span> ipv4_dst_addr<span class="token operator">=</span><span class="token string">'%key%'</span> <span class="token operator">and</span> protocol<span class="token operator">=</span><span class="token number">6</span> <span class="token operator">and</span> ctimestamp <span class="token operator">></span> <span class="token operator">%</span>lookbackseconds<span class="token operator">%</span> <span class="token keyword">group</span> <span class="token keyword">by</span> i_device_name<span class="token punctuation">,</span> l4_dst_port <span class="token keyword">order</span> <span class="token keyword">by</span> f_sum_both_bytes <span class="token keyword">limit</span> <span class="token number">10</span> <span class="token punctuation">[</span>supl_sql_two<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token keyword">select</span> i_device_name<span class="token punctuation">,</span> l4_dst_port<span class="token punctuation">,</span> <span class="token function">sum</span><span class="token punctuation">(</span>both_bytes<span class="token punctuation">)</span> <span class="token keyword">as</span> f_sum_both_bytes <span class="token keyword">from</span> all_devices <span class="token keyword">where</span> ipv4_dst_addr<span class="token operator">=</span><span class="token string">'%key%'</span> <span class="token operator">and</span> protocol<span class="token operator">=</span><span class="token number">6</span> <span class="token operator">and</span> ctimestamp <span class="token operator">></span> <span class="token operator">%</span>lookbackseconds<span class="token operator">%</span> <span class="token keyword">group</span> <span class="token keyword">by</span> i_device_name<span class="token punctuation">,</span> l4_dst_port <span class="token keyword">order</span> <span class="token keyword">by</span> f_sum_both_bytes <span class="token keyword">limit</span> <span class="token number">10</span> <span class="token punctuation">[</span>supl_sql_one_value<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> Array <span class="token punctuation">(</span> <span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> Array <span class="token punctuation">(</span> <span class="token punctuation">[</span>f_sum_both_bytes<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">5849088</span> <span class="token punctuation">[</span>i_device_name<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> cat2_readnews <span class="token punctuation">[</span>l4_dst_port<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">0</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span>supl_sql_two_value<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> Array <span class="token punctuation">(</span> <span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> Array <span class="token punctuation">(</span> <span class="token punctuation">[</span>f_sum_both_bytes<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">5849088</span> <span class="token punctuation">[</span>i_device_name<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> cat2_readnews <span class="token punctuation">[</span>input_port<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token number">30</span> <span class="token punctuation">)</span> <span class="token punctuation">)</span> <span class="token punctuation">[</span>debug<span class="token punctuation">]</span> <span class="token operator">=</span><span class="token operator">></span> <span class="token punctuation">)</span></code></pre></div> <p><strong>Parsed Output</strong></p> <p>The following example shows the parsed output you can expect to see in testFile.txt. Because supplemental queries are specified in the alert that’s generating the notification, the parsed output includes values from the arrays corresponding to the supplemental queries (see the pre-parsed output in <a href="#Fc02-JSON_Output"><strong>JSON Output</strong></a>):</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">ALARM CRITICAL alert_id: 3483 event_id: 350074 key name: ipv4_dst_addr key value: 204.186.16.136 device_name: gateway_nyc_xyz_net l4_dst_port_sup1-1: 0 l4_dst_port_sup1-2: device_input_int_sup2-1: 30 device_input_int_sup2-2: source_IP_addresssup1-1: (not used for this test) source_IP_addresssup1-2: (not used for this test)</code></pre></div> <p>So there you have it: a simple approach to using the output from the Kentik alerting system. By modifying and/or extending the PHP script you can shape the output into whatever form you need for your particular use-case. For further information on the possible use cases and how to implement, contact your Kentik Solutions Engineer.</p><![CDATA[A Compelling Cloud Approach to Network Visibility]]><![CDATA[Taken together, three additional attributes of Kentik Detect — self-service, API-enabled, and multi-tenant — further enhance the fundamental advantages of Kentik’s cloud-based big data approach to network visibility.]]>https://www.kentik.com/blog/a-compelling-cloud-approach-to-network-visibilityhttps://www.kentik.com/blog/a-compelling-cloud-approach-to-network-visibility<![CDATA[Alex Henthorn-Iwane]]>Thu, 24 Sep 2015 19:34:40 GMT<p>I’m excited to be writing my first blog post as VP Marketing at Kentik. I’ve spent most of my career in networking or network management, including stints at Packet Design, a routing and traffic analytics company, as well as CoSine Communications and Lucent Technologies (by way of Livingston Enterprises). Working at Kentik allows me to apply those experiences at a startup with an exceptionally compelling story: Kentik is rewriting the rules of network visibility with a cloud service driven by big data technology.</p> <p>Before Kentik, network visibility has been dominated by single or multi-tier appliance architectures. As the volume of network metric data grows exponentially, the inadequacy of these prior approaches has become obvious. Kentik Detect, on the other hand, uses a big-data engine running on a scale-out, back-end infrastructure cluster, and is designed for either SaaS (public cloud) or on-premises (private cloud) deployment. By leveraging cloud efficiencies, Kentik Detect delivers scale, speed, time-to-value, and affordability that can’t be matched by older designs.</p> <p>The importance of speed and scale for effective visibility has been discussed in previous posts from Kentik’s founders and from Jim Frey, our VP Product. So I thought I’d reflect on Kentik from a slightly different angle: the additional, valuable reasons for a cloud approach to network visibility. If we define “cloud” in the broadest sense as self-service, API-enabled, and multi-tenant, we can see why Kentik Detect is so compelling.</p> <p><strong>Self-service friendly</strong></p> <p>Kentik Detect not only ingests vast amounts of network data, it also offers customers multiple points of self-service access to that data: via Postgres, REST API, or the web portal. The meaning of self-service at Kentik is deep, because even in the web UI there’s a distinct emphasis on enabling a high degree of flexibility. Contextually nested selection and filtering of data is strongly realized. To boot, even if you’re not a SQL nerd, the SQL behind every portal-based analysis can be accessed and passed along to others with more SQL expertise, so they can self-serve the data programmatically without having to write queries from scratch. Pretty nice.</p> <p><strong>API-enabled</strong></p> <p>I learned the importance of APIs early on when I worked on creating a “carrier-class” RADIUS AAA policy server product back in my days at Livingston Enterprises and Lucent. The standard way of configuring policies at that time was basically an ACL, and initial commercial products simply wrapped a GUI around that concept. But that wasn’t a solid enough foundation on which to build a large services business. So the wizards in our team came up with a plug-in workflow approach that became Alcatel-Lucent’s PolicyFlowTM scripting language. Sure, we had a GUI for simpler use cases, but that scripting language allowed our customers to use the product to implement new service creation in a very unconstrained fashion.</p> <p>In my observation the value of programmatic approaches like the above is often overlooked. In most network management products, for example, the focus is on the GUI, with the API being at best an afterthought. Very few network management products dogfood their API, placing the API way behind the functionality and performance curve. For example I once worked closely with a customer that had built a native language portal for its network ops team to get at network management data, but they never built that capability into an API. I don’t mean to knock the engineering or technical sales teams; they often complained about neglecting the API. It was largely a matter of habits and perceived business priorities. But it was also because the GUI was totally separate from the API, which is quite typical of enterprise software. You only have so many engineering hours to go around, and the GUI simply has to work, so the API gets pushed down the stack over and over again.</p> <p>At Kentik, the GUI and the API are one. SQL queries are the foundation. This is a beautiful thing because, unlike some products where the query language is proprietary, SQL is universal and standard. Both the REST API and the GUI are derived directly from SQL queries. Not only is that incredibly efficient from an engineering point of view, but it also means that all three are always on par with each other. It means that Kentik transcends “dogfooding” because it’s all steak (interpret that as a carnivorous or vegan steak as you wish). Furthermore, with continuous deployment, new features like extended query filtering options show up in the UI, but since they’re all built on SQL, previously written queries will always keep working.</p> <p><strong>Multi-tenant</strong></p> <p>Multi-tenancy is, of course, one of the prerequisites for SaaS. But you won’t find true multi-tenancy in offerings that masquerade as cloud-based but are actually a series of separate, single-tenant enterprise software instances deployed on VMs on either a public cloud or private cloud IaaS. Multi-tenancy and true cloud require some level of standardization of the service offering, even when each tenant’s service can be heavily customized to meet specific use cases. That’s the case for Kentik Detect, which is important is because it means that all end-users directly benefit from Kentik’s continuous deployment of new functionality. Moreover, if you’re looking to offer Kentik Detect as an embedded service to your end-users, you’re not looking at some brutally awkward deployment architecture. It will work because it was designed to do that.</p> <p>Taken together, these three additional attributes of Kentik Detect — self-service, API-enabled, and multi-tenant — further enhance the fundamental advantages of Kentik’s cloud-based big data approach to network visibility. And that’s the take-away for this first post. It was fun writing it; I’m having fun being here already. And I’m looking forward to speaking with you about Kentik Detect, and how network visibility in the cloud can enable your organization to maximize the value of its network data.</p><![CDATA[PostgreSQL Foreign Data Wrappers]]><![CDATA[Relational databases like PostgreSQL have long been dominant for data storage and access, but sometimes you need access from your application to data that's either in a different database format, in a non-relational database, or not in a database at all. As shown in this "how-to" post, you can do that with PostgreSQL's Foreign Data Wrapper feature.]]>https://www.kentik.com/blog/postgresql-foreign-data-wrappershttps://www.kentik.com/blog/postgresql-foreign-data-wrappers<![CDATA[Ian Pye]]>Fri, 11 Sep 2015 16:34:28 GMT<h2 id="extending-a-postgresql-datastore-with-fdws">Extending a PostgreSQL Datastore With FDWs</h2> <p>This how-to looks at using the Foreign Data Wrapper feature of PostgreSQL to enable access from your application to data that’s not in a PostgreSQL relational datastore. The how-to covers the following topics:</p> <ul> <li><a href="#Fc01-About_Foreign_Data_Wrappers">About Foreign Data Wrappers</a></li> <li><a href="#Fc01-Example_Implementation_WhiteDB">Example Implementation: WhiteDB</a></li> <li><a href="#Fc01-Setting_the_Environment">Setting the Environment</a></li> <li><a href="#Fc01-Into_the_Code">Into the Code</a></li> <li><a href="#Fc01-Some_Tricky_Bits">Some Tricky Bits</a></li> <li><a href="#Fc01-And_Beyond">And Beyond…</a></li> </ul> <h3 id="about-foreign-data-wrappers">About Foreign Data Wrappers</h3> <p>Relational databases like PostgreSQL (PG) have long been dominant for data storage and access, but sometimes you need access from your application to data that’s either in a different database format, in a non-relational database, or not in a database at all. In PostgreSQL, this capability is provided by Foreign Data Wrappers (FDWs), which support pluggable data backends. FDW backends can be a surprisingly powerful tool when your data model isn’t classically relational but you still want all the nice things that come with PostgreSQL (aggregates, client libraries, authentication, group by, etc.).</p> <p>Leveraging PostgreSQL’s support for ANSI SQL and secure client libraries like JDBC and ODBC, FDWs in PG support a wide range of applications. They provide access to key:value stores like MongoDB, to ACID guarantees when accessing remote MySQL or PostgreSQL servers, and to web services like Twitter and Philips Hue smart light bulbs. A mostly complete <a href="https://wiki.postgresql.org/wiki/Foreign_data_wrappers">list of implementations</a> is available from the PostgreSQL Wiki.</p> <p>FDWs are implemented using callback functions. In this primer we’ll show how to use FDWs to front-end your own datastores, and to allow JOINs with native PG data and data stored in other FDW-accessible systems. We use FDWs this way at Kentik as part of the Kentik Data Engine (KDE) that powers Kentik Detect, the massively scalable big data-based SaaS for network visibility.</p> <p>FDWs enable Kentik Detect to use SQL-compatible syntax and to take advantage of PostgreSQL’s authentication, SSL, and libpsql functionality. They also allow us to simplify some sections of the code by relying on PG’s ability to combine multiple result sets. Kentik Detect supports multi-petabyte datastores spread across tens or hundred of nodes, but to users these stores look like SELECTs from normal PG tables. Queries run in parallel across the cluster(s), with multi-tenancy features like rate-limiting and “query fragment” caching that are critical for our use cases but aren’t found in PG itself.</p> <p>So far the biggest hurdle that aspiring PostgreSQL hackers face in writing FDWs of their own is the lack of documentation; the best documentation has typically been the actual FDW implementations. Hopefully this primer will help make things easier. So let’s dig into how to implement an FDW…</p> <h3 id="example-implementation-whitedb">Example Implementation: WhiteDB</h3> <p>FDWs come in two main forms: writable and read only. The only difference is that support for writes requires a few additional callback functions.</p> <p>I’ve posted on GitHub a simple <a href="https://github.com/Kentik/wdb_fdw">writable FWD example</a> for WhiteDB (WDB), which is a project that stores data in shared memory (see <a href="#Fc01-httpwwwwhitedborg">http://www.whitedb.org</a>). This FDW, wdb_fdw, allows PostgreSQL to read and write into WDB managed memory.</p> <p>While WDB is a key:value store, and therefore provides only a subset of what you could do with full-on SQL database functionality, wdb_fdw enables you to read, write, and delete using statements like the following:</p> <div class="gatsby-highlight" data-language="sql"><pre class="language-sql"><code class="language-sql"><span class="token keyword">INSERT</span> <span class="token keyword">INTO</span> my_table <span class="token punctuation">(</span><span class="token keyword">key</span><span class="token punctuation">,</span> <span class="token keyword">value</span><span class="token punctuation">)</span> <span class="token keyword">VALUES</span> <span class="token punctuation">(</span><span class="token string">'key'</span><span class="token punctuation">,</span> <span class="token string">'value'</span><span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token keyword">SELECT</span> <span class="token operator">*</span> <span class="token keyword">FROM</span> my_table <span class="token keyword">WHERE</span> <span class="token keyword">key</span> <span class="token operator">=</span> <span class="token string">'key'</span><span class="token punctuation">;</span> <span class="token keyword">DELETE</span> <span class="token keyword">FROM</span> my_table <span class="token keyword">WHERE</span> <span class="token keyword">key</span> <span class="token operator">=</span> <span class="token string">'key'</span><span class="token punctuation">;</span></code></pre></div> <p><strong>Note: The code posted to GitHub to support this article has the following limitations:</strong></p> <ul> <li>It only supports adding and deleting values based on a key.</li> <li>Integers are cast to TEXT (string) when displayed by PostgreSQL.</li> </ul> <h3 id="setting-the-environment">Setting the Environment</h3> <p>To get started, let’s set up our dev environment:</p> <ul> <li>Assuming a debian/ubuntu setup, <code class="language-text">apt-get install postgresql-server-dev-9.X</code> (Current latest release is 9.4).</li> <li>Install whitedb from source.</li> <li>When these are done, you should be able to:</li> </ul> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash"><span class="token function">git</span> clone https://github.com/Kentik/wdb_fdw.git <span class="token builtin class-name">cd</span> wdb_fdw <span class="token function">make</span> <span class="token function">sudo</span> <span class="token function">make</span> <span class="token function">install</span></code></pre></div> <p>Diving into the source code, first note a lot of boilerplate in the Makefile. Major things to change are the flags:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">SHLIB_LINK = -lwgdb -lsasl2EXTENSION = wdb_fdw</code></pre></div> <p>Everything else can be left as it is from the defaults. This works because the following magic line pulls in all of the machinery PostgreSQL needs to build an FDW:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">PGXS := $(shell $(PG_CONFIG) --pgxs)include $(PGXS)</code></pre></div> <p>Once things are installed, create a new table by installing the FWD extension (actually a .so file) for your database and then creating a table using the FDW as its data engine:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">CREATE EXTENSION wdb_fdw; CREATE SERVER wdb FOREIGN DATA WRAPPER wdb_fdw OPTIONS (address '1000', size '1000000'); CREATE USER MAPPING FOR PUBLIC SERVER wdb; CREATE FOREIGN TABLE _table_name_ (key TEXT, value TEXT) SERVER wdb;</code></pre></div> <p>In the <code class="language-text">CREATE SERVER</code> call, <code class="language-text">address</code> is a shared memory segment address, and size is how many bytes to grab. If you try to write more rows than can fit into the address you create here, bad things happen.</p> <p><em>Note:</em> PostgreSQL will reload the .so file once per client process. So when you make a change and re-install this file, make sure to exit and re-connect to your psql client to pick up the latest code.</p> <h3 id="into-the-code">Into the Code</h3> <p>Now that the extension is working, we can dive right into the meat of the code:</p> <div class="gatsby-highlight" data-language="bash"><pre class="language-bash"><code class="language-bash">src/wdb_fwd.c</code></pre></div> <p>A few notes on PostgreSQL’s code conventions:</p> <ul> <li>PG uses the Datum type as a general placeholder for any data type to be stored.</li> <li>PG has two major collection types: linked list (LIST *) and hash table (HTAB * ).</li> <li>Almost every function gets its arguments in PG passed in as a set of void*. Figuring out what what exactly you are getting and how to use it boils down to grepping in the PG source code for example uses. This is about as entertaining as it sounds. The best places I know of to look are the contributed code in PG’s source distribution and other FDWs online.</li> <li>PG implements its own memory management with palloc and pfree. These do what you would expect from the std lib.</li> </ul> <p>In <code class="language-text">wdb_fdw</code>, the first code block you come to is this:</p> <div class="gatsby-highlight" data-language="go"><pre class="language-go"><code class="language-go"><span class="token function">PG_FUNCTION_INFO_V1</span><span class="token punctuation">(</span>wdb_fdw_handler<span class="token punctuation">)</span><span class="token punctuation">;</span> <span class="token function">PG_FUNCTION_INFO_V1</span><span class="token punctuation">(</span>wdb_fdw_validator<span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre></div> <p><code class="language-text">wdb_fdw_validator()</code> is what makes the <code class="language-text">CREATE SERVER</code> call work. It parses the <code class="language-text">OPTIONS</code> clause and sets up a <code class="language-text">wdbTableOptions</code> stuct.</p> <p>Note that options come in as a linked list. The following macro Is how PostgreSQL iterates over a LIST* value:</p> <div class="gatsby-highlight" data-language="go"><pre class="language-go"><code class="language-go"><span class="token function">foreach</span><span class="token punctuation">(</span>cell<span class="token punctuation">,</span> options_list<span class="token punctuation">)</span> <span class="token punctuation">{</span>DefElem \<span class="token operator">*</span>def <span class="token operator">=</span> <span class="token punctuation">(</span>DefElem\<span class="token operator">*</span><span class="token punctuation">)</span> <span class="token function">lfirst</span><span class="token punctuation">(</span>cell<span class="token punctuation">)</span><span class="token punctuation">;</span></code></pre></div> <p>Moving on, <code class="language-text">wdb_fdw_handler()</code> is the real code of the FDW. It sets up the FDW and lays out all of the callback functions supplied. The only required functions are these:</p> <div class="gatsby-highlight" data-language="go"><pre class="language-go"><code class="language-go">fdwroutine<span class="token operator">-</span><span class="token operator">></span>GetForeignRelSize <span class="token operator">=</span> wdbGetForeignRelSize<span class="token punctuation">;</span> <span class="token comment">// How many rows are in a foreign table?</span> fdwroutine<span class="token operator">-</span><span class="token operator">></span>GetForeignPaths <span class="token operator">=</span> wdbGetForeignPaths<span class="token punctuation">;</span> <span class="token comment">// Bookkeeping function, which makes the PG planner able to do its thing.</span> fdwroutine<span class="token operator">-</span><span class="token operator">></span>GetForeignPlan <span class="token operator">=</span> wdbGetForeignPlan<span class="token punctuation">;</span> <span class="token comment">// Another bookkeeping function for the planner.</span> fdwroutine<span class="token operator">-</span><span class="token operator">></span>BeginForeignScan <span class="token operator">=</span> wdbBeginForeignScan<span class="token punctuation">;</span> <span class="token comment">// As you might expect, called at the beginning of each scan over a foreign table.</span> fdwroutine<span class="token operator">-</span><span class="token operator">></span>IterateForeignScan <span class="token operator">=</span> wdbIterateForeignScan<span class="token punctuation">;</span> <span class="token comment">// Called for every row. When this function returns null, means that the scan should stop.</span> fdwroutine<span class="token operator">-</span><span class="token operator">></span>ReScanForeignScan <span class="token operator">=</span> wdbReScanForeignScan<span class="token punctuation">;</span> <span class="token comment">// Sometimes PG may need to re-scan a table.</span> fdwroutine<span class="token operator">-</span><span class="token operator">></span>EndForeignScan <span class="token operator">=</span> wdbEndForeignScan<span class="token punctuation">;</span> <span class="token comment">// At the end, put all of your cleanup code here.</span></code></pre></div> <p>These functions are called in this order by PostgreSQL, during the course of running <code class="language-text">SELECT * FROM my_table</code>. Here’s a closer look at the functions:</p> <ul> <li><code class="language-text">GetForeignRelSize</code>, <code class="language-text">GetForeignPaths</code>, and <code class="language-text">GetForeignPlan</code> are all stubs, which don’t really do anything for our purposes. More information about them is in the comments above each function in the code.</li> <li><code class="language-text">wdbBeginForeignScan</code> is given the options defined in the create server clause. It uses these to open a handle into the WDB table. This handle is then used at each iteration (<code class="language-text">IterateForeignScan</code>) to read or write into the WDB table. And, as you might expect, it is closed in the <code class="language-text">EndForeignScan()</code> function.</li> <li><code class="language-text">IterateForeignScan</code> is called repeatedly, and it has the option of either add a row to the result set, or else returning null. When it returns <code class="language-text">NULL</code>, PG assumes that all of the data has been returned.</li> </ul> <p>Additional functions to know about include:</p> <ul> <li><code class="language-text">FillTupleSlot()</code> is a helper function used by IterateForeignScan to actually give data to PostgreSQL. This function gets two pointers: <code class="language-text">Datum *columnValues</code> and <code class="language-text">bool *columnNulls</code>. The i’th column value goes in <code class="language-text">columnValues\[i\]</code>. If a column is null however, <code class="language-text">columnNulls\[i\]</code> is set to true, and <code class="language-text">columnValues\[i\]</code> is left null as well. This is called once for each row added to a result set.</li> <li><code class="language-text">ExecForeignInsert</code> must be implemented to support inserts.</li> <li><code class="language-text">ExecForeignUpdate</code> and <code class="language-text">ExecForeignDelete</code> handle updates and deletes respectively.</li> </ul> <p>And that’s it — go out and create (for simpler use cases)!</p> <h3 id="some-tricky-bits">Some Tricky Bits</h3> <p>At this point the careful reader will start looking at the code and realize that a lot of functions have been left unmentioned, in particular functions to parse the WHERE clause of a query and to handle data types.</p> <p>The function <code class="language-text">ApplicableOpExpressionList()</code> in <code class="language-text">src/wdb_query.c</code> shows how both of these can be accomplished. Taken from the FWD for MongoDB (<code class="language-text">github.com/citusdata/mongo_fdw</code>) by the good folks at CitusData, this function iterates over a list of WHERE clauses:</p> <div class="gatsby-highlight" data-language="text"><pre class="language-text"><code class="language-text">foreach(restrictInfoCell, restrictInfoList) {</code></pre></div> <p>The function returns a list of the clauses that can be used by WDB. For example, integer and string comparisons are returned, but LIKE conditions are rejected because WDB doesn’t support them. The list of valid clauses is then passed to <code class="language-text">BuildWhiteDBQuery()</code>, which uses them to build a WDB query struct. <code class="language-text">EncodeConstantValue()</code> is then used to map from PostgreSQL data types into types that WDB can understand, using a big switch statement.</p> <p>There are, of course, many other tricks and tips, including how to handle <code class="language-text">GROUP BY</code>, which we’ll cover in future posts.</p> <h3 id="and-beyond">And Beyond</h3> <p>The above primer is just a small taste of how Foreign Data Wrappers can be applied to extend a PostgreSQL datastore. At Kentik, we’ve found FDWs built on top of PG to be both stable and scalable, greatly increasing the capacity and capabilities of Kentik Detect. We heartily recommend FDWs if you have similar needs.</p><![CDATA[The Stirring Simplicity of SaaS for Network Visibility]]>https://www.kentik.com/blog/the-stirring-simplicity-of-saas-for-network-visibilityhttps://www.kentik.com/blog/the-stirring-simplicity-of-saas-for-network-visibility<![CDATA[Jim Frey]]>Mon, 31 Aug 2015 18:07:28 GMT<p>Let’s face it – we live in a world of increasingly complex software products and systems, upon which organizations are absolutely dependent to function. While software-centricity has allowed phenomenal rates of innovation and growth in productivity, software systems can be notoriously expensive to own and operate. Research going back to Barry Boehm’s seminal text on <a href="http://www.amazon.com/gp/product/0138221227/ref=olp_product_details?ie=UTF8&#x26;me=">Software Economics</a> estimates that up-front costs of software comprise as little as 25% of the long-term cost of ownership (TCO). Much of the cost equation involves deployment, configuration, and ongoing maintenance of the software and the hardware to host/run it.</p> <p>Network and security monitoring solutions are a prime example. Keeping today’s networks up, running, and safe is a major challenge given their scale, traffic growth, and criticality. This has resulted in a wide array of complex, highly functional and highly capable solutions coming to market, which carry with them not only large up-front licensing price tags, but often heavy costs for installation and administration.</p> <p>To simplify deployment, many such products are offered in an appliance model, where the software comes preloaded and preconfigured (to a degree) on a hardware platform. This is helpful, but comes with some common downsides – namely the high up-front capital outlay required, the limited lifetime of hardware (can you say “forklift upgrade?”) and the fact that such approaches do little to reduce the pain of systems administration for software updates and patches.</p> <p>How can we do better? The public/private cloud approach can help reduce hardware costs, <em>if</em> the product of interest can run in a VM and <em>if</em> you can get your management data to your cloud provider of choice. A promising alternative is Software-as-a-Service (SaaS). SaaS is a proven and accepted model for many software product sectors, including CRM, APM, ERP, and more. SaaS offers some compelling advantages, including:</p> <ul> <li>Blazingly fast deployment: Our experience is that basic monitoring can take as little as 10-15 minutes from first keystroke to first data graph.</li> <li>OPEX instead of CAPEX: Taking subscription-based bites at license costs instead of a big upfront swallow is a great option for most shops.</li> <li>Better/faster support: The SaaS provider looks after the viability and resilience of the system 24x7, applying patches and new features with little or no delay. No more “sorry, we can’t reproduce your problem” support black holes and more more “we recommend you upgrade your operating system” brick walls.</li> <li>Easy scalability: Again, the SaaS provider is responsible for making sure adequate compute and storage resources are available to deliver committed levels of service. No more worrying about whether you have enough RAM or enough disk for the system.</li> </ul> <p>To be fair, there are some disadvantages as well, or at least perceived disadvantages. The two we hear about most often are:</p> <ul> <li>Data control: Many shops worry about their sensitive data being held and protected by an external entity. At Kentik, we have overcome such concerns with most organizations by providing secure data transport and assured multi-tenant partitioning. Thinking of analogous services helps too. Consider Salesforce here – if your CRM data is considered to be safe with an external provider, then in most cases your network flow data will be too. And if it’s just too scary, we bring out private hosting options (though few teams end up needing to go that far).</li> <li>OPEX instead of CAPEX: While most organizations like the subscription operational expense option for license fees, some only have the budget available for software systems on an occasional capital basis. As with most SaaS providers, we’re glad to get creative on those situations.</li> </ul> <p><strong>The Time for SaaS Network Monitoring is Now</strong></p> <p>Before joining Kentik, I spent a couple of decades in the network management sector, largely focused on traditional product architectures that required local deployment and administration. In the past several years, I’ve been intrigued by the utter lack of SaaS options among high-end network management solutions. As an industry analyst at EMA, I conducted direct <a href="http://www.enterprisemanagement.com/research/asset.php/2749/Managing-Networks-in-the--Age-of-Cloud,-SDN,-and--Big-Data:-Network-Management-Megatrends-2014">primary research</a> on the topic in 2014, and found that on average, 80% of enterprises were amenable to SaaS models for network management – there just weren’t many choices available.</p> <p>There are a few SaaS options out there already for basic network infrastructure monitoring, such as <a href="https://www.auvik.com">Auvik</a> and <a href="http://www.logicmonitor.com/">LogicMonitor</a>. But choices are fewer and farther between when it comes to network traffic monitoring. <a href="http://www.appneta.com/">AppNeta</a> offers a hybrid SaaS solution that does network and application performance monitoring, but it is designed for small-medium enterprise scale only. Similarly, <a href="https://polygraph.io/">Polygraph.io</a> offers SaaS-based flow monitoring, but is limited in its ability to scale for large managed environments.</p> <p>Kentik is the first to take on ultra high-volume flow monitoring using the SaaS approach and make it work for massive scale and near-real-time (sub-minute) results. On the scale front, we are already monitoring environments that regularly generate in excess of 1 million flows per second, and our back end SaaS platform is currently ingesting over 40 billion flow records per day. Our clients are getting immediate results when they start using the service, and our operations team constantly monitors the health and success of each managed environment.</p> <p>There is beauty in simple answers to complex problems. As long as you don’t have to sacrifice capability or assume uncomfortable levels of risk, the real question around SaaS is no longer “why?” The real question is “why not?”</p><![CDATA[Introducing Kentik]]>https://www.kentik.com/blog/introducing-kentikhttps://www.kentik.com/blog/introducing-kentik<![CDATA[Avi Freedman]]>Tue, 30 Jun 2015 08:00:13 GMT<p>We founded Kentik to make life easier for the networks and application operators that run the modern web. Our first service, Kentik Detect, is an infrastructure data analytics service that is scalable, powerful, flexible, open, and easy to use. Our customers — including Yelp, Box, OpenDNS, and Neustar — are using Kentik Detect to detect, alert, and dig into the efficiency, availability, and security of their infrastructure and applications.</p> <p>Here’s a bit of background on how we got to where we are today:</p> <p>Managing networks well requires visibility into traffic flows, network peering and interconnection, fast and accurate attack detection, performance analytics, and security forensics. But the tools available to operators have been… challenged — for decades. A common reaction we’ve all heard to the tools people have been using was some version of “they’re pretty much horrible but better than nothing.”</p> <p>While it’s been almost 20 years since Cisco created NetFlow, the options for even the most basic analytics remain grim. Available choices for “solutions” consist largely of enterprise software or appliances, single-machine open source software, or more recently, work done by in-house tools groups trying to build platforms on top of existing big data engines like Hadoop or Elastic.</p> <p>I had been a frustrated network operator myself, and had heard grumbling for years about the available tools, but it wasn’t until a few years ago that friends like Dan Ellis (now Kentik’s CTO) hammered home to me that there really wasn’t any modern, scalable, and affordable option for infrastructure analytics. The problem wasn’t getting the data — most switches and routers can generate decent traffic flow via sFlow, NetFlow, or IPFIX.  The problem is what to do with all that data. Receiving, storing, and processing it flexibly, in real time, and with granular retention is a huge challenge at today’s network scale and complexity. The gap was so big that most large-scale companies we talked to were trying to build their own solutions, or had first versions, with long roadmaps of basic functionality yet to be completed.</p> <p>As we fired up a first version of Kentik’s platform and asked for victims to help us test it, we also discovered that most of our potential users (and now customers) had a strong preference for a cloud-hosted service. They wanted to be able to send data without having to stand up servers and install software, and then easily and quickly access it via a client, a portal, or API, and integrate into their existing tools and operational and alerting work flows.</p> <p>Specifically, operators said that they’d like to see:</p> <ul> <li>High speed ingest of high resolution NetFlow, combined with routing data.</li> <li>Instant availability of that data for DDoS and anomaly detection.</li> <li>Long-term retention and availability at high resolution.</li> <li>A scalable architecture with open access to the data and analytics.</li> <li>A roadmap to take richer sets of flow-like data from load balancers, server agents, and sensors to support deeper performance and security use cases.</li> <li>Usability for all of the different operational groups: network, server, application performance, and security.</li> </ul> <p>After taking the feedback and working with some first users, Ian Applegate, Ian Pye, and I decided to start Kentik (then CloudHelix) in 2014 to address these requirements. Working with data sent by leading ISPs and web operators, we built an operating data platform, and first applications for analytics and alerting. We were soon joined by a fourth co-founder, Justin Biegel, who took over our customer outreach. We raised our $3.1m seed round in September of 2014, grew from four to now 21 people, and started growing a paid customer base in February of this year.</p> <p>Today I’m thrilled to announce the two major pieces of Kentik news mentioned above:</p> <ul> <li>First, the launch and general availability of our core product: Kentik Detect. We’re excited that our initial paying Kentik Detect customers include companies such as Yelp, Box, OpenDNS, and Neustar as well as other service providers and web properties. If you’ve been frustrated with your network visibility tools, please ping us — we’d love to understand your requirements and set up a Kentik Detect trial!</li> <li>Second, our $12.1m Series A funding. Led by Vivek Mehra of August Capital, this round represents a more than doubling of funding commitments from our seed investors — including First Round Capital, DCVC, WIN Funding, Tahoma Ventures, Central Electric, Engineering Capital — and our industry friends who helped support us as investors.</li> </ul> <p>As part of the funding round, Vivek, who brings deep technical and business expertise as CTO and co-founder of Cobalt Networks and as General Partner at August Capital, has joined our board of directors. We’re thrilled to welcome him and August Capital to the Kentik team! August is a firm with extensive experience and strong commitment to enterprise and infrastructure companies, including some of the companies we most admire (hi, Artur!).</p> <p>We’ll be using our new funding primarily for product development and core company growth. We have some very exciting products planned for the next few quarters that we’re already working on with customers. Again, please talk to us if you’d like to be part of Kentik’s road map conversation.</p> <p>Check out the <a href="/news/kentik-launches-from-stealth">Kentik Launch news release</a> to learn more about our product launch and funding, and hear first-hand from some of our investors and customers!</p> <p>Also — and no surprise — we’re hiring. Please ping me if you’re a data dreamer with strong hands-on back end or front end experience, or you’re at a company building “classic” solutions and have been thinking about how great it’d be to work at a fast-paced company with smart customers and a deep road map.</p> <p>Lastly, please check back here for future posts about how network operations are evolving, and what Kentik is doing to meet the challenge! The Kentik team will be blogging about our functionality and road map, and will also be presenting tutorials about infrastructure and operational topics.</p> <p>Ω</p><![CDATA[20 Years of Flying Blind]]>https://www.kentik.com/blog/20-years-of-flying-blindhttps://www.kentik.com/blog/20-years-of-flying-blind<![CDATA[Dan Ellis]]>Tue, 30 Jun 2015 07:59:31 GMT<p>It was 1994 and I was running one of the first multi-city ISPs in the country. We were one of SprintLink’s first BGP customers, and we had dozens of Cisco and Wellfleet routers. What we didn’t have was a way to answer any of the many urgent network utilization questions we faced every day:</p> <ul> <li>Where was our users’ traffic was coming from? Which types of customers had traffic traversing expensive links?</li> <li>Where should we order our next T1? Should it be a DS3? Where should it plug into the network?</li> <li>Who’s transferring data with our servers that shouldn’t be?</li> <li>Were our servers talking to each other on weird ports?</li> <li>Were we under attack? Were we compromised?</li> <li>Should we interconnect with a local competing ISP?</li> </ul> <p>I was young, I was new to the networking world, and everyone was new to the Internet. I just needed a “top -b -d 5 -i > networktraffic.txt.” As it was, to get info for a network link that was 200 miles away, I literally had to drive those 200 miles, enable a span port, risk crashing our 3Com switch, capture packets, and manually decipher IP addresses from a text dump. Universities and big enterprise companies had been running networks for years, I reasoned, so there must be something easier that would give us remote visibility. But as much as I searched in Gopher, I couldn’t find the solution.</p> <p>Five years later, just before the end of the world (Y2K), I had moved up to running one of the first high-speed data cable networks in the country. We had DS3s, Gigabit Ethernet, and a company-owned, multi-state dark fiber network of thousands of miles. And we had hundreds of “big” Cisco routers that were capable of generating Cisco’s new NetFlow network data. But NetFlow implementation in routers was so unstable that I still had basically no idea of where my traffic was coming from or going to. It was like piloting a ship by holding my finger up to the wind. I spent endless nights changing routing and enabling/disabling links to try to determine traffic patterns. All I needed was a little window on the side of the fiber to show me where the bits were going.</p> <p>Early in the new millennium NetFlow was finally somewhat stable. If I ran a brand-new version of router software with 276 serious bugs and caveats, I’d be able to get a flow of information out of the router, the equivalent of telephone “call records.” I thought: this is exactly what I need. I tried it, and while it didn’t work well, it did get me some data. But I had nowhere to send or collect it, and no way to work with it. Damn. So close.</p> <p>Between 2001 and 2011, NetFlow export from routers matured. But the ability to collect it and visualize it did not. There was some open-source software that worked for low volumes of traffic, and some over-priced, feature-poor commercial systems, but nothing that worked well. The rate of growth in traffic far outpaced any improvement in NetFlow tools. I kept thinking that there must be some viable collection/visualization solution out there, but I couldn’t find it.</p> <p>Meanwhile, starting in about 2008, I began having regular conversations with my good friend and fellow ex-ISP operator, Avi Freedman, about the state of traffic visibility. We agreed that it was a critical requirement to run a network, and that there wasn’t a good solution. The conversations went kind of like this:</p> <p><strong>Dan</strong>: “NetFlow still stinks.”<br> <strong>Avi</strong>: “Did you buy the $500k solution that doesn’t do what you need yet?”<br> <strong>Dan</strong>: “No, did you build me one yet?”<br> <strong>Avi</strong>: “Give me a list of what you need it to do.”<br> <strong>Dan</strong>: “OK.”</p> <p>And that went on for years, every 6 months. And, yes, I sent him the list every time.</p> <p>By 2012, everything was in the cloud and big-data had prevailed (whatever that means). I was running Netflix’s CDN, and guess what? Same tools, same vendor evaluations, same problem: still no solid NetFlow solution.</p> <p>I ran into Avi in 2013 and he informed me that he was going to build a data platform that could consume massive amounts of NetFlow, make it available in real-time, and provide both a UI and an API. When he invited me to test it in his lab, I said that what I really wanted was to see it running in the cloud. Was the whole thing for real? Frankly, I was doubtful.</p> <p>By mid-2014 Avi was working with Kentik co-founders Ian Applegate and Ian Pye. He said to me, “Look, we’ve got a prototype. It’s a SaaS platform capable of consuming 1M fps with no pre-aggregation, and of returning a query answer in &#x3C;1s.” So I asked the engine the same questions I’d been wanting answers to back in 1994, and in 1999 and 2004 and 2012… and it worked! I could quickly visualize where traffic was coming from and going to. I could drill into a “bump” on the graph and within a minute figure out if it was legit or not. I could see all the weird traffic a server was doing that wasn’t expected, and where it was going to. I could see who the heavy users were, who they were talking to AND tell if those people were expensive bits or cheap bits. In short, it was real.</p> <p>That early version was just a prototype, bugs and all. But looking under the hood I could see that it had a solid, scalable, open foundation — something that had been missing from every previous NetFlow tool I’d ever evaluated.</p> <p>A few months later I joined Kentik.</p> <p>So that’s how I got involved in what I think will be a major advance in the way that network operators are able to understand what’s happening over their infrastructure. Over the next several months I’ll be posting about how to use NetFlow and the Kentik products to get that understanding, which is a key part of operating a successful large-scale network. So we’ll look next time at the top 5 ways to use Kentik Detect to get a 360-degree view of your network.</p><![CDATA[Fifty Shades of Network Visibility]]>https://www.kentik.com/blog/fifty-shades-of-network-visibilityhttps://www.kentik.com/blog/fifty-shades-of-network-visibility<![CDATA[Jim Frey]]>Tue, 30 Jun 2015 07:58:13 GMT<p>After 20+ years in the network management sector, spanning both enterprises and service providers, you like to think that you’ve seen it all. So by the time I was first exposed to what Kentik was doing, I was pretty sure that I understood every angle and approach that is or could be taken to establish network/traffic visibility. Frankly, my first impression was that Kentik didn’t sound all that new or different. But the more I learned, the more I realized that Kentik truly was unique.</p> <p>Network management tools and technologies have evolved over time, but always revolve around two key objectives: visibility and control. While control is a worthy topic in itself (I’ll come back to it in future posts), there is a long, varied, many-shaded story to tell around network visibility. There are many, many ways to establish visibility into the network, and a lot of compelling benefits in terms of operational resilience.</p> <p>As an analyst for six years at <a href="http://www.enterprisemanagement.com/">EMA</a>, I conducted periodic research studies on <a href="http://www.enterprisemanagement.com/research/asset.php/2749/Managing-Networks-in-the--Age-of-Cloud,-SDN,-and--Big-Data:-Network-Management-Megatrends-2014">Megatrends in Network Management,</a> poking and probing to see if and how network management tools, technologies, and practices were evolving. And while almost everyone with whom I talked used monitoring tools — ranging from basic SNMP to logs to NetFlow/xFlow to real-time packet inspection — there was a glaring gap in the adoption of application/traffic visibility by front line network operations. The data is there in their tools, but NetOps either didn’t want to be distracted with that next layer of detail or they were afraid to inundate their operators with what can be a fire hose of data, including performance-related alerts/alarms.</p> <p>As an executive at <a href="http://www.netscout.com">NetScout</a>, I was a big advocate of using the company’s network performance management technology for sustained real-time monitoring. But only a small slice of the installed based embraced the Operations use case; most were focused on troubleshooting or planning &#x26; engineering. And despite successes in large enterprises and wireless service providers, it never gained traction in web-scale commercial organizations. Those shops typically preferred to build their own tools rather than make the requisite (and substantial) investments in 3rd party packet monitoring probes/appliances.</p> <p>Late in 2014, I decided to research another cool/emerging trend: the use of big data technologies in network and infrastructure management. I had always been of the belief that network visibility was a big data problem, particularly for live traffic and performance monitoring. On the one hand you have multiple sources and types of valuable data. On the other you have relentlessly increasing network size, bandwidth, and complexity. Combine the two and you meet the traditional “three Vs” definition of big data: volume, velocity, and variability. What wasn’t clear was how and if the existing tools-vendor community would use big data technology to address network visibility.</p> <p>I went out in search of solutions, and found plenty who were willing to talk about big data or to send their own data into big data back ends. But precious few were actually using big data tech directly. I ran across Kentik (then called CloudHelix) and immediately thought I had found a match. “What exactly is the solution?” I asked. “Network visibility using NetFlow at scale,” was the response.</p> <p>At first I was pretty disappointed. Hadn’t this been done already? Weren’t there already plenty of tools out there that could handle millions of flows per second of <a href="https://en.wikipedia.org/wiki/NetFlow">NetFlow</a>/<a href="https://en.wikipedia.org/wiki/SFlow">sFlow</a>/<a href="https://en.wikipedia.org/wiki/IP_Flow_Information_Export">IPFIX</a>? The answer was yes, but not in the way that Kentik was approaching the problem. No other NetFlow/xFlow solution in the mainstream market was using a big data architecture, nor delivering a massively scalable approach as <a href="https://en.wikipedia.org/wiki/Software_as_a_service">SaaS</a>. Further, no one else retained full raw NetFlow, which is a hard and fast prerequisite for troubleshooting and security forensics.</p> <p>There were some parallels using high-volume NetFlow/xFlow for security monitoring, but none of those were designed to address NetOps. The Kentik solution, on the other hand, was built from the ground up to specifically address NetOps use cases. The founders had all been directly involved in running huge networks in places such as Akamai, NetFlix, and Cloudflare, and they drew on their own experiences to design the Kentik solution. These guys knew what was most important for the NetOps environment: visibility, quick access to data, and fast answers to important questions.</p> <p>While this all had me intrigued, what put me over the top was this: Kentik, just 14 months from founding, already had paying customers and a boatload of prospects in trial. And these were not a bunch of struggling startups themselves — they were big household names. While I’m not allowed to share all of the big names here, having companies like Yelp, Box, and OpenDNS on the list confirmed that Kentik had found a real and viable gap and was successfully filling it.</p> <p>What Kentik is doing aims directly at the heart of what I’ve been working towards for over two decades: network visibility that is practical, useful, and cost effective, built by operators for operators, using leading edge technologies and backed by a team with a passion for success. When they asked me to be part of that team, how could I say no? I’ve been a visibility geek for a long time, and this was a company where I could channel that passion into better answers for all.</p> <p>As VP Product, I will be helping to organize and prioritize product roadmaps and requirements as well as to establish alliances with a range of potential industry partners. We’re just at the beginning of the Kentik story, and I look forward to sharing it with you as it unfolds, shade by shade.</p>