Watch more sessions
-
- Trailblazers: exploration, discovery, & navigating failure
- DevSecOps and Secure Incident Response
- Toxic Velocity: Speed Kills
- Day-two Operation of Multi-cloud Kubernetes and Vault
- Manage Panel Discussion
- Authorization: Ensuring Only Ada Can Access Her Files
- Lifecycle of a Pulumi Program
- Forecasting the Future: Creating a Radar for Risk
Session Information
Join Ell Marquez, Niall Murphy, Jeff Smith, and Sasha Rosenbaum as they discuss topics related to Cloud Engineering.
Presenters
- Ell MarquezLinux and Security Advocate
- Niall MurphyReliability Consultant and Infrastructure Aesthetician
- Jeff SmithDirector, Production Operations, Centro
- Sasha RosenbaumRed Hat
-
Hello everybody. Thank you for being with us here today. We are at Cloud Engineering Summit hosted by Pulumi. And this is a, an amazing panel on the managed track. And I’m Sasha Rosenbaum. I’ve worked for Red Hat. I’m in technical sales of OpenShift, which is the best version of Kubernetes. You can find out there. And with me today are some very, very awesome people. So I’m going to let them introduce themselves.
Um, Ell, would you like to get us started? It’s always fun to be the first one. Hey, I’m Ell Marquez. I am the Linux in security Advocate at Aneserer, which sadly is a little known field of actually talking about Linux security. The thing I am most proud of though, is I am an advocate for operations, safe escape, where a 501C3 helping victims of domestic violence. Being able to escape their abuser when targeted by tech.
So just big plug here, if you, or if anyone, you know, is in this situation, please reach out. It’s so important that people know they’re not alone. Wow Ell, I’m so glad I met you. I will keep your name in mind for, for those situations cause they definitely happen. And then Niall, would you like to introduce yourself? Sure.
So uh, I hope in the SRE field, I’ve been in internet infrastructure since the mid nineties, which dates me as well as the lack of hair. If you have come across my name before, it’s probably because of the SRE books, which I was the instigator, and editor on a bunch of other things, of I’m currently a reliability consultant and infrastructure, a face position is what my Twitter bio now reads. And you can find me on twitter. com/Niall. So you basically do Botox treatments for people’s infrastructure.
[Laughter]
Conflict filed pruning and so on. So forth. That’s what we do. Alright. And Jeff, would you like to introduce yourself? Sure uh, My name is Jeff Smith. I’m the director of production operations, at a company called Centro. We’re a digital advertising software platform and we are hiring, which I will obnoxiously remind you of throughout the panel. I also recently wrote a book, Operations Anti-patterns Develop Solutions. Feel free to buy 15 copies and hand them out for Christmas presents. It is actually somewhere behind me, um, buried in that bookshelf.
But yeah, me and Jeff were here in Chicago. So we actually get to meet in person and now I’m all almost blurry, but I’m going to go with this, this, [Laughter] uh, so let’s get started with the first question um, actually in the first question, since we’re in the manage Trek, we thought that it would be interesting to talk about um, what do people do in a day two operations, right? So like you have launched everything that you have. You’ve deployed things, you’ve configured things. You went through all of the pain and then now you’re starting to manage that infrastructure as it runs. And you have to worry about things like availability and stuff like that.
So um, what are the key components needed for day two operations of modern services? And Niall would you like to get us started? Sure. So I think some of the fundamental things are, are probably gonna fall in line with this theoretical framework called the Dickerson Hierarchy, which turns up in the SRE book, which is by Dickerson, but is not necessarily a hierarchy. And so anyway, some of that model probably needs to be a little bit changed, but basically it says, if you’re running a production service, there’s some hierarchy that needs a little bit like Maslow’s hierarchy of needs, and not the bottom is monitoring. You gotta be monitoring your stuff. Otherwise you don’t know if it’s up or down, or what it’s doing.
And I suppose these days we probably would put observability in there as well. Although in theory, you can draw a bright line between the ability to interrogate your systems kind of arbitrarily, which what is load observability, basically means just kind of checking up or down, kind of relatively simple monitoring, but the things that sit on top of monitoring, I mean more or less, everything sits on top of monitoring in some way, needs that data. What we’re talking about, incident response, post incident review, like postmortems and stuff like that. Capacity planning, looking at performance, latency, measurement, all of that kind of stuff. You basically need to have situational awareness of what your service and systems are doing, what they’re talking to, how it’s being affected, error rates, all of that kind of stuff.
And, and then you need to be able to react to that. So there’s this other periodical construct called an OODA loop, which I think derives from the us military and observe orient decide and act. So basically looking at, what’s happening to your service by observability, deciding what to do based on that. Like maybe there’s some air conditioning or maybe you need to increase your capacity. Otherwise you’ll run out in a month or two and then act on that.
And that loop is one way of understanding the, the kind of demands that are involved with running any significant online service. So I, I think you’re talking about like kind of being a very mature state from ordering services, right? And I think, you know, the future is here, but it’s not evenly distributed. And we have a lot of companies, who are like in a very good state and actually, you know, doing observability and, and are very advanced as in, in what they’re approaching. And we have some companies just trying to catch up from the nineties. Right? And there’s a lot of that out there that like, that people are like actually, their actual availability is like two nines and they’re struggling with that.
And um, I, I kind of wonder how we get there and, and the modern expectations are different, right? Like we, we have, you know, we can’t, we can’t say we’re close down for the weekend for maintenance. Like that’s no longer, I still see people do it, but it’s no longer a thing that you should or could be doing ah, with a fair chance of, of being a modern service. So I don’t know, Jeff, if you want to talk about it from the standpoint, of like a real world company trying to get to the modern state.
Yeah. You know, I, I agree with Niall, you know, in terms of like observability being important, but I, you know, I think the day two conversation um, really is it depends, right? Because I think everyone who’s launching a new service, or a migrating to the cloud or anything like that, they’re doing it for a particular set of reasons.
And I think those reasons will inform what the day two stuff is. Because once you migrate um, you quickly start to realize the pains that you’re experiencing, because as much testing as you do, nothing is like production, right? So then day two comes along and suddenly you’ve got this list of things that are problematic and causing your organization pain, and you need to address those first. So for like me, for example, when we were migrating to AWS and migrating a bunch of services there, monitoring was actually part of our day one. And we did a lot of work in getting a lot of that observability um, and that alerting and that incident response process down solid before we went live. And there’s a lot of other organizations that feel comfortable, you know, solving that with day two.
So I think it’s largely going to depend on what it is that you’re optimizing for. So a big thing that we were optimizing for was, was cost, right? So a lot of the work that we did was around environment cleanup, environment management, uh, resource exploration. That’s a lot of things that bigger organizations are like, you know what, we’re going to deal with that later. And we will just continue to spend, right. So that we can empower other, other, other business operations and business functionality.
And not to say that that’s wrong. I think it’s all a trade off. Right. So you may want to say, well, you know, we’ll, we’ll spend that extra dollars because we know that we’re empowering developers, to be able to do these other things faster. And there’s a dollar value to that.
So that’s a long-winded way of saying it depends. Um. I think the frameworks that Niall mentioned um are, are super important and something that you should consider when you’re trying to evaluate that stuff. You know, I love looking at the, the OODA loop is a perfect example, right? Because it is this sort of structured way of going about and approaching um, incident management. But it’s also something that you can do outside of incident management and just sort of look at, you know, your overall strategy um, in a, in a larger context, right? So what is the day two things that we should be looking at? Well, you know, orient observe, decide, and then act right.
Look and figure out what it is that is causing us pain. And, and how do we go about addressing that? And I think your pain points will sort of bubble up to the top really fast. Once you go live. I actually, I, I hate to bring it back to Niall, but like cost is the conversation we were having offline. Right? And the, the thing that always pops up is like, no one, unless you’re a Google, maybe, maybe, um you don’t have unlimited funds to, to throw at your infrastructure.
Right. And so cost is always a factor. And so what happens usually with your management, is your management wants the five nines, but they don’t want to pay for it. And so this is a conversation you have to have, right? You have to explain it like, there is a cost to every single, you know, nine that we are a quate, like a quarter nine or half a nine, that we add to our availability. And um, I think one of the things that kind of the Google SRE um, book uh, maybe helps us structure a little bit.
It’s structured that conversation about costs. And I think everyone wants to jump into two base actually. So Ell like, w why, why don’t you, jump into that? All right, I’m going to go completely contrary to everything everyone just said. You know, and this is going to sound rude. And I don’t mean to, but I don’t speak the, you know, alphabet soup of, you know, the SRE world, you know, the quick deployments world.
And, you know, that’s like all the different models and everything. But one thing that I feel that everyone has missed and everyone’s gotten short, is the whole concept of security anywhere in this, right. We keep talking about cost and, you know, reliability in everything being up. But, you know, it takes a couple of minutes, for an attacker to be able to find a misconfiguration within an environment. And we’re talking about cost and quick deployment, that’s going to happen.
I mean, historically we see it. And you talk about, you know, this whole concept of mature companies, those are the ones that are the best known attack vectors, where I think companies that are just transitioning to the cloud, are in the best place at anyone that’s there because they already have this information at hand, they can completely build their pipeline with security in mind. So yes, we have, you know, the whole concept of having to have, you know, reliability, right? And having to have everything up, which is great. But if what you have is something that is full of holes because we tried to cut costs and it’s not worth it. Right? All you’re doing is inserting something into your runtime, inserting something that can become a target to pivot into everything else, even if it’s old infrastructure.
I think the most important thing here is when you’re talking about costs is not ensuring that you don’t have, you know, 50 web servers or whatever it is that you need, but ensuring that you haven’t cut corners when it comes to that, when you know the entire security posture of your company. So Jeff first. I, you know, so I’ve got two things to respond to now. So I’m going to, I’m going to start with what Ell said. I, I completely agree regarding the security conversation, but at the same time um, and this is a super unpopular opinion.
I know this, but this is the reality that we live in, right? Security is just another piece of the puzzle, that we’re trading off against, based on the organization and their needs. Right? So I, I can be launching a new product. Right? And the thing is, I can have all of the sort of security pipelines and everything in place, but if there’s no one that’s actually responding to that, right? If I’m not staffed appropriately or, or have the time and or energy to dedicate to those things, you know, being aware of it is nice, but not reacting to it as terrible. And that’s the thing that you see in a lot of organizations where it’s like, oh, we’re going to get the software. We’re going to get vulnerability management.
We’re going to get scanning. And then you get it and you just get bombarded by these emails, like, yeah, you haven’t patched in four years. And they’re like, yeah, we’re going to get to that. Right? But from an organization perspective, you may just not be there yet. So, you know, I, I think it also is, is part of the trade-off paradigm and it’s a terrible, terrible thing, to trade off of, right.
Because it’s like, In no other scenario with security being negotiable. Right? It’s like, well, you know, I want my kids to wear a seatbelt, but you know [Laughs]. I, I think that you’ve 100% just made my argument, like what should be part of day two. You’ve already spent that part on day one focusing on it by not having that as a direct, like the first thing you do and, you know, day two, having that security operations, having somebody monitor it. That that’s exactly what I’m saying.
Like that’s the issue. Does that make sense? Yeah, it makes sense. I guess the thing is um, if you have a hundred dollars, right. And you’re trying to build a new product, right. And 40 of those dollars have to be consumed by security. Right? That becomes a calculation that you have to make. Right? And you go like, well, we’re just starting off. We’re trying to build a customer base. Right? Like maybe we, we take the risk and, and, and, you know, we focus on actually building the product out a bit more. And then the conversation is always, we’ll come back to that later.
And I’m not saying it’s the right decision. I’m just saying, those are the, that’s the reality of the conversations that I think are, are happening. And the question becomes, and this goes back to the original thing that I was talking about is I think it’s incumbent on us as leaders and managers is how do we translate this risk, into something that’s actionable, by people that are making decisions? And a lot of time, that boils down to not only just the financial component, but the, the likelihood of occurrence. Right? So it’s this thing where it’s like, you know, it’s easy to say like, well, you know, if we get compromised, then it could be expensive. Right? All right.
Well, you know, let’s quantify that though. Right? You know, okay. There’s a 30% chance that we’re to be compromised. And in that 30% chance, the cost is going to be between 25000 and 2. 5 million. Right? If we give it that, go ahead.
So sorry. I, I just, I thought, you know, you were finished talking, but, but I’m going to jump in, and, and I think that, so this is something that came out of Twitter thread because best things come out of twitter threads, but um, we were talking about like, we need an SLA for security, right? So, so the way we quantify this cost and availability space is we put an SLA on it. And we say, we are financially responsible for X, Y, Z. But if we didn’t meet our targets, we kind of need a way to, put the dollars against security risk and say, Hey, we quantify this.
And now, you know, Mr. CSO or Mrs. CEO, like, yes, like there’s a 1% chance that you will be in jail tomorrow. If we don’t patch this, like I I’m, you like, we have to put dollars behind it. And that’s that, I think that will be the way to have this conversation in a, in a, in a better fashion.
You know, I promise I’ll let Niall talk, but real world scenario, I swear, real-world scenario that I’ve heard twice is exactly what we talked about Jeff. I have a company that goes, okay, I have to be the first to market. Like we have to meet this deadline. Okay? But we also need to be secure. Well, if we be first to market, this is how much we’ll make.
If we get popped, this is how much we’ll make. What I hear is if you know what company profits, you know, users, customers, if their data gets breached and it gets out there, oh, we’ll just have to pay this much, the rest they handle on their own, but we, the company can make this much. And that’s what really bothers me. And, and I agree with you completely. I think, I think there is a moral component to this.
Right? But, you know, I think realistically companies like, especially larger companies, like it’s the cost of doing business. Right? And I think part of it is incumbent on us as users and, and consumers need to react to that. Right? Like I haven’t started stopped shop shopping at target. I still shop at target. Right? I’m still giving money to these companies that are like, oh, wow, I cannot believe that they had this terrible security breach.
Ooh, look a sale. Right? [Laughs] So, so I think it’s also incumbent on us to, to, to show with our dollars that, hey, this stuff is important um and, and not to sort of dismiss it as like, you know, well, you know, they had this breach, but everyone’s having a breach, and I really liked that feature. So it’s worth it. Well that’s the thing though, right? Like this is capitalism and it’s like negative externalities. Right? And, and, and we, we, we keep doing this, not just with security, with everything, right.
It’s like, w we can make this much dollars. And these people might, you know, be in a terrible situation if something goes wrong, but like, it’s not me, so I don’t care. [Laughs] So, you know, it, it, it keeps happening with all of our decisions and we kind of keep encouraging people to, to make these decisions. Also, like when it comes to everyone has had a breach, I’m like, we should probably give up on PII altogether. Like, I mean, like, like my information is on the internet in so many ways from so many breaches, like, I can just be like, you shouldn’t be able to verify my identity by like, knowing what my, my cat’s name was, like, because it’s just all out there.
Like, you know what I mean? It’s just like um, yeah. Okay. Super great. Okay. Niall finally, finally, okay, go ahead. So you must understand that I live in socialist Europe, [Laughter] where there is a regulation, which says that I am in control of my data. And I get to ask the company to do various things with the data, including deleting it, giving it to me, et cetera, et cetera, et cetera. Y’all can say your data is sewing on the internet, and that’s all totally fine. And so PII is not useful anymore. I, I’m going to say to you, you’re in my ivory tower of wonderfulness and say, yes, actually Digidor solves a lot of problems, and actually Digidor, also creates a lot of problems and the impact of the legislation.
Well, like there’s a lot of ideological conversations that go on about this. People saying it impedes kind of startups. It, it concentrates the power in larger organizations, which are able to afford the teams and resources that go towards managing the privacy of data and so on, so forth. It’s a very important concentration of economic power argument. But I will say that I think regulation as a tool for managing this is, is overlooked in the American context, and I will leave it there.
I mean, I completely a agree there because the, the problem and, you know, it’s funny, I was, I gave a talk on ethics and I was talking to my CEO about this, and, you know, it’s ironic an ad guy talking about ethics, but hear me out. [Laughs] We were talking about ethics in the industry. And it’s like the, our CEO, Sean recycle was so adamant that we need government regulation, because without that, it’s sort of this race to the bottom, right? Like you can be the company that decides, I’m going to, you know, set my foot down and we’re going to do all of this great security stuff, or all of this great privacy stuff in the ad tech industry. But if the rest of the industry says, well, we’re not going to do that. Right? They could end up eating your lunch.
So I agree. Regulation sort of levels the playing field and sort of sets up like, exactly like, okay, this is the standard by which everyone needs to play. And, you know, I think making security part of that, as well as, you know, sort of like leveraging the fines in such a way that it’s not just the cost of doing business, but like, there’s, you know, some actual uh, uh skin in the game around it, because if you can just sort of write it off in a budgetary context, then you know, yeah, a lot of people are going to do that.
Well, like if you, if you think about the, the mechanism by which economic externalities, are turned into internalities, the mechanism for that is regulation. Like the market, ain’t going to solve this. There’s no incentive for the market to solve this, because like personally, we continue to shop at Target. Well, I do not continue to shop at Target because Target has done all way, but you got the idea. So anyway. Um, so, okay. Ell, I know you have opinions on regulations, so why don’t you jump in? I do too. So we’re just going to continue going into this. I’m desperate to hear them so. [Laughter]
You know, the whole concept of regulations is great in theory, like even when it comes to the EU, we recently had them find, what was it, 220 something odd euros, when it comes to WhatsApp, having their data leak. Okay, cool. But you know what? My dad is still out there. It’s already there. It’s already been breached. They had a fine, and it’s not really doing preventative measures. And I could give example after example. And so this whole concept that regulation will solve everything. It would be that utopian society. Right? And don’t go into utopia and the book, just the concept of utopia. So anyways, that’s all I had to say is all of this is great, in terms of theoretical and in a perfect world, but that’s not what we’re actually seeing.
So, so I think personally, first of all, regulation is the only way to regulate the market. Right? Otherwise we would have monopoly in every single, you know, situation because that power concentrates over time. Right? So this is a perfect example. Like we do regulate a lot of things and it does work when we regulate things. It’s not perfect, but it does work. And it, again, it creates the only incentive companies have to actually invest in something like security compliance and things like that. What I think though, so like referring to books, there’s this book called The Inevitable and, and it’s, it’s really nice in general, just kind of different take on where the internet is going and like w what, how it all started and where are we going to end up.
I think the, from my, my standpoint, and what he raised in this book is like, this is already happening. It’s already happened. Like, we can’t take back our data. We can’t take back our data, not just from beaches, but also like from Facebook, or Google, or whoever else knows everything that I’ve done since high school. Right? Like it just, I, I cannot not know GDPR is going to help me get the data back.
Right? It’s already there. So what we have to do is evolve as a society, for this new world, where everything is out there, and you can find pictures of me in, in high school and use them if I run for president, which I can’t do, by the way, I’m an immigrant, so. [Laughter] But anywho uh, it, it, like, I think we have to be realistic. Like, I I’m, I’m always pragmatic. Right? So it’s the same conversation for security.
If we live in a wishful thinking world, and we say like, we have to secure everything. We have to invent, like, here’s, it’s going to cost this little company, 2 million to have proper security posture. They’re not going to do it because they don’t have the money. Realistically, they just can’t, even if they wanted to, even if they like, had the best Russian hard, so we have to make it easy for them. Right? So like, I’m, I’m a big fan of trying to automate security, as much as we can, because like that allows a developer, that’s writing the code to click on buttons, like, you know, instrument the code to, to be more secured and like, yes, Jeff, maybe no one’s looking at the alerts, but maybe we need to automate that too.
Right? So we have signal to noise ratio. That’s much better than we have right now. So, there’s a lot of things we could do, if we accept that the reality is what it is, and we have to address it in ways that, that are doable. Quick clarification. Nothing else. I wasn’t saying I’m, anti-regulation just put out that out there. I wasn’t saying that. Alright go on. We didn’t, we didn’t think so, like no for sure that, um okay. I, I’m going to move on to the next question, [Laughs] because believe it or not, this was the first question we ever talked about.
Um, so this one actually, I guess, is, is interesting, because um, what do we think everyone gets wrong when trying to run reliable systems? Like what’s the biggest mistake, you see out there in the industry, in your companies or other companies? And Jeff, I’m going to start with you. Um, I would say the biggest thing that people, companies get wrong. And I haven’t been to every company obviously, but the biggest thing that people get wrong is that reliability is this thing that we slap on at the end. You know, it’s like, we, we go through this phase, we build it. Then we like, okay, now, you know, now that it’s in production, let’s make it reliable.
As opposed to thinking about it as a, as a feature of the product. Right? Same thing with security, honestly, it’s like at the very end where like, okay, now let’s talk about security. So it’s like, how do we, how do we make these requirements? You know, back into the actual requirements of the product. Because if we’re always at the tail end of this, reliability is always going to be behind. And there’s a lot of things that, you know, especially with, you know, modern con practices and things like that, there were things that applications need to take into account to be reliable.
They’re not things that we can always do just with infrastructure. Um, you know, sometimes we end up um, sometimes we end up, you know, uh, creating cover for poor applications through infrastructure, by saying like, you know, oh, well, you know, dynamically scale when these things run out of memory and fall over, right. So that we’re not losing traffic or anything like that. But then the other question is like, well, you know, when are we going to spend time to figure out why, why we’re running out of memory and falling over all the time, right? How do we, how do we build good metrics into, into the platform so that we’re admitting very specific bits of data, right? As opposed to well, memory and CPU look great, how are we making sure that the applications are meeting metrics that verifies not just error conditions, but that things are actually working the way they’re supposed to be working right? So it’s easy to say like, oh, this thing broke, but we should also be saying, this thing was successful or this message was received. And this message was processed, right.
As opposed to just one end of the equation. So I think it goes back to getting reliability into the design phase of the product. And, and this is going to sound again, unpopular, product needs to be on the hook for that reliability, because if product is, assuming product is powerful in your organization, different orgs have different product structures. But if the product org is the one that’s setting the prioritization that is, you know, responsible for reliability as well, guess where their view on reliability is going to change. Oh, okay.
Now that I’m on the hook for the five nines. Yeah. Let’s put in that feature into the sprint, that is going to fix this database and stability and things like that. Um, so, you know, it goes back to design and I, you know, I’m, I’m putting product on the hook.
I, I had to agree with like every word that you said. I feel, I have a feeling that this will be a question, where we just all agree on. I don’t know, Niall, do you want to jump in? Uh sure. I mean, we kind of give you nine fives, so we can kind of give you five nines. You can probably get me five nines, sorry. Other way around.
Other way, uh. Jeff, I was going to respond to you and say like, it’s, it’s, I’m not quite sure who you thought that opinion was going to be unpopular with. Cause I’m not sure if anyone on this panel, but I, I will say that I think the general point that you’re making that concerns about everything other than feature engineering at the moment are peripheral in most kind of product conversations, organizational prioritization, conversations, all of those kinds of things, right? There’s basically feature engineering, for which it is perceived.
You know, each next step or feature engineering, will bring more money into the company. Each next thing, which is note, feature engineering will not bring money into the company and therefore is a lower priority. So what I’m seeing in the industry today, when I talk to my clients and a bunch of other, other folks in the industry is, with this, with this core of future engineering and we have more or less, every other priority kind of peripherally around it, all petitioning what they imagine, you know, the central team to be, to give them the resources in order to, do the thing that they’re supposed to do. And I, I, I just think that’s a fairly hopeless situation to be in really, because you can’t ever approach it in a meaningful kind of way, contact or expect to even get to an 80/20 style ratio of all of the things you need to do in order to make the product reliable, secure, et cetera, et cetera. So to my mind, we really need a new, well, we need a new paradigm.
But we also need to understand our current situation a bit better. I think we need a kind of a physics of software, or maybe a biology of software. So we understand what the trade-offs are between the elements in this space. And we’re in a position to make a much more informed decision than the really, simple Boolean model that a lot of engineering leadership has right now, which is one equals money. Good.
And zero equals no money bad. And actually the space is way more like a real, a real number space or field with a rational, or something, rather than this billion model. I honestly think that this is not about biology of software, but about psychology of humans, right? And, and something that both of you have mentioned is incentives and incentives is what drives the business. And so if you pay some people to, you know, fix the, it, keep the lights on and some other people to deliver features, and you put your focus on the people who deliver features, then guess what’s going to happen, right. You’re going to accumulate tag that, as you go every single day.
And like, no, one’s going to care about it until again, a CEO ends up in jail, over a breach or something. I don’t know, um you know, Ell, do you want to jump into this one as well? Sure. I mean, to avoid being repetitive, but I think, you know, we already established that everybody agrees with this. I think the biggest issue we’re having is outsourcing ops. Like we’ve, we are basically getting to the point where like, you know, operations side of the house is just extra amount of money.
We can completely outsource this. We’re giving almost that responsibility to our dubs. We’re having that. Well, it didn’t work just redeploy it. Without operations, you don’t have anyone to look at it and go, okay, why are we really hitting these memory caps? You know, is it actually what you’re looking at? Or do you have any to go back to the security mind, but like, do you have a crypto miner that’s suddenly gotten in there? Is it actually the software, or even if we don’t put that concept of security in it, what’s actually going on with the metrics? What are you actually seeing? Is it that you’re not deploying enough resources or is it that they’re not allocated correctly.
When you have an in-house ops team and not something belonging in the cloud, they get to know your environment and kind of what your software is supposed to do. It’s really funny when you talk about like the biology of software, because the way that I was taught to look at it, when I kind of came to this world is DNA. Right? You have your base DNA, which is your original plan of what everything’s gonna look like. And then you have your runtime, would it actually evolve into, so I think without having someone actually look at, well, it was supposed to be this and not, what is it actually, you’re just setting yourself up for failure.
And I think that’s a, a huge point because like, you know, w we, we always tend to latch on to what we think we designed and then, you know, not re-evaluating for what it actually is. And a lot of us are running systems and still looking at it, in that the optimistic lens of what we intended instead of what it actually is doing. And it it’s like, it’s a common thing with systems right? There are like, you know, systems are designed, they’re doing what they’re intended to do, not what they’re designed to do. Right? It’s like, they’re just sort of self-reinforcing. So I, I think that’s a huge point. I think the other thing too, is that if the pandemic has taught us anything, is that as humans, we’re not good at evaluating risk.
Right? So I think part of that is that is our, you know, internalization of risk and getting better at not only presenting the risk, but then consuming that and, and making intelligent choices and decisions off of it. I think, so I’m going to pivot just a little bit of this one. And like, I super love everything that everybody said. Also, I will say that, like, I, I, I’m a big believer that there’s no perfect architecture, right? Whatever you design your system for is going to evolve over time. And so yelling about clean code is not very helpful because you don’t know what tomorrow is going to bring you, right, you right.
And you can’t architect for perfect scale. You don’t know what your features are going to involve to. What your customers are going to do, all that stuff um, But I think what I, what I keep hearing from everybody is that we get, we have to get observability. Right, right? We have to be able to understand what our code does, understand what our, so what our software, like, what issues our software is having, and have it consumable by humans. Right? In a way that, again, signal to noise ratio, right.
We’re not drowning in useless alerts about like CPU is over 90%. Well, is that a problem or is it not right? All of that stuff, um. So what I want to ask a controversial question related to the previous question, which is, who should carry a pager in your company? Yes [Laughter]
I always say it starts with the, the person that has the power to make the change, because I feel like if there isn’t skin in the game, it’s always easy to defer. Right? So, so when, when Deb started getting the pager, suddenly the memory errors went away. [Laughs] Right? Because no one wants to be woken up, in the middle of the night.
No one wants to be at the barbecue and then have to restart a server. Right? So, and then that was a lot of the frustration of ops where you would get page and there would be nothing that you could particularly do about this problem. Right? And it’s like, well, I can’t really do anything about it. Um, so I, I think it, the pagers should be carried, by the person that has the power to actually make change. And that’s something that is in ops, that we had to learn, that once we started giving the pager to dev, guess what we had to give them the access too, because that’s the same argument, right? Oh, I got page, I know the service needs to be restarted, but I’m not allowed to restart it because I can’t connect to that server or whatever.
Right? So it’s a two-way street, but I mean, for me, it starts with, whoever has the power to, to enact change. And once that pain starts to get felt, suddenly tickets start showing up in sprints. So, so I think there’s a lot of like, you know, I’m gonna, I’m going to bring the, the, the other side of this conversation with though. I don’t, I fully believe that dev should carry a pager and I’m a dev. And, you know, I fully believe that products should carry a pager periodically, right? Once a month experience the pain that support goes through, right? Not, not like you don’t have to live that life every day.
You have to know what your customers are calling your sport about, or what’s blowing up, or what’s paging people in the middle of the night. Um, but there’s a lot of conversation about this. And every time it comes up, there’s a lot of people who say, well, I, well you know, I I’m skilled at this. I I’ve studied computer science for the last 10 years, whatever. Right.
And I know how to alert write C plus plus code, and I shouldn’t have to worry about reporting servers or something like that. Right? And it’s like, it’s just like a waste of expertise, or something like that. So is that a concern? Should we listen to that? Yeah. This is going to sound like I’m making a joke, but I’m being 100% serious who should be page. Whoever makes enough money to actually care that they’re being paged.
I’ve seen too many situations where it’s like, oh, I’ll get to it. Hold on. And that gets delayed the best way I ever saw at work. And I knew it’s different for every company, is the first page went to that serve slot. They got on and they’re like, okay, I can’t do something, but this is what I’m seeing. The second page went out to the team. We’re talking site, you know, the lead architect, we’re talking product, we’re talking to the dev. There’s a meeting everybody’s involved in it. Everybody’s having the discussion. With that many people being called into it.
You better believe that alerts and issues started dropping, because nobody wanted to be the reason that everybody got woken up. As well as, why they were there. Interesting. Do we need to find A Pager too? We’re all sort of, [Laughs] I think we’re all, you know, every time I use the word pager, I’m like, do people know what we’re talking about? Maybe, maybe Pager Duty out there and people actually know what Pager Duty is, set out, but like being on call, right. Responding to, to software to definitely wanted to find it.
Yeah, I, I, I was, I was being facetious just because like, you know, whenever I use the term, like pager or like, you know, oh, I got beeped to people like, oh, you sound so old. It’s like, when people call it iTunes still, instead of apple music. [Laughs] For sure. For sure. Niall, do you want to say anything on the topic? Yep.
I certainly do. [Laughs] I have a lot of feelings about on call, some of which I’ve recorded in previous talks. I think the first thing to say is that, there’s this thing called the wisdom of production, which is a thing in Esery theory, it’s basically says more or less what Jeff was saying at the start, that being connected to in a very direct and real way. The consequences of your actions earlier, in the SDLC is actually a good thing for the product, for your expertise, for your development as a human being and a designer of software and so on and so forth. The only reason why you wouldn’t do it, is of course, because it sucks and it’s terrible.
And it has terrible effects on human beings and all of those kinds of things. So it tends to be that the terrible effects, bit dominates the discussion as opposed to the better software cross person ship component of that discussion, which I think is something that we need to change in the industry. And there’s a lot of things we could do about changing some of that, that balance. I think we’re not doing. But the, the other thing I would say about carrying the pager um, um, the, the kind of implicit social hierarchies we’re talking about here and have talked about earlier in this conversation, there is a kind of a, I’ll use the word Marxist, but I actually mean crossed space analysis of on-call, which could take place here, which is to say, feature engineers up the core of an organization, bringing money in revenue, et cetera, et cetera, perceived as being a higher social class than others.
And so we’re all petitioning for attention, resources, etc, etc, from this situation. I, there are a surprising number of analyses of conventional software development and that whole process and how humans, I mean, coming back to your point about human psychology, how humans operate in groups and then companies, and so on, that’s missing. When we think about what does it mean, to be a good software developer? What does it mean to work well in a team? What does it mean to run a service effectively? And I think the fact that there’s such a variety of models for how we do on call, which is like, it really is very strongly varied across a lot of the industry. And even the fact that the dev ops folks, you build it, you run it, ITIL SRE, like all of this variety in the space, of our highly crumbled production concerns, is really a reflection of the fact that these, these, these things are intentioned, when people are, are trying to figure out, how to live, how to work. And I don’t think we’ve got there yet.
I think this needs way more attention than it gets. So I think, I completely agree with you, on the classicism. Right? And there’s a lot of classicism in how we run our teams. And it’s, it’s really interesting. Cause like, like I, I’ve been a dev by then, was what we now call a DevOps engineer maybe, you know? And, and then I, you know, I’m, I’m in technical sales now. Right? And so every team I wasn’t endeavor I’ll brief. Right. So every team is like, well, I’m not marketing, marketing sucks. I’m not sales. Sales is like only invested in money.
Like I’m not dev like I’m ops, I’m better. I like care about how systems, like whatever, every team has a lot of like swagger and it usually comes at expense of other teams. And, and then there’s this definitely classicism, that like devs, like featured devs get the most kind of investment from um, from the standpoint, of the companies usually, right? I think that the big tech companies, are actually doing this right. In a sense of like trying to kind of remediate that stance and make everybody kind of more aligned in terms of incentives so that people, there’s not a huge disparity between all of these roles. In terms of paying, in terms of like investment that they make into people’s careers and stuff like that.
I think, so this is the conversation I think. And I don’t remember Jeff, if you were in it, like we we’ve been running downstairs Chicago, for the last seven, eight years. I don’t know where this time. And at one point I was kind of getting frustrated and Niall was like, did we actually make any difference in the industry? Like I, I you know, I’m dealing with the same stuff every day. And like, did we do anything in like the decade that the term devops existed? And actually a lot of people came back to me with like, yes we did, because ops is now a like dev ops or SRE, right.
We keep this evolution. Those are people that are valued. Those are people that do interesting work. And those are people that get paid to do interesting work. Right. Which was not the case 10, 15 years ago, when it was kind of like opposites the janitor of the dev. Right? And so I think we are definitely making progress, in the right direction in the industry. And I don’t know if anyone wants to jump into this one. Yeah, no, I, I completely agree. And uh, you know, we are getting more focused on ops and I think part of it as, as terrible as it sounds is a lot of people don’t understand what it is operations is doing and providing for the organization.
And that’s where that sort of um, resource contention comes from. Because if you’ve ever been in an organization, where the system becomes unstable and you’re having outages, suddenly ops has every resource they need because now people understand what it is, that ops is doing and what they’re being provided. And ops has given a platform to say like, we don’t have this, we don’t have that. We need this, now suddenly it’s like hire three people, right? We’ve got a war room. Here’s a dev team dedicated to this because we’ve, we’ve, we’ve uh, felt the impact in a way that, you know, a lot of organizations only understand financially.
Um, so I think we’re definitely making progress in that respect because people are understanding that the power um, that comes from having a solid ops organization, because the problem with ops is, if you’re doing your job, no one notices, [Laughs] Right? It’s just this sort of invisible background thing that happens. And it’s not spectacular that the system didn’t crash, right? No one gets rewarded for the near misses. Right? Um, so I, I think, I definitely think things are getting better. I think operations leaders um, need to do a better job again, of quantifying the value, that we’re providing, quantifying the risks and the trade-offs, that we’re making so that people understand what, what it is that we’re bringing to the organization.
So, Ell I, I don’t know if you wanted to jump into it. And, and again, security is also a class of people that gets prioritized. So you prioritize based on certain circumstances. Right? So I don’t know if you want to speak to that. Um, you know, having worked in the ops side, of the house for quite a while, I think, cause I have some opinions there, but to go back a minute, I thought this was hilarious. Cause I totally get that it was a joke, but your whole concept on, like, I don’t know if people know what I mean that, you know, with pager, that’s actually one of the key examples, of what I’m talking about when we talk, like this whole platform, right.
It’s supposed to be the managed platform, in a managed panel, yet everything has focused around the concept of SRE. Like there is more to managing a server, a server, an environment, depending on where the company goes. And I recently heard a statistic where there are 10,000 developers per one security person. And like I said, basically ops are being pushed out. So their numbers aren’t any better.
So the whole concept of, well, they have to quantify it. They have to look at their metrics. Okay, great. How do we keep up with you with our numbers, if we’re not involved in the conversation? One of the biggest issues in all of this is communication. If no one speaks the same language, nobody has the visibility, or not everyone has a visibility into it. Like aren’t we putting unfair or unrealistic expectations on everyone else inside of the house?
I mean, a absolutely, right? Um, I, for me though, it it’s, and this may be unsatisfactory, but for me it’s still part of that, you know, how do we quantify it? Right? So for example, like, you know, security for the longest time, was an ops concern amongst the many other things as if it wasn’t, or, or it couldn’t be its own discipline. Right? So once we start to quantify that and say like, Hey, here are the pain points right. Here are the things that, you know, the security team wouldn’t inform us on in terms of, you know, best practices, policy, uh, even um, you know, regulation, right. Um, suddenly once clients start asking, you know, what is your posture? What is your, you know, what is your regulatory compliance stance, right? Then suddenly it’s like, oh, we need a security team. Right? Um, because the value of the team becomes crystallized in a way that we weren’t doing before, when the value wasn’t just about potential new revenue, but about risk to existing revenue.
And um, you know, there was a term that I heard and now, you know, I’ve got brain fog, so my mind is blanking on the, on the person. But um, revenue protection was a term, that I heard to define things like maintenance and security and things like that, you know? So that helps to sort of frame it, in the sense that like, Hey look, the risk, isn’t just theoretical. If it happens, there is potential dollars at stake. And I, I still just think, you know, we need to do a better job of, of communicating that. And it’s in most organizations, it starts with the operations teams because we don’t start with a security team, which is sort of indicative to the problem, you’re exactly talking about.
Right? It’s like, you know, we don’t even have this thing that is super important, even in the mind, even in the smallest capacity. So it starts with the operations team and then, you know, we’ll continue as the security organization grows. So, so how, how do we make it better again, real world, right? Like we, we aren’t, we don’t have unlimited budgets and, and the statistics that Ell you quoted, I used to quote it too, when we were talking about why we should automate security. Right? Because that’s how bad the situation is. Like we have so many devs working on features that have never heard of security.
Again, if you look at my personal example, like the first time I played the CTF, I was like, oh, okay, I’ve done all of these. Like all of them, like, you know what I mean? And no one ever told me anything. And I, you know, I did a computer science degree, no one ever taught me about this. Like where did we all miss the training, while trying to kind of onboard people? So, so is there a way for us to, to make it better, to, to, to make the situation better for ops, for security, for including all of the observability too, for including all of the things we should care about, when we’re developing software.
I know that my, my answer’s going to seem like, okay, really, we can’t do this, but it’s a long-term plan. And it’s the whole concept, I got told to keep it PG 13. So screw DevSecOps, we are one team. We have one goal. And in the end we need to have, and this is a whole nother topic, which may be, we can get into on a different panel, but we need to have like one set of tools, to begin with. And that’s, Hey, you keep talking about, you know, adding security from the beginning, how much security training have you had? There are 50 CVEs that come out per day.
How much are you expected to have? So if we had, even from the beginning, we’ve got ops in the conversation, we’ve got security in the conversation. Everybody’s going to learn from each other. So it’s not a quick solution, but in the long run, devs will be able to automate security better, security teams will be able to give you insights as this automated. So we can’t just put everything, on one side of the house, when it comes to building. And to that point too, another thing that is, that I see a lot in, in organizations I talk to people is like, especially in security.
The organization has zero knowledge about security. And what I mean by that is, so often we rely on the engineers and the network of their engineers to know about CVEs, right? To know about critical vulnerabilities and things like that. And it’s like, oh, we know to patch because I’m up on Twitter. Right? And I see people talking about it, but does the company have a way to know about these things and make the appropriate, take the appropriate action, right? What are the systems in place that allow that to happen? And a lot of times when you ask that there’s a lot of blank stares, there’s right. There is no tooling.
There is no a, a process around surfacing that stuff. So, you know, I think one of the things we can do is like, you know, start with, what does the organization know minus some superstar engineer leaving, right? And because she’s gone all of the knowledge about, you know, potential vulnerabilities is gone with her, how do we embed that in the organization? The other thing I was thinking about, was um, one thing we’ve experimented with a bit is like confidence intervals, in terms of quantifying risk, right? So when someone says, what’s the risk of us not patching this thing, you know, I say, well, between $25,000 and $7 million dollars, and they’re like, whoa, why such a huge range? And it’s like, well, because we’re not sure. So that represents how, you know, how we have a lack of confidence, about how big this impact is. We can spend a little time and if we spend time and energy, I can narrow that range for you. Right.
Um, but do we want to make that investment? So we did that um specifically. We did that with um, looking at going multi-region active, active, right. So I say, well, you know, I don’t know how much it’s going to cost, but I know it’s going to be between like 250 grand and like 3 million a year. Right. That very quickly could get evaluated and be like, well, all of those numbers are higher than anything.
We might have to pay out for the SLA. Right? So maybe we don’t do it now. And we table it til later, you know, and that evaluation took me, you know, maybe a couple of days worth of work. But once we did that, we, we had enough data to say like, okay, we don’t even need to know more of a specific range. Cause we already know that this is ludicrous for us right now, but we might come back and revisit it.
So how do we do that with security as well? Like, you know, oh, what’s the cost of the vulnerability? Well, it’s, you know, a remote access exploit. So, you know, it could be everything, right? Or it could be nothing. We can look into it more, or we can spend, you know, the two days stop, feature, work, and patch this thing.
So the quantifying risk is a recurrent theme. Niall do you want to jump into the, sorry, Ell. No. I just want to say two things. I suppose. I can’t hear you now. There’s like a ton of static. You’re breaking up super badly. I think we have to put you on mute, unless you can change your speaker. While he’s doing that, I can throw out the statistic that I wanted to give you, which I think Jeff you’ll love, right now with the current process. It takes 280 days to actually detect a breach and that’s not even to react to it. And I think part of it is we keep talking about patching.
How long does it take for patch to come out? How long does it take for the issue to be found? This is all this added time, that we can’t really just focus on. So I thought that, you know, as you know, the dev ops side of the house, or the dev side of the house, y’all might be interested on, how long it actually takes to get there.
Yeah. That is a, that is a, yeah, that is a sh, I can’t even get my words together. [Laughter] I’m so kind of like, hahhhhh, about that stat. Oh, that was cool. And again, we’re, we’re, we’re living in the reality where, where breaches are a fact of life, right? You have been breached, your company have been breached, like it’s breached right now. There’s some malicious actor inside your infrastructure. You just haven’t found them yet. Like that, that’s the case for everybody all the time.
And we, and again, that actually also maybe devalues the conversation, right. Because when people realize that, there’s two ways you can react to it, you know, we’re just say like, okay, that’s fine, or, or like, we try to fix it, which is almost undoable. And so, as Jeff says, we’re bad as humans at evaluating risks. So it’s an interesting conversation. Niall, do you want to try to speak again? Is this better.
Uh actually, not, maybe your speaker, still going through the wrong um, there’s a gear icon that you can click on to see. And so we’re, we’re kind of getting to the end of the panel but it’s been super fun. And I just like, I really I’ve, I’ve really enjoyed this and I wish we could have in-person conferences, which we’re starting to go back to and like actually have this conversation in person, which would be fantastic, but I’m glad that technology affords us a way to like, actually do this remotely and, and um, you know, keep, keep talking to each other because I think the best things that are born in conversations now, Niall, do you want to try again?
3, 2, 1. Yes, you’re fine. Yes, very quickly. I just want to make the point, we’ve been talking a lot about economics, externalities incentives, and so on, the insurance pricing model of figuring out how much it’s appropriate to spend, in the case of risk, et cetera, actually doesn’t capture like the full extent of what’s going on CSA. Like this thing, hasn’t been a dollar, if there’s a 1% chance of being uh, unreliable or exploited or whatever, therefore we spend 1% of a million dollars. I actually, that probably doesn’t capture either of the rising value of what you’re doing. The long-term value of the escape of the data at a whole bunch of other things. Like it’s a very instantaneous kind of point of time model, which I don’t think captures the holistic value.
And the more I think about where the industry has to go, the more I think about holistic understandings, whole life cycle understandings and how that’s really the key to attacking any of these problems. The other thing I wanted to say is that, I think reliability and security, and other domains as well are very, very similar in terms of how they’re approached. I mean, even on a technical basis, a lack of reliability can often be proxied into a lack of security in some sense. And I, I think that rather than accepting the fate by which we are kind of milled down, into separate groups, which have our own separate ways of being incentive structures, don’t talk to each other. We need to aggregate ourselves in order to have any kind of a realistic chance of addressing these problems.
But that’s my soap box. Alright. So I actually, I wanted to ask everyone to know, in the last couple of minutes that we have, to, to kind of give us a parting sentence, right? So like leave us with something an advice or, you know, a complaint, um, or something that people can do in confined, or a book you should read, like, whatever it is that your fancy today. So I’m going to start with Jeff. Ooh, ah, ah, I guess I would start with iterate, right? Like we talk about a lot of things and where you are in your journey, could be radically different and don’t feel that you have to fast-forward to the end, or to the perfect state.
Right? For a lot of the things that we’re talking about, these are big changes. These are big organizational changes um, that are gonna take time. So don’t be afraid to bite a piece off because here’s the thing about progress. Um, it is addictive, right? So the thing that you do today, that makes life a little bit better, is going to make you hungrier, to make that thing a little bit better. And then that thing a little bit better.
So you could quickly go from, we have a uh, zero security awareness to, oh, now we have CVEs reporting, but we don’t have the time to fix them, but sooner or later you’re gonna be like, all right, now let’s talk about fixing them right. Then it’s going to move to, well, how do we make sure that we just don’t have them in these fixes, are happening automatically. So just starting small will sort of uh, wet the appetite to get better, and better, and better. So if you’re sort of stuck in this paralysis, just start biting something off small. And then that will be the thing that fuels that addiction.
And you’ll continue to improve. Alright Niall, but you have to stick to one sentence, cause we have a minute. Read the book Accelerate. Alright, and Ell. It’s all about runtime. Whether it’s automating and deploying, whether it’s your software, or whether it’s malicious, or unauthorized code and attack, that’s exactly where you need to start. You need to start with your runtime.
Nice. And so my parting thoughts would be, be kind to other humans and examine the incentives, alright. Go and examine what people get paid on and you will find where problems come from. And um, this has been super fun. I’m super glad that we did this ah, very exciting. So this has been a part of Cloud Engineering Summit, and that was my voice, um hosted by Pull and Me. And thank you so much to Jeff, and Ell and Niall. It’s been my pleasure. And again, I’m Sasha. You can find me on Twitter if you want to, bye everybody.
Get started today
Pulumi is open source and free to get started. Deploy your first stack today.