Code[ish]

Salesforce Engineering

A podcast from the team at Salesforce Engineering, exploring code, technology, tools, tips, and the life of the developer.

All Episodes

118. Why Writing Matters for Engineers

In this episode, Ian, Laura, and Wesley talk about the importance of communication skills, specifically writing, for people in technical roles. Ian calls writing the single most important meta skill you can have. And the good news is that you can get better at it, with deliberate practice! Ian and Wesley both come from engineering backgrounds but have moved into more writing-intensive roles as their careers have progressed. Laura is an instructional designer with experience across many industries. They all agree that writing plays several different important roles for people, whether it's to educate, persuade, or even mark a decision. So if writing is such a critical part of what you're doing from an engineering perspective, how can you get better at it? Laura offers a handful of practices, including providing context, supplying the appropriate level of detail for the audience, using stories or analogies, incorporating repetition, and finding a good editor (even if it's yourself coming back to a piece with fresh eyes). The guests close the episode by sharing some of their favorite resources for improving communication skills, which are listed below. Links from this episode “Programming as Theory Building" by Peter Naur Example of an RFC process Illusion of Explanatory Depth The Sense of Style by Steven Pinker Write and Organize for Deeper Learning by Patti Shank Tech Writing course from Google

Aug 3, 2021

43 min

117. Open Source with Jim Jagielski

This episode is hosted by Alyssa Arvin, Senior Program Manager for Open Source at Salesforce, with guest Jim Jagielski, the newest member of Salesforce’s Open Source Program Office (OSPO). They talk about Jim’s early explorations into open source software during his time as an actual rocket scientist at NASA and his role in the formation of the Apache Software Foundation. Next, they discuss getting started in open source, specifically, how to find the right open source community for you to start contributing to. They suggest looking for a code of conduct that the project members take seriously to make sure you’re joining a community that is welcoming and takes diversity and inclusion seriously. Who’s part of an open source community? Well, that would be more than just the contributors--it’s also the project’s end users, even companies who consume it. Those companies have a responsibility to support the projects they use, to contribute back and provide feedback to keep making it better. As an individual contributor (IC), contributing to open source can be part of your growth plan! Leveraging open source contributions to grow your skill helps you become a better employee. Jim encourages companies to adopt as frictionless a process as possible for employees contributing to open source. Salesforce sees open source as a strategic advantage for the company. It’s a way of driving culture, of ensuring that teams collaborate and communicate and, in the process of doing that, drive innovation to benefit not only the individuals who contribute but the company as well. How important is open source to your corporate culture? That will drive how you go about building an Open Source Program Office (OSPO). It really is, at the end of the day, a cultural shift. Finally, Jim shares concrete tips for getting started with your first open source project. He suggests “lurking” in the community and checking their bug tracker for issues marked as “good for newbies.” Most projects have a handful of people who are signed up to be mentors and can help you out. And, look for something like a contributing.md file that makes it clear how you can get involved and what the future will hold for you as you get more involved. Alyssa closes with the comment that she’s excited to work with and learn from Jim, and we are too! Expect to hear more from him on future podcast episodes. Links from this episode Open Source at Salesforce Apache Software Foundation Open Source Initiative People Powered by Jono Bacon TODO Group InnerSource Commons The Apache Way

Jun 22, 2021

28 min

116. Success From Anywhere

This episode of Codeish includes Greg Nokes, distinguished technical architect with Salesforce Heroku, and Lisa Marshall, Senior Vice President of TMP Innovation & Learning at Salesforce. Lisa manages a team within technology and product that focuses on overall employee success in attracting technical talent and creating a great onboarding experience. The impact of remote work Salesforce is looking at various work configurations across remote and in-office options in different ways. She shares, "In the past 12 months, we've been thinking about what the future will look like. What do our employees want? What do our leaders want for different worker types?" In addition to the fully remote and in-office workers are flex workers who "come into office maybe one, two, three days a week to work with their Scrum teams, or maybe even one day, every other week. You come to an office to work together when it makes sense for you and your team for collaboration and other ways." She notes there's a lot to learn from workers like Greg, who has been working remotely for 12 years. Greg notes, "It took me years to figure out how to work successfully from home and how to have home not encroach on work and work not encroach on home." After unsuccessfully working from the couch, he needed to get an office with a door. Greg stresses that remote work in the pandemic is not the same as remote work at other times. "One of my joys was going to a coffee shop, having a really good cup of coffee and sitting there without headphones on, just listening to people talk while I would write and just that background noise. And I really miss that. So I want to make sure that everybody who's been forced to go remote knows that the present is not a great example of remote work. It's a lot different, and it's a lot harder." Lisa and her team have been talking with other companies who are fully remote and stress that the experience of working fully remote during the pandemic "... isn't normal. We know we all want to see each other. We want to get back together at times where it makes sense." Part of this is focusing on "the things that we can do right now that we want to keep doing in the future when things start to open up." Greg Nokes asserts that a remote-first work approach differs in organizations where remote work is an afterthought. He gives the example of a group of San Francisco employees sending a lunch invitation over a messaging platform, "...and then everyone in San Francisco signed off and when they signed back on, I'm like, ‘What happened?’ They'll go, ‘Well, so-and-so said, where do we want to get lunch? And then we all talked about it in the coffee shop; we're all sitting in, and then we went to lunch together.’ And we're like, ‘that's not remote.’" Lisa Marshall shares the need for intentional inclusivity. "We all know how horrible it feels when you're in a meeting. And when you're a remote person, and others are in the room, and it's very hard sometimes to get a word in edgewise, it's difficult to hear all the common things." Her team is working on organizational guidelines, including team agreements on how people want to work together. One senior leadership team has decided their weekly team meetings will be 100% remote because they found they communicate better when they're all online versus some co-located. How will offices look in the future? Lisa believes the majority of the office will be flex in the future. "So we're looking at how do we want to configure our spaces to support the kinds of work people want to do in the office? What kind of different technologies can we use? What kind of seating arrangements around couches or different pods or other considerations for building in those spaces to be truly about collaboration versus only individual work?" Lisa's team is also focused on trying apps and tools to see what works and start rolling the tech out to other locations. Greg Nokes shares, "The last year has been a tremendous inflection point. And it's given us the ability to re-examine what work is and how we get it done. And I think folks that are just going to go back to the way it was before are really missing out." What are the unique challenges engineering teams face in a distributed/in-person environment? Dev teams are already agile. Lisa asks how they can adapt to remote work in a way that "doesn't burn people out from staring at screens all the time?" She also believes that release planning will change in response to remote work by breaking it up into "smaller increments virtually to do your planning, whether it's two hours a day and having those chunks of time to work together." Fun is important, and recognizing that people can't work non-stop. But we're all pretty tired of Zoom happy hours. Salesforce recently had a paint party and magicians for parents with kids. Equally important is protecting maker time, where developers need to be heads down to get things done without any meetings. Any advice for new remote workers or new hires? Lisa stresses the importance of onboarding new hires. Part of this is about having fun and building relationships through hanging out virtually together, creating an opportunity for new hires to ask questions about who to go to. "And if you don't build that in, it's really hard to just accomplish that because your work is going to get prioritized based on the tasks that you have." Greg also commends the idea of cross-team get-togethers as an opportunity for diverse opinions.

Jun 8, 2021

30 min

115. Demystifying the User Experience with Performance Monitoring

In this episode of Codeish, Greg Nokes, distinguished technical architect with Salesforce Heroku, talks with Innocent Bindura, a senior developer at Raygun about performance monitoring. Raygun provides tools and utilities for developers to improve software quality through crash reporting and browser and application performance monitoring. According to Innocent, the absence of crash reports does not mean that software is performing well. Software can work - but not be optimal. Thus, Innocent takes a holistic view: “I look at the size of my audience, and if it's something sizable, that gets a lot of traffic, for example, a shopping cart that gets a lot of traffic on a Black Friday. I would want to be in a comfort zone when I know that during the peak periods my application is still performing, so I tend to look at the end-user, how their experience looks like during very high peak periods. And from there I start working my way back to the technology that is supporting that application.” Raygun really shines in monitoring the time spent in different functions and helping to improve the performance of highly hit endpoints. This includes performance telemetry of browser pages, the current application running, and server-side performance application monitoring. Raygun has lightweight SDKs or lightweight providers that can be injected into code. These provide a catch-all to deal with unhandled exceptions. They also encourage best practices for developers. Greg asks how to track a user's journey through the application in order to see the endpoints being hit, and the user experience. A RAM tool can provide opt-in user information. In the case of Javascript, an SDK is integrated with code to create a session ID that follows the user through every single page that they visit. This internal ID can also be associated with crash reports. Over time, Raygun can provide a complete picture of how the user session performed “from the point they visited your page, logged in, visited a couple of pages, and then left your application. The crash reports and the traces relating to that particular user are also tied up with that session on the Raygun side.” Innocent highlights a sampling strategy that reduces the noise of APM data. Raygun also provides a birds-eye application view that provides aggregated stats on application performance: “For the run product, you will have each page aggregated over time, regardless of how many users you've had in a period of time. You want to look at the individual sessions. That information is aggregated and you're able to see, for example, your median, your P90, and P99.” Innocent focuses on the P99 figure because “whoever is in there has had a terrible time, and that forms the basis of my investigations. I want to know why there are so many sessions in that P99, and that P99 is probably a six or seven-second load time. I want to move that to a sub-three-second.” Innocent provides a definition of P99 for new customers undergoing the journey of performance optimization. Next, Innocent asserts that decisions should be based on numbers and empirical evidence. He has found that the use of actionable data has enabled him to redesign applications and focus on the mission-critical command needed in real time. Innocent concludes: “I think the life of a developer is an interesting one. We fit in everywhere situations permit, and we definitely take different routes to develop our careers. But ultimately what we should all be concerned about is the quality of the products that I produce. This definitely reflects on my capability as a software developer. What sets me apart from the next developer is not the number of cool techniques I can do with code, it's delivering a product that actually works and what better way of knowing what works when you actually measure things. Everybody should live by the philosophy of assuming nothing, measure everything. Everything and everything should be measured.” Links from this episode Raygun

May 11, 2021

26 min

114. Beyond Root Cause Analysis in Complex Systems

In this episode of Codeish, Marcus Blankenship, a Senior Engineering Manager at Salesforce, is joined by Robert Blumen, a Lead DevOps Engineer at Salesforce. During their discussion, they take a deep dive into the theories that underpin human error and complex system failures and offer fresh perspectives on improving complex systems. Root cause analysis is the method of analyzing a failure after it occurs in an attempt to identify the cause. This method looks at the fundamental reasons that a failure occurs, particularly digging into issues such as processes, systems, designs, and chains of events. Complex system failures usually begin when a single component of the system fails, requiring nearby "nodes" (or other components in the system network) to take up the workload or obligation of the failed component. Complex system breakdowns are not limited to IT. They also exist in medicine, industrial accidents, shipping, and aeronautics. As Robert asserts: "In the case of IT, [systems breakdowns] mean people can't check their email, or can’t obtain services from a business. In other fields of medicine, maybe the patient dies, a ship capsizes, a plane crashes." The 5 WHYs The 5 WHYs root cause analysis is about truly getting to the bottom of a problem by asking “why” five levels deep. Using this method often uncovers an unexpected internal or process-related problem. Accident investigation can represent both simple and complex systems. Robert explains, "Simple systems are like five dominoes that have a knock-on effort. By comparison, complex systems have a large number of heterogeneous pieces. And the interaction between the pieces is also quite complex. If you have N pieces, you could have N squared connections between them and an IT system." He further explains, "You can lose a server, but if you're properly configured to have retries, your next level upstream should be able to find a different service. That's a pretty complex interaction that you've set up to avoid an outage." In the case of a complex system, generally, there is not a single root cause for the failure. Instead, it's a combination of emergent properties that manifest themselves as the result of various system components working together, not as a property of any individual component. An example of this is the worst airline disaster in history. Two 747 planes were flying to Gran Canaria airport. However, the airport was closed due to an exploded bomb, and the planes were rerouted to Tenerife. The runway in Tenerife was unaccustomed to handling 747s. Inadequate radars and fog compounded a combination of human errors such as misheard commands. Two planes tried to take off at the same time and collided with each other in the air. Robert talks about Dr. Cook, who wrote about the dual role of operators. "The dual role is the need to preserve the operation of the system and the health of the business. Everything an operator does is with those two objectives in mind." They must take calculated risks to preserve outputs, but this is rarely recognized or complemented. Another component of complex systems is that they are in a perpetual state of partially broken. You don't necessarily discover this until an outage occurs. Only through the post-mortem process do you realize there was a failure. Humans are imperfect beings and are naturally prone to making errors. And when we are given responsibilities, there is always the chance for error. What's a more useful way of thinking about the causes of failures in a complex system? Robert gives the example of a tree structure or AC graph showing one node at the edge, representing the outage or incident. If you step back one layer, you might not ask what is the cause, but rather what were contributing causes? In this manner, you might find multiple contributing factors that interconnect as more nodes grow. With this understanding, you can then look at the system and say, "Well, where are the things that we want to fix?" It’s important to remember that if you find 15 contributing factors, you are not obligated to fix all 15; only three or four of them may be important. Furthermore, it may not be cost-effective to fix everything. One approach is to take all of the identified contributing factors, rank them by some combination of their impact and costs, then decide which are the most important. What is some advice for people who want to stop thinking about their system in terms of simple systems and start thinking about them in terms of complex systems? Robert Blumen suggests understanding that you may have a cognitive bias toward focusing on the portions of the system that influenced decision-making. What was the context that that person was facing at the time? Did they have enough information to make a good decision? Are we putting people in impossible situations where they don't have the right information? Was there adequate monitoring? If this was a known problem, was there a runbook? What are ways to improve the human environment so that the operator can make better decisions if the same set of factors occurs again?

Apr 27, 2021

26 min

113. Principles of Pragmatic Engineering

Karan Gupta, Senior Vice President of Engineering, Shift Technologies joins host Marcus Blankenship, Senior Manager Software Engineering, Heroku in this week's episode. Karan shared his career trajectory, which includes founding aliceapp.ai, a fast, privacy-first recording and transcription service for investigative journalism, and acting as an advisor for various companies, including Alphy, a platform for women's career advancement. A concept important to Karan is pragmatic engineering. Pragmatic engineering is about having "an oversized impact on the business by applying the right technology at the right time". It's about the technology, the process of creating that technology, and its impact on the underlying business. For example, building an electric car is cool, but producing a version in which people feel safe? That's engineering that changes the world forever. According to Karan, these are the key things that matter in development: Fast-ness (speed) Function (capabilities provided) Form (how it looks and feels) Fabrication (how it is built on the inside) He recalls the value of the snake game on 404 pages. And the value of intentionality, saying "once you add a feature, it's probably going to be there forever. It's probably going to need maintenance and love and care forever. So do we really want to put it in?" He talks about design and the balance between form versus function, such as designing something aesthetically pleasing versus easy to use. Then, there's fabrication: "How well can we make it? Can we deliver it quickly? And can others maintain it?" Sometimes using off-the-shelf software and well-proven frameworks are the most effective, and "Perfect is the enemy of good enough." Karan stresses the importance of being a learning organization. "Be open to picking up what's out there to help make more informed choices, especially if the choice is to stick with the tried and tested." Good engineers are always open to learning about what new things are coming out and open to different opinions, frameworks, and ways of thinking. Links from this episode Shift Technologies Alphy AliceApp

Apr 13, 2021

36 min

112. Managing Public Key Infrastructure within an Enterprise

This episode features a conversation between Robert Blumen, DevOps engineer at Salesforce, and Matthew Myers, principal public key interface (PKI) engineer at Salesforce. Matthew shares his experience running a certification authority (CA) within the Salesforce enterprise. He shares the rationale for the decision to take CA in-house, explaining that becoming a certificate authority means you can become the master of your universe by establishing internal trust. A private or in-house CA can act in ways not dissimilar to a PKU but can issue its own certificates, trusted only by internal users and systems. Using a public certificate authority can be expensive at scale, particularly for enterprises with millions (or even billions) of certificates. However, an enterprise CA can be an important cost-saving measure. It adds a granular level of control in certificate issuing, such as naming conventions and the overall lifecycle. You can effectively have as many CAs as you can afford to maintain as well as the ability to separate them by use case and environment. Further, having the ability to control access to data and to verify the identities of people, systems, and devices in-house removes the cybersecurity challenges such as the recent SolarWinds supply chain attack. Matthew notes that Information within a PKI is potentially insecure “as the information gets disclosed to the internet and printed on the actual certificates which leave them vulnerable to experienced hackers.” Matthews shares the importance of onboarding and people management and the need to ensure staff doesn’t buy SSL certificates externally. Myerss offers some thoughts for businesses considering the DIY route discussing the advantages and limitations of open source resources such as OpenSSL and Let's Encrypt. Identity mapping and tracking are particularly important as you’re giving certificates to people, systems, and services that will eventually expire. Matthew shares the benefits of a central identity store, its core features, and how it works in tandem with PKI infrastructure. There’s also the need to know how many certificates you have in the wild at any given time. As a manager, the revocation infrastructure for PKI implementation means that you're inserting yourself in the middle of every single deal, because if you’re doing it correctly everything needs to validate that the certificates are genuine. When you have a real possibility of slowing down others’ connections, you want to ensure that your supporting infrastructure is positioned in such a way that you are providing those responses as quickly as possible. Network latency becomes a very real thing. Auditability and the ability to trust a certificate authority are paramount. The service that creates and maintains a PKI should provide records of its development and usage so that an auditor or third party can evaluate it. Links from this episode Salesforce Wikipedia page on Public Key Infrastructure Wikipedia page on Certificate Authorities OpenSSL Let’s Encrypt

Mar 30, 2021

50 min

111. Gift Cards for Small Businesses

This episode is a conversation between Heroku developer advocate, Chris Castle and James Dong, developer and owner of Last Minute Gear. The business enables San Francisco residents to buy, rent, and borrow clothing and outdoor gear for activities such as camping, snow sports, and climbing. During the early days of the pandemic, the business was forced to close to comply with shelter-in-place regulations. There was an outpouring of support for small businesses, but not everyone has a Venmo account or wants to donate to a GoFundMe appeal. While many used the pandemic to catch up on Netflix and banana bread baking, James spent a day coding a website and platform where businesses could sell gift cards. It not only helped his own anxiety and insomnia but helped brick-and-mortar businesses like gyms and restaurants (and his own shop) to still earn revenue. It allowed customers to purchase gift cards to be remunerated once businesses reopened. While other platforms with this functionality already existed, James’ project included business-critical functions, such as processing payments and gift cards. James talks about his experiences of anxiety and insomnia which acted as catalysts in making his website operational in just one day. Support from Stripe and Heroku meant there were no fees—all money generated went to the businesses. The conversation offers interesting insights into the value of using a decision logger to document ideas and milestones as well as notes and commit messages to explain why particular decisions were made at certain points in time. It’s also a great example of what can happen when developers build projects that help others in need. Links from this episode Last minute gear — James’ outdoor sports store. Gift Cards for Small Businesses

Mar 16, 2021

29 min

110. Scaling a Bernie Meme

This episode is a conversation led by Greg Nokes, a Product Manager with Salesforce, Dan Mehlman, a Director of Technical Architecture for Salesforce, Mike Rose, a Director of Technical Architecture for Salesforce, Jack Ziesing, a Technical Architect with Salesforce. They're interviewing Nick Sawhney, a college student who saw an opportunity to make his friends laugh and built something that grew beyond his wildest dreams. At the 2021 US Inauguration, a single shot of Bernie Sanders sitting in a chair captured the hearts of many on the Internet. People everywhere were photoshopping him in the unlikeliest of places. Nick utilized his Python skills and quickly built a Heroku app that would allow users to place Bernie anywhere in the world, by adding him to any image available on Google Street View. To say the app was a success was an understatement. Inundated by tweets and distracted by press requests, Nick couldn't devote the time needed to keep the app stable and operational. He sent out a desperate tweet for help, only to be picked up by no less than Dan and Michael, who recruited Jack to help Nick with his operational issues. They paired together in a number of ways, optimizing Jack's Python code, securing its authentication logic, and autoscaling dynos in order to handle the waves of traffic. All of these rapid changes allowed Nick to step back and engage with fans on where they'd like to take Bernie next. In addition to a newfound gratitude towards Heroku's team, Nick learned a few lessons from this experience. He was really humbled by the availability of the engineering community to donate their time and knowledge to help his issues. It's also inspired him to create videos to teach others how they can mitigate scaling issues in their architecture before it becomes a problem. He's also hoping to create some open source tools that to monitor things like server costs and availability issues for other small projects.

Feb 18, 2021

28 min

109. Meditation for the Curious Skeptic

Chris Castle, a developer advocate at Heroku, is joined in conversation with Andrew Lenards, a 20-year programming veteran and meditation coach. He believes that meditation is the practice of familiarizing one's mind with its various states. Concentration is the ability to place attention on something for as long as desired. Clarity is about identifying the sensory experiences in your body. Equanimity is about accepting the state of the world around you. In programming terms, mindfulness becomes a sort of monitoring and observability tool for our bodies. Andrew suggests that curious listeners focus their attention on sourcing materials from secular sources. As well,the benefits of meditation can only come after quite a bit of time. The inclination of most starting practitioners is to quit before investing to see the benefits. Even if you feel like you're doing it "wrong" or feeling your mind get distracted, the core tenant of the practice is to not judge yourself. This in turn will help bring about the calmness which meditation can offer. Links from this episode "The Mind Explained" Niksen is a methodology focusing on "doing nothing" Mindfulness-based Stress Reduction "The Art of Noticing" Search Inside Yourself “99% Invisible” podcast Meditation information on Andrew's site: lenards.us and Afternoon Idle - 15 min guided meditation via a YouTube live stream Headspace, Calm and Ten Percent Happier are just a few apps which can help your meditation practice

Feb 2, 2021

29 min

108. Building Community with the Wicked CoolKit

Nowadays, the internet is so huge that it can be hard for people to find others who share their niche interests. But when they do find that rare kindred spirit, it can feel like a magical moment. Lynn Fisher and design agency &yet have been exploring ways to help people build community around their passions (which can sometimes be a little “weird”). The team launched a project called “Find Your Weirdos” that incorporates different tools, sites, and techniques for helping people connect with their fellow weirdos. Their project also helps companies connect with customers through niche interests. Inspired by the Weirdos project, the &yet team envisioned ways to help Heroku developers connect — and the Wicked CoolKit was born. The kit harkens back to the earlier days of the internet, when simple, fun web widgets and tools helped people connect without all the noise of today’s mega social platforms. The initial version of the kit offers a new take on a few nostalgic web widgets, including: Developer trading cards — Echoing the retro joy of collecting baseball cards or playing card-based games, this widget allows developers to create their own profile card. They can specify their personal bio, coding skills, niche interests, “feats of strength,” and more, and share it within an elegantly designed UI. Themed stickers — A perennial favorite, stickers are a colorful way to identify interests, such as baking or woodworking. Users can download stickers to use as they wish, or add a sticker to their trading card that links to other people’s cards that have the same sticker. Webring — Years ago, fans and friends would use a webring to share a collection of websites dedicated to a specific topic. The kit brings the old school webring into the modern context and allows people to easily share and access web resources. Hit counter — Everyone wants to know how many visitors came to their site. The old-fashioned hit counter is a fun way to track and display page visits. The higher the number, the more likely people will want to engage with the site (and the developer behind it). The Wicked CoolKit is fully open source and available to use. Links from this episode Lynnandtonic.com — Lynn’s personal website. wickedcoolkit.com — Home of the Wicked CoolKit Show, don’t tell — the story behind the Wicked CoolKit. Find.yourweirdos.com — a series of essays on how companies connect with customers through sharing mutual niche interests. Face.camp — an app that connects to Slack for people to capture and post animated gifs. Wegotchu.cards — digital cards that people can pass around and sign.

Jan 26, 2021

26 min

I Was There: Stories of Production Incidents II

Corey Martin leads the discussion with two developers about production incidents they were personally involved in. Their goal is to inform listeners on how they discovered these issues, how they resolved them, and what they learned along the way. Ifat Ribon is a Senior Developer at LaunchPad Lab, a web and mobile application development agency headquartered in Chicago. For one of their clients, they developed an application to assist with the scheduling of janitorial services. It was built with a fairly simple Ruby on Rails backend, leveraging Sidekiq to process background jobs. As part of its feature set, the app would send text messages to let employees know their schedule for the week; these schedules were assembled by querying the database several times, fetching frequencies and availabilities of workers. Unfortunately, a client noticed a discrepancy between how many notices were being sent out, versus how many jobs they knew they had: of the 400 jobs total, only 150 had notifications. It turned out that all of the available database connections were being exhausted--but that was only half of the issue. Sidekiq was attempting to process far too many jobs at once, and each of these jobs were responsible for connecting to the database, exhausting the available pool. The solution Ifat settled on was to reduce the number of parallel jobs processed while increasing the number of connections to the database. From this experience, she also learned the importance of understanding how all these different systems interconnect. Christopher Ostrowski is Chief Technology Officer at Dutchie an e-commerce platform for the cannabis industry. One Christmas Eve, while celebrating with his family, Chris began receiving pager notifications warning him about some sluggish API response times. Since it didn't really have any significant end user impact, he ignored it and went back to the festivities. As the night went on, the warnings became significant alerts, and he pulled together a response team with colleagues to figure out what was going on. By all accounts, the website was functioning, but curiously, the rate of orders began to drop off. Through some investigation, they realized what was going on. Customers' order numbers were assigned a random, non-sequential six digit numbers. Dutchie was about to track its one-millionth order, a huge milestone. Before any orders are created, though, the app generates a six digit number, and tries to create one that doesn't already exist. The database was constantly being hit, as less and less six digit numbers were available for use. The solution ended up being rather simple: the order number limit was increased to nine digits. Although they had monitoring in place, the data was set up as an aggregate reporting; even though the "create order" API was slow, all of the others were low, keeping the average within tolerable levels. Christopher's solution to avoid this in the future was to set up more groupings for "essential" API endpoints, to alert the team sooner for latency issues on core business functionality. Links from this episode LaunchPad Lab is a web and mobile application development agency Dutchie is an e-commerce platform for the cannabis industry

Jan 19, 2021

29 min