Honeycombâs CTO and coauthor of Database Reliability Engineering Charity Majors joins me on this episode of Semaphore Uncut to share her insights on observability and going beyond logs and dashboards to better understand the systems we build.
Itâs hard to trace down problems in modern distributed systems. For events that weâre able to foresee, we have to implement logging, metrics and performance monitoring. The data ends up scattered across several services, which doesnât help when you get a call that your service is down. For unforeseen events, itâs even worse, as we often have no data to reason with.
Is it possible that the system provides us with enough information to diagnose unknown unknowns from a single origin? Whatâs the future of measuring the quality of microservices in production? Listen to this episode or watch it on youtube below?
Also, connect with me and Charity on Twitter @darkofabijan @mipsytipsy @semaphoreci @honeycombio.
Watch this Episode on Youtube
Edited Transcript
Darko: (00:16) Hello everyone. Welcome to Semaphore Uncut, a show where we talk about engineering topics, products and people behind those products. My name is Darko Fabijan and Iâm your host today. Iâm a co-founder of Semaphore. Today with us we have Charity Majors who is joining us live. So hello Charity, nice to have you on the show.
Charity: (00:35) Hi. Thanks for having me.
Darko: (00:38) Yeah. Please go ahead and introduce yourself.
Charity: (00:40) I am the co-founder of Honeycomb, currently CTO. I was CEO for three long years until recently. We are a company that is observability product that helps you understand whatâs actually happening on these crazy complex systems that we keep inflicting on the universe without having to ship new code to handle things that we know in advance.
The origins of Honeycomb: an undebuggable system at Facebook
Darko: (01:09) Okay. And maybe before diving deep in the technical topics, can you tell us a bit more about Honeycomb and the product?
Charity Majors: (01:12) Yeah. Co-founder Christine and I were both early engineers at Parse, the mobile backend as a service. Love Parse, rest in peace. We were acquired by Facebook in 2015. Around the time we got acquired by Facebook, I was coming to the horrified realization that we had built a system that was basically undebuggable, by some of the best engineers in the world doing all of the right things. Yet every day people were coming to us, âParse is down.â Weâd be like, âParse is not down. Like behold my wall full of dashboards. Theyâre all great. Everythingâs cool, right?â Because baby, weâre doing a hundred thousand requests per second. Mobile app, traffic isnât huge. Maybe theyâre doing like 50 requests per second or four. Never even show up in my time series graphs. So Iâd have to dispatch engineer or go debug myself exactly what had gone wrong or if it was their fault or our fault, or combination of the two.
Charity: (02:03) It would take a day sometimes or more to figure out what was actually going wrong in each case. Our productivity ground to a halt. We stopped shipping, and we were just trying to understand our product. I tried everything out there. The problem with logs is that you have to know what to search for before, and if itâs a new problem you donât know what to search for. The problem with metrics is they aggregate at write time, and you canât break down by high cardinality dimensions like say User ID. So it was a very manual and awful process. The first thing that helped us start to dig our way out of this pit was this tool at Facebook called Scuba, which is not a pretty tool. I would go so far to say itâs actively hostile to users.
Charity: (02:49) But Scuba did one thing really well, which was it let you slice and dice in basically real time, on dimensions of arbitrarily high cardinality. Cardinality meaning the number of unique elements in a set. So like the highest possible cardinality will always be a unique ID. First name, last name, high cardinality. Gender is low cardinality and species is very low I assume. So solutions didnât support that. We started getting our data into Scuba, and it started to drop our time to understand these complex scenarios that just dropped like a rock, from days to seconds, maybe a minute. But it was like a support problem, not even an engineering problem. That made a huge impact on me. To that point that, when I was leaving Facebook, planning to go be an engineering manager at Stripe or Slack, I suddenly realized: âOh shit. I donât even know how to engineer anymore without this stuff weâve written.â
Charity: (03:31) We built around Scuba because itâs not just about incident response, itâs like my five senses. Itâs how I decide what to build by instituting something, looking at the impact, what itâs going to affect, and then I write it. Iâm in this constant conversation with my coach, just like is it doing what I thought it would do? Is it behaving as I expected? Does anything else like weird? The idea of going back to metrics and logs, it was unthinkable like using Ed instead of like an actual editor. But at the time we thought that this was a platform problem. Christine and I started working on this for a year, and we really thought that this is a platform problem because platforms will have this characteristic where itâs one of many thousands of apps to me. But to you, the customer who wrote a big check expecting our solution to solve your needs, itâs your world. Itâs everything.
Charity: (04:56) Itâs like some of this is self-induced, but whether itâs containers or schedulers, or polyglot persistence, proliferation of mobile devices. All of these things are high cardinality problems and everybody needs a different solution. According to the control theory definition, observability was just the ability to understand whatâs going on in the inner workings of the system by observing it from the outside. Not by knowing in advance, writing custom code to handle it. Not by any of these things that work for known unknowns, but itâs really about action such that you can ask any question of your systems without having to ship custom code to handle that.
Charity: (05:43) This was a mind blowing thing to me. Because it really spoke to the shift from known unknowns (which weâd had in the days of the LAMP stack), to unknown unknowns (which we have with distributed systems of today). Itâs like, the problems we have to deal with are like this infinitely long thin tail of things that almost never happen, except one time they do. And itâs not a good use of our time and effort to invest in a dashboard for it that will help us find that problem immediately the next time, or monitor and check for it. Weâre handling all these things as one offs like thereâs some end in sight, and thereâs just not. So that was the original insight that led to Honeycomb and also to us taking a pretty aggressive stance that this was something different. And that observability is something that the industry needs to know and respect to technical term, not just as a generic synonym for telemetry.
Does your job end when you push to master?
Charity: (06:35) I think of observability as a technical term because you could look at a tool and just say, âDoes this give me observability or does it not?â And if it does pre-aggregation, it doesnât give you observability because youâve gathered your data in a way that prohibits you from asking a new question. Same with indexes. You need to be able to do read time aggregation of the raw data in order to have that flexibility. So anybody whoâs not offering that is not doing observability. So I think that the reason itâs taken off is you can, so many people have seen themselves, and their problems reflected in this distinction.
Darko: (07:05) Yeah. Itâs an interesting journey and definitely in the area of scratching your own itch.
Charity: (07:10) Oh God, yes.
Darko: (07:11) Thatâs when you get really motivated.
Charity: (07:16) Yeah. The thing is that, this comes at the right time I think. Itâs just in the past three years I feel like we really arrived at a consensus that software engineers need to be on call for their own systems. This was not an accepted answer three years ago, but weâve learned as an industry that this is the way to build systems and support them in a way that scales, in a way that is not miserable for the humans who have to tend them. And so the person who has the original intent for what theyâre trying to build, in their head, goes and watches it all the way out to where your code is interacting in real time with users. Youâre the only person who knows really what youâre expecting to see.
Charity: (07:58) You have to take it all that way. You canât just lob it over the wall. You canât just say, âMy job is done when Iâve merged to master.â The ops team doesnât have your original intent. You donât have necessarily their skill sets. So I feel like this is kind of the second coming of DevOps in a way. The first wave of DevOps is all about, âOps people must learn to write code.â Like, âYeah, absolutely. Message received.â And we do now. But the second wave of DevOps is very much about, âOkay software engineers, itâs your turn. Itâs time to learn to write operable services, and itâs time to learn to run them.â Iâm not saying that all roles are going to dissolve and go away, but itâs increasingly almost initial consulting area of expertise, where weâre here to help you as software engineers run your own services using our expertise, not to do it for you.
Charity: (08:52) Because that is the direction in which lays misery and pages, waking up every night. A lot of people are really afraid that being on call means that, thatâs what Iâm asking them to do. I want to be clear that itâs not. Iâm over 30, I donât want to get woken up in the middle of the night either. But the thing is that we can make it so that no one has to get woken up, if that person with original intent is babysitting all the way to the end. If we just raise our standards for what we accept in terms of the abuse that weâre willing to sign up for as engineers.
Positioning observability between logs, metrics and APM
Darko: (09:19) Great. To explain to myself some of the things that youâve shared and hopefully for some of our viewers and listeners too. So the problem that we have with metrics and logs is that we must decide to implement them. We must benchmark certain parts of our code and decide, âOkay, this was not benchmarked, letâs introduce these metrics. Then letâs put it on some dashboard somewhere and wait for enough data to arrive.â With logs, itâs a very similar process. âWe have a bug, whatâs the best thing to do when something is really complicated? Letâs add a couple of lines of logs, and wait for the next situation.â We would want to get away from that problem. Because as you said, we cannot figure out in advance.
Charity: (10:05) Itâs fundamentally reactive.
Darko: (10:06) Exactly.
Charity: (10:06) Youâre always reacting to something.
Darko: (10:09) Okay. To solve this challenge in practical terms, there are some frameworks and tools like Istio. Thatâs maybe the only one that I know. Apart from maybe tools like New Relic, where you add some library into your application and they gather everything over time. So maybe just in that area where New Relic still is, what are some other options that you have?
Charity: (10:39) New Relic is an APM, Application Performance Monitoring. And Istio is a service mesh. And then thereâs tracing. Tracing is incredibly important if youâre using microservices because ordering is so important. So thereâs two things here. First of all, I see observability as sitting like right smack in the middle of monitoring and metrics, logs, and APM. Honestly, I believe in the next couple of years youâre going to see all three of those categories go away, because they were all premature optimizations. Hardware was very expensive, so they had to optimize for something up front, when the data was being written.
Charity: (11:11) What you want is to not have to write that data out to three different places, because then you, as a human, are sitting there in the middle, copy/pasting IDs from tool to tool, trying to track down a single problem. Thatâs just nuts. Itâs expensive. Itâs unwieldy. It relies on humans. You want there to be one source of truth, and you want to be able to go from a very high level, like the dashboards monitoring has, to a very low level of, like, the logs, without jumping between tools.
Charity: (11:35) So I think that observability is ultimately going to make all of those categories disappear, or become one. APM, youâre absolutely right that tools of the future will have to come from your code. Youâre going to need to install library or something, and youâre going to need to do some amount of manual effort, not zero. Because magic is never going to give you insights into your code. You know your code. I donât know your code. I can do a lot of guessing, and thatâs going to get you a long way. Itâs going to give you your great top ten graphs, which is what New Relic gets you, right? Those beautiful top ten graphs. But then you hit a wall. Youâre like, okay, cool. I care about this graph, but for this user, you canât do it. Right?
A new way of capturing runtime data in the age of microservices
Charity: (12:11) So the Honeycomb way, and I think that this is becoming the industry standard way, which Iâm stoked about, is when the request enters a service, we initialize an empty, arbitrarily wide row of structured data. And then we pre-populate it with everything that we know about that request, or can infer from the environment, from the language internals, in the request parameters that were handed in, everything that we know.
Charity: (12:34) Then, throughout the life of that request and that service, you, as the developer, can basically do printf()
of anything that you know are going to be interesting: shopping cart
IDs, user IDs, anything that youâre like, âThis is going to be useful to
me for debugging in the future,â you just stash it into that blob. And
then, at the end, when itâs ready to exit for error, it ships off to
Honeycomb as one, single, very wide, usually hundreds of dimensions,
structured data blob. And then if you have, like, 12 microservices,
youâre going to have one of those blobs for the edge, one per service,
and maybe one for each data base call.
Charity: (13:07) That gives you really powerful amount of context. So when youâre debugging these systems, it turns out that the hardest part is almost never debugging the code. Itâs figuring out which part of the system the code that you need to debug lives in. And if you have this rich context for the entire path of your request, it allows you to zero in and pinpoint that just immediately. Say, like, which five things have to go wrong in order for this bug of errors to happen, right? Youâve got all the data packaged in the right way for you to get that really rapid wisdom out of it.
Charity: (13:37) And, it turns out, that since tracing is so important, well, tracers are just events with some ordering, right? So you basically can get that for free. If youâre using the Honeycomb library, you get all of the span IDs and everything admitted, so you just switch visualizations. Youâre slicing and dicing, trying to isolate an error. Oh, I found it! Cool. Let me trace it. Oh, thereâs a problem in the trace. Okay, now let me zoom out and see who else is impacted by this.
Instead of opening five tabs
Charity: (14:01) So youâve gotten away from that thing where youâre storing it in four different places, and the human is hopping between tools. When thereâs just one tool, it just gives you observability, and tracers are included. But it really does start with that library that you build into your code that gives you the insights from the inside out. Youâve got the software explaining itself back out to you, the developer.
Charity: (14:20) And then, once youâve found where in the system the problem is, then you can go debugging, like GDB. Stepping through functions is out of scope for this kind of thing. Way out of scope. But it tells you where the problem is happening, and you have all of the context of the request at that point, so you can feed that into your local debugger and find the actual problem.
Darko: (14:39) Okay. So when you said the request is coming in, for example âGive me the sign in pageâ, you have something which is on the level of a process running in whatever programming language, and those two talk together?
Charity: (15:02) Yeah. Itâs just a library in your code, right? We provide all the helpers. And other people have done this, not using Honeycomb. Theyâve implemented the same thing, where they initialize an empty data blob at the beginning. They pre-populate it. Then they stuff stuff in through the life of the requested service, and then the fire it off. This is just where in your system problems are, like, full stop, that we have discovered, as an industry.
Darko: (15:22) Okay. Yeah. Sounds very powerful. I mean, what you said, I can totally relate to that. There are five tools. There is a PagerDuty call coming in, you open five tabsâŠ
Charity: (15:34) And you have to pay to store it so many times! Itâs not a good use of money, either. I believe that observability should be a dollar well spent. I think it should generally be, like, 10 to 30 percent of your infra costs. You should spend that much on observability. But not on every single tool! Like, total, right? So you really want something that can bundle up as many functions as possible. Right now, youâve got all of these people who are charging you like theyâre youâre only tool. But, in fact, you need all these different tools. Itâs kind of painful. But I believe that the industry is headed in the right direction.
Every developer is now a service reliability engineer
Darko: (16:07) I can share a war story. In the first version of Semaphore, there was a single Rails application. At the end, it was close to a hundred thousand lines of code, using lots of memory, and all that. And when we were creating the second version, we used Elixir as our main language, and we have, like, 20 services running. We were getting close to launching, and we used Kubernetes in production for the first time. We delayed our launch by maybe month and a half, at least, until we installed and learned to use Istio in our Kubernetes cluster.
Charity: (16:45) Yeah, yeah. Yep.
Darko: (16:46) Itâs probably possible to use Kubernetes without Istio, but I would rather not.
Charity: (16:52) Yeah. Agree.
Darko: (16:54) Thereâs another thing that you mentioned that I wanted to ask. For a monolithic application, the line is relatively sharp between when itâs working and when itâs not. For instance, itâs not booting at all or the queue is full.
Charity: (17:12) Yes, yes.
Darko: (17:13) Whoâs going to tackle that incident? And you have that huge code base with all the features, any of those can make a problem.
Charity: (17:22) Right.
Darko: (17:22) In our case it was clear who was on call, there was a group of people, and then there were other groups of people who were just not on the call. Another thing that was not surprising, so a developer that is developing a new service, in the end itâs just an operating system process. And when itâs time to ship it, engineers pretty much have no clue. Does it require four vCPUs, or eight? Does our application need 16 GBs of RAM, or 4? Like, no clue.
Charity: (17:53) Yes! Yeah.
Darko: (17:56) Now with Kubernetes and containers, you pretty much have to reserve your capacity.
Charity: (18:03) Yes, you do.
Darko: (18:04)
At
least in my view of the world right now, thatâs the main influence,
that every developer is now a system reliability engineer.
Charity: (18:15) The abstractions have gotten very leaky, right? Now you have to care about those things. You have to think about them, or youâre just going to be screwed. Absolutely agree.
Charity: (18:23) I think that part of the reasons itâs taken us this long to agree that software engineers should be on call is because, in the past, weâve asked them to be on call, and then weâve given them ops tools to debug their code with. Ops tools speak the language of free memory and uptime. Translating that to the world of variables and endpoints takes work, itâs a different language. And you were basically asking them to do two jobs, right? Do your job! Also, learn this other job and do it at the same time.
Charity: (18:53) Some exceptional engineers did it, and do it well. Most engineers, and I donât blame them, were just like, âHell, no.â Right? Which is why Honeycomb is very much designed to speak to engineers in the language of variables, endpoints, the things that they spend every day thinking about. But itâs definitely true that, like I was saying, itâs like when doing DevOps, itâs kind of like saying to software engineers, ops is now part of your job. And I would argue that thatâs a good thing, because ops has always been the engineering or most aligned with user happiness. It can be very easy for software engineers to construct an ivory tower, where they donât feel the pain or the consequences of what theyâve shipped. That towerâs being torn down. Itâs going away. And I think this is, overall, a very good thing. But thereâs definitely some pain in the meantime.
Charity: (20:05) You mentioned developers on call, and the lines between roles. This is a very hard problem. If youâve got a monolith, and youâve got 20 developers, you canât have a rotation with 20 people on call. That rotation is so long everyoneâs on call like twice a year. Theyâre going to forget everything in between those times, right?
Charity: (20:21) There is a case study that I found last night, of how they took a monolith and three teams of software engineers with three SREs supporting them, and they divided up the types of alerts, and theyâre like, okay, I own these. You own these. And it being a monolith, there was kind of no way to protect each other. Youâre all going to get the top level alerts, the app isnât performing well. But Iâm going to take ElasticSearch ones. Youâre going to take the MongoDB ones. And that kind of works, because youâve got three people on call at any given time.
Defining on-call duty for microservices
Charity: (20:56) But most people are starting to look at the shift to microservices now. Microservices give you more tools for doing on-call differently. If youâve done it correctly, youâll have this service in front of every data store. Data stores are the number one cause of infection seeping throughout the layers, because if a data store goes down, everything starts queuing up, waiting for that data store, everybodyâs getting paged, right? Which is why you have to take that and put it in a service thatâs a level up, and make it so the only people who are responsible for that data store get paged, and you can start to separate out whoâs responsible for the app, whoâs responsible for the data store. And if youâve done it Uber-style, and you have a shit ton of tiny, little services, you can start to group up. Like, okay, this team is going to own these four or five services.
Charity: (21:39) I feel like upper limit of one to two services per team member is the absolute max. And, really, thatâs talking, like, two or three in active development, and the rest have to be pretty stable if youâre going to go beyond that. Itâs definitely possible to take the philosophy a little too far. But I think, all in all, itâs the right direction for us to be taking steps in, and learning how to isolate these services cleanly from each other so that we can craft on call policies and only impact the people, because the key to designing an on-call rotation that doesnât burn people out that is effective is making sure that every single alert that you get is actionable, that you can fix and make it so that it never happens again. Right? Because every time you get paged, you should be going, âHuh. This is new. I donât understand this.â Itâs the death of on-call if youâre like, âOh, that again. Oh, that again.â That will kill your team. You have to pay that down. If itâs, âOh, that again, and I canât fix it because itâs somebody elseâs problem,â that is 10 times worse. That will burn people out like nothing.
The transition to shared responsibility
Darko: (22:42) I agree completely. Do you have any predictions how this will play out? There are so many developers in the world that havenât been on call. Iâve developed these features, shipped it. Not something that Iâm going to worry about.
Charity: (22:56) Itâs a cycle. I feel like thereâs an understandable period where people are just repelled by the idea because itâs so bad for ops teams. Youâre the worker climbing out of that pit and talking about it and telling people, âNo, a better world is possible.â And it is possible. Iâve seen teams who never get into that pit. My teams, we consider it a crisis if someone gets woken up. We post-mortem it. We make it so it doesnât happen again. We respect their time and their sleep.
Charity: (23:23) Iâve also seen teams who are way deep in the pits of terribleness and theyâve clawed their way out, and itâs been better. Because the amazing thing is that once you get out of that hole, you have so many more cycles to think about whatâs best for your users. You can spend your time more efficiently. Like firefighting is just like lost time of your life.
Charity: (23:43) I feel like there are three pegs to this stool. There is ops teams. We have to stop being gatekeepers. We have to stop blocking people. We have to stop building a glass castle. We have to start building a playground, which is why I say test in prod. We have to get used to being up to our elbows. Every engineer who is shipping to prod should be looking at prod every single day so they know what feels normal and they know what wrong feels like and they know how to debug it and they know how to get to a known good state. Thatâs the bar of operations that every developer whoâs shipping to prod should have. Everyone should know how to debug, how to get to a known good state, how to deploy.
Charity: (24:17)Ops people need to stop being gatekeepers and we need to start inviting people in. We need to start sharing our knowledge and educating and stop seeing ourselves as the people who do things and start seeing ourselves as the people who empower people to do things.
Charity: (24:37) Software engineers need to be willing to be heard again. Take a risk on love, right? I know youâve been hurt before, but I swear to you, youâll get hooked on it. The dopamine hit of, âOh, I found it. I fixed it. I made it better for that user,â and youâre seeing the impact of your work, that is addictive.
Charity: (24:55) What Iâve seen is that once people have experienced that level of control and power and empathy with their users, they find it very hard to go back. They donât want to go back to a place where theyâre insulated from it ever again, because itâs so much more visceral and real and they can see the impact of what theyâre doing. Thatâs very motivating to every engineer that I know of. So scarred and like Iâve been woken up so many times, itâs like never again. Just please. You have to be willing to try again. Itâs up to you, too. We need you and the original intent in your head to help us dig ourselves out of this pit.
Charity: (25:25) The third part is management. Thereâs no on-call situation that will ever work if management is not carving out enough project development time, like continuous development time, for things to actually get fixed. I know that interferes with product shipping cycles in a short run. You just have to get aggressive about it. You have to shield your team. You have to carve out that time. Let themselves dig out of that hole so that youâll have so many more cycles freed up to spend on product and so many fewer cycles going down the toilet to debugging, like production problems in prod.
Charity: (25:57) This is the job for line managers. It is not reasonable for you to expect your team to be on call if you are not carving out the time for them to fix their shit. Then youâre just asking them to go to the salt mines every day. If any engineers are working under those conditions, I would encourage them to quit their jobs and to go to somewhere where they do have air cover. It takes all three, but itâs doable and itâs a better world.
Darko: (26:18) You presented this very nicely, and I agree with you that it will bring a better future.
Service Level Objectives: agreed metrics of quality
Darko: (26:51) As the last thing I wanted to ask, what are some of the features that you are planning to ship in Honeycomb that you are most excited about?
Charity: (26:58) Oh, boy. Yes. Two things. Weâre shipping as a beta now a tool for SLOs. Weâre always talking about being able to go to the big picture, like whatâs happening at big level into the weeds and see individual requests. A lot of people get stuck on the question of how much is too much? Or like management is very concerned about this one user, and engineers are like, âIt doesnât matter.â You need to have some common language there where you all agree this is what matters, and below this line engineers are responsible for delivering and above this line managers are responsible for making sure that itâs the right line. That is what SLOs are.
Charity: (27:31) SLOs are service level objectives: a few service level indicators where you all agree this is the quality of service that you agree to provide for your users. As long as you are hitting that line, anything that you do in engineering is fine. Solve it as you want. This is how we create that crisp level of obstruction so that everyone gets what they needs and nobody feels micromanaged and nobody feels completely abandoned. You agree on this number and then engineering can go and build it however they need to.
Charity: (27:57) SLOs, that sounds deceptively simple. It is not. It quickly devolves into arguments about what does good actually mean? Over what window of time? So weâre building this into Honeycomb so that you can just pick. Any query that you run you can say, âThis is an SLI. This is something I care about.â And then out of your SLIs, we will compute this is your SLO. So you can see if youâre hitting it or not or if youâre on track to run out of your budget before your time is up.
Charity: (28:35) This is the most powerful tool that you could have in your arsenal for allocating time correctly. When I was saying that managers have to carve out the time for their engineers to fix things, how much? How good is good enough? Engineers are always going to want to spend time refactoring, making things better, more elegant. How much is enough? Thatâs where SLOs come in.
Charity: (28:53) SLOs are the number that you have agreed upon. So if the quality of service has been brutally bumpy for the past 30 days and youâre running out of budget, thatâs what your manager uses as the hammer to be like, âOkay. Sorry, product development. Weâre about to stop. My team needs time to fix things.â When things are going pretty well and engineers are agitating because they really want to do this thing that isnât directly tied to product development, thatâs when managers can go, âYouâre going to have to wait for that, team, because actually weâre meeting our objectives and itâs time for us to make progress on the roadmap.â This is the only way that I know of to make this relationship not fraught and painful. You agree on the number. You ship it. You build to it. Youâre done.
Charity: (29:32) Nobody in the industry has actually done this well yet. Liz Fong-Jones just joined our team a month or two ago from Google, and she had SLOs there. So she has been leading the product development process for us. Weâre building an SLO product that I would be proud to run myself in production.
You can throw data away
Charity: (29:54) The second thing that weâre building that Iâll say a little bit less about is thereâs this myth in this industry that you can not throw away data. The log vendors who are like, âKeep every log line,â and the metrics vendors who are like, âWe keep every metricâ. Bullshit. Either youâre throwing away data at ingestion time by aggregating or youâre throwing away data after by sampling. There is no other way. No company in this world is going to pay for an observability stack that is as large or larger than their production stack. Itâs just not going to happen.
Charity: (30:22) Weâve kind of lost the muscles and the language, because our vendors have been monopolizing this conversation, and it did mostly the pre-aggregate types plus some logging vendors. We want to reintroduce sampling. We want to do it in a way that helps elevate the level of discourse and speaks to people like theyâre engineers. Like this is not outside of your scope to comprehend by sampling matters.
Charity: (30:44) We donât have the libraries. We donât have the language. We obviously are a tool where if you donât sample at some level, itâs going to be absolutely unaffordable. But thatâs fine, because what percentage of the 200s to your group domain do you actually care about? Almost none. You care about trends. You care about errors and outliers. All of this can be done incredibly cost effectively with intelligent sampling that weights common things like 200s as less important than things like 500s. So this is not a hard product to develop from an engineering perspective. Difficult thing to develop from a language and marketing and educational perspective. Itâs three years in the making, so Iâm really pretty excited about it.
Darko: (31:27) Okay. So whatâs the ETA?
Charity: (31:33) SLOs are being beta tested right now by some customers, and if anyone wants to try it, they should hit us up. The sampling stuff, maybe a month. Weâre a startup. We move pretty fast when we decide to do something.
Darko: (31:51) Great. Sounds all very exciting.
Charity: (31:54) Thanks for having me on. This has been really fun.
Darko: (31:56) Yeah, for me, also. I learned a lot and it was great to hear your thinking process and how you see infrastructure ops, where itâs going. It looks very promising to me.
Charity: (32:11) Thanks.
Darko: (32:11) Thank you very much. See you!
Charity: (32:13) Bye!