The PerimeterX R&D team works every day to ensure that each of our products are functioning around the clock to safeguard your digital business. PerimeterX Senior VP of R&D Amir Shaked joins us to give a behind-the-scenes look at how our internal culture helps bring our suite of solutions to life and keeps them evolving. Listen to the full podcast episode on your favorite streaming platform here.
Great to have you, Amir. Let’s talk about how the development team at PerimeterX functions, and how they tackle obstacles. First, let’s address the framework you use for this process. What do you mean by a culture of learning, or a learning culture?
Amir: When I say a “learning culture,” I mean instilling a healthy culture of debriefs and positive discussions that bring understanding out of difficult situations and ultimately minimize internal backfires.
Behind the scenes here at PerimeterX, we have a large-scale, cloud-based microservices environment, with over 15,000 cores and 300 microserves total. And like any production environment, we encounter challenges all the time. This is the essence of chaos engineering, and while a lot can be said on the technical aspects of randomly breaking things to find gaps, reality can always surprise you.
Every production system I’ve worked on has experienced issues — sometimes due to code changes, other times due to third-party providers having their entire infrastructure crash. These struggles are normal, and they lead us to constantly seek ways to learn and improve how we do things, how we protect the system, and how we provide the best service to our customers.
How would you describe the issues you’ve witnessed with production systems, and how you’ve learned from them? Let’s take it from the beginning of your time here at PerimeterX.
Amir: When I joined the company, I set a goal to enable rapid deployment and rapid changes to provide the most adequate and up-to-date solution. We want to be able to adapt quickly. Oftentimes, a good DevOps culture can be the differentiator and act as a competitive edge for your company. Our team wants to have zero downtime and have errors or mistakes happen only once. This means we have a chance to learn, but not twice — meaning we didn’t the first time around.
Downtime and repeating errors are critical factors, but you also need to look ahead at how you scale 10x or 100x ahead. What may be a minor risk today could be catastrophic in the future. And that future can be next week in a fast-growing company. It can blindside you.
If you're not planning ahead, both your engineering and in your company culture, you won’t be ready. This was the root cause of this initiative and how we wanted to plan a better learning culture.
What’s the solution that you’ve proposed for these issues?
Amir: With that starting point and destination in mind, we set off to establish a new process within the team of how we analyze every kind of failure and what we do during the analysis. We conduct a debrief and then follow up.
When you have an outage or problem, make sure you set time to analyze it, to understand how you want to address it. While in theory it's simple, in practice it usually isn't. When you encounter an issue, emotions are rattled and spirits are sometimes down. There is a lot of fine print among different types of key issues. When you have an incident, you make time to analyze what happened and focus on a specific process.
Why focus on the process? Because a bad system will beat a good person every time. And assuming you have the right foundation of engineers, if you fix the process, it will lead to a great resolution.
Can you give me an example of such an incident?
Amir: Let's start with an example which I'm sure anyone who owns a production environment has experienced. Then I’ll move on to how it relates to the process.
You have an incident. A customer is complaining about something misbehaving in their environment, and they think it might be related to you, so they're calling support.
Support is trying to analyze and understand, and after a while they realize they don't know what to do with the issue. They'll page the engineering team, as they should.
The engineering team wakes up because it's the middle of the night and they're in another time zone. They work to analyze what's going on, they find a problem, they fix it. Then they go back to sleep and move on to other tasks the next day.
If you end the process there, you are certain to experience a similar issue again from a similar root cause. So you should ask yourself: why did this issue happen and what can we do better? What can we do to avoid seeing this issue or any similar cases in the future?
In a specific case, there was an incident where code was deployed into production by mistake. But how? An engineer merged the code into the main branch, it failed some tests and it was late, so he decided to leave it as is and keep working on it tomorrow, knowing it won’t be deployed.
What he didn't know was there was a process added by a devops engineer earlier that week that automatically deployed that code to production when there is a need to auto-scale that specific micro service. And that night had a significant usage increase spinning up more services, with the buggy code.
Now here lies the issue. You can focus a lot on why that buggy code was merged into main, or why was the auto-scale added. But if you focus too much on why he did it or what he was thinking, you'll miss the entire issue of: “Wait a minute, the process is flawed. How could he actually merge code into production and not understand that it's going to be deployed?” There is a meaning behind specific repositories or specific ways that you manage code.
So if you fix the process, the problem — in simple cases — aligns all engineers to understand that the main branch is equal to deployment to production, always, for any service. The way they approach merging branches to main will change drastically.
Fix problems by addressing the process. Don’t over-judge what a specific employee did or
didn't do. They were just doing their job, and tomorrow it could happen to a different employee.
I like how you’re simultaneously preserving accountability while also creating a better system for people to iterate on your processes. So how do you allow employees to learn from this type of incident?
Amir: There are 4 steps:
- The incident: The more mature you become as an organization, the team will create an incident from supposedly minor things, just to follow up and learn from them.
- The resolution: You provide an immediate resolution to the issue.
- The follow-up: 24 to 72 hours afterwards, depending on how much time the team has to sleep, you do a debrief.
- Two weeks after that meeting, we do a checkpoint to review the action items that came from the debrief and make sure things were incorporated — especially the immediate tasks.
Let's talk about conducting a debrief — and this isn’t a standard retrospect, as it usually follows an incident that may have been very severe in impact. When do you debrief?
Amir: You debrief every time you think the process and/or system didn’t perform well enough.
Ask a lot of questions. The question I ask first is: what happened? Let's create a detailed timeline of events from the moment it really started. Not the moment somebody complained, or when someone raised the alarm, but from the moment the issue started to roll into production. This could be when someone merged the code, when someone changed the query, or when the third-party provider we were using started to crash and updated their status page.
It’s especially important to ask, “What is the impact?” to foster a learning environment. This helps convey a message to the entire engineering team to prioritize different aspects of the issue. Get as full of a scope as you can. Understand the big picture and that everything is connected. That is vital to help everyone understand why we're delving into the problem, and why it is so important.
Now, after you have the story and the facts, you start to analyze and brainstorm how to handle that better in the future.
The first two questions you need to ask are:
- Have we identified the issue in under a certain amount of time? Let’s say it's five minutes.
- Why five minutes? It's not arbitrary, we want to have a specific goal on how fast we do things.
So, did we identify the issue in under five minutes? Sometimes we did, sometimes we didn't.
Did we completely fix the problem in under an hour? Did we do it in under 10 minutes? Do we need to do anything at all? Was it completely resolved automatically and there was no point of us trying to analyze anything? Once you answer no to any of these, the follow-up should be: “Okay, we understand the full picture. What do we need to do? What do we need to change? What do we need to develop so we’ll be able to answer yes to those two questions?”
This part, while seemingly simple, led to a drastic culture change over time.
This framework helps convey to everyone the focus is on the process and the system; it's not about anyone specific. Whoever caused the incident on a given day is irrelevant.
How quickly do you see the changes reflected in your employees’ behavior?
Amir: Naturally, any culture change takes time. A lack of trust could be in the process, could be in the questions. When implementing changes, people could ask themselves, “Is there a hidden agenda behind that?” How will it not become a blame game? This can completely be resolved if you implement the debrief process properly and consistently.
When you're trying to understand why a problem occurred, people will often say, “He or she did something at fault,” and the real issue is something else. You go through your proper processes, review everything, set action items for everybody, and think things are going to be better. Then you have the same problem all over again a few weeks or months later. How did it happen?
You then see that the action items that were set weren't being followed up on. The resolution there is very simple: establish checkpoints. If you had a debrief, you’d set checkpoints every 2-3 weeks, or whatever timeframe is relevant for you to make sure that the immediate action items are handled.
In my personal way, we label each JIRA ticket with a debrief and hold a monthly review of all debrief items to see what was left open is irrelevant or has to be prioritized.
Another critical move we’ve made to resolve future issues is implementing proper communication on a wide scale. You need to make sure everybody knows there was a debrief. We publish all of our debriefs internally for all to access. It's exposed to everyone: the details of what happened and what we are going to do to make it better. This helps bridge the gap of trust and show that everything is very transparent and visible.
Trust through the continued implementation of this process has earned the buy-in of team members on the value of the learning culture.
What are the main things you would want folks to take away from this?
Amir: There are four main takeaways from this culture of debriefs:
- Avoid blame. If you see blame starting to occur within your team, you have to interfere and stop it politely. Do so calmly, but you need to stop it to keep things on track and remove blame from the process altogether.
- Go easy on the “why” questions. It’s important to understand why somebody did something, but the more you dive into it, if you ask somebody why they did something, you could create resentment or self-doubt for employees. It can sound to them like you’re being critical and judging how they behaved, and why they were doing certain things.
- Be consistent. Stay on top of your checkpoints and timeframes for each incident, and enforce the process to the same appropriate degree for each issue.
- Keep calm. Rest assured knowing there’s a path forward. This always helps in creating a better environment for change.