Application Security

You Build It, You Own It

You Build It, You Own It

Do you remember that retrospective where a customer reported a bug, and the discussion was around whose fault it was? QA, Engineering, a gap in the PRD?

I hate that discussion, and it's avoidable when you have the correct values and work procedures in place, with the right expectations from everyone.

Shift left quality assurance has perks and risks. I'll describe my takeaways from building and scaling an engineering team with no QA team from day one. And the impact it had on engineering practices, quality procedures, and speed of deployment.

Commonly known as "You Build It, You Run It" I prefer the term "You Build It, You Own It." Ownership implies more than just creating an end-to-end cycle of how code is shipped with developers running their dockers. It emphasizes the sense of ownership required when something breaks in the middle of the night or when a bug is reported. And how to handle the long stretch of maintenance of a system as it changes (security, scale, bugs).

Set expectations with your engineers:

Like many things in life, setting expectations makes everything simpler.

  1. When it breaks, you fix it.
  2. When it fails in the middle of the night, you wake up and fix it.
  3. Done (also) means code is deployed, metrics added, and alerts are set.
  4. If it breaks, we will prioritize resources to make sure it doesn't break again.
  5. It's okay to fail. Make sure we detect failure fast and restore faster.

Don't hire QA

Don’t hire QA employees, especially manual ones. It's ok to add automation engineers or a QA expert to help define what a good test is. Just make sure your developers know what is expected.

Once expectations are clear, they will add the needed tests to ensure things won't break. Things will change during the feature's lifetime. Things that are outside of our control. So even if we have all the tests in the world, they can still fail in the future.

When the team building the service owns it, they will create the most effective automation for testing and deploying it.

Test-driven design

I keep hearing about “test-driven design” as the solution for quality. It's a good technique in my experience, but isn’t right for every scenario. When test-driven design is used as a theological concept and considered the only way to write quality code, you will ship code slower, overburden yourself with unnecessary tests and focus on the wrong thing - testing everything instead of what's really important.

If anything, I prefer the behavior-driven design in the scope of quality. It focuses on testing the behaviors you want to ship to customers, not every piece of code behind it.

Avoid burnout

Your engineers’ happiness matters. As we add more responsibility, and especially oncalls, it can be a slippery slope leading to burnout and attrition.

It's crucial to track alerts as a KPI: Off hours, work hours, and weekends. Tracking them should be used to make sure repeating issues are being addressed, validating that alerts are raised only when immediate action is required (vs. notifications that can be handled during regular work hours). Hold yourself (and your managers) accountable to keep that metric in place, prioritizing tasks to reduce alerts.

Pay attention to how you roll out alerts. Many times they are noisy at first, so we start with them as notifications (quiet alerts) and make sure they are tuned properly before being released. Such a process reduced the noise dramatically.

You can read more about it this topic here

But sometimes you need QA

No, you don't. That sentence was thrown at me many times when certain parts of the system had more errors. As I see it, it's not the solution since once you start going back to separating coding and testing as separate responsibilities, you will quickly find yourself in the ancient discussion of what is the right ratio of QA/Developers.

If you hear that sentence, most chances, something was built wrong, lacking automated tests or the right maintenance metrics. That's a good time to gather the team and refocus to find the root causes (design, process, ownership) and fix them.

With that being said, certain parts of what you build might require more rigorous testing than others, if the acceptance of error is very low (for us it was in the SDK we built and shipped to customers). Such areas require more rigor and process around automation and validation. We even went the extra mile to define an internal RFC to make sure all our SDKs confirm to the same behavior. (This is a place a QA expert can help).

Metrics and alerts These two come together. Measure what matters and what makes an impact. Convert it to proper KPIs, and set notifications and alerts if anything moves in the wrong direction. These can be anything from common operational metrics such as cost, CPU or RAM, and product specific such as time on page, users per day, etc.

Tracking these alerts helps highlight if anything was broken, even directly in your system. It's also the best indicator that something outside of your control changed, letting you know it's time to do some maintenance (going back to the concept of "you build it, you own it”) and get that service working again.

Infra and config as code

Once you have good coding practices, enforcing code reviews, proper code branches, and continuous integration with tests (and, if possible, continuous deployment), the next place many bugs and production errors will come from are things outside your code. One example is configurations, either of infrastructure, back office, customer-specific, etc. Moving to manage these as code, with a review process, auditing, rollback, and even automated tests, metrics and alerts will help keep your system at an even better shape.

Feature flag and gradual rollout

Feature flag and gradual rollout are two processes that are used in combination to give you a lot of confidence and peace of mind when rolling out changes (code or configuration). With the right metrics and alerts, you can deploy a change, enable it for a portion of your customers, validate everything is working properly and continue deploying for everyone.

Debrief, learn, and improve

I have written (and talked) about this many times. Using debriefs after failures highlights what's missing in the processes or technology you are using. It's what helped me build the list above over the years as the items that repeatedly came up as the most effective never to experience the same failure or bug a second time.

It's not for everyone

I have seen engineers become much more capable under this culture. It drives the team to be more professional, understand the scope of the software they ship and think about maintainability in the long term.

But it's not for everyone, and that's ok. Sometimes engineers see themselves only as code writers. And if they can't change that mentality, both of you will be disappointed. Tell candidates what is expected and how things work.

I'd love to hear your thoughts about this subject!

Explore Jobs

Join our Growing Team

Explore Openings
© PerimeterX, Inc. All rights reserved.