Random but Interesting Posts about Nothing in Particular: Why Modern Software is Buggy

Monday, October 3, 2016

Why Modern Software is Buggy

There are several factors that have led to Microsoft (and other large companies) releasing software that is buggy:

1) The internet makes updates easy.

Ahh, the delicious irony. Because software can be updated at any moment, the desire to fix any individual bug has gone down dramatically. In the days before 0-day patches software was shipped on physical media. If there was a bug in your product, that bug would likely live forever since the internet wasn't a thing. We'd go into ship room and argue passionately about whether or not bugs needed to be fixed and the decision was "Will we fix this bug now, or never?".

Thus, even a bug that affected only a small number of users gained a certain level of gravity because that bug would never be fixed, except for a small glimmer of hope that it was addressed in the next release (unlikely, since we'd just say "well, we shipped this before - why is it important now?").

Now in theory you can fix a bug at 10am, push it into code review, and people can see the fix literally hours later. Why sit in a room and have big arguments over a bug that will go away soon?

Except in reality bugs don't get fixed that fast. And bugs create more bugs. And sometimes bugs are one-way doors. But never mind all that. We can fix it in the next sprint!

2) People don't want to pay for software.

More and more programs are being created by communities for free or being given away by large companies for free to help monetize ad traffic. Competition for eyeballs is fierce, fierce, fierce.

But software is very expensive to develop. In order to hire the best talent you have to pay top dollar. An average software engineer with ~5 years at a big-four company is level 61 or 62 (SDE2) and earning $120-140k a year in base salary. That's before their $20k bonus and $15k of stock, not to mention health benefits, 401k, and other assorted perks. Folks who have hit principal and above are clearing $250k easily in total compensation before benefits.

Now you're in a situation where you want to give your software aware for free. And you're in a situation where bugs don't matter as much. So how do you save a couple of million dollars a year? Get rid of (half of) your testing staff.

Why pay someone to test your software when you can convince the public to test it for you? Call it a preview program and... boom! free resources! People will file bug reports for you, and by adding instrumentation into the build you can also find bugs programmatically. You also get a ton more diversity in hardware, better app compat testing, better/more globalization and localization testing, etc. And it's FREE!

This is a fantastic theory, until the bug reports start coming in. They are largely terrible. Most of the useful info in bug reports is unstructured data that requires some hefty natural language parsing or a human eyeball to read and interpret. Some bugs reports are literally things like 'clikeed the botton and nottthing'. WTF? What do you do with that?

You ignore it, that's what you do. You start paying much more attention to the bugs that are being filed internally by people who are (forcibly) dogfooding the product. The result is that you've distributed the testing from a small group of experts to a wide group of tech-savvy non-experts. You've also randomized your dev staff because they need to stop what they're doing and file bugs a goodly amount of their day.

3) Everyone is metric-based, nobody knows what the metrics are or what they mean

Managers are in love with measuring things. Much telemetry. So data. Except the ability to get data has vastly outpaced the ability to understand the data. Even sampling at 1% or less, Microsoft gets petabytes of data on a constant basis about what's happening with Windows users. No human can grok that data in its raw form. Someone needs to enrich that data, visualize it, provide context into it, and determine how that data should be acted upon. Those people, by and large, don't exist at Microsoft.

We're hiring for it as fast as we can, and the QE staff (bless their hearts) are trying to become data scientists. But no.

You get into a room and someone puts up a chart. Then everyone spends 30 minutes doing an interpretive discussion about what the chart means. Everyone attacks the data and wants undeniable evidence the numbers are correct. Rightfully so, because often the numbers have turned out to be wrong due to bad SQL, bad assumptions, events in the wrong place, event sample mismatch, or a host of reasons.

Even if the data is assumed to be correct, what does it mean? We released a patch last week and usage went up. Yay! Oh, well last week was also back-to-school week, so maybe usage went up because more machines were coming online. Can we see this data normalized for number of machines? No, that's another slice of data that we'd have to go off and produce.

Our crashes-per-million-sessions numbers are down, that's good. Well, no. That's bad because we think it means people who are crashing are just using the product less, therefore the people that are left aren't the people that are crashing. We didn't get more stable, we just lost users. Maybe.

How does this translate to buggier software though? Well, in order to fix a bug you need to provide data that fixing the bug will make the product better (slight simplification). We have all this data, so surely if a bug is important you'll be able to provide strong data-backed justification. Except, no, for all the reasons above.

So now you have a situation where managers want data before they'll fix a bug. And they correctly state that the data exists. But nobody really knows how to get them that data, so nobody can make a strong case for a bug. Thus anyone that wants to punt a bug can do so trivially by simply asking the developer to prove the bug is important. That should be easy, right?

There are a myriad of other, smaller, reasons I could speak to ('Everyone does it this way', 'The data shows that customers don't actually care about quality, they care about the perception of quality' (this is true, by the way), 'We need to be fast') but the three bullets above capture the heart of the issue.

SOURCE

Random but Interesting Posts about Nothing in Particular

Monday, October 3, 2016

Why Modern Software is Buggy

No comments:

Post a Comment