This one’s likely to get a bit controversial J.
There is an unfortunate tendency among test leads to measure the performance of their testers by the number of bugs they report.
As best as I’ve been able to figure out, the logic works like this:
Test
Manager 1: “Hey, we want to have concrete metrics to help in the
performance reviews of our testers. How can we go about doing that?”
Test Manager 2: “Well, the best testers are the ones that file the most bugs, right?”
Test Manager 1: “Hey that makes sense. We’ll measure the testers by the number of bugs they submit!”
Test Manager 2: “Hmm. But the testers could game the system if we do that – they could file dozens of bogus bugs to increase their bug count…”
Test Manager 1: “You’re right. How do we prevent that then? – I know, let’s just measure them by the bugs that are resolved “fixed” – the bugs marked “won’t fix”, “by design” or “not reproducible” won’t count against the metric.”
Test Manager 2: “That sounds like it’ll work, I’ll send the email out to the test team right away.”
Test Manager 2: “Well, the best testers are the ones that file the most bugs, right?”
Test Manager 1: “Hey that makes sense. We’ll measure the testers by the number of bugs they submit!”
Test Manager 2: “Hmm. But the testers could game the system if we do that – they could file dozens of bogus bugs to increase their bug count…”
Test Manager 1: “You’re right. How do we prevent that then? – I know, let’s just measure them by the bugs that are resolved “fixed” – the bugs marked “won’t fix”, “by design” or “not reproducible” won’t count against the metric.”
Test Manager 2: “That sounds like it’ll work, I’ll send the email out to the test team right away.”
Sounds
good, right? After all, the testers are going to be rated by an
absolute value based on the number of real bugs they find – not the
bogus ones, but real bugs that require fixes to the product.
The problem is that this idea falls apart in reality.
Testers
are given a huge incentive to find nit-picking bugs – instead of
finding significant bugs in the product, they try to find the bugs that
increase their number of outstanding bugs. And they get very combative
with the developers if the developers dare to resolve their bugs as
anything other than “fixed”.
So let’s see how one scenario plays out using a straightforward example:
My app pops up a dialog box with the following:
Plsae enter you password: _______________
Where the edit control is misaligned with the text.
Without
a review metric, most testers would file a bug with a title of
“Multiple errors in password dialog box” which then would call out the
spelling error and the alignment error on the edit control.
They
might also file a separate localization bug because there’s not enough
room between the prompt and the edit control (separate because it falls
under a different bug category).
But
if the tester has their performance review based on the number of bugs
they file, they now have an incentive to file as many bugs as possible.
So the one bug morphs into two bugs – one for the spelling error, the
other for the misaligned edit control.
This
version of the problem is a total and complete nit – it’s not
significantly more work for me to resolve one bug than it is to resolve
two, so it’s not a big deal.
But what happens when the problem isn’t
a real bug – remember – bugs that are resolved “won’t fix” or “by
design” don’t count against the metric so that the tester doesn’t flood
the bug database with bogus bugs artificially inflating their bug
counts.
Tester:
“When you create a file when logged on as an administrator, the owner
field of the security descriptor on the file’s set to
BUILTIN\Administrators, not the current user”.
Me: “Yup, that’s the way it’s supposed to work, so I’m resolving the bug as by design. This is because NT considers all administrators as idempotent, so when a member of BUILTIN\Administrators creates a file, the owner is set to the group to allow any administrator to change the DACL on the file.”
Me: “Yup, that’s the way it’s supposed to work, so I’m resolving the bug as by design. This is because NT considers all administrators as idempotent, so when a member of BUILTIN\Administrators creates a file, the owner is set to the group to allow any administrator to change the DACL on the file.”
Normally
the discussion ends here. But when the tester’s going to have their
performance review score based on the number of bugs they submit, they
have an incentive to challenge every bug resolution that isn’t “Fixed”.
So the interchange continues:
Tester:
“It’s not by design. Show me where the specification for your feature
says that the owner of a file is set to the BUILTIN\Administrators
account”.
Me: “My spec doesn’t. This is the way that NT works; it’s a feature of the underlying system.”
Tester: “Well then I’ll file a bug against your spec since it doesn’t document this.”
Me: “Hold on – my spec shouldn’t be required to explain all of the intricacies of the security infrastructure of the operating system – if you have a problem, take it up with the NT documentation people”.
Tester: “No, it’s YOUR problem – your spec is inadequate, fix your specification. I’ll only accept the “by design” resolution if you can show me the NT specification that describes this behavior.”
Me: “Sigh. Ok, file the spec bug and I’ll see what I can do.”
Me: “My spec doesn’t. This is the way that NT works; it’s a feature of the underlying system.”
Tester: “Well then I’ll file a bug against your spec since it doesn’t document this.”
Me: “Hold on – my spec shouldn’t be required to explain all of the intricacies of the security infrastructure of the operating system – if you have a problem, take it up with the NT documentation people”.
Tester: “No, it’s YOUR problem – your spec is inadequate, fix your specification. I’ll only accept the “by design” resolution if you can show me the NT specification that describes this behavior.”
Me: “Sigh. Ok, file the spec bug and I’ll see what I can do.”
So
I have two choices – either I document all these subtle internal
behaviors (and security has a bunch of really subtle internal behaviors,
especially relating to ACL inheritance) or I chase down the NT program
manager responsible and file bugs against that program manager. Neither
of which gets us closer to shipping the product. It may make the NT
documentation better, but that’s not one of MY review goals.
In
addition, it turns out that the “most bugs filed” metric is often
flawed in the first place. The tester that files the most bugs isn’t
necessarily the best tester on the project. Often times the tester
that is the most valuable to the team is the one that goes the extra
mile and spends time investigating the underlying causes of bugs and
files bugs with detailed information about possible causes of bugs. But
they’re not the most prolific testers because they spend the time to
verify that they have a clean reproduction and have good information
about what is going wrong. They spent the time that they would have
spent finding nit bugs and instead spent it making sure that the bugs
they found were high quality – they found the bugs that would have
stopped us from shipping, and not the “the florblybloop isn’t set when I
twiddle the frobjet” bugs.
I’m
not saying that metrics are bad. They’re not. But basing people’s
annual performance reviews on those metrics is a recipe for disaster.
Somewhat
later: After I wrote the original version of this, a couple of other
developers and I discussed it a bit at lunch. One of them, Alan Ludwig,
pointed out that one of the things I missed in my discussion above is
that there should be two halves of a performance review:
MEASUREMENT: Give me a number that represents the quality of the work that the user is doing.And
EVALUATION: Given the measurement, is the employee doing a
good job or a bad job. In other words, you need to assign a value to
the metric – how relevant is the metric to your performance.
He
went on to discuss the fact that any metric is worthless unless it is
reevaluated at every time to determine how relevant the metric is – a
metric is only as good as its validity.
One
other comment that was made was that absolute bug count metrics cannot
be a measure of the worth of a tester. The tester that spends two weeks
and comes up with four buffer overflow errors in my code is likely to
be more valuable to my team than the tester that spends the same two
weeks and comes up with 20 trivial bugs. Using the severity field of
the bug report was suggested as a metric, but Alan pointed out that this
only worked if the severity field actually had significant meaning, and
it often doesn’t (it’s often very difficult to determine the relative
severity of a bug, and often the setting of the severity field is left
to the tester, which has the potential for abuse unless all bugs are
externally triaged, which doesn’t always happen).
By the end of the discussion, we had all agreed that bug counts were an interesting metric, but they couldn’t be the only metric.
http://blogs.msdn.com/b/larryosterman/archive/2004/04/20/116998.aspx
No hay comentarios.:
Publicar un comentario