Since the last update of our software, we've heard from a very small number of new users who were unable o create an account. This is a big deal for this small percentage of people. It is possible that the number of people that are reporting this problem are a fraction of the people who are experiencing it, since many people will just drop-off if they can't create an account, but even so, we also know that most people don't see this problem. We've also not seen this problem during extensive testing.
Except that we have seen this problem in testing. A few times in dozens and dozens of attempts. Each time, the exact steps were recreated, and it worked fine the next time. Its what we in the testing business call "a fluke".
But, back to the business at hand - how can I solidly recreate this most elusive of bugs? The first thing to do is to find out all we can from the people that experience the bug. Using our best detective skills, we should try to determine how the people who see the bug are different from those who don't. It could be something about about how they are interacting with our product or something about their environment (physical resources, other software, even their physical location). If you can isolate what the difference is, it should be easy to reproduce.
If you can't isolate the difference, then it is time to bring in your testing skills. Two approaches you can try:
- "The Crazy Uncle" - Test unusual user flows. Since most people don't see the bug, perhaps we need to try something weird and unexpected to make it occur. Never underestimate a user's ability to use your software in a way that is totally unimaginable.
- "The Big Snafu" - Make what can go wrong, go wrong. This means limiting or disabling your systems resources while going through the use case. This could include such things as unplugging from the network half way through, removing the install CD at the wrong time, or devoting but a sliver of your system's RAM to some application other than what you are testing.
- "Brute Force" - Test the usual flow many many many times. It could be what is happening is some sort of race condition that will only occur 1 out of 100 times. This approach is extremely time consuming and boring (and therefore a good candidate for automation), but if you can make it fail when everything else is normal, that is important information that the developer can use to fix it.
If after all this you still can't reproduce the bug, then your choices are to keep trying, live with it, or hope the developers can find it by scrutinizing and carefully reviewing the code. None of this is especially appealing, but no one promised you a pleasure cruise.
Does anyone else have suggestions on ways to find the unfindable bug? Please add them to the comments if you do.

