Since the last update of our software, we've heard from a very small number of new users who were unable o create an account. This is a big deal for this small percentage of people. It is possible that the number of people that are reporting this problem are a fraction of the people who are experiencing it, since many people will just drop-off if they can't create an account, but even so, we also know that most people don't see this problem. We've also not seen this problem during extensive testing.
Except that we have seen this problem in testing. A few times in dozens and dozens of attempts. Each time, the exact steps were recreated, and it worked fine the next time. Its what we in the testing business call "a fluke".
But, back to the business at hand - how can I solidly recreate this most elusive of bugs? The first thing to do is to find out all we can from the people that experience the bug. Using our best detective skills, we should try to determine how the people who see the bug are different from those who don't. It could be something about about how they are interacting with our product or something about their environment (physical resources, other software, even their physical location). If you can isolate what the difference is, it should be easy to reproduce.
If you can't isolate the difference, then it is time to bring in your testing skills. Two approaches you can try:
- "The Crazy Uncle" - Test unusual user flows. Since most people don't see the bug, perhaps we need to try something weird and unexpected to make it occur. Never underestimate a user's ability to use your software in a way that is totally unimaginable.
- "The Big Snafu" - Make what can go wrong, go wrong. This means limiting or disabling your systems resources while going through the use case. This could include such things as unplugging from the network half way through, removing the install CD at the wrong time, or devoting but a sliver of your system's RAM to some application other than what you are testing.
- "Brute Force" - Test the usual flow many many many times. It could be what is happening is some sort of race condition that will only occur 1 out of 100 times. This approach is extremely time consuming and boring (and therefore a good candidate for automation), but if you can make it fail when everything else is normal, that is important information that the developer can use to fix it.
If after all this you still can't reproduce the bug, then your choices are to keep trying, live with it, or hope the developers can find it by scrutinizing and carefully reviewing the code. None of this is especially appealing, but no one promised you a pleasure cruise.
Does anyone else have suggestions on ways to find the unfindable bug? Please add them to the comments if you do.


5 comments:
Hi,
Not sure if My top 5 ways to reproduce a hard to reproduce bug could have helped you in this case. Though your description of the bug's behavioral patterns leaves not much chance of it being a Heisenbug, still I would have liked to explore that possibility as well!
Happy Testing...
-Debasis
Hi Andy
Are you testing a closed system or are you testing it live? If you can't reproduce it reliably you'll never know for sure what caused it. If controlled environs can't reproduce, maybe it's something in the live application. Maybe it's not even your code that causes it, maybe it's Amazon? Maybe it's ISP? Maybe it's user config? Maybe I'm reminding you all the things you've already thought?
If you see it unreliably, I'd probably be thinking it's what you called a race problem. And I'd be thinking it's a race through teh intarweb. I'd be looking at Amazon, for at least a little while.
Just had a thought... what about the pc clock? If the local clock is running at a time different from the server, will the server or pc ignore some input because it's happened locally "in the future"? Meh, I dunno. I can barely make Wordpress do what I want... but troubleshooting is what it is, and it works sometimes across "platforms".
My number one resource for finding those insane bugs is to be the developers to create a build with a lot of info logging, put it in a test stack, and playback access logs against it (using a time period where we're pretty sure the bug happened to someone). That has almost always turned up at least enough evidence to reliably reproduce it.
Log playback is such an under used tool for web QA folks, but it's actually really good at shaking out some of those weird-flow bugs.
Could it be due to data? I have experience once that the data could cause unexpected bugs. Just a thought. -Candanna
thanks for sharing information
Post a Comment