Slow crawl on the B31: How not to implement an automation platform

The task that my team works on is the end to end test and integration of a communications network. Struggling though a period of mass layoffs where there weren't enough minds and bodies available to perform the minimum tasks required to get a robust product out to our customers, we took the forward looking approach and dedicated some mind/timeshare into implementing an automation platform. The idea was that it would allow us to increase our work load by shifting repetitive tasks to automation while allowing us greater test coverage plus it would allow us to focus our energies on the interesting work, the non-repetitive stuff.

One thing I immediately learned is that engineers like to implement automation platforms. I liked it. Its fun. Challenging. The meetings were full of energy and ideas. I also learned that engineers do not like debugging automation platforms. Once the software and hardware was up and running and the platform was proven functional, working on stability issues was the last thing anyone wants to do. The answer from everyone always was "Well, it worked for me. Must be your setup/testcase/server/coverage." It was never the automation platform that was the problem.

Work on the platform started in 2004 and this week, the whole project was finally put to rest. What follows is the post-mortem, i.e. what not and what to do when implementing an automation platform
1. There were some repetitive tasks being done at the time we undertook the automation platform but they were not large. It was maybe a day's worth of work every month. The thought when this started was that those repetitive tasks would get larger or that we would run automated tests more often if an automation platform was available. There were a range of tests that we could cover with the platform that were not being done today.
It turned out that the tests that were being run repeatedly were not unmasking bugs. We were running the tests which were always passing. A warm fuzzy feeling followed but nothing else. After a while, it became apparent that the tests were a waste of time and they were stopped. However, the link to the work on automation platform was not apparent. We continued working on the automation platform even though the initial requirement was no longer valid. The thought was to keep looking forward without really looking forward.
It also turned out that the problems that were being reported by customers were not those that could be uncovered by automation. So the range of tests that was not being done and potentially could be covered by automation was not at all essential to the stability or robustness of the product. Yet again, the link to the work on the automation platform was not made. It seemed like the work on automation already in progress could not be stopped. Time and effort already spent required more time and effort. Though apparent now, it wasn't really apparent at the time.
Finally, no one like doing repetitive tasks, even if they're simple to do. That was the reason to undertake the automation platform. On the same hand, no one takes on more repetitive tasks just because they're made easier. They're still repetitive. Its still doing the same thing again and the overhead of testing is still there. I've never heard an engineer say " I've got some time. Let me run some automated testcases" I've often heard an engineer say "F****** automated testcases again. F***"

2. Testing an end to end network where software on multiple nodes is updated frequently in probably not the ideal environment for an automation platform. But that one of the key drivers to begin with. We have so many changes to the network that we need an low labor intensive method to see if basic functionalities were still functioning. However, the testing required to cover the software changes were always so specific that the effort to automate those tests never made sense. It seemed that there were a lot of tests in a lot of different testplans that were alike (probably because basic tests were always included and testplans were copied from one another) but the bulk of the work were tests that were unique and time consuming. Tests that could not be automated. So the automation platform wasn't really making a dent in our workload.

Configuring an automation system where so many changes were happening also required quite a bit of work. Automation does not respond well to large scale changes and the work required for upkeep of the system was large. In the end, the total effort saved was nil or maybe even in the negative. Lack of minds/bodies was exacerbated by the need for reconfiguring. Automation was not really saving us time. People were giving up on using the automation platform cause so much reconfiguring and tweaking was needed before their automation runs. Since there was no central person responsible for the platform, the person who required the automation run was basically also responsible to make sure it was in working order before they began. That resulted in a lot of frustration and people began giving up on the platform before they even really used it.

3. We did not have the budget for a dedicated system for the automation platform. We basically had to run our tests on a system that was under test at the same time by a lot of other users. A system with multiple nodes, each of which was simultaneously being tested by multiple people. Initially, the thought was that we would either be able to run automation when the system was not being used by other people,overnight for instance, or we would be able to create our own system within the system that could be shielded from other users. Both never fully worked.

Resetting the system to a known initial starting point after use by many users was a nightmare. Things were changing that should not have been. I think we could've made this work, if there were people managing the changes. Lack of minds and bodies precluded us from setting that up. In our environment, managing changes meant an email saying something was going to be changed unless they heard otherwise was sent around. Rarely did they hear otherwise.

Automation does not deal well to non-standard starting points. Changes to system under test can and do have severe effects on test results and when running suites of testcases, they produce totally incoherent results.

I have read several documents that say a dedicated system is needed for automation. I do not think this is absolutely necessary. This can be gotten around by having a set of people managing the changes to the system. However, non dedicated system and all testers managing their own changes is a recipe for disaster.

4. Writing testcases for automation was deemed to be one of the easier activities. We handed off the activity to a couple of contractors who ,with a few inputs from us, basically came up with the framework, wrote the testcases, verified them and passed them to us for a second round of verification. Based on the fact that they were getting paid per testcase, problems they were facing with robustness really didn't cause them any bother. To them, it wasn't the testcase that was the problem. It was the SUT or the automation platform. Of course, some of their failures had to do with both those things but the design of their testcases only exacerbated those issues. We also only saw a list of testcases that were being automated. No further info. In the end, we got a set of testcases that ran and passed sometimes and failed other times , even when there were no changes to the SUT. They were useless. We did get a framework for writing testcases but that may have been a curse in disguise as we never got around to totally rewriting the testcases from the ground up as we felt we had some work already done.
The testcases have been through a few revisions. now looking nothing like the testcases originally writing and one can get consistent results from them. But not with all the other problems.

Moral of the story: Automation is not a substitute for minds and bodies. It only shifts the work and it is continual if the SUT is continuously changing

Slow crawl on the B31

Tuesday, April 03, 2007

How not to implement an automation platform

No comments: