Software Industry Performance: What You Measure Is What You Get

From GRISOU Wiki
Jump to: navigation, search
Charles Symons, Common Software Measurement International Consortium
The choice of performance metrics, estimating methods, and processes impacts software supplier performance. Improving current metrics and methods and educating customers can lead to better all-around performance.

Software suppliers excel at selling the benefits of their products and services and at emphasizing how customers can improve business performance. However, suppliers rarely use their own internal performance measurements to support their selling propositions. When you look at the data, it’s easy to see why. And we should know why.

Bearing in mind the old adage “what you measure is what you get,” it’s worth exploring how current practice in software metrics and estimating methods and processes might influence industry performance. I propose a package of improvements to performance metrics and estimating methods and processes that should help customers obtain better performance from their suppliers. The package aims to overcome the identified weaknesses; it can’t solve all the industry’s performance problems, of course, but it should help customers obtain some improvements.

Industry Performance: The Evidence

My observations draw from publicly available surveys of overall industry performance and from my own experience with performance measurement and estimating. Many individual software suppliers who have invested in software process improvement (SPI) have impressively used measurements to guide and demonstrate performance improvement and to estimate accurately. These exemplars show what can be done, but they don’t tell us anything about overall industry performance.

As software industry performance indicators, I use the following in this article:

  • Productivity: the ratio of size of software delivered to effort.
  • Speed of delivery: the ratio of size of software delivered to elapsed time.
  • Quality of delivered software: the ratio of defects recorded post-delivery to size of software (that is, defect density).
  • Delivery to budget: the ratio of actual cost to estimated cost.
  • Delivery to time: the ratio of actual elapsed time to estimated elapsed time.

Many other important performance indicators are possible, of course—moreover, any one project, depending on its goals, will rank some of these tradable indicators as more important than the others. However, given the goal of examining overall industry performance, I recommend exploring all these high-level indicators as a set.

When you examine the software industry’s performance on these five indicators, it’s uneven and, at first sight, puzzling. For example, how do we reconcile the following?

  • Delivery to time and budget is notoriously poor.
  • Productivity doesn’t show much improvement over time, and little is known about industry speed of delivery trends.
  • Delivered software quality is astoundingly good in several domains but variable in others.

All this is true, even though SPI activities have been practiced in many cases for 20 years. And, although customers pay heavily and repeatedly for the software industry’s failings, it’s very rare that a major external supplier actually suffers serious financial harm. However, internal suppliers don’t survive perceived poor performance for long; their services get outsourced. (Most of the time, I make no distinction between the behavior and performance of external and internal software suppliers because the differences generally aren’t relevant for this article.)

Delivery to Time and Budget

Regularly since 1994, the Standish Group has reported on software projects undertaken in public and private sectors by internal suppliers.(1) These CHAOS reports generally classify about 15 to 24 percent of projects as failures (that is, they weren’t completed or the results weren’t used), up to 50 percent as challenged (that is, they weren’t completed within 10 percent of the estimated time or budget), and only about 33 percent as successful.

A recent study of 214 projects in the EU found comparably bad results.(2) The projects were from multiple industry sectors, undertaken between 1998 and 2005, at a total cost of 1.2 billion Euros (approximately US $2 billion at the time). Roughly one-quarter of the projects failed—that is, they were never completed—and, of the remainder, one-third (another quarter of the total) over-ran budget or schedule by at least 10 percent.

Researchers have questioned the Standish methodology, and the European project reports gives no details of methodology, so it’s possible that their findings exaggerate the problem. However, projects undertaken for the UK public sector showed even worse results than these broad surveys suggest. According to a Financial Times report,(3) only “approximately 30 percent” of government IT projects are delivered on time and budget. Supporting this figure is a well-documented analysis of 105 software contracts between UK public-sector customers and external suppliers that failed or over-ran.(4) For these contracts, which had a total value of 29.5 billion UK pounds (approximately US$50 billion at the time), 30 percent were terminated, 57 percent experienced cost overruns averaging 30.5 percent, and 33 percent suffered major delays. Yet the external suppliers undertaking all these projects operate worldwide and would call themselves “world-class.”

Better results have been reported. The best that I’ve found are from a survey of 412 projects in the UK,(5) which indicated performance figures roughly twice as good (or half as bad?) as those reported by Standish. But it’s notable that project managers volunteered the survey data from their own projects, and these managers had a high average level of industry experience. So, we might expect the data to be biased toward the better performers. Suppliers of commercial benchmarking services also report better data for over-runs related to time and budget, with medians of typically 20 to 30 percent. However, organizations that participate in external benchmarking studies almost certainly do not represent the industry as a whole, and projects that fail completely usually aren’t submitted for benchmarking. Whichever sets of results you prefer, taken together, it’s obvious that overall industry performance is poor in this area.

Let’s try to put the cost of these failings in perspective. Estimates by Standish for the US market and by others for the European Union put the annual cost of write-offs due to software system project failures and cost over-runs somewhere around US$100 billion for each market. (These figures’ origins are unclear; they could be exaggerated, and they almost certainly include hardware and other costs associated with these projects.) These figures are comparable to the amount collectively written off by a few large investment banks in 2008 due to the global credit crunch. Why, in contrast to the outpouring of anger over the credit meltdown, is there not a greater outcry demanding retribution from the software industry? Customers seem to accept that annual losses on this scale are par for the course. Construction industry projects often suffer hefty cost and time over-runs, but imagine the landscape if 10 to 20 percent of their projects were failures.

Productivity (and Speed of Delivery)

A widely quoted Business Week article asserts that the US software industry’s productivity has actually been declining over the past 15 years or so.(6) However, what Business Week calls “productivity” is actually “industry value-added output divided by industry resource used.” This metric is economically important but difficult to interpret in this context.

In the follow-up discussions(7) of the Business Week article, various software suppliers claimed that their technology had led to major increases in real productivity or that the software’s complexity had increased in recent years, which explained the apparent lack of progress in productivity. Nobody defined complexity in these discussions; the comments merely implied that productivity metrics weren’t working properly.

Publicly available industry-wide trend data on software development productivity is quite sparse, mostly from the business application software domain and covering internal supplier performance. An analysis of measurements on more than 4,000 development projects world-wide from 1995 to 2005, collected by the International Software Benchmarking Standards Group (ISBSG), found “no trace of ongoing improvement” in productivity.(8) However, over this period, the average team size for projects roughly doubled, which the authors suggest is due to software development’s “growing intricacy.”

Some commercial benchmarking service suppliers show data indicating that their clients’ development productivity has doubled over the past 10 to 15 years—but again, it’s unlikely that these clients represent the industry as a whole. The situation probably isn’t universally as dire as these various reports indicate. For example, widespread adoption of packaged software (such as COTS and major ERP software) probably resulted in higher productivity than was achieved when developing the previous generation of custom software. But organizations don’t often measure productivity of package-implementation projects, so their contribution to industry performance is missed.

Interestingly, although time to market, and thus speed of delivery, is often more important economically than productivity, there doesn’t seem to be any industry trend data at all on this performance parameter. Adopting rapid and agile approaches to software development could result in earlier (and thus, more valuable) delivery of some functionality than traditional waterfall approaches, but it remains to be seen if the overall speed of delivery of the final, total requirements is faster.

Quality

In contrast to the story so far, delivered software is often of sufficient quality to support the intended use case at the time of deployment. However the world’s communications, transportation, and financial services systems are under constant cyber attack that leverage both software design and implementation defects. For these industries, the supplier must continuously address lifecycle and complex software supply chain issues long after initial delivery to satisfy safety-critical needs (or someone might die), or to avoid the risk of huge financial losses. Resilient software can also be produced when the supplier is suitably incentivized by a contract that makes it expensive to deliver defects.

Software quality is an opaque property. Many customer's are at an information disadvantage about software quality which erodes bargaining power. In addition, customers may prefer software with an initial lower cost and seek to compensate or cope with usability and security issues over time. In fact, some widely used software is full of defects and frustrating to use, yet enables vastly improved health, safety, and productivity.

Industry Investment in SPI

Models for helping software suppliers improve their processes and performance have been around for more than 20 years. The industry was initially slow to adopt these models, but they’re now gaining widespread acceptance.

The quality movement pre-dates formal SPI models, and much of SPI’s focus has been on improving quality. Judging by reports from SEPG conferences, the early adopters were often in the same domains that must aim to produce defect-free software. What’s surprising is the relative lack of emphasis by organizations undertaking SPI efforts on improving productivity and delivery to time and budget.

At the 2007 European SEPG conference, researchers asked 49 participants what actions were receiving the most attention.(9) The response was that improving product quality was receiving the highest priority, followed by improving productivity and improving predictability. But when asked what should be happening, participants reversed the order of these priorities. There is hope!

Industry Financial Performance

The cost of the software industry’s failings seems to fall almost entirely on customers. While the fortunes of individual commercial software suppliers fluctuate with their successes and difficulties, how often do suppliers go out of business due to incompetence? The survey of UK public-sector outsourcing contracts4 noted that the profit margin of the “world-class” suppliers on these contracts was almost always above 10 percent (and as high as 25 percent in some cases).

For in-house suppliers, the picture is bleaker. If ever top management suspects that their performance isn’t competitive, in-house suppliers are at risk of being outsourced.

Metrics and Estimating Methods: Impact on Performance

Various studies have shown that few organizations succeed long-term in gathering software metrics and using them to improve performance and predictability. Often the metrics that are gathered are the easiest to collect.

Of our five main performance indicators, measuring quality by counting defects and comparing and reporting defect densities across projects (using counts of LOC to normalize the size factor) is one of the easiest—and therefore, most common—metrics activities. Furthermore, as I’ve already shown, suppliers often have strong incentives to produce defect-free software—for example, when there are safety considerations, or when customers can withhold payment to a supplier until defects are eliminated.

Defects are also easily understood by all parties to a software contract. The customer might not get defect-free software if he lacks leverage over the supplier or if he sets another performance parameter (such as achieving a target delivery date) as the highest priority. But generally, producing defect-free software isn’t constrained by any difficulties of measuring defects.

Contrast this observation with the difficulty of reliably measuring software project productivity and speed of delivery, the difficulties for customers in understanding these metrics, and the corresponding poor industry performance. Both parameters need a technology-independent measure of the delivered software’s size as a measure of project work-output. Function Points are still the most commonly used measure, but their use isn’t widespread and tends to be limited to business application software.

Allan Albrecht’s proposal to measure the size of software requirements (as opposed to counting LOC or other program artifacts) required a great leap of lateral thinking.(10) However, his metric, which assumes a very limited size range for any one functional component. and which was calibrated for one specific software domain is now showing its limitations. But it’s a remarkable tribute to Albrecht’s ideas that a metric established more than 25 years ago in an IBM development group is still usable. (The method is now maintained by the International Function Point User Group and is called the IFPUG method.)

Measuring project productivity isn’t just a question of measuring work-output. It’s often just as difficult to reliably and consistently measure the corresponding work-input (that is, effort). Then there are the nontrivial problems of how to interpret measurements of productivity and speed of delivery and present them for decision-making. For example, presenting productivity measurements without considering the possible trade-offs with speed of delivery and quality can be highly misleading—yet, it’s often done. Consequently, all large collections of productivity measurements that I’ve ever seen show a huge spread of results. The ISBSG productivity figures, for example, for developments using 3GL languages on mainframe technology, typically show a spread of a factor of 10 between the 10th and 90th percentiles. Part of this spread is surely due to real differences in productivity across projects, but much must be due to the weaknesses of the metrics and of the raw data, and the difficulty of presenting productivity measurements independently of speed, quality, and the many other influences.

Estimating methods rely on data collections to derive their algorithms—for example, for how productivity varies with size, programming language, and technology platform. But how many estimating methods automatically advise the user on the inherent uncertainty in an estimate due to the spread of the raw data used to derive the method’s algorithms?

When starting to estimate effort and time for a new project, the first challenge is to somehow quantify the requirements, and then to use past measurements of productivity and speed of delivery to convert size to effort and time. Formal estimating methods should, and mostly do, let the user calibrate the method using performance data collected in the user’s own organization. But given the difficulties we’ve been discussing, few organizations have the skills and patience to do this calibration.

Proprietary and publicly available estimating methods and tools often are very sophisticated and can be especially valuable for evaluating what-if scenarios. However, what if, as commonly happens, the approach to estimating consists of the following?

  • Applying a primitive method of measuring a functional size, or hazarding a guess at the SLOC;
  • measuring an incomplete and unstable set of requirements;
  • entering this size (or even worse, a functional size converted to SLOC) into a “black box” estimating tool that hasn’t been calibrated to local performance; and, finally,
  • setting the estimates of effort and time in stone, with no understanding of the real uncertainty.

Would we seriously expect the estimates to be realistic? Actually, the problem isn’t that industry is necessarily poor at delivering on time and budget; it’s that customers and suppliers agree to unrealistic estimates that aren’t much better than guesswork in the first place.

Everyone accepts that the customer must have some idea of costs before a project proceeds too far. But everyone also knows that it’s impossible to get requirements right the first time. So why do rough, initial estimates get cast in stone too early in software projects? The answer must lie with the poor way we integrate the process of using metrics for estimating with the processes of determining requirements and project decision-making, coupled with the interests of the parties involved in these processes.

This integration is usually weak. I recently met the software metrics managers of two “world-class” software suppliers, both of whom manage very large databases of project performance measurements. One told me that he had analyzed the effort estimates of all the current projects in his area and found that 50 percent of the total effort was categorized as “contingency.” I asked the other manager, “How do you estimate if you get an RFP where the requirements don’t have enough detail to do a proper estimate?” The answer was, “We make our best estimate and then add 150 percent contingency.”

Turning to the customer side, those negotiating software contracts—usually senior managers, lawyers, and accountants—often lack in-depth understanding of how to measure and control performance, or of the uncertainties of estimating. My best illustration of the difficulties that can arise from a poorly written outsourcing contract are the immortal words of an IS manager: “We used to have them maintain our systems on a T&M contract, and we couldn’t get rid of them. Now we have them on a fixed-price contract, and we can’t find them.” Up against an experienced software sales team, the typical customer is easily outclassed.

Examining current practices also helps us understand why quality usually receives more attention than delivery on time and budget. (Meeting a quality target therefore isn’t just driven by imperatives such as safety-critical requirements.) Suppliers must meet the finally agreed requirements, or they’ll be fired, or not paid, or will lose money during the warranty period. But because the customer changes his mind on the requirements after agreeing on the initial estimates, the supplier gets a get-out-of-jail-free card for not meeting the delivery targets.

Figure 1 summarizes my conclusion—that a causal chain exists, linking current practices in software metrics, estimating methods and processes to observed project performance.

Of course, not all “challenged” projects follow this path exactly. For example, where meeting an agreed delivery date becomes paramount, testing might be curtailed and quality will suffer. Either way, a customer who makes an early commitment to an unsound estimate is storing up trouble for later.

Figure 1 A causal chain linking measurement and estimating to actual performance. Estimates are made and committed to before the requirements are stable. But when the customer realizes the need for changes, the supplier is free to renegotiate the estimated delivery date and costs. The supplier then focuses on the finally agreed-upon requirements. The outcome, then, is that quality is acceptable, costs escalate, and the delivery date goes out the window.

The Way Forward

Major progress in science and technology has always required improvements in measurement. This is especially true for software metrics, estimating methods, and processes. Some practices haven’t advanced for a generation.

Creating a Solution

The elements of a solution for these measurement and estimating challenges already largely exist. What we need is a concerted effort to package and implement them. I wouldn’t suggest for one moment that adopting this package will solve all the performance problems I’ve described here. Clearly, adopting other good practices in project management, requirements determination, and so on is also vital to project success. But when we have a set of known weaknesses, and remedies largely exist to overcome the weaknesses, then it seems sensible to adopt these remedies as well.

The elements of the package are as follows.

A credible, open method for sizing software functional requirements. The IFPUG method might be adequate in certain circumstances, but it has its weaknesses. The method designed by the Common Software Measurement International Consortium (COSMIC), an international group of software metrics experts, should satisfy most needs all the criteria.(11) Designed for use in both business and real-time software domains (but not mathematically intensive software), you can use it, for example, to size the requirements of multicomponent, distributed software systems in any software architecture layer.

Briefly, the method assumes that the functional requirements of a piece of software to be measured contain functional processes; each is triggered by a user (a person, a hardware device, or another piece of software) informing the software that an event has occurred to which the software must respond. A functional process is composed of movements of data groups between the software and its users and between the software and persistent storage. You measure a functional process’s size as the count of its data movements, with a minimum of two COSMIC Function Points (CFPs), but no maximum size. The size of a piece of software is the sum of the sizes of its functional processes.

Open methods of estimating software project effort and duration. However sophisticated an estimating method might be in terms of the number of variables and trade-offs that it can account for, it’s just as important that the method advises on the uncertainty of its estimates due to uncertainties in the input data, algorithms, unknowns, or risk. The ISBSG estimating methods satisfy many of these needs,(12) but this is the area that still has the greatest need for further improvement—that is, for better, open, industry-standard estimating methods, ideally of varying levels of sophistication, usable for different conditions and understandable by software customers.

We also need more publicly available benchmark data based on a modern functional sizing method, to support organizations’ use of estimating methods. However, each organization should aim to collect performance data to establish its own benchmarks and use these to calibrate its estimating methods.

An open process for applying the estimating methods. The Australian “Southern Scope” process meets this need.(13) Simply put, it starts when the customer issues an initial statement of requirements. The supplier bids a fixed price per unit functional size and estimates the total price using a size estimated from the initial customer requirements. As the requirements evolve in size, the total price varies in proportion, but the price per unit functional size remains fixed. The customer therefore bears the risk of varying the size of the requirements; the supplier bears the risk of bidding the unit price based on its knowledge of the customer’s needs and of its own capabilities.

And the Benefits?

As I stated earlier, no current large-scale survey data support my claims about this package, and not all elements of the package will be equally important in all circumstances. However, there’s already evidence, some anecdotal, of achievable benefits.

The COSMIC functional size measurement method had its greatest initial take-up in the domain of real-time software, which had not, hitherto, had an open functional size measurement method designed for that domain. Already several users have reported successful use of CFP sizes as input to recalibrated in-house estimating methods, resulting in improvement over earlier methods, particularly for estimating larger software projects.

Software sizing accuracy is important in project estimating, where a 10-percent error in an estimated size normally translates into a 10-percent error in the estimated effort. For software whose components’ sizes are all within a limited range, the IFPUG size scale could still be adequate. But some measurements using the COSMIC method have revealed extremely large single functional processes of up to 70 CFPs in business application software and up to 100 CFPs in avionics software. (The IFPUG size scale for a functional process ranges from 3 to 7 FP)

One interesting case is of a large, global bank that had invested significantly in SPI but showed almost no improvement in productivity when using IFPUG sizes to measure project work-output. The problem was that the average size of the bank’s software functions measured using the IFPUG method had moved toward the upper-size limits for the method’s components. After substituting COSMIC sizes that have no artificial upper-size limits and that reflected the increasing size of their functional processes, the bank’s newer applications were found to be larger relative to the older applications. The revised measurements revealed that productivity had improved over the period. So, how much real productivity improvement is the industry missing by using inadequate work-output metrics?

Extensive use of the Southern Scope process in Australia has led to a threefold improvement in price per unit functional size compared with traditional requirements management, estimating, and bidding processes, and the average cost of project over-runs has improved from 84 percent to less than 10 percent.(14) These impressive results deserve much wider attention. But why, if these metrics methods and processes are so good, does the market not rush to adopt them? Unfortunately, the balance of incentives and knowledge between software customers and suppliers doesn’t help adoption, and there is huge (and understandable) inertia in the field of software metrics.

Current practices don’t incentivize suppliers to deliver all-around high performance in delivering software. Furthermore, while suppliers can continue to make healthy profits with current practices, why give more power to their customers by improving the transparency of their bids or by agreeing to share measurements of their internal performance with their customers? At the same time, users of existing metrics, their advisors, and suppliers of estimating tools have few incentives to change while they can continue to exploit their investment in their existing assets. Software customers ought to be driving their suppliers to obtain improved all-around performance, but they’re mostly ignorant of the possibilities for improvement, and their decision-makers don’t understand software metrics. Serious information asymmetry exists between software customers and suppliers.

The Southern Scope process provides the customer with levers to control project scope, price performance, cost, and schedule. The process is fair to both the customer and the supplier. If the customer does a poor job of articulating his requirements so that the software size increases significantly, then costs will rise proportionately and the delivery date will slip, although unit price will remain constant. Alternatively, if the supplier doesn’t achieve the unit-price target, its profitability will suffer. The customer learns about the suppliers’ performance via the quoted unit price, so much of the information-asymmetry is redressed. This process is the most important element in this package.

Our hope must be that those in the software industry who call themselves professionals will give a higher priority to improving software metrics and estimating practices, and to educating their customers on available levers. Then, software customers can use their purchasing power to require suppliers to deliver and demonstrate better all-around performance. The prize is enormous, however you measure it.

References

  1. Standish Group, CHAOS Report, 2009; www.standishgroup.com/newsroom/chaos_2009.php.
  2. J. McManus and T. Wood-Harper, “A Study in Project Failure”, BCS, June 2008; www.bcs.org/server.php?show=ConWebDoc.19584.
  3. N. Timmins, “Suppliers Agree to Cut IT Costs for Whitehall,” Financial Times, 1 Dec. 2006.
  4. D. Whitfield, Cost Over-Runs, Delays, and Terminations: 105 Outsourced Public Sector ICT Projects, research report 3, European Services Strategy Unit, Dec. 2007.
  5. C. Sauer, A. Gemino, and B.H. Reich, “The Impact of Size and Volatility on IT Project Performance,” Comm. ACM, vol. 50, no. 11, 2007, pp. 79-84.
  6. J. Weber et al., “Industry Outlook 2004,” Business Week, 12 Jan. 2004; www.businessweek.com/magazine/content/04_02/b3865601.htm.
  7. R. Groth, “Is the Software Industry’s Productivity Declining?” IEEE Software, vol. 21, no. 6, 2004, pp. 92–94.
  8. Z. Jiang, P. Naudé, and C. Comstock, “An Investigation on the Variation of Software Development: Productivity,” Int’l J. Computer & Information Science & Eng., vol. 1, no. 2, 2007, pp. 72–81.
  9. A. Rainer, M. Muhammad, and S. Rule, “Report on a Survey Conducted at the ESEPG Conference 2007,” Software Measurement Services, 2008; www.measuresw.com/library/Papers/Others/ESEPG2007%20QuestionnaireReport%20v1.pdf.
  10. A.J. Albrecht, “Measuring Application Development Productivity,” IBM Applications Development Symp., 1979, pp. 83–92.
  11. “The COSMIC Functional Size Measurement Method Version 3.0.1: Measurement Manual,” Common Software Measurement International Consortium, May 2009; www.cosmicon.com/portal/public/COSMIC%20Method%20v3.0.1%20Measurement%20Manual.pdf.
  12. Practical Project Estimation, 2nd ed., International Software Benchmarking Standards Group, 2004.
  13. “Southern SCOPE: Avoiding Software Budget Blowouts,” Government of the State of Victoria, e-Government Resource Centre; www.egov.vic.gov.au.
  14. P.R. Hill, “Software Development Projects in Government: Performance, Practices, and Predictions,” Int’l Software Benchmarking Standards Group, Jan. 2004; www.ifpug.org/about/SoftwareInGovernment.pdf.
Charles Symons is semiretired after 50 years in computing. He’s president of the Common Software Measurement International Consortium, and has specialized for the past 25 years in improving measurement and estimating for software activities. A graduate in physics (BSc, Birmingham University, UK), he has worked as a scientific programmer, managed large data centers, been responsible for IS standards-setting, and has led consulting studies on IS strategy and improving the performance of the IS function in many parts of the world. He resides in the UK and the French Alps. Contact him at cr.symons@btinternet.com.
The software industry’s overall performance is uneven and, at first sight, puzzling. Delivery to time and budget is notoriously poor, and productivity shows limited improvement over time—yet quality can be amazingly good. Customers largely bear the costs of the poor aspects of performance. Many factors drive this performance. This article explores whether causal links exist between the overall observed performance and the commonly used performance metrics, estimating methods and processes, and the way these incentivize suppliers. The author proposes a set of possible improvements to current metrics and estimating methods and processes, and concludes that software professionals must educate their customers on the levers that are available to obtain a better all-round performance from their suppliers.
performance measures, cost estimation, productivity, quality

 

Main_Page