免费注册 查看新帖 |

Chinaunix

  平台 论坛 博客 文库
最近访问板块 发新帖
查看: 2055 | 回复: 0
打印 上一主题 下一主题

Alan Cox on writing better software [复制链接]

论坛徽章:
0
跳转到指定楼层
1 [收藏(0)] [报告]
发表于 2006-01-13 19:29 |只看该作者 |倒序浏览
A large part of the software industry has never heard of the science of quality assurance - or if it has, it doesn't believe in it. Thus spake Alan Cox, Wales' most famous Red Hat employee and one of the most influential voices in the IT world. Currently wrapping up his MBA at Swansea University, it's clear that Cox has been spending a lot of time thinking about what the software world can learn from everyone else about quality.
Cox was speaking at the launch of an advanced technical computing group for Wales, run by IT Wales, part of Swansea University's computer science department. IT Wales' other activities include running events for SMEs in South and West Wales, and working to retain IT skills in Wales by matchmaking computer science graduates with Welsh businesses.
The advanced technical computing group aims to bring best practise to Welsh software engineers from organisations such as the British Computer Society, the Natural Computing Forum and the Welsh e-Science Centre. Activities kick off in January 2005.
Cox, a graduate of Swansea University, discussed a number of trends which are allowing developers to produce better quality software. While some of these trends relate specifically to the computing world, others are simply a case of that world putting into practice the kinds of techniques which have been seen as essential in traditional industry for some time.
Starting with the statement that "all software sucks", Cox compared software engineering to its counterpart on the hardware side of the equation, where the economic incentives for getting it right first time are indisputable; with hardware, a single error can cost millions.
Using microprocessor manufacturers as an example, Cox said, "They put over 100 million gates/transistors on a tiny piece of silicon. On that piece of silicon there are more lines than there are on a roadmap of London - and they work. There are very very few errors in a microprocessor."
When software doesn't work the way it should, it's easy and cheap to ship an upgrade or a patch to the users, who are then inclined to accept buggy software as the normal state of affairs, Cox said.
Even though there has been a movement for some time to introduce traditional engineering concepts such as quality assurance to software development, Cox sees today's software engineering as "the art of writing large bad programs rather than small bad programs".
Of the much-vaunted 'holy grail' of reusable objects, Cox said, "As far as I'm concerned these all generally suck too. Part of the problem is that they're sold as products and the original ideas behind a lot of reusable products is that you wrote it once. If you write it once, it has to do everything. If it does everything it's complicated, and if it's complicated, it's broken. That's not always the case but it is quite frequently the case."
As for QA, "Everybody in the real world will agree - the moment a project is behind deadline, quality assurance tends to go out the window. People go through the specification and everything marked 'optional' becomes 'version 2', and everything marked 'QA needed' becomes, 'we'll find out from the users if it works,'" Cox said.
Another factor that's led to the current state of affairs is that of canny software companies which shift bad software as quickly as possible, on the basis that once the end user has one piece of software for the job it becomes harder to switch to another one - in that context, Cox considers Microsoft's release of early versions of MS Windows as a very sound economic and business decision.
Compounding the situation even further is the incentive for businesses to deny all knowledge and point fingers when software errors are uncovered. If there are several parties responsible for the maintenance of a piece of software, he said, it's in everybody's interests that the other person fixes the bug because the customer will assume that whoever fixes the bug was responsible for it. Most businesses, particularly SMEs, don't have that luxury.
Gladly, it seems there are good reasons why this situation can't go on for much longer. One large incentive for improving matters is security. "We're looking at very large numbers of PCs being taken over every day, used as zombie machines, fed software which makes them dial the internet via Ghana, and in particular, something known as zero day holes. In other words, someone who's finding a security flaw and exploiting it before the rest of the world knows."
"The update side is becoming a problem. You take a WinXP machine, you plug it onto the internet, on average you have 20 minutes before it is infected with something, if it's not behind a firewall. That is considerably less time than you need just to download the updates. These are becoming economic issues, because they're starting to cost businesses all over the world astronomical amounts of money."
So, how does one make the world a better place by writing better software? For starters, Cox says, we need to accept that humans are fallible and that software engineers, no matter how well trained, will make large numbers of mistakes in their software - so we should start using the right tools to keep the error count as low as possible.
Here, then, are Alan Cox's hot tips and tools for writing better software...
Execute-only code: One of the classic ways of attacking a web server with a known security hole is to feed that server with a command that triggers the security hole, and which contains a piece of code that is run as a result. Cox cited recent developments in microprocessor design which allow execute-only and read-only areas of memory, which provides protection against such potential damage because, for instance, any data fed to trigger a security hole won't run if it's not in executable memory.
Firewalling by default: "Red Hat has been doing this for four years now, Microsoft is finally doing it, Apple has been reasonably intelligent about this for a long time as well. You don't leave your front door open just in case you need to walk in and out. It's much much safer to have your front door shut. So by having firewalling by default, it actually allows users to accept, there is probably insecure software on my computer system. It may have bugs in it. But if the rest of the world can't get at my software, I don't care - not too much."
Languages are very important, particularly when it comes to the issue of memory allocation.
"If computer programmers get the memory allocation wrong, why are we letting the computer programmers do the memory allocation? The computer can do this. The world has moved on since the design of languages like Fortran and C."
"So for other newer languages, we have garbage collection, we have sensible memory allocation, and this means we can take things away from the programmer, so that providing the language has done it right, the programmer cannot make that mistake anymore. And this works out incredibly effectively when you look at the kind of bugs you get in software. Even when just doing it by getting programming interfaces right, we see huge improvements."
"I looked at this for some of the Linux desktop code. And instead of using standard C functions for a lot of the memory handling for text, it has a library which doesn't allow the programmer to screw it up. If you look at the history of this kind of error, almost none of them occurred in desktop [environment] compared to a very large number that were found elsewhere in applications on Linux. So it tells us that using the right tools works."
Validation tools: "They used to be very expensive, they're getting a lot cheaper. So we know for example if a given function takes a lock, it should also get rid of the lock in all paths. So one of the cases where the error code forgets to do things, we catch."
Type safety: "Things like type safety are now taken for granted. When I was an undergraduate at Swansea University, we thought it was a novelty when the C compiler told you if you passed a floating value to a function instead of an integer."
Tainting: "The idea is that when you've got untrusted data, you actually tell the computer this data is untrusted, because then you can look through how the untrusted data is used, and what other data it creates. And you can look for cases where you're doing stuff with untrusted data that you shouldn't be - like relying on it. And so we catch human mistakes before we ship them to the consumer."
Rule verification: "If you have rules in your software, you know how certain bits of it should behave, you can start to use software in some cases to verify or to validate these rules."
Good interfaces: This is another surprisingly effective one. If you look at a lot of other businesses, if you're a car manufacturer and you find you've got a lot of faulty cars coming off the production line because someone's put a part in backwards, the first thing you do is make a new version of that part which has a knob on it or something so it won't fit backwards. That's the immediate reaction. So we've started to do this kind of thing in software. So we have things that are simple and hard to misuse."
"An example of this is, with locking, instead of having one function for taking a lock and another function for releasing the lock, which inevitably means that someone always has an error handling or an unusual case where they forget, you have a single function which calls another function locked; it takes the lock, calls the function, and drops the lock. All of a sudden it's another mistake you can't make because the computer won't let you, because fundamental to your language, fundamental to the way you're coding, is the idea that this lock must be released. And it turns out you can do a lot of these things in languages like C++ by being a bit clever."
Defensive interfaces: "Locks with corrupt flags is another example. One of the things the telco industry cares about is that systems stay up. So eventually your software crashes with somebody owning the lock - someone currently has the sole right to some critical data structure. And in this case what the telecoms people do with newer systems is that after a certain amount of time, the system has a watchdog, much like your video recorder does. If the video recorder or your DVD player crashes, it just reboots after a certain amount of time, as if nothing has happened. This is great until you've got locking, and you kill a particular part of your phone switch and it owns some critical part of the system."
"[With] defensive interfaces, I can now take a lock and I can be told, 'I'm giving you this lock, but be aware that something terrible happened to the last user of it' - which means that when you take this lock you can actually start to take defensive actions."
Mathematical models: "People have started to use mathematical models for things like defect rates. Turns out all the models exist - the large part of industry that actually makes physical objects has known about them for a considerable number of years. They tell you interesting things like when you should release software beta. Providing you've got a good estimate of the cost of finding faults yourself, and the quality of the fault finding relative to your beta testers, you can actually do the maths to tell you when you should be going into beta testing."
Scripted debugging: "Traditionally you think of your debugger as something that you use after your software has crashed. But a debugger turns out to be very useful in quality assurance, because you have a lot of things in your software which you can't easily inspect. You can actually use a debugger as part of your QA testing to go in at the end of the run and say, are all the internal values right? Does the software appear to have behaved as we expected on the inside as well as on the outside?"
Brute force testers: "These are beta testers, and users of dot-zero versions of software, of course. And tools like CrashMe, which is one of the ones we use for Linux. And there are application level equivalents of this. The basic idea is, generate random input, feed it to the application, keep doing this until the application breaks. It's surprisingly effective. In a recent study they did this with Windows application software, feeding random Windows events to it, so effectively it simply sat there at full computer speed continuously clicking randomly, closing and opening dialog boxes, picking menu items, and typing. And about half the Windows software they subjected to this particular torture, crashed."
Root cause analysis: "I've got a friend who works on aeroplanes, and he has the wonderful job of, when a piece of an aeroplane falls off, cracks, or something before it was supposed to, they go to him and say 'why did it happen?'. And it's then not a case of saying 'oh, this analysis is wrong', it's saying 'how did this analysis come to be wrong? How did it make this wrong decision? Where else have we made this decision?' People are starting to do this with software."
"The OpenBSD Project started doing it with security in particular, and found it very effective. Every time somebody found a mistake, they'd take the entire software base for these systems - bear in mind, working in the open source world you have a lot of source code, so it's much easier - and you look, with the aid of automated search tools, for every other occurrence of the same problem, in all your software. Because if someone's made a mistake once, we know lots of other people will have made the mistake.
"All of this sort of analysis then leads back to things like, what tools didn't we use? Are our interfaces wrong? And because you're able to actually start digging in and get data, you can start to understand not only the 'oh, it's failed, I'll fix it', sort of the car mechanic approach to software maintenance, but actually the need do the kinds of things that should be done and which go on elsewhere, where you say 'Why did this fail? Where else have we got this? Where else will it fail? What should I do proactively? How do I change the software component involved so it can't happen again, or so that it blows up on the programmer when they make the mistake, not blows up on the user when they run the software?".

Document trails:
"I've worked for several large software companies, before I worked for Red Hat, and trying to answer questions like, 'Who wrote the first version of this software?' and 'What other code is this function in?' can be interesting."
"So you're looking at an ISDN router and you say, that's a security hole. And you have no idea where else this code appears in your company's product line. So you have no ability to test all the cases. Someone has to test each one individually, and possibly get it wrong, possibly find the code. So document trails are also a big help; where did this code come from, where is it going, what things do we know programmers get wrong with it? Actually carrying the documentation around with this software not only makes you get the documentation right so you can tell the programmer, by the way, people always get this wrong, but more importantly, you can fix it so they can't get it wrong. Because after all, programmers don't read documentation - you know that."
Rigorous reviews: "The effect of having to explain it to a second person is sometimes truly startling, as people try to explain what the code is doing and then realise that what they've written doesn't do the same thing."
Statistics: "And the final one which turns out to be really useful is statistics. Because if you've got enough copies of a piece of software out there, you can actually do statistical analysis, and so we've been doing this now with Linux, and you can start asking questions like, is there a 90% probability that all of these mysterious crashes with this kind of pattern, happened on a machine with a particular physical device, like a particular SCSI controller? Did 90% of them happen on a machine with a USB keyboard? We've actually pinned down hardware problems in this way - in one case we managed to pin down a fault in a particular brand of disk drive, because we looked at it and we realised it is directly correlated to this particular make of disk. And we went to the disk vendor, who ignored us, and eventually enough Windows people hit the problem that Microsoft went to the disk vendor, whereupon it got fixed."

文章来自:http://www.pingwales.co.uk/2004/10/07/Cox-on-better-software.html


本文来自ChinaUnix博客,如果查看原文请点:http://blog.chinaunix.net/u/6165/showart_67687.html
您需要登录后才可以回帖 登录 | 注册

本版积分规则 发表回复

  

北京盛拓优讯信息技术有限公司. 版权所有 京ICP备16024965号-6 北京市公安局海淀分局网监中心备案编号:11010802020122 niuxiaotong@pcpop.com 17352615567
未成年举报专区
中国互联网协会会员  联系我们:huangweiwei@itpub.net
感谢所有关心和支持过ChinaUnix的朋友们 转载本站内容请注明原作者名及出处

清除 Cookies - ChinaUnix - Archiver - WAP - TOP