Linguistic method
From Wikipedia, the free encyclopedia
In software engineering, the linguistic method is an approach to developing software based on awareness of formal computer science and information theory. It requires a number of issues to be simultaneously understood: Speed, Redundancy, Language, and Data.
Contents |
[edit] Speed
An algorithm can be thought of as an information channel with input, output, and bandwidth. The individual operations of the algorithm process quantifiable amounts of information, and high performance can be achieved by maximizing the information yield of individual operations. Also, simple diagnostics in large software can easily identify those parts of programs that waste cycles.
The simplest and most effective tool for diagnosing performance problems in single-thread software is a debugger that can instantly halt the program when the programmer strikes a key. Unlike a breakpoint, it happens at a random time in the execution of the program. When the program is halted, the programmer examines state of the program, including the call stack, and asks "Why was the program doing this particular thing?". This is repeated several times, up to 20.[clarify] If there is some program construct responsible for a significant part of the execution time, it will be seen, and can be dealt with. Such things are not usually "hotspots" (tight inner loops where the program counter is often found). More commonly, subroutines are being called excessive numbers of times, doing repeated unnecessary calculation. As the typical calling depth of subroutines increases, the time spent tends to increase exponentially, because the opportunity for extra calls exists at every level.
[edit] Redundancy
Information theory deals with the subject of redundancy in error-correcting codes. Errors in information transmission can be detected and corrected if sufficient extra bits are added to the code, and if there is a mechanism that uses them for detecting and correcting errors. To change a valid code word A into a valid code word B requires changing some number N of bits. If fewer than N bits are changed, the result is a code word that is not correct.
In a program, information is stored in data structures. Generally, if data structures are being built or modified, multiple instructions or statements must be executed. Changing valid data structure A into valid data structure B may require N modifications to the data structure. If fewer than N modifications are done, the data structure is invalid. So data structure has redundancy, and the number N is a measure of it. In the same way, program source code has redundancy. If valid program A requires N source code editing operations (say, insertion or deletion of statements) to be changed to valid program B, and fewer than N editing operations are done, the result is an incorrect program.
In the programming realm, it is important to minimize redundancy (i.e. the number of steps N needed to make changes correctly) in both data structure and in program source code. By contrast, redundancy in error-correcting codes is carefully added as part of a thought-out scheme for detecting or correcting errors. (Without such a scheme, redundancy does not help; if it is large it even hurts because it increases the exposure for errors to occur.) As a seldom-achievable ideal, if N can be reduced to 1, errors due to incomplete changes can be eliminated, and it could be suspected that most errors are of this type. When redundancy cannot be eliminated, techniques that help manage it should be considered, such as Differential Execution.
[edit] Language
If a language is understood to be not only the syntax, but the vocabulary available to the programmer, then computer languages like C, Fortran, C++, etc. are only base languages. The language that the programmer works in consists of this, plus all the classes, variables, subroutines, methods, macros, etc. that have been defined as part of the project. If these things are defined in a very different way, it makes a huge difference in the actual surface language in which the programmer works.
So, programmers are language designers, and the languages design are targeted at the problems they are solving. It is beneficial for programmers to understand how to design their languages so that, when requirements change, those changes can be implemented with the smallest possible number of editing operations (N). (Certainly, one can hope for the best that all requirements will be specified ahead of time, but one should also plan for the worst, that requirements will change up until the day of completion.)
When languages are designed so as to minimize N for the class of problems at hand, they become "problem-oriented languages". One characteristic of such languages is that they may have dramatically less explicit data structure. There may or may not be plenty of data structure underneath, but once the language is near completion the programmer does not need to refer to it. Another characteristic is that they become "declarative" rather than "imperative". The source code tends to state the problem, rather than the solution. The solution is inherent in the work that goes into building the language because the language solves all problems capable of being stated in it.
[edit] Data
Data exists to be read by programs that do things with it. The complexity of the program may vary, from simple, such as reading the data and copying it or adding up the numbers in it, to complex, like an interpreter reading the data and using it as instructions to carry out complex activities. Nevertheless, data controls to some extent the behavior of the programs that read it. The behavior of the programs can be classified according to what kind of temporary storage they need, just as automata are classified according to the temporary storage they need, whether it be a fixed set of registers, a pushdown stack, or arbitrary storage like a Turing machine tape. Thus the distinction between data and program is only a matter of perspective, just as in Einstein's physics the distinction between matter and energy is only a matter of perspective.[dubious ][vague] Therefore, when data structure is being designed, it is equally correct to say a language is being designed, and vice versa. This may seem a vacuous observation until it is seen that it opens up a spectrum of ways to represent information and process it.
[edit] Methodology
Below is a rough outline of a methodology for applying the linguistic method. It consisting of asking a series of questions:
- What information needs to be worked with, and what temporary storage is needed to work with it? This identifies where the information falls in the spectrum from simple to complex processing.
- When is the information acquired? Is it acquired at development time, installation time, or run time? This determines the options for representing the information as source code, installation script, or run-time data structure.
- Who provides the information? This may put constraints on how the information is represented.
Answers to these questions determine the space of possible implementation approaches, from which a good one can be chosen.
[edit] Characteristics
- Minimum data structure
- Explicit run-time data structure (i.e. data structure that programmers need to code to) is minimized. This minimizes redundancy of the source code, because having to manipulate explicit data structure is a major contributor to source code.
- Maximum performance
- Since one option for representing information is to represent it as compiled source code, it can be processed at high speed in that case. Even if the information must be represented in run-time data structure, programs that operate on it may dispense with overly general techniques that sap cycles. For example, programs have made use of hash tables classes, complete with storage allocation and freeing, for lists where the expected number of members is fewer than 10.
- Minimize defects
- When the surface language minimizes redundancy for the problem domain, the exposure to programmer errors in implementing requirements is minimized. This not only minimizes defects, it speeds implementation.
[edit] Comparison to object oriented programming
The linguistic method is not in opposition to object oriented programming (OOP), but can use OOP as one of multiple implementation techniques. The tension is with the idea that OOP is such a fundamental good idea that all further growth in software engineering must be based on it. In practice, when OOP is used as the thinking method for software design, it tends to result in great numbers of classes, lots of data structure, and correspondingly massive code (large N, both for data and for source code). Such code, due to many layers of excessively general abstractions implemented by method (subroutines) tends to result in performance problems.
By contrast, if OOP is used as a useful technique instead of the end-all concept, it can be used well by being used sparingly. The essential difference is that, rather than thinking in terms of the objects that exist at run-time, a run-time perspective, in the Linguistic Method the thinking is explicitly about the surface form of the source language, i.e. a development-time perspective. One does not ask "What kinds of objects will there be?" but "What do we need to say?".
[edit] References
- Dunlavey, "Building Better Applications: a Theory of Efficient Software Development" International Thomson Publishing ISBN 0-442-01740-5, 1994.