Preprocessing

From Wikipedia, the free encyclopedia

Preprocessing is the act of processing data before it is parsed. There are numerous situations where it makes sense to do the parsing in several stages. One is where humans are the parsers, and another is in the context of computer programming. Preprocessing is also used in data mining and neural network software where it refers to the stage before a model is constructed, when data is analyzed and transformed and filtered.

Contents

[edit] Computer programming

As the name suggests, preprocessing is performed by a preprocessor. The preprocessor modifies the data according to preprocessing directives that are usually placed in the input data itself. For instance, in the C programming language, where preprocessing directives are marked with a '#' at the beginning of the line, the preprocessor can be used to implement macros, to include external files at different points in the file or to select blocks of code that are to be sent to the compiler. The criteria can be several things, such as processor type (e.g. to resolve integer representation problems), availability of function calls (so you can provide a work-around if one is missing) and user preferences.

When preprocessors support macros, calls to macro functions within the code will expand to the whole implementation of the macro before it is sent to the compiler. This can be quite useful where speed is more important than the size of the binary code, and when you need expressions that expand to more than just a function; for instance a case-block.

Preprocessing is very useful to solve portability issues: depending on the target platform (that is specified to the preprocessor by some command-line argument) the application will contain specific code. For instance, when compiled for Linux, the application would read its configuration options from a file named .conf, whereas when compiled for Microsoft Windows, it would read the configuration from the registry.

[edit] Data mining

In data mining preprocessing refers to data preparation where you select variables samples, construct new variables and transform existing ones.

[edit] Variable selection

Ideally, you would take all the variables/features you have and use them as inputs. In practice, this doesn’t work very well. One reason is that the time it takes to build a model increases with the number of variables. Another reason is that blindly including columns can lead to incorrect models. Often your knowledge of the problem domain can let you make many of these selections correctly. For example, including ID numbers as predictor variables will at best have no benefit and at worst may reduce the weight of other important variables.

[edit] Sample selection

As in the case of the variables, ideally you would want to make use of all the samples you have. In practice however samples may have to be removed to produce optimal results. You usually want to throw away data that are clearly outliers. In some cases you might want to include them in the model, but often they can be ignored based on your understanding of the problem. In other cases you may want to emphasize certain aspects of your data by duplicating samples.

[edit] Variable construction

It is often desirable to create additional inputs based on raw data. For instance forecasting demographics using a GDP per capita ratio rather than just GDP and capita can yield better results. While in theory many adaptive systems can handle this autonomously, in practice helping the system out by incorporating external knowledge can make a big difference.

[edit] Variable transformation

Often data needs to be transformed in one way or another before it can be put to use in a system. Typical transformations are scaling and normalization which puts the variables in a restricted range – a necessity for efficient and precise numerical computation.

[edit] See also

In other languages