Quex

For the Nazi propaganda movie, see Hitler Youth Quex.
quex
Developer(s) Dr.-Ing. Frank-Rene Schäfer
Stable release 0.59.6 / July 14, 2011; 6 months ago (2011-07-14)
Operating system Cross-platform
Type Lexical analyzer generator
License LGPL
Website quex.sourceforge.net

Quex is a lexical analyzer generator that creates C and C++ lexical analyzers. Significant features include the ability to generate lexical analyzers that operate on Unicode input, the creation of direct coded (non-table based) lexical analyzers and the use of inheritance relationships in lexical analysis modes.

Contents

Features

Direct coded lexical analyzers

Quex uses traditional steps of Thompson construction to create nondeterministic finite automatons from regular expressions, conversion to a deterministic finite automaton and then Hopcroft optimization to reduce the number of states to a minimum. Those mechanisms, though, have been adapted to deal with character sets rather than single characters. By means of this the calculation time can be significantly reduced. Since the Unicode character set consists of many more code points than plain ASCII, those optimizations are necessary in order to produce lexical analysers in a reasonable amount of time.

Instead of construction of a table based lexical analyzer where the transition information is stored in a data structure, Quex generates C/C++ code to perform transitions. Direct coding creates lexical analyzers that structurally more closely resemble typical hand written lexical analyzers than table based lexers. Also direct coded lexers tend to perform better than analogous table based lexical analyzers.

Unicode input alphabets

Quex can handle input alphabets that contain the full Unicode code point range (0 to 10FFFFh). This is augmented by the ability to specify regular expressions that contain Unicode properties as expressions. For example, Unicode code points with the binary property XID_Start can be specified with the expression \P{XID_Start} or \P{XIDS}. Quex can also generate code to call iconv or ICU to perform character conversion. Quex relies directly on databases as they are delivered by the Unicode Consortium. Updating to new releases of the standard consists only of copying the correspondent database files into Quex's corresponding directory.

Lexical analysis modes

Like traditional lexical analyzers (e.g. Lex and Flex), Quex supports multiple lexical analysis modes in a lexer. In addition to pattern actions, Quex modes can specify event actions: code to be executed during events such as entering or exiting a mode or when any match is found. Quex modes can be also be related by inheritance which allows modes to share common pattern and event actions.

Sophisticated buffer handling

Quex provides sophisticated mechanism of buffer handling and reload that are at the same time efficient and flexible. Quex provides interfaces that allow users to virtually to plug-in any character set converter. The converters are activated only "on-demand", that is, when new buffer filling is required. By default Quex can plug-in the iconv library. By means of this backbone Quex is able to analyze a huge set of character encodings.

Example

Quex follows the syntax of the classical tools lex and flex for the description of regular expressions. The example in the section Flex can be translated into Quex source code as follows:

header {
   #include <cstdlib>  // C++ version of 'stdlib.h'
}
 
define {
   digit     [0-9]
   letter    [a-zA-Z]
}
 
mode X :
<skip:     [ \t\n\r]> 
{
    "+"                  => QUEX_TKN_PLUS;       
    "-"                  => QUEX_TKN_MINUS;      
    "*"                  => QUEX_TKN_TIMES;      
    "/"                  => QUEX_TKN_SLASH;      
    "("                  => QUEX_TKN_LPAREN;     
    ")"                  => QUEX_TKN_RPAREN;     
    ";"                  => QUEX_TKN_SEMICOLON;  
    ","                  => QUEX_TKN_COMMA;      
    "."                  => QUEX_TKN_PERIOD;     
    ":="                 => QUEX_TKN_BECOMES;    
    "="                  => QUEX_TKN_EQL;        
    "<>"                 => QUEX_TKN_NEQ;        
    "<"                  => QUEX_TKN_LSS;        
    ">"                  => QUEX_TKN_GTR;        
    "<="                 => QUEX_TKN_LEQ;        
    ">="                 => QUEX_TKN_GEQ;        
    "begin"              => QUEX_TKN_BEGINSYM;   
    "call"               => QUEX_TKN_CALLSYM;    
    "const"              => QUEX_TKN_CONSTSYM;   
    "do"                 => QUEX_TKN_DOSYM;      
    "end"                => QUEX_TKN_ENDSYM;     
    "if"                 => QUEX_TKN_IFSYM;      
    "odd"                => QUEX_TKN_ODDSYM;     
    "procedure"          => QUEX_TKN_PROCSYM;    
    "then"               => QUEX_TKN_THENSYM;    
    "var"                => QUEX_TKN_VARSYM;     
    "while"              => QUEX_TKN_WHILESYM;   
 
    {letter}({letter}|{digit})* => QUEX_TKN_IDENT(strdup(Lexeme));
    {digit}+                    => QUEX_TKN_NUMBER(atoi(Lexeme));
    .                           => QUEX_TKN_UNKNOWN(Lexeme);
}

The brief token senders via the "=>" operator set the token ID of a token object with the token ID which follows the operator. The arguments following inside brackets are used to set contents of the token object. Note, that skipping whitespace can be achieved via skippers which are optimized to pass specific character sets quickly (see the "<skip: ...>" tag). For more sophisticated token actions C-code sections can be provided, such as

    ...
    {digit}+       {
           if( is_prime_number(Lexeme) )     ++prime_number_counter;
           if( is_fibonacci_number(Lexeme) ) ++fibonacci_number_counter;  
           self.send(QUEX_TKN_NUMBER(atoi(Lexeme)));
    }
    ...

which might be used to do some statistics about the numbers which occur in analyzed code.

See also

External links