Obfuscated code

From Wikipedia, the free encyclopedia

Obfuscate redirects here; for the Discipline from the "Vampire: The Masquerade"/"World of Darkness" fictional setting please see Discipline (World of Darkness)#Obfuscate.

Obfuscated code is source code that is (usually intentionally) very hard to read and understand. Some languages are more prone to obfuscation than others. C, C++ and Perl are most often cited as easily obfuscatable languages. Macro preprocessors are often used to create hard to read code by masking the standard language syntax and grammar from the main body of code. The term shrouded code has also been used.

There are also programs known as obfuscators that may operate on source code, object code, or both, for the purpose of deterring reverse engineering.

Obfuscating code to prevent reverse engineering is typically done to manage risks that stem from unauthorized access to source code. These risks include loss of intellectual property, ease of probing for application vulnerabilities and loss of revenue that can result when applications are reverse engineered, modified to circumvent metering or usage control and then recompiled. Obfuscating code is, therefore, also a compensating control to manage these risks. Although reverse engineering always existed in computer software, in recent times it has become a significant problem. The exposure is far greater in newer computing environments such as Java and Microsoft's .NET.[citation needed]

Another definition of obfuscating code is as a development process included into a larger development lifecycle process. In addition to obfuscating code, the obfuscation process includes well-defined steps and technologies for debugging, managing distributed development, patch management and third party library integration and other tasks that complete the obfuscation process. This definition is required to ensure precision and clarity around the net reduction in risk that obfuscated code provides by accounting for and minimizing the potential for adding complexity, cost or resource requirements.

Yet another definition of obfuscating code involves reducing the size of Java class files. When building MIDlets using Java ME, an obfuscator can be used to reduce the size of the JAR file.

Contents

[edit] Recreational obfuscation

Code is sometimes obfuscated deliberately for recreational purposes. There are programming contests which reward the most creatively obfuscated code: the International Obfuscated C Code Contest, Obfuscated Perl Contest, International Obfuscated Ruby Code Contest and Obfuscated PostScript Contest.

There are many varieties of interesting obfuscations ranging from simple keyword substitution, use/non-use of whitespace to create artistic effects, clever self-generating or heavily compressed programs, or programs that are valid and operate similarly in multiple programming languages.

Short obfuscated Perl programs printing "Just another Perl hacker" or something similar are often found in signatures of Perl programmers. These are informally known as JAPHs, and the origin of this practice is generally credited to Randal Schwartz.

[edit] Examples

Take this infamous example from Internet lore:


#include <stdio.h>
main(t,_,a)char *a;{return!0<t?t<3?main(-79,-13,a+main(-87,1-_,
main(-86,0,a+1)+a)):1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13?
main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t,
"@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l,+,/n{n+,/+#n+,/#\
;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \
q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# \
){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \
iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \
;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \
}'+}##(!!/")
:t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1)
  :0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a,
"!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1);}

Although unintelligible at first glance, it is a legal C program which when compiled and run will generate the 12 verses of The 12 Days of Christmas. It actually contains all the strings required for the poem in an encoded form inlined in the code. The code then iterates through the 12 days displaying what it needs to.

Another example is a program's source listing that was formatted to resemble an empty tic-tac-toe board. Each pass through the program modified the sourcecode to show a turn in the game, to be executed for the next move.

Yet another example is this short program that generates mazes of arbitrary length:


char*M,A,Z,E=40,J[40],T[40];main(C){for(*J=A=scanf(M="%d",&C);
--            E;             J[              E]             =T
[E   ]=  E)   printf("._");  for(;(A-=Z=!Z)  ||  (printf("\n|"
)    ,   A    =              39              ,C             --
)    ;   Z    ||    printf   (M   ))M[Z]=Z[A-(E   =A[J-Z])&&!C
&    A   ==             T[                                  A]
|6<<27<rand()||!C&!Z?J[T[E]=T[A]]=E,J[T[A]=A-Z]=A,"_.":" |"];}

Note the shape of the corridors in the program. Modern C compilers don't allow constant strings to be overwritten, which can be avoided by changing the first line to


char M[2],A,Z,E=40,J[40],T[40];main(C){for(*J=A=scanf("%d",&C);

Versions of gcc (the GNU Compiler for C) prior to 4.0 can compile the program in its original form if the -fwritable-strings flag is used.

A famous example for Perl is


@P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{
@p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord
($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&&
close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print

This slowly displays the text "Just another Perl / Unix hacker", multiple characters at a time, with delays. An explanation can be found here.

Some Python examples can be found in the official Python programming FAQ.

Yet another famous C example is:


_(__,___,____){___/__<=1?_(__,___+1,____):!(___%__)?_(__,___+1,0):___%__==___/
__&&!____?(printf("%d\t",___/__),_(__,___+1,0)):___%__>1&&___%__<___/__?_(__,1+
___,____+!(___/__%(___%__))):___<__*__?_(__,___+1,____):0;}main(){_(100,0,0);}

This program prints out the prime numbers less than 100. While this program is extremely difficult to follow, it is the result of several simple transformations from the following elementary program:


void primes(int cap) {
  int i, j, composite;
  for(i = 2; i < cap; i++) {
    composite = 0;
    for(j = 2; j < i; j++) 
      composite += !(i % j);
    if(!composite)
      printf("%d\t", i);
  }
}

int main() { 
  primes(100);
}

The first set of transformations aim to reduce the primes function into a single statement: (1) the meat of the function is rewritten as a single while loop with a sequence of cascading if-else structures inside of it; (2) the while loop is then rewritten as recusion; (3) the cascading if-else statements is rewritten as a single conditional statement. These transformations serve to obfuscate the code considerably. The remaining transformations purposefully hide the details of the function by: (4) removing all intermediate variables; (5) renaming the few variables left to _, ___, etc.; and, finally, (6) removing all whitespace and unnecessary parentheses.


The first transformation takes advantage of the well-known property that any simple function can be transformed into a single while loop followed by a series of cascading if-else statements. The program above is first transformed by rewriting the primes function in this form:


void primes(int cap) { 
  int i, j, composite, t = 0;
  while(t < cap * cap) {
    i = t / cap;
    j = t++ % cap;
    if(i <= 1);
    else if(j == 0)
      composite = 0;
    else if(j == i && !composite)
      printf("%d\t",i);
    else if(j > 1 && j < i)
      composite += !(i % j);  
  }
}

int main() {
  primes(100);
}

Another well-known property is that any while loop can be replaced by a series of recursive calls. The program is further transformed by replacing the while loop with recursion. Note that this requires adding two parameters to the primes function's header. Also note the consolidation of two statements into a list in the third if block. Finally note that one additional if-else block must be added to capture the situation in the non-recursive version where program flow fails all of the if-conditions and the while loop would simply continue with t incremented by one:


void primes(int cap, int t, int composite) {
  int i,j;
  i = t / cap;
  j = t % cap;
  if(i <= 1)
    primes(cap,t+1,composite);
  else if(j == 0)
    primes(cap,t+1,0);
  else if(j == i && !composite)
    (printf("%d\t",i), primes(cap,t+1,0));
  else if(j > 1 && j < i)
    primes(cap,t+1, composite + !(i % j));
  else if(t < cap * cap)
    primes(cap,t+1,composite);
}

int main() {
  primes(100,0,0);
}

The program is next transformed by changing the variable names to single letters and replacing the if-else structure with conditional statements (e.g. if(A) B else if(C) D else E becomes A ? B : C ? D : E):


void primes(int m, int t, int c) {
  int i,j;
  i = t / m;
  j = t % m;
  (i <= 1) ? primes(m,t+1,c) : (j == 0) ? primes(m,t+1,0) : (j == i && !c) ? 
  (printf("%d\t",i), primes(m,t+1,0)) : (j > 1 && j < i) ? 
  primes(m,t+1,c + !(i % j)) : (t < m * m) ? primes(m,t+1,c) : 0;
}

int main() {
  primes(100,0,0);
}

Next the program is transformed by removing the intermediate variables i and j and replacing them with (t / m) and (t % m), respectively:


void primes(int m, int t, int c) {
  ((t / m) <= 1) ? primes(m,t+1,c) : !(t % m) ? primes(m,t+1,0) : 
  ((t % m)==(t / m) && !c) ? (printf("%d\t",(t / m)), primes(m,t+1,0)) : 
  ((t % m)> 1 && (t % m) < (t / m)) ? primes(m,t+1,c + !((t / m) % (t % m))) : 
  (t < m * m) ? primes(m,t+1,c) : 0;
}

int main() {
  primes(100,0,0); 
}

Next the program is transformed by renaming the function primes and the variables m, t, and c to _, __, ___, and ____, respectively:


void _(int __, int ___, int ____) {
  ((___ / __) <= 1) ? _(__,___+1,____) : !(___ % __) ? _(__,___+1,0) : 
  ((___ % __)==(___ / __) && !____) ? (printf("%d\t",(___ / __)), 
  _(__,___+1,0)) : ((___ % __) > 1 && (___ % __) < (___ / __)) ? 
  _(__,___+1,____ + !((___ / __) % (___ % __))) : (___ < __ * __) ? 
  _(__,___+1,____) : 0;
} 

int main() {
  _(100,0,0); 
}

Finally, the program is transformed by removing white space, type declarations, and unambiguous parentheses to come up with the fully obfuscated version:


_(__,___,____){___/__<=1?_(__,___+1,____):!(___%__)?_(__,___+1,0):___%__==___/
__&&!____?(printf("%d\t",___/__),_(__,___+1,0)):___%__>1&&___%__<___/__?_(__,1+
___,____+!(___/__%(___%__))):___<__*__?_(__,___+1,____):0;}main(){_(100,0,0);}

This program runs on most systems. The limiting factor generally will be stack overflow because setting the equivalent of the cap parameter to 100 generally results in 100^2 recursive calls. This sort of obfuscation by program transformation is relatively easy to apply and can be performed on many simple programs.

[edit] Obfuscation by code morphing

Obfuscation by code morphing refers to obfuscating machine language or object code rather than obfuscating the source code.

This is achieved by completely replacing a section of the compiled code with an entirely new block that expects the same machine state when it begins execution as the previous section, and will leave with the same machine state after execution as the original. However, a number of additional operations will be completed as well as some operations with an equivalent effect.

Code morphing makes disassembly of a distributed program more difficult. However, by adding unnecessarily complicated operations, the execution time of the program is increased. For that reason, code morphing should be limited to critical portions of a program and not be used on an entire application.

Code morphing is often used in obfuscating the copy protection or other checks that a program makes to determine whether it is a valid, authentic installation, or a pirated copy.

[edit] Obfuscation in malicious software

Spammers frequently use obfuscated JavaScript or HTML code in spam messages. The obfuscated message, when displayed by an HTML-capable e-mail client, appears as a reasonably normal message -- albeit with obnoxious JavaScript behaviors such as spawning pop-up windows. However, when the source is viewed, the obfuscations make it far more difficult for investigators to discern where the links go, or what the JavaScript code does. For this same reason JavaScript obfuscation is also often used by malware authors to conceal parts of code that run browser exploits, or that redirect to pages containing exploits.

Automated JavaScript obfuscators are currently available on the market. Although the effectiveness of these tools may vary, and the channels through which they are exchanged range from the legal to the manifestly malicious, they are all created for the purpose of confounding casual viewers or inexperienced investigators.

The techniques use JavaScript's dynamic nature -- a piece of code is stored as an encrypted string, which is decrypted and evaluated. This may be done several times. Other techniques include insertion of dummy code, as well as dummy HTML links to legitimate pages.

[edit] Disadvantages of obfuscation

[edit] When used alone

No obfuscator known today provides any guarantees on the difficulty of reverse engineering, and this seems to be an inherent issue (see for example, this paper). Thus, obfuscators do not provide security of a level similar to modern encryption schemes, and should be used with other measures in tandem, in cases where security is of high importance.

Additionally, the most common software reverse engineering attacks target copy protection schemes. These schemes generally rely heavily on existing operating system procedure calls, making basic code obfuscation easily bypassed using the same tools used with unobfuscated code.

[edit] Debugging

Obfuscated code is extremely difficult to debug. Variable names will no longer make sense, and the structure of the code itself will likely be modified beyond recognition. This fact generally forces developers to maintain two builds: One with the original, unobfuscated source code that can be easily debugged, and another for release. While both builds should be tested to make sure they perform identically, the second build is generally reliably constructed from the first by an obfuscator.

Obviously this limitation does not apply to intermediate language (Java, C#, etc.) obfuscators, which generally work on compiled assemblies rather than on source code.

[edit] Portability

Obfuscated code often depends on the particular characteristics of the platform and compiler, making it difficult to manage if either change.

[edit] Conflicts with Reflection APIs

Reflection is a set of APIs in various languages that allow an object to be examined or created just by knowing its classname at run-time. Many obfuscators allow specified classes to be exempt from renaming; and it is also possible to let a class be renamed and call it by its new name. However, the former option places limits on the dynamism of code, while the latter adds a great deal of complexity and inconvenience to the system.

[edit] Obfuscating software

A vast variety of tools exists to perform or assist with code obfuscation. These include experimental research tools created by academics, hobbyist tools, commercial products written by professionals, and Open-source software. There are also deobfuscators, which do the reverse.

Software obfuscation tools include specialized obfuscators to demonstrate a relatively limited technique, more general obfuscators which attempt a more thorough obfuscation, and combined-function tools which obfuscate code as part of a larger goal such as software licensing enforcement.

[edit] Explanation

Programs written in languages such as C++ or Pascal are compiled into the machine language of a given computer before they become a program. Programmers write "source code", computers run "machine code" so this conversion is necessary. There is (generally) a one way transformation from source code to machine code. Machine code is not encrypted and is easy for anyone to see, but the format is so tedious for humans that reverse-engineering efforts are slow and painful.

Java and .NET languages (C#, Visual Basic, etc) take a different approach to compilation. They are far easier to reverse engineer because they do not compile to machine code, they compile into intermediate code.

Microsoft recommends using the Script Encoder to obfuscate the ASP files, so in case the web server is compromised, the hacker (or cracker) would be unable to find out how your ASP applications work. The Script Encoder works also on JScript and VBScript files. Note that the encoded JScript is only functional in Internet Explorer. However, in the documentation, it states that "Note that this encoding only prevents casual viewing of your code; it will not prevent the determined hacker from seeing what you've done and how.". In fact, it is indeed possible, as shown by the Windows Script Decoder, created on August 1, 2000 by Mr. Brownstone.

The Code Morphing is multilevel technology containing hundreds of unique code transformation patterns. In addition this technology includes the special layer that transforms some commands into Virtual Machine commands (like P-Code). Code Morphing turns binary code into an undecipherable mess that is not similar to normal compiled code, and completely hides execution logic of the protected code.

[edit] See also

[edit] External links