"Who can afford to do professional work for nothing? What hobbyist can put three man-years into programming, finding all bugs, documenting his product, and distributing it for free?"
-Bill Gates in his "Open Letter to Hobbyists," 1976. Excerpt from Free as in Freedom: Richard Stallman's Crusade for Free Software The recent Microsoft Windows source code leak has raised serious concerns in security intellectual property protection circles. Software, being an intangible yet highly valuable commodity, is now indigenous to organisations in various forms. But its theft and disassembling is more dangerous to an organisation than any other replaceable entity being burgled. Its theft or leakage, in part or in full, impacts an organization's credibility and increases the risk of exploitation of bugs, which may be found inside the code and improperly gives a leading edge to competitors.
It's rare that a proprietary OS's source code gets posted on the Internet, but if this worries you, how about distributing a program, which retains most or all of the information present in original source? With constant advancements in operating systems architectures, we are now living in a world where major platforms, .NET and J2EE, rely on virtual machines (VM) to facilitate the generation of intermediate code, which could be executed on any machine, in principle. The slogan of 'Write Once Run Anywhere' sounds very attractive but considering the code is exposed through a virtual machine, there are various security measures that need to be taken. Since the intermediate language can be disassembled back into source, your highly valued commodity is in danger. In this article, I'll discuss the potential perils in the VM arena, how virtual machines work, what code obfuscation is, how open source reacts to intellectual property and what the steps in execution of a CLR based program are. In this article we'll also discuss how .NET's Reflection APIs work and how we can read a Portable Executable using it. Let us explore these topics in detail. Welcome to the uncharted waters of .NET.
A virtual machine, as its name depicts, emulates a hardware machine by using software. However, the architecture is not bound to any physical machine but instead is supported through an interpreter, which executes the code. A VM provides a security "sandbox" to protect the underlying resources. The idea of a VM is not new; it dates as far back as 1965 with Andrew Tenenbaum and the IBM VM emulation, which is now IBM System 370 (S/370) and IBM System 390 (S/390). This chronology is discussed by Tennenbaum and William Stallings as well as at GMU's Web site under the history of Virtual Machines. Knuth's famous MMIX is another example of 64 bit RISC VMs used in his three volumes "The Art of Computer Programming", a classical text in computer science.

Figure: Schematic Architecture of a Virtual Machine
This idea was widely and commercially publicized with Sun's advent of Java and the company's slogan of "write once run anywhere". Therefore the JVM (Java Virtual Machine) has become the standard mode of virtual machine-based execution. To have its code deployed and executed on different machines with disparate architectures and different processor types without any extra effort is a developer's dream come true. This idea was widely accepted, and although there was criticism on the basis of program speed and code vulnerability, Java kept thriving with its machine-independent byte code and JVM.
Code exposure is native to the architecture of the virtual machines, java classes or .NET assemblies. Code isn't compiled into machine code but rather into an intermediate form, which is later executed by the virtual machine. This intermediate language mnemonics contain much information about the original source and can nearly be transformed back to the original source code. To provide portability, this intermediate step can't be avoided, and this is where reverse engineering gets easier. Since the Java language is considered to be the established forerunner with its uses of a virtual machine, there is a large amount of text available on the architecture of JVM and on code obfuscation, which is the main topic of this article. Therefore, I'll focus on the .NET platform. For further reading about the JVM, refer to the references section.
In Microsoft's .NET Framework, the fundamental unit of deployment and execution is assemblies. They consist of all managed type resources combined. Managed simply means code that can be executed by the CLR or targets the Common Language Runtime. There are various benefits of using managed code, for instance, automated memory management, garbage collection, thread management, type safety, etc., but this is beyond the scope of this article. Managed code provides the metadata that helps disassemblers reverse engineer the intermediate language code and extract the original source.The Microsoft counterpart of byte code is MSIL or Microsoft Intermediate Language. It's Microsoft's implementation of ECMA's Common Intermediate Language (ECMA, the European Computer Manufacturers Association; a European trade organization that issues its own standards and is a member of the ISO). The subtle differences between each standard is defined in Don Box's Essentials .NET, which is well worth reading.
Shared Source Common Language Infrastructure (Rotor) This discussion will not be completed without mentioning shared source CLI, codenamed Rotor. Shared Source CLI is the archive of implementation source code of ECMA C# and CLI specifications. It's a free, open source version of the .NET Framework and C# compiler distributed by Microsoft. It's supporting systems include FreeBSD, Mac OS X and off course Windows. More information on Rotor can be found on MSDN and its release could be downloaded from . |

Figure: Steps in compilation of a .NET source file.
MSIL is converted into machine code using a JIT (Just In Time) compiler prior to its invocation. A JIT compiler embodies one half of the two common execution models. The first (pre-compiling) works by generating a memory image of the complied source, whereas a JIT compilation causes effective memory paging, as only necessary components get loaded in memory instead of the whole code. It also provides interoperability and portability of code by extending its scope to dissimilar architectures.
Following is the MSIL (IL in short) source code for the HelloWorld Program.
/* IL Code for
HelloWorld.exe */ .assembly HelloWorld { .ver 1:0:0:0 /*The Assembly Version */ } .module
HelloWorld.exe //Hello World Module
declaration //Class declaration .class public auto ansi
HelloWorld extends [mscorlib]System.Object { .method static void HelloWorld() //Static Method declaration { .entrypoint ldstr "Hello World." // Loading the
string call void [mscorlib]System.Console::WriteLine(class
System.String) //Calling static method to print string ret } } |
Listing: HelloWorld.il
If the reader is familiar with assembly language for the 8086, it's pretty much like it; ldstr reminds me of the accumulator register and the language's 1-1 mapping with machine code. Here it's used to load a hard coded string into memory, which is later printed using the System namespace's static method, WriteLine. IL language can be written in any text editor and compiled using ilasm.exe, which comes with the .NET Framework. Ilasm.exe or IL assembler generates a PE i.e. Portable Executable coded file from MSIL source, as can be seen in the screenshot below.

Figure: Compilation of IL using ILASM.exe
There is a collection of command line (and GUI based) tools available with the .NET Framework. You may find the complete list useful for reference purposes. Also, detailed information and the specifications of the MSIL are available here at MSDN. To execute the HelloWorld.exe file created by ILASM.exe, one has to write the filename as shown below.

Figure: Executing HelloWorld.exe
After reading Simon Robinson's Advanced .NET Programming, I wrote an IL program and compiled it using ILASM.exe. I found myself as excited as I was when I first used TASM (Turbo assembler) or MASM (Microsoft assembler), or when I coded inline assembly in Turbo C++ 3.0 in a University lab to change the monitor resolution by calling an interrupt. While ILASM is exciting, there is an ILDASM too, the IL Dissembler, the bad guy. In the next example we will see how to disassemble a VB .NET program.
Imagine an ideal world where nothing is lost in translation, where everyone speaks the same language or speaks the same second language, to be more precise. MSIL and the MS .NET platform is the ideal
world. CSC is the command line compiler for C# and vbc for visual basic .NET, both translating the source code into MSIL to be executed on the CLR.
'//Importing
the system class imports system '//Delcaring
HelloWorld Namespace Namespace HelloWorld '//Delcaring
HelloWorld Class Class HelloWorld SharedSub Main Console.WriteLine("HelloWorld from VB") '//Calling Static WriteLine function EndSub endclass endnamespace |
Listing: HelloWorld.vb
The simple and self-evident code above just prints a string, HelloWorld, from VB on a console screen. To compile this code, we use vbc.exe, which comes with the .NET Framework.

Executing HelloWorld.

This process may appear mundane, and you might be wondering what the whole point of this trivial exercise is? Go to run (or Visual Studio .NET command prompt) and type ILDASM. Providing your path is set right, the following utility will run.
Figure: ILDASMing the HelloWorld.Exe
On opening HelloWorld.exe, which was just generated from HelloWorld.vb, you can see that its source code is pretty much exposed. In the left pane, namespace, class and functions are listed, which can be further explored in detail. ILDASM uses different icons to manifest modules and their corresponding members. After retrieving this much information from a deployment module, which was considered almost gibberish in a pre-VM era, translating back to source code is completely possible. Namespace declaration, class signature, and function definition can all easily be explored using ILDASM or various third party decompilers, Lutz Roeder's .NET Reflector, Salamander decompiler, and Anakrino to name a few. A detailed listing can be found in the references section at the end of this article.
To understand the needs of code obfuscation, it is also important to comprehend what metadata a Portable Executable holds and how it gets used. As defined before, the basic unit of resources in .NET is an assembly. An assembly contains
- A Portable Executable (mandatory)
- Any number of optional Portable Executable modules
- Any number of Optional Resource files
Portable Executable format isn't new either, but has been with us since the evolution of Win32. It's an extended version of Unix COFF (common object file format) introduced in Unix System V. Later, Executable and Linkable format (ELF) made PEs deprecated in UNIX. Microsoft's specification of Portable executables and common file format is a reference for this topic. Also, at the end I've provided various links for further study of Portable Executable file format's structure, its verification and validity, vulnerabilities and formal specification.
Roughly speaking, a Portable Executable has the following file format; header and COFF text, which is further divided into various sections shown below.

Table: CLR Module Format
To explore an assembly (a PE, a DLL), the .NET framework provides Reflection APIs, which are used to find out type definitions at runtime. They provide different aspects of types definition at design time and runtime. These APIs could be classified into two genres. On MSDN, reflection is defined as
"The System.Reflection.Emit namespace contains classes that allow a compiler or tool to emit metadata and Microsoft intermediate language (MSIL) and optionally generate a PE file on disk. The primary clients of these classes are script engines and compilers. The System.Reflection namespace contains classes and interfaces that provide a managed view of loaded types, methods, and fields, with the ability to dynamically create and invoke types."
To demonstrate the reflection API, here's a simple example. In this code, I've declared an integer and then initialized an object of class Type which holds the type of x, which is int32. This depicts that the type of a variable could be discovered at runtime. Also, even when it is casted to a higher hierarchy, i.e. Object, it still returns the same, System.Int32. Last but not least, I instantiated an object of class instanceRetriever, i.e the class itself and tried to get its type. Reflection API returned instanceRetriever.
using System; using System.Reflection; class instanceRetriever { publicstaticvoid Main(String[] args) { int x= 1; Type t = x.GetType(); Console.WriteLine(t.Name); Object obj = x; Console.WriteLine(obj.GetType().ToString()); Console.WriteLine(new instanceRetriever().GetType().ToString()); } } |
Listing: InstanceRetriever.cs

Figure: Running InstanceCreater
This dynamic recognition of type is useful in late binding and on the fly code generation and execution. In the next detailed example, I'll demonstrate through a C# application how to open an assembly and read its methods and types without using ILDASM. It's like writing a simpler version of ILDASM. We'll call this PEManifest or Portable Executable Manifestation Engine.
| Eric Lippert writes in his Visual Basic Security Handbook: "Source code ends up in hands of outsiders in many ways. In the more security conscious era, an increasing number of customers are demanding individual independent review of source code. It would be sub opened if you are sued or fall in hand of attacker if they successfully attack" |
PEManifest >>