Tuesday, September 7, 2010

Introduction to ClamAV's Low Level Virtual Machine (LLVM)

Users of prior versions of ClamAV may have noticed a drastic increase in the size of the tarball with the introduction of 0.96. This is due to the addition of a bytecode interpreter, and a JIT Low Level Virtual Machine (LLVM). It greatly extends ClamAV detection capabilities by being able to interpret/execute bytecode. Not a lot of documentation exists as yet about how to write bytecode for ClamAV and take advantage of the tremendous flexibility it offers (I will try to fix that). If you want to write your own bytecode for ClamAV, you will need to configure ClamAV to allow it to load unsigned bytecode (bytecode shipped by ClamAV is digitally signed, and by default only signed bytecode is loaded).

If you already have ClamAV installed, even the latest version, you will have to remove it:
sudo make uninstall
(Alternatively you can keep your existing ClamAV installed, and just build a new ClamAV without installing it.)

Get the latest version of ClamAV here. Untar the archive and run the commands
./configure --enable-unsigned-bytecode && make && sudo make install
Note the configure option --enable-unsigned-bytecode. Without it, ClamAV will refuse to load your custom bytecode and produce this warning:
LibClamAV Warning: Only loading signed bytecode, skipping load of unsigned bytecode!
Now get the bytecode compiler by running the command
git clone git://git.clamav.net/git/clamav-bytecode-compiler
This will create a folder called clamav-bytecode-compiler that contains everything needed to compile ClamAV bytecode, including documentation in the subfolder doc (the latest compiler documentation can always be accessed here). Make sure to follow the instructions in the README file to build the compiler.

Here's a case study to see how ClamAV bytecode can come in handy (this is an integer overflow vulnerability in a old version of OpenOffice CVE-2008-2238). The vulnerability came about due to the way OpenOffice used to parse Enhanced Metafiles (EMF). The specifications for the EMF file format is available here. An EMF metafile is composed of a series of variable-length records called EMF records. An EMF record has the following format:
Offset Size Description
------------------------------------
0x0000 4 Record Type
0x0004 4 Record Size
0x0008 N Type-Specific Data
There is a record called EMR_EXTTEXTOUTW which has the following format:
Offset Size Description
------------------------------------
0x0000 4 Record Type: EMF_EXTTEXTOUTW <0x00000054>
0x0004 4 Record Size
0x0008 16 Bounds
0x0018 4 iGraphicsMode
0x001c 4 exScale
0x0020 4 eyScale
0x0024 N EmrText (variable)
The EmrText block has the following format:
Offset Size Description
-----------------------------
0x0000 8 Reference
0x0008 4 Chars OR nLen
0x000C 4 OffString
......
Without getting into the details of why, I'll just say that there is an integer overflow condition if the value of Chars is equal or greater than 0x80000000 bytes.

Fire up your favorite text editor and create a file called emf_CVE-2008-2238.c.

Start off by specifying the type of file you are targeting (more information about target types here):
TARGET(0)
Next we declare the .ndb style pattern we will be looking for in EMF files as we attempt to identify the ones that may be trying to leverage the vulnerability. Based on the specifications for the EMF format, the first record in the metafile is always an EMF header record (type 0x01) and 40 bytes into the record is a digital signature that must be EMF. Let's declare this signature and delimit it with the macros SIGNATURES_DECL_BEGIN and SIGNATURES_DECL_END:
SIGNATURES_DECL_BEGIN
DECLARE_SIGNATURE(emr_header)
SIGNATURES_DECL_END
The definitions are delimited by the macros SIGNATURE_DEF_BEGIN and SIGNATURES_END:
SIGNATURES_DEF_BEGIN
DEFINE_SIGNATURE(emr_header, "0:01000000{37}454d46")
SIGNATURES_END
We then define a function called logical_trigger() which is a must for bytecode that is triggered by a logical signature:
bool logical_trigger()
{
return matches(Signatures.emr_header);
}
If needed you can combine multiple signatures here with boolean and comparison operators. See the format of .ldb signatures for more details, or the compiler's documentation. In this case what this function does is return true if the emr_header signature is matched. If the function logical_trigger returns true then the fuction entrypoint is called. The function is of type int. I have attempted to explain the detection logic of the function through the embedded comments below:
/* This is the bytecode function that is actually executed when the logical signature is matched */
int entrypoint(void)
{
uint8_t emf_exttextoutw[4] = "\x54\x00\x00\x00"; /* Header for EMF record EMR_EXTTEXTOUTW */
int pos=0; /* Cursor position in file */
int Chars_value=0; /* Value of the attribute Chars */
uint8_t Chars[4]; /* Chars attribute. See format for EmrText block */

while (1)
{
/* Find a EMF record EMR_EXTTEXTOUTW */
pos = file_find(emf_exttextoutw,4);

/* If EMF record EMR_EXTTEXTOUTW cannot be found */
if (pos == -1)
break;
else
{
/* Move the cursor 44 bytes forward, to the start of Chars */
seek(pos+44, SEEK_SET);

/** Read Chars, which is 4 bytes long, little endian **/
read (Chars, sizeof(Chars));

/*** Convert to host system's endianess. cli_readint32 is part if the ClamAV API.
So if your system is already little endian it does nothing (just reads
the value), and if your system is big endian it swaps the bytes. See definition
of cli_readint32 in other.h in the libclamav folder of your ClamAV installation ***/
int Chars_value = cli_readint32(Chars);

if (Chars_value >= 0x80000000)
{
foundVirus("CVE-2008-2238");
break;
}
else
{
/** Advance by 1 position in the file **/
seek (pos+1, SEEK_SET);
}
}
}
return 0;
}
Here's the code in its entirety. Use it as a template to write your own bytecode, or as an exercise, compile it and using a hex editor, create a file that will trigger this bytecode signature.

Finally, before you run off and start writing your own code, keep in mind that you are writing code in C. What I mean by that is that you can introduce buffer overflow vulnerabilities, infinite loop conditions and so on. Check, double check, heck! triple check your code before you start using it in a production environment. With that being said, ClamAV does have some measures in place to keep it from running out of control: memory accesses are bounds checked, bytecode execution has timeouts, and bytecodes are run with stack smashing protection. When either of these are detected at runtime, bytecode execution is stopped and ClamAV continues to execute normally. Still it is not guaranteed that these protections are perfect, so you should still check your code!
Add to Technorati Favorites Digg! This

2 comments:

Jun said...

really, you allow C code in your bytecode????


this looks like a horrible idea! a right way is to invent a new language that is type safe, so you can make sure that people can never introduce any BOF into their ClamAV. or if you dont want a new language, use smt available like Python, Lua, ...

again, from technical point of view, this is a very decision!

edwin said...

The language we compile is more accurately described as a "C-like" language, but we only accept a subset of C (see the manual for details).
Memory accesses through pointers are bounds checked, loops have timeouts introduced, no external calls are allowed, and in general the compiler rejects anything that:
- can't be checked at compile time as being safe
- and can't be checked at runtime to be safe

If a bounds is violated at runtime, the bytecode's execution is stopped and a message is shown, but ClamAV keeps running normally.

That doesn't mean we won't add support for other languages, or even write a new language. However that'd probably be done by compiling to the existing bytecode format.

As for using existing languages: Lua was slow, and python wouldn't be any safer since it allows calls to external C libraries.