Pic.1 - Here we see the HTML where the embedded Windows Media Player is called with baby.mid as content followed by the start of obfuscated shellcode
Pic.2 - Here we see the end of shellcode and the call to its unobfuscation.
Pic.3 - Finally this function will start the exploit with a call to play midi file on launch.
The importance of this exploit cannot be understated as MIDI is a relatively uncomplicated format and a simple 1-byte overflow was found to be enough to allow remote code execution in the context of Windows Media Player. This is not an exploit that takes hours of testing and sophisticated knowledge to use, all that's needed is rudimentary knowledge of the (open) MIDI file format and basic HTML, as you can embed MIDI in webpages in order for them to be played on the time the page is loaded. In this blog post I will take you through how detection is achieved and how the VRT ( in particular Alain Zidouemba, Patrick Mullen and myself) worked to Parse and detect MIDI files leveraging CVE-2012-0003, including MIDI files using an encoding designed to reduce file size which caused false positives for other detection devices.
Without going into too much detail, the MIDI file format simple and lightweight, hearkens back to the year 1982 with MIDI 1.0. Since then, there have been revisions and updates, even competing sub-standards. We can see that a MIDI file is a big-endian file format that starts with a 4-byte identifier: "Mthd" (0x4D546864). Inside the file, individual tracks can be identified by the bytes "MTrk" (0x4D54726B). We are interested in the data inside this track. After a chunk size field (size of the track) of 4 bytes, we get into the track event data where MIDI events are defined. MIDI events consist of a delta-time field, an event type field and up to 3 bytes as parameters for that event type. Delta-time fields are variable length, with a flag at bit 7 of each byte to determine whether the next byte is a continuation of the delta-time or the actual event type. Some events have no delta-time values and their delta-time fields are always set to 00.
Using this blog as a reference we can be more specific about the vulnerability. Of all the MIDI event types, the ones we will concentrate on are: Note Off (0x8), Note On (0x9) and Note Aftertouch (0xA). All three of these have 2 parameters of 1 byte each. The first of these two parameters is the note number, and is valid when between 0 and 127. However with a 1-byte field, the range of values is twice that, and that's where Windows Media Player has a problem. So, forl a start, detection has to be able to find all event types that have high nibbles of 0x8, 0x9 and 0xA to be able to judge if the note number field is over 0x7F. Then detection must then contend with variable delta-time fields. And as if that wasn't enough, there is another type of encoding for note events that complicates things, called MIDI Running Status. This is a type of run-length encoding where, as long as the status byte/event type doesn't change, the parsing assumes all subsequent events are of the same type. Additionally, in an attempt to be even more efficient, when using running status, MIDI files can simply set the 2nd parameter of the note on event to 00 to turn off the note instead of using the note off event. This complicates detection as we peg all our content matching on those status types as they are the vulnerable ones. However, the vulnerable fields are now no longer at a predictable distance from the identifying content.
For our Snort and ClamAV, we have the ability to write detection routines in C. ClamAV uses LLVM, and code is JIT-compiled at load time on systems that are supported by the JIT. On systems that don't, it's interpreted. This allows for detailed parsing of a file type. Alain Zidouemba wrote the following detection code, and goes through it with us. I encourage you to download the BC.Exploit.CVE_2012_0003-1.c and follow along.
We start by declaring sig1 and sig2, two signatures that we will use to identify MIDI files. Sig1 will match "Mthd" at the beginning of a file and sig2 will match "MTrk" anywhere in the file. We make our trigger condition coming across a file that matches both sig1 and sig2. If that triggering condition is met, through our code, ClamAV will do the following for every track ("MTrk") in the MIDI file:
- Read the chunk size, which is a 4-byte big-endian value that comes immediately after “MTrk”
- Skip the delta-time, which is a variable-length value field. It determines when an event should be played relative to the track's last event. For our parsing purposes, we don’t need to know what the delta-time is. We only need to know where it ends so that we can skip it.
- Next, we read the event type. It can be a value between 0x80 and 0xFF
- For event types between 0xB0 and 0xEF, we properly parse the MIDI channel events and parameters.
- For event type 0xFF, we are dealing with what are called meta-events. They are events that aren’t sent or received over MIDI ports, yet we need to parse them properly. And that’s what we do.
- Event type 0xF0 usually defines a Normal System Exclusive Event. These are events used to control MIDI hardware or software that require special data bytes that will follow their manufacturer's specifications. These are the most common type of SysEx event and are used to hold a single block of manufacturer specific data. The last byte transmitted is 0xF7 to indicate the end of the event
- Event type 0xF0 sometimes defines a Divided System Exclusive Event. This is when a large amount of SysEx data in a Normal SysEx Event could cause following MIDI Channel Events to be transmitted after the time they should be played. In that case, the last byte is not 0xF7 to indicate that the SysEx data is not finished and will be continued in an upcoming Divided SysEx Event. Any following Divided SysEx Events before the final one use a similar format as the first, only the start byte is 0xF0 instead of 0xF7 to signal continuation of SysEx data. The final block follows the same format as the continuation blocks, except the last data byte is 0xF7 to signal the completion of the divided SysEx data. Again, we are most concerned about just parsing this data properly.
- If the event type is “Note On”, “Note Off”, or “Note Aftertouch” (in other words, if the higher nibble is 0x8, 0x9 or xA), check to see if the velocity (in the case of “Note On” or “Note Off”) or the aftertouch value (in the case of “Note Aftertouch” ) if greater than 0x7F. If that’s the case we have the vulnerable condition!
Pic.4 - The Metasploit exploit with note on event, “9F” and overflow of the note number field “B2”.
The Windows Media Player MIDI overflow vulnerability is a great example of a vulnerability being disclosed by the vendor, hackers realizing the usefulness of the vulnerability, and in-the-wild code appearing a short while after. Thankfully, both ClamAV and Snort cover the baby.mid in-the-wild exploit, exploits generate by the Metasploit module, as well as POC exploits. On the ClamAV side, the signature BC.Exploit.CVE_2012_0003-1 provides coverage for the midi file, while CVE_2012_0003-1 through 3 cover the html, payload and the .exe that is downloaded. On the Snort side, there is coverage with rule sids 20900,21159, and 21167.
MD5's of samples found in the wild up to now: