Thursday, July 10, 2014

Hybrid Analysis - NextGen Technology for Advanced Malware Payload Detection

As malware evolves, the era of pure dynamic analysis systems is coming to an end. What potential does Hybrid Analysis have?

by Jan Miller (jan(dot)miller(at)payload-security.com)
What you will learn…
What you should know…
About malware analysis challenges
What Hybrid Analysis is about
Why Hybrid Analysis is successful
Basic knowledge of x86 Assembly
Basic knowledge of Malware Analysi

Introduction

The Internet connects a wide range of personal computers for private and business purposes that often run Microsoft Windows OS on x86 compatible architectures with Windows ranging at 90% market share in the desktop segment (NetMarketShare, 2014). These monocultures are an extremely attractive environment for numerous malware attacks. Today, malware often appears in the form of highly complex Trojan systems that come with exploit kits and very sophisticated anti-detection measures. The number of infections and the awareness in the industry is larger than ever. Today, there are about 4 million new infections per month (SecureList, 2014). The worm MyDoom.X alone caused damages of about $38.5 billion – and that was in 2006 (Borglund, 2014). Lately, also due to the NSA scandal, the awareness for IT security has been growing a lot and IT security is becoming a highly invested market.
Classical malware detection methods were based on pure static code analysis, such as finding a specific byte pattern and matching it against a known database of “malicious signatures”. Static analysis can be described (in the most general sense) as code analysis without execution of the target payload. In turn, malware authors started releasing packed/encrypted or even polymorphic software that rendered classical methods worthless. Consequently, anti-virus (AV) vendors, CERTs/CIRTs and malware researchers started developing and using dynamic analysis systems. Dynamic analysis can be described (in the most general sense) as code analysis during execution or emulation of the target payload. This was a huge step in evolution, because when the execution environment is instrumented appropriately, it allows the observer to see the target software behavior after the malware unpacks its security layers. Today, dynamic analysis systems run the target software on virtual environments with hardware acceleration support (such as VMWare or VirtualBox), in order to observe the malware behavior during runtime. These often automatic systems are called “Sandbox” analysis systems, as they represent an isolated execution environment for malware that simulates a real victim’s machine [1]. Using systems such as VirtualBox, the virtual machine (VM) state can be restored to a clean state by loading predefined snapshot files, thus allowing execution of numerous malware samples in sequence without the need to restore the infected machine. Of course, malware authors have adapted to the growth of Sandbox systems and introduced a variety of VM detection methods. If a VM environment can be detected, the malware may behave differently as it would in the wild and not show its true behavior. The not-observed malicious functionality is what we call dormant code. These avoiding techniques range from delayed execution – so called “time bombs” – to complex system/hardware state detection methods. For example, if the real payload is not executed within a reasonable amount of time – the analysis system will give up on the analysis and potentially miss valuable information. Thus, dormant code detection is a vital prerequisite to Sandbox systems. Analysis results get even better when dormant code is analyzed in-depth using runtime context information.
Combining both static and dynamic analysis (typical term is Hybrid Analysis) in a fully automated, scalable and performant analysis environment is the next generation in malware forensics and detection algorithms. In this article, we will take a look at what dynamic analysis data is necessary to understand dormant code and how we can combine it with static analysis to extract in-depth behavior information.

Terminology

In this chapter the most important terms are outlined, in order for all readers to be at the same level for when the terms are used in the article.

Static Analysis

Static analysis can be described in the most general sense as code analysis without execution of the target payload. The target code (the analysis input data) may be a compiled binary file or a human-readable format, such as program source code, scripting language files or any other type of machine code representation. N. Ayewah et al. define static analysis as a method that “(…) examines code in the absence of input data and without running the code, and can detect potential security violations (…), runtime errors (…) and logical inconsistencies (…).” (Nathaniel Ayewah, David Hovemeyer, J. David Morgenthaler, John Penix and William Pugh, 2008).

Dynamic Analysis

Dynamic analysis can be described in the most general sense as code analysis during execution or emulation of the target payload. Involved techniques are usually implemented by tools such as execution visualizers, system observing tools (e.g. malicious behavior detection, intrusion detection, performance observation, etc.), profilers or other types of behavior analysis tools (e.g. sandbox systems). The only known technique used for performing dynamic analysis is instrumentation of the target code or its host (i.e. instrumenting the Operating System to enable system-level profiling of the suspect application), in order to profile the target code’s behavior (Kendall, 2007). Instrumentation refers to techniques that insert additional code for analysis purpose (or instrumentation code) into the target code, in order to measure client performance, detect bugs or intercept code-flow in order to analyze certain behavior patterns. In malware analysis, behavior patterns are often the most interesting.

Dormant Code

Dormant code or dormant functionality in malicious programs is payload/code that is not observed during dynamic analysis. In the context of malware, dormant code (not to be confused with “Software rot”) may be hiding very interesting behavior that was not executed during analysis for whatever reason (e.g. due to virtual machine detection, a command and control server not being available, a long initial sleeping delay, etc.). We can say that every pure dynamic analysis containing “no malicious behavior” always contains some kind of dormant code (as the executed code coverage is never 100%) and sometimes malicious dormant code. As the “false negative” case is to be avoided at all cost (i.e. thinking something is clean that is not), it makes sense to invest resources into detecting dormant code. This can be achieved by adding e.g. an additional static analysis layer on memory snapshots.
On a side-note: process memory context constantly changes. Thus, it is necessary to take memory snapshots at an intelligent point in time or with a high frequency to “catch” e.g. unpacked code or injected shellcode, etc. In a “perfect” world with quantum processors, an analysis system would be able to observe any memory change and instantly analyze the entire process address space for all potentially executable code locations and not make an impact on the performance. Unfortunately, we do not have quantum computers and as such need to require on heuristics and shortcuts, leaving room for mistakes. For example, analysis systems that run through thousands of files per day have an analysis time limit that they have to abide by. If nothing happens within the first ~5-10 minutes, it is off to the next file and heuristics have to do the job. Thus, the better and more intelligent the underlying algorithms and performance of the system overall is, the more files can be analyzed in a more complete and error-reduced fashion. Of course, scalable systems and a lot of hardware can solve bad implementations to some degree, but there is always a limit in the real world hardware-wise and other bottlenecks surface on large parallel systems, i.e. quality starts at the lowest level keeping in mind a flexible architecture.

Hybrid Analysis

Hybrid Analysis (HA) is something we call intelligent combination of static and dynamic analysis. It is a technology or method that can integrate run-time data extracted from dynamic analysis into a static analysis algorithm to detect behavior or malicious functionality otherwise not as easily possible. Often, the dynamic “helper data” resembles memory snapshots, runtime API symbol data (memory reference address values) and adding them as an input to a sophisticated static analysis engine (possibly including data flow analysis). For example, if a dormant code sequence executes an indirect call, it would not be possible to resolve the called function address without knowing the value read from a memory location at the point in time of execution [2]. Even if we knew the value, it would not be possible to associate the called function address with a system call, if a mapping of memory references to symbol information is not available for the specific execution environment [3].

Hybrid Analysis in Action

In this chapter we will apply Hybrid Analysis techniques on an exemplary malware and evaluate the results in order to take a look at the practical side of the topic. In the previous chapter, Hybrid Analysis and its associated terms were outlined briefly.

Tools

Before we get to the experimental results, the involved tools will be outlined briefly.

VirtualBox

For our example malware analysis, we will be using VirtualBox as our preferred virtual machine environment. From the main page Oracle states that “VirtualBox is a powerful x86 and AMD64/Intel64 virtualization product for enterprise as well as home use. Not only is VirtualBox an extremely feature rich, high performance product for enterprise customers, it is also the only professional solution that is freely available as Open Source Software under the terms of the GNU General Public License (GPL) version 2.” (VirtualBox) Sounds good? It is good. Definitely good enough to show what HA is about.

StaticStream

StaticStream is our preferred static analysis engine, as it can take dynamic data (such as memory snapshots, symbol data) and put it together using HA technology. From the webpage, it is described as following: “StaticStream is a high-performance static analysis engine that is written in C++ and can analyze x86 PE files, memory dumps or shellcode. It uses a novel approach of combining dynamic data with state of the art static analysis techniques in order to detect and understand dormant code. It offers a wide range of configuration options and regular updates.” (Payload Security)

Dynamic Analysis Tools

For run-time data capturing we are going to use the AREE (Automatic Reverse Engineering Engine) Manager and Monitor binaries. These are two in-house tools used at Payload Security to generate dynamic data when running malware. These tools work similar to the Cuckoo Sandbox monitor library “CuckooMon” in the sense that they detour calls at the application level, whereby the Manager is used to load configuration data and start the analysis. The monitor is a DLL file that is injected into the initial malware process and user-level hooks are applied to catch system API calls. Also, whenever the malware tries to inject itself into another process (e.g. using a remote thread or other techniques), the monitor is applied to the new target process. In order for our experiment to be successful, injected shellcode, memory dumps, process context (loaded modules, registry accesses, mutants, etc.) and symbol information (module exports) are logged before the malware is able to modify/taint the data. Why did we use our own tools? Basically, we only decided to use them, because the generated dynamic data has a preferred format that is understandable to StaticStream and we can show how HA works more easily. If you want to replicate our experiment and want to try out the tools, feel free to contact us.

Hybrid Analysis vs. Matsnu Trojan

Now that we know about the tools involved, let us take a look at real malware and see HA come into action. For our “experiment”, we decided to use a Trojan called Matsnu [4] that encrypts files on the target drive in order hold the unencrypted data as a ransom. These are the steps we will be taking:
  • Install a VirtualBox instance with a typical OS, such as Windows XP
  • Load Matsnu sample on the virtual machine drive
  • Run Matsnu sample using AREEv2Mgr and inject AREEv2Mon monitor library
  • Let the analysis run for a couple of seconds (it is enough) and grab the generated run-time data
  • Take the grabbed run-time data and use it to analyze memory snapshots using HA technology
  • Evaluate the results and draw a conclusion
First, let us install Windows XP and load Matsnu on the main drive. The following screenshot shows the system after setup shortly before an analysis.


Figure 1: Start Screen after Installing Windows XP and loading “matsnu” on the main drive

As we can see, there is a “shared folder” (release) open with the Manager ready to start the Matsnu application. Also, we notice that Matsnu is using a PDF icon in order to mislead the Windows user into thinking it is dealing with a document and not an executable. As extensions are disabled by default, we cannot know at first sight that it is an executable.
In the next screenshot we see the manager open and use the command “.run C:/Matsnu” to start analysis manually. There is also a command-line interface, but that is not outlined here.


Figure 2: Running “matsnu” from the Manager using the interactive mode

At this point we can already observe an output folder “AREE” that has been created on the C: drive. It will contain all the dynamic analysis information. Also, the Matsnu file is missing. Checking the captured files in the “AREE” folder, we detect that this is implemented using a dynamically created batch, which is deletes itself after deleting the original file “Matsnu.exe” on the C: drive. Also, the batch file is executed from a duplicated process so that the original file is not in use by the OS. This is the batch file content:

:l
if not exist "C:\Matsnu.exe" goto e
del /Q /F "C:\Matsnu.exe"
goto l
:e
del /Q /F "C:\DOCUME~1\mjkdmjmj\APPLIC~1\5176313.bat"
All in all, the malicious process duplicates itself upon startup, deletes the original file, but continues to exist. The PDF file is missing for the user and the malware author’s probably assume that the user will continue with daily business not putting thought to what happened.
After running the sample for a couple of seconds, we abort the analysis, quit the VM and take a look at the captured dynamic data. This is how the dynamic data folder looks like.

Figure 3: Dynamic Data Folder

The “api” folder contains system calls and parameters, the “bin” folder contains captured files (e.g. the *.bat file mentioned above), the “ctx” folder contains environment data (such as loaded modules, their symbols, registry accesses, etc.), the “dmp” folder contains memory snapshots of multiple frames and the “shc” folder contains extracted shellcodes. The “monprocs.csv” file contains an overview of all monitored processes. In this case, the contents are similar to the following (reduced version):
15539444-00013192,"INJECT_NEW","c:\Matsnu.exe","\Device\HarddiskVolume1\Matsnu.exe","<date>"
15540015-00013280,"INJECT_EXISTING","C:\WINDOWS\system32\cmd.exe","\Device\HarddiskVolume1\WINDOWS\system32\cmd.exe","<date>"
15540115-00001528,"INJECT_EXISTING","C:\WINDOWS\Explorer.EXE","\Device\HarddiskVolume1\WINDOWS\explorer.exe","<date>"
We quickly see that Matsnu first runs the batch file and then injects itself into “explorer.exe” where it remains to execute most of its payload. This makes manual debugging with e.g. OllyDbg more difficult.
Consequently, we first try to analyze the memory dump files (ignoring all system files) from the explorer.exe process using symbol memory references and module information as “context information”, which is one of the ideas of Hybrid Analysis. Specifically, we start StaticStream letting it analyze the last frame of the process (i.e. the last “dump” we logged before quitting the VM), because it often contains already unpacked code sequences. See the following StaticStream’s output in a shorter form (passing by nearly 1.6 million instructions including data flow in an impressive ~3 seconds):
Welcome to AREE v2.1
Starting analysis ...
Adding undefined memory file 15540115-00001528.00000002.15561486.2B90000.00000040.mdmp (POI: 0, Executable: 1) for later analysis
Found a hidden PE file in memory file 15540115-00001528.00000002.15561486.3730000.00000002.mdmp at 3730000
Analyzing in-memory binary file 15540115-00001528.00000002.15561486.3730000.00000002.mdmp
Analyzing 1 exports
1 of 1 exports accepted
No packed files could be detected
Running heuristic scan on binary file 15540115-00001528.00000002.15561486.3730000.00000002.mdmp
Generating final analysis report
Number of passed instructions: 1660669
Finished analysis in 3276 ms with a throughput of 445 KB/s
This is an excerpt of how one output folder with stream files containing disassembly listings looked like (a human-readable output is the default behavior):

Figure 4: Streams Folder File Listing

Hand-browsing some of the stream files quickly reveal that one portion of the streams contains encrypted payload and one portion contains unencrypted payload. Here are some of the more interesting functions that could be used for post-processing to generate behavior signatures or used as an entrypoint for an additional manual analysis:

Figure 5: Persistance using RegCreateKeyEx

The above “code sequence” (or “Stream”) shows the call to RegCreateKeyExW at ADVAPI32.dll that would otherwise not be detected using pure static analysis, as the indirect call memory reference would not be resolved. In this case, the creation of a registry key and a registry key value was set during execution, as indicated by the dynamic analysis registry logfile (i.e. the associated code sequence is not dormant code):


Figure 6: Persistance using Registry

Converting the hex values to ASCII reveals the following pathway:
C:\Documents and Settings\mjkdmjmj\Application Data\Microsoft\qfpvideo.exe
Matsnu obviously tries to survive a reboot by adding itself to the auto-start registry, which is a very common technique. Checking more streams, another interesting entrypoint was found quickly. It is the function that encrypts the Command & Control server requests before sending the data over an alternate HTTP connection.


Figure 7: Encrypting Payload before C&C request

The code location above is a good starting point to check cross-references and intercept the encrypted key creation (of course, this requires a flexible monitor system). Also, please note that using a run-time capturing mechanism located at the kernel level, such a system would not be able to capture the unencrypted data without hooking into the user mode and becoming detectable again.
Today, more and more malware is using encrypted traffic (not only HTTPS, but the payload itself being encrypted as well), making it necessary to move closer to the malware code itself, as encryption/decryption of important system data happens at the application level.
On a side note, the HA technology also revealed the following C&C server IP addresses using the alternate HTTP port 8080:
50.31.146.134:8080
204.197.254.94:8080
78.129.181.191:8080
27.124.127.10:8080
173.203.112.215:8080
50.97.99.2:8080
103.25.59.120:8080
5.135.208.53:8080
50.31.146.109:8080
204.93.183.196:8080
… and a lot more interesting dormant code sequences, which are not outlined here.

Conclusion

Although the Matsnu Trojan is not the most sophisticated malware available today, it is a good example, because it reflects typical and state of the art aspects. The traffic communication uses encrypted payloads, it tries to hide its payload injecting itself into a variety of processes, it decrypts its payload inside the explorer making manual debugging difficult, and so forth. Using some run-time data capturing tools we were able to extract a lot of information, including dormant code and complete symbol information. Of course, the dynamic analysis tool was required to follow the malware into the explorer and remain undetected. As a next step, the static analysis engine StaticStream associated run-time data and generated code sequences for post-processing quickly, allowing us to find valuable analysis entrypoints and behavior data otherwise unseen by a pure dynamic analysis engine.
In general we can say that static analysis is good, if the to-be-analyzed data is not encrypted, not obfuscated and available in a more or less complete manner, etc. Sadly, this is not often the case with malware today. Furthermore, we can say that dynamic analysis is good as well, but it misses dormant code and potentially malicious functionality. As we cannot make any qualified statements about the unknown, it is impossible for a pure dynamic analysis system to safely make a statement about a file being benign/clean, because maybe the real payload was never executed. Thus, new Hybrid Analysis (HA) technologies are not only a necessity, but part of a future solution in the battle on malware. Due to the additional overhead imposed by hybrid technologies, very efficient and performance-oriented algorithms are necessary, especially if viewed on a large scale.

Summary

In this article we outlined that today’s malware development is opening up new challenges for malware analysis systems. In the early days, simple static analysis byte patterns were enough to detect and classify malware. Then, as malware became more sophisticated, dynamic analysis systems that observed run-time behavior surfaced. The dynamic analysis systems have evolved and are a powerful tool today, but their impact is becoming more and more limited. Today, neither static nor dynamic analysis alone is an effective weapon against modern malware. Dynamic analysis environments are either being detected and/or malicious dormant code is not being analyzed, due to time-constraints or unpredictable code flow behavior. Using intelligent algorithms and Hybrid Analysis (HA) technologies, the best of both worlds can be put together: first-pass checks, analyzing/logging run-time behavior, as well as detecting and understanding dormant code functionality. In this article we showed that Hybrid Analysis is an answer, if the run-time data captured has a sufficient quality and the static analysis engine is flexible enough to produce usable analysis results that can be post-processed to generate signatures or indicators.

About the Tools

In this article we put focus on a static analysis engine called StaticStream. It is a product of Payload Security and makes automatic and efficient Hybrid Analysis available to dynamic analysis systems and analysts. Its easy interface, high configurability and flexible data stream processing architecture make it an interesting option to upgrade any dynamic analysis system for challenges today and tomorrow.

On the Web

More information on StaticStream is available on the web at www.payload-security.com.

About the author

Jan Miller is a specialist for static binary analysis algorithms, reverse engineering and malware signatures. He is the CEO and founder of Payload Security UG (haftungsbeschränkt). In the past two years, he has been putting focus on Android based malware, as well as implementing Hybrid Analysis technologies for a leading dynamic analysis system.

Table of Figures

Figure 1: Start Screen after Installing Windows XP and loading “matsnu” on the main drive.
Figure 2: Running “matsnu” from the Manager using the interactive mode.
Figure 3: Dynamic Data Folder.
Figure 4: Streams Folder File Listing.
Figure 5: Persistance using RegCreateKeyEx.
Figure 6: Persistance using Registry.
Figure 7: Encrypting Payload before C&C request

Bibliography

Borglund, J. (2014, April). Top 5 Most Costly Viruses of All Time. Retrieved April 2014, from TopTen Reviews: http://anti-virus-software-review.toptenreviews.com/top-5-most-costly-viruses-of-all-time-pg5.html
Cuckoo Sandbox. (n.d.). Malwr - Malware Analysis by Cuckoo Sandbox. Retrieved June 24, 2014, from https://malwr.com/analysis/YjQzNzExNjcwMDQyNDBhMmJmOTFhN2Y4ODk5ZmQ0NGM/
Kendall, K. (2007). Practical Malware Analysis. Mandiant, Intelligent Information Security.
Nathaniel Ayewah, David Hovemeyer, J. David Morgenthaler, John Penix and William Pugh. (2008). Experiences Using Static Analysis to Find Bugs.
NetMarketShare. (2014, April). Desktop Operating System Market Share. Retrieved April 2014, from http://www.netmarketshare.com/
Payload Security. (n.d.). Payload-Security.com - Combining Static and Dynamic Analysis Intelligently. Retrieved June 24, 2014, from http://www.payload-security.com/
SecureList. (2014, April). Internet threats statistics. Retrieved April 2014, from SecureList: http://www.securelist.com/en/statistics#/en/map/oas/month
VirtualBox. (n.d.). Oracle VM VirtualBox. Retrieved June 24, 2014, from https://www.virtualbox.org/

Footnotes


[1] Executing malware on a prepared physical machine is possible as well, of course.
[2] Using a memory snapshot from a later point in time is possible as well, if the value remains unchanged.
[3] The “specific analysis” reference is important, because techniques such as ASLR (Address space layout randomization) cause system API function addresses to not be predictable. As such, we always need to understand detected dormant code in a process context of a specific execution environment.
[4] SHA256 e008e161cce090242262fc977b6fe707d3058cdaa3b5d5c3bab24c8c6b05ce9e