As malware evolves, the era of pure dynamic analysis systems
is coming to an end. What potential does Hybrid Analysis have?
by Jan Miller (jan(dot)miller(at)payload-security.com)
What you will learn…
|
What you should know…
|
About malware
analysis challenges
What Hybrid Analysis
is about
Why Hybrid Analysis
is successful
|
Basic knowledge of x86
Assembly
Basic knowledge of
Malware Analysi
|
Introduction
The Internet connects a wide range of personal computers for private
and business purposes that often run Microsoft Windows OS on x86 compatible architectures
with Windows ranging at 90% market share in the desktop segment (NetMarketShare, 2014). These monocultures are an extremely attractive environment for
numerous malware attacks. Today, malware often appears in the form of highly
complex Trojan systems that come with exploit kits and very sophisticated
anti-detection measures. The number of infections and the awareness in the
industry is larger than ever. Today, there are about 4 million new infections
per month (SecureList, 2014). The worm MyDoom.X alone caused
damages of about $38.5 billion – and that was in 2006 (Borglund, 2014). Lately, also due to the NSA scandal,
the awareness for IT security has been growing a lot and IT security is
becoming a highly invested market.
Classical
malware detection methods were based on pure static code analysis, such as
finding a specific byte pattern and matching it against a known database of
“malicious signatures”. Static analysis can be described (in the most general
sense) as
code analysis without
execution of the target payload. In turn, malware authors started releasing
packed/encrypted or even polymorphic software that rendered classical methods
worthless. Consequently, anti-virus (AV) vendors, CERTs/CIRTs and malware
researchers started developing and using dynamic analysis systems. Dynamic
analysis can be described (in the most general sense) as
code analysis during execution or emulation of the target payload.
This was a huge step in evolution, because when the execution environment is
instrumented appropriately, it allows the observer to see the target software
behavior after the malware unpacks its security layers. Today, dynamic analysis
systems run the target software on virtual environments with hardware
acceleration support (such as VMWare or VirtualBox), in order to observe the
malware behavior during runtime. These often automatic systems are called
“Sandbox” analysis systems, as they represent an isolated execution environment
for malware that simulates a real victim’s machine [1].
Using systems such as VirtualBox, the virtual machine (VM) state can be
restored to a clean state by loading predefined snapshot files, thus allowing
execution of numerous malware samples in sequence without the need to restore
the infected machine. Of course, malware authors have adapted to the growth of
Sandbox systems and introduced a variety of VM detection methods. If a VM
environment can be detected, the malware may behave differently as it would in
the wild and not show its true behavior. The not-observed malicious
functionality is what we call dormant code. These avoiding techniques range
from delayed execution – so called “time bombs” – to complex system/hardware
state detection methods. For example, if the real payload is not executed within
a reasonable amount of time – the analysis system will give up on the analysis
and potentially miss valuable information. Thus, dormant code detection is a
vital prerequisite to Sandbox systems. Analysis results get even better when
dormant code is analyzed in-depth using runtime context information.
Combining
both static and dynamic analysis (typical term is Hybrid Analysis) in a fully automated, scalable and performant
analysis environment is the next generation in malware forensics and detection
algorithms. In this article, we will take a look at what dynamic analysis data
is necessary to understand dormant code and how we can combine it with static
analysis to extract in-depth behavior information.
Terminology
In
this chapter the most important terms are outlined, in order for all readers to
be at the same level for when the terms are used in the article.
Static Analysis
Static
analysis can be described in the most general sense as code analysis
without execution of the target payload. The target code (the analysis
input data) may be a compiled binary file or a human-readable format, such as
program source code, scripting language files or any other type of machine code
representation. N. Ayewah et al. define static analysis as a method that “(…) examines code in the absence of input
data and without running the code, and can detect potential security violations
(…), runtime errors (…) and logical inconsistencies (…).” (Nathaniel Ayewah, David Hovemeyer, J. David
Morgenthaler, John Penix and William Pugh, 2008).
Dynamic Analysis
Dynamic
analysis can be described in the most general sense as code analysis
during execution or emulation of the target payload. Involved techniques
are usually implemented by tools such as execution visualizers, system
observing tools (e.g. malicious behavior detection, intrusion detection,
performance observation, etc.), profilers or other types of behavior analysis
tools (e.g. sandbox systems). The only known technique used for performing
dynamic analysis is instrumentation of the target code or its host (i.e.
instrumenting the Operating System to enable system-level profiling of the
suspect application), in order to profile the target code’s behavior (Kendall, 2007). Instrumentation
refers to techniques that insert additional code for analysis purpose (or instrumentation
code) into the target code, in order to measure client performance, detect
bugs or intercept code-flow in order to analyze certain behavior patterns. In
malware analysis, behavior patterns are often the most interesting.
Dormant Code
Dormant
code or dormant functionality in
malicious programs is payload/code that is not observed during dynamic
analysis. In the context of malware, dormant code (not to be confused with
“Software rot”) may be hiding very interesting behavior that was not executed
during analysis for whatever reason (e.g. due to virtual machine detection, a
command and control server not being available, a long initial sleeping delay,
etc.). We can say that every pure dynamic analysis containing “no malicious
behavior” always contains some kind of dormant code (as the executed code
coverage is never 100%) and sometimes malicious
dormant code. As the “false negative” case is to be avoided at all cost (i.e.
thinking something is clean that is not), it makes sense to invest resources
into detecting dormant code. This can be achieved by adding e.g. an additional
static analysis layer on memory snapshots.
On a
side-note: process memory context constantly changes. Thus, it is necessary to
take memory snapshots at an intelligent point in time or with a high frequency
to “catch” e.g. unpacked code or injected shellcode, etc. In a “perfect” world
with quantum processors, an analysis system would be able to observe any memory
change and instantly analyze the entire process address space for all
potentially executable code locations and not make an impact on the
performance. Unfortunately, we do not have quantum computers and as such need
to require on heuristics and shortcuts, leaving room for mistakes. For example,
analysis systems that run through thousands of files per day have an analysis
time limit that they have to abide by. If nothing happens within the first
~5-10 minutes, it is off to the next file and heuristics have to do the job.
Thus, the better and more intelligent the underlying algorithms and performance
of the system overall is, the more files can be analyzed in a more complete and
error-reduced fashion. Of course, scalable systems and a lot of hardware can
solve bad implementations to some degree, but there is always a limit in the
real world hardware-wise and other bottlenecks surface on large parallel
systems, i.e. quality starts at the lowest level keeping in mind a flexible architecture.
Hybrid Analysis
Hybrid Analysis (HA) is something we call intelligent combination of static and
dynamic analysis. It is a technology or method that can integrate run-time data
extracted from dynamic analysis into a static analysis algorithm to detect
behavior or malicious functionality otherwise not as easily possible. Often, the
dynamic “helper data” resembles memory snapshots, runtime API symbol data
(memory reference address values) and adding them as an input to a
sophisticated static analysis engine (possibly including data flow analysis).
For example, if a dormant code sequence executes an indirect call, it would not
be possible to resolve the called function address without knowing the value
read from a memory location at the point in time of execution [2].
Even if we knew the value, it would not be possible to associate the called
function address with a system call, if a mapping of memory references to
symbol information is not available for the specific execution environment [3].
Hybrid Analysis in Action
In this chapter we will apply Hybrid Analysis techniques on an exemplary malware
and evaluate the results in order to take a look at the practical side of the
topic. In the previous chapter, Hybrid Analysis and its associated terms were
outlined briefly.
Tools
Before
we get to the experimental results, the involved tools will be outlined
briefly.
VirtualBox
For
our example malware analysis, we will be using VirtualBox as our preferred
virtual machine environment. From the main page Oracle states that
“VirtualBox is a powerful x86 and
AMD64/Intel64 virtualization product for enterprise as well as home use. Not only is VirtualBox an
extremely feature rich, high performance product for enterprise customers, it
is also the only professional solution that is freely available as Open Source
Software under the terms of the GNU General Public License (GPL) version 2.” (VirtualBox) Sounds good? It
is good. Definitely good enough to show what HA is about.
StaticStream
StaticStream
is our preferred static analysis engine, as it can take dynamic data (such as
memory snapshots, symbol data) and put it together using HA technology. From
the webpage, it is described as following: “StaticStream
is a high-performance static analysis engine that is written in C++ and can
analyze x86 PE files, memory dumps or shellcode. It uses a novel approach of
combining dynamic data with state of the art static analysis techniques in
order to detect and understand dormant code. It offers a wide range of configuration
options and regular updates.” (Payload Security)
Dynamic Analysis Tools
For
run-time data capturing we are going to use the AREE (Automatic Reverse
Engineering Engine) Manager and Monitor binaries. These are two in-house tools
used at Payload Security to generate dynamic data when running malware. These
tools work similar to the Cuckoo Sandbox monitor library “CuckooMon” in the
sense that they detour calls at the application level, whereby the Manager is
used to load configuration data and start the analysis. The monitor is a DLL
file that is injected into the initial malware process and user-level hooks are
applied to catch system API calls. Also, whenever the malware tries to inject
itself into another process (e.g. using a remote thread or other techniques),
the monitor is applied to the new target process. In order for our experiment
to be successful, injected shellcode, memory dumps, process context (loaded
modules, registry accesses, mutants, etc.) and symbol information (module
exports) are logged before the malware is able to modify/taint the data. Why
did we use our own tools? Basically, we only decided to use them, because the
generated dynamic data has a preferred format that is understandable to
StaticStream and we can show how HA works more easily. If you want to replicate
our experiment and want to try out the tools, feel free to contact us.
Hybrid Analysis vs. Matsnu Trojan
Now that
we know about the tools involved, let us take a look at real malware and see HA
come into action. For our “experiment”, we decided to use a Trojan called Matsnu [4] that encrypts files on the target drive in order hold the unencrypted data as a
ransom. These are the steps we will be taking:
- Install a VirtualBox instance with a typical OS,
such as Windows XP
- Load Matsnu sample on the virtual machine drive
- Run Matsnu sample using AREEv2Mgr and inject
AREEv2Mon monitor library
- Let the analysis run for a couple of seconds (it
is enough) and grab the generated run-time data
- Take the grabbed run-time data and use it to
analyze memory snapshots using HA technology
- Evaluate the results and draw a conclusion
First,
let us install Windows XP and load Matsnu on the main drive. The following
screenshot shows the system after setup shortly before an analysis.
Figure 1: Start Screen after Installing Windows XP
and loading “matsnu” on the main drive
As
we can see, there is a “shared folder” (release) open with the Manager ready to
start the Matsnu application. Also, we notice that Matsnu is using a PDF icon
in order to mislead the Windows user into thinking it is dealing with a
document and not an executable. As extensions are disabled by default, we
cannot know at first sight that it is an executable.
In
the next screenshot we see the manager open and use the command “.run
C:/Matsnu” to start analysis manually. There is also a command-line interface,
but that is not outlined here.
Figure 2: Running “matsnu” from the Manager using
the interactive mode
At
this point we can already observe an output folder “AREE” that has been created
on the C: drive. It will contain all the dynamic analysis information. Also,
the Matsnu file is missing. Checking the captured files in the “AREE” folder,
we detect that this is implemented using a dynamically created batch, which is
deletes itself after deleting the original file “Matsnu.exe” on the C: drive.
Also, the batch file is executed from a duplicated process so that the original
file is not in use by the OS. This is the batch file content:
:l
if
not exist "C:\Matsnu.exe" goto e
del
/Q /F "C:\Matsnu.exe"
goto
l
:e
del
/Q /F "C:\DOCUME~1\mjkdmjmj\APPLIC~1\5176313.bat"
All in all, the malicious process duplicates itself upon
startup, deletes the original file, but continues to exist. The PDF file is
missing for the user and the malware author’s probably assume that the user
will continue with daily business not putting thought to what happened.
After running the sample for a couple of seconds, we abort the
analysis, quit the VM and take a look at the captured dynamic data. This is how
the dynamic data folder looks like.
Figure 3: Dynamic Data Folder
The “api” folder contains system calls and parameters, the
“bin” folder contains captured files (e.g. the *.bat file mentioned above), the
“ctx” folder contains environment data (such as loaded modules, their symbols,
registry accesses, etc.), the “dmp” folder contains memory snapshots of
multiple frames and the “shc” folder contains extracted shellcodes. The “monprocs.csv”
file contains an overview of all monitored processes. In this case, the
contents are similar to the following (reduced version):
15539444-00013192,"INJECT_NEW","c:\Matsnu.exe","\Device\HarddiskVolume1\Matsnu.exe","<date>"
15540015-00013280,"INJECT_EXISTING","C:\WINDOWS\system32\cmd.exe","\Device\HarddiskVolume1\WINDOWS\system32\cmd.exe","<date>"
15540115-00001528,"INJECT_EXISTING","C:\WINDOWS\Explorer.EXE","\Device\HarddiskVolume1\WINDOWS\explorer.exe","<date>"
We
quickly see that Matsnu first runs the batch file and then injects itself into
“explorer.exe” where it remains to execute most of its payload. This makes
manual debugging with e.g. OllyDbg more difficult.
Consequently,
we first try to analyze the memory dump files (ignoring all system files) from
the explorer.exe process using symbol memory references and module information
as “context information”, which is one of the ideas of Hybrid Analysis.
Specifically, we start StaticStream letting it analyze the last frame of the
process (i.e. the last “dump” we logged before quitting the VM), because it
often contains already unpacked code sequences. See the following StaticStream’s
output in a shorter form (passing by nearly 1.6 million instructions including
data flow in an impressive ~3 seconds):
Welcome to
AREE v2.1
Starting
analysis ...
Adding
undefined memory file 15540115-00001528.00000002.15561486.2B90000.00000040.mdmp
(POI: 0, Executable: 1) for later analysis
…
Found a
hidden PE file in memory file
15540115-00001528.00000002.15561486.3730000.00000002.mdmp at 3730000
…
Analyzing
in-memory binary file 15540115-00001528.00000002.15561486.3730000.00000002.mdmp
Analyzing 1
exports
1 of 1
exports accepted
No packed
files could be detected
…
Running
heuristic scan on binary file
15540115-00001528.00000002.15561486.3730000.00000002.mdmp
…
Generating
final analysis report
Number of
passed instructions: 1660669
Finished
analysis in 3276 ms with a throughput of 445 KB/s
This is an excerpt of how one output folder with stream files containing
disassembly listings looked like (a human-readable output is the default
behavior):
Figure 4: Streams Folder File Listing
Hand-browsing some of the stream files quickly reveal that one
portion of the streams contains encrypted payload and one portion contains
unencrypted payload. Here are some of the more interesting functions that could
be used for post-processing to generate behavior signatures or used as an
entrypoint for an additional manual analysis:
Figure 5: Persistance using RegCreateKeyEx
The above “code sequence” (or “Stream”)
shows the call to RegCreateKeyExW at ADVAPI32.dll that would otherwise not be
detected using pure static analysis, as the indirect call memory reference
would not be resolved. In this case, the creation of a registry key and a registry
key value was set during execution, as indicated by the dynamic analysis
registry logfile (i.e. the associated code sequence is not dormant code):
Figure 6: Persistance using Registry
Converting the hex values to ASCII reveals the following
pathway:
C:\Documents and Settings\mjkdmjmj\Application
Data\Microsoft\qfpvideo.exe
Matsnu
obviously tries to survive a reboot by adding itself to the auto-start
registry, which is a very common technique. Checking more streams, another
interesting entrypoint was found quickly. It is the function that encrypts the
Command & Control server requests before sending the data over an alternate
HTTP connection.
Figure 7: Encrypting Payload before C&C request
The code location above is a good starting point to check cross-references
and intercept the encrypted key creation (of course, this requires a flexible
monitor system). Also, please note that using a run-time capturing mechanism
located at the kernel level, such a system would not be able to capture the
unencrypted data without hooking into the user mode and becoming detectable
again.
Today, more and more malware is using encrypted traffic (not
only HTTPS, but the payload itself being encrypted as well), making it
necessary to move closer to the malware code itself, as encryption/decryption of
important system data happens at the application level.
On a side note, the HA technology also revealed the following
C&C server IP addresses using the alternate HTTP port 8080:
50.31.146.134:8080
204.197.254.94:8080
78.129.181.191:8080
27.124.127.10:8080
173.203.112.215:8080
|
50.97.99.2:8080
103.25.59.120:8080
5.135.208.53:8080
50.31.146.109:8080
204.93.183.196:8080
|
… and a lot more interesting
dormant code sequences, which are not outlined here.
Conclusion
Although
the Matsnu Trojan is not the most
sophisticated malware available today, it is a good example, because it
reflects typical and state of the art aspects. The traffic communication uses
encrypted payloads, it tries to hide its payload injecting itself into a
variety of processes, it decrypts its payload inside the explorer making manual
debugging difficult, and so forth. Using some run-time data capturing tools we
were able to extract a lot of information, including dormant code and complete
symbol information. Of course, the dynamic analysis tool was required to follow
the malware into the explorer and remain undetected. As a next step, the static
analysis engine StaticStream associated run-time data and generated code
sequences for post-processing quickly, allowing us to find valuable analysis
entrypoints and behavior data otherwise unseen by a pure dynamic analysis
engine.
In
general we can say that static analysis is good, if the to-be-analyzed data is
not encrypted, not obfuscated and available in a more or less complete manner,
etc. Sadly, this is not often the case with malware today. Furthermore, we can
say that dynamic analysis is good as well, but it misses dormant code and
potentially malicious functionality. As we cannot make any qualified statements
about the unknown, it is impossible for a pure dynamic analysis system to
safely make a statement about a file being benign/clean, because maybe the real
payload was never executed. Thus, new Hybrid Analysis (HA) technologies are not
only a necessity, but part of a future solution in the battle on malware. Due
to the additional overhead imposed by hybrid technologies, very efficient and
performance-oriented algorithms are necessary, especially if viewed on a large
scale.
Summary
In
this article we outlined that today’s malware development is opening up new
challenges for malware analysis systems. In the early days, simple static
analysis byte patterns were enough to detect and classify malware. Then, as
malware became more sophisticated, dynamic analysis systems that observed
run-time behavior surfaced. The dynamic analysis systems have evolved and are a
powerful tool today, but their impact is becoming more and more limited. Today,
neither static nor dynamic analysis alone is an effective weapon against modern
malware. Dynamic analysis environments are either being detected and/or
malicious dormant code is not being analyzed, due to time-constraints or
unpredictable code flow behavior. Using intelligent algorithms and Hybrid
Analysis (HA) technologies, the best of both worlds can be put together: first-pass
checks, analyzing/logging run-time behavior, as well as detecting and
understanding dormant code functionality. In this article we showed that Hybrid
Analysis is an answer, if the run-time data captured has a sufficient quality
and the static analysis engine is flexible enough to produce usable analysis
results that can be post-processed to generate signatures or indicators.
About the Tools
In
this article we put focus on a static analysis engine called StaticStream. It
is a product of Payload Security and makes automatic and efficient Hybrid
Analysis available to dynamic analysis systems and analysts. Its easy
interface, high configurability and flexible data stream processing
architecture make it an interesting option to upgrade any dynamic analysis
system for challenges today and tomorrow.
On the Web
About the author
Jan
Miller is a specialist for static binary analysis algorithms, reverse engineering
and malware signatures. He is the CEO and founder of Payload Security UG
(haftungsbeschränkt). In the past two years, he has been putting focus on Android
based malware, as well as implementing Hybrid Analysis technologies for a
leading dynamic analysis system.
Table of Figures
Figure 1: Start Screen after Installing Windows XP and
loading “matsnu” on the main drive.
Figure 2: Running “matsnu” from the Manager using the
interactive mode.
Figure 3: Dynamic Data Folder.
Figure 4: Streams Folder File Listing.
Figure 5: Persistance using RegCreateKeyEx.
Figure 6: Persistance using Registry.
Figure 7: Encrypting Payload before C&C request
Bibliography
Borglund, J. (2014, April). Top 5 Most Costly
Viruses of All Time. Retrieved April 2014, from TopTen Reviews:
http://anti-virus-software-review.toptenreviews.com/top-5-most-costly-viruses-of-all-time-pg5.html
Cuckoo Sandbox. (n.d.). Malwr - Malware Analysis
by Cuckoo Sandbox. Retrieved June 24, 2014, from
https://malwr.com/analysis/YjQzNzExNjcwMDQyNDBhMmJmOTFhN2Y4ODk5ZmQ0NGM/
Kendall, K. (2007). Practical Malware Analysis.
Mandiant, Intelligent Information Security.
Nathaniel Ayewah, David Hovemeyer, J. David
Morgenthaler, John Penix and William Pugh. (2008). Experiences Using
Static Analysis to Find Bugs.
NetMarketShare. (2014, April). Desktop Operating
System Market Share. Retrieved April 2014, from
http://www.netmarketshare.com/
Payload Security. (n.d.). Payload-Security.com -
Combining Static and Dynamic Analysis Intelligently. Retrieved June 24,
2014, from http://www.payload-security.com/
SecureList. (2014, April). Internet threats
statistics. Retrieved April 2014, from SecureList:
http://www.securelist.com/en/statistics#/en/map/oas/month
VirtualBox. (n.d.). Oracle VM VirtualBox.
Retrieved June 24, 2014, from https://www.virtualbox.org/