10

I'm working on a web app that notifies users on whether or not the JavaScript that they entered is malicious. I'm using this article (Examples of malicious javascript) for reference.

Is it possible to create an equation with coefficients -coefficients that include occurrences of uncommon characters- that can give a strong indication that a certain piece of code is malicious/benign?

If not, is there another method that I could use?

Suhass
  • 121
  • 1
  • 3
  • I doubt that. Whether or not there *"is another method that you can use"* depends on why you are trying to do this. – Philipp Aug 17 '15 at 07:37
  • 4
    You seem to basically work on the same kind of issues anti-virus editors are studying for years now... – WhiteWinterWolf Aug 17 '15 at 09:23
  • 2
    Which you want seems to be impossible, but nothing prevents you to running this JS on a one time virtual environment and then by watching what the code does, define the boundaries of which actions you would consider malicious (e.g initiate a download, messing with local files on the virtual environment) I don't know if this is feasible or efficient, but seems to be simpler – Freedo Aug 17 '15 at 23:43
  • Glad that my thesis was mentioned here(under "Machine learning based classifiers" and "Dynamic methods"). I would like to suggest a [work](https://www.cs.ucsb.edu/~vigna/publications/2010_cova_kruegel_vigna_Wepawet.pdf) by Marco Cova et al. on analysis and detection of malicious JavaScript (focused on drive-by downloads). – Birhanu Eshete Sep 13 '15 at 15:07

2 Answers2

48

You are trying to fulfill something impossible. If it is that easy, web malware would be dead few decades ago.

If you want to use mathematical tools to track malicious JavaScript code, you need first to know which features are employed by JavaScript malware. Once you understood these features, you may guess that it will be impossible to factor anything meaningful in one or several mathematical equations; so let's throw a glance over the employed and common features of JavaScript attacks:

  1. Server side polymorphism

Literally meaning many shapes, polymorphism is a technique used by malware authors to evade signatures based detectors. Polymorphism is qualified as being server sided when the engine which produces several but different copies of the malware is hosted on a compromised web server (Server-Side Polymorphism: Crime-Ware as a Service Model (CaaS)). simulated metamorphic encryption generator (SMEG) version 1.0 was the first engine developed to implement the notion of polymorphism for computer viruses on the early 1990's (Parallel analysis of polymorphic viral code using automated deduction system)

  1. Code obfuscation

The other common feature you may find in malicious JavaScript code is that obfuscation is always used. This common factor -obfuscation- does not make even things simpler: because innocuous JavaScript code also uses obfuscation (for instance, some developers for example do not want their personal pretty JavaScript function to be understood by others as you can easily read HTML and JS pages codes). Along with server side polymorphism, code obfuscation is a widely used technique by malware authors to circumvent antivirus scanners. A myriad of techniques could be used to obfuscate JavaScript codes such as string reversing, Unicode and base 64 encoding, string splitting and document object model (DOM) interaction (Malware with your Mocha? Obfuscation and anti­-emulation tricks in malicious JavaScript.).

  1. Code unfolding

Code unfolding is the mechanism with which a new code is introduced at run time. In JavaScript, this is made concrete by invoking functions like document.write() and eval() in order to execute obfuscated portions of code and functions. (Weaknesses in Defenses Against Web-Borne Malware)

  1. Heap spray

This attack targets mainly web browsers. The user controllable data can corrupt the heap by a remote execution code if the miscreant has compromised the user's computer to the point he can have access to this vulnerable memory area (BuBBle: A Javascript Engine Level Countermeasure against Heap-Spraying Attacks)

  1. Drive-by download

Drive-by download attacks consist in downloading and and executing or installing malicious programs without the user's consent. Such attacks occur by exploiting browsers' vulnerabilities, their add-ons or plugins such as ActiveX controls or unpatched useful software such as Acrobat Reader and Adobe Flash Player (Drive-by download attacjs: effect and detection methods, MSc Information Security)

  1. Multi execution paths

It is possible to trigger an action only if certain conditions are fulfilled. Such circumstances could be the arrival of a given date or the existence of a file on the system on which the malware is intended to be executed. An other quick and well known example could be a denial of service attack that must be fired only if the number of the botnet's nodes has reached a certain value. That is the notion of multi execution paths (Exploring Multiple Execution Paths for Malware Analysis)

  1. Implicit conditionals

This technique is mainly used against dynamic approach detectors. The main idea for this process is to execute a set of instructions by hiding the condition that fires it (Weaknesses in Defenses Against Web-Borne. Malware)

Given these common features and tactics used by JaaScript malware, if you want to detect this type of malware as you asked, you need first to study the state of the art of the methods used to detect that. Various methods have been developed so as to detect web (JavaScript) malware. We can divide them into two main categories as follows:

  1. Machine learning based classifiers
    • Features: HTML and JavaScript codes distinguishing features extraction. These features are then evaluated to train a machine learning for classifier generation. The premise of this approach is that malicious webpages are likely to be different from benign ones (Thesis: Effective Analysis, Characterization, and Detection of Malicious Web Pages)
    • Advantages: Lightweight approach, useful to deal with a bulk of websites analysis.
    • Drawbacks: Obsolete against obfuscated JavaScript code and totally useless against new malicious code patters or zero attacks.
  2. Dynamic methods
    • Features: Based on the dynamic behavior analysis, these techniques are implemented using either proxies where a page is rendered to the visitor only after its safety is checked, or a sandboxing environment relying on honeyclients (Same thesis: Effective Analysis, Characterization, and Detection of Malicious Web Pages).
    • Advantages: Efficient against zero day attacks and obfuscated code.
    • Drawbacks: Resources and time consuming. Sandboxing environments rely on low interaction honeyclients which themselves are based on virus signatures, and thus suffer from the same disadvantages as the static methods' ones.

What you have tried to do belongs to the first category.

Now, after you are well informed about all this, it can be useful for you to study some available tools dedicated for this purpose in order to implement your own technique. So let me mention you three important tools among so many others:

  1. Zozzle

Zoozle relies on Bayesian classification abstract syntax tree (AST) . It is legitimately classified as mostly static web malware detector because it embeds another engine that supervises the JavaScript code execution at run time. Its authors claim that it has a very low false positive rate of 0.0003% and is able to process over one megabyte of HTML and JavaScript code per second. This tool is intended to be used as a browser plugin; its aim is to protect browsers against heap spray attack. It is time to point out how ZOZZLE operates.

How ZOZZLE operates? The following figure summarizes its core (ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection):

enter image description here

  • Extraction and labeling phase: The classifier needs training data. This data is extracted from obfuscated JavaScript code. Instead of developing an efficient de-obfuscation technique, Compile function interception calls is performed. Compile function is located in jscript.dll library. It is a smart way to obtain plain JavaScript code because it is called each time <SCRIPT> and <IFRAME> tags, or eval() and document.write() functions have been called, which thing defines also the code context. Each code context is saved on the hard drive for further analysis.

  • Feature selection: JavaScript AST is used to tag each labeled context code for its safety or malignancy. The features are pre-selected using this formula: enter image description here

Where:

  • A: malicious context with feature
  • B: benign context with feature
  • C: malicious context without feature
  • D: benign context without feature

  • Classification: The Bayesian classifier is used for classification because even if it seems obsolete, in practice it gives good results and it is not time consuming.

    1. Profiler Profiler follows the static schema to detect web malware. It combines static features analysis of HTML and JavaScript code, including unified resource locator (URL)s. Then it uses machine learning techniques to teach a classifier that decides if a webpage embeds malicious content or not. Suspicious webpages are not processed by this tool. It rather forwards them to third party technologies such as Wepawet (Prophiler: A Fast Filter for the Large-Scale Detection of Malicious Web Pages)

    2. SpyProxy

SpyProxy follows the dynamic analysis principles. It monitors the active content of webpages within a virtual machine before deciding to render them to the visitor or not. The architecture of SpyProxy is illustrated through this figure (SpyProxy: Execution-based Detection of Malicious Web Content):

enter image description here

  • (a): The proxy performs a static analysis over the requested page. In the case it judges is likely to be malicious, if forwards it to the virtual machine. basically only pages with active content are forwarded to the virtual machine (VM).
  • (b): The virtual machine loads the malicious pages to monitor their activities.
  • (c): Only benign pages are rendered back to the proxy which forwards them in turn to the user's browser.

    1. Iceshield

ICESHIELD performs in-line dynamic code analysis using a set of heuristics to verify attack attempts. Its authors take an inventory of the attacks that usually target the DOM properties of a website that are performed by injecting JavaScript into the website's source code. ICESHIELD supervises the running JavaScript code by predefining a set of rules related to functions calls and applying heuristics on them in the hope to determinate whether the script is malicious or not (IceShield: Detection and Mitigation of Malicious Websites with a Frozen DOM).

  • 23
    This is most likely the best answer I've seen on Information Security. It provides several terms which can be used for further research as well as a rich description of each term. It provides existing solutions with similar goals and explain how they work and could work. – Alex Aug 17 '15 at 15:14
  • One, very small, addition to this great answer: Code obfuscation is mostly used by developers to reduce file sizes and load times, not to hide their pretty code from viewers. Only insane people would think, that "hiding" is the reason, if you can "unhide" ([deobsfucate](http://reverseengineering.stackexchange.com/a/4562/13928)) it within few clicks. – trejder Oct 23 '15 at 06:32
1

Your attempt to recognize malicious JavaScript will certainly fail at character level. I doubt there is any difference in the first place, and even if there was, the author could easily obfuscate their code and get a mostly uniform distribution for the code's characters.

I believe a more fructuous approach would be to detect combinations of particular libraries being used, and how often functions from each library are called. Perhaps there will be some patterns typical for malicious code (for example, authentication functions are called but no real workload is done), but that's just a wild guess.

Dmitry Grigoryev
  • 10,122
  • 1
  • 26
  • 56
  • 2
    obfuscation itself presents predictable characters that non-malicious code is not likely to have - most of the malicious javascript I've read does not use unusual libraries. – schroeder Aug 17 '15 at 20:35