Hacked: Can a UTF-8 encoded script execute non-UTF-8 characters?

Question

To be honest, I'm not really sure the best title for this question, or the full scope of it, but the motivation behind it is:

Motivation

Assume your server was hacked, you open up your UTF-8 encoded php script and you find a block or lines of characters that mean nothing to you mostly mapping in the UTF-8 char set to "????????????? hacker.ru"

I'm trying to get a grasp on what this could be and do:

Thoughts I'm considering

Perhaps the text editor selected font doesn’t support those chars?
Perhaps those chars were copied and pasted from non UTF-8 into the UTF-8 document
- Crude Example:
  - non-UTF-8 binary 1111->A
  - UTF-8 binary 1111->B
  - Effectively copying bits that don’t map properly
Is there a way to properly display those chars?
This is my priority question about these characters Can I assume that these non-mapping chars do nothing? (i.e., they don’t execute aka do damage)
Are programming languages multi-lingual?
- Can I write php in russain?
- Can i write php in english and russian in the same file?

Assumption: if I or anyone opens a UTF-8 encoded file and type into it, in any language or chars it will properly map them and display properly.

Can anyone shine some light on this subject?

Note: if you notice that there is suddenly PHP code on your webserver which you didn't put there, figuring out what it does should not be your primary priority. For more information, see [How do I deal with a compromised server?](https://security.stackexchange.com/questions/39231/how-do-i-deal-with-a-compromised-server) — Philipp, May 15 '17 at 18:22
@Philipp thank you, excellent read, which I read fully and took notes. However, this question is specific to the actual point made in item 3. of Understand the problem fully: "Examine the 'attacked' systems again, this time to understand where the attacks went, so that you understand what systems were compromised in the attack. Ensure you follow up any pointers that suggest compromised systems could become a springboard to attack your systems further." — Timothy L.J. Stewart, May 15 '17 at 19:13

score 3 · Answer 1 · answered May 15 '17 at 20:19

In short, the answer is yes to the use of UTF-8 characters in an attack chain. There are a few cases that have crossed my path. What I have read about this method of attack, is that this it is the last step to "drop shell" on an attack chain into a native system. With a quick "Google", this article came up.

"Using UTF-8 Encoding to Bypass Validation Logic"

The article goes on to explain exactly how this particular method is used. Here's the executive summary.

"This attack is a specific variation on leveraging alternate encodings to bypass validation logic. This attack leverages the possibility to encode potentially harmful input in UTF-8 and submit it to applications not expecting or effective at validating this encoding standard making input filtering difficult. UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. Legal UTF-8 characters are one to four bytes long. However, early version of the UTF-8 specification got some entries wrong (in some cases it permitted overlong characters). UTF-8 encoders are supposed to use the "shortest possible" encoding, but naive decoders may accept encodings that are longer than necessary. According to the RFC 3629, a particularly subtle form of this attack can be carried out against a parser which performs security-critical validity checks against the UTF-8 encoded form of its input, but interprets certain illegal octet sequences as characters."

This subject has come up before, although the use case is unavailable during this writing. If It's found later, it will be added as a comment.

What is remembered, was a attack chain method that had gained entry to a native system and sat dormat. When entry was gained through the firewall, access further was denied. The program then transformed to a UTF-8 file until a time when it could gain access past the security software. Once the hole was opened, because of character bit coding, the python program was able to open a command prompt, or "drop shell". The attacker then had full access to the root. It then opened a gateway for another part of the program laying in wait.

It's very similar to the study "Using UTF-8 Encoding to Bypass Validation Logic" in the way it masks itself and becomes unreadable to the security software. The methods of attack are very similar.

Methods of Attack 1. Injection 2. Protocol Manipulation 3. API Abuse

In the case I had read before, the objective of the program was to penetrate as deep as possible. If blocked, if else, transforms to a UTF-8 and perform attacks in that prescribed matter. If successful, open a gateway to another part of the malware laying farther down the attack chain.

Got to say, it's interesting to think about. The way I see it, if you can write code that is able to outperform the limited scope of another system, you will have success in the attack. If the defending code has constraints and the attacking code has choices and options that were never programmed into the scope of the defender, then it's an obvious loss. Especially, when you can have an AI or machine learning attacker.

Attribute to the CAPEC Content Team, The MITRE Corporation 2014-06-23 Internal_CAPEC_Team for the article.

Hacked: Can a UTF-8 encoded script execute non-UTF-8 characters?

Motivation

Thoughts I'm considering

1 Answers1