Malware prevention of data received in a POST endpoint

Question

Problem

Having one HTTPS api which has an endpoint i.e POST data how can we prevent/check that the data received is not malicious or possibly detrimental?

To consider (please read)

POST data receives 2 payload parameters: key and data. Where key could be any valid string (even allows extension malicious.bat and data a string or multipart/form-data.
GET data will return the data into a downloadable file where the file name is the key parameter and sets the Content-Type to application/octet-stream.
This allows a client, to post stuff like: key: danger.bat data: !#/bin/bat\ndb_destroy;' and when retrived, a file on name danger.bat` is downloadable.
In the current implementation (for whatever X reason) I can only implement a new endpoint of POST data from 0 in order to don't destroy current api-clients, so in that part I am quite free but let's say, can't do much on the GET data.
A virus scan is out of question, due to high volume of this API.
It's clear that (unless someone knows), there is no guarantee to secure the API, we need here to find instead the Best prevention and solution in a Best Effort

Question

Best way and how to validate data a scan for possible Malware/virus/XSS(cross-site scripting)/ and other possible malicious content sent via HTTP through an API by client.
What are the possible dangers on letting a client create/read any type of strings into a database without check what's the content?

Scenario

With an API a client can connect to endpoint /postdata and currently accepts any Content-Type. The data can be inserted in a form-data or urlencoded base64, anytype of string or bytes. Once the the request is received in the database, there is not check, the data is only converted from a string into base64 bytes and then inserted into a database as bytes databytes, more specifically in a Cochbase database. The user also specify the name of its data with a name.

In another endpoint called /getDatathe client can retrieve that data. So a sending client can send stuff like (as strings):

Html file with possible Javascript scripts in it:
- Name: virus.html
- Data: <!DOCTYPE html><html lang="en"><head></head><body><script>alert("Virus!!!!")</script></body></html>
Shell scripts
- Name: virus.bat
- Data: #!/bin/bash echo "Virus";
Images
- Symply converted into base64 if as parameter of a URL or as field in a form-data ( multipart/form-data)
Executables as bytes.
Shell scripts

Also worth pointing out that once this data is in the database, is not proccessed or used, but only released to the client when it requests it via the /getData endpoint.

The challange I'm facing is that currently, there is not checking on the string sent by client, so it could technically send malicious data, however, how can you check a string with all the different possible dangers?

Example

schroeder Actually in the comment made a good example:

how is the risk different from hosting an html file on OneDrive, Google Drive, Dropbox, etc.?

Now, this back-end service that I'm facing this challange is not like the above mentioned apps, however, a similar use case would be with Google Drive and you are the security developer

user_a put in his drive: virus.html and shares it for the web.
user_b click on the virus.html on the link, downloads it no problem and is pc blows up, or whatever malicious thing the virus.html was meant for

Where is the risk?

I can see 2

How would you implement a possible (if any) check for user_a when it uploads virus.html? Can you prevent it somehow?
How would you implement a possible (if any) check for user_b when it download virus.html?

What's the risk?

For example, someone can send data in a form of html (like above example) and also set it's name as a html file with file extension. If then you request that data, the file is downloaded.

Structure

API Back-end: Python
Database: Couchbase with Bytes data type

If you need to allow all possible strings, then you can't check for all possible threats. If you can allow only specific types of data, then you check for that. As stated, I'm not sure this question is answerable. You also have to know how the *client* processes the data in order to determine what the potential impacts are in how the string is processed. — schroeder, Jan 02 '23 at 15:27
we only know that incoming request, from the headers and the data field (that is just a string). There is almost no checking on the content. Then we just save it into NoSQL db. On the get to get the database, the headers are set to: `application/octet-stream` ,. if the data was a string like `html` and the name as `file.html` then the client will receive it as html. You could open the file which opens the browser etc. — Federico Baù, Jan 02 '23 at 15:38
@schroeder is it clear what is the problem? if you need more info please let me know. I know it sounds weird what I am askin, but let's say, that is something that is already built and the change possible are minimal, and more or less that's all what is there to know. If this information, "is the situation even saveable"? — Federico Baù, Jan 02 '23 at 15:40
@schroeder basicaly yes. there is some back-end built that allows any type of data in its `data` endpoint field. And is easaly sendable by adding data encoded in url like `?data=blablaIamMaliciousBlabla or in form-data — Federico Baù, Jan 02 '23 at 15:41
So, how is the risk different from hosting an html file on OneDrive, Google Drive, Dropbox, etc.? — schroeder, Jan 02 '23 at 16:10
The difference is that I am the developer responsable right now to think of this possible risk :D . Actually is the same thing as I thought, how are these other company implementing possible solution on their own back-end (hence, I posted this question here)? I could actually, make an example with Goodl Drive,. (i post now on the question in 2 minç) To be honest, I think in my case, the biggest flaw lays on how they initially designed the system. — Federico Baù, Jan 02 '23 at 16:14
They don't worry about that risk. You are not responsible for everything everyone does on your system, that's why I've been trying to get you to define the risk. This might simply be something that is not your responsibility. — schroeder, Jan 02 '23 at 16:17
@schroeder ok ok, but my back-end service is not them!. Let's say, that everything is internal, and there is some type of hack/put a malware (now I dont recal the name) that is created by who is actually developing (normally by mistake). So, is there a risk? Yes! If one employee uploads by mistake something for example, and blows up the entire database and milions of money are gone. So, at least, there may be some way to prevent it. Yes that user mad ea mistake is human, but can we prevent it? — Federico Baù, Jan 02 '23 at 16:24
But that's a totally different risk and something that you can counter with something besides "filter all the potentially bad things!" I'm afraid that there is not going to be. a concise, contained, easy-to-answer list of filters to apply. — schroeder, Jan 02 '23 at 19:30
@schroeder ok. Yes i mean, i dont mean to find a final solution honestly but more like a "best effort" to save what has to be implemented badly since day 1. And just to see outside point of view . Finally i will take a decision which right now is at least to put more constraints , example dojt allow some content type, file extension. Not 100 % without risk but better than nothing — Federico Baù, Jan 02 '23 at 21:51
I think the most you can do in this scenario is to prevent injections by using prepared statements, and then running an anti-virus scan on any files that are uploaded. If these are just blobs, then there shouldn't be a risk of execution on the server-side, but once converted to a file (I guess that happens only client-side?) it's up to the end-user to check for viruses. You can't force a client-side check. DO NOT rely on provided file extension if accepting uploaded files. You should set both file name and extension server-side. — pcalkins, Jan 12 '23 at 21:56
In my question one requirement is: **A virus scan is out of question, due to high volume of this API.** | *"You can't force a client-side check."*: well that's clear, I never mentioned that. | *"You should set both file name and extension server-side."*: mhh why? client will send a `key` that could be a file extension and then when requests it, expectes the same `key`. The data is stored in a database (couchbase) and never as an actual file. One of my solutions was to enforce only specific file extension and not all. — Federico Baù, Jan 13 '23 at 06:29
please clarify whether the user must retain their chosen filename? or whether your service can take control and rename? you can have hundreds of char in the url (or say ~255 char if you ever decide to go file-system backed instead of db), so it could be worth storing a hash of the content along with a sig. that can be verified (~130 b64 char) - `upload > scan > hash_sha256 > sign_ed25519 > rename > store` .. you can use the hashes to do continuous lookups against any "potentially unwanted" listings and respond accordingly, and as an added bonus, you can detect probable duplicate content — brynk, Jan 15 '23 at 00:08

score 4 · Accepted Answer · answered Jan 13 '23 at 09:37

What's the risk?

For example, someone can send data in a form of html (like above example) and also set it's name as a html file with file extension. If then you request that data, the file is downloaded.

You ask a very good question here, and then... don't answer it. The situation, as described, is not a risk. (Or rather, it's not a risk that you need to worry about). Downloading a file is safe; browsers do it constantly all over the place and it does no harm whatsoever. People downloading and then executing a file that they don't (or shouldn't) trust is on their own heads, and it's not your job to block it any more than it is e.g. GitHub's job to prevent people from downloading a library with a vulnerability in it, even though that library could make their software insecure. You aren't (or at least, absolutely should not be) making any claims about the safety of the downloadable files; it's up to your users to exercise a modicum of caution and sense.

With that said, the claim that virus scanning is out of the question is weird; virus scans are very fast, and trivial to parallelize (not for any individual file, but across files). For literally any size of network pipe, you should easily be able to set up enough hardware behind it to virus scan every single upload, even at max bandwidth usage, via load balancing. This isn't necessarily worth doing; antivirus scans miss a ton of stuff, and also have lots of false positives, and might give some people a false sense of security and other people a frustrating time trying to upload benign content. Antivirus also increases the attack surface of your server such that if somebody finds an antivirus scanner flaw, they can use it to attack the server; your current scheme of treating uploads as opaque blobs is much safer. But if you really care about "best effort" prevention of malware uploads, you can use AV software. The AV scanner can probably work on arbitrary file streams with a little effort, but you can also just have the scanner hosts create temporary files, scan them, and then pass them on to database after they pass (deleting the temp file once it's in the DB). Do this on a RAM disk and you don't even need to worry about SSD wear or similar.

As for dangers of letting people upload content:

Resource exhaustion (denial of service/cost increases). Unless you charge people for their upload/storage size, your site will probably get used to store peoples terabytes of videos and whatnot. Storage is cheap and de-duplication tools exist, but the Internet can definitely generate more uploads than you can store.
Authentication of uploads. If uploads are identified only by a string, then either there's nothing to prevent one user from overwriting another user's upload (not necessarily maliciously; could easily be an accident)... or people will be constantly needing to find unique names to give their uploads. The latter isn't necessarily a disaster (you see it already with e.g AWS S3 buckets) but it's a bad UX. You really should have a way to associate an upload with a specific user, and require a way to specify the unique upload when downloading (rather than just using the name string). This also makes it much easier to handle authorization of changes (overwrites, deletions, etc.) to uploads.
Authorization of downloads. If you're specifically creating a service where everybody can download anything uploaded, that's your prerogative and you can make the download links as usable as you want, but most of the time people want at least a little control over who can find the download links. That means either per-user authorizations (requiring downloaders be authenticated), or adding a unique, unpredictable token to the download URL, or adding a whole system of cryptographically signed authorizations to the download request (as seen in e.g. AWS S3 signed requests).
Inline document opening. You should set Content-Disposition: attachment (probably with the file name included) to prevent the file from being opened as content of your own domain. Downloading the file and then opening it from client device storage is generally no problem; file: URIs are treated as a special, low-privilege origin that can't do anything any other page on the web couldn't do - which isn't much, thanks to same origin policy - and in any case "don't run files you don't trust" remains very good advice.
Illegal content. The archetypal example is child pornography; you do not want to be found to host certain kinds of content. Avoiding that is basically the same as running antivirus (and more important); there's tools for detecting illegal content, and you should probably run them on all uploads.
Abuse of the service. Whatever terms of use you're going to have, it'll be up to you to enforce them. People will try all sorts of things - in the early days of gmail there was a project to use your gmail mailbox as an online file system, because it gave gigabytes of storage for free - and that's without even considering people specifically looking to attack your system with e.g. DDOS. You need a way to detect misuse of the system and limit or block offenders.

score 1 · Answer 2 · answered Jan 12 '23 at 21:19

1

how can you check a string with all the different possible dangers?

You can't because:

that would mean you have a perfect AV product (which hasn't happened yet, all AV products have false negatives).
an attacker could very well encrypt their very malicious scripts in a secure (impossibly hard to decrypt without knowing the key) way and store them using your API, then send a victim a less malicious script that knows the key and downloads, decrypts and executes the very bad scripts from your API.

Your best bet would be to try to minimize the risk by using some kind of AV scanning (the specifics of which depend on a lot of things) and by having a way for people to report files or links to you.

answered Jan 12 '23 at 21:19

Zicar

56
1

SO i understand that by AV you mean Antivirus right? Anyway, I see what you mean and I appreciate the answer, however, as I stated in the Question: *A virus scan is out of question, due to high volume of this API.* Unless: *There is a way to implement a very light weight antivirus, that it doesn't require external APIs* which I don't think there is a way. So @zicar, Your answer is valuable but cannot accept it so far. I also think a AV is best,but cannot be implemented. So, *what would you do?* let's sey you can't use AV, what else would you do? What the best you can do to prevent? – Federico Baù Jan 13 '23 at 06:21
Is clear that without an AV (and probably even with) there is no guarantee that you API is secure, but this doesn't mean that you should not put some validations/preventions . SO what I am looking for is: What preventions / best effort would you implement* – Federico Baù Jan 13 '23 at 06:33
1

Well, if you want a just some lightweight prevention you could scan your data with [yara rules](https://yara.readthedocs.io/en/stable/), there are a lot of projects and repositories of rules ([this](https://github.com/InQuest/awesome-yara) curated list is just a tiny part of them) and you also have a [python module](https://yara.readthedocs.io/en/stable/yarapython.html) you could use. – Zicar Jan 13 '23 at 11:48
But, bear in mind there might be some false positives and also some clearly malicious files that might not be detected at all (it's not perfect, but it's the best effort solution in my opinion). Also, from experience, if you go this route I'd advise you to update the rules regularly and compile all the rules you want to scan with inside one big file when updating and in your API just load the compiled rules after each update (you could also cache the results using the hash of the uploaded data). – Zicar Jan 13 '23 at 11:58

Malware prevention of data received in a POST endpoint

Problem

Question

Scenario

Example

Where is the risk?

What's the risk?

Structure

Related

2 Answers2

What's the risk?