All your xor'ed data are belongs to us

Deriving plaintext from ciphertext only

May 16, 2021

Disclaimer: The views, methods, and opinions expressed at Anchored Narratives are those of the author and do not necessarily reflect the official policy or position of my employer.

https://images.fineartamerica.com/images-medium-large-5/mystery-gate-olivier-le-queinec.jpg — Cover: Mystery Gate Photograph by Olivier Le Queinec

Background

After reading my last article on the StrongPity malware, my former colleague from Hoffmann Investigations and good friend Oscar Vermaas reached out with an interesting malware problem he worked on a couple of weeks ago. During our Hoffmann days, I spend many nights with Oscar working on the famous Diginotar breach investigation. Although our report has never been published, I still believe it is one of the best digital forensic reconstructions we ever did. I have worked on many Advanced Persistent Threat cases, but the Diginotar case is still one of the highlights of my career. Although Oscar’s passion is more into breaking security, he is an expert that can really break down a problem and solve it. Therefore the next anchored narrative will be more technical but focuses on XOR, frequency analysis, and a bit of crypto. But hang in there this exciting puzzle is also going to be solved. Very happy to have Oscar onboard with the next anchored narrative!

Introduction

Rule 1 of cryptanalysis - check for plaintext

Last week I had to de-obfuscate some javascript. After removing the first layer of obfuscation, I was left with a decrypt function and a large blob of ciphertext.

Figure 1 shows a similar sample, found on Pastebin. It is in fact the code that is the result of removing the first layers of obfuscation. The key of the decrypt function is 10 bytes long and initialized. However, the key is mangled at runtime with some XOR-values before it is actually used in the decrypt function.

Figure 1:Encryption function found in malware sample on Pastebin

The key is dependent on the source code text of the javascript itself. This means that if, for example, for debugging purposes, you put a console.log somewhere, the text of the function changes, which in turn changes the key.

Also, the runtime key mangling is dependent on the body text of some of the calling functions, and the first layer of obfuscation (not shown here) makes it hard to follow the call stack, let alone pick the correct function text which is mixed in with the key. Actually, in the Pastebin post, the calling functions are not even there.

The screenshot below shows the second part of the paste. It is the application of the decrypt function on the ciphertext.

Figure 2: Screenshot of the Pastebin sample

So basically, we have a decryption (also encryption because of symmetry) function, an unknown key of 10 bytes, and the ciphertext. The encrypt/decrypt function combines parts of the key in several rounds, which at first sight seems to make the encryption stronger. However, as we shall see in this article, it does not add any real security. Better yet, it leaks plaintext bytes at fixed positions, independent of the key used (as long as it has length 10).

The structure of this write-up is as follows

Some basic XOR properties
Character frequency counts of text
Analysis of the malware encryption/decryption function
Ciphertext-only attack based on weakness in encryption function
Some afterthoughts

Note that this blog does not claim that the method discussed here is the most efficient. We are aware of sandboxes, debuggers, etc which would also help determine the functionality of this code.

This article intends to, hopefully, be fun and educational. The intended audience is anyone who reverses malware sometimes or has ever participated in a Capture The Flag assignment and liked it.

Related work

Independently of this write-up, another article on the same encryption method was published a while ago. I was notified by RJM after having written a substantial amount of this blog. Therefore, we included a link to the article for transparency and more background. Although the linked article contains a lot of the same material as what is written here, I still think my article has something to add on the topic;

the author’s “faithful re-implementation” of the encrypt/decrypt function is (strictly speaking) incorrect;
The article does describe the weakness of the encryption, but IMHO not to the fullest extend. Neither does it demonstrate exploitation of the weakness by an attack on the ciphertext (although the author does include an encryption demo of a plaintext of his choosing).

RJM also found another article in which the author shows the process of finding the key by analyzing the javascript at runtime in a debugger. Although an interesting read in itself, the method is very different from the one described here.

Some basic XOR properties

First, some basics which we'll need later on (feel free to skip to the analysis). The XOR operation is defined on two bits p and q as shown in the table below:

p  q  | p xor q

0  0  |   0

0  1  |   1

1  0  |   1

1  1  |   0

It has some interesting properties;

First look at row 0 and 3:

  p  q  |   p xor q

  0  0  |     0

  1  0  |     1

It tells us that p XOR 0 == p, so XOR-ing p with 0 evaluates to p. (fun fact 1)

Look at row 0 and 4:

p  q  | p xor q

0  0  |   0

1  1  |   0

It shows that if p and q have the same value, the result will be 0. In other words, if you XOR p with the same p, the result is 0. (fun fact 2)

The XOR operation is “commutative”, meaning that p XOR q == q XOR p. Thus, we can swap operands (just like, for example, addition but not division).

Lastly, XOR is “associative”, meaning that (A XOR B) XOR C == A XOR (B XOR C). So, it does not matter which components of an expression we evaluate first. You can compare this to simple addition, for example, (3+5)+7 == 3+(5+7) == (3+7)+5 (fun fact 3)

Now, a simple encrypt and decrypt method may be defined as follows. In the example below P is plaintext, C is ciphertext and K is a key.

To encrypt P, we have:

C = P xor K

To decrypt C, we do:

C xor K              substitute (P xor K) for C

== (P xor K) xor K   apply funfact 3

== P xor (K xor K)   apply funfact 2

== P xor 0           apply funfact 1

== P

Another useful trick is to XOR both sides of an equation with the same value, eg:

A xor B = C

A xor B xor B = C xor B     (XOR both sides with B)

A = C xor B                 (B xor B cancels)

When XOR operates on bytes (or words, double words, etc), it operates on the individual bits of the bit expansion of the operands.

The modulo operator

Many programming languages support the modulo operation (often denoted by ‘%‘). We discuss it here because it will be used later on. The value of p % n is the remainder of the integer division p/n.

Some examples;

7 % 5 = 2, because 7/5 = 1 with remainder 2

14 % 7 = 0, because 14/7 = 2 with remainder 0

Another example maybe your digital clock, which shows the hours modulo 12 or perhaps modulo 24.

If we have to loop through fixed-length arrays multiple times, the modulo operator is convenient to use;

Suppose you have an array A of length l, and you have some incremental variable i. You may iterate through the array several times by using A[i % l]. Every time i “falls of the end” of the array (that happens when i is a multiple of l), it “wraps around” to the beginning of the array.

Frequency analysis

An effective way to attack a simple (!) (substitution) cipher is to perform a frequency count on the bytes in the ciphertext. The frequency count can help us make educated guesses, instead of blindly trying all possible keys/plaintext bytes.

It is known that the most common characters in standard English text are ETAOI and white space.

Let's copy the text of RJM's previous article and paste it into a text file to get a feel for plaintext English text (Most screenshots that follow are generated on a Linux console).

Figure 3: Screenshot of normal English text from Strongpity article

To perform the frequency count I normally use grep, sort, wc and uniq.

The line below first greps single characters from the file and puts each on its own line. After that, they are sorted and uniqued to group and count them. Lastly, they are sorted by their count. To generate a "top 10", 'head' is used.

Figure 4: Screenshot of the frequency count of characters in Strongpity article.

As we can see, the result matches almost perfectly what we expected (The empty spot behind the number 2817 is the space character).

Another interesting statistic is the number of unique bytes. We just count the number of lines;

Figure 5: Screenshot of the textfile contains 88 unique bytes

As we can see, there are only 88 different bytes in the file. A decent encrypted file will likely have 256 unique values, with an even distribution.

Now, if we are presented with some ciphertext, and some byte value is the most common, and there is a clear uneven distribution, you might guess that that byte value represents one of our top 10 characters (Obviously, this may differ per language and type of data, so you need to take that into account).

For fun, let's test it. In the image below, I first write one random byte to a file "keyfile". I then xor that value with the contents of the plaintext article.

Doing a frequency count gives us, as expected, different bytes, but the same statistics.

Figure 6: Screenshot of the “keyvalue“ 0xec applied to the plaintext. The frequency count stays the same.

And indeed, the key matches the most frequent "encrypted" byte, which is 0xcc.

Figure 7: XOR of most common byte with the ordinal value of a space, reveals the key.

The little python xor script I use is this:

The script uses the "generalized" version to handle keyfiles of arbitrary length (remember the mod operator to wrap to the beginning of the key when we've reached the end of the key).

Let's try with a key of length 16 (bs=16 in the dd command in the screenshot below). The results are now not so nice.

Figure 9: XOR-ing data with 16 random bytes

As you can see, we now have 254 different values, and the count is much flatter (the extreme case would be if we used a key as long as the plaintext, resulting in a one-time pad).

However, if we know the key length and have enough ciphertext compared to the length of the key, we can still use the same trick. The difference is that we have to create as many different frequency counts as there are key bytes.

Basically, we divide the ciphertext into several smaller groups of ciphertext. Bytes that were encrypted with the same key byte end up in the same group.

We can write a program for this, but we might as well try it on the command line. In the next screenshot, we make a frequency count of every 16th byte (notice the change in the c flag in the xxd command), starting at offset 0.

Figure 10: Screenshot of frequency count on every 16th byte

Again, we see that we've returned to a frequency count with a large spike in it. Let's guess that that corresponds to a space, and calculate what key byte would have been used when our guess is correct.

Figure 11: Guessing a space for the most common value of every 16th byte (0xc2) reveals the first keybyte (0xe2).

It should be 0xe2. As you can see I also calculated for the guesses 'e' and 't', just in case.

However, the precaution was unnecessary, since the 0xe2 was indeed the first keybyte (which is revealed by the hexdump of “keyfile“).

We can do the same for the 1st, 17th, 33th, ... ciphertext bytes to get the second keybyte and so on until we completely recovered the key.

The cipher described above is very similar to the Vigenere cipher, although he uses modular addition instead of XOR. The principle of the attack is the same, however.

Finding the length of the key, in case it is unknown, can be achieved by analyzing repeating groups in the ciphertext. The common divisor of the differences between the offsets of such groups is likely the key length.

See, for a good explanation; https://en.wikipedia.org/wiki/Kasiski_examination

Key generation (side note)

In the previous examples, I generated the keys by reading from /dev/urandom (the command line "dd if=/dev/urandom .... ").

You could also imagine that someone would generate the 16 bytes, for example, by using the MD5 hash of some password. Or possibly the first 16 bytes of a million iterations of SHA256 hash on a salted password. Or by tossing a fair coin 128 times. Or a combination, perhaps xor, of all of the above.

If not already obvious, it's important to realize that, in the current case, it does not matter how this key material was generated, because an attacker doesn't care and just looks at the frequencies and calculates the key from educated guesses instantly (however, if the key material is predictable and easier to guess than performing the analysis on the ciphertext, I suppose the attacker will just predict the key).

More rounds

The above cipher does not get any better by adding rounds.

For example, the cipher does not get any harder to crack if we perform the encryption multiple times. By doing so, the "effective key" will just be an XOR of all keys used in the various rounds.

The picture below shows that I applied one more round with a different key (keyfile2) to the result of the first round. As expected, the count does not differ at all from the single-round version;

Figure 12: Adding a second round does not change the count

So, the "effective key" of round 1 and round 2 combined, is the ciphertext 0x12 XOR 0x20 => 0x32.

Indeed, this is the XOR of the first byte of key1 and the first byte of key 2, as shown below.

Figure 13: Screenshot XOR key1 and first bye of key2

The analyst does not care about the two keys. The analyst only needs to care about the net effect (the "effective key") of the application of the key material in several rounds

The special case

In the previous example, when the keybytes in two rounds are the same, the net effect of the keybytes combined will be 0 (p XOR p == 0). You could say that, in that case, the second round undoes the encryption in the first round.

In the example below, I changed the last byte of keyfile2 to match the last byte of keyfile1. This ‘new‘ key was stored in keyfile3 and is used as the second round on the intermediate result of the previous example.

Figure 14: Using the same keybyte in the second round reveals part of the plaintext

If you compare the result of the second round with the plaintext, you’ll see that every 16th byte matches the plaintext (shown below).

Figure 15: Plaintext match of every 16th byte

Now, we can take a look at our malware sample.

Malware analysis

Here’s a screenshot of the encrypt/decrypt function from the Pastebin post:

Figure 16: Screenshot of the encrypt/decrypt function

The function consists of 10 (the length of “key”) rounds. Every round will use as its input the result of the previous round. The first round works on plaintext or ciphertext, depending on whether it’s encrypting or decrypting. Here, I just call it “text“.

In the first round, the last keybyte k9 (tmpKey=[k9]), is XOR-ed with the entire text.
In the second round, the last two keybytes (tmpKey=[k8,k9]) are XOR-ed with the results of the previous round, in the following way; The first text byte is XOR-ed with k8 (tmpKey[0 % 2]), the second text byte is XOR-ed with k9 (tmpKey[1 % 2]), the third text byte is XOR-ed with k8 (tmpKey[2 % 2]), the fourth text byte with k9 (tmpKey[3 % 2]), and so on until we reach the end of the text. More general, the ith text byte is XOR-ed with tmpKey[i % 2], for 0 <= i < length(text).
In the third round, the last three keybytes (tmpKey=[k7,k8,k9]) are XOR-ed with the results of the previous round, in the following way; The first text byte is XOR-ed with k7 (tmpKey[0 % 3]), the second text byte is XOR-ed with k8 (tmpKey[1 % 3]), the third text byte is XOR-ed with k9 (tmpKey[2 % 3]), the fourth text byte with k7 (tmpKey[3 % 3]), and so on until we reach the end of the text. More general, the ith text byte is XOR-ed with tmpKey[i % 3], for 0 <= i < length(text).
In the nth round, the last n keybytes (tmpKey=[k10-n,..,k9]), are XOR-ed with the results of the previous round as text byte[i] XOR tmpKey[i%n].

Note that the length of tmpKey is the same as the round number (starting to count from 1). So, as the round number increases, the tmpKey will be [k9], [k8,k9], [k7,k8,k9], … [k0,k1,k2, … k8, k9]. The article we linked to erroneously describes this as [k0],[k0,k1], [k0,k1,k2,] .. etc. The principle is the same, but in a real-life situation, this will mess up the analysis.

Figure 17: The slice method in action with key = [1,2,3]. In every round, tmpKey is written to the console.

In the table below it is illustrated as follows. The first line are the plaintext bytes (p1, .., p2824), the last line are the ciphertext bytes. The rows in between are the round keybytes:

p1 p2 p3 p4 p5 p6 p7 p8 p9 ..... p2824

k9 k9 k9 k9 k9 k9 k9 k9 k9 ..... k9

k8 k9 k8 k9 k8 k9 k8 k9 k8 ..... k8

k7 k8 k9 k7 k8 k9 k7 k8 k9 ..... k?

k6 k7 k8 k9 k6 k7 k8 k9 k6 ..... k?

k0 k1 k2 k3 k4 k5 k6 k7 k8 ..... k?

c1 c2 c3 c4 c5 c6 c7 c8 c9 ..... c2824

The table above should be read in vertical order, for example:

c1 = p1 XOR k9 XOR k8 XOR k7 XOR k6 XOR k5 XOR k4 XOR k3 XOR k2 XOR k1 XOR k0

c2 = p2 XOR k9 XOR k9 XOR k8 XOR k7 XOR k6 XOR k5 XOR k4 XOR k3 XOR k2 XOR k0

Basically, the 10 keybytes are combined together in 10 rounds to create 2520 different “combined“ keys. The number 2520 is the least common multiplier (LCM) of 1,2,3,4,5,6,7,8,9,10. This makes sense because to get a repeating column, the roundkeys (tmpKey) must repeat at that same position. The position of where this happens has to be divisible by the length of the roundkey of round 1, by the length of the roundkey of round 2, etc. So, the position is the smallest number divisible by 1,2,3,4,5,6,7,8,9,10, which is the LCM.

That is annoying because 2520 different keys for a ciphertext of length 2800 is too much. Yes, you can use the repeat after 2520 to attack the first and last 300 bytes of the ciphertext (Think "onetime pad with key re-use”), but it still is a tough cookie.

Now comes the main observation; while the combination of the 10 keys seems to make the encryption harder to crack, it also weakens it.

Reducing keys

Remember that A xor A = 0. This means that if in an XOR column above, there is an even number of the same keybytes, we can remove them because they cancel.

If there is an odd number of the same keybyte, we just handle this case as if there were only one occurrence of that keybyte. This works, because the number of other occurrences of that keybyte will then again be even, and we can cancel them as well.

Since the order of the applied keybytes does not matter, we can also sort the keybytes by their position in the key.

If we apply this to two different combined keys, we find that many reduce to the same effective key. For example;

k1 k1 k2 k2 k3 k3 k3 k2 k7 k7 => k2 k3

k1 k1 k3 k8 k8 k8 k8 k2 k9 k9 => k2 k3

The next screenshot shows the output of my python script, enumerating the combined keys, and reducing them to a canonical form. Note that ‘0‘, ‘1‘ etc, are just symbolic names for the first, second, etc keybytes. Their actual values (which are unknown anyway) do not matter.

Note that a combined key will never reduce to exactly one keybyte. If that were the case, there would be an odd number of occurrences of that keybyte, and the rest would need to cancel in order to end up with just that keybyte. This impossible since the length is 10, thus even, and there would be at least one other keybyte with an odd count. This keybyte would then also be included in the final reduced key. So, the best we can do is reducing to a combination of at least two keybytes.

This reduces the number of 2520 combined keys to 297 effective combined keys. This is something you can work with.

Figure 19: Sorting and counting the reduced keys shows that there are 297 unique reduced keys

Better yet, at position 2519, the XOR column consists of 10 k9's. This means at that position there is a plaintext byte.

I developed a small python program to find all positions where all key bytes cancel each other (for example: k9 k9 k9 k9 k1 k2 k1 k2 k2 k1). This results in 99 plaintext bytes, scattered around in the ciphertext.

Figure 20: Offsets in the ciphertext where plaintext bytes will be found.

If we group and count all the unique reduced keys, we can find the offsets where the same reduced key will be applied. The picture below shows how often specific reduced keys are applied.

Figure 21: The reduced key k8,k9 will be applied 77 times in the ciphertext. There are 99 occurrences of plaintext (the empty reduced key).

Figure 22: The offsets of the 99 plaintext bytes. Other offsets from other reduced keys are not shown here.

Note that all these relations hold, without knowing the actual values of those bytes.

With the reduced set of combined keys, we can now do a frequency count of all ciphertext values per combined key. A frequency count is a common attack on these kinds of ciphers.

Basically, for the most common ciphertext value per combined key, you guess it is, for example, whitespace or 'e' or 't' or some other common letter. With that assumption, you calculate the combined key from the ciphertext and guessed plaintext.

If the guess was correct and apply the combined key to the other positions of the ciphertext where that key applies, you will get a result that makes sense. If you get garbage at other points, it means the guess was incorrect.

If you can make a guess based on the surrounding leaked plaintext bytes, that's even better.

An actual sample

Here’s a snippet of the ciphertext from the Pastebin URL after having it transformed from its ASCII HEX representation to binary:

First, let us find the 99 plaintext bytes we get for free. The figure below shows a snippet. The snippet includes the plaintext bytes as their actual values. All other ciphertext bytes are represented as 0x00 for readability.

I also counted the most frequent ciphertext character per reduced key. The results are a bit disappointing. Maybe we have bad luck, but the sample size is not convenient in relation to the size of the combined keys. It is actually smaller than 2520. But let’s see how far we can get.

Figure 25: Some stats on the frequency counts for reduced keys.

The figure above shows that the key 89 (k8k9) is applied at 56 positions in the ciphertext. The ciphertext contains at those positions 28 unique values. The value 0x0c is the most common with a count of 6. So, it would make sense to guess that k8k9 = 0x0c XOR 0x20, where 0x20 is the ASCII value for a space. If the guess is correct, we solve 56 positions. If the guess is incorrect, we might try for the characters ETAOI.

Here is an example; in the figure below, I made a guess for k8k9 as 0x0c XOR 0x20.

Figure 26: Result of guessing a space for 0x0c

We already knew the X and b (offsets 0x651 and 0x653), as that was leaked plaintext. Our guess did not improve much (at positions 0x652 and 0x664, unprintable characters appeared). I then guessed ‘e‘, but still no luck. But see what happens when we guess ‘t‘.

Figure 27: Result of guessing a ‘t‘ for 0x0c

As you can see, we get “XOb“, “HT“,”tp”. We can safely assume for now, that k8k9 is 0x78 (= ord(‘t‘) XOR 0x0c).

The python scripts I use for this, allow me to “hardwire” a key value for a given reduced key. The data structure is pretty simple, namely an associative array of the form {reducedkey1: keyvalue1, reducedkey2 keyvalue2, … }. If a reduced key is found in this array, it will override all other (automated frequency-based) guesses. Obviously, plaintext bytes will also never be altered by guesses.

Based on the guess above, I hardcode my first guess: { '89': 0x78 }.

Let’s try another frequency-based guess. The space and ‘e’ did not work, but again ‘t’ worked.

Ignoring the other changes in the file, we see that “XObject” is slowly starting to emerge.

I created a small program that allows me to supply an offset in the file and a guess for the plaintext byte at that position. The program will then supply me with the reduced key at that offset and its value. The figure below shows how I make the guesses for “jec“ in “XObject“. The program returns the values I need to add to my fixed reduced key table.

Figure 29: Generating a keytable, based on educated guesses.

Combining keys

Although the above process seems to work (in this case), it is a bit slow, and it is easy to get stuck. However, note that if we solve the keys k8k9 and k7k8, we also have solved k7k8, since k8k9 XOR k7k8 = k7k9 (the k8 cancels). Not only gives us these reduced keys “for free“, but also a way to break ties or (in)validate guesses.

I created a small python script, which takes a keymapping as input and returns an expanded keymapping as output. The process is a bit brute force, as the script generates a powerset of all reduced keys found so far, combines their values by XOR, and reduces their concatenation. The following screenshot shows this in action.

As you can see, the program creates a pretty impressive new table from only 7 guessed reduced keys. After pasting this generated key in the decrypt script, we get this result:

The surprise

Now, from the above, one could conclude that it’s going to take a while to finish the remaining part of the ciphertext. At least, this is what I initially thought. However, and in hindsight this may have been a bit obvious, if we add one new reduced key mapping to the “key combiner”, it doubles the translation table (the powerset grows exponentially with respect to the number of elements in the set). I am not sure it always doubles. I can imagine that sometimes you add (partially) redundant information, leading to redundant results that will be filtered out in the end.

However, in this case, I added a ‘u‘ near the end to generate “unction“ (see screenshot above). From this change I created a new keymapping. Near “unction” we then find “typeo“. After adding the ‘f‘ for “typeof“ I created a new keymapping. The large generated table is shown in the screenshot below.

Figure 32: Guessing the ‘f‘ and creating the final keymapping

The plaintext

I put the new keymapping in my decrypt script and tried again. I recovered the plaintext:

Figure 33: The decrypted text of https://pastebin.com/p356DXGu

So, without knowing anything of the key, but by exploiting some flaws in the crypto algorithm, it was possible to decrypt a ciphertext from scratch by just analyzing the ciphertext and the encryption algorithm.

Afterthoughts

I may need to think about this more, but I find it curious that I only needed to guess 9 bytes. That is less than the keylength of the original unknown key. I am not sure if this is always the case, or that I just got lucky with this instance.

Although this cipher was relatively easy to break, we have to consider that we were lucky with the plaintext. If the plaintext was obfuscated as well, it would have been much harder to spot words like “Object“. This is essential for the technique we discussed because it involves making educated guesses. Another layer of obfuscation would have made the guessing way harder. Knowing that the plaintext would most likely be javascript was very helpful too, of course.

Also, do note that in this instance, the space did not seem to be the most common character at all! Something to keep in mind. Here’s the top 20 frequency count after decryption (fun fact; the round brackets are balanced):

Figure 34: Top 20 frequency count of decrypted sample

The cipher is ultimately a substitution cipher. Many ciphers employ transposition (moving plaintext bytes to other positions) as well. Had the author employed a simple transposition based on the key, I most likely would not have been able to recover the plaintext. This relatively simple extra step makes the problem much harder.

Had the malware author used an odd length key, the plaintext leakage would have been gone.

Although the javascript has been decrypted, the C2 is still obfuscated in the javascript source code. If you analyze the javascript, you will find that you need to reverse the URL at the beginning and consider every second character, starting at offset 1. You will then get “http://ftmp.pendidikann.info/mdjvrhg.php“. I did not further investigate this, but a quick search on VirusTotal suggests that it lives pretty much under the radar.

Figure 35: The obfuscated URL on VirusTotal.

I did not publish the source code of my scripts. The reason for this is mainly because it is a bit of a mess. I have several copies of slightly different versions, all doing slightly different things. They are not documented at all, and error checking is missing. On the other hand, I don’t think they are very hard to develop yourself. If you’re really interested, just contact me.

Anchored Narratives on Threat Intelligence and Geopolitics