Demystifying JSCrush

Posted on Apr 16, 2012

Some of you may have seen Philip Buchanan’s award winning. Autumn Evening entry for JS1K 2012 Love. While skimming the source I saw that he had used Aivo Paas’s JSCrush to compress the code. The JSCrush website is intriguing with the source code minified and then passed through JSCrush itself (so it can be submitted to JS1K). The JSCrush version used in the page, in <script> tags is the JSCrushed version. As a weekend project I tried to ‘reverse engineer’ JSCrush and understand it. It took me about 4 hours. What follows is a walk-through of the process.
JSCrush is a very interesting JavaScript program, liberally abusing eval(), global variables and insane levels of nesting to achieve a sort of compression.
Remember that your browser’s web development tools are indispensible for activities like this. I made extensive use of Firebug.

The de-obfuscation

I chose to start with the compressed version in the <script> tag rather than the plain text in the upper text field.
The syntax is clearly wrong for JavaScript, and it is all stuck in a string assigned to _. The part at the end is interesting though (properly formatted).
For every character in $ it is splitting _ on the character, using with to make the resulting array the scope. Then joining the pieces using the last piece and reassigning to _. For example:

_ = "HelloRWorldRCrushedRAB"
$='R'

var temp = _.split($) // => temp = ['Hello', 'World', 'Crushed', 'AB']
var last = temp.pop() // => last = 'AB',
                      //    temp = ['Hello', 'World', 'Crushed']
_ = temp.join(last)   // => _ = 'HelloABWorldABCrushed'

Remember this step, it is key to how JSCrush works. These steps are repeated for every character in $, after which the ‘decompressed’ output is the minified source code:
You can see this for yourself by putting a console.log(_) just before the eval(_).
Now we’ve a fair idea of how JSCrush is doing decompression. Compressed scripts are stored in _, decompressed using the loop and then executed using eval().
The next thing I did was to un-minify the source (manually):
One change I’ve made is the call to setTimeout(). I’ve converted it to a function to make it easier to read, and directly used the script tags innerHTML, since I had the decompressed source in the tag. The JSCrush code generates the textareas and button as part of it’s run and assumes body.children[9] to be the <script> tag with the JSCrush compressed source. Hence it replaces the eval call with the program source itself so that the inner eval() call in setTimeout() extracts the decompressed source and puts it as the value of the first textarea. It then calls L(), the JSCrush crushing function to compress the original code back, so that you get the compressed version of JSCrush in the lower text field. Mind-boggling.
The setTimeout() without a time simply causes the code to be executed after the script has finished evaluating completely.

Understanding

Now that we’ve decompressed code, it still has the scars of minification – single letter variable names and no comments. Time to start reading the code. Line 1 is just setting up the HTML for the user. Lines 2-4 is the first interesting piece. The array Q is being populated with all the ASCII characters, in reverse order! The characters \n, \r, \\, ' and " are excluded, as are \0 and DEL, so that Q has 121 characters. Rather than using a readable if statement, Aivo is using the fact that && is ‘short-circuiting’ in JavaScript. Much space saving here.
Next we come to the definition of L. Line 12 just removes blank lines, whitespace and single line comments (except those following code). It also escapes backslashes so that the code is ready to be put into a string later. This is assigned to the letters i and s. Be warned from here, in the goal of smaller size, variables frequently change their meaning to promote reuse. s is always going to point to the code, but i is used as a counter all over the place.
Next, B is half the length of the program, m is the empty string. Line 15 is where it starts getting interesting. The pattern:

encodeURI(string).replace(/%../g, 'i')

occurs thrice in the code. Its task is to get the byte length of the string rather than the number of characters that string.length gives. In ASCII there is no difference, but if there are Unicode characters, they may occupy 2-4 bytes. encodeURI will replace each byte with a ‘%xx’ code with xx being the hexadecimal byte value. Replacing this with the single letter ‘i’ will get us one ‘i’ for every byte, so that the length of the resulting string is the byte length. This was one of the many clever tricks present in the JSCrush code. They might be well known, but this is the first time I came across it.
The initialization in the for loop is only to save a byte, it does not affect the loop itself in any way. Similarly the m = c + m call can be moved to the end of the loop body. This construct will generate the decompression sequence contained in $. This for loop is then actually an infinite while loop.
Line 43 is again a trade-off of readability for size. Here it is in a cleaner form:

c = 0
i = 121
while (!c && i) {
 if (! (~s.indexOf(Q[i])))
  c = Q[i]
 --i;
}

~ is binary NOT. If Q[i] is not found in the source, then indexOf will return -1, NOT -1 = 0 and !0 === true, so that this code is actually saying:

For every character Qi in Q in reverse:
    If the source does NOT have the character in it:
        c = Qi

Or, c is set to an ASCII character that is not present in the program. Initially it will be ASCII 1, then perhaps 2 and so on. This ‘c’ is now the character that will be used to join the pieces obtained in Lines 20-32. This is one round. When all the characters have been used up, compression stops (Line 18).
Lines 12-32 basically try to find long, repetitive strings that can be replaced with a single character, to get the best compression. JSCrush follows a brute-force approach to find these segments. With single variable names, the code is a mess, so here is a cleaned up version which makes things much clearer:
Lines 9-28 try to find segments which repeat atleast twice in the code. Longer segments will give better compression, so we try all of them. For segments of length 1, we try every character in the string, for segments of length 2, we try every pair and so on. If it repeats we keep track of the segment count.
The segmentLengthBound = longestSegmentLength (B=Z) bit is interesting, and it took me some thinking to figure it out. It relies on the following facts:

The longest segment in the current source is longestSegmentLength.
Splitting by something, and then joining by a character not in the source will not lead to creation of longer segments.

So we can restrict segmentLength of the next round to segmentLengthBound.
Lines 32-41 choose the best segment to substitute in this round. The expression (R=o[i])*j-R-j-1 may seem cryptic, until you look a little later in the code where the split and join is done, and you remember how JSCrush works. R * j is the number of bytes we will remove by replacing this segment. But to join the split, we’ll need one character for every repetition, followed by one character to separate the segment suffix itself. The conditional asks if this leads to actual, and better, compression than what we already know of. If no such segment was found, we are done compressing. Otherwise we split by the segment, join the pieces by the join character and tack on the segment at the end. One round done!
Once multiple rounds have been done, the script is compressed, and only some trivial things remain. The value of B is now changed to store the quotes (double or single), based on which are fewer. Since the compressed program is stored in a string, using the quotes that appear less times means less ’\’ to substitute, each of which costs a byte. We then prepare the boilerplate, setting _ to the now compressed source, setting $ to the decompression sequence m and adding the evaluation code. The savings accomplished are announced too.
One trick I picked up in the code is forcing a certain digit view precision.

i/S * 100

would give a float percentage with many digits after the floating point. Instead multiplying by 1000, gets us two digits in the integral positions, bitwise OR-ing with 0 casts to an integer, losing the floating point digits, then dividing by 100 gets us the two digits we want.

Summary

JSCrush works by:

Finding the first unused ASCII character to act as the join
Finding the substring of the program text that gives the best space savings if its repetitions are all replaced by the ASCII character from 1.
Splitting the source on 2 and joining the pieces using 1, tacking on 2 to the end. This string replaces the original source.
Repeating 1, 2 and 3 until no more savings are possible or we’re all out of ASCII.
Wrapping the compressed source into a string, then using the list of join ASCII characters to unroll the string.
Unrolling is performed by splitting on every ASCII character used in 3, extracting the original repeated substring 2 from the split and joining the parts.

I hope this (long) post was interesting and educational. If you have feedback, a comment would be great.