ICFP2001 Programming Contest: team Tycon Mismatch

Tycon Mismatch was our team for the ICFP 2001 Programming Contest. Most of us are (current or former) developers for the TILT project at Carnegie Mellon University.

Team Members

Perry Cheng

Tom Murphy

Charlie Smart

Dave Swasey

Joe Vanderwaart

I believe we all put in at least one all-nighter, but many of us were not working on the contest the whole weekend.

The contest task was to write an optimizer for an HTML-like markup language. We had to take a well-formed input and produce a smaller (hopefully) document with the same meaning. This task is quite nontrival, particularly due to tags like <PL>, which cancels out other style-setting tags.

There seem to be two reasonable strategies for this task. One is to do local optimizations to the document that we're given, in the style of the examples given in the task description. The other is to compute the meaning of the document and then try to reconstruct that as efficiently as possible.

Our original intention was to do a hybrid of these two methods. We planned to attempt local optimizations on the original document and a reencoded one, and after a few rounds of optimizations abandon one to focus our efforts on whichever seemed more promising. The reencoding optimization was risky, though, so we all wanted to work on local optimization passes for the first few days. An hour before the contest deadline, we decided to submit two separate entries for the two approaches. Both are written in SML, and were compiled for speed with MLton.

(\x.x)
This was our premiere entry. It consisted of the following optimization passes:

WhiteSpace - removes redundant whitespace
Space pushing - rearranges spaces in a "desirable" way
Tag flattening - optimizes the internal ADT (flattens lists)
Scoped Elim - removes redundant tags (as in task description)
Useless - removes tags that set attributes that are never used
Overlap inversion - see task description
Color Nesting - see task description
Hoist - factors out commuting tags which are repeated in sequence (ie <6>x</6><5>y</5> to <6>x</6><5>y</5>)
Joe's Redund - another pass at redundant tag removal
Attrthenpl - Special case of Useless
Fusetags - removes and 
<PL> Shortcut - see task description
Color/Size Redund - special case of Scoped Elim
Some of these passes are well suited to a tree representation (especially Useless, Hoist, and Scoped Elim); others are much easier in a linear representation. Consequently, we have two intermediate languages which the optimization driver automatically converts between.
Our entry last year was disqualified for being incorrect, so we really wanted to have a functioning program this year. Owing to our work in certifying compilers, we knew that mechanically checking our work would effectively prevent mistakes. To do so, we run a semantic validator against the intermediate expressions after each optimization pass. Rogue, malfunctioning optimizations are dynamically disabled (with a nod to Camls 'R Us '99), and non-optimizations (those passes that make the tree larger!) are ignored. Our program spends most of its time verifying its own work. Since we have two intermediate languages, our program actually has two separate semantic checkers, giving even more redundancy!
The judges have begun posting preliminary results, and this entry seems to be reasonably competitive. It was not eliminated. It ties for first place on the trivial inputs, takes first place on input #5 ("random") by a pretty significant margin, and lands around 8th place for the other tests.
See sorted results.

Grab our submission for more details. (Also included are some fun tools.)

reencode
This entry attempts to build the optimal document to represent a particular meaning, using dynamic programming. We were really concerned about running out of time when doing this (and rightfully so; even though the algorithm is polynomial time and space, the constants are large), so reencode is only a loose approximation of the optimal solution. It would have been best suited to be used as described in our "initial plan". From our readme:
This program uses a randomized dynamic programming algorithm to recode the input document from scratch. It is a modified version of (what is believed to be) an O(n³) optimal solution (whose constant is too large). It instead runs in O(n log n) time, and produces only approximate results.
Bundled with the reencode are a handful of our other optimizations (Whitespace, Redundant, Useless, and Hoist) to further tweak the approximate result.
Unfortunately, this optimization ran out of time on the giant "exhaustive" test case, and was disqualified. I'm pretty sure this was running fine on our 5mb test files (in 3 minutes), so there must be something special about "exhaustive" which hurt us. It didn't perform as well as (\x.x) on the other tests, anyway.
Grab our submission for more details.

We had fun, and we'll see you next year!