How the Use of Scala's Features Affects Compile Time

by Bill Venners

February 12, 2013

Summary

The compiletime project is an attempt to better understand the relationship between the use of Scala's features and compile time. This article gives a quick overview of what we've learned so far.

Long build times is a common complaint among Scala users. As the author of ScalaTest, I have been concerned about both the compile and run time of tests, since this often accounts for a significant portion of build times. Having seen first-hand significant build times on projects that use ScalaTest, and having noticed several complaints about slow build times with specs2, I decided to try and better understand the problem. The result is the compiletime project.

This weekend at the Northeast Scala Symposium, I led an open-space session on the compiletime project, and was spontaneously joined by Grzegorz Kossakowski from Typesafe. Grzegorz has developed scalac-aspects, a tool to help investigate compiler performance. After our session we spent a couple hours applying his tool to the compiletime project to try to understand one of the effects I was seeing.

In this article, I'll give a quick overview of what we've learned so far about how the use of Scala's features influences compile times. (If you want to try taking the measurements yourself, follow the instructions in the aside, How to run the compiletime scripts.)

Eating dessert first: what we learned

The compiletime project consists of several scripts that generate Scala source files. Each file contains one test class that use various combinations of Scala language features via the test frameworks that use them: JUnit, TestNG, ScalaTest, and specs2. (ScalaCheck is not included because it is disimilar enough from the others to make an apples-to-apples comparison difficult, and apples-to-apples is required to ensure we are attributing differences in compile time to the deserving Scala language feature or features whose use is causing those differences.) Once the source files are generated, the scripts compile them with scalac and measure the compile times. Lastly it generates graphs to better visualize the results.

The fastest to compile was JUnit, followed closely by TestNG and then ScalaTest. specs2 was significantly slower. So far the evidence indicates the differences in test compile time are influenced primarily by the number of implicits in scope, the number of by-names used, and whether or not methods are being mixed into the test class by extending a supertrait versus inherited by extending a superclass.

Most likely the influence of the number of implicits in scope is in some way multiplied by the actual number of implicit applications, because at each implicit application the compiler must ensure one and only one implicit “heals” the candidate compiler error. In other words, the compiler can't stop looking at implicits once it finds one that solves the type error at hand; it must keep on looking to make sure one or more other implicits don't also solve that same type error. Thus the compile time cost of implicit use is likely related to the number of implicits in scope multiplied by the number of times the compiler must check all those implicits (i.e., the number of implicit applications). I say “likely,” because as yet we haven't written a script to verify this theory. If someone wants to contribute a new script to the project, this would be a good one to contribute.

Also, most likely it is the number of function literals in general, not just the use of by-names, that influences compile time. But again, we have not actually written a script that tests that theory. This would be another good candidate for a contribution to the project.

Lastly, the compile time cost of mixing in a trait compared to extending a superclass likely exists because the compiler must insert forwarding methods for all the methods declared in the trait into the body of the subclass when mixing in, but when extending a superclass it does not. Instead, the subclass simply inherits the superclass implementations of those methods. We did observe that the larger the number of methods to be mixed in the more the compile time was increased. As JUnit and TestNG don't use traits (being Java frameworks), this was a difference observed between ScalaTest and specs2 only. Because specs2's mutable and immutable Specification traits declare more methods than ScalaTest's various style traits, the compilation of specs2 styles are being slowed down more by this than that of ScalaTest's styles.

Nevertheless, I think users of both test frameworks should probably be extending “style classes” instead of mixing in “style traits.” In the next milestone release of ScalaTest 2.0, I will likely rename all the style traits by appending a “Like” suffix, and make the old name a class. So for example, I'll rename trait FunSuite to FunSuiteLike, then create a class named FunSuite that extends FunSuiteLike and does nothing else. (I expect this change to break little if no code, because every use of ScalaTest styles I've seen use extends already with the trait, which will continue to compile if that trait is changed to a class.) The compiletime data suggests this will reduce the compile time of each ScalaTest test class by 0.15 to 0.2 seconds. The improvement promises to be bigger for specs2 users—0.5 seconds per test class—because the specs2 style traits require more methods to be mixed in compared to ScalaTest's. So I think this may be a good change for specs2 to make as well.

Switching from extending a trait to a class is something users of either test framework can do locally without waiting for the next version of the framework. Just make a class that extends Specification or FunSuite, whatever style trait you are currently using, then have your test classes extend that superclass instead of extending the style trait directly. If you try this, please measure the difference in compile time before and after and email the scalatest-users mailing list, so I can find out to what extent this predicted improvement actually happens in practice.

The approach

The compiletime project is composed of several Scala scripts. When run, each script creates a directory named after the script. The allTestsInOneFile.scala script, for example, creates a directory named allTestsInOneFile. Inside this top-level directory each script creates three subdirectories named generated, output, and stat. The script places generated Scala source files in subdirectories of generated, deposits class files resulting from compiling the source files in subdirectories of output, and writes csv files containing measurements in stat. The project also contains a script named google-chart.scala that generates web pages containing graphs of the collected measurements. For each compilation script, the charting script deposits an html file in the appropriate top-level directory. For example, the google-chart.scala script will create a file named allTestsInOneFile-graph.html in the allTestsInOneFile top-level directory. You can then open these html files in a browser to inspect the results.

The scripts call scalac directly (using the scalac that is available on the path) instead of using sbt or some other build tool. One of the main themes of the approach taken by compiletime is the (attempted, anyway) elimination of all but one difference in the Scala source files, so that we can attribute differences in observed compile time to that one difference in the Scala source. For this reason we didn't want sbt in there, to be sure we were observing behaviors of the Scala compiler and not sbt.

The runem.sh script that drives the measurements looks like this:

JAVA_OPTS="-server -Xmx1024M -Xms128M"
export JAVA_OPTS
scala tenTestsPerFile.scala
JAVA_OPTS="-server -Xmx2048M -Xms256M"
export JAVA_OPTS
scala allTestsInOneFile.scala
scala testsIn100Files.scala
scala dataTables.scala
scala allMethodTestsInOneFile.scala
scala assertTestsInOneFile.scala
scala allClassTestsInOneFile.scala
scala google-chart.scala

We set the memory maximums at 1 Gigabyte heap and 128 Megabyte stack for the tenTestsPerFile script, which is the smallest compilation problem. We run the other scripts at double those values, 2 Gigabytes heap and 256 Megabytes stack. We found these values did the best job of showing the differences without requiring an inconveniently long time to run the scripts. Under these memory settings, the specs2 immutable style will often produce an out-of-memory error. If you give it more memory, the specs2 immutable style will go farther, but the compile times for this style appear to increase exponentially with the amount of test code being compiled. Thus you should probably be prepared to not use your computer for many hours if you increase the memory settings, such as letting runem.sh run overnight while you sleep. We were actually unsucessful at getting the specs2 immutable style to go all the way in most cases, so if you succeed in doing so, please post your graphs somewhere on the web and point us to them on scalatest-users so we can see what those actual numbers are.

Let's graph something!

In the remainder of this article, I'll summarize how we came to the conclusions I described previously in the Eating dessert first: what we learned section. The graphs shown below were generated by running the runem.sh on my Macintosh Powerbook laptop, which has 8 Gigabytes total memory, 2.5 Gigaherz processor, and a solid-state drive. (Because the goal of this exercise is not to measure absolute compiler performance as much as to measure differences in compiler performance, I think the actual hardware used is not as important so long as it has sufficient memory to prevent swapping and a solid-state drive to quickly accept all the generated class files.) We used the latest versions of all test frameworks and of Scala at the time, except for ScalaTest, which was run with a snapshot of the latest version of trunk, in which two additional implicits needed for a new feature coming in 2.0.M6 have been added to each style trait (which should slow compilation slightly compared to 2.0.M5b, the latest released ScalaTest 2.0 milestone version), and the two styles measured were classes not traits (which should speed compilation slightly compared to 2.0.M5b). The versions were:

Scala	2.10
ScalaTest	2.0.M6-SNAP8
JUnit	4.11
Hamcrest	1.3
TestNG	6.8
specs2	1.13
scalaz	6.0.1

The compile-time cost of implicits and mixins

Several of the scripts focus on placing varying numbers of similar tests in the same test class. The first script we wrote took this approach because we were attempting to reproduce the problem described by Daniel Spiewak on ScalaWags: a test class (in a 5000-line source file) that contained one line of code per test, an assertion using must. We were never able to observe the magnitude of the problem Daniel described (we never saw his actual code), but nevertheless were able to measure differences and isolate causes this way. Since most test classes aren't this large, though, we also wrote a few scripts that generate multiple source files in the same directory, to make sure we weren't accidentally measuring artifacts of just placing all tests in the same source file.

The assertTestsInOneFile.scala script takes this all-tests-in-one-file approach to isolate the compile-time cost of two factors: the number of implicits in scope and whether you mix in a trait or extend a class. All of the generated tests use ScalaTest's Spec trait, in which tests are methods, not functions, so by-names are not involved. The script generates three variants: one that uses an assertion in each test, another that mixes in ShouldMatchers and uses a matcher expression in each test, and a third that imports ShouldMatchers._ and uses the same matcher expression in each test. Here's a snippet from the top of an example of the variant that uses assertions:

package SpecTripleEqual

import org.scalatest._

class ExampleSpec extends Spec  {
  object `Scala can ` { 
    def `increment 1` {
      assert(1 + 1 === 2)
    }
    def `increment 2` {
      assert(2 + 1 === 3)
    }
    def `increment 3` {
      assert(3 + 1 === 4)
    }
    // ...

As you can see, the assertion only exercises Scala's ability to add two integers. The goal here was again to keep the variables to a minimum. Essentially only the test framework itself is really being compiled here, for different styles of testing. Here's a snippet from the file that mixes in ShouldMatchers:

package SpecMixinShould

import org.scalatest._

class ExampleSpec extends Spec  with ShouldMatchers {
  object `Scala can ` {
    def `increment 1` {
      1 + 1 should be (2)
    }
    def `increment 2` {
      2 + 1 should be (3)
    }
    def `increment 3` {
      3 + 1 should be (4)
    }
    // ...

And here's the third variant that imports from ShouldMatchers instead of mixing it in:

package SpecImportShould

import org.scalatest._
import matchers.ShouldMatchers._

class ExampleSpec extends Spec  {
  object `Scala can ` { 
    def `increment 1` {
      1 + 1 should be (2)
    }
    def `increment 2` {
      2 + 1 should be (3)
    }
    def `increment 3` {
      3 + 1 should be (4)
    }
    // ...

Before I show you the results, it is helpful to look at the actual difference between using assertions and matchers. In this case, each assertion and each matcher expression used requires one implicit application. Each assertion requires an implicit conversion to invoke the === method on Int. Each matcher expression requires an implicit conversion to invoke the should method on Int. So that's the same. But there is a difference in the number of implicits in scope. This number you can determine with one line of code using the reflection library in Scala 2.10, the method called countImplicits below:

scala> import scala.reflect.runtime.universe._
import scala.reflect.runtime.universe._

scala> def countImplicits(tp: Type): Int = tp.members.count(_.isImplicit)
countImplicits: (tp: reflect.runtime.universe.Type)Int

scala> countImplicits(typeOf[org.scalatest.Spec])
res0: Int = 5

scala> countImplicits(typeOf[org.scalatest.Spec with org.scalatest.matchers.ShouldMatchers])
res1: Int = 35

The first call to countImplicits shows that Spec declares five implicit members. But when you mix in ShouldMatchers, that number of implicits jumps to thirty five. The extra thirty implicit members are required to implement ScalaTest's Matchers DSL. (Note: since 1.0, the simplest ScalaTest style traits—FunSuite, FunSpec, PropSpec, FeatureSpec, and now Spec—have declared only three implicits by default. This will increase to five in 2.0.M6, the next milestone release, but should drop back down to three for 2.0 final. Down the road, after we add assert macros and go through a deprecation cycle for soon-to-be legacy === operator, the default implicit count for these simplest styles should drop to zero.) The number of implicits in scope will also be thirty five for the variant that imports the members of ShouldMatchers, five inherited from Spec and thirty imported from the ShouldMatchers companion object.

Now the results. The data for this script is displayed across two graphs to see trends in both the small and large. The first graph shows compile times for each of the three variants for classes containing 0 tests, 10 tests, 20 tests, ... up to 100 tests:

The second graph shows compile times for classes containing 0 tests, 100 tests, 200 tests, ... up to 1000 tests.

The x-axis in both graphs is the number of tests in each test class. The y-axis is the compile time in milliseconds for each of the test classes. These show two consistent trends: 1) assertions compile faster than matchers, and 2) importing ShouldMatchers compiles slightly faster than mixing it in.

If you hover your mouse over the lines, little windows will pop-up containing the actual compile time. If you hover over the times for 1000 tests, for example, you'll see it is 7.625 seconds for assertions, 8.517 seconds for imported matchers, and 8.760 seconds for mixed-in matchers. The conclusions to which I leapt given this data are 1) the 0.892 second difference between assertions and imported matchers is primarily caused by the compiler needing to inspect an extra 30 implicits for each of the 1000 implicit applications in the matchers case, and 2) that the 0.243 second difference between imported and mixed-in matchers is primarily caused by the compiler needing to wire up the ShouldMatchers methods in class ExampleSpec.

To get an idea of how many methods there are, you can use a similar trick to the one used to count implicits:

scala> def countMembers(tp: Type): Int = tp.members.size
countMembers: (tp: reflect.runtime.universe.Type)Int

scala> countMembers(typeOf[org.scalatest.Spec])
res4: Int = 104

scala> countMembers(typeOf[org.scalatest.Spec with org.scalatest.matchers.ShouldMatchers])
res5: Int = 272

As the numbers reveal, the Scala compiler must add 168 (i.e., 272 - 104) more members to each test class in the mix-in case.

The fastest way to compile in each test framework

The allMethodTestsInOneFile.scala script attempts to capture the fastest way to compile in each test framework: JUnit, TestNG, ScalaTest, and specs2. As JUnit and TestNG really only support one style each, the fastest way is, of course, that one style. The fastest way to compile ScalaTest is its Spec style, because tests are methods, not functions, using expectResult, the assertion construct that does not require an implicit resolution. The fastest way to compile specs2 is its mutable style. To do this comparison, we again generated all tests in one file. Here's a snippet from the top of one of the JUnit test classes:

package JUnit

import org.junit.Assert.assertEquals
import org.junit.Test

class ExampleSpec {

    @Test def increment1() {
      assertEquals(2, 1 + 1)
    }
    @Test def increment2() {
      assertEquals(3, 2 + 1)
    }
    @Test def increment3() {
      assertEquals(4, 3 + 1)
    }
    // ...

The test class for TestNG is very similar:

package TestNG

import org.testng.annotations.Test
import org.testng.AssertJUnit.assertEquals

class ExampleSpec {

    @Test def increment1() {
      assertEquals(2, 1 + 1)
    }
    @Test def increment2() {
      assertEquals(3, 2 + 1)
    }
    @Test def increment3() {
      assertEquals(4, 3 + 1)
    }
    // ...

Here's the test class for ScalaTest's Spec:

package SpecSpec

import org.scalatest.Spec

class ExampleSpec extends Spec {

    def `increment 1` {
      expectResult(2) { 1 + 1 }
    }
    def `increment 2` {
      expectResult(3) { 2 + 1 }
    }
    def `increment 3` {
      expectResult(4) { 3 + 1 }
    }
    // ...

And last but not least, the test class for specs2's mutable Specification:

package SpecificationSpecification

import org.specs2.mutable._

class ExampleSpec extends Specification {

    "increment 1" in {
      1 + 1 must be equalTo (2)
    }
    "increment 2" in {
      2 + 1 must be equalTo (3)
    }
    "increment 3" in {
      3 + 1 must be equalTo (4)
    }
    // ...

This time test classes were only generated with 0, 100, 200, ..., to 1000 tests, respectively. The graph looks like:

As before, the x-axis indicates the number of tests in the test class. The y-axis indicates the compile time. If you hover your mouse over the lines at 1000 tests, you'll see that the JUnit class took 4.329 seconds. If you're persistent you may be able to find a spot where it will pop up TestNG's compile time, which is 4.352 seconds. It is slightly slower, but so close to JUnit as to be hard to find with the mouse. ScalaTest's Spec trait is easier to find with the mouse, because it is an entire half second slower: 4.766 seconds. The specs2 mutable Specification is also easy to find with the mouse, as it is 33.337 seconds, almost seven times slower to compile than the others.

The specs2 mutable Specification compile time is so much slower becuase it has several disadvantages compared to the others in this measurement. First it uses test functions, not test methods, so the compiler has to deal with one extra by-name per test. In addition, each test has an implicit application to add the must method, compounded by a large number of implicits to pore through each time. The other three do not. And the number of implicits is quite high:

scala> countImplicits(typeOf[org.specs2.mutable.Specification])
res1: Int = 149

Whereas extending Spec by itself brings 5 implicits into scope, and extending Spec with ShouldMatchers brings 35 implicits into scope, extending specs2's mutable Specification brings 149 implicits into scope. And of course JUnit and TestNG, being Java frameworks, bring no implicits in scope. For comparison, here's how many implicits you get by default in Scala from Predef:

scala> countImplicits(typeOf[Predef.type])
res2: Int = 78

So specs2 mutable style's 149 is quite a few implicits. The other disadvantage of the specs2 mutable style in this measurement is that the test class is mixing in a trait, Specification, whereas the others are not. The JUnit and TestNG test classes don't extend anything, and the ScalaTest class extends Spec, which is a class in the version used, 2.0.M6-SNAP8. Here's how many members the specs2 mutable test class ends up with:

scala> countMembers(typeOf[org.specs2.mutable.Specification])
res4: Int = 795

For comparison, the member count for ScalaTest's WordSpec with MustMatchers, which provides the most similar DSL to specs2's mutable Specification is quite a bit smaller:

scala> countMembers(typeOf[org.scalatest.WordSpec with org.scalatest.matchers.MustMatchers])
res5: Int = 282

I believe this accounts for the observation that switching from mixing in a trait to extending a class improves compile times for specs2 by around 0.5 seconds, but for ScalaTest only around .15 to 0.2 seconds. The compiler just has around 2.5 times more work to do to wire up methods in the spec2 case.

The main takeaway from this chart is the realization that how you use Scala's features, as well as how the libraries you've chosen use Scala's features, has a big impact on your compile times.

The other takeaway for me, of course, was why the heck does ScalaTest Spec require a half second more to compile than JUnit? The only significant difference between the two test classes is that the ScalaTest class is extends org.scalatest.Spec, and the JUnit class extends java.lang.Object. When I experimented, I discovered that this was only true if I extended Spec, not if I just extended org.scalatest.Assertions, which was sufficient to get the class to compile. This did not make any sense to me, nor to Grzegorz Kossakowski, so after our session at nescala, we spent a couple hours applying his scalac-aspects tool to investigate. We didn't figure it out in the time we had, but hopefully once we do, either I can do something in Spec, or something can be done in the Scala compiler, to close that gap.

Comparing apples to apples

The previous measurement did an apples-to-apples comparison of JUnit, TestNG, and ScalaTest compile times, because the test classes were very similar. The specs2 test class, on the other hand, was quite different because specs2 offers nothing similar: it has no assertions and no way to write tests as methods. To do an apples-to-apples comparison with specs2, therefore, we needed to use the ScalaTest styles that are most similar to specs2. The testsIn100Files.scala script attempts to do this. It compares both mutable and immutable specs2 styles, and both WordSpec and Spec from ScalaTest, using ScalaTest's MustMatchers. Instead of placing all tests in one file, however, each data point involves compiling 100 source files, each containing one test class. The number of tests in each test class is increased from 0, 10, 20, ... 100. Here's a snippet from one of the Spec classes:

package SpecMust

import org.scalatest._
import matchers.MustMatchers._

class ExampleSpec10 extends Spec  {
  object `Scala can ` {
    def `increment 1` {
      1 + 1 must be (2)
    }
    def `increment 2` {
      2 + 1 must be (3)
    }
    def `increment 3` {
      3 + 1 must be (4)
    }
    // ...

Here's a snippet from one of the WordSpec classes:

package WordSpecMust

import org.scalatest._
import matchers.MustMatchers._

class ExampleSpec10 extends WordSpec  {
  "Scala" can {
    "increment 1" in {
      1 + 1 must be (2)
    }
    "increment 2" in {
      2 + 1 must be (3)
    }
    "increment 3" in {
      3 + 1 must be (4)
    }
    // ...

Here's a snippet from one of the mutable Specification classes:

package mSpecification

import org.specs2.mutable._

class ExampleSpec10 extends Specification {
  "Scala" can {
    "increment 1" in {
      1 + 1 must be equalTo (2)
    }
    "increment 2" in {
      2 + 1 must be equalTo (3)
    }
    "increment 3" in {
      3 + 1 must be equalTo (4)
    }
    // ...

And lastly, here's one of the immutable Specification classes in its entirety:

package iSpecification

import org.specs2._

class ExampleSpec10 extends Specification { def is =
  "Scala can"  ^
    "increment 1"  ! e1^
    "increment 2"  ! e2^
    "increment 3"  ! e3^
    "increment 4"  ! e4^
    "increment 5"  ! e5^
    "increment 6"  ! e6^
    "increment 7"  ! e7^
    "increment 8"  ! e8^
    "increment 9"  ! e9^
    "increment 10"  ! e10^
    end
def e1 = 1 + 1 must be equalTo (2)
def e2 = 2 + 1 must be equalTo (3)
def e3 = 3 + 1 must be equalTo (4)
def e4 = 4 + 1 must be equalTo (5)
def e5 = 5 + 1 must be equalTo (6)
def e6 = 6 + 1 must be equalTo (7)
def e7 = 7 + 1 must be equalTo (8)
def e8 = 8 + 1 must be equalTo (9)
def e9 = 9 + 1 must be equalTo (10)
def e10 = 10 + 1 must be equalTo (11)

}

Here are the results:

The x-axis indicates how many tests are in each of 100 test classes. So the data point labeled 10, one thousand tests are being compiled: 10 tests in each of 100 files. At the data point labeled 100, ten thousand tests are being compiled: 100 tests in each of 100 files. The y-axis, as usual, is compile time in milliseconds. The reason the specs2 immutable style curve, the red line, drops after 50 is that the compiler crashes with an out-of-memory error. More on that later.

One difference that is isolated by this measurement is the cost of by-names. The only significant difference between ScalaTest's Spec (the blue line) and WordSpec (the green line) is that in Spec, tests are methods whereas in WordSpec, tests are functions. If you hover your mouse over the right end of both those curves, you'll see it took 49.657 seconds to compile 10,000 test (by-name) functions, but only 25.639 seconds to compile 10,000 test methods—about half the time.

One way to measure this difference in by-names is to look at how many class files are generated, because the Scala compiler will emit a class file for each by-name. Here's a graph showing the number of class files generated by running this script:

As you can see, the number of class files for Spec actually stays constant at 200 class files. That's one class file for each of the 100 test classes, and one class file for the `Scala can` singleton object in each test class. By contrast, in addition to the same 200 class files you get with Spec, with WordSpec you get one more class file per test. So the count grows linearly along with the number of tests being compiled. If you hover your mouse at the high end of the green line, where 10,000 tests are being compiled, for example, you see 10,200 class files were generated. That value represents the baseline 200 class files shared with Spec, plus one class file for each of the 10,000 tests.

Now consider the difference between the green (ScalaTest WordSpec) and the yellow (specs2 mutable Specification) line. If you hover your mouse over the high end of the yellow line, you'll see that specs2 managed to generate 20,000 more class files for almost identical source code. It does this because the matcher statement itself is generating two class files via by-names. This was described by Daniel Spiewak on the ScalaWags podcast:

...all of the assertions have lamdas on both sides, because they are by-names in both the reciever and the parameters...

For a more detailed explanation on the cause, see this excerpt of a tech talk I gave at Twitter in Dec, 2011.

Note that although we are counting class files, it doesn't appear that the actual writing of those class files significantly impacts compiler performance, at least on a solid-state drive. I made that conclusion with the help of Kirk Pepperdine, who helped me monitor just what the compiler was doing while compiling these files. Daniel Spiewak indicated on the ScalaWags episode that he'd come to the same conclusion:

It wasn't I/O cost; it was all CPU-bound.

Note also that “ 1 + 1 must be (2)” does not compile in specs2, nor can you write “ 1 + 1 must equal (2)” in specs2, as you can in ScalaTest. You can write “ 1 + 1 must equalTo (2)” in spec2, but that's not valid English grammar, so we used “ 1 + 1 must be equalTo (2)”, which is valid English grammar. So one other difference between the green and yellow lines is this matcher statement is not identical, but my belief is that the significant difference in compile time is predominantly caused by the extra implicits in scope and the two extra by-names used by specs2's matchers. (If someone wants to write another script to try and verify that belief, that might also make a good contribution to the project.)

Because of the increased use of by-names, the larger number of implicits in scope, and the act of mixing in a trait rather than exending a class, the mutable Specification style tended to need around 2.5 times more time to compile than WordSpec with MustMatchers. If you hover your mouse over the high end of the green line, for example, you'll find that WordSpec with MustMatchers required 49.657 seconds to compile 10,000 tests, whereas the mutable Specification required 140.682 seconds to compile very similar test code.

To understand what's going on with the immutable Specification style, it is helpful to look at another chart. This one measures the total size of the class files generated:

As you can see from this and the previous charts, whereas the count of class files goes up linearly for the immutable Specification style, the actual size appears to increase exponentially as you add more tests. If you look back two charts that the compile time for the immutable Specification style tends to follow the same upward bending curve until the compiler runs out of memory. Hover your mouse at the peak and you'll find that the compiler emitted 202902807 bytes worth of class files to compile 5,000 tests in the immutable style, which is 193.503 megabytes. We did try increasing the memory above the settings in runem.sh, and sure enough, the compiler can manage to compile more tests written in the immutable Specification style, but because that curve keeps bending upwards, it takes a very long time to complete. I don't understand why the class file size is increasing. It would be interesting to find out why, both so it can be fixed if possible in specs2, and so the rest of us will know what to avoid in future designs.

Conclusion

So that's essentially what we've learned so far. If you run the scripts, please look for any places where you think we might measuring something besides what we think we're measuring. If you find something, please let us know on the scalatest-users mailing list.

The most obvious-sounding conclusion you can make from this data is that if you give the Scala compiler more work to do, it requires more time to do the work. I think the main takeaway for library designers is to use Scala's features with some degree of restraint. Although Scala features like implicits and by-names can certainly enhance developer productivity, if they aren't used with care the productivity boost can end up being attenuated by unnecessarily long compiles.

Acknowledgments

The compiletime scripts were primarily written by Chee Seng Chuah. Thanks also to Kirk Pepperdine and Grzegorz Kossakowski for their assistance with the project, and to Coda Hale and Daniel Spiewak for sharing information about the test classe that were taking an inordinately long time to compile.

Aside: How to run the `compiletime` scripts

You can easily run the compiletime scripts and generate the graphs yourself. If you have the inclination, please inspect them and look for places where we may be miscalculating the measurements or misinterpreting the results. You can also tweak the scripts and rerun the scripts to try out your own theories. To run the scripts yourself:

Clone the project
git clone https://github.com/bvenners/compiletime.git
Cd into the directory
cd compiletime
4. Run the scripts
sh < runem.sh

The graphs will be placed in html files in subdirectories named after the scripts.

Aside: Complaints about test compile times

For context, here are some of the complaints I noticed over time about compile times for specs or specs2, which is what originally motivated this project—i.e., a desire to understand how design choices might influence compile times. I did not actually ever hear a similar user complaint about ScalaTest, JUnit, or TestNG. Personally I have been deeply involved in two major Scala projects, both of which use ScalaTest (one of which is ScalaTest itself). Productivity on both of these ScalaTest projects would definitely benefit from a reduction in compile time. I have never personally worked on a Scala project that uses JUnit, TestNG, specs, or specs2.

One of the first complaints came from Code Hale's leaked email about why Yammer was moving away from Scala:

Ditching Specs2 for my little JUnit wrapper meant that the main test class for one of our projects (~600-700 lines) no longer took three minutes to compile or produced 6MB of .class files.

Another complaint came from Mark McBride in a talk about Scala usage at Twitter:

Another thing with specs, just from a performance standpoint, having worked on the streaming API, which has the largest investment in specs, doing a clean of the streaming API on a non-SSD takes an appreciable amount of time, like close to a minute, because specs emits a class file for every little statement specs does, because they are all closures. So I think we ended up with like 45,000 class files generated for our test suite for the streaming API.

Occasional tweets from random specsN users, such as this conversation between Stephan Schmidt and Pawel Dolega:

Schmidt: Why are specs2/SBT tests so damn slow? Dolega: We had a similar problem with specs2; a fair part of the answer for us was (point 5 of Code Hale's email). To be clear--in our case "slow" referred to compilation, not execution.

And Dan Spiewak described a compiler performance problem with Precog's specs2 tests in the second ScalaWags episode:

If you have a file that has an enormous number of lamdas in it--a good example of this is a large specs test suite, you've got your outer lamdas, then every single example is a lamda, then all of the assertions have lamdas on both sides, because they are by-names in both the reciever and the parameters--if you have a file that has a zillion lamdas in it, it can take an enormous amount of time to compile. We had a 5000 line file that took about 15 minutes to compile--one file.

Talk back!

Have an opinion? Readers have already posted 6 comments about this article. Why not add yours?

About the author

Bill Venners is president of Artima, Inc., publisher of Artima Developer (www.artima.com). He is author of the book, Inside the Java Virtual Machine, a programmer-oriented survey of the Java platform's architecture and internals. His popular columns in JavaWorld magazine covered Java internals, object-oriented design, and Jini. Active in the Jini Community since its inception, Bill led the Jini Community's ServiceUI project, whose ServiceUI API became the de facto standard way to associate user interfaces to Jini services. Bill is also the lead developer and designer of ScalaTest, an open source testing tool for Scala and Java developers, and coauthor with Martin Odersky and Lex Spoon of the book, Programming in Scala.