Comparing Golang, Scala, Elixir, Ruby, and now Python3 for ETL: Part 2

A year ago, I wrote the same program in four languages to compare their productivity when performing ETL (extract-transform-load). Read about part 1 here and feel free to check out the source code.

The code has changed, the languages have evolved, and the hardware now includes a SSD drive. So, where are they now?

Results

Ruby w/ Celluloid (Global Interpreter Lock Bound, single core) 43.7s
JRuby w/ Celluloid 15.8s
Ruby w/ grosser/parallel (not GNU Parallel) 10.9s
Python w/ Pool 12.7s
Scala 8.8s
Scala w/ Substring (Skipped regex for performance analysis) 8.3s
Golang 32.8s
Golang w/ Substring (Skipped regex for performance analysis) 7.8s
Elixir 21.8s

Recap

The original goal was not to see how fast each language could go. Rather, it was was to measure the length of time needed to write a solution and subjectively measure the maintainability of said solution, all while learning each language’s gotchas on the way. But, in the end, everyone wants benchmarks.

It was assumed that runtimes would all be approximately the same, since this should have been an IO-bound problem. So why care about the speed of the language? Well, on my old MacBook Pro with a 5200 RPM HDD, this was not true and, surprisingly, it still isn’t on my SSD.

The Hardware

MacBook Pro 2.3GHz i7 (quad core) with 16GB RAM and SSD

The Problem

We have ~40M tweets spanning multiple files, with each tweet tagged with their New York City neighborhood. We want to discover which neighborhoods care the most about the New York Knicks by searching for the term knicks.

Questions and Concerns from Part 1

  1. Why was I writing to an intermediary file? Why didn’t I do it all in memory? Well, now I do.

    This comparison was derived from a larger ETL process that spanned multiple computers and therefore used intermediary files to pass along the information. This cookie-cutter experiment has no need for this, so it has been removed.

  2. Why am I using regex and not a simple string search (GoLang’s regex sucks in 1.x.x)?

    The implementations should be consistent across all languages for a fair comparison. Even though the problem is simply searching for knicks, I wanted the implementations to have the flexibility to perform more powerful searches. That being said, Golang’s Regexp package performs dramatically worse than other languages so I included results using strings.Contains.

  3. In Scala, why did I use Akka instead of the lighter Parallel Collections?

    Because I love Akka.

Implementation Changes

Ruby

  • Ruby version is now 2.2.2.
  • No longer uses GNU Parallel, but instead uses grosser/parallel to span multiple cores.
  • Implementation no longer writes to intermediary file.

Scala

  • Upgraded to Scala 2.11.5 and Akka 2.3.10.
  • Reduction no longer writes to intermediary file.
  • Still uses Akka. If you think the Parallel Collections library would be a better fit, which it very well might be, please feel free to contribute a pull request.

Python

  • Version python3-3.4.3
  • A new Python implementation has been added for comparison’s sake.
  • The Pool object allows one to run the program on multiple processes and sidestep the Global Interpreter Lock (GIL). A pretty great alternative to my use of GNU parallel with Ruby in part 1.

Elixir

  • Updated to Elixir version 1.0.4
  • Reduction no longer writes to intermediary file.
  • Actor model is beautiful in Elixir.
  • No significant performance improvement when using String.contains instead of regex.
  • Profiled with exprof but didn’t see any low hanging fruit (I’m welcome to any feedback here). Elixir Profiling
  • Changing
Map.merge(...)

to this

HashDict.merge(...)

made a dramatic difference. It speaks to the youth of the Elixir.

From the website:

Note: Maps were recently introduced into the Erlang VM with EEP 43. Erlang 17 provides a partial implementation of the EEP, where only “small maps” are supported. This means maps have good performance characteristics only when storing at maximum a couple of dozens keys. To fill in this gap, Elixir also provides the HashDict module which uses a hashing algorithm to provide a dictionary that supports hundreds of thousands keys with good performance.

Golang

  • Updated Golang to 1.4.2.
  • Initial performance was a disappointing 30s+, so I dug in and used pprof to profile the code. Golang Profiling
  • Go’s Regular Expression engine really is as slow as a previous commenter mentioned. Switching to strings.Contains took it to ~7s.
  • They’ve been hyped before, and I’m going to hype them again: GoLang’s Channels are fantastic.

A modification to the GoLang implementation liberally uses channels as a FIFO queue to great effect:

// Spawns N routines, after each completes runs all whendone functions
func Spawn(N int, fn func(), whendone ...func()) {
  waiting := int32(N)
  for k := 0; k < N; k += 1 {
    go func() {
      fn()
      if atomic.AddInt32(&waiting, -1) == 0 {
        for _, fn := range whendone {
          fn()
        }
      }
    }()
  }
}

#####Usage

Note the channel filenames acting as a thread-safe queue.

filenames := make(chan string, *procs)

Spawn(*procs, func() {
  for filename := range filenames {
    file, err := os.Open(filename)
    ...
}, func() { fmt.Println("Done") })

Conclusion

  • It’s always a challenge (or a lot of fun) attempting to write the same thing in two languages, let alone five. Each language’s idioms sway an implementation in a particular direction. Long story short, there are still a lot of discrepancies between the implementations.
  • Elixir and Golang have matured dramatically in a year’s time.
  • It is damn difficult to parse Scala code when you’ve been away for a while. It’s just… dense.
  • My previous conclusion still holds up, check it out (it’s at the bottom of the link).
  • This whole experiment has lived far longer than I thought.

Think you can do better? Want to see another language? Contribute.

Submit a pull request with your code changes and I’ll update the doc.

Thanks

Thanks to all those who contributed to the repo:

comments powered by Disqus