ETL Language Showdown Part 3

14 Nov 2015

This article is a continuation of the Extract-Transform-Load (ETL) showdown series. Check these posts out for additional context and check out the repo.

Analyses and discussions done in this ETL series have led to the following language pull requests:

In this post, we will compare a Map Reduce solution to count the number of times knicks is mentioned in ~40M tweets spanning multiple files. This originally came about as a way to compare vanilla GIL-bound Ruby implementations against Scala and Golang. It has since evolved into a repo of idiomatic ETL solutions for a variety of languages (ten at the time of this writing).

Langauges Covered

Language	Owner
Ruby
Golang
Scala
Nim
Node
PHP
Erlang	potatosalad
Elixir	josevalim
Rust	potatosalad
Python
C#	mganss
shell	mganss
perl	sitaramc

Hype up your language by improving the current implementation. Give me a shout by raising an issue to become an owner.

Shortcuts led to apple and orange comparisons

During this adventure, my Golang solution wasn’t as performant as I had hoped because the regex library couldn’t compete with Scala’s. As a result, I introduced the first apple to orange comparison by allowing Golang to use substrings.

One can’t help but find the fastest solution possible, and this story continued with other languages and pull requests. Reference implementations were accompanied with optimized solutions taking shortcuts, and I couldn’t turn away such sweet code.

Even when sticking to the reference solution, it has becoming increasingly difficult to keep the implementations consistent across languages because certain idioms sway solutions in different directions.

Examples of shortcuts

PHP takes a shortcut by using stripos that only works on ASCII, not Unicode.
Regex comparison as opposed to substring checks.
Loading file contents into memory as opposed to streaming.
Having the master or supervisor keep track of the results so there is effectively no reduction step.
Leveraging binary pattern matching as opposed to regular regex.

Here are the previous implementations described in part 2 of the ETL showdown:

Takeaway

The results have a healthy amount of variety. Going into part 1, it was believed that the results of multi-threaded approaches would be the same because the problem would be IO bound. Well, that was wrong. Instead of using this as a speed comparison across languages, use this repo as a reference for writing idiomatic ETL solutions in the language of your choice. Check out the repo.

In order to promote consistency across implementations, I’ve introduced the following:

Rules of Reference Implementation

Stream input from files.
Use Regular Expressions to check for the presence of knicks.
Have multiple mappers, but one reducer.
Each individual worker holds its results in a hash and sends that final hash back for reduction.

New School: Language Additions

New School Observations and Shortcuts

We’ve since added even more languages, each with their own nuances. Nim is a newcomer I’ve never heard of but it performed surprisingly well. Below you’ll find some notes on each implementation.

Erlang

Leverages Binary Pattern Matching.
Holds all data in memory as opposed to streaming input line by line.

Nim

Implementation seems identical to the ruby version (streaming input with regex but using multiple cores) and it is blazing fast. The first I’ve heard of Nim.

Rust

Regex solution is identical to reference implementation.
Substring solution only deals with ASCII.
Man is it fast. Would love someone to take another look at the Rust implementation to see how it might be taking shortcuts against the reference implementation.

PHP

Only doing substring checks against ASCII.
Single threaded.
Included just to grow the library of implementations. Improvements and pull requests are welcome.

Node JS

Uses cluster to get around the Global Interpreter Lock (GIL).
The workers communicate a match to the cluster master, effectively skipping the reduction step.

Call To Action

The number of languages being covered has grown past my ability to maintain. If you would like to own a particular language’s implementation, please let me know and feel free to submit PR’s my way.
Another area of interest is memory consumption. Tracking memory consumption can be tricky. For example, runtimes can aggresssively grab memory despite the application not using all of it. It would be great to track peak memory usage for the runs, so that we can now plot memory consumption along with execution time.

dimroc blog See Experiments