Counting Crowds and Lines

Updated with video footage of the CUHK Mall Dataset:

The ML and site for this post can be found at

In Union Square, NYC, there’s the untoppable burger joint named Shake Shack that’s always crowded. A group of us would obsessively check the Shake Cam around lunch to figure out if that trip was worth it.

Shake Cam

14 person line, not bad

Rather than do this manually (come on, it’s nearly 2018), it would be great if this could be done for us. Then, to take that idea further, imagine being able to measure foot traffic on a month to month basis or to measure the impact of a new promotional campaign.

Count Alpha

Object detection has received a lot of attention in the deep learning space, but it’s ill-suited for highly congested scenes like crowds. In this post, I’ll talk about how I implemented multi-scale convolutional neural network (CNN) for crowd and line counting.

Why not object detection

Regional-CNN’s (R-CNN) use a sliding window to find an object. High density crowds are ill-suited for sliding windows due to high occlusion:


Failed attempt with off the shelf (no retraining) TensorFlow R-CNN

Further exploration in this approach led me to TensorBox, but it too had issues with high congestion and large crowd counts.

Density Maps to the rescue

Rather than a sliding window, density maps (aka heat maps) estimate the likelihood of a head being at a location: UCF Original Dense Crowd Ground Truth

Crowd photo from the UCF Dataset

3406 vs 3408? Pretty close!

What’s happening here?

Multi-scale CNN

Based on multi-scale convolutional neural network (CNN) for crowd counting, the ground truth is generated by taking the head annotations and setting that pixel value to one, and then gaussian blurring the image. The model is then trained to output these blurred images, or density maps. The sum of all the image pixels then results in the crowd count prediction. Read the paper for more insight.

Let’s look at density maps applied to the shake cam. Don’t worry about the color switch from blue to white for the density maps. Dense Crowd Ground Truth

The sum of the pixel values is the size of the crowd

As you can see above, we have:

  1. The annotated image courtesy of AWS Mechanical Turk.
  2. The calculated ground truth by setting head locations to one and then gaussian blurring.
  3. The model’s prediction after being trained with ground truths.

How to get the images?

From your neighborhood Shake Shack Cam of course.

How to annotate the data?

The tried and true AWS Mechanical Turk, with a twist: a mouse click annotates a head as shown below: Head Annotator

I went ahead and modified the bbox-annotator to be a single click head annotator.

How to count the line?

Lines aren’t merely people in a certain space, they are people standing next to each other to form a contiguous collection of people. As of now, I simply feed the density map into a three layer fully connected (FC) network to output a single number, the line count.

Gathering data for that also ended up being a task in AWS Mechanical Turk.

Here are some examples of where lines aren’t immediately obvious:

Line Not Hot Line Not Hot

Making a product out of data science

This is all good fun working on your development box, but how do you host it? This will be a topic for another blog post, but the short story is:

  1. Make sure it doesn’t look bad! Thanks to the design work done by Steve @
  2. Use Vue JS and d3 to visualize the line count.
  3. Create a docker image with your static assets and Conda dependencies.
  4. Deploy to GCP with kubernetes on Google Container Engine.
  5. Periodically run a background job to scrape the shake cam image and run a prediction.

Count Alpha

I did the extra credit step of having a Rails application interact with the ML service via gRPC, while integration testing with PyCall. Not necessary, but I’m very happy with the setup.

Unexpected Challenges

These following challenges have contributed to erroneous line predictions:

  1. Umbrellas. Not a head but still a person.
  2. Shadows. Around noon there can be some strong shadows resembling people.
  3. Winter Darkness. It gets much darker much sooner in November and December. Yet the model was trained predominantly with images of people in daylight.
  4. Winter Snow. Training data never had snow, and now we have mistakes like this:

Count Mistaking Snow

As I discover more of these scenarios, I’ll know what data to gather for a model retraining.

Check it out

Feel free to drop a line below if you have any questions.

comments powered by Disqus