Sports Illustrated Cover Image Analysis with AWS Rekognition

I’ve been exploring some of the lesser-known Amazon Web Services (AWS) tools recently, and when I saw the in-console demo of Rekognition, the image labeling/classification service, I knew I needed to find an application for testing.

Then I forgot about it for a few weeks.

And THEN! I came across this archive of all the Sports Illustrated covers in the Sports Illustrated Vault. The rest happened in a whirlwind.

Project Design

Here’s the project plan outline:

  1. Gather all Sports Illustrated cover images and save them to S3.
  2. Write a Python script to submit all of the cover images to the Rekognition API, using the boto3 AWS SDK, using the following services:
    • Generate a list of labels for each image (as generic as “sports”, but hopefully including specific sport, and maybe some individual item recognition).
    • Analyze the image for faces, including facial characteristics like whether the face is smiling, wearing glasses, or has a mustache/beard.
    • Analyze images for celebrity face matches.
    • Analyze images for suggestive content (yep, think about the SI swimsuit issue – more on that later).
  3. Store the giant mess of response data in a PostgreSQL database.
  4. Write an R script to run some analyses on that giant mess of data.
  5. Generate some cool plots to include in this post.
  6. Build a Shiny app to let people do their own analysis.

Classifying Images

With all the cover images stored in S3, it was really easy to write a Python script with boto3 that would run each image through Rekognition.

Here’s the client setup and a few lines that get a list of object in the given S3 bucket:

import boto3

client_s3 = boto3.client('s3')
client_rekognition = boto3.client('rekognition')

bucket_name = 'sports-illustrated-covers'

object_list = client_s3.list_objects_v2(
 Bucket = bucket_name)

object_count = object_list['KeyCount']
object_keys = object_list['Contents']

Then it’s a matter of looping over each object and submitting it to the Rekognition API endpoints. For example, image labeling and face detection:

for object_i in range(object_count):
    object_key = object_keys[object_i]['Key']
    
    labels = client_rekognition.detect_labels(
        Image = {
            'S3Object': {
                'Bucket': bucket_name,
                'Name'  : object_key
            }
        },
        MaxLabels = 20,
        MinConfidence = 75)

    for label_i in range(len(labels['Labels'])):
        label = labels['Labels'][label_i]
        label_confidence = label['Confidence']
        label_name = label['Name']

    faces = client_rekognition.detect_faces(
        Image = {
            'S3Object': {
                'Bucket': bucket_name,
                'Name'  : object_key
            }
        },
        Attributes = ['ALL'])

    for face_i in range(len(faces['FaceDetails'])):
        face = faces['FaceDetails'][face_i]
        face_confidence = face['Confidence']

That ran for a while, not that I was in a huge rush. Once the database was ready, though, I dove in to see what Rekognition could detect.

Note: all of what follows is a pretty soft analysis, focused more on trends over time and the relative comparison between different groups than the actual numbers. It’s also a blog-ready version of what could eventually be a much longer piece – I couldn’t answer every interesting question here!

Note, part two: I do not own any of the Sports Illustrated cover images featured in this post, and all credits for those images go to Sports Illustrated. I only included them directly to avoid cross-linking to SI’s page and adding to their bandwidth load.


 

Image Labels

First up: while browsing the early years of cover images, I was surprised by how many don’t have an obvious connection to what we think of as sports these days.

Some covers only feature artwork, a human face with no obvious connection to a sport, or even wildlife:

 

So the first question was whether the frequency of labels like “Sport”, “Sports”, and “Team Sports” increases over time:

sports-labels-over-time

The rate did increase over time…for a while. And then there’s a dip – what happened between 1980 and 2005? Was it truly a decline in sports showing up on the cover of SI, or is Rekognition having more difficulty detecting sports?

My own theory is that covers became busier, less like photographs of a single subject, and more like advertisements with big text blocks, multiple inset photos, and sharp shapes. This could make it less obvious to an image recognition system that the main topic is sports.

The labels “Poster”, “Brochure”, “Flyer”, and “Magazine” are four of the fifteen most common labels returned by Rekognition for the cover images. Let’s overlay the frequency of those labels and see if they account for the drop in assignment of sports-focused labels:

sports-labels-vs-advertising-labels

Nope! Back to the drawing board on that theory.

What about individual sports? Let’s focus on the big four American sports first (baseball, basketball, football, and hockey):

major-sports-over-time

I’m surprised by how jumpy this graph is – hockey seems most steady but still goes off and on at regular intervals. But also:

  1. What made baseball shoot up in 2007, Barry Bonds? (pun intended…)
  2. Why do basketball and baseball covers drop down to near 0 around the year 2000? Is it truly fewer cover features, or did Rekognition fail to label them correctly?
  3. Does football sell more magazine? Wouldn’t surprise me.

Next, if we condense cover dates into a time of year, it seems reasonable to expect covers to feature sports that are in season, right? As expected, we see more baseball covers in summer (rising into fall for the World Series), football in fall (rising into winter for the Super Bowl), and basketball and hockey in late winter and early spring, as their regular seasons end and playoffs draw attention:

major-sports-seasons

I was surprised by how far baseball extends in this view of the data – every other sport has a period where there is essentially no hope of making the cover, but baseball manages to show up even in the middle of winter.

I thought we should at least mention less popular sports (in the US at least), so here’s a similar plot over time of covers labeled “Soccer”, “Tennis”, “Swimming”, or “Cyclist”:

minor-sports-over-time

I don’t see much here, but it’s obvious that there’s a Lance Armstrong effect in the 2000s, and surprising that there aren’t more Michael Phelps covers in 2008 and 2012 – turns out he’s been on the cover lot (see next section), but isn’t actually swimming in most of those photos.


Celebrity Face Recognition

Rekognition can also pick out celebrity faces, so the first test is: who are the most-recognized celebrity faces on covers?

Note: this is not an official count, just the counts based on what Rekognition can detect – this could be more like a ranking of whose face is most easily detectable.

  1. LeBron James (15)
  2. Larry Bird (14)
  3. Magic Johnson (9)
  4. Tiger Woods (8)
  5. Muhammad Ali (7)
  6. Michael Phelps (7)
  7. Tied at 5: Shaquille O’Neal, Pete Rose, Patrick Ewing, Yao Ming, Mark McGwire, Sugar Ray Leonard, Johnny Damon (!!!)

To give you an idea of how rare it is to be featured on multiple covers, let alone five or more:

celebrity-cover-appearances

But some faces are harder to distinguish, especially if football facemasks, hockey helmets are in the way, or the photo has a strained facial expressions as in pitching or swinging a bat.

For example, Hines Ward’s face wasn’t detected here (and note that Johnny Damon only makes our list above because on this cover his face is featured twice, showing before and after he cut his giant mane and beard.)

Sports_Illustrated_726179_20060213-001-775

 

So obviously LeBron James, Larry Bird, and Magic Johnson have earned their 38 combined facial detections, but I’d argue there’s advantage to playing a sport where you can be photographed close-up during the game, AND your face isn’t obscured by a helmet or facemask.

As another example, even someone as recognizable as Ken Griffey, Jr. got missed in favor of two background faces, which were incorrect (see image below)

bad-celebrity-match

Since faces can be detected anywhere in the images, it wasn’t fair to run an analysis of facial detection rate against sport – as we saw in the Hines Ward cover, the main sport is obviously football, but the two faces detected were the same baseball player featured twice.

But I thought it would be interesting to see if celebrity faces were recognized at an overall increasing rate over time:

celebrities-over-time

Indeed they are, and I can think of several explanations:

  1. More frequent feature of celebrities (as opposed to the nature/outdoors covers shown earlier).
  2. Better close-up photography resolution.
  3. More recent celebrities should have larger sets of sample facial photos that Rekognition can use as training – detecting them should be easier than a golfer from 1959 with fewer photos available.

What’s the story behind the issue in 1987 with 17 celebrity faces? It’s an edition where SI analyzed Major League Baseball salaries, and featured a bunch of players on the cover:

Sports_Illustrated_702433_19870420-001-775

Suggestive Content Detection

I mentioned it briefly above: Rekognition has a service that detects adult/suggestive images.

How well would it pick out the Sports Illustrated swimsuit covers?

I’m actually not going to include any examples here, because it could get this page/site flagged as one featuring adult content, but if you insist, you’ve already got a link to the covers.

And would Rekognition flag anything else a little too suggestive, like, say, shirtless and underwater Cal Ripken, Jr?

Sports_Illustrated_702934_19950807-001-775

So who are the celebrities tagged in images that are labeled suggestive? Some might surprise you:

Daniela Peštová, Muhammad Ali, Nomar Garciaparra (!), Edith Södergran, , Jerzy Gorgoń, Kathy Ireland, Larry Bird (!), Tyler Hansbrough (!!!)

Forgot the swimsuit models – why is Larry Bird on a suggestive cover, and even Pelé?

Rekognition is pretty conservative here: if you pass it a bare chest (Muhammad Ali), or Nomar Garciaparra in the cover shown below, it’s going to be conservative and say this is suggestive content, even though to an average viewer, this probably isn’t.

Sports_Illustrated_702794_20010305-001-775
Weird.

But if you were using this on a website to detect images that shouldn’t be shown to kids, I suppose that conservative approach is important.

If we aggregate flagged cover dates into the time of year, look at the high density in January and February, incidentally when the swimsuit issue is released each year:

moderation-label-time-of-year

So it’s almost certainly picking up on swimsuit issues, but there are other covers too that are making this picture a bit noisier overall.

Facial Traits

How common have human faces (not limited to celebrity faces) been on Sports Illustrated covers?

faces-over-time

Two covers with 100 faces? Seriously? Let’s find those covers and see if Rekognition actually got those right:

The first two covers shown above had 100 faces detected (actually more, but 100 is the limit per API call), and the third had 82 faces detected. I was pretty sure this would reveal some error in the detection algorithm, but this looks right to me.

It’s interesting that ensemble-style covers with 50+ faces never occurred until around 2004, but there have been 11 since then.

So what’s the trend like for female faces being included on SI covers? Again, this isn’t an official count, just a smoothed trend line for the mean confidence of female faces appearing:

female-faces-over-time

I doubt we’d expect 50%, even if some would say that’s ideal, but there is an upward tick in the last 10-15 years.

More features on the cover means more attention to women’s sports, right?

Well, there was criticism that a non-swimsuit-edition cover in 2000 featuring Anna Kournikova really had no obvious connection to sports, and a 2005 cover featuring Jennie Finch was only slightly more focused on the fact that she’s an actual athlete:

 

So if we look at the covers where Rekognition returned a strong confidence in one gender or the other, how often would those labeled “Female” also be labeled “Sports”?

For comparison, the same measurement for “Male” is shown on the graph below:

female-male-faces-sports

This was surprising to me – there really isn’t much difference, and both lines follow the trend from the first graph in this post: first sharp increase in “Sports” labels, then a decline from 1980 until around 2000, and a sharp increase since.

Again, this doesn’t prove that covers from 1980-2000ish were not as often about sports, just that Rekognition didn’t label them sports-related as often.

 

Next Steps

That’s all for this post – I think there are several follow-up questions, and I also want to look at some of the more common image labels that aren’t sports-related (for example, animals/outdoors/military themes), and run some graphs on things like eyeglasses, sunglasses, beards, and mustaches over time.

 


Lastly, if you want the data set, send a message via the contact form – I don’t want to make it publicly available, but am willing to share for research purposes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s