Sports Illustrated Cover Image Analysis with AWS Rekognition

I’ve been exploring some of the lesser-known Amazon Web Services (AWS) tools recently, and when I saw the in-console demo of Rekognition, the image labeling/classification service, I knew I needed to find an application for testing.

Then I forgot about it for a few weeks.

And THEN! I came across this archive of all the Sports Illustrated covers in the Sports Illustrated Vault. The rest happened in a whirlwind.

Continue reading “Sports Illustrated Cover Image Analysis with AWS Rekognition”

Order Dependence in as.Date

I was working on a task a few days ago that required converting a lot of timestamps into dates (not times, just dates). Thankfully, as.Date seemed to be working fine as a default approach, but then an error showed up:

Error in charToDate(x) : 
  character string is not in a standard unambiguous format

Fair enough, nobody expects to get through parsing dates without a few hangups.

So I focused in on the vector that was causing the trouble, and didn’t see anything obviously wrong – those timestamps looked just like the other columns of timestamps that had worked thus far.

There were some blanks, but there were blanks in the other columns too, and besides, as.Date("") is just NA, right?

It depends!

Continue reading “Order Dependence in as.Date”

Base R When Possible, Packrat When Not

I gave a talk at Monday night’s Winston-Salem R Users Group that covered a lot of the base R package, and also showed a brief demo of how to use packrat when packages are necessary, and so is sharing code across multiple team members and/or environments.

The idea came from a comment at a previous meeting about the dangers of trying to maintain common versions of code within and across teams, not just to avoid surprising errors, but also to ensure reproducibility.

So my recommended approach was:

  1. Learn as much as you can about the details of the base package. It’s a huge package, and a lot of common needs can be handled simply and effectively.
  2. When you need a package (and there are certainly useful and necessary packages), use a system like packrat to keep dependencies systematically managed.

Most of the content wouldn’t be a surprise to daily R users, but I did throw in some things that either 1) surprised me when I first learned them, or 2) increased my productivity so much that I think everyone should know them.

Continue reading “Base R When Possible, Packrat When Not”

Growth and Development in Wake County, NC

I grew up in Raleigh, North Carolina, the state capital and seat of Wake County, which contains several other municipalities.

When I was in middle school in the year 2000, I would have laughed if you had told me any of the following:

  1. The county’s population would grow by 43.5%, from 627k to 900k, in the next ten years.
  2. The county’s population would actually reach 1 million in the next fifteen years.
  3. Cary would become the seventh-most-populous municipality in North Carolina.
  4. Holly Springs and Fuquay-Varina would be considered reasonable suburbs to Raleigh.
  5. By 2017 Wake County would have 172 public schools (27 high schools) for 155,000 students.

Well, here we are. Wake County and Raleigh have made a million of those “best places to live/work/retire” lists since the year 2000, and the county has seen growth that puts it in the top ten or fifteen fastest-growing counties in the country, depending on how you measure.

So what does that rate of development look like on a map of land parcels? Good question, keeping reading for more…

Continue reading “Growth and Development in Wake County, NC”

Query Cache Hashing in SQL Server

I attended a SQL Server user group about two weeks ago, and the topic was query optimization, presented by Kevin Kline of SentryOne.

I found the subject pretty interesting – there are a lot of tricks to SQL Server queries that can really impact how quickly queries return.

I want to follow up on some of the very subtle cases he demonstrated, but one thing that really caught my eye was his explanation that query results are cached using a hash of the query string.

No problem, right? Actually, the following queries are all hashed to different keys, meaning that despite returning the exact same results, they wouldn’t use the same cached results.

/* All of the following four queries will be cached separately! */

SELECT * FROM table1;
select * from table1;
SELECT *  FROM table1;
select *
  from table1;

So if you want to take advantage of caching, consistent style really matters!

But why is this? Why isn’t SQL Server upcasing/lowcasing queries, stripping returns, and collapsing sequential whitespace before calculating the hash that will be the cache key?

Seems easy enough – does anyone know why they don’t?

Implementing SHA-1 in Python

Earlier this year researchers at Google were able to generate two PDFs with the same SHA-1 digest, and the world became reasonably worried about the security of the hashing algorithm.

So even though I’ll likely never be using SHA-1 in the future (and more importantly, would never use my own implementation in a real-world project), I thought I’d sit down with the spec and see if I could implement it in Python, which I haven’t been using as much as I want to lately.

Thankfully NIST also provides a short example case to check against.

So let’s begin!

Continue reading “Implementing SHA-1 in Python”

Copying R Environments

I’ve been working on a codebase that relied on storing a lot of objects in R environments, mainly because of the potential speed improvements with large numbers of objects.

See this article for a pretty good explanation.

After a recent spec change, I needed to start looping around a block of code that was previously using a single environment to store objects. The easiest approach was to create an initial base environment to use at the start of each iteration of the loop, and then create a copy of that environment that would be specific to the loop iteration.

But I got a surprise!

All of the result environments looked like the last iteration.

First, let’s look at how this works when you start with a base list, make a copy,  and modify the copy:

Continue reading “Copying R Environments”