Solving the SQL Murder Mystery

Posted on

I saw this SQL Murder Mystery appear on Hacker News recently, thought it sounded fun, and figured I’d do a quick write-up of how I worked through it.

If you want to follow along, go ahead and download the SQLite database (which is copyright NUKnightLab and redistributed here under the MIT license). You’ll need some kind of SQLite client to interact with it (I just used the sqlite3 CLI tool).

In addition to the database, it’s very helpful to start with a prompt:

A crime has taken place and the detective needs your help. The detective gave you the crime scene report, but you somehow lost it. You vaguely remember that the crime was a murder that occurred sometime on Jan. 15, 2018 and that it took place in SQL City. Start by retrieving the corresponding crime scene report from the police department’s database. If you want to get the most out of this mystery, try to work through it only using your SQL environment and refrain from using a notepad.

Let’s start by seeing what tables are available. The sqlite3 CLI uses meta-commands that start with a dot, like this:

sqlite> .tables
crime_scene_report      get_fit_now_check_in    interview
drivers_license         get_fit_now_member      person
facebook_event_checkin  income                  solution

Okay, let’s start with finding our crime scene report. First, we’ll need to know what the data looks like. We can learn about this with the .schema command:

sqlite> .schema crime_scene_report
CREATE TABLE crime_scene_report (
        date integer,
        type text,
        description text,
        city text
    );

Okay, seems pretty straightforward. The only thing I’m not quite sure about is how the date is being represented -- it’s just stored as an integer. A UNIX timestamp perhaps? Let’s sample the data:

sqlite> select date from crime_scene_report limit 5;
date
20180115
20180115
20180115
20180215
20180215

Okay, seems it’s just being stored as YYYYMMDD. Let’s take a crack at finding the crime scene report! We know the type (murder) and the city (SQL City). Let’s be generous with the date and assume it was sometime in January of 2018:

sqlite> select * from crime_scene_report
   ...> where type = 'murder'
   ...> and city = 'SQL City'
   ...> and date between 20180101 and 20180131;
date type description city
20180115 murder Security footage shows that there were 2 witnesses. The first witness lives at the last house on "Northwestern Dr". The second witness, named Annabel, lives somewhere on "Franklin Ave". SQL City

Great, there’s only one row that matches our broad date criteria! Let’s see if we can track down these witnesses. First, let’s see how the data we need is structured:

sqlite> .schema person
CREATE TABLE person (
        id integer PRIMARY KEY,
        name text,
        license_id integer,
        address_number integer,
        address_street_name text,
        ssn integer,
        FOREIGN KEY (license_id) REFERENCES drivers_license(id)
    );
sqlite> .schema interview
CREATE TABLE interview (
        person_id integer,
        transcript text,
        FOREIGN KEY (person_id) REFERENCES person(id)
    );

Okay, so we need to find the two rows in the person table, and then use their ids to cross reference their interview text. This is “the big idea” with relational databases, joining data in several tables based on something they have in common.

We’ll start with the witness who lives on Northwestern Drive. We know that they live in “the last house,” which presumably has the highest house number on that street. We can easily find this by first filtering for only people who live on Northwestern Drive, then ordering those results by house number in descending order, and only showing the first result:

sqlite> select * from person
   ...> where address_street_name = 'Northwestern Dr'
   ...> order by address_number desc
   ...> limit 1;
id name license_id address_number address_street_name ssn
14887 Morty Schapiro 118009 4919 Northwestern Dr 111564949

Great! Now let’s find Annabel. We can use SQL’s LIKE operator to match a partial name, along with the name of their street:

sqlite> select * from person
   ...> where name like 'Annabel%'
   ...> and address_street_name = 'Franklin Ave';
id name license_id address_number address_street_name ssn
16371 Annabel Miller 490173 103 Franklin Ave 318771143

Okay, so we’ve got our person IDs: 14887 and 16371. I think we’re going to want these IDs in a bunch of upcoming queries, so let’s help our future selves out by saving their IDs as parameters (a sort of temporary variable):

sqlite> .parameter set $MORTY 14887
sqlite> .parameter set $ANNABEL 16371

Let’s grab their interviews. To do this, we’ll put joins to use for the first time so we can show their name rather than just their person ID. We’re selecting records from the interview table, but joining matching records from the person table, using the person_id column to match up the people.

sqlite> select person.name, interview.transcript
   ...> from interview
   ...> join person on person.id = interview.person_id
   ...> where person_id in ($MORTY, $ANNABEL);
name transcript
Morty Schapiro I heard a gunshot and then saw a man run out. He had a "Get Fit Now Gym" bag. The membership number on the bag started with "48Z". Only gold members have those bags. The man got into a car with a plate that included "H42W".
Annabel Miller I saw the murder happen, and I recognized the killer from my gym when I was working out last week on January the 9th.

Okay, we’ve got tons of info now! Since the car and bag might not belong to the killer, I think our best lead for narrowing things down is to see all the people who crossed paths with Annabel at the gym on January 9th, 2018. Let’s see what those tables look like:

sqlite> .schema get_fit_now_check_in
CREATE TABLE get_fit_now_check_in (
        membership_id text,
        check_in_date integer,
        check_in_time integer,
        check_out_time integer,
        FOREIGN KEY (membership_id) REFERENCES get_fit_now_member(id)
    );
sqlite> .schema get_fit_now_member
CREATE TABLE get_fit_now_member (
        id text PRIMARY KEY,
        person_id integer,
        name text,
        membership_start_date integer,
        membership_status text,
        FOREIGN KEY (person_id) REFERENCES person(id)
    );

Alright, time to look for some check-ins! We could do this in two separate queries, one to find Annabel’s Get Fit Now member ID by using her person_id, and a second query to find her check-ins using her membership_id, but we can also use a sub-query to do this in one shot:

sqlite> select check_in_time, check_out_time
   ...> from get_fit_now_check_in
   ...> where date = 20180109
   ...> and membership_id = (
   ...>   select id
   ...>   from get_fit_now_member
   ...>   where person_id = $ANNABEL);
check_in_time check_out_time
1600 1700

Looks like Annabel was at the gym from 4pm to 5pm on the 9th. Since we’re looking for someone who overlapped with Annabel at the gym, we’re looking for someone who arrived before 5pm and left after 4pm. Again, we’ll join some tables together here so we can grab their names and person IDs right away, not just their membership numbers:

sqlite> select person.id, person.name, get_fit_now_member.id,
   ...>   get_fit_now_check_in.check_in_time,
   ...>   get_fit_now_check_in.check_out_time
   ...> from get_fit_now_check_in
   ...> join get_fit_now_member on get_fit_now_member.id = membership_id
   ...> join person on person.id = person_id
   ...> where check_in_date = 20180109
   ...> and check_in_time <= 1700 and check_out_time >= 1600;
id name id check_in_time check_out_time
28819 Joe Germuska 48Z7A 1600 1730
67318 Jeremy Bowers 48Z55 1530 1700
16371 Annabel Miller 90081 1600 1700

Interesting, there were only two other gym members who were checked in for a period overlapping with Annabel on the 9th. Let’s save their IDs as well:

sqlite> .parameter set $JOE 28819
sqlite> .parameter set $JEREMY 67318

Their member numbers both start with 48Z; let’s take a look at their vehicles, presumably in the drivers_license table:

sqlite> .schema drivers_license
CREATE TABLE drivers_license (
        id integer PRIMARY KEY,
        age integer,
        height integer,
        eye_color text,
        hair_color text,
        gender text,
        plate_number text,
        car_make text,
        car_model text
    );

sqlite> select person.id, person.name, drivers_license.*
   ...> from person
   ...> join drivers_license on drivers_license.id = person.license_id
   ...> where person.id in ($JOE, $JEREMY);
id name id age height eye_color hair_color gender plate_number car_make car_model
67318 Jeremy Bowers 423327 30 70 brown brown male 0H42W2 Chevrolet Spark LS

So only Jeremy Bowers has a drivers license. And his car’s license plate does contain H42W, so it looks like we’ve found the killer! According to the instructions in the GitHub repository, we should insert our answer into the solution table, then query it:

sqlite> insert into solution values (1, 'Jeremy Bowers');
sqlite> select value from solution;
value
Congrats, you found the murderer! But wait, there's more... If you think you're up for a challenge, try querying the interview transcript of the murderer to find the real villian behind this crime. If you feel especially confident in your SQL skills, try to complete this final step with no more than 2 queries.

Aha! We did correctly identify Jeremy Bowers. Let’s see if we can connect the dots to find the mastermind! First, we’ll grab the killer’s (Jeremy’s) interview transcript:

sqlite> select transcript from interview where person_id = $JEREMY;
transcript
I was hired by a woman with a lot of money. I don't know her name but I know she's around 5'5" (65") or 5'7" (67"). She has red hair and she drives a Tesla Model S. I know that she attended the SQL Symphony Concert 3 times in December 2017.

Alright, there goes one query... one more to make it count! We’re going to be correlating data from a bunch of tables here: person, income (related by SSN, probably as a sort criterion since we don’t have an exact figure to work with), we can grab height, hair color, gender, and car make/model from the drivers licenses. It’s a bit of a risk to filter by Facebook checkins to the SQL Symphony, since we don’t know that she checked in at all, but maybe we can include the count of the number of times there was a check-in at the symphony during December. Let’s get a reminder of what these tables look like:

sqlite> .schema person
CREATE TABLE person (
        id integer PRIMARY KEY,
        name text,
        license_id integer,
        address_number integer,
        address_street_name text,
        ssn integer,
        FOREIGN KEY (license_id) REFERENCES drivers_license(id)
    );
sqlite> .schema income
CREATE TABLE income (
        ssn integer PRIMARY KEY,
        annual_income integer
    );
sqlite> .schema facebook_event_checkin
CREATE TABLE facebook_event_checkin (
        person_id integer,
        event_id integer,
        event_name text,
        date integer,
        FOREIGN KEY (person_id) REFERENCES person(id)
    );
sqlite> .schema drivers_license
CREATE TABLE drivers_license (
        id integer PRIMARY KEY,
        age integer,
        height integer,
        eye_color text,
        hair_color text,
        gender text,
        plate_number text,
        car_make text,
        car_model text
    );

And assemble our final mega-query!

sqlite> select p.id, p.name, i.annual_income, dl.height, dl.hair_color,
   ...> dl.gender, dl.car_make, dl.car_model, (
   ...>   select count(*)
   ...>   from facebook_event_checkin
   ...>   where person_id = p.id
   ...>   and event_name like '%symphony%'
   ...>   and date between 20171201 and 20171231) as num_symphonies
   ...> from person p
   ...> join income i on i.ssn = p.ssn
   ...> join drivers_license dl on dl.id = p.license_id
   ...> where dl.height between 64 and 68
   ...> and dl.hair_color like '%red%'
   ...> and car_make like '%tesla%'
   ...> and car_model like '%s%'
   ...> order by i.annual_income desc;
id name annual_income height hair_color gender car_make car_model num_symphonies
99716 Miranda Priestly 310000 66 red female Tesla Model S 3
78881 Red Korb 278000 65 red female Tesla Model S 0

Okay, so we actually got two results for red-haired people around 66" tall who make a lot of money and drive Tesla Model S’s. However, one of them attended the symphony three times in December (and makes even more money), so I think we’ve found the mastermind!

I didn’t include gender in the filter as I wasn’t sure how the data looked, and I technically would’ve needed an additional query to discover that.

sqlite> insert into solution values (1, 'Miranda Priestly');
sqlite> select value from solution;
value
Congrats, you found the brains behind the murder! Everyone in SQL City hails you as the greatest SQL detective of all time. Time to break out the champagne!

Hooray! I had a lot of fun playing through this, and would love to do another similar puzzle again sometime.


More in SQL: