Building a Retrosheet Database – Part 2

Welcome back to TechGraphs’ series on building your own Retrosheet database. If you haven’t already, go back and read part one. If you are interested in actually following along on your own system, you must do some work up front, which all mentioned in that article. If you’ve made it past that point, you’re ready to actually load some baseball data.

Before I get too far into things, I should mention that much of what I will be presenting is an updated version of Colin Wyers’ article at our sister site The Hardball Times. That article is how I first learned to do this install, so many thanks are owed to him for laying down the groundwork.

(Note: a lot of these processes will take a good deal of computing resources and time. Be prepared to wait a good deal for these to complete. I would recommend doing this during a time you won’t need to do much else on your computer.)

The first step is to download the .zip file that will contain all the batch files and folder structure you will need. There are two options; one for 32-bit machines and one for 64-bit machines. The files are essentially the same, but the file paths in the scripts vary slightly. Check back with part one if you don’t know if you have a 32- or 64-bit machine. There are instructions on finding out toward the bottom of the article.

Download the proper file and extract it into a new folder on the C:\ drive called Retrosheet, so the result looks like C:\Retrosheet. Open the folder \data\zipped, and run the file get_zip_files batch file. (Note: if you don’t know, batch files are basically strings of code that tell your computer to run multiple processes all at once, saving you a whole lot of clicking.) When you run the get_zip_files file, you’ll notice a lot of scrolling text in the Command Prompt. This is a good thing. Basically, your computer is downloading all the game files from Retrosheet and unzipping them into usable files. If you back up to the \data\unzipped folder, you’ll notice all the .EVA and .EVN files that were just downloaded. This is the format Retrosheet used for their files. We’ll be using a program called Chadwick to turn these files into MySQL data. Also in that folder are three batch files — $cevent, $cwgame, and $cwsub. Running these files will parse the data through Chadwick and place the resulting .csv files in the \data\parsed directory. Run each script one time. Be patient, these processes can take some time. You can minimize the Command Prompt window, but DO NOT CLOSE IT. Closing the window ends the process, and you’ll have to clear out the parsed directory and start over again. When the window closes, the process is done and you can run the other two processes.

Now, it’s time to get our hands a little dirty with MySQL. Open the MySQL Workbench that we installed in part one. There should only be one connection to choose from at this time.

part2snip1

Double-click and enter the root password you created to open the workbench. The first thing we need to do is create the proper schema. A SQL schema is basically the bones of the database. It creates the tables and the fields that go into those tables. It doesn’t load the actual data, but it gets everything set up to load the data in. To create the schema, click File > Open SQL Script. Navigate out to the Retrosheet directory and open the “retrosheet table schema.sql” file. It will load the SQL commands into the workbench. In the toolbar right above all the new text, you’ll see a lightning bolt. This button runs whatever is in the editor window — in this case, our schema creation. Click the lightning bolt and let the SQL commands do their magic. When it’s all done, you’ll have to refresh your schema list to see the new Retrosheet tables. One the left-hand window, next to the word Schema, you’ll see a little refresh button. Click it, and the Retrosheet schema should now be visible.

part2snip2

Now, it’s time to fill those schemas. Go to File > Open SQL Script again. This time, navigate to Retrosheet\loaders. These files tell MySQL to fill those tables with the data we extracted in the previous steps. Choose events.sql, and you’ll see the SQL statement load into the workbench. Click the executing lightning bolt again and wait. Then, wait some more. You’ll have to wait for a while.

part2snip3

See that little circle in the tab above your SQL statement? That means a process is running. Wait until that disappears before you run the games.sql and subs.sql files.

Once the other .sql files are run, you’re all done. You’ve done it, you’ve created your very own Retrosheet database.

What exactly can you do with it? Well, I think we’ll save that for the next installment. I’ll go over some basic queries and show you just what your shiney new database can do. Until then, take your time, be confident, and don’t be afraid to sound off in the comments if you have any questions.

 





David G. Temple is the Managing Editor of TechGraphs and a contributor to FanGraphs, NotGraphs and The Hardball Times. He hosts the award-eligible podcast Stealing Home. Dayn Perry once called him a "Bible Made of Lasers." Follow him on Twitter @davidgtemple.

18 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Marciano
9 years ago

So I’ve run into a problem: $cevent, $cwgame, and $cwsub don’t move anything into the parsed folder for me. Also, I had to extract the zipped Retrosheet files from \data\zipped to \data\unzipped myself, the get_zip_files batch didn’t do that on its own. Any idea what might be going wrong? It’s frustrating to get snagged at the very end here.

Marciano
9 years ago
Reply to  Marciano

You know what? I spoke too soon. Thanks for this very helpful guide.

Matt Mitchell
9 years ago

I ran the x64 version of the get_zip_files.bat script, but it only downloaded the .zip files. It did not decompress them.

Matt Mitchell
9 years ago
Reply to  Matt Mitchell

My apologies. Turns out it was a issue with my specific system setup.

For those who encounter the same problem, right click on get_zip_files.bat and select “Edit”. Check to make sure that the directories for wget and 7-zip match your system setup.

tz
9 years ago

David – this series has been great. You explained it well enough that even I could follow it.

Can’t wait for Part 3!

Mark Penman
9 years ago

Is there anyone who is using Windows Vista like the primitive neanderthal that I am?

I can install MySQL Workbench version 6.1 on vista but do I have to separately install a separate Vista-compatible MySQL server to make things work properly?

thanks

Jim
9 years ago

I knew I was too stupid to do this. I cannot even find the folder \data\zipped.

Typical of computer types, leaving out important steps.

Thanks, I guess.
Jim

Kevin A
9 years ago

Thanks so much for this tutorial! It’s been so helpful. I have almost completed it, but for some reason, when I run the $cwevent, $cwgame and $cwsub batch files, the command window opens, some stuff appears to happen in the window, and then it closes, and no data has been loaded into the “parsed” file. This leads to an error in workbench, as expected. What did I do wrong?

Thanks so much for your help.

Eric Shreve
9 years ago

The scripts that are downloaded in the “loaders” folder stop at 2009. If you want data for 2010-2014 you will have to extend the scripts to include those years.

Bart
9 years ago
Reply to  Eric Shreve

How do you do that?

Adam
9 years ago

Running into a hangup that was mentioned above….solution? Novice here, thanks in advance.

$cevent, $cwgame, and $cwsub don’t move anything into the parsed folder

Compton
9 years ago

For those stuck on the $cwevent, $cwgame and $cwsub step–where it will bring up the command prompt but not actually process anything– try the following:

When you are unzipping the retrosheet download onto your C:\ drive, unzip the contents of *only* the retrosheet(x64) file and not both the _MACOSX and retrosheet(x64) files. The result should be something that looks like C:\retrosheet\common, C:\retrosheet\data, C:\retrosheet\loaders, etc.

You can then go click on the get_zip_files .bat file and let it run. The important thing to check after this is completed is that the .EVA and .EVN files are in the C:\retrosheet\data\unzipped directory. If you do not have the contents of the retrosheet download in the proper path, then those .EVA and .EVN files will not unzip to the directory that the $cwevent, $cwgame and $cwsub .bat files were written for, and those three bat files will not run if they can’t find the .EVA and .EVN files.

Hope this helps!

Jacob
9 years ago

When will the next part of this series be released? Not tech savvy enough (yet) to query this data. Thanks!

Bill
9 years ago

Why aren’t event logs from 1932 – 1954 downloaded? Could they just be added, or is something problematic in terms of the stats tracked in earlier years?

John
8 years ago

For some reason when I run my $cevent, $cwgame, and $cwsub files, they take like half a second and don’t end up in the parsed folder. A similar thing happens when I unzip the zipped files. They don’t end up in the correct unzipped file, instead they create a new unzipped file outside the retrosheet(64x) file. In this case I just copy and paste them into the correct folder, but when it happens to the $cevent, $cwgame, and $cwsub files, I can’t find them, and I don’t even know if they parsed at all. Then, when I run the files on MySQL, I get an error message when I try to run the events file, because it cannot run the parsed file. Did this happen to anyone else or does anyone know how to fix it?
Thanks,
John

John
8 years ago
Reply to  John

Oh, Thanks Compton.
I didn’t see your solution.
The parsing worked this time!
One more question, when I was opening the events file in MySQL, instead of having green check marks in the bottom screen it said something like truncating column UNCERTAIN_PLAY_EXC_FL, with yellow warning signs. Any idea what that is?
JOhn

Daniel
8 years ago
Reply to  John

I have the same issue Jhon is explaining above, with the truncated message.
Also, instead of green check marks there is an exclamation sign on each line.

What does that means?.

Is it OK?