The UCI – the international cycling union, publishes a list of bike frames allowed in the World Tour – the series of races that comprise the premier event of professional road cycling. If there’s a premium mass market road bike, it’s on the list. And this list is a great way to analyze the development of new models. In particular, it is a great way to keep a tab on upcoming new bike releases, as often times they show up on the list before they show up in a press release (after all, you [the bike company] will show off your new bike just as the Tour de France is about to start, and you want to be absolutely sure the bike is actually legal to race in said major event).
But for all of us who love data, well the UCI doesn’t make it easy. The list is published as PDF, and only as a PDF. Given the cycling union is run by old school French people (which results in numerous problems beyond this) it is quite likely that the list does not exist in an official spreadsheet anywhere. Now PDFs may be great for sending your latest sales report in to your boss, but they are a horrible way of transporting data – for practical purposes they are basically a picture of the document.
I set out to make a script that could take this list and turn it into a .csv -the sort of file you read in Excel or better yet a programming language like R or Python. This turned out to take much longer than anticipated. I tried a number of existing python package for extracting tables from PDFs, specifically Camelot and Tabula, and ultimately settled on Tabula. These packages work, but not reliably – my final script run on a Linux computer starts by generating 21 columns, while run on Windows it generates 7 or 8 columns (no idea why, software versions are the same or similar as far as I can tell), and my script has to handle cleaning both. Mostly this work is cleaning the resulting dataframe, which often has data in the wrong columns. My solution was basically a try/except series of options where I try to pull the data from various possible columns until I get one with a date format. Then that it, date is the only machine readable thing I really care about, as humans will do the rest of the reading from there. Another problem is that the brand name isn’t in this table, you’ve got to figure it out from the first letters of the frame or fork code – which really is pretty easy because who else is going to make an Infinito from brand Bia–?
I decided to publish the results to Google Sheets. Yes, Google Sheets… Why? Well, I don’t have a personal database server, and I don’t really need one. Google has an API, excellent products, I generally trust them, and it’s free. It’s also then really easy to embed the resulting Sheet as an iframe, and it also has notification updates (and change tracking) so that when my script updates the list with new results (I only update when there are new results), I can get an email without having to put email-me code into the python script.
My script, now in production, is run as a daily cronjob on my newly built home server – a few years old gaming laptop that beats the Raspberry Pi I have been using significantly in performance, and actually only uses a bit more electric power (with no display, no wifi, no bluetooth) while giving a lot more computer power. I’ll post about that server setup eventually, I hope.
Here’s the code for the PDF processing: