Preparing the UCI Framelist

The UCI – the international cycling union, publishes a list of bike frames allowed in the World Tour – the series of races that comprise the premier event of professional road cycling. If there’s a premium mass market road bike, it’s on the list. And this list is a great way to analyze the development of new models. In particular, it is a great way to keep a tab on upcoming new bike releases, as often times they show up on the list before they show up in a press release (after all, you [the bike company] will show off your new bike just as the Tour de France is about to start, and you want to be absolutely sure the bike is actually legal to race in said major event).

But for all of us who love data, well the UCI doesn’t make it easy. The list is published as PDF, and only as a PDF. Given the cycling union is run by old school French people (which results in numerous problems beyond this) it is quite likely that the list does not exist in an official spreadsheet anywhere. Now PDFs may be great for sending your latest sales report in to your boss, but they are a horrible way of transporting data – for practical purposes they are basically a picture of the document.

I set out to make a script that could take this list and turn it into a .csv -the sort of file you read in Excel or better yet a programming language like R or Python. This turned out to take much longer than anticipated. I tried a number of existing python package for extracting tables from PDFs, specifically Camelot and Tabula, and ultimately settled on Tabula. These packages work, but not reliably – my final script run on a Linux computer starts by generating 21 columns, while run on Windows it generates 7 or 8 columns (no idea why, software versions are the same or similar as far as I can tell), and my script has to handle cleaning both. Mostly this work is cleaning the resulting dataframe, which often has data in the wrong columns. My solution was basically a try/except series of options where I try to pull the data from various possible columns until I get one with a date format. Then that it, date is the only machine readable thing I really care about, as humans will do the rest of the reading from there. Another problem is that the brand name isn’t in this table, you’ve got to figure it out from the first letters of the frame or fork code – which really is pretty easy because who else is going to make an Infinito from brand Bia–?

I decided to publish the results to Google Sheets. Yes, Google Sheets… Why? Well, I don’t have a personal database server, and I don’t really need one. Google has an API, excellent products, I generally trust them, and it’s free. It’s also then really easy to embed the resulting Sheet as an iframe, and it also has notification updates (and change tracking) so that when my script updates the list with new results (I only update when there are new results), I can get an email without having to put email-me code into the python script.

My script, now in production, is run as a daily cronjob on my newly built home server – a few years old gaming laptop that beats the Raspberry Pi I have been using significantly in performance, and actually only uses a bit more electric power (with no display, no wifi, no bluetooth) while giving a lot more computer power. I’ll post about that server setup eventually, I hope.

Here’s the code for the PDF processing:

3 thoughts on “Preparing the UCI Framelist”

  1. I dont mean to criticize, just think making codes into modules and functions(or objects) can make it much easier for future reference and developing reusable blocks. I’m challenging myself to rewrite my code into 10 line functions and blocks these days.

    1. Building objects and functions for something you are only going to do once in a program is unnecessary overhead. Nor is particularly cleaner, in fact it often takes more effort to trace through all the objects and functions than just having code that progressively does what it needs, where it needs. And as such, it still quite reusable, it’s copy + paste whether its in functions or not.
      It’s a relic of developing big apps that just doesn’t apply in most of these microservice type programs I create.

      1. I agree for one-time stuff or experiment there is no need.

        Sadly most of the stuff I wrote needs to be revamped every now and then. so I still need modules. Or else I found myself reading my own code in bewilderment. And I’m working in a team, one-time stuff creates lots of redundancy. And I ran into multiple occasions where I simply refuse to read somebody else’s code because it’s dreadful to read.

        Making functions/classes, dividing functions into different blocks according to what they do actually helps.

Leave a Comment

Your email address will not be published. Required fields are marked *