The fact that all of this data was already public meant that collating it all couldn’t be that hard, right? Wrong. Very wrong


The Westminster Accounts is a database of databases. It takes three existing bodies of government data – all publicly available in some form online – lays them down, untangles them, moves them around, smooths them out, jumbles them back up, untangles them again, lines them up, dusts them off and spits them out as something entirely new but also almost exactly the same.

Here are some of the things we found while we untangled:


Newsletters
Sign up to hear the latest from The Observer

For information about how The Observer protects your data, read our Privacy Policy.


Nothing is a neat little table.

This is almost a given, but MPs do not register their interests in a neat table at the end of the week, month or year. Every two weeks (roughly), they produce a new register document. This means that, since the beginning of the current parliament in 2019, most MPs have produced 62 different documents of potentially hundreds of overlapping interests. Together, they’ve produced roughly 40,300.

Duplicates and duplicates and duplicates.

MPs don’t just add new interests, they often update existing ones, too (and sometimes they update those updates). They might amend the amount of money they receive, declare they’ve resigned from a second job, or simply correct a minor typo, all of which invalidate the previous entry. Just combining everything together risks counting things more than once. To avoid duplicating figures, we had to trace the history of each entry over time.

My kingdom for a bridge.

Sometimes, in order to connect dataset A to dataset B you have to separately connect both A and B to C, a third dataset that bridges the gap between the original two. This project is rife with C-type datasets. 

One example: JCB is one of the biggest donors to the Conservative party and its MPs, but it ends up in public data under a lot of different names – “J C Bamford Excavators Ltd”, “JCB Limited”, “J.C.B Services”. (We found at least eight.) Some of these were easy to match up, but others were not. In the end, in order to accurately connect company names together for JCB and the other 3,000 sources of money in the database, we had to add in a combination of the the Companies House database, some AI machine learning techniques, geographic location data (including a failed attempt to use Google Maps’ underlying data to work out if the headquarters of one company was in the same place as another similarly named company), and good old fashioned manual review.

To err is human. (It’s also computer.)

This project overwhelmingly involves data generated and reported by humans, and humans make mistakes. They add extra zeros, forget decimal places, spell people’s names wrong, spell their own names wrong. Even when they’re not making factual errors, people introduce an inevitable level of variability to any dataset.

None of this is impossible to overcome, but it’s also not a problem computer programmes alone can solve. In fact, computer programmes often introduce errors themselves because they too are written by humans. We often found ourselves in the torturous cycle of: write code to detect human errors in data, have humans check those detections and correct those errors, write code to detect errors made by humans when making corrections, have humans look at errors detected by code…

The rules are confusing for MPs, too.

Seriously, we don’t envy them.

Over the course of the project, we had to make a lot of decisions about how to interpret different kinds of register entries, and looking at the actual guidelines often left us less sure of what to do, not more.

No one thought this through.

You might think there are two simple ways to prevent someone using information you’re forced to give them – give them too much of it, or you give them too little. Unfortunately there’s a third strategy that’s equally simple but far more effective – you give them both.

We’re not saying any of the challenges in building the Westminster Accounts were deliberately engineered to stop people accessing the data they need to fully investigate the financial interests at the heart of the British government.

But that’s sort of the point.

It’s very easy to end up with a system that’s so disorganised, so disconnected, so broken that it looks as if no one has put any thought into it at all.

When this project began, we’d both had enough experience of government data to know the challenge ahead of us was immense, but neither of us quite understood the scale.

If there’s one thing we’ve learned it’s that not actively managing a country’s information ecosystem in a way that makes crucial data accessible to the public is an active act of obfuscation. You shouldn’t need an engineering background and six months’ hard labour to uncover basic facts about who’s paying the people who make your laws.


Share this article