Has it really been a year since I posted here? Oh, dear. Well. So, this Friday/Saturday I went to the FCO’s hackathon event – wonderfully titled “Who Was Our Man In Havana?” – to have a play with a dataset of British diplomats.
My goal was to try and synch this up with Wikidata in some way – there were obviously some overlaps with the MPs project, but given how closely tied the diplomatic service has been into the establishment, it seemed likely there would be a lot of overlap. The objective of the event was to produce some kind of visualisation/interface, so after a bit of discussion with my team-mates we decided to get the data cleaned up, import some into Wikidata, and pull it out again in an enriched fashion.
The data cleaning was a bit of a challenge. Sev and Mohammed, my team-mates, did excellent work hacking away at the XML and eventually produced a nice, elegantly-parsed, version of the source data.
I uploaded this into Magnus’s mix-and-match tool, using a notional ID number which we could tie back to the records. Hammering away at mix-and-match that evening got me about 400 initial matches to work with. While I was doing this, Sev and Mohammed expanded the XML parsing to include all the positions held plus dates, tied back to the notional IDs in mix-and-match.
On Saturday, I wrote a script to pull down the mix-and-match records, line them up with the expanded parsing data, and put that into a form that could be used for QuickStatements. Thankfully, someone had already established a clear data model for diplomatic positions, so I was able to build on that to work out how to handle the positions without having to invent it from scratch.
The upload preparation was necessarily a messily manual process – I ended up compromising with a script generating a plain TSV which I could feed into a spreadsheet and then manually lookup (eg) the relevant Wikidata IDs for positions. If I’d had more time we could have put together something which automatically looked up position IDs in a table and then produced a formatted sheet (or even sent it out through something like wikidata-cli, but I wanted a semi-manual approach for this stage so I could keep an eye on the data and check it was looking sensible. (Thanks at this point also to @tagishsimon, who helped with the matching and updating on mix-and-match). And then I started feeding it in, lump by lump. Behold, success!
While I was doing this, Mohammed assembled a front-end display, which used vue.js to format and display a set of ambassadors drawn from a Wikidata SPARQL query. It concentrated on a couple of additional things to demonstrate the enrichment available from Wikidata – a picture and some notes of other non-ambassadorial positions they’d held.
To go alongside this, as a demonstration of other linkages that weren’t exposed in our tool, I knocked up a couple of quick visualisations through the Wikidata query tool: a map of where British ambassadors to Argentina were born (mainly the Home Counties and India!), or a chart of where ambassadors/High Commissioners were educated (Eton, perhaps unsurprisingly, making a good showing). It’s remarkable how useful the query service is for whipping up this kind of visualisation.
We presented this on Saturday afternoon and it went down well – we won a prize! A bottle of wine and – very appropriately – mugs with the famed Foreign Office cat on them. A great weekend, even if it did mean an unreasonably early Saturday start!
So, some thoughts on the event in conclusion:
- It was very clear how well the range of skills worked at an event like this. I don’t think any of us could have produced the result on our own.
- A lot of time – not just our group, but everyone – was spent parsing and massaging the (oddly structured) XML. Had the main lists been available as a CSV/TSV, this might have been a lot quicker. I certainly wouldn’t have been able to get anywhere with it myself.
- On the data quality note, we were lucky that the names of records were more or less unique strings, but an ID number for each record inserted when the original XML was generated might have saved a bit of time.
- A handful of people could go from a flat file of names, positions, dates to about a thousand name-position pairs on Wikidata, some informative queries, and a prototype front-end viewer with a couple of days of work, and some of that could have been bypassed with cleaner initial data. This is really promising for
And on the Wikidata side, there are a few modelling questions this has thrown up:
- I took the decision not to change postings based on the diplomatic rank – eg someone who was officially the “Minister to Norway” (1905-1942) conceptually held the same post as someone who was “Ambassador to Norway” (1942-2018). If desired, we can represent the rank as a qualifier on the item (eg/ subject has role: “chargé d’affaires”). This seemed to make the most sense – “ambassadors with a small ‘a'”.
- The exception to this is High Commissioners, who are currently modelled parallel to Ambassadors – same hierarchy but in parallel. This lets us find all the HCs without simply treating them as “Ambassadors with a different job title”.
However, this may not be a perfect approach as some HCs changed to Ambassadors and back again (eg Zimbabwe) when a country leaves/rejoins the Commonwealth. At the moment these are modelled by picking one for a country and sticking to it, with the option of qualifiers as above, but a better approach might be needed in the long run. - Dates as given are the dates of service. A few times – especially in the 19th century when journeys were more challenging – an ambassador was appointed but did not proceed overseas. These have been imported with no start-end dates, but this isn’t a great solution. Arguably they could have a start/end date in the same year and a qualifier to say they did not take up the post; alternatively, you could make a case that they should not be listed as ambassadors at all.