A long time ago I made a system for health statistics and I was demoing it for the stakeholders who of course knew a lot about epidemiology. One of them asked if I could use my system to find out the most common cause of death in Sweden over a time period. I was eager to show off my system and generated the query.
The result was flat foot.
I didn't feel that sure about my system after that.
So what had happened?
I had tons and tons of statistical material with gender, ages and cause of death over several years. The cause of death was marked with a diagnostic number, a so called ICD (International Classification of Disease). This ICD-code was versioned so you had a ICD-6, ICD-7, ICD-8 and so on.
Now, what I didn't know was that Sweden made a shift from ICD-7 to ICD-8 at a certain point of time. My stakeholders (stake holders?) knew this of course and set the trap with a smile.
In ICD-7 the code 746 stands for flat foot but in ICD-8 the code 746 stands for congenital anomalies of heart. So when I summarized the statistics using ICD-7 terminology for ICD-8 data... well, I guess you get the idea.
ICD is an interesting example of taxonomy, the science of classification. Other types of classifications can be the futile attempt to classify the internet into a hierarchy of subjects by yahoo, the classification of plants by Linnae or the subject classification at a library (the strange combinations of letter like Pcj:k that somehow describes a books subject).
Classifications is hierarchic by nature, a subdivision from all into smaller and smaller parts. An approach that is easy but has drawbacks when something fits equally well in two or more classes. (consider a book that is both about History and Math for example)
Classifications also change over time as we saw with the flat foot case. In ICD-8 flat foot had moved from 746 to 736. A change that is vital to know about in order to get correct statistics.
So each version of classification connects to the previous version. In ICD-7 the flatfoot at code 746 points to the 736 flat foot in ICD-8.
Other diagnoses had one code in ICD-7 and got several codes in ICD-8.
Ischaemic Heart Disease for example was 420 in ICD-7 but was covered by the codes 410-414 in ICD-8.
This of course made it impossible to know which ICD-8 diagnose a person with the ICD-7 diagnose of Ischaemic Heart Disease had.
All that could be said was that it was one of the diagnoses between 410 and 414 and maybe, maybe if we knew the relative distribution between the diagnoses 410-414 we could guess that it was 40% chance of 410, 15% chance of 411 and so on stumbling through fuzzy logic.
The opposite could also happen of course, that two classes in the old version is represented by a single class in the new. A join.
Similarly some classes may not have a representation in the new version and totally new classes could appear. You will not find HIV in the ICD-7 because originates from 1955 when the disease was unknown.
To complicate matters even more there can be several different classifications that each has their own versioning with forks and joins, but who's classes also connect to classes in other taxonomies.
With birds you have the Sibley-Ahlquist classification that sees the species differently from the traditional Clemens classification.
The national symbol of New Zeeland, the kiwi bird, is considered to be part of the kiwi order Apterygiformes in Clemens but in the Sibley-Ahlquist it is seen as a part of the ostrich order Struthioniformes.
Still, a kiwi is a kiwi and there is a very strong relationship between the kiwi class in the Clemens classification and its Sibley-Ahlquist sibling.
So how do we fit this thing called classifications into SQL Server?
We have seen that a classification consists of a hierarchy of classes that are connected to other versions of the classification and also to other classes in totally different classifications.
We do have pretty good possibilities to implement hierarchies in SQL Server, but a hierarchy only covers one version of a classification. If we want to be able to track changes and translate between different versions of a taxonomy, a hierarchy is not enough because it is not a hierarchy. It is a graph. And maybe it is a directed acyclic graph and maybe, maybe even a weighted variant.
You can put graphs in a relational database, but it is painful.
Better to use a dedicated graph database such as Neo4J or maybe use a mix of graph engine and a relational database.
More on this next time.