January 27, 2013

Taxonomy pt 2

In my last entry I discussed the special constraints and problems that occur when you need to implement a classification system that changes over time.

This time we will take the discussion a little bit deeper and look at the basic API of a taxonomy component.

A taxonomy is a graph where each taxon has a limited life span and is traceable through previous revisions of the taxonomy.
If I would use birds as an example (and I really like to do that). The Armenian gull, Larus armenicus, is considered a specie by The Association of European Rarities Committees.
It was first considered a sub specie of Herring Gull (L. argentatus), but after that specie was split into European Herring Gull, Larus argentatus, American Herring Gull, Larus smithsonianus, Caspian Gull, Larus cachinnans, Yellow-legged Gull, Larus michahellis, Vega Gull, Larus vegae and the Armenian Gull, Larus armenicus.

To complicate stuff further another taxonomy, namely birdlife.org, doesn't consider it to be a valid specie but lumps it together with Yellow-legged Gull (Larus michahellis).



So... depending on when you see this gull and which taxonomy you use it can be either a Herring gull, an Armenian gull or a Yellow-legged gull and then you're only checking two taxonomies and believe me, there are more out there...

What does this tell us?

  1. A taxon has a time span.
  2. A taxon is derived from one or more taxons.
  3. A taxon is dependent of its taxonomy and several parallel taxonomies may exist.
  4. To find a taxon from its key (the latin name in this case), you will need to know the key, the time and the taxonomy.
To find out that the Armenian gull nowadays is considered a sub specie of Yellow-legged gull in Birdlife, I would have to backtrack the taxonomy graph to find the Armenian gull and then follow it to present time to see that it has been included in Yellow-legged gull. 

Even though this example is about birds, the same will apply more or less to any other type of taxonomy.

In pseudo-code the core functions would be:
  • Taxon InsertTaxon(Taxon taxon, List<Taxon> ancestors, DateTime validFrom) to insert a taxon based on zero or more ancestors.
  • Taxon FindTaxon(String key, DateTime when, Taxonomy string) to find a taxon given a key, time and a certain taxonomy.
The signature can of course differ, but the basic design will be the same.

Next time we will take a look on how we can implement this.

Til then... Bye, bye!

January 14, 2013

The deadly flat foot

A long time ago I made a system for health statistics and I was demoing it for the stakeholders who of course knew a lot about epidemiology. One of them asked if I could use my system to find out the most common cause of death in Sweden over a time period. I was eager to show off my system and generated the query.

The result was flat foot.
I didn't feel that sure about my system after that.

So what had happened?

I had tons and tons of statistical material with gender, ages and cause of death over several years. The cause of death was marked with a diagnostic number, a so called ICD (International Classification of Disease). This ICD-code was versioned so you had a ICD-6, ICD-7, ICD-8 and so on.

Now, what I didn't know was that Sweden made a shift from ICD-7 to ICD-8 at a certain point of time. My stakeholders (stake holders?) knew this of course and set the trap with a smile.

In ICD-7 the code 746 stands for flat foot but in ICD-8 the code 746 stands for congenital anomalies of heart. So when I summarized the statistics using ICD-7 terminology for ICD-8 data... well, I guess you get the idea.

ICD is an interesting example of taxonomy, the science of classification. Other types of classifications can be the futile attempt to classify the internet into a hierarchy of subjects by yahoo, the classification of plants by Linnae or the subject classification at a library (the strange combinations of letter like Pcj:k that somehow describes a books subject).

Classifications is hierarchic by nature, a subdivision from all into smaller and smaller parts. An approach that is easy but has drawbacks when something fits equally well in two or more classes. (consider a book that is both about History and Math for example)

Classifications also change over time as we saw with the flat foot case. In ICD-8 flat foot had moved from 746 to 736. A change that is vital to know about in order to get correct statistics.

So each version of classification connects to the previous version. In ICD-7 the flatfoot at code 746 points to the 736 flat foot in ICD-8.

Other diagnoses had one code in ICD-7 and got several codes in ICD-8.

Ischaemic Heart Disease for example was 420 in ICD-7 but was covered by the codes 410-414 in ICD-8.

A fork.

This of course made it impossible to know which ICD-8 diagnose a person with the ICD-7 diagnose of  Ischaemic Heart Disease had.
All that could be said was that it was one of the diagnoses between 410 and 414 and maybe, maybe if we knew the relative distribution between the diagnoses 410-414 we could guess that it was 40% chance of 410, 15% chance of 411 and so on stumbling through fuzzy logic.

The opposite could also happen of course, that two classes in the old version is represented by a single class in the new. A join.

Similarly some classes may not have a representation in the new version and totally new classes could appear. You will not find HIV in the ICD-7 because originates from 1955 when the disease was unknown.

To complicate matters even more there can be several different classifications that each has their own versioning with forks and joins, but who's classes also connect to classes in other taxonomies.

With birds you have the Sibley-Ahlquist classification that sees the species differently from the traditional Clemens classification.

The national symbol of New Zeeland, the kiwi bird, is considered to be part of the kiwi order Apterygiformes in Clemens but in the Sibley-Ahlquist it is seen as a part of the ostrich order Struthioniformes.

Still, a kiwi is a kiwi and there is a very strong relationship between the kiwi class in the Clemens classification and its Sibley-Ahlquist sibling.

So how do we fit this thing called classifications into SQL Server?

We have seen that a classification consists of a hierarchy of classes that are connected to other versions of the classification and also to other classes in totally different classifications.

We do have pretty good possibilities to implement hierarchies in SQL Server, but a hierarchy only covers one version of a classification. If we want to be able to track changes and translate between different versions of a taxonomy, a hierarchy is not enough because it is not a hierarchy. It is a graph. And maybe it is a directed acyclic graph and maybe, maybe even a weighted variant.

You can put graphs in a relational database, but it is painful.

Better to use a dedicated graph database such as Neo4J or maybe use a mix of graph engine and a relational database.

More on this next time.

January 6, 2013

A nice example on ink functionality and how to save and load images in Win 8 RT

Hi,

A new year and although I didn't made a promise to blog more this year, it still feels like I have a little more energy this year. After all, the world didn't vanish the 21st of December and the European Union hasn't collapsed yet.

So I got a question on how to load and save bitmaps in Windows 8 RT and planned to blog about it.

However... one of my rules in coding is to never ever try something new without googling first. So I did.

Here is a nice sample on using the ink and how to do a lot more than I would have done if I would have made it.

Hmm... now to some serious coding!