Wikipedia categories ≠ ontology
I think I’m probably stating the obvious here. If we take a single trace of an article such as Tom Cruise through the category hierarchy in Wikipedia, we find out that he is merely a theory…
Tom Cruise → 1962 births → 1960s births → 20th century births → Births by year → People → Humans → Apes → Primates → Mammals → Vertebrates → Chordates → Animals → Eukaryotes → Organisms → Life → Core issues in ethics → Ethics → Branches of philosophy → Philosophy → Belief → Spirituality → Human behaviour → Behaviour → Branches of psychology → Psychology → Interdisciplinary fields → Academic disciplines → Academia → Education → Personal development → Personal life → Self → Metaphysics → Reality → Philosophical concepts → Philosophical terminology → Terminology → Vocabulary → Language → Communication → Social psychology → Social philosophy → Philosophical movements → Movements → Ideologies → Epistemology → Philosophy of science → Analytic philosophy → 20th century philosophy → 20th century → 2nd millenium → Millenia → Years → Chronology → Measurement → Scientific observation → Data collection → Data management → Computer data → Computer storage → Computer memory → Digital media → Digital technology → Electronics → Electromagnetism → Special relativity → Relativity → Theoretical physics → Theories → …
And yes, this isn’t completely irrelevant. It relates to my honours research work. It means that the Wikipedia category hierarchy is only useful as a folksonomy, or perhaps only for a very small hierarchical depth beneath each article…
not sure what you mean here — all this shows is that one category ultimately connects to all the others. but even if there was some rigorous taxonomy the same chain would apply since ultimately it would divide up Everything thereby creating a path between any 2 nodes?
Comment by Michael — 22 June, 2008 @ 2:58 pm
“Connection” is a broader relationship than what is depicted here. Categorisation of A in B usually means that B subsumes A; i.e. A (or all the articles contained in A) is part of B. For many Wikipedia categorisations this is the case, but not for many others. Still, we’re talking about containment or subsumption, not arbitrary paths between nodes.
I don’t know what you mean by dividing up Everything. Wikipedia does have categories that are meant to act as the root of their article categorisation system, e.g. Main topic classifications, Fundamental, Articles and Contents. But many of Wikipedia’s categorisations are more often thematic than taxonomic, and thematic ties are much more broad.
Compare to, for instance, WordNet, a carefully designed lexical-semantic ontology. For WordNet, there are very few root nodes for each part of speech, i.e. I think all nouns are rooted in “entity”, below which are a number of “unique beginners” categorised under “physical entity”, “abstraction” and “thing” (i.e. unnamed entities). Such constraints mean that the ontology is contrived at some places, but still, the point is that there is nothing clear about the semantics of Wikipedia’s hierarchy…
And I possibly have a solution which would find an approximately taxonomic subgraph from Wikipedia’s category, but I’ll report on that when I have something to report on.
Comment by Joel — 22 June, 2008 @ 4:37 pm
Hmm, interesting… but from the looks of that a ‘folksonomy’ could still be kinda useful, particularly if you worked out how to prune some of the ‘bad’ links. It’s all about probability anyway - you could just degrade the value of the connection the longer the chain is, which is the nice way of doing the small depth thing.
(I had a quicky traceback from Tom Cruise myself, and found he was a kind of ‘Underpopulated category’, which itself was a kind of ‘Very large category’
)
Comment by James — 22 June, 2008 @ 8:00 pm
Yup… I do have a way of pruning the bad stuff… and I might try it out in a few days.
Comment by Joel — 22 June, 2008 @ 10:25 pm