So I started a conversation with the clever and oh-so helpful Mike Steckel from International SEMATECH about thesauri […]

So I started a conversation with the clever and oh-so helpful Mike Steckel from International SEMATECH about thesauri and their kinfolk. It seems he learned a ton from the argus seminar, and was kind enough to share some of that learning with me.

It proved to be trendously helpful. You wouldn’t not believe how little about organization tools is in english for ordinary people. Kudos to Mike and the former argonauts!

I reproduce it below in hopes it helps some other poor lost fool

ME: http://webreview.com/1999/07_09/strategists/07_09_99_3.shtml

Despite having read this, and having followed multiple links from google, I can’t really sort out how a controlled vocabulary is so different from a thesaurus (they seem to be used almost interchangeably) and why is it useful. I seems to me– please please correct me– that a controlled vocabulary could hinder information retrieval if used without a thesaurus.. so is it just the basis for one? or?

MIKE: A thesaurus is a kind of controlled vocabulary. There are many ways to control vocabulary as a way to avoid the confusion of synonyms. A minor example would be something like this:

Use “CONGO, Democratic Republic of” rather than ZAIRE

A thesaurus is a very advanced way of controlling vocabulary and in general shows:

1. Equivalence – variants and preferred terms
2. Hierarchical – Broader and Narrower
3. Associative – “see also” references

Hope that helps.

I’m trying hard to sort out the metadata kids for my book. it’s quite a struggle. some many things I know internally, but have never expressed.

I find it interesting that thesaurus is a subset of controlled vocabularies.. what are other sort of controlled vocabularies? would a misspelling list be one?

You mean like DrugStore.com taking people who type “tilenol” directly to “Tylenol?” Sure, that would be a controlled vocabulary. You are taking the terms the user uses and translating them to something controlled. The Congo example below could be called an authority list, maybe the “least controlled” controlled vocabulary. Somewhere in between an authority list and a thesaurus is a taxonomy, which is often just a list of controlled terms taken directly from a series of documents. A thesaurus would include all of the spaces and relationships so that the list of terms is “complete.” An analogy (from a friend of mine) of how complete a thesaurus would be is the way the periodic table predicted certain elements that did not yet exist. Does this make sense?

An article: http://www.dlib.org/dlib/november98/11batty.html

Peter and Lou have written a lot on this.

A huge series of links on the subject.


but doesn’t a taxonomy also add in hierarchy? I noticed in one of the things I was looking at a thesaurus was said to have hierarchy, which I had not thought belonged to that beast.

I’m still trying to figure out relationships between controlled vocabs, thesauri, taxonomies, and keywords…

sigh. messy, innit?

Yup. it sure is.

Normally a taxonomy does add in hierarchy, but it does not attempt to be a
complete representation of something.

Check out the ASIS thesaurus:

or the Art and Architecture Thesaurus (faceted! Cool!):

These attempt to be fully descriptive of the activities of their field. People often think of Roget’s when they think about thesaurus, this basically is something different.

A taxonomy is smaller, but usually does contain hierarchies. “Taxonomy” is often thrown around in KM circles and is the most abused word in your list.

Both of these (taxonomy and thesaurus) are controlled vocabularies. In my case, I have a thesaurus of semiconductor manufacturing terms that I assign to documents for information retrieval. When I take a term from the thesaurus and put it on a document I am giving the document a keyword. When the user searches for a variant of the keyword, like, he calls it a reticle when we use the term MASK (they generally mean the same thing), we can pull the documents with MASK assigned and give them to him. The thesaurus tells us “when you see reticle, pretend it is MASK.” A taxonomy would do the same thing.

wow. I’m digging all this.
I am finding myself having a hard time shutting up!

I would say that this is a taxonomy that would be familiar:

Keith assigns keywords to the documents he puts here. The thing he draws from is the taxonomy. The terms are taken more or less from the material and organized, hard to do this without some hierarchy involved.

So, is the heirachy on the front page of yahoo a taxonomy?

is yahoo a place where we can start to talk about relationships.. like I search, right/ well, my keyword is matched against what…? I get pages, but I also get categories….. what’s going on here….

Yahoo is without doubt a taxonomy. Your keywords are matched against the taxonomy and you get the matches, but you also get the “See also” categories for further browsing. The hierarchy can show all of the ways your term hits these kinds of relationships:

Equivalence – variants and preferred terms
Hierarchical – Broader and Narrower
Associative – “see also” references

I don’t know what you mean by “taxonomy of organization tools”

I’ll try, but I doubt this will format well….

controlled vocabulary
thesaurus        taxonomy
^                        ^
spelling  synonyms    category keywords

or some such… you know, a hierarchal relation demonstrating diagram of the organization tools….

Maybe a “controlled vocabulary” taxonomy? How about as a Hierarchy?

Authority list — lowest level — no hierarchy, just preferred terms, a way to tell the system “CA” is the same as “California”

Taxonomy — middle level — hierarchy, pulled from material, may have gaps if there is no content. You would be able to tell that San Francisco is a narrower term and California is a broader term. If there is no content relating to Santa Clara, then Santa Clara would not be a term. This is the highest level necessary
for most websites.

Thesaurus — Highest level — Peter Morville called this the “Rolls-Royce of controlled vocabularies” at a seminar I went to. It would attempt to include all California cities as a subset of California. In other words each city would have California as a broader term. It would also show related terms such as cities that are near each other, or something like that. Generally useful only to very large sites. By the way there are two kinds — pre-enumerative and post-enumerative (faceted), but don’t worry about that yet.

Authority list is new to me… do you have an example?
I don’t know of an example in practice, but I know what it is in theory. People don’t think about this level too much, but it is a way to show simple equivalent relationships. The examples from the Argus seminar were these (hope this formats

Preferred Variants Authority
AZ Ariz, Arizona, 85XXX US Postal Service
IBM Intl Bus Machines, Big Blue NY Stock Exchange
Nyctalopia Night blindness,Moon blindness National Library of Medicine

“Big Blue” is the same as IBM. This says “when you see ‘Big Blue’ it means IBM”

If you had a medical site used by both doctors who might use “Nyctalopia” and consumers who might look for “Night Blindness,” this would be a way to link them together.

ahhh. that’s actually what I thought a thesaurus is. damn roget’s!
Yeah, forget that guy!
My understanding is that “Thesaurus” was used for what we now call “Dictionary” and that Roget used it in 1852 as a way to give the user a choice among several terms. In the early 1950’s people started to use it in the opposite way. As a prescriptive limitation on the terms used.

I have a book here in my office that says “thesaurus” is a Latin form of a Greek word meaning “Treasure Store.” I like that!

Also, FYI…for me, the process of finding a keyword from my thesaurus and applying it to some content is “indexing.”

at which point I asked if I could reproduce this in the blog, and he graciously agreed. Thanks Mike!


    tess lispi

    This is how I understand how the different components (taxonomies, thesauri, keywords, metadata, controlled vocabularies) relate to each other:

    Taxonomies are classifications (an arrangement in groups or categories with established criteria). Taxonomies can become navigation in a web site or tables and fields in databases.

    Thesauri are controlled vocabularies. This is the framework that is often built from a combination of the metadata categories (or keywords) and the taxonomies. This is what I have put in screen and content decks to be entered by the engineers. Usually gleaned from the content in an assessment.

    Keywords are words that are essentially repeated often or have a particular emphasis on a site or in a document or a database. These can be what librarians call access points to the information.

    Metadata uses keywords and taxonomies to create a high-level view of the information on a site or in a database. The metadata is the information framework which everything on the site relates to. These can be what librarians call access points to the information. They can follow the Dublin Core element set (if you need a copy of the latest DC, I would be happy to send it to you).

    Controlled vocabularies are thesauri. These are all the variants of a word or phrase the way it is found in the site or database (e.g. misspellings, words with similar meaning and inter-related terms).

    This is the way I understand them. I have been doing quite a bit of research in this area since I have been a woman-of-leisure. I have been accepted into graduate school in the LIS program and imagine that I will get a fuller understanding as the next 2 years pass!

    I hope that this has helped you. If I can clarify any more let me know. I have some docs to use and send if necessary. This has been straight from my head, so if you need documented quotes I may be able to scrounge them up. :>


    How I understand it: (essentially the same)

    – a controlled vocabulary is simply anything that says: “Use this word instead of this”. It doesn’t have structure. If you want to group your controlled vocabularies (like in: cities, things to do, …) you need to make separate ones for each group.

    – a thesaurus is a controlled vocabulary with some added stuff: it doesn’t just say use this instead of this, but also gives a way of showing (roughly) hierarchy (broader and narrower terms), related terms and variants. However, it has the same problem that you can’t group your terms into blobs.

    – a topic map is yet more advanced than a simple controlled vocabulary or a thesaurus: it allows for a fantastic complexity of organising your metadata (things like scope, or relationships), and it’s very strong in the way the data can be manipulated by a computer (You can automatically MERGE toicmaps for example, try that with a thesaurus). (see http://easytopicmaps.com for more info)

    – a taxonomy is anyting that organises things into categories.

    – metadata is just the generic term for data about data. All the above are forms of metadata.

    So if you use a controlled vocabulary, or even a thesaurus, you will probably also need to build a taxonomy. A topic map can include all of that (and more)

    Mike Steckel

    I love this conversation. I just want to clarify.

    “Taxonomies are classifications (an arrangement in groups or categories with established criteria)”

    “- a taxonomy is anyting that organises things into categories”

    What are you organizing into categories here? The controlled terms. It is difficult to organize
    your terms without establishing some sort of hierarchy eventually. The extent to which you
    decide to do this determines whether you have a taxonomy or a thesaurus. The thesauri I
    keep in my cube are huge, like textbooks. Taxonomies work for most sites perfectly well.

    I have just started reading about topic maps and they look thrilling. Thanks for setting up
    the site Peter!

    Excellent comments. I’m with Christina on this — still on the beginner level with taxonomy vs. controlled vocab vs. thesaurus — but the one thing that I read a while ago that really helped clarify everything was this Argus presentation (PDF) by Chris Farnum on IA for Intranets. There’s lots of great info there on general IA and applying it to Intranets, but pages 11-23 hit on all the stuff we’re talking about here. It isn’t as in-depth but it’s a nice overview that helped sort out all of the terms for me.

    “I’m a little late to this discussion…

    Can anyone direct me to info about how Controlled Vocabs/Thesauri/ etc. are actually implemented in search engines? How do I get it from a spreadsheet onto the server–what does my database team need to do? Part of it is associating content items one-by-one with preferred terms, but how do I, for example, implement “correct spellings”? (i.e. “tilenol” is “Tylenol”)”


    Re: tilenol.

    I’m struggling with the same problem and have come to the conclusion that brute force is probably the only way to catch most mis-spellings or slight differences in terminology.

    For smaller vocabularies,I had some success with comparing search words with what was in the database using the distance algorithm, but this wouldn’t scale.

    I can see the potential benefits of having a comprehensive vocabulary, but at the moment they are proving to be laborious to author, and having done the hard work, quite difficult to integrate with a search engine.

    For example, I have a taxonomy which might have “Usability” and “Human Computer Interface” and “HCI” in it…and someone searches for “Human Computer Interface HCI Tools Software”, how do I pick out which lumps of the search query are term and which are words?


    What a bunch of self-important crap. How brilliant to just decide to redefine what common understanding and dictionaries say the meanings of words are, such that any educated person has to fight continual cognitive dissonance to glean any understanding of the bafflegab you all are talking about.

    Controlled vocabulary? Thesauri are controlled vocabularies with added hierarchel information? Puh-leaze.

    David Locke

    A taxonomy is a heirarchy, yes. But it consists of two different things: the decisions that classify (taxons), and the things being classified (infons). There is the terms and there is some logic.

    The taxons define the categories. The infons occupy the categories. Taxons are the branches of the decision tree. Infons are the leaves of the decision tree.

    Color? [taxon]
    / | \
    Red Green Blue [infon]

    Infons can become taxons by adding resolution to the infon:

    Color? [taxon]
    / | \
    / Green Blue [infons]
    Red [taxon]
    / \
    Light Red Dark Red [infons]

    And, taxons can become infons:

    Color [infon]

    The roles depend on focus or ganularity. Am I shoping for clothes or shirts?

    Knowledge grows by converting infons to qualitative and later quantitative taxons. As this knowledge is explicated, new terminology gets created.

    On Topic Maps, scope on a Topic Map is context. A word like demo means different things depending on context. It’s descriptive data in marketing. It is a tape or CD if you are a band.

    David Locke

    Well, dah.

    A few more points on taxonomy.

    Shopping is said to be navigation. And, navigation is said to be taxonomy. So it’s only about terms when we think about keywords and metadata. There is a reality about decisions we make in real time lining out a taxonomy.

    Life is a taxonomy, a navigation. Make the good decisions end up, hopefully and not macroeconomically, where you wanted to go.

    That taxon/infon terminology didn’t come from me. I found them on the EDS website where they were evolving an ecommerce practice.

    I’ve use a varient of them when I think about ontologies. Ontologies are related to taxonomies, but no, not the same thing. Ontologies are fuzzier. 🙂

    David Locke

    Here is something I ran across. Sorry, don’t know where.

    Controlled Vocabulary – All of the following:

    Authority – Preferance

    Authority = Preferred Term + Variant

    Taxonomy – Hierarchical

    Taxonomy = Authority + Parent + Children

    Thesarus – Associative

    Thesarus = Taxonomy + Associated Terms


    As to why anyone goes to the trouble? It is essential to find it before you can use it. Disambibuity abounds and the machines can’t handle it. For example,

    I looked at my demos today.

    What the heck does that mean? Demos as in multimedia presentations. Or, Demos as in demographics.

    Different scopes, different classifications, would have been needed to make the meaning clear. Read more on computational linguistics. They focus on ambiguity. They are not there yet.

    This stuff is for machines. One of the advantages of SGML was the ability to do semantic-based post processing of text. Text is still being treated as a blob. We do not have any understanding of the content of strigs. All of this stuff is extrinsic. The next revolution will be the intrinsic understanding of text. But, to get there ambiguity will have to be conquered.

    Controlled Vocabularies in all its forms are extrinsic.

    Topic Maps are still extrinsic. They associate keywords to specific strings in the text using a link anchor. Many technologies today are about moving closer and closer to the intrinsic. The tag will, however, always be extrinsic.

    Topic Map = Thesarus + RDF (or weaker URL)

    Scope is hierarchical. Scope would let us decide what demo meant in that particular context.

    If Scope(Marcom), then demo is unknown
    If Scope(Marketing Research), then demo is demographics
    If Scope(Selling), then demo is demonstration
    If Scope(Promotion), then demo is demonstration

    Now, what I said about RDF above is complicated by the fact that XTM and RDF are complementary. Some members in the TopicMap community want to use XTM, instead of RDF.

    Ultimiately, TopicMaps are meant for machine consumption.


    On getting dictionaries into databases, you need a license to do that unless you are writing your definitions from scratch. The dictionary publishers alrady have terminology databases. They should be selling their content as a webservice. Try WordNet.

    David Locke

    One point on the maintenance of context:

    Content management systems (CMS) deliver structured text. The text is stored in a blob in the database. Within that blob the content can be structured (tagged) or unstructured (untagged). If it is untagged, the CMS will still deliver it in a tagged form at the level of the container holding the text.

    If the CMS stores paragraphs, then you get paragraphs. If the CMS stores pages, then you get pages. The CMS delivers textual blobs at some container-based resolution.

    When you write structured text, you cannot assume that a specific container will follow or preceed your text. There are no transitions in structured or non-linear text. So making the assumption that you can determine the context of the text from adjacent text doesn’t work. This is another reason why controlled vocabularies, taxonomies, and thesaruses are important. It’s not just selling or purchasing or discoverability or wayfinding. It is meaning itself.

    Digital-to-analog processes take linear things like music and turn them into non-linear based simulations of the real thing. This is so with text on the web if it is delivered by CMS systems.

    The larger the storage granularity of that text, the more likely you will be able to disambiguate, but at the paragraph, sentence, or sub-sentence level clarity becomes more difficult.

    The scope of a keyword in a Topic Map is trying to put back what the digitalization process removed.

