Understanding metadata in Digital Asset Management

Digital Asset Management or DAM for short is not as easy to understand as many people think it is. Understanding a DAM project starts with understanding the difference between asset and metadata. The assets are easily understood: a PDF, a video clip, an image. Those are assets. It's the metadata that makes people's eyes glaze.

"Metadata is data about data". How often have you read this definition? It's the right description, but unfortunately it's also too general to be useful. Metadata can be subdivided into categorisation or taxonomy and "others".

An example tells more than theory, so here goes: you're a classical music publisher and you're digitising your entire collection of recordings from all over the world. You're using a DAM for that.

The first thing you will need is an understanding of the difference between metadata and taxonomies. Metadata implies that assets are tagged with labels. Any sort of attribute or element that helps to define or describe a particular image, document, presentation or spreadsheet would be considered metadata. Some metadata is generated by a computer. If you add a song from a ripped CD to iTunes, you can opt to have iTunes search for tags using the Gracenote CDDB — a CD metadata database maintained by volunteers.

Taxonomy is the practice and science of classification. A taxonomy or taxonomic scheme is a particular classification. Defining and using a taxonomy allows users to categorise assets. Ideally, they will do so using a controlled vocabulary. A controlled vocabulary is a set of terms that has been authorised to use by a librarian, a database manager, or some other human at preferably senior management level (at least for this task).

In working with a taxonomy some best practices exist. Taxonomy should have a specific use and a logical hierarchy. It should be easy to understand by users in different divisions or departments. It should conform to other published taxonomy standards when possible (e.g. those defined by Taxonomy Warehouse), not overlap with other defined metadata, and be clean of acronyms and abbreviations as much as possible. It should also not nest further than five levels if possible.

Machines can use metadata to act upon your assets. For example, if a metadata field changes when an asset has been downloaded to a local workstation, it can be used as a trigger for an accounting system so the download can be properly charged for.

Taxonomy allows for the same, but on a different level.

The problems with metadata

Metadata problems occur when different people are allowed to fill in the fields as they see fit. This is different from imposing a controlled vocabulary. The controlled vocabulary only enforces the use of the same terms in specific fields. But no system exists which can control if the term you use for a field is an accurate description of the field's topic.

For example and sticking to the music business, the Gracenote CD database contains several different descriptions of the same "Album" field, with all of them containing the same literal words, but in a different order. As a music consumer, I'll import the publisher's Elgar, The Apostles CD ALbum into iTunes only to find the two original CDs are now scattered all over the place with one track residing (according to the database) on a completely different disc.

This particular problem can be solved by throwing technology — intelligence, more specifically: linguistics — at it, but with other fields this may be less obvious.

The only real solution therefore lies at the input side. If you offer users of the DAM system the ability to enter descriptive fields by typing in text, not only will they often make mistakes that are hard to track. They equally often will type in slightly different content. For new assets therefore, you must find a way to beat them to it. That's where the manager (e.g. a librarian) comes into play. As soon as our music publisher has completed a new album that's ready to go on sale, the system admin will type in the exact description and/or allowed deviations.

The system will ideally be implemented in such a way that only those descriptions are presented as a data entry field. This will work quite well within organisational boundaries. However, it won't work with external databases such as Gracenote, where everybody does as he likes.

The problem is aggravated with the usage of the popular metadata type known as "tags". Tags can be anything, from one-word to whole-sentence labels. The original idea was to have keywords that search engines can find back. Practice is different. Tags are a nightmare if you don't impose restrictions, which most sites where you can enter them, don't.

Perhaps tags should be avoided in a DAM system because they add little value. One of the disadvantages of tags is their inherent context-less-ness. A tag doesn't convey any context at all. An article tagged with the keyword "Red" can mean different things that only become clear after reading the article. By then, the time benefit from using a DAM system in the first place, has gone up in thin air.

For those who think the tag "Red" refers to a colour, I might add that Red may also refer to a company manufacturing the cameras with which blockbuster movies have been created.