On the Myth of Metadata

Like many IT professionals, I spend a lot of time these days dealing with metadata. Or perhaps I should say”metadata” (with the quotation marks). Because I’ve come to the conclusion that thee’s no such thing as metadata, and the sooner we accept that assertion the better off our data practices will be.

It has long been a truism that “one person’s data is another person’s metadata.” While both brief and mildly witty, this glib assertion hints at an important truth: that sometime’s is the stuff we think of as metadata that we want to analyze, instead of the “operational” data that we normally think of an important. This is especially true in the era of big data, where a meta-analysis of all that stuff can yield important insights. A classic example of this is the now-defunct NSA telephone metadata analysis program. When we think of telephone surveillance, we think of listening to the conversation to find out what people are up to. Alice calls Bob and talks about plans to rob a bank; the police  have tapped Alice’s phone and now they can thwart the robbery. But that’s a very precise bit of analysis that can only prevent the bank robbery if either Alice or Bob is already under enough suspicion for the police to get a warrant for the phone tap.

By contrast, what a metadata analysis program like the NSA’s would do is analyze the existence of a phone call instead of its content and try to relate that information to other known information to gain some insight into trouble that is brewing before there is enough evidence to involve the criminal justice system. For example, by analyzing phone logs we learn that Alice calls Bob, then Bob calls Carol, and some time later David calls Carol. This pattern occurs repeatedly, and every time it occurs there is a bank robbery within one week after David calls Carol. Based on this information, it is reasonable to conclude that Alice, Bob, Carol, and David may have something to do with the string of bank robberies. From that information, we can look for that pattern of telephone calls and infer that the next time it happens we should alert local banks to the likelihood of a robbery within the next week.

The same concept applies to any large data set. In the case of documents, perhaps we want to analyze how many documents the typical analyst produces in a year, or whether there are some months where document production is higher than other months, or any other metric that may come to mind.

The problem is that for the past 15 or 20 years, we’ve been capturing and managing document metadata largely as separate XML files. This document metadata is often captured using the Dublin Core Metadata specification or one if its derivatives like the DoD Discovery Metadata Specification (DDMS). It is captured, stored, and managed separately from the documents it describes, and is often thought of as the means for discovering data. The problem is that this is an old paradigm based on a manual process that never imagines the information management power we take for granted today.

Dublin Core is a product of the library science discipline, and is essentially an attempt to digitize the old-style library card catalog system. Traditionally, card catalogs indexed books by title, author, and subject (which is another term for “keyword”). Libraries did this because it was impossible to index the contents of every book in a library because doing it by hand was the only option. With modern technology, indexing the entire contents of a book is a trivial exercise, so maintaining a separate subject index might be convenient for some uses but it is hardly necessary to find information within a large corpus of text data.

By the same token, using linked data technologies like RDF, we don’t need to create and maintain separate databases of metadata elements that describe our data holdings in order to enable search and discovery. Instead, we can store all of the original information and link it to other data we maintain to provide an interconnected ecosystem of data that improves our ability to analyze and understand our ever-expanding data holdings. This will have the added benefit of reducing data redundancy and improving our ability to keep data updated.

One of the main problems with the Dublin Core / DDMS style of metadata management is that each “metacard” is a separate file, and its contents are a series of strings. So the information about a document’s author is just the name, phone number, etc. written out in each metacard. The problem is that this makes it both difficult and expensive to update that information if some part of it changes This is a serious problem in cases where contacting the original author of a document is important, such as intelligence analysis.

A better solution is to stop creating stand-alone files or databases that maintain this information, and instead link authoritative data repositories together. So instead of including the name and phone number of a document’s author into a metacard, insert a link to a single copy of the author’s contact information, such as an entry in an enterprise registry or a Friend-of-a-Friend file. That way, when the person’s phone number changes it only needs to be changed in one place and not in every metacard.

This same principle can be applied to nearly every type of data an enterprise deals with: instead of capturing and managing a new pile of stuff as “metadata,” just treat every piece of data that is captured as a first-class citizen of the data ecosystem and manage it as such.