Reconsidering Data

Master data management has firmly embedded itself into the collective leadership psyche. Well, the terminology has. But whether decision-makers really understand what they should be managing is another question entirely.

I spend a lot of time in data meetings, usually involving government organizations. The majority of the time is spent worrying about “data tagging.” The discussions focus on what should be tagged, which tags to use, etc. Some of these discussions have been going on for fifteen years or more. I’ve seen the same argument over the same issue run on for over five years with no decision. That’s just a waste of time; there’s no other word for it.

The problem is that we’re trying to manage the wrong things, and we’re doing that because of some unexamined assumptions. Understanding that, and thinking clearly about possible solutions, will require understanding a little bit about how we got here.

In 1994, the Dublin Core Metadata Initiative (DCMI) was established to define metadata standards for documents on the World Wide Web. In a nutshell, their goal was to define an XML version of a library card catalog so that I could be electronically cataloged, simplifying search and retrieval of documents.

Dublin Core was a good idea for its time. But its time has passed. Technology has passed it by, but too many data managers have not noticed. Think about it: library catalogs date as far back as 700 BC, and the modern card catalog as we used to know it dates to 1780.  Works were indexed by subject, author, and title. This made sense in the days when it was impossible to index the entirety of each work.

Modern technology has made the old-fashioned card catalog obsolete, replaced by an electronic catalog. By the same token, the catalog structure itself is obsolete due to the march of technology. The card catalog was devised because it was impractical to create detailed indices of an entire work. The solution was to mine out specific search terms; title, subject, author, and important keywords in the text. Even into the early 2000s this made sense; scanning and indexing an entire work was impractical for large bodies of work. That is no longer the case. A document of a hundred pages can be fully indexed in less time than it takes to write this sentence. for documents that are reasonably well-structured, it is even feasible to identify the author and title.

And yet we persist in managing document metadata as if we were still beholden to a physical card catalog. Why?

It would be better to toss out, or drastically strip down, standards like Dublin Core and replace them with a more comprehensive approach that assumes the entire contents of a work will be fully indexed and electronically searchable (which is the current reality, whether we want to recognize it or not). Continuing to treat document metadata as something unique is the source of endless problems.

In reality, digital data only comes in one of two forms: as a binary object (a picture, a Word document, etc.) or as tuples. All the text inside a document can be treated as one or more tuples for purposes of managing data. Even if it is stored as a unit (i.e., a file containing text), tuples that describe that body of text can be easily mined out of it.

Instead of managing a card catalog standard, we should be focused on managing tuples.