Reconsidering Data

Master data management has firmly embedded itself into the collective leadership psyche. Well, the terminology has. But whether decision-makers really understand what they should be managing is another question entirely.

I spend a lot of time in data meetings, usually involving government organizations. The majority of the time is spent worrying about “data tagging.” The discussions focus on what should be tagged, which tags to use, etc. Some of these discussions have been going on for fifteen years or more. I’ve seen the same argument over the same issue run on for over five years with no decision. That’s just a waste of time; there’s no other word for it.

The problem is that we’re trying to manage the wrong things, and we’re doing that because of some unexamined assumptions. Understanding that, and thinking clearly about possible solutions, will require understanding a little bit about how we got here.

In 1994, the Dublin Core Metadata Initiative (DCMI) was established to define metadata standards for documents on the World Wide Web. In a nutshell, their goal was to define an XML version of a library card catalog so that I could be electronically cataloged, simplifying search and retrieval of documents.

Dublin Core was a good idea for its time. But its time has passed. Technology has passed it by, but too many data managers have not noticed. Think about it: library catalogs date as far back as 700 BC, and the modern card catalog as we used to know it dates to 1780.  Works were indexed by subject, author, and title. This made sense in the days when it was impossible to index the entirety of each work.

Modern technology has made the old-fashioned card catalog obsolete, replaced by an electronic catalog. By the same token, the catalog structure itself is obsolete due to the march of technology. The card catalog was devised because it was impractical to create detailed indices of an entire work. The solution was to mine out specific search terms; title, subject, author, and important keywords in the text. Even into the early 2000s this made sense; scanning and indexing an entire work was impractical for large bodies of work. That is no longer the case. A document of a hundred pages can be fully indexed in less time than it takes to write this sentence. for documents that are reasonably well-structured, it is even feasible to identify the author and title.

And yet we persist in managing document metadata as if we were still beholden to a physical card catalog. Why?

It would be better to toss out, or drastically strip down, standards like Dublin Core and replace them with a more comprehensive approach that assumes the entire contents of a work will be fully indexed and electronically searchable (which is the current reality, whether we want to recognize it or not). Continuing to treat document metadata as something unique is the source of endless problems.

In reality, digital data only comes in one of two forms: as a binary object (a picture, a Word document, etc.) or as tuples. All the text inside a document can be treated as one or more tuples for purposes of managing data. Even if it is stored as a unit (i.e., a file containing text), tuples that describe that body of text can be easily mined out of it.

Instead of managing a card catalog standard, we should be focused on managing tuples.

 

 

 

On the Myth of Metadata

Like many IT professionals, I spend a lot of time these days dealing with metadata. Or perhaps I should say”metadata” (with the quotation marks). Because I’ve come to the conclusion that thee’s no such thing as metadata, and the sooner we accept that assertion the better off our data practices will be.

It has long been a truism that “one person’s data is another person’s metadata.” While both brief and mildly witty, this glib assertion hints at an important truth: that sometime’s is the stuff we think of as metadata that we want to analyze, instead of the “operational” data that we normally think of an important. This is especially true in the era of big data, where a meta-analysis of all that stuff can yield important insights. A classic example of this is the now-defunct NSA telephone metadata analysis program. When we think of telephone surveillance, we think of listening to the conversation to find out what people are up to. Alice calls Bob and talks about plans to rob a bank; the police  have tapped Alice’s phone and now they can thwart the robbery. But that’s a very precise bit of analysis that can only prevent the bank robbery if either Alice or Bob is already under enough suspicion for the police to get a warrant for the phone tap.

By contrast, what a metadata analysis program like the NSA’s would do is analyze the existence of a phone call instead of its content and try to relate that information to other known information to gain some insight into trouble that is brewing before there is enough evidence to involve the criminal justice system. For example, by analyzing phone logs we learn that Alice calls Bob, then Bob calls Carol, and some time later David calls Carol. This pattern occurs repeatedly, and every time it occurs there is a bank robbery within one week after David calls Carol. Based on this information, it is reasonable to conclude that Alice, Bob, Carol, and David may have something to do with the string of bank robberies. From that information, we can look for that pattern of telephone calls and infer that the next time it happens we should alert local banks to the likelihood of a robbery within the next week.

The same concept applies to any large data set. In the case of documents, perhaps we want to analyze how many documents the typical analyst produces in a year, or whether there are some months where document production is higher than other months, or any other metric that may come to mind.

The problem is that for the past 15 or 20 years, we’ve been capturing and managing document metadata largely as separate XML files. This document metadata is often captured using the Dublin Core Metadata specification or one if its derivatives like the DoD Discovery Metadata Specification (DDMS). It is captured, stored, and managed separately from the documents it describes, and is often thought of as the means for discovering data. The problem is that this is an old paradigm based on a manual process that never imagines the information management power we take for granted today.

Dublin Core is a product of the library science discipline, and is essentially an attempt to digitize the old-style library card catalog system. Traditionally, card catalogs indexed books by title, author, and subject (which is another term for “keyword”). Libraries did this because it was impossible to index the contents of every book in a library because doing it by hand was the only option. With modern technology, indexing the entire contents of a book is a trivial exercise, so maintaining a separate subject index might be convenient for some uses but it is hardly necessary to find information within a large corpus of text data.

By the same token, using linked data technologies like RDF, we don’t need to create and maintain separate databases of metadata elements that describe our data holdings in order to enable search and discovery. Instead, we can store all of the original information and link it to other data we maintain to provide an interconnected ecosystem of data that improves our ability to analyze and understand our ever-expanding data holdings. This will have the added benefit of reducing data redundancy and improving our ability to keep data updated.

One of the main problems with the Dublin Core / DDMS style of metadata management is that each “metacard” is a separate file, and its contents are a series of strings. So the information about a document’s author is just the name, phone number, etc. written out in each metacard. The problem is that this makes it both difficult and expensive to update that information if some part of it changes This is a serious problem in cases where contacting the original author of a document is important, such as intelligence analysis.

A better solution is to stop creating stand-alone files or databases that maintain this information, and instead link authoritative data repositories together. So instead of including the name and phone number of a document’s author into a metacard, insert a link to a single copy of the author’s contact information, such as an entry in an enterprise registry or a Friend-of-a-Friend file. That way, when the person’s phone number changes it only needs to be changed in one place and not in every metacard.

This same principle can be applied to nearly every type of data an enterprise deals with: instead of capturing and managing a new pile of stuff as “metadata,” just treat every piece of data that is captured as a first-class citizen of the data ecosystem and manage it as such.