On the Myth of Metadata

Like many IT professionals, I spend a lot of time these days dealing with metadata. Or perhaps I should say”metadata” (with the quotation marks). Because I’ve come to the conclusion that thee’s no such thing as metadata, and the sooner we accept that assertion the better off our data practices will be.

It has long been a truism that “one person’s data is another person’s metadata.” While both brief and mildly witty, this glib assertion hints at an important truth: that sometime’s is the stuff we think of as metadata that we want to analyze, instead of the “operational” data that we normally think of an important. This is especially true in the era of big data, where a meta-analysis of all that stuff can yield important insights. A classic example of this is the now-defunct NSA telephone metadata analysis program. When we think of telephone surveillance, we think of listening to the conversation to find out what people are up to. Alice calls Bob and talks about plans to rob a bank; the police  have tapped Alice’s phone and now they can thwart the robbery. But that’s a very precise bit of analysis that can only prevent the bank robbery if either Alice or Bob is already under enough suspicion for the police to get a warrant for the phone tap.

By contrast, what a metadata analysis program like the NSA’s would do is analyze the existence of a phone call instead of its content and try to relate that information to other known information to gain some insight into trouble that is brewing before there is enough evidence to involve the criminal justice system. For example, by analyzing phone logs we learn that Alice calls Bob, then Bob calls Carol, and some time later David calls Carol. This pattern occurs repeatedly, and every time it occurs there is a bank robbery within one week after David calls Carol. Based on this information, it is reasonable to conclude that Alice, Bob, Carol, and David may have something to do with the string of bank robberies. From that information, we can look for that pattern of telephone calls and infer that the next time it happens we should alert local banks to the likelihood of a robbery within the next week.

The same concept applies to any large data set. In the case of documents, perhaps we want to analyze how many documents the typical analyst produces in a year, or whether there are some months where document production is higher than other months, or any other metric that may come to mind.

The problem is that for the past 15 or 20 years, we’ve been capturing and managing document metadata largely as separate XML files. This document metadata is often captured using the Dublin Core Metadata specification or one if its derivatives like the DoD Discovery Metadata Specification (DDMS). It is captured, stored, and managed separately from the documents it describes, and is often thought of as the means for discovering data. The problem is that this is an old paradigm based on a manual process that never imagines the information management power we take for granted today.

Dublin Core is a product of the library science discipline, and is essentially an attempt to digitize the old-style library card catalog system. Traditionally, card catalogs indexed books by title, author, and subject (which is another term for “keyword”). Libraries did this because it was impossible to index the contents of every book in a library because doing it by hand was the only option. With modern technology, indexing the entire contents of a book is a trivial exercise, so maintaining a separate subject index might be convenient for some uses but it is hardly necessary to find information within a large corpus of text data.

By the same token, using linked data technologies like RDF, we don’t need to create and maintain separate databases of metadata elements that describe our data holdings in order to enable search and discovery. Instead, we can store all of the original information and link it to other data we maintain to provide an interconnected ecosystem of data that improves our ability to analyze and understand our ever-expanding data holdings. This will have the added benefit of reducing data redundancy and improving our ability to keep data updated.

One of the main problems with the Dublin Core / DDMS style of metadata management is that each “metacard” is a separate file, and its contents are a series of strings. So the information about a document’s author is just the name, phone number, etc. written out in each metacard. The problem is that this makes it both difficult and expensive to update that information if some part of it changes This is a serious problem in cases where contacting the original author of a document is important, such as intelligence analysis.

A better solution is to stop creating stand-alone files or databases that maintain this information, and instead link authoritative data repositories together. So instead of including the name and phone number of a document’s author into a metacard, insert a link to a single copy of the author’s contact information, such as an entry in an enterprise registry or a Friend-of-a-Friend file. That way, when the person’s phone number changes it only needs to be changed in one place and not in every metacard.

This same principle can be applied to nearly every type of data an enterprise deals with: instead of capturing and managing a new pile of stuff as “metadata,” just treat every piece of data that is captured as a first-class citizen of the data ecosystem and manage it as such.

Initial Lessons Learned with BFO

Building an ontology using BFO was a bit tricky. It takes some time to get one’s head around the way BFO models the world For example, the distinct difference between information and its representation (as in, the same information could be contained in a both a word processing document and a slide presentation). Some of this is made a little easier by using BFO as an upper-level ontology and using a mid-level ontology such as the Common Core Ontologies being developed for use by the US Army. (See this paper for a discussion of their application.)

But getting my head wrapped around how BFO represents information wasn’t the half of it.

Once I had what I believed was a good ontology in place, meaning that it passed the consistency checks of at least one reasoner (such as Fact++ or Pellet), I needed to add data to it. That’s where the real fun began. I have several tens of thousands of records representing ships and position information reports that need to be converted from the existing concept-based ontology into the BFO-based version (perhaps hundreds of thousands, I haven’t really counted).

To do the conversion, I’m using the Manchester OWL API. It’s robust and pretty straightforward, and I’ve used if for other projects, so it was a natural choice for me. (For anyone who objects that a Java application is awfully heavy for doing what is essentially text conversion, you’re probably right. But I’m not a real programmer; I only impersonate one from time to time when it’s necessary for a given task. And I’ve got deadlines to meet for the work project that I’m doing this for, so I went with what I know.)

Actually adding the data revealed several inconsistencies in my ontology, all centered around that very explicit distinction between information and its representation. Thankfully, a colleague who is doing much of the development work on the Common Core Ontologies was able to help me out.

I’ve managed to update all of the data about individual ships to the new ontology; now I need to update the position information.

 

Investigating Basic Formal Ontology

Recently at work I learned of an initiative called the Basic Formal Ontology. It’s an interesting take on ontology development that has captured my interest, at least for the moment.

When I learned ontology development, it took a concept-centric approach. That is, the ontology developer created classes that represented concepts of interest in the domain being modeled, created data properties that define the attributes of each class, and created object properties that describe the relationships between classes.  To take a simple example, an ontology describing Navy ships might be broken down into combatant and noncombatant classes, and this is perfectly valid provided it meets the needs of the ontology developer and the eventual system user.

In contrast, BFO takes a very different approach. BFO is focused on modeling reality–that is, those things that objectively exist in time and space. If one wanted to model Navy ships, then the model might include watercraft or ship classes, but it would not include terms like “combatant” as classes because “combatant” is a role that a ship assumes, it is not a thing in and of itself.

It’s certainly a different way of looking at ontologies. Much of this is because the developer of BFO, Dr. Barry Smith, is a philosopher by trade and training, and not an information systems specialist. He runs the National Center for Ontological Research (NCOR), which is a project of State University of New York at Buffalo. He specifically developed BFO to support scientific research, hence the focus on modeling reality instead of someone’s perception of a particular domain.

I’m not 100% certain that BFO is a better way than concept-based ontologies. But I think I see some intriguing possibilities, so I’m going to dig into it and see how things go. BFO has a lot of traction within the biomedical community, so that tells me someone thinks it has value. It’s a bit tricky to wrap one’s head around it at first, but now that I’m getting used to it I think I’m on the verge of actually being productive with it. An invaluable resource in understanding the BFO approach to ontology development is the book Building Ontologies with Basic Formal Ontology from MIT Press.

I guess I’ll see if BFO is worth the trouble or not.

Finally, an end. And perhaps, a new beginning…

It has been a long time since I’ve updated this blog, mostly due to frustration.

It took way too long, but I am finally finished. During most of last summer and early fall,  I was growing very frustrated with my co-directors. I was practically begging for a date to defend my dissertation, but they seemed to be dragging their feet for some reason. Then, in late October, they suddenly decided it was time for me to finish. (I think a faculty meeting had something to do with it — I suspect the provost lit a fire under them because I had been in the program ten years.)

So the fall became a mad scramble to align schedules to arrange a pre-defense and then a defense. Sadly, one of my committee members (Dr. Andrew Sage) had passed away in the fall. This made my scheduling easier as I only had to juggle the schedules of three remaining committee members plus the Dean. But one of my other committee members was out of the country for almost three weeks, so that didn’t help.

In the end, I could not arrange to do my defense before the deadline for a fall 2014 finish, but I was able to arrange it for the last day before the University closed for the year.

On December 18, 2014, I successfully defended my dissertation, Method and Models to Enable Automated Optimal Service Composition. Both my children were able to attend because their schools had let out for the winter, and my wife and her mother were also there as well as several friends and colleagues.

It’s good to finally be done. To finally, after 10 long years, to be “Dr. McDowall.”

So, now what? Well, there is still much to be done in my day job. But what about this site? I certainly don’t need to keep a dissertation journal any more. So I think I will follow in the footsteps of many others in this Internet age and convert it into a personal blog. Who knows, maybe some day it will lead to a paid writing gig.

Frustration

I prepared yet another revised introduction section for my advisers to review. I sent it to both the senior adviser and the junior adviser to review. The senior said he’d wait until i had made any additional changes the junior wanted before he commented on it. This was last week. I was supposed to meet with the junior adviser on Friday, 3/7 before he left town for the week.

He never contacted me to confirm a meeting time, but when I pinged him he said he’d call me Monday in the afternoon or early evening. He didn’t call, which left me in an extraordinarily foul mood. By Wednesday I had calmed down enough to contact him and very politely point out that he had not called. He apologized and stated it was because he was busy. Well, cry me a frickin’ river–I’m pretty damned busy too, but I make time to meet with him at his convenience. Anyway, he said he’d be back the following week but would be at NIST through Wednesday, and we could either meet Thursday or by phone before that. I told him I would be available whenever he was back and at almost any time for a phone call.

I still haven’t heard from him.

At this point, the calendar is starting to drive events. If I don’t get a pre-defense scheduled soon, I will run out of time to complete the pre-defense and the public defense before the end of the semester. Hence the fact I was beyond irate at the junior adviser not calling when he was supposed to.

I am not sure what I will do if I do not make some very rapid progress very soon. I am giving very serious consideration to dropping out of the program if I cannot finish this semester. After nearly 10 years of work, and tens of thousands of dollars in tuition, that would be turning my back on a lot of hard work. But I feel like I’m being strung along with my professors having no regard for the amount of time and money I have sunk into this. They are used to dealing with graduate students who have to go through this entire gauntlet to meet the qualifications for getting a job as a professor. And that’s not what I’m doing this for.

I do know one thing: if I do leave the program, that leaving will go through the dean’s office and quite possibly the university president’s office. I want them to understand that it’s not that I couldn’t hack it, but rather that I think they’re deliberately trying to milk me for all the tuition they can get from me.

Yet Another New Draft

Having met with both of my advisors almost two weeks ago and convinced the younger one that I’ve done enough work to claim victory, I’ve begun working with him on yet another draft of my dissertation. I won’t be throwing away all of my old work, but I am starting with a new outline and going from there.

The young guy is very much a stickler for getting the introduction and first couple of sections right, with the goal of convincing readers that I’ve got all my crap together and what follows will be a genuine contribution to the field.

I’ve been trying to meet with him weekly, or bi-weekly if I can, because there’s a LOT to get done if I am to complete my pre-defense by early April. Two (or was it three?) meetings got snowed out in the past couple of weeks, but I finally got to meet with him this past Monday. He had a lot of changes, almost a complete re-write of the introduction section, but it sounds like a good way to structure things.

I scheduled another meeting with him for yesterday (Thursday) afternoon to go over the re-written introduction. He clearly liked it a lot better than the previous version, but he still had more changes for me to work on. It’s a pain in some respects, but Susan informs me it was worth it because she went through the same process with him and had zero changes between her pre-defense and final defense. Hopefully she’s right.

In the meantime, I still need to schedule a date for my pre-defense. I sent a request for open dates to my committee, but I only got one reply. I guess I’ll just pick a date and see who objects.

Lots of Progress

I haven’t updated this in a while, but it’s not because I haven’t been making progress.

I spent the fall semester continuing to meet weekly with the same committee member. We made a tremendous amount of progress, successfully developing a service composition optimization model in OPL. He devoted so much time to it that my adviser suggested I consider asking him to be a co-adviser for my dissertation. I gather that being a dissertation adviser is something of a feather in a professor’s cap, and he did help me a lot, so it seemed like the right thing to do.

He agreed to be a co-adviser, which was the easy part. Now comes the hard part: getting two professors to agree that I A) have done enough to graduate, and B) agree on the content and format of my dissertation. I had already spoken to the more senior adviser at the end of the fall semester, and he agreed that I had enough work done. So we hashed out a dissertation outline and over the break I started filling it in. After meeting with him again at the start of this semester, I got a confirmation that I can finish this semester. (Frankly, I told him that this was my last semester one way or another. I’ve been at this for 10 years, and I’m broke and tired.)

I let the co-adviser know about this at the start of this semester, and his immediate comment was that we need to discuss how much I’ve done to see if I have made enough of a contribution for a dissertation. Let the fun begin.

Details, details, details…

Over the summer I’ve been meeting weekly with one of my committee members who is an expert on the subject of decision guidance systems. That is, helping someone make the best choice from among complex alternatives. I’m working with him to devise a way to select the best possible combination of services to complete a task when there are several valid combinations available. I’m basing the selection on a Quality of Service model I’ve developed that includes factors such as the cost to invoke a service, the time the service will take to complete, etc.

It’s been quite an interesting experience, though frustrating at times. The problem is that selecting the best combination of services is an NP-hard problem. In order to make an accurate selection in a reasonable time, we will be using an optimization engine developed by IBM that uses a language called OPL (Optimization Programming Language). To make use of that language, we need a precise model of the problem space.

So over the summer we have been working to define precisely what is meant by a “process model” and a “service composition.” And this is precise in the mathematical sense meaning rigorously defined in such a way that the definition can be queried for exact results. The difficulty is that any given service that is used in a process may be a “virtual service” — that is, a composition of services that is offered up as a pre-made package. (This is different from an atomic service, which is an offering that cannot be broken down into any smaller unit.)

In order to accurately calculate the quality of service for a given composition, we have to calculate the QoS for every constituent atomic service within all of the virtual services, and this must be decomposed recursively for all virtual services that may be invoked.

So we’ve had to create recursive definitions for a process and a service (as is turns out, any process is itself just a virtual service, so that simplified the problem a tiny bit). Once that was done, we have to develop recursive definitions for calculating the QoS of each service and rolling those up. Cost is easy, in that it’s a sum of the costs of each constituent service. But things like duration are more difficult, as some services may run in parallel.

It’s been interesting and enlightening, but it has certainly been tedious. I am generally good about paying attention to details, but this has been a whole new kind of detail-oriented for me.

About the Picture

The picture across the top of this site is of the aircraft I flew in the Marines, a Sikorsky CH-53E Super Stallion, refueling in flight from a C-130 Hercules. In-flight refueling in a helicopter is a delicate operation. More than a few pilots have earned new call signs by cutting off the refueling probe. There’s a picture here that shows just how close the two aircraft are during the refueling process.

Waiting

I’ve been in a bit of a holding pattern for a month or so. I met with one of my committee members to discuss how to select an optimal service composition from among those available. As it turns out, that selection of this type is an NP-complete problem. Given that this committee member specializes in decision guidance systems, asking him for help on this task is a good idea.

I met with him and explained the sort of problem to him. I don’t think he realized how far along in my research I am (having an operating prototype and all), but after bringing my advisor into the discussion he seemed to get the idea that I was a little beyond the planning stages. Ultimately he asked for some formal definitions of the problem space (e.g., how I define a “business process”). So I produced those definitions and ran them by my advisor, who liked them.

Unfortunately, the decision guidance expert has been very busy lately. Heck, it seems like he’s always busy. He’s the committee chair for a friend, and she has a bear of a time cornering him and getting substantive feedback.

After a month of trying, I finally managed to get on his schedule for this Friday (he’s been out of the country for almost two weeks, so I can understand some of the delay). Hopefully, it being the summertime, he won’t have any other students clobbering his time when I get there for my 11 AM appointment. More importantly, hopefully he’ll have more for me that “refine these definitions blah blah blah.” I need some concrete direction so I can get this show on the road.