The Value of Metadata

You, as a decision maker or implementer, are frequently bombarded with the message that the data in your organization is its lifeblood, and in the modern digital age, that data has immense value to your enterprise. That data can help to create a better understanding of the financial activity within your organization, and can help you better understand (and subsequently capture) customers, to the extent that one of the most popular new initiatives in the last couple of years has been a move towards a “customer 360” application that lets you see your customers (and potential customers) from many different perspectives. Or at least that’s the intent.

What is less well understand by managers, data analysts, or technologists, though, is the value of metadata. This is partially due to the fact that in a typical relational database, the metadata about any given table is both largely structural and occupies a very tiny proportion of what is actually stored in the database. It is usually set up by a database administrator or perhaps a programmer, and consists of column (or field) names, a handful of distinct structural simple types (strings, integers, floating point numbers, currencies, dates, and so forth), some cardinality information (whether a property describes a required, optional, one-to-many or zero-to-many association) and typically a primary or foreign key relationship to another table.

These are all forms of metadata – in this case they tell the database how the information is structured, but more importantly, this is data (types, cardinality, constraints) that describes some aspect of the data, of the individual cells in each row of a given table. With relational databases, these are practically invisible because they are so intimately connected with the mechanics of that database storage, and in general, while you need to know tables column names (mostly) to perform queries, such metadata is typically hidden beneath the facade of a user interface, or perhaps presented as a spreadsheet, quite frequently with only the programmer actually knowing the deepest layers of metadata.

So long as data remained within its respective databases, this intimate relationship between programmers and metadata wasn’t really that important to the average business user. However, about fifteen years ago, as data interchange began to become a reality with the emergence of XML (and later JSON) what began to become apparent was that no two databases (and very few programs) were really any good at coming up with the same definitions for things. This in turn spurred an interest in establishing standards for data interchange, which then led to the logical (if perhaps erroneous) belief that you could set up an enterprise data model, where the whole organization used the same terms to describe the same things.

XML in particular caused a significant revolution. First, it let you append attributes on elements (roughly, column names), that let you provide additional information for qualifying that element value. This meant that you could add things like unit information, descriptions, data sources and so forth, all things that were difficult to be able to add to relational tables. Moreover, XML provided the first (and arguably best) way of creating separate documents called schemas that let you describe the structure and constraints that acted on that document. At first, most XML documents were intended to model, well, documents, such as web pages, books and so forth, but it didn’t take long before the idea of using XML as a way to pass data structures, as it solved one of the biggest issues of using binary encoding schemes – the difficulty of specifying metadata. XML (and later JSON) schemas had the advantage of both providing a consistent language for describing structural metadata independent of the data itself, and because the document was independent, you only needed a single reference to the location of that metadata for a validation tool to determine whether the information you were transmitting matched to the schema or not.

This approach using what’s often called semi-structured data also had a few of other significant advantages over pure relational databases. First, you could encapsulate a data structure using containment, making it possible to store deeply nested information in a form that could be readily transmitted from computer to computer. Additionally it laid the groundwork for the notion of global identifiers – data keys that existed outside of the context of a specific database – based on an abstraction of the URL which had been such a phenomenally successful mechanism for identifying resources globally on the web. Finally, it meant that you could look for patterns in the data based upon their structural relationships, even without necessarily knowing the specific semantics involved with specific tags, something facilitated by an increasingly powerful transformation engine on the XML side called XSLT.

At the same time, organizations has been moving towards the challenge of managing their growing number of digital assets. Most such digital asset management systems (DAMs) started out using the familiar paradigm of files and folders, but as the number of files began to climb above a few hundred, file/folder organization approaches began to prove less than satisfactory. The next major jump came with the use of search engines in order to locate documents, which helped considerably – but given that larger documents especially might actually have a great deal of text, text search worked best when you could also apply some kind of relevance score, on the search text itself, to determine the degree to which a specific sequence of terms were relevant to the text overall.

Even given this, straight text search by itself has significant limitations. Corpora of documents tend to be thematically similar, meaning that relevancy scores often provide only a very rough ranking of the likelihood of finding the document that you are looking for. This approach also doesn’t do much good when it comes to images, video or audio media. However, a more generalized keyword approach can work in those particular circumstances, where you provide specialized hashtags to each image, article, or related content. These keywords can come from a specific taxonomy (an ordered grouping of tags) or can be an open-ended folksonomy. One of Twitter’s most significant innovations was the specific introduction of #hashtag entries that signalled to the Twitter database that those particular words were intended as privileged descriptors, and it is a practice that Facebook, Linked In, Instagram, Pinterest and many other social media sites have adopted.

One benefit that hashtags offer as a way of providing metadata (and one that even today is not used to its fullest extent) is the ability to narrow down searches by querying against multiple hashtags. For instance, on Linked In (and other social media such as Twitter), I usually put together a list of minimally five distinct tags for every article I write or link that I note in a post. I also set up an additional tag, #theCagleReport. By using a combination of the last tag with something like #galaxy or #artificialIntelligence (which I also usually add #AI as an alias) I can find all content that I’ve either written or posted about via a keyword search. Controlled taxonomies (especially ones that can handle synonyms and acronyms) can ensure that you don’t end up making you taxonomy matrix too sparse, though in my experience folksonomies – open taxonomies where terms can be added by anyone – are surprisingly effective at providing the same capabilities, especially if you have some way of managing term mergers in the background.

These techniques have helped to tame the digital asset space considerably, though even there that space is still evolving. Taxonomy management allows you to create more complex relationships. For instance, if I write an article about #galaxies, for instance, I may want to also indicate that #galaxies contain #stars and are contained by #galacticClusters, that all of these are covered under the domain of #astronomy, that #galaxies are also related to #darkMatter and #blackHoles, and that #MilkyWay, #Andromeda, and #Triangulum are all instances of #galaxies. This means that if I search for #MilkyWay, while I’ll pick up articles specific to our galaxy, I will also (at a lower level of relevance) pick up related articles on galaxies, on black holes and so on.

It is here where taxonomies begin to turn into ontologies. A taxonomy, when you get right down to it, is a mechanism for classification of information in a system. While relationships certainly exist within taxonomies, most tend to be implicit. An ontology, on the other hand, is a more formal way of describing taxonomies among well as other data structures. For instance, an ontology would identify all of those relationships (what are called predicates in ontology land). Taking the same statement, the emphasis in ontologies changes: I may want to also indicate that #galaxies contain #stars and are contained by #galacticClusters, that all of these are covered under the domain of #astronomy, that #galaxies are also related to #darkMatter and #blackHoles, and that #MilkyWay, #Andromeda, and #Triangulum are all instances of #galaxies. In general, ontologies identify classes, predicates and attributes, and identifies how each are related to one another through formal logic.

Ontologies are powerful in part because they specify the rules of language, especially fabricated languages such as those used to describe business processes and entities. Identifying the entities involved in a book or website by itself is not inconsequential, but if you can identify relationships between these entities, then you bridge one of the most subtle yet significant divides in computer science – the difference between data and documents. This has several huge implications – you can treat blocks of narrative content (text documents) as data structures that provide actionable information that a computer can understand. Similarly, you can go the other route and treat data as documents and ultimately making it possible to develop a narrative around any data, in effect making that data tell its story in a human readable form.

This in turn also plays out with one of the most difficult tasks in data management – dealing with the complexities of mapping between properties from one database (or data source) to another. This is made complicated by the fact that most relational data (and even a lot of semi-structured data, such as XML or JSON) doesn’t have metadata that helps to identify key pieces of information – whether two properties describe the same thing, whether two enumerations (such as two different ways of describing astronomical information), actually refer to the same concept, and so forth. This also conflates with areas such as master data management (where you are trying to determine whether two resources with different keys from different systems are in fact the same concept) and dimensional analysis (determining, when such information is not necessarily known, what the scale of a given value is, in terms of given units).

In many respects, what has emerged in the last couple of years is an awareness, that the problem domains that data and documents have overlap to a significant degree, which means that solutions for solving these complex domain problems in one arena (text analysis) can be employed in the other area (data analysis). What’s more, there’s a growing awareness that machine learning and ontology management also overlap; machine learning can go a long way in identifying overlapping classifications and can consequently grow a unified taxonomy that can in turn be codified within an ontological approach.

The overall field for this form of metadata management is called semantics, and it covers everything from taxonomy and ontology design to master data and identify management, inferencing (surfacing new information based upon previously learned facts), text enrichment, natural language processing, and even certain parts of machine learning.

So having gone into detail about what exactly metadata is, the question comes back to a critical one for your company or agency – what does that metadata provide to you as the manager of a digital organization? What is the value proposition for metadata?

Metadata serves several purposes, each of which have their own contribution to the value of a company:

Better Organization and Search. Metadata makes it possible for everyone within an organization to find the data, whatever that data may be, within that organization. I think it can be safely said that most organizations have only a bare inkling about what data they have collected, let alone where it is located and who manages it. This provides value because it reduces duplication of effort and resources throughout your company or agency, it makes it easier to track who is responsible for the quality and timeliness of that data, it identifies potentially marketable content, and it identifies holes in your data strategy that can be filled with investment in quality data sources.
Reduced Software Costs. One thing that getting a handle on your metadata will do is to significantly reduce your overall software bill. We live in an era where most data has been reduced to four primary formats: XML (including HTML), JSON, CSV (relational tables and Excel and similar spreadsheets), and RDF, with RDF being an abstraction layer that can manage the encoding of information to the other formats. RDF also provides a means to build a cohesive microservices strategy based upon resource types, instances and constraints.
Improved SEO and Marketing. Metadata and semantics sit at the heart of contemporary SEO (search engine optimization). Most contemporary large-scale search engines (Google, Bing, Facebook, Amazon, Baidu and many others are now using semantic “rich snippets” to encode both data and structural and contextual metadata within web sites and mobile apps. This can go a long way towards turning search engines into product or service catalog servers.
Richer Data Analytics. Most data scientists spend upwards of 90% of their time locating, cleansing, and reconfiguring data from various sources in order to even begin the process of analysis. The process of managing metadata also directly impacts the quality and fungibility of the data that the metadata describes, meaning that the information that is generated by your data scientists and analysts is likely to be more accurate, timely, and those same people can spend more time actually mining that data for meaningful insights rather than simply spending their time compensating for bad data – and that means better operational information for future planning and decision making.
Easier Governance. Metadata and data governance are heavily intertwined. Most data governance is focused on annotational data – provenance (where information comes from), dominance (what constitutes a primary or golden record), authoritativeness (who is responsible for ensuring the integrity of that data), definition (how exactly the information is defined, often including citational context), temporality (when was the data first known) and purpose (why was the data captured in the first place). Note the journalism like focus of governance – who, what, when, where, why and so forth. This isn’t accidental – governance is largely involved with telling the “story” of data and its relevance to the organization.
Understanding the External Data Environment. Metadata management also involves understanding (and integrating) the external data environment: data from trading partners, sentiment analysis from social media, sensor data, government and regulatory data, and so forth. Through the use of emerging metadata standards such as schema.org, there’s a consensus slowly building about common structures – organizations, individuals, contracts and transactions, vehicles, biomedical information and so forth that increasingly gives an 80% solution for data interchange with the outside world.

The data within your organization is important, but the metadata within your organization is the language that it speaks to itself and to others, about the state of the company or agency, about the products that are created and the customers that buy them, and about the world that the organization exists in.

Source link