Big Data proponents seem to love water metaphors – data lakes, data streams, the velocity of data, the whole notion that data flows through your system seamlessly if only you use their specialized product du jour.
Take for instance, the concept of data siloization. I come originally from the Midwest, where silos were generally full of corn and soybeans, so it took me a while to realize that data silos actually equated to water silos – if you could just break those silos down, then everything would be peachy keen, and that data could go on flowing again.
In reality, data silos represent three distinct challenges, only one of which is addressed by the new wave of data applications. This first problem is one of protocol – how do you represent your data? Is it in relational tables? XML? JSON? Key/Value pairs? Is it held in Oracle SQL, Transact SQL, MySQL or Postgre SQL? For the most part, these are questions of representation, and while they spawn fervent arguments in various camps, are generally resolvable by creating translators between JSON and XML, or between normalized vs. denormalized renditions of tables coming from relational data systems.
The second challenge, the harder one, is making narrative documents more readily queryable from a data perspective – trying to get significant nuggets of meaning or insight from such documents that can then be structured in a more computationally friendly form. This problem is solved more by text analysis tools, and generally requires mapping to some internal model (or ontology) in order to extract meaning.
The third challenge facing most organizations is the need to access information from one data system and utilize that data in another data system via a consistent mechanism. As enterprises become more service oriented, building a centralized service hub for data becomes increasingly desirable, and is a critical prerequisite to comprehensive analytics.
A Data Service Hub (DSH) is the realization of this third channel. I’ve seen (and built) more than a few of these over the years, and find that there is always some misconceptions about what exactly such a hub is. Indeed, I’d argue that there are two principle kinds of data service hubs, common repositories and canonical (or semantic) repositories.
A common repository virtualizes the data from disparate data systems into a single repository that can be queried using a specific query language, typically with the expectation that the output will be in a variety of different protocols (XML, JSON, CSV, etc.). Common repository systems will usually create “graphs” or collections that echo the data model of the source systems
These are weak hubs – with some minor exceptions to identify referential links from systems, there is no real attempt to identify common data structures between one graph and another. Common repositories are comparatively easy to write and for many applications are sufficient to needs – they abstract the source data from its initial database into a common queryable format.
A semantic repository builds on a common repository – you need the first in order to have the second. In this case, once you have established a single internal representation of external database systems, there is usually a second “semantification” process that accomplishes several tasks.
The first is to map different tables and fields (or their equivalent JSON or XML equivalents) to a central canonical model. This canonical model is typically an enterprise-wide model where the business objects of the enterprise are defined in relationship to one another (their entity relationship (ER) map).
The second is to put into place an overall accessible logical model that can be used to create inferences that describe subclasses and sub-properties, among other things. For instance, in such a model for a retail company, the canonical model would be used to assert that a SKU is a subclass of a product ID, would be able to recognize that a Shipping Address and Billing Address were both subclasses of general Addresses (and as such can use most of the same interfaces).
Additionally, such a canonical model makes it possible to organize and centralize controlled vocabularies. As a simple example, one database might identify the gender of a person through the notation “F” or “M”, while another may use a numeric code (1 or 2), and another may use (“woman” or “man”) or (“female” and “male”).
By concentrating on the conceptual meaning of these controlled vocabularies, a conceptual hub can recognize when such concepts are equivalent, so that when a person queries on “female owned businesses” the system is able to recognize that the querant is looking for a for-profit organization which has a principle stake-holder (person) who is a woman.
Thus, a semantic hub is capable of making inferences about data without requiring a programmer to write a specific dedicated query for that particular request. Indeed, this ability to write and perform natural language queries against data is one of the big advantages of semantic data hubs.
Many data vendors talk about schemaless databases as the advantage to working with their products, but in reality all that this means is that it is not necessary to establish a schema when the data is loaded. There is still a model there (even the contents of a novel have models, they just aren’t explicitly identified as such).
The problem though with “schemaless” systems is that schemas are not bad things. Schemas tell you not only basic things like whether the string 12345 is a number or a string of characters (which determines whether 1235 gets sorted ahead of 12345 (as a number) or behind it (as a string), but also what the superstructures of that data are, such as addresses, which contrains several fields such as street, city, postal code as well as references such states or countries. They provide, for the computer, as well as other users, what constitutes valid or acceptable data as well as what relationships and properties mean in computer terms.
To illustrate this, consider a data-centric document with the labels and data obfuscated:
That, in fact, is what a typical NoSQL data record looks like to a computer. There are a couple of relationships that can be inferred from hash structures, and others which could be reduced down to 50/50 guesses, but this tells you next to nothing about what is being modeled or what are the rules associated with that modeling.
The semantification process tries to determine a schema based upon a canonical model. For instance, through a number of different heuristics (as well as an occasional human assist), it may be able to determine that integers are actually pretty rare within databases, especially if they don’t seem to repeat for a given field. This means that chances are pretty good that these integers are actually either primary or foreign keys, with the likelihood being high that primary keys will be the first item in a table while foreign keys are likely to be later (this isn’t always true, but it’s a good first guess).
Some detective work can then be used to look at known structures in a canonical model that share certain similarities in links. Two dates are indicated, if you assume that numeric string that start or end with either 19XX or 20XX indicate years, and typically dates tend to be paired as start and end-dates. Depending on where the data comes from (and with a quick check of the data itself), this points to the other four digits being month and day respectively (any value about 12 will clarify which is which, with cultural trends – US is usually month first, Europe is usually day first) acting as back up.
Other modeling patterns can also be used – when you have arrays of objects, then this is usually a related set of like items having a compositional relationship – if you take out the parent, then the child has no context to refer to (an address, for instance, is almost always compositional). Singleton pointers, on the other hand, are usually pointers to controlled vocabularies. Tables which consist of nothing but pointers are almost certainly join keys, establishing many-to-one (or in some cases many-to-many) relationships, especially when there’s no other qualifying metadata.
The point of all of this is that such heuristics can help to establish enough that it can narrow down the likelihood that a given structure is similar to a canonical structure, even without obvious textual semantics.
In a learning heuristic system, these would be presented to a human business analyst for confirmation, at which point, the system would also associate identities – “shadlu” corresponds to “person”, “goshen” would be a person ID for that record, “gushni” would be given name (a comparison against a comprehensive list of first names could confirm this), asjefrt would be “birth date”, “bsfjfrt” would be death date, and “koshino” would be an address structure. There would likely be some transformation involved -lexical dates may be converted to IEEE date standards, for instance.
The canonical model would also be updated to recognize, from this particular database, that these terms or minor variations have equivalent meaning – the data hub has at that point established a map that could be stored creating the associations, and the presence of specific terms might indicate not only a single field equivalent but would be able to circumvent additional processing. Put another way, semantic data hubs have the ability to learn other databases over time.
The other role of the canonical model is to create a consistent mechanism for identifying relationships that would make querying this data easier. For instance, such a canonical model could indicate that shadlu:91205 is of type Person. and that the assertion
shadlu:91205 model1:asjefrt “03021939”
is the same as
person:91205 canon:birthDate “1939-03-02″^^xs:date
This means in practice that rather than needing to know relationship names or field names for every data source, a person need only query with the canonical relationships and field names, regardless of source. This in turn makes it much easier to query across multiple databases, makes it easier to combine information from multiple sources and provides for the kind of natural language queries that are the holy grail of information seekers.
So given this, which is the best strategy to pursue for data hubs? This depends in part upon the degree to which the data hub represents a single ontological space – in effect, how comprehensive the data model is intended to be. A health care ontology will perforce focus far more upon modeling diseases, conditions, physiology and medical history, while a health insurance ontology will be much more likely to concentrate on claims, memberships, plans and providers.
When you have a single focus, then a canonical model is generally more feasible, while if you’re creating a general hub without clear ideas about who or how it will be used, it is usually better to start with a slightly enriched common repository that can then be opened up to handle multiple hubs as the data itself makes the relationships obvious.
This in turn leads to one of the biggest advantages of semantic canonical hubs. It’s a learning system. Designed properly, semantic hubs can expand as more systems are brought on board, can be changed to reflect changes in the awareness of understanding of the model (which is not always obvious at the beginning of a project) and that can in turn establish areas where such ontologies overlap in significant ways (such as at care providers in the health care/health insurance ontologies).
I’ll be exploring semantic data hubs in far more detail in subsequent articles.
Kurt Cagle is the Founder and CEO of Semantical LLC, a semantics and enterprise metadata management company. He has worked within the data science sphere for the last fifteen years, with experience in data modeling, ETL, semantic web technologies, data analytics (primarily R), XML and NoSQL databases (with an emphasis on MarkLogic), data governance and data lifecycle, web and print publishing, and large scale data integration projects. He has also worked as a technology evangelist and writer, with nearly twenty books on related web and semantic technologies and hundreds of articles.