What is the actual utility of Big Data? The term as coined was deliberately ambiguous, perhaps in part because it was more marketing than a full realization that there has been a qualitative change in the nature of business data. Perhaps it is time to think about what the nature of all that big data really is.
One class of Big Data is streaming data – twitter feeds, facebook status messages, RSS feeds and so forth. Significantly most of this streaming data is an artifact of a process – the process of running Twitter or Facebook, of running a blogging system or news aggregator. This is similar to another frequent use case of Big Data – analysing system logs, which are, in effect, logs about what a particular system is doing. In this regard, these streams are effectively an indication of changes of state of a particular system. To put this into programmer terms, such feeds are what are known as signals.
A signal indicates that there is a qualitative change in the state of a system.
Web programming makes use of signals all the time – a person clicks on a mouse, and the mouse in turn sends a signal (my mouse button has been pressed and released, or mouse_click). The mouse in this case is a simple system with a fairly minimal number of states – mouse_down, mouse_up, mouse_over, mouse_out, mouse_click, perhaps a dozen overall.
The application in turn will create an event trap that initiates a process – an event handler – that gets passed details about the event in question – where the mouse was, which button was pressed, what particular object within the system was top-most in the hierarchy. Events can be consumed (the buck stops here) or passed on to lower level handlers, until it eventually gets consumed by the system itself.
The vast majority of all events in the system are never processed by any application except the operating system. This is an important point to remember, because it says a great deal about dealing with big data systems; to whit, most of the signals produced in a big data environment are noise, and will not in fact be otherwise processed. One of the biggest problems that most big data developers face is the fact that they see this as a data storage problem, when in fact, the goal is not to store everything, but rather to identify and capture (in as close to real time as possible) that a significant signal occurred, and to act accordingly.
A good case in point is a stock market ticker, which is, at the end of the day, just another feed. To abstract this out, we can talk about a stock ticker, which is a state machine that describes the price of a given stock at any moment in time. In an analytics approach, what you would do with that stock is to record the signal change at every interval and then perform some kind of a time series analysis on all of the data points.
However, while there is some value in this, most applications do not need all of that data – it’s noise. Instead, what most stock traders are looking for are inflection points – did the stock break a certain value on the upside (or downside), is the momentum of a stock increasing or decreasing, and so forth.
The overwhelming majority of all signals are (and should be) ignored by applications.
Put another way, a stock analyzer will maintain a record of perhaps a few dozen or even few hundred samples, with such samples being taken at large enough intervals to smooth out noise (more frequent sampling is not always better sampling). Then the state machine will retain first, second, third and perhaps fourth order differentials, changes not only in state but in how fast the state is changing (and in what direction) how fast the change of the change is happening (and in what direction), enough to give a qualitative idea about the behavior of the stock.
With this information, there are signals that such a stock state machine can then emit – “I’m becoming more expensive.”, “My value is dropping faster than normal”, “My price is decreasing, but it’s decreasing slower than it was earlier.”. These are more complex signals than those for the mouse (though not really by all that much), but they are still signals for which event handlers can be written. The event handler gets additional information about the stock – its current price, its nth order derivatives, metadata about the stock (what it’s symbol is, for instance) that can then be passed in for processing.
In most cases, the signal will be ignored – if the stock is range bound, buying or selling it will likely be a waste of money given transactional fees. If, however, the stock starts accelerating significantly (its second derivative suddenly kicks into overdrive), then a signal handler would invoke the BUY or SELL function, depending upon the direction of the first order derivative.
The same process can be run on aggregates, with the exception being that it is the aggregate behavior of this group that is emitting the signal, not any one particular member. This, in fact, is exactly what an index is – an aggregate of stocks that behave very much like a single entity, or state machine. Such entities have the same memory stack, but require an intermediate level of processing (typically abstracted into some kind of object facade)
Business Intelligence systems should actually work the same way, though they frequently don’t. You can create a model of a system, operating close to real time, but the more complex that model, the more costly and more out of sync the model will be. On the other hand, in general what most people actually need out of a BI system is not a snapshot of the complex system but some way to have that system generate signals that can be captured and handled.
“Sales in the biotech sector increased 5% in three different regions”, “The USD-EUR exchange rate is in free fall”, “The cost of critical components from four different vendors has been increasing 2.5% m/m for the last three months”, “#myBigDataBizCo hashtag has appeared 45,000 times in the last 12 hours on Twitter, most with a negative sentiment bias”. The values matter, but what matters more are the trends, the inflection points, frequencies, and similar metrics. These are business signals.
In business, signals are very important – they represent that the status quo is changing, and your business needs to respond accordingly.
Now, what you do with those business signals is part of your business strategy, and is where your analysts earn their keep. If a supply chain is indicating fragility, do you broaden your base of suppliers or redesign your product away from the problematic sector? No BI tool can tell you that, though you can build decision support tools that will show what has or had not worked when certain signals occurred. However, even there, you have to have a way of being able to generate a consistent set of signals so that you can, as much as possible, compare apples to apples.
A signals based approach to Big Data application development thus makes sense when the data in question is representative of a system’s state. The trick is in determining what the system is, but once you do that, and can have that system emit meaningful signals, then you’ve created a critical component in building a responsive system.
Of course, if you have one system that is a signal emitter and and another that is a signal consumer (has event handlers), then you are building a more complex system that can intercommunicate. Such systems move “business intelligence” beyond “simply” being able to see the state of a business, into the realm of being truly responsive and adaptive to changing business conditions.
Kurt Cagle is the Founder and CEO of Semantical LLC, a semantics and enterprise metadata management company. He has worked within the data science sphere for the last fifteen years, with experience in data modeling, ETL, semantic web technologies, data analytics (primarily R), XML and NoSQL databases (with an emphasis on MarkLogic), data governance and data lifecycle, web and print publishing, and large scale data integration projects. He has also worked as a technology evangelist and writer, with nearly twenty books on related web and semantic technologies and hundreds of articles.