Articles

Semantics of Social Media

GalaxyI’ve been busy recently on a number of social media platforms – Linked In, Facebook, Twitter, Pinterest, Instagram, Tumblr and others. Part of this comes in setting up my own business (and my proclivity towards blogging, which has become a passion of mine). However, another part is due to my interest in the Semantic Web, and the dawning realization that all of these social media networks are becoming, in effect, a dynamic semantic web.

Hashtags as Semantic Concepts

People working in the Big Data space know full well the power of activity feeds. A significant percentage of non-database conversion Hadoop projects involve pulling in data feeds from these platforms in order to create an ad-hoc sentiment analysis base. Hashtags have been creeping into various sites that don’t actually make any use of them (though most are now starting to because of the practice) because there’s a whole generation of twitter users who are now using this to push their longer works into Twitter’s “news” environment.  In effect, hashtags, person identifiers and shortened links are establishing a new paradigm for information organization.

Consider Twitter, which begin to establish the practice as a way to get past the limit of 140 characters. The earliest Twitter feeds actually did nothing with hashtags (or #hashtags, as they are more commonly displayed), starting only with the use of back-referenced user identifiers (@kurt_cagle) and URIs which people had to externally shorten. However, within a year, all three forms, as well as responses, retweets and other forms of enrichment had not only become standard in Twitter, but had essentially become essential to its DNA, providing ways of indexing content via free-formed folksonomies.

Other features (such as de-referencing image and video sources) eventually followed, adding additional dimensions, and extending the ecosystem as other platforms (most notably instagram) became the repository for holding image content, and vine, which made it possible to embed short mobile videos. By opening up their APIs, most of these social media companies in turn provided the foundation for mobile and web apps, and its likely that most people looking at these apps see these data feeds as a source for enhancing these core services.

This is actually one reason why Twitter never needs to turn a dime in profit – it exists as the foundational technology on which any number of other startups are built, and most are perfectly willing to help fund Twitter in order to keep the ecosystem dynamic.

For all that, semantic folks have been puzzling out Twitter for some time, because it has created a linked data infrastructure that is clearly a graph and just as clearly does not follow the complex rules based system that so much of semantics on. Part of this has to do with the fact that most users of Twitter et alia have never heard of the Semantic Web, and most would be hard-pressed to even describe what it does if they have.

However, take a look at a typical Twitter message (this from Data Scientist Kirk Borne (or @KirkDBorne):

Lack of poses problem: (see also )

This entry (chosen pretty much at random) has the following characteristics:

  • @user-references (@SKA_telescope)
  • #topical-references (#BigData)
  • shortened links (http://goo.gl/YrvpMd)
  • Account originator (@KirkDBorne)
  • Message identifier (not shown, but every twitter message has one, typically a GUID: f9732f07-ab02-46d4-b869-215e36576dc5, which I’ll shorten to f9732f07)
  • A time stamp (not shown)
  • Retweet-reference (not shown, but typically an internal link to a previous tweet).
  • Additional human text message.

Originally, most tweets started as human text messages, but over time, the textual content has eroded to the point where it provides useful color, but not necessarily much else. You can think of the text message as the content without the semantic decoration:

Lack of DataScientists poses SKA_telescope problem: [link] BigData Astronomy DataScience (see also LSST)

This can be useful for text analysis, but the very brevity of the message means that there is a somewhat condensed grammar in play that requires a different kind of lexical parser. We’ll come back to this concept in a bit, but it’s worth focusing on the other components first.

Semantically, a tweet has as its subject itself, which we can get from the GUID. For convenience sake I’ll use an abbreviated form:
tweet:f9732f07

to represent this particular tweet message.

With this is mind, you can use the notations above as shortcuts for different predicates:

@SKA_telescope
tweet:f9732f07  prop:refUser  user:SKA_telescope.

#BigData
tweet:f9732f07  prop:refTopic  topic:BigData.
topic:BigData    skos:prefTerm “BigData”.

http://goo.gl/YrvpMd
tweet:f9732f07  prop:about  <http://goo.gl/YrvpMd>.

From @KirkDBorne
tweet:f9732f07  prop:fromUser user:KirkDBorne.

In Feed @kurt_cagle
tweet:f9732f07  prop:inFeed user:kurt_cagle.

Retweet 
tweet:f9732f07  prop:inRefTo tweet:c192a1e3.

Sent on 2015-08-08
tweet:f9732f07  prop:sent “2015-08-08T02:30:16″^^xs:dateTime.

Text
tweet:f9732f07  prop:text “””Lack of DataScientists poses SKA_telescope problem: [link] BigData Astronomy DataScience (see also LSST)”””^^xs:string

This actually provides a huge amount of information just with this tiny set of predicates. Twitter also has APIs for resolving user names to get more detailed information, including name, user name, information about followers and so forth (more info at https://dev.twitter.com/rest/reference/get/users/show, each of which can also be translated into similar semantics.

Changing Perspectives

Consider, for instance, some of the more subtle relationships that exist within a tweet. It’s unusual to have more than one link within a tweet. This strongly suggests that within each tweet this primary link is what the tweet is about, while the hash tags and user links provide categorization.  When no link is present, then typically the focus of the tweet is usually on the @users mentioned, but this is a considerably weaker pattern.

Links typically point to web pages or media, but it’s not completely straightforward to determine when two links point to the same resource. URL rewriters are usually used to compress links, and different people may end up using different rewriters to capture the same link in different ways. This is actually typical of semantic web information, where the same resource (a specific web page) may have multiple “addresses”.

For this reason, it’s usually necessary to resolve a link, perform some post cleanup (removing parametric information except for obvious identifiers) and then establish a convention for representing the links within your database consistently. It’s this cleanup work that is usually the most onerous part of  trying to convert these links into semantic IRIs. If two links can be determined to point to the same thing, a third IRI, one that doesn’t necesarily represent a web address, will then be used to indicate both of the URLs (as well as any future URLs that point also point to the same web page).

Let’s use a carat (^) character to represent a link (and while we’re at it, we’ll use % to represent a tweet and * to represent a relationship). With this in place you can start creating some additional suppositions. For instance, it becomes possible to invert the subject so that the focus is on the article – which I’ve renamed ^dataSciArticle1 here.

%tweet1  *hasLink  ^dataSciAstrArticle1.
^dataSciArticle1 *hasURL  “http://goo.gl/YrvpMd“.
^dataSciArticle1  *inTopic   #DataScientists.
^dataSciArticle1  *inTopic   #DataScience.
^dataSciArticle1  *inTopic   #Astronomy.
^dataSciArticle1  *inTopic   #BigData.
^dataSciArticle1  *referredBy @KirkDBorne.
^dataSciArticle1  *refersTo    @SKA_telescope.
^dataSciArticle1  *refersTo    @LSST.
^dataSciArticle1  *summary   “Lack of DataScientists poses SKA_telescope problem: BigData Astronomy DataScience (see also LSST)”.

By the way, these are all triples in the semantic sense. I’ve created a shorthand notation, so that its not in any one language (Turtle or JSON-LD or RDF-XML, for instance), but it is a representation of simple statements. One of the beauties of RDF is that the language itself is a conceptual framework – syntax is up to you.

Now, let’s go to Twitter and search for #datascience and #astronomy (tags are not case sensitive). Here’s another from Kirk Borne.

Aug 11

Huge discovery potential with release of :

Once again, we can encapsulate therse as assertions (using ! to indicate dates, though I’m rapidly running out of top row characters).

%tweet2  *hasLink  ^dataSciAstr2.
%tweet2  *publishedOn  !2015-08-11.
^dataSciAstr2 *hasURL   ““.
^dataSciAstr2  *inTopic  #Astronomy.
^dataSciAstr2  *inTopic  #BigData.
^dataSciAstr2  *inTopic  #OpenData.
^dataSciAstr2  *inTopic  #DataScience. 
^dataSciAstr2  *referredBy @KirkDBorne.
^dataSciAstr2  *summary  “Huge discovery potential with release of Astronomy BigData”.
^dataSciArticle2  *hasImage ^dataSciArticle2Image1.
^dataSciArticle2  *hasImage ^dataSciArticle2Image2.
^dataSciArticle2Image1 *hasURL “http://goo.gl/vED8lX“.
^dataSciArticle2Image2 *hasURL “http://goo.gl/FOtDS1“.

Notice I’ve added a new property hasImage, which would likely be derived from a bit of screen scraping of the source. We can also do the same with the first tweet:
^dataSciArticle1  *hasImage ^dataSciArticle1Image1.
^dataSciArticle1Image1 *hasURL   “http://goo.gl/uhWElC“.

And just to round things out, a third post in the same vein, this time a commented retweet of the second post.

Aug 11

Am really excited to see the release of :

%tweet3  *hasLink  ^dataSciAstr2RT.
%tweet3  *publishedOn  !2015-08-11.
%tweet3  *retweets  %tweet2.
^dataSciAstr2RT *hasURL   ““.
^dataSciAstr2RT  *inTopic  #Astronomy.
^dataSciAstr2RT  *inTopic  #BigData.
^dataSciAstr2RT  *inTopic  #OpenData.
^dataSciAstr2RT  *inTopic  #DataScience. 
^dataSciAstr2RT  *retweetBy @sarveshgupta89.
^dataSciAstr2RT  *mentions @KirkDBorne.
^dataSciAstr2RT  *summary  “Am really excited KirkDBorne to see the release of Astronomy BigData OpenData DataScience”.

So, 420 characters later (3 x 140 chars per tweet), we’ve pulled out a LOT of potential semantic information.  We can convert this with a script (which I won’t cover here) that uses regular expressions to symbols into namespaced prefixes, then load this into a semantic triple store using N3 notation.

Querying Social Media Semantics

What can you do with this kind of information? Quite a bit as it turns out. For instance, let’s say that you sample thousands of tweets around the keywords #astronony and #bigData. You can use a sparql query to determine what the most common words that tend to co-occur with these:

select  ?term (count(?term) as ?count) where {
?article :inTopic  ?term.
?article :inTopic ?searchTerm.
filter (!isSame(?term,?searchTerm).
} order by desc(?count) limit 50,
{searchTerm: (#astronomy,#bigData)}

#data 821
#dataScience 425
#dataAnalytics 391
#dataScientist 353
#dataScientists 328
#statistics 295
#galaxies 252
#survey 212

This is a form of clustering analysis. If terms occur at random, then there will be roughly equal numbers of words in each terms of the list, but terms that are more highly relevant will occur more frequently, and hence creep up to the top of this list. This relevance ranking accuracy tends to increase with the number of tweet, while the number of terms you add will reduce the overall number of samples.

Similarly, you can use this query to make recommendations. In this case, this involves performing an inference in order to capture the counts:

insert {

        ?survey :hasLink ?link.
        ?survey :hasTerm ?term.
        ?survey :hasCount ?count.
        ?survey :type   class:Survey.
        ?survey :lastModifiedBy ?date.
        }
where {
       bind (now() as ?date).
       bind (uuid() as ?survey).
{select ?article ?term (count(?term) as ?count) where {
      ?article :inTopic  ?term.
      ?article :inTopic ?searchTerm.
       filter (!isSame(?term,?searchTerm)
       }
      group by ?article
      having ?count > 10
}
?article :hasLink ?link.
?link     :hasURL  ?url.
} ,
{searchTerm: (#astronomy,#bigData)}

For a given term related to the parametric terms, this will let you look up each article with that term showing how many times its been mentioned in a tweet, a fair proxy for the popularity of the article for that particular term. You can also order these by last update and take only the most recent for a given term and article, and you can also periodically delete any “survey” objects that are older than a certain date, just to keep them from overwhelming your system.

The Business Case For Semantics

Using similar queries, you can also track how often a given article gets cited (and can create time series of these to show citation evolution over time), can watch retweet cascades (the tree formed when an article gets retweeted by multiple people) to determine how influential a given person is in specific segments, or how relevant a topic is to different people. One benefit that you get with semantics is that not only can you get aggregates, but you can also look at behaviors of specific relationships over time, something which often gets overlooked in traditional analytic circles.

Sentiment analysis can often be done by looking at the text abstracts after removing the semantic components, and attempting to identify words and phrases that indicate approval or disapproval. This in turn can be tagged back into the tweet, possibly as a metric between 1 (wild about something or someone) and -1 (loathe something or someone).  Other networks besides Twitter also have “like” capabilities (and very occasionally dislike, though those usually tend to disappear pretty quickly when implemented),but textual analysis is often a more reliable indicator – people may be reacting to the news of the even, not the even itself.

Pinterest is another example of a highly mineable social media, and is especially useful for attempting to identify what is in a given image. Pinterest boards are natural categorization mechanisms, and unlike Twitter doesn’t require the addition of a hashtagged folksonomy. Indeed, if you see a picture as being simply a resource, then the techniques for semantically encoding it vs. a tweet and its links are almost identical.

The beauty of social media semantics is that so much of what gets posted provides very valuable information about what people think. Political data scientists regularly mine Twitter and Facebook both to see how well their candidates are doing as well as to forestall potential danger signals (discontent, rumors and so forth).

Businesses can see whether customers like or dislike new products or services, or whether the launch of a new product even hits the radar. For entertainment companies, semantic mining of social media can make it possible to tweak a flagging series by introducing or retiring characters and storylines, or can determine if a certain genre off game is becoming oversaturated in the market. Such efforts also provide a deeper dive into whether ad spends are effective or not, can help predict early on when something is about to go viral, and can pinpoint those influencers who have an overweight ability to change the market with a good word or bad review.

Semantics is one of the best set of tools on the planet for working with graphs, and social media platforms are fundamentally graphs in nature. As such, business analysts and data scientists should spend some time becoming familiar with tools such as RDF, OWL and Sparql, in addition to more traditional analytics functions. While most forms of analytics can tell you how much, semantics can increasingly answer the question “Why?”.


Kurt CagleAuthored by:
Kurt Cagle

Kurt Cagle is the Founder and CEO of Semantical LLC, a semantics and enterprise metadata management company. He has worked within the data science sphere for the last fifteen years, with experience in data modeling, ETL, semantic web technologies, data analytics (primarily R), XML and NoSQL databases (with an emphasis on MarkLogic), data governance and data lifecycle, web and print publishing, and large scale data integration projects.  He has also worked as a technology evangelist and writer, with nearly twenty books on related web and semantic technologies and hundreds of articles.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s