Hi, I’m a graduate student studying data science in China, I’m a
Chinese, and my English is very poor. So I want a American friend to
talk with me, every topic is OK.
I love U.S. drama like 《Game Of Thrones》、《Agent of SHIELD》
I’m looking for a pen pal so I can chat with he or she in English
Millions of posts are published on Tumblr everyday. Understanding the topical structure of this massive collection of data is a fundamental step to connect users with the content they love, as well as to answer important philosophical questions, such as “cats vs. dogs: who rules on social networks?”
As first step in this direction, we recently developed a post-categorization workflow that aims at associating posts with broad-interest categories, where the list of categories is defined by Tumblr’s on-boarding topics.
Posts are heterogeneous in form (video, images, audio, text) and consists of semi-structured data (e.g. a textual post has a title and a body, but the actual textual content is un-structured). Luckily enough, our users do a great job at summarizing the content of their posts with tags. As the distribution below shows, more than 50% of the posts are published with at least one tag.
However, tags define micro-interest segments that are too fine-grained for our goal. Hence, we editorially aggregate tags into semantically coherent topics: our on-boarding categories.
We also compute a score that represents the strength of the affiliation (tag, topic), which is based on approximate string matching and semantic relationships.
Given this input, we can compute a score for each pair (post,topic) as:
w(f,t) is the score (tag,topic), or zero if the pair (f,t) does not belong in the dictionary W.
tag-features(p) contains features extracted from the tags associated to the post: raw tag, “normalized” tag, n-grams.
q(f,p) is a weight [0,1] that takes into account the source of the feature (f) in the post (p).
The drawback of this approach is that relies heavily on the dictionary W, which is far from being complete.
To address this issue we exploit another source of data: RelatedTags, an index that provides a list of similar tags by exploiting co-occurence patterns. For each pair (tag,topic) in W, we propagate the affiliation with the topic to its top related tags, smoothing the affiliation score w to reflect the fact these entries (tag,topic) could be noisy.
This computation is followed by filtering phase to remove entries (post,topic) with a low confidence score. Finally, the category with the highest score is associated to the post.
This unsupervised approach to post categorization runs daily on posts created the day before. The next step is to assess the alignment between the predicted category and the most appropriate one.
The results of an editorial evaluation show that the our framework is able to identify in most cases a relevant category, but it also highlights some limitations, such as a limited robustness to polysemy.
We are currently looking into improving the overall performances by exploiting NLP techniques for word embedding and by integrating the extraction and analysis of visual features into the processing pipeline.
Some fun with data
What is the distribution of posts published on Tumblr? Which categories drive more engagements? To analyze these and other questions we analyze the categorized posts over a period of 30 days.
Almost 7% of categorized posts belong to Fashion, with Art as runner up.
The category that drives more engagements is Television, which accounts for over 8% of the reblogs on categorized posts.
However, normalizing by the number of posts published, the category with the highest average of engagements per post isGif Art, followed by Astrology.
Last but not least, here are the stats you all have been waiting for!! Cats are winning on Tumblr… for now…
Just a few days ago, I was talking with my friend about the recent incident Google had where their Photos app tagged a black man and woman as “gorillas.” I had trouble succinctly expressing my thoughts on the matter but luckily I came across sociologist Zeynep Tufekci’s tweets about the blunder.