There’s something really interesting about how we do language on the internet. We do a lot of writing to each other, and the vast majority of that is through electronic means - email, texts, Twitter. But when we talk about what that writing looks like, research really tends to focus hugely on Twitter, despite it not being the most popular form of talking online: a 2012 paper found that while Twitter only makes up 0.16% of electronic communications, it gets over 74% of the discussion. And often, it’s explicitly stated in people’s research that Twitter is considered pretty much equivalent to how people use language in texting or other social media.
Why is this? Well, it’s pretty obvious: it’s easy to get a lot of data about how people use Twitter, because it’s pretty much all available to the public on Twitter already. By contrast, looking at SMS (i.e. texts) is much less common: despite nearly 38% of electronic communications going via text - over 21 billion texts daily! - less than 15% of publications actually look at SMS data. And the reason for that is just as obvious: texts are private, and thus harder to get a good sample of.
But that doesn’t matter as long as Twitter and SMS language is pretty much equivalent, as so many of these studies are claiming. But is that really true? Recently, I went to a presentation by Claudia Brugman and Thomas Conners at the University of Maryland’s Language Science Center. They’d been doing research comparing Twitter vs. SMS corpora from Indonesian, to see if people use language differently in these different media.
Why Indonesian? Well, because it’s really different from English. Indonesian’s got pretty free word order, fluid word classes, no number or tense. English has had a lot of computational linguistics research done, including some on SMS usage. No one’s really done this research yet on languages that are really typologically different from English, and Indonesian’s a really good candidate, because Indonesians actually use social media a lot. They make up the 3rd biggest population of active Twitter users, as well as #4 for Facebook. There are more mobile subscriptions getting used in Indonesia than there are people there, 308.2 million vs 255.5 million. It’s a big group of people.
And they do similar things to English users for non-standard spellings in their texts. We may have yaaasss (or your favourite spelling of it); they have their word for agreement, sip, turning to siiippp, or before, dulu, turning to duluuu. And they have spellings that try to match the pronunciations of the trendy prestigious Jakarta dialect, as well: thanks, terima kasih, becomes makaci, or the word for want, mau, appearing as mo. This is all pretty familiar if you see much English texting.
But are there differences between how they use language on Twitter vs. in texts? Well… yes. They compared corpora of 3000 SMS messages to about 1000 tweets, and found that the text messages were more casual, in just about every way. Word usage was more casual: people used a more casual negation, gak, instead of tidak, and they used a lot more non-standard forms overall, 21.1% to 5.4%. They used a larger variety of words in the Twitter messages, too, and they were nearly twice as long as the texts. And they weren’t just longer, they were also more complex: over half the tweets had markers for subordinate clauses, but only about 20% of the texts did. The texts also just left out more words that were clear from the context, which means more non-standard word orders.
So overall, this means that maybe the general idea that Twitter can stand in for other social media stuff in linguistic analysis isn’t really correct. It really looks like tweets are more like written and composed speech than super casual stuff like texts, which makes sense: texts go straight to friends, tweets are at least in theory available to anyone.
And it means that there’s more of a challenge for working out good corpora and computational systems for dealing with speech coming from different sources. That’s part of where research like this will be going, likely: trying to work out how to automatically code and tag all these different types of speech. But when it comes to this, it looks like texts are going to be a level harder than tweets, not exactly the same. And it shows, too, that people everywhere use different media to talk in different ways, when you look at it, which is really cool. ^_^
Me:sign me the FUCK up 👌👀👌👀👌👀👌👀👌👀 good shit go౦ԁ sHit👌 thats ✔ some good👌👌shit right👌👌th 👌 ere👌👌👌 right✔there ✔✔if i do ƽaү so my selｆ 💯 i say so 💯 thats what im talking about right there right there (chorus: ʳᶦᵍʰᵗ ᵗʰᵉʳᵉ) mMMMMᎷМ💯 👌👌 👌НO0ОଠＯOOＯOОଠଠOoooᵒᵒᵒᵒᵒᵒᵒᵒᵒ👌 👌👌 👌 💯 👌 👀 👀 👀 👌👌Good shit