Feature Hashing, or the “hashing trick”

Feature hashing, or the “hashing trick,” is a clever method of dimensionality reduction that uses some of the important aspects of a good hash function to do some otherwise heavy lifting in NLP. This is a good blog post with the fundamentals of how and why the hashing trick works when working with a large, sparse set of vectors:

Hashing Language

Feature hashing is an elegant solution to the otherwise hairy problem of fighting the curse of dimensionality. It turned out to be extremely useful for a project I’m currently working on for a course at Columbia: Computational Models of Social Meaning.

Scikit-Learn has an implementation of the hashing trick if you’d like to read more about it.

A hidden gem in Manning and Schutze: what to call 4+-grams?


I’m a longtime fan of Chris Manning and Hinrich Schutze’s “Foundations of Natural Language Processing” — I’ve learned from it, I’ve taught from it, and I still find myself thumbing through it from time to time. Last week, I wrote a blog post on SXSW titles that involved looking at n-grams of different lengths, including unigrams, bigrams, trigrams and … well, what do we call the next one up? Manning and Schutze devoted an entire paragraph to it on page 193 which I absolutely love and thought would be fun to share for those who haven’t seen it.

Before continuing with model-building, let us pause for a brief interlude on naming. The cases of n-gram language models that people usually use are for n=2,3,4, and these alternatives are usually referred to as a bigram, a trigram, and a four-gram model, respectively. Revealing this will surely be enough to…

View original post 235 more words

Detecting Social Power in Written Dialog

This semester, I’ll be working on a project at Columbia CCLS with Prof. Owen Rambow and Vinod Prabhakaran. It falls under the broad category of discourse analysis. Specifically, we’ll be looking into detecting displays of power from email threads and discussion boards. Here’s some prior work that explains the subject better:

Written Dialog and Social Power: Manifestations of Different Types of Power in Dialog Behavior

Extracting social meaning from text analysis is an interesting subject, and I’m excited to get started on it.

Filler words and function words

Today, I found this interesting article on NPR:

Our Use Of Little Words Can, Uh, Reveal Hidden Interests

Here’s a short excerpt:

“When two people are paying close attention, they use language in the same way,” he says. “And it’s one of these things that humans do automatically.”

Pennebaker has counted words to better understand lots of things. He’s looked at lying, at leadership, at who will recover from trauma.”

Here is Prof. Pennebaker’s web page, discussing some of the details of his findings:

The World of Words

An excerpt:

Style-related words can also reveal basic social and personality processes, including:

  • Lying vs telling the truth. When people tell the truth, they are more like to use 1st person singular pronouns. They also use more exclusive words like except, but, without, excluding. Words such as this indicate that a person is making a distinction between what they did do and what they didn’t do. Liars have a problem with such complex ideas.
  • Dominance in a conversation. Analyze the relative use of the word “I” between two speakers in an interaction. Usually, the higher status speaker will use fewer “I” words.
  • Social bonding after a trauma. In the days and weeks after a cultural upheaval, people become more self-less (less use of “I”) and more oriented towards others (increased use of “we”).
  • Depression and suicide-proneness. Public figures speaking in press conferenecs and published poets in their poetry use more 1st person singular when they are depressed or prone to suicide.
  • Testosterone levels. In two case studies, it was found that when people’s testosterone levels increased rapidly, they dropped in their use of references to other people.
  • Basic self-reported personality dimensions. Multiple studies are now showing that style-related words do much better than chance at distinguishing people who are high or low in the Big Five dimensions of personality: neuroticism, extraversion, openness, agreeableness, and conscientiousness.
  • Consumer patterns. By knowing people’s linguistic styles, we are able to predict (at reasonable rates), their music and radio station preference, liking for various consumer goods, car preferences, etc.
  • And much, much more.

And finally, here’s a link to the paper published in The Journal of Language and Social Psychology:

Um . . . Who Like Says You Know: Filler Word Use as a Function of Age, Gender, and Personality

I find it fascinating that they were able to extract this information without using any complicated analysis of syntax, as far as I can tell.

I played with the free-to-use, public version of LIWC. It seems this gives you some results of the analysis, without drawing any conclusions from it. I fed it the “I Have A Dream” speech by Martin Luther King, Jr. Here were my results:

Details of Writer: 34 year old Male
Date/Time: 1 September 2014, 2:43 pm

LIWC Dimension Your
Self-references (I, me, my) 4.08 11.4 4.2
Social words 6.58 9.5 8.0
Positive emotions 3.74 2.7 2.6
Negative emotions 0.79 2.6 1.6
Overall cognitive words 2.27 7.8 5.4
Articles (a, an, the) 8.50 5.0 7.2
Big words (> 6 letters) 18.03 13.1 19.6

The text you submitted was 882 words in length.

The numbers don’t have units, so I’m not sure how I’m supposed to interpret them. Nonetheless, it’s interesting to compare the contents of the speech to “personal” and “formal” texts in relative terms, I suppose.

I looked around on the Internet, and found a Reddit comment referring to the work of Fairclough, Van Dijk and Wodak.

Here’s one article about Critical Discourse Analysis, the category this type of study falls under: Teun A. Van Dijk – Critical Discourse Analysis. From that article:

Critical analysis of conversation is very different from an analysis of news reports in the press or of lessons and teaching at school.

This might be a reason why I Have A Dream might not have been a good example to use to play with the LIWC tool.