A hidden gem in Manning and Schutze: what to call 4+-grams?


I’m a longtime fan of Chris Manning and Hinrich Schutze’s “Foundations of Natural Language Processing” — I’ve learned from it, I’ve taught from it, and I still find myself thumbing through it from time to time. Last week, I wrote a blog post on SXSW titles that involved looking at n-grams of different lengths, including unigrams, bigrams, trigrams and … well, what do we call the next one up? Manning and Schutze devoted an entire paragraph to it on page 193 which I absolutely love and thought would be fun to share for those who haven’t seen it.

Before continuing with model-building, let us pause for a brief interlude on naming. The cases of n-gram language models that people usually use are for n=2,3,4, and these alternatives are usually referred to as a bigram, a trigram, and a four-gram model, respectively. Revealing this will surely be enough to…

View original post 235 more words

Detecting Social Power in Written Dialog

This semester, I’ll be working on a project at Columbia CCLS with Prof. Owen Rambow and Vinod Prabhakaran. It falls under the broad category of discourse analysis. Specifically, we’ll be looking into detecting displays of power from email threads and discussion boards. Here’s some prior work that explains the subject better:

Written Dialog and Social Power: Manifestations of Different Types of Power in Dialog Behavior

Extracting social meaning from text analysis is an interesting subject, and I’m excited to get started on it.

How I Start

I recently found this website on Hacker News called “How I Start.” It’s a series of tutorials for learning new programming languages. Right now, there are only a few, but I think they’re expanding to add more.

I’m learning Go for one of my projects this semester, so their tutorial for Go is very useful:

How I Start. – Go (with Peter Bourgon)

He’s setting up a web server, querying a weather API, and displaying some results.

I’ll be setting up an identical toy webserver soon, and I’ll post my results here.


It seems that Satoshi Nakamoto has been hacked or somehow compromised. This led me to learn more about Bitcoin and its history so far.

Like many others, I’m kicking myself for not buying a few Bitcoins back in 2009. I was a college student then, and I had no disposable income. Still, I was considering buying one or two, just to walk myself through the process and use an exchange or a client. Maybe it’s a good thing I didn’t get involved in it at all, since its legality and widespread acceptance are still big issues.

Anyway, I’m trying to drown out that regret by satisfying my fascination with the concept of cryptocurrency.

First, here’s the original whitepaper about Bitcoin:

Bitcoin: A Peer-to-Peer Electronic Cash System

This Bitcoin series on Khan Academy is a good introduction as well:

Bitcoin: What Is It?

This Wikipedia article has some interesting background:

History of Bitcoin

And finally, take a look at these photos taken at a Bitcoin farm:

Gallery: Inside a Top Bitcoin Mine in China


Just a quick productivity tip today.

Noise is probably my #1 distractor. I find it very difficult to concentrate in noisy environments. Moving from Seattle to New York, it has definitely become even more of a problem.

But ambient noise, with some soft instrumental music layered on top, is the best way to protect myself from noisy distractions. Here’s what I use for ambient noise:


This page is one of the pinned tabs on my browser. I like to mix thunder, rain (at a low volume), and the crackling fire. The rain noise gives the right amount of noise at the right frequencies. Plus, the thunder and fire give it some texture. I find this combination far superior to simple white noise, which actually gives me a headache.

I haven’t used the visuals or the text editor, so I can’t speak for their effectiveness. But I recommend that you try out Noisli’s ambient noise features.

Character encoding

Every software developer needs to know the basics of character encoding. However, I find it a very dry and dull topic. So here are some entertaining introductions to it.

First, a video explaining Unicode, UTF-8, and its elegance.

Now, read this popular Joel on Software blog post:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)


A demo of neural networks

I enjoyed this demo of how exactly neural nets work:

A visual proof that neural nets can compute any function

I liked the interactive demo. I feel like I have a better intuitive understanding of weight and bias mean.

I took a machine learning course during my undergrad, and it was focused more on formulas and implementing machine learning algorithms. I felt that it didn’t really give me a fully intuitive understanding of how these algorithms really worked. I managed to implement the algorithms in Matlab, and I was able to step through them and look at the result of each iteration, but it always seemed a bit too complex to grasp.

This article, with its interactive elements, walks you through how sigmoid functions work and what they actually do.

Learning Python – Plans for a meal picker

I have some basic experience with Python — which is to say I’m a beginner. I want to become more fluent in it. So, I plan to write a toy script using Python to help me decide what to eat for breakfast, lunch, and dinner.

I want to start it off as simple as possible. I want to model it after a lunchtime restaurant picker I wrote (in Perl) when I was working at Amazon. I manually pre-categorized a list of restaurants, placing each one in tier-1, tier-2, or tier-3. The script would then randomly pick one restaurant from each tier, and then choose one out of those 3 using a weighted random number.

Some simple improvements will come in the next iteration. I plan to perhaps incorporate what’s in my fridge and pantry. I’m not sure how to model that data and store it, though. It needs to be human-readable, so I’m thinking a text file. That’s how I stored my list of restaurants and their tiers. For my new script, I could write a frontend tool to manage this data, but that might be over-engineering this simple script.

Anyway, the purpose is just to get familiar with Python, so it doesn’t need to get too complex.

.bash_profile madness on OSX

I ran into some issues setting up my terminal correctly on OSX. Namely, whenever I’d open a new tab on Terminal, my ~/.bashrc script didn’t get executed. So, I had no syntax highlighting, and I had none of my aliases.

It turns out that ~/.bashrc isn’t designed to be run for Terminal. This StackExchange answer has an explanation of why:

Why doesn’t .bashrc run automatically?

Terminal opens a login shell. So I followed their advice and sourced ~/.bashrc in ~/.bash_profile