Feature Hashing, or the “hashing trick”

Feature hashing, or the “hashing trick,” is a clever method of dimensionality reduction that uses some of the important aspects of a good hash function to do some otherwise heavy lifting in NLP. This is a good blog post with the fundamentals of how and why the hashing trick works when working with a large, sparse set of vectors:

Hashing Language

Feature hashing is an elegant solution to the otherwise hairy problem of fighting the curse of dimensionality. It turned out to be extremely useful for a project I’m currently working on for a course at Columbia: Computational Models of Social Meaning.

Scikit-Learn has an implementation of the hashing trick if you’d like to read more about it.

Git ignore

I recommend adding this .gitignore file as soon as you create a new Git repository:

Octocat’s .gitignore: Some common .gitignore configurations

As you might know, Git has a pretty complex ignore system, especially if you want to start ignoring files that are already in the repo. Adding this .gitignore file to your repo immediately after you create it might save you some headaches down the line.

Setting up swap space on a Linux server

Recently, I set up an EC2 host. It was a smooth and relatively quick process.

However, the problem with the default image is that it doesn’t come with swap space. You can verify this by running:

swapon -s

I found this tutorial online that walks you through creating a swap file, setting it up, and using it as your swap space:

All About Linux Swap Space

That post doesn’t mention it, so I’ll say it here: you need to use sudo in front of each of those commands. Another thing missing from that tutorial is that as I was setting this up, the shell gave me this warning:

$ sudo swapon /swapfile
swapon: /swapfile: insecure permissions 0644, 0600 suggested.

So, I had to do this:

$ sudo chmod 0600 /swapfile

At the very end, you can check whether you were successful using the swapon -s command:

$ swapon -s
Filename Type Size Used Priority
/swapfile file 1048572 0 -1

Removing empty lines using `sed`

Consider this snippet from a file:






It’s a list of integers (the full file has 122 of them), but the delimiter is three newline characters. I needed to get rid of the empty lines in this file.

I thought of it as replacing three \n characters with one, so it seemed like a good place to use the sed command.

However, according to this FAQ for sed, it’s impossible to use \n in a sed command to strip out newlines.

5.10. Why can’t I match or delete a newline using the \n escape sequence? Why can’t I match 2 or more lines using \n?

The \n will never match the newline at the end-of-line because the newline is always stripped off before the line is placed into the pattern space. To get 2 or more lines into the pattern space, use the ‘N’ command or something similar (such as ‘H;…;g;’).

Sed works like this: sed reads one line at a time, chops off the terminating newline, puts what is left into the pattern space where the sed script can address or change it, and when the pattern space is printed, appends a newline to stdout (or to a file). If the pattern space is entirely or partially deleted with ‘d’ or ‘D’, the newline is not added in such cases. Thus, scripts like

       sed 's/\n//' file       # to delete newlines from each line
       sed 's/\n/foo\n/' file  # to add a word to the end of each line

will never work, because the trailing newline is removed before the line is put into the pattern space. To perform the above tasks, use one of these scripts instead:

       tr -d '\n' < file              # use tr to delete newlines
       sed ':a;N;$!ba;s/\n//g' file   # GNU sed to delete newlines
       sed 's/$/ foo/' file           # add "foo" to end of each line

Basically, the utility reads the input line-by-line, so it never sees the newlines anyway. It looks like we can’t replace multiple newlines with a single newline after all.

However, a regex trick I found was to detect empty lines and delete them:

sed /^$/d FILE

This regex looks for any line that ends immediately after it begins, then deletes that line. It worked perfectly for my needs.

Using an older version of Java JDK in Mac OSX

This StackOverflow thread gives instructions on how to How to Revert to Java 1.6.

I ran this command and got this result:

prem$ /usr/libexec/java_home -v '1.6*'

That directory doesn’t seem to exist according to ls:

prem$ ls /Library/Java/JavaVirtualMachines
./ ../ jdk1.7.0_67.jdk/

However, running the given command to change the JDK actually seemed to work:

prem$ export JAVA_HOME=`/usr/libexec/java_home -v '1.6*'`
prem$ java -version
java version "1.6.0_65"
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-466.1-11M4716)
Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-466.1, mixed mode)

It’s still pretty mysterious to me how this works.

Riding with the Stars: Passenger Privacy in the NYC Taxicab Dataset

In my previous post, Differential Privacy: The Basics, I provided an introduction to differential privacy by exploring its definition and discussing its relevance in the broader context of public data release. In this post, I shall demonstrate how easily privacy can be breached and then counter this by showing how differential privacy can protect against this attack. I will also present a few other examples of differentially private queries.

The Data

There has been a lot of online comment recently about a dataset released by the New York City Taxi and Limousine Commission. It contains details about every taxi ride (yellow cabs) in New York in 2013, including the pickup and drop off times, locations, fare and tip amounts, as well as anonymized (hashed) versions of the taxi’s license and medallion numbers. It was obtained via a FOIL (Freedom of Information Law) request earlier this year and has been making waves in the…

View original post 2,314 more words