Lessons Learned from Establishing an Internal Knowledge Base, Part II

In the previous post, I talked about Agolo’s search for a system to use as our internal knowledge base. We settled on DokuWiki in the end.

In this post, I’ll talk about the challenges and issues we faced after choosing to proceed with DokuWiki.

Installation

Installing DokuWiki was more challenging than expected. My teammate, Tom, did all of the installation and server setup work. Even though we used Bitnami to manage the installation, we still had some ramping up to do when configuring the server. None of us is intimately familiar with the LAMP stack (minus the M, in this case). So, we faced some challenges right away. We were okay with this because it’s just a one-time issue. Eventually, we overcame these issues and installed the wiki on an Azure virtual machine.

Plugins

We had to install a number of plugins to truly make the wiki useful. We knew this had to be done, and the plugins are one of the main reasons we chose Dokuwiki in the first place.

The challenge then became picking which plugins we should install. I spent some time browsing the list of the most popular plugins here.

These are the most useful plugins we’ve seen so far:

I consider these plugins to be absolutely essential to take full advantage of DokuWiki. Being able to paste images directly into the WYSIWYG editor and being able to list all pages under a namespace are essential. We have an “Add new page” widget on the wiki’s landing page, as well as an Indexmenu of the root namespace that allows us to see the high-level pages and namespaces.

However, we are careful not to install too many plugins. The more plugins we install, the more we run the risk of plugins negatively affecting each other. Also, if a plugin becomes deprecated, it just causes maintenance hassles when it comes time to update the wiki software.

 Backup

We do a nightly backup to a GitHub repo. It’s a simple shell script run as a cron job. All it does is a git add, commit, and push to a private repository on GitHub. Git is smart enough to only create a new commit when there are changes, so our private repo’s commit history stays relatively clean.

I’ve set it up so only our text and images get backed up. However, it should be possible to backup the entire DokuWiki directory from your virtual machine.

Markdown vs. Wiki Syntax

Almost all of my teammates are familiar with GitHub-flavored Markdown. Unfortunately, DokuWiki only supports Wiki syntax out of the box. These two markup languages are similar enough that it caused our users a lot of confusion and an unnecessarily steep learning curve.

The Markdowku plugin helps to reduce this friction, but it also causes some other issues. The WYSIWYG editor becomes unreliable — bulleted lists and bold vs. italics have conflicting markup in Markdown and Wiki Syntax.

The editor is already difficult to use in the first place (no live preview, and no undo after you click one of the buttons, for example). This conflict further complicates the matter and frustrates users.

Design

Wiki software’s UX feels outdated by at least a decade in 2016. So, it is absolutely imperative to install the Bootstrap3 template. Not only does it make the wiki look and feel much more modern, it also allows you to install themes from Bootswatch. This goes a long way in making users feel welcome when they use the wiki. Without a well-designed, modern, clean theme, users are immediately turned off and are less likely to look at the wiki.

This brings me to my final and most important point.

 Expectation Management and Learning Curve

Some intangibles became apparent once we installed the wiki. The most important takeaway is users are naturally disinclined to start using a new tool when they are already acquainted with an old one.

In this case, we had already been using a hodge-podge of Dropbox, Google Drive, Slack, and Evernote. We had sort of found a kludgy system that worked for us in most cases. It just wasn’t scalable, so I advocated strongly for wiki. While we got buy-in from everyone, it is simply difficult to get everyone to enthusiastically and consistently use the wiki. It takes time to use it to its full extent.

If the user has never used a wiki before, it can be intimidating, frustrating, and boring to learn all at the same time. It would feel like a waste of time to them when they could just create a Google Doc and Slack the link to someone.

I was excited when we first launched our wiki. Over the next couple of weeks, I saw that my coworkers weren’t using the wiki to its fullest potential. While they had a positive attitude toward it, they are not as excited as I was. I had to do some introspection to find out why I thought the wiki is an essential tool, and why my coworkers aren’t as enthusiastic. I came up with the following reasons.

First, I hadn’t taken the Network Effect into account. At my previous company, I quickly learned to use the internal wiki because it was actively read and contributed to. When everyone is already using the wiki on a daily basis to accomplish almost every non-trivial task, a new user is more likely to spend the time to learn it as well. The learning curve is not a huge deterrent because the rewards are tangible and obvious. However, because we are all ramping up on the wiki at the same time, each of us doesn’t see the full benefits from everyone else’s wiki usage yet. We are not yet reinforcing each other’s use of this important tool.

Secondly, I was involved in the research and installation of the company wiki, so I had some idea about DokuWiki’s internal workings. So, a concept like namespaces seemed natural to me. However, to a general user, the idea that a namespace can sometimes be a page but not always is quite confusing. In addition, page creation involves much more brainpower and typing when you make use of namespaces. This is in stark contrast with something like Evernote, where page creation is 1-click and doesn’t require the user to think of a hierarchy of namespaces. This is an instance of having to unlearn UX design patterns that have emerged in the last decade since wiki software was dominant.

Most importantly, Wiki has a relatively steep learning curve. This is especially true in 2016 when the wiki paradigm has so many competitors that have leapfrogged over it in terms of UX and ease-of-use. When a user tries to use a wiki software, they have to unlearn a decade’s worth of UX design patterns.

It takes a while for users to grok the idea of an interconnected web of documents as the model for a knowledge base. It’s much easier to think of it as a set of files inside folders (like Evernote, Google Drive, and Dropbox). If that’s what the user is expecting, then they will see wiki as a very poor substitute for the more modern cloud storage services. So, it becomes important to coach them and mold their thinking to see the advantages of a highly-scalable web of plaintext with an auto-generated table of contents, tagging, and easy hyperlinking.

If you are in charge of installing an internal knowledge base at your organization, make sure you invest time in coaching. Write some easy-to-understand introductory wiki pages, fill it with screenshots, and encourage everyone to read them. Create a sandbox page and a namespace for each of the existing users. Schedule 30 minutes for everyone to sit down and play with the editor and learn the syntax. Do your best to make the learning curve shallow. Do not underestimate the effects of a well-designed, modern theme and a helpful set of plugins. Make the experience as welcoming as possible for the users.

Lessons Learned from Establishing an Internal Knowledge Base

If you’re unfamiliar with the concept, a knowledge base is a collection of easily-searchable and well-organized documents to which anyone in the company can contribute. The purpose is to document institutional knowledge in a centralized place to which everyone has access.

At companies I’ve previously worked at, the internal wiki has been immensely useful for me. It was the first place I went to if I ran into any issues, technical or non-technical. If an answer didn’t exist in the wiki, I’d write a page myself. I would also create pages for myself and add my daily notes in the hope that it might help me or someone else someday. And it often did.

At Agolo, I advocated to start a knowledge base for our company. After a lot of investigation and deliberation, we decided to choose DokuWiki. I’m learning a lot from the process.

This blog post is an attempt to organize and distill the options we looked at before settling on DokuWiki. Hopefully, it helps to guide you through a similar process at your organization. In the next post, I will write about the challenges and lessons learned from using our DokuWiki.

The following are all the options we considered when picking a knowledge base software for our startup.

1. Wiki

I heavily advocated for this tool because of my previous experience with it.

We considered a number of alternative wiki offerings. First, we looked for a hosted solution that was preferably free. We looked at Wikia, but it didn’t offer private, internal wikis.

We looked at Gollum, and possibly creating an empty GitHub repository just for its wiki, but decided against it. It doesn’t have a large enough community or a library of plugins to choose from. Also, it doesn’t have the ability to tag pages as far as I could tell. But it was a very strong contender.

We also considered Confluence, but decided it was too complex and too expensive. We don’t use JIRA or any other Atlassian products, so we felt we wouldn’t be making full use of it for the amount of money it costs. If not for the cost, Confluence has everything we were looking for.

How.dy’s Slack Wiki looks ideal from what we could tell. It uses Markdown, it has a simple and modern UI, and it’s free. It has Slack integration and uses Slack for authentication. It also uses flat text files as its storage engine. It might not have search, so that was one drawback. The main deterrent to using it, however, was that it is not available to the public. The blog post says that it will be opened up, but I could not find a follow-up post that announces its availability.

And last but not least, we investigated using MediaWiki, which is what Wikipedia is built on. This is a very full-featured wiki software. It’s tried-and-true in its usage at Wikipedia. Most people are very familiar with its UI. The reason we didn’t choose it is because it is too powerful for our needs. We wanted our wiki to be lightweight, easy to install, and not too overwhelming. In addition, we wanted to avoid using a full-fledged database for the storage engine. So, MediaWiki was out.

2. Gitbook

One of my coworkers recently used Gitbook to write a book, and he was very enthusiastic about it. So, we considered having our knowledge base in Gitbook format as well. The idea was each page of the knowledge base would be its own chapter, or sub-chapter, or sub-sub-chapter, depending on where it fits in a global hierarchy of documents.

The advantages would be that it uses Github-flavored Markdown, which all of us are already pretty familiar with. It has a modern look-and-feel, which most wiki offerings do not. It also forces us to think in a hierarchical structure when creating pages.

However, I felt that it had too many drawbacks. I’d say that the forced hierarchical nature is a drawback in itself. Knowledge bases should have as little friction as possible for page creation. If creating a page meant thinking about where it fits in the global structure of the knowledge base, page creation would become less frequent.

In addition, moving from read mode to write/edit mode feels very sluggish — this transition has clearly not been optimized. When writing a book, an author doesn’t often switch between editing and reading. So, this makes sense for that use case. A knowledge base should be much more permissive of switching from reading to editing. This would encourage more participation.

Another disadvantage is that in my mind, a knowledge base should be littered with links to other pages in the knowledge base. In Gitbooks, linking to another chapter is not a frequent use case. This is because the paradigm it’s based on is books, which are read linearly from one chapter to the next.

Another big reason why I advocated against Gitbooks is because it is not scalable. Having one chapter per document might work in the first few months. However, as our company grows, so will our knowledge base. Having 500 chapters would become cumbersome if every new chapter had to fit into an existing hierarchy. Also, the list of chapters would become totally useless.

And finally, having a private Gitbook costs $7 per month.

So, we decided not to use Gitbook for our company knowledge base.

3. Evernote

Some of us are already heavy users of Evernote. So, we considered just creating a notebook where all of us would keep adding notes.

We already use Evernote as a collaborative tool for some specific purposes. For example, we store meeting notes in Evernote notebooks. This is made especially easy because some of us use the Scannable app to take a photo of a hand-written page of notes and make it searchable in Evernote. This is a huge advantage.

Another advantage of using Evernote is the extremely low friction of creating and editing notes. There is no notion of edit mode vs. read-only mode, so users will be encouraged to edit whichever page they’re reading. This is a highly desirable effect in a knowledge base. Also, because it is an application, it is extremely mobile-friendly in addition to being highly performant on our laptops.

However, I felt there is an inherent lack of structure when it comes to dumping everything into Evernote. This is on the opposite end of the spectrum from Gitbook. To me, it would cause more problems by making it too easy to create new pages — it would become more difficult to find old pages.

In addition, having every document in the knowledge base live under the same notebook also seems problematic. If it were possible to have a shareable hierarchy of notebooks, Evernote would be more viable in my eyes. Unfortunately, Stacks are not shareable. It would be easier to organize the knowledge base into smaller categories.

So, while Evernote is a strong contender, we decided against it.

4. Pivotal Bookbinder

We weren’t too familiar with the principles behind Bookbinder, and the documentation doesn’t seem to be detailed enough for us to get acquainted. In addition, the setup seemed like a barrier to entry to us.

In addition, we could not tell if this software supported the tagging and categorization of pages.

5. Sharepoint

In addition to being pricey, Sharepoint is also too heavyweight for our needs. We aren’t extensive users of the MS ecosystem. In addition, it seems too complex for our needs with a steep learning curve.

While Sharepoint also has an option to create a Wiki, it does not support Markdown. So, we did not want to make the commitment to using Sharepoint.

Screenshot 2016-05-10 21.19.25

My Experience at TechCrunch Disrupt Hackathon 2016

Last weekend, I participated in the TechCrunch Disrupt Hackathon in New York City. Here’s my demo.

Screenshot 2016-05-10 21.19.25.png

The story of how I got on that stage with that project is slightly more complicated.

I originally went to the hackathon as part of a team: me, Tom, Lowell, Shabnum, and Scott.

The hackathon took place at the Brooklyn Cruise Terminal, a very industrial-looking place.

We were one of the first teams to arrive, so we got to pick a good table.

Our project was in EdTech, and we called it Mindset. Scott has written about it here. My job was to set up and implement the Natural Language Processing backend server and its API endpoint. The application would send the server a syllabus, and the server would parse it into topics, tag each topic, extract dates and deadlines, and return a nice data structure with all of this information. Its topic extraction would be powered by IBM Watson’s Concept Insights API.

The hackathon began at around 1:30 PM on Saturday and the deadline to submit projects was 9:30 AM on Sunday. We worked on it without facing any real problems all through Saturday afternoon and into Saturday night.

The hackathon had tables for 89 teams total, plus a number of booths for sponsors.

Soon enough, it was past midnight. We were starting to get tired but we were fueled by three things: teamwork, our goal to complete the project, and caffeinated beverages.

IMG_1774.JPG

Still going strong at 2:15 AM.

Before we knew it, 4 AM rolled around. Some of my teammates went home to take naps or freshen up. Some found places to curl up. Regardless, we were driven by a singular purpose: submit our project before the 9:30 AM deadline and wow the judges at the 60-second demo.

Finally, at around 5:30 AM, we had completed what we’d set out to do! Our website was up and running, my NLP server was making calls to IBM Watson and interpreting the results correctly, and our backend server was fully functional and robust.

My team started prepping for the demo. I didn’t need to be involved, so I was left to my own devices. I was wide awake at this point, and I had around 4 hours to burn, so I decided to do some work on a project I’d been thinking about for a while.

I had been planning to make a Twitter bot that uses Agolo‘s API to summarize the contents of any URL you tweet at it. This would be a follow-up to a similar Slack bot that I created a few weeks ago. I thought to myself, what better time to get started on this project than 6 AM at the TechCrunch hackathon after having stayed up all night?

I got to work on it. I picked Python because I have some experience working with Tweepy, a Twitter library. I knew that I had to circumvent the 140-character limit somehow, so I had the idea to use images to display the summary. I used the Python Imaging Library (PIL) for that.

I set up the Twitter account, got my code running on my AWS server, and started testing it. I had to make a number of tweaks to the way I was using PIL in order to make the text look good enough to demo. PIL doesn’t automatically do word wrap, so I had to find a way to insert newlines into the text where it made sense.

Finally, with around 20 minutes left until the deadline, I hacked together a working Python script that could achieve my project’s goal!

I submitted it, deathly-tired, forgetting one important detail: submitting a project means I have to give a demo onstage. I was about to fall asleep, but this realization was a shot of adrenaline that kept me awake.

It was time for the demos to start.

IMG_1775.JPG

The auditorium from halfway back.

I sat in the audience and mentally prepared some things to say at my demo. I checked, double-checked, and triple checked that my project was working.

Meanwhile, my teammates Tom and Shabnum went up to present Mindset. They did a wonderful job despite the technical difficulties they faced. I was proud to see the end result of a long night of hard work being presented up onstage.

IMG_1776.JPG

Tom and Shabnum setting up the laptop for their demo.

They called up the next batch of presenters to wait backstage. There was a 10-minute break during this period, so I got to practice my speech a little and meet some of the other presenters.

Backstage at the control booth. The black wall with the green lights at the top is the stage’s backdrop.

Finally, I was next in line to go onstage, set up my laptop, and wait for the previous presenter to finish.

Photos taken just offstage. I was balancing my open laptop in one arm as I took these pictures. I probably should not have taken this risk.

Then, I presented. I don’t remember most of it. The lack of sleep, combined with the adrenaline, put me in a state where I was giving an impassioned presentation of my project instead of paying attention to the hundreds of faces looking at me from the audience.

I walked offstage and back to my seat. I came down from the rush, and my tiredness finally took over. It took a lot of effort to finish watching the rest of the presentations and the awards ceremony. Then, I finally stepped outside for the first time in many hours.

Finally, sunlight and fresh air. Well, as fresh as it gets in NYC.

It was sunny for the first time in a week. It was a strange feeling to finally feel the sun on my skin after many days, punctuated by an intense experience like that.

I somehow made my way home.

Then, I slept for 14 hours.

All in all, it was a really fun experience. It was like a marathon, but with my team to keep it light and make it enjoyable. My 6 AM decision to start working on my own project turned out well, but I wasn’t in my right mind when I made that choice. However, sleep-deprived-me chose to take a big risk instead of playing it safe, and that’s a lesson I can learn from him. My main takeaway is to challenge myself and push my boundaries whenever possible, because the reward is often underestimated and risk is often overestimated.

Tay, Twitter Bots, and the Value Alignment Problem

Recently, Microsoft launched a bot on Twitter that learns to speak from anyone who speaks to it. The results were disastrous on multiple levels:

Trolls turned Tay, Microsoft’s fun millennial AI bot, into a genocidal maniac

First, let’s look at some of the many reasons why it’s a bad thing this happened:

  • On a purely business level, this is a PR disaster for Microsoft. In the present-day culture of instant outrage, this was the perfect news story. The headline “Microsoft Builds Racist Robot” is a guaranteed clickthrough. It makes Microsoft look either evil, negligent, or incompetent.
  • On a user experience level, this bot makes wide swaths of the population feel excluded and attacked. That’s simply bad UX.
  • On an infosec level, it has wide-open attack vectors. The most obvious one is you can get it to tweet anything if you use the following phrase: “repeat after me.” It’s the most obvious injection attack vector possible.
  • On a social level, hate speech already has enough of a platform. This bot was turned into an amplifier for the most deplorable parts of humanity.
    • Even if it were just some pranksters from 4Chan messing with the bot as a joke, it has the unintended side-effect of making visible hateful, fringe viewpoints beyond their proportional representation in society.
    • It was promoted by Microsoft, one of the largest corporate presences in the world by any measure.
    • It was on Twitter — a platform not only with millions of users, but one that’s closely watched by every news agency and blog to be amplified to audiences worldwide.

As bots and other self-sustaining agents become more prevalent in day-to-day life, they absolutely need to deal with these issues.

Why did this happen? How can we avoid this?

For something more clear-cut, let’s take a look at a similar snafu that happened with Google Photos last summer:

Google Photos Mistakenly Labels Black People ‘Gorillas’

This algorithm did not have the knowledge of context and history and racial issues that a human would have. It was simply working with a collection of training data and statistical models. In essence, it was matching new input with its knowledge of old input and producing the most probable output. As with all statistical modeling, it has some error rate, and just like sometimes, it would mislabel a chair as a stool, it mislabeled this input as well.

It’s tempting to say that algorithms are neutral.

They are not.

Machine learning algorithms, by definition, are biased. They have to be. If they were neutral, they would have no better results than flipping a coin. They have to have bias built into them. What builds this bias into the statistical models? Training data and those who design the algorithm. And, as much as we in the software industry would like to believe otherwise, both of those things have complicated relationships with the real world.

Data is not a perfect representation of the real world. A dataset is highly dependent on the choices, conscious and unconscious, made by those who collected the data. If your training data seems comprehensive (e.g. every photo indexed by Google Image Search), that’s when you have to be careful.

How do you know you have enough photos of dark-skinned people to be able to distinguish them from animals that they’ve been historically associated with as a means to oppression and dehumanization? If your test data is equally biased, you can’t, until it blows up out in the real world when a real-life dark-skinned person tries to use it. This is especially true if your team of computer scientists, data scientists, and software engineers are full of people who don’t have experience with these issues socially. That brings us back to Tay.

Here is a breakdown of Tay’s failures in the context of a larger culture where these issues are generally not visible:

The Ongoing Lessons of Tay

Particularly, take a look at this quote:

A long time ago, I observed that there are hundreds of NLP papers on sentiment classification, and less than a dozen on automatically identifying online harassment. This is how the NLP community has chosen to prioritize its goals. I believe we are all complicit in this, and I am embarrassed and ashamed.

This is a consequence of the free market. There is a business demand for sentiment analysis tools (to classify customer reviews of products as positive or negative, for example), but no demand for anti-harassment technology. Research with an immediate business impact is prioritized over research with long-term social and business (PR) consequences. The skeptical response is: “Why is this bad in the long run? Why not let the free market take care of it? If ethical algorithm design becomes a priority, it will automatically become prioritized.”

I’m not convinced this is true.

This line of thinking follows the ideology of utilitarian ethics, which has many problems of its own. For example, take a look at this article. You can justify a lot of morally unsound behavior and decisions with utilitarianism.

Another reason we should not always let market forces rule public goods (like society’s body of research and publically available algorithms) is because it is a short-sighted force of nature. As humans, we should have more of an interest in our long-term survival. Here are some situations where the free market has, is, and will fail us:

  • environmental concerns
  • sustainable energy usage for the long term
  • market bubbles and crashes, ruining individual lives
  • child labor
  • investment in space travel for us to become a multi-planetary species to reduce the chances of annihilation

The free market has worked mostly well for us until now. However, the lack of focus on the long term is troubling especially now that we live in such an abstract, accelerating world. Each individual has far-reaching powers unimaginable to anyone even half a century ago. We are inching ever-closer to creating algorithms that have significant impact on our day-to-day lives. This brings us to the Value Alignment Problem.

Here is the Arbital page for Value Alignment Problem. In essence here is what it is: How do we design systems (particularly self-sufficient software systems such as AGI) that has motivation to do its best to help humanity? How do we align its values with values possessed by the best of our species (for a well-thought-out definition of “best”)?

In the far (but not too far) future, this issue will suddenly become an emergency if not dealt with now. The Machine Intelligence Research Institute (MIRI) is starting to tackle some of these problems, but the free market is not.

The free market is not set up to deal with issues like the Value Alignment Problem. It needs to be solved by forces outside the market. Government is the most obvious candidate, but a government run by the governed often has trouble solving large, abstract problems. Maybe we need more organizations like MIRI. Maybe we need more individuals willing to get involved in civic hacking even as just a hobby. I don’t know what the solution is but I do know the market will have nothing to do with it until it’s too late.

Let’s get back to Tay. What should the Tay team have done differently?

Tay is a relatively simple Twitter bot. Twitter already has a tight-knit, conscientious community of botmakers, all of whom already deal with ethical questions pretty well. The easiest thing in the world for Microsoft to do would have been looking into prior art before creating a Twitter bot. Here is an article containing interviews with some of the more prominent botmakers:

How to Make a Bot That Isn’t Racist

Microsoft’s engineers failed to do their due diligence before launching Tay, and this failing points to much larger issues that we are all about to face.

 

GitHub for reviewing code

At my company, we’ve been using GitHub’s builtin code review tool. I totally agree with this assessment. GitHub would reap huge dividends by focusing more UX resources on their code review functionality.

Wrong Side of Memphis

A couple of weeks ago we started (in my current job) to use GitHub internally for our projects. We were already using git, so it sort of make sense to use GitHub, as it is very widespread and used in the community. I had used GitHub before, but only as a remote repository and to get code, but without much interaction with the “GitHub extra features”. I must say, I was excited about using it, as I though that it will be a good step forward in making the code more visible and adding some cool features.

One of the main uses we have for GitHub is using it for code reviews. In DemonWare we peer-review all the code, which really improves the quality of the code. Of course, peer-review is different from reviewing the code in an open software situation, as it is done more often and I suspect…

View original post 1,552 more words

The temptation of arrogance

I wanted to change the default alert sound for Calendar events on OSX Yosemite. I found a thread on discussions.apple.com about it. First, I was surprised to find there’s no way to customize the sound effect in the UI. Second, I was shocked to find this reply in the thread:

Let’s think about this.

This user is sincerely making the case that updating a config value in an obscure file (XCode recommended, by the way, because we don’t know the side effects) is a better solution than selecting an option from a menu in the Calendar app.

I’ve also seen this kind of arrogance in StackOverflow threads and other technical forums. When someone asks how to do something, someone replies to ask, “Why would you ever want to do that?”

It’s always disappointing to me to see technical knowledge being wielded as a weapon like this. A lot of people in the tech community lack the tiny amount of discipline required to act politely toward those who know less. It casts all of us in a bad light.

Data-Driven Product Development

Data Driven Products Now!

Here’s a video of this talk. In this blog post, I will be using some screenshots from the above presentation.

In this presentation deck, former Etsy developer Dan McKinley outlines two very important ideas:

  1. Data-driven strategies of project management
  2. Data-driven tactics of product management

First, the Agile-esque process of iterative development using prototyping and A/B testing at key milestones is an interesting approach. This is difficult to do in both small and large companies due to different reasons. In small companies, the resources required for the discipline of this kind of development is too big. At large companies, a single developer or product manager would not have enough control to apply this process in a realistic way, especially given constraints from other teams, QA, designers, and management.

Screenshot 2015-12-17 00.11.13.png

 

This is a great ideal to shoot for. But personally, I don’t know how plausible it is. It requires buy-in from everyone involved. McKinley even briefly mentions this when he talks about discussions with his designer, in which he promises to polish up the prototype in the second phase (“Refinement”) of development.

The second great concept in this talk is the tactical use of simple arithmetic and statistics to make decisions about products and features. While the previous idea’s concern is the quality of a particular product in development, this idea pertains to the nitty-gritty of picking which products to develop in the first place.

Screenshot 2015-12-17 00.15.25.png

This back-of-the-envelope calculation, he outlines, has saved him from spending weeks or months in design, development, testing, deployment, and analysis. This idea seems fundamental and possibly obvious to experienced product managers. However, it is tempting to bypass the rigor this level of analysis requires when everyone involved is swept up in the excitement of a cool new feature.

While these two ideas a vital enough in themselves, this presentation gives us this diagram as icing on the cake:

Screenshot 2015-12-17 00.21.54.png

These charts demonstrate the change in priorities that must happen as a company grows. There are two reasons for this necessity:

  1. Risk mitigation – reducing the number of moonshot ideas being implemented
  2. The absolute importance of using data to make product decisions

And, as McKinley mentions at the beginning of the talk, a dangerous pitfall is to mis-categorize these three by forming an opinion and then finding some pieces of data to back it up. The approach should have more scientific rigor: it should start with a dispassionate hypothesis based on prior data, and through the process of prototyping and A/B testing, the hypothesis should either stand and be refined, or easily discarded after a falsification.

Installing a cursor position custom plugin in Sublime Text 3 on OSX

I couldn’t find any useful resource about how to install a custom Python script plugin in Sublime Text 3. It took me a while to figure it out, but the solution seems easy in retrospect.

In Sublime Text, the bottom right corner of the window displays the line and column position of the cursor. For example, it might say:

Line 11, Column 35

..to denote that the cursor is currently at the 35th character on the 11th line of the file.

My goal was to get a counter that shows an absolute counter of the cursor’s position from the beginning of the file. I wanted this:

Character 1324, Line 11, Column 35

After some digging around on the internet, I found this plugin on GitHub, which was actually taken from this post on StackOverflow.

However, all the information about how to actually install this on OSX seemed to be outdated. They wanted me to save the Python script as a file in directories that did not exist. If I created the directories required and restarted Sublime Text, I would either see no effect or get an ugly error message.

The solution turned out to be simple. I just had to open Sublime Text, and find the right menu item:

Tools > New Plugin...

Then, I just pasted the script from GitHub into the window and saved it as charPosition.py.

Now, the bottom-right corner of my Sublime Text window looks like this:

Screenshot 2015-11-20 21.13.52

Is Software Engineering Really Engineering?

Programmers: Stop Calling Yourselves Engineers

This article raises some valid points about whether or not software engineering really is engineering.

While software does have some overlap with traditional engineering disciplines, I believe that it’s fundamentally different because it usually doesn’t involve physical materials. Software is similar to mathematics in that it’s only one step removed from pure ideas. As Fred Brooks wrote in Mythical Man-Month:

The programmer, like the poet, works only slightly removed from pure thought-stuff. He builds his castles in the air, from air, creating by exertion of the imagination. Few media of creation are so flexible, so easy to polish and rework, so readily capable of realizing grand conceptual structures…

This allows for practices like Agile development and continuous deployment. Combined with the rise of scripting languages in the past few years (NodeJS and Python powering backends, for example), it has become easier to deploy poorly-engineered software to production. That’s not necessarily Agile’s or scripting languages’ fault. It’s just a consequence of a diminished need for rigor.

Test-driven development, extreme programming, and other approaches reintroduces some of the rigor, but I don’t believe that’s enough. These processes are difficult to faithfully implement in a real-world setting, which waters down their effectiveness.

I believe the key to solving this issue is a change in attitude on the part of software engineers. Even though it’s easy to give into poor quality production software and loose standards for unit testing and integration testing, software engineering teams and their managers need to take a stronger stance on shipping quality products. The emphasis on design in the current Apple-dominated consumer tech world is a great model to emulate. We should be as insistent on good software as we are on good design. In the end, shipping high-quality software is rewarding for both users and engineers.

Feature Hashing, or the “hashing trick”

Feature hashing, or the “hashing trick,” is a clever method of dimensionality reduction that uses some of the important aspects of a good hash function to do some otherwise heavy lifting in NLP. This is a good blog post with the fundamentals of how and why the hashing trick works when working with a large, sparse set of vectors:

Hashing Language

Feature hashing is an elegant solution to the otherwise hairy problem of fighting the curse of dimensionality. It turned out to be extremely useful for a project I’m currently working on for a course at Columbia: Computational Models of Social Meaning.

Scikit-Learn has an implementation of the hashing trick if you’d like to read more about it.