Introducing HNInstant

Hacker News Instant logo

I recently launched my new site HNInstant, and these are the nuggets of knowledge that I picked up.

What I learned

All that I learned in this project is beyond the scope of this article cause it’s just way too much so I’ll hit the key non-technical points.

Build something and show it to the world

This is key. This is the first time I’ve ever gone all the way through on a project and put it up on Hacker News. Watching the points and feedback stream was indescribable. Nothing could take my eyes away from the screen as I watched the points and comments rack up on HN. Even the stinging comments from British people bashing the “colour” of my website felt amazing (nothing against British people, pretty much everyone bashed the color of my website). The 4 weeks of hard work paid off in those 2 hours. This is why it is great to be an engineer. Find a need and fill it.

Bust through the lulls

Every project has these, and I’ve hit them many times. It’s right when you finish coding all the “fun” stuff and you have start the “boring” stuff. It reminds of the days when I had gymnastics practice 6 times a week and my dad would tell me it’s the days you don’t want to go that you get stronger. In the same vein, pushing through all the boring and monotonous days of making uninteresting changes will open the door to success. For me, it was making small design changes, writing the about page (always extremely painful to write these), making the webpage compatible on all devices, and last but definitely not least, fixing darn’d bugs. Whatever it is though, push through it! I like writing a list of all the boring stuff that needs to get done and sticking it to my monitor. This way when I have a spare moment or two, I’ll be able to knock a few things off the list.

Make sure your website is pristine before posting to HN

I learned this one the hard way. People have no shame digging into the flaws of your project (as it should be). You have one shot on HN so do it right.

Appreciate

Bask in the glory of your hard work. There is nothing more satisfying. Then fix the site.

What I used:

Host – Me

I wiped my old desktop and installed Ubuntu server on it. This is nice because I didn’t have to pay ridiculous amounts to keep the site up. I’ll write about getting a server up and running in a later post since it wasn’t too trivial, but here’s a good guide to get you started.

Framework – Django

Easy choice for me since I was already familiar with it. Django is very easy to setup and quick for development which was what I was looking for.

Database – MongoDB

This was a tough call since I’d never used it before and after reading this, I was unsure. However, I clearly wasn’t going to stress MongoDB as much as other sites, so I ultimately  decided  that I was looking to store a tree as a backend and this was the obvious choice.

How I did it:

The scraper

The scraper was most essential to my site since it was the program that analyzed the data and organized it properly.

The parsing and discovering the relevance of the article was the most difficult part for me. I’ve had limited exposure to natural language processing in my career at Stanford (thank goodness I found python’s nltk package). After opening a link scraped off the site (which I did using PyQuery to grab the score, link, title), I used a simple parser to give me content rather than side bar information and such. To do this you can read this article here. Simply put, it measures the text to html tag ratio to decide whether the text is important or not (more text than html mark-up usually means it’s content rather than navigation bars).

Once I had all the text I used the nltk package to tag all the words by their part of speech. After tagging all the words, I only took the ones that were nouns. Then I stored the stemmed version of the word along with the original word and the frequency in a dictionary. I repeated this same process with the title of the articles.This part I could definitely use some guidance on, so any suggestions would be welcomed.

Once I had parsed all my data, I constructed a suffix tree to store it in. The node structure was as follows:

//node structure
node = {
‘_id’:< auto-generated id from mongodb> //this was -1 for all root nodes
‘_char’:<the character of the node>
‘_parent’:<the parent node’s id>
‘_children’:<array of tuples (child_character, id)>
‘_docs’:<array of doc dictionaries>
}
//doc dictionary
doc = {<title>:[<link>,<score>,<word>,<timestamp>]}

Then to store a word I’d simply walk down the tree. For example, suppose I was storing the word ‘web’:

‘w’ – search for character ‘e’ in the children array of ‘w’ if it exists, load it, if it doesn’t, create a new child node ‘e’
‘e’ – search for character ‘b’ in the children array of ‘e’ if it exists, load it, if it doesn’t, create a new child node ‘b’
‘b’ – no more letters in query, if node exists, add the doc dictionary to the _docs, if it doesn’t exists create a new node and add the doc dictionary to the _docs

In addition to using a suffix tree, I also used another table that was a collection of nodes containing the title, score, link, and timestamp. This was used for the “what’s hot” button to quickly get the highest scoring articles.

The server side code

The server side code would simply handle two types of requests. An ajax call from a search, and an ajax call from the “what’s hot” button.

To handle the search, I would first break up the query and stem all the words. Then for each word, traverse the suffix tree. If nothing was found, return nothing, if I was on a node with no docs, I would find the nearest node using a BFS search (this was how I got my suggestions). I’d repeat this so I’d have a resultant set of docs for each word in the query. After processing all the words, I’d take the intersection of all the sets to get the set of titles that match all the query terms. Then simply hand over the data to the front end. This is convenient since it takes O(n), where n is the length of the query, to look up all the words, and then it takes O(1) to intersect a set which must be done for each word. Thus, it made for a very quick lookup, but I fear that it’ll scale badly. I have to return all the results since it’s hard to return a minimal set and keep track of what articles have already been displayed. In other words, it’s really hard to do pagination since I’m working with sets and not an iterable data structure.

The “what’s hot” button utilized the other table where the nodes only contained the title, link, score, and timestamp. This made it is easy to do a simple query to grab the highest scoring articles from the past two weeks and limit it to 100 results. This feature proved to be pivotal since it provides a function for the site if the user doesn’t know what to search for.

The front end

This was easy thanks to jQuery. All that was needed was an ajax request to the server when someone typed a letter or hit the “what’s hot” button. It was also necessary to limit the number requests in a certain time period in case the user typed quickly. It would only send an ajax request every .3 seconds, anything less than that, it would wait until the typing slowed or stopped. I also did a few cool tricks with css by looking at a few blogs here and there.

Future plans:

  • Have options to switch the orange to white and the white to orange (many complaints on this)
  • Add link to the comment thread
  • Find more relevant articles when a user searches
  • Fix various bugs

Comments are closed.