Text parser - shash7

At my university campus there are loads of free food events happening around. Unfortunately, I tend to miss most of them because of my lecture’s timings.

I always wondered how many are happening in a particular day because there are so many groups and societies who organize them. Unfortunately there is no central location where these events are posted.

So over the last week I created a nodejs app which collects the last 20 posts of these groups and pages using the facebook api and searches for keywords like ‘free’, ‘bbq’, ‘pizza’, ‘beer’, etc.

Here’s the tech behind it.

The architecture is a bit different behind this compared to my other side projects. The frontend is a pure html page on bitballoon and it gets the data via a rest api exposed by my nodejs server hosted on openshift.

First, it gets the posts and scans them for relevant keywords. Then it assigns a score of 0.5 to every keyword. Keywords like ‘free’ get a score of 1.

Then all the scores are added up and a a decreasing function is used. So 0.5 + 0.5 + 0.5 is not 1.5 but rather 1.0022

This is to combat keyword stuffing. It works surprisingly well too. The average score almost flatlines at about 1.6

Then I filter out posts who’s score is below 1. Then finally, I run another function to extract the time and location from the text and then all that data is saved in the mongodb database.

Turns out, there is a lot of room for improvement.

Strike 1

The first implementation was not bad. It did work but in the process, it caught a lot of unwanted junk.

The best analogy I can give is that of a fishing ship which catches sardines but also catches a lot of other non-edible fishes. Only problem being that sardines constitute less than 50% of the catch.

Take an example of this sample post:

There is FREE ENTRY for any University student to the Blue Diamond Stakes today.
Pre-drinks start at the Racecourse Hotel at 11:30 where you can drown all of your post-Beach Day blues..

It matched ‘free’ and ‘drinks’. Bam, strike #1

Strike 2

There was clearly a need to refine the algorithm. The scoring system needed a revamp.

So we have two scores. One is the score we get from the parser and the other one is the compare score which is a constant(1)

So : score > compare score (Which means the post is about free food)

I realized the compare score needs to be refined. Having a static value of 1 or even 1.5 doesn’t change anything. Using our previous analogy, it just makes our net smaller or larger. We need to refine our net.

This time I observed there is correlation between the length of post and the its actual score. So I made the compare score a dynamic value.

Now the formula to compute the compare score is like this:

var quantifier = 1;

quantifier + ((quantifier / 16) * postLength / 64);

This bring a value of slightly greater than 1 if the length of the post is short but brings up a bigger value(slightly exponential) if the post length is larger.

After putting this in production, it works like a charm. Now it filters out most of the useless posts.

Take a look at the app in production. I may have to pull it in the future.