Finding Nearby Points of Interest with the Google Places API

Demo @ http://www.gurchet-rai.net/places/
 
The Google Places API is a web service that returns yelp-like information (ratings, reviews, contact info) based on your location. Using the HTML5 geolocation API described in my last post, I’ll go over how I made a simple little app that lets you view details for places near you.

Getting a Map Near You

The Google Maps V3 Javascript API can be used to easily get a map centered around your location.

First, include the maps API and the places API:

1
<script type="text/javascript" src="http://maps.googleapis.com/maps/api/js?libraries=places&sensor=false"></script>

Using the coordinates from the geolocation API (described here), construct a map as follows:

1
2
3
4
5
6
var loc = new google.maps.LatLng(coords.latitude, coords.longitude);
var map = new google.maps.Map(document.getElementById("map_canvas"), {
              mapTypeId: google.maps.MapTypeId.ROADMAP,
              center: loc,
              zoom: 13
});

Note that the dom element (#map_canvas) must have a defined width and height.

Getting Local Points of Interest

Now that we have a map, we can integrate the Google Places API and build a localized search. The following will get up to 20 results within 5 km of the maps center point (paging is ignored for this example).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
var resultList = [];
var service = new google.maps.places.PlacesService(map);  
var request = {
    location: map.getCenter(),
    radius: '5000',
    query: <search string>            
};

service.textSearch(request, function(results, status, pagination){
    if (status == google.maps.places.PlacesServiceStatus.OK) {
        resultList = resultList.concat(results);
        plotResultList();
    }
});

Plotting the Result List

To plot the result list from above, I wrote the following function to drop marker overlays.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
var overlays = [];
       
function plotResultList(){
    var iterator = 0;                  
    for(var i = 0; i < resultList.length; i++){
        setTimeout(function(){
                   
            var marker = new google.maps.Marker({
                position: resultList[iterator].geometry.position,
                map: map,
                title: resultList[iterator].name,
                animation: google.maps.Animation.DROP
            });

            overlays.push(marker);
                   
            iterator++;
        }, i * 75);
    }
}

Show Additional Details for each Place

The Google Places API can be used to fetch information like contact info, ratings, and reviews. Here’s some code to overlay some of this info when a marker is clicked.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
var infoWindow = new google.maps.InfoWindow();     
google.maps.event.addListener(marker, 'click', function() {
    infoWindow.close();
    var request = {
        reference: resultList[iterator].reference
    };

    service.getDetails(request, function(place, status){
        var content = "<h2>" + name + "</h2>";
        if(status == google.maps.places.PlacesServiceStatus.OK){    
            if(typeof place.rating !== 'undefined'){
                content += "<p><small>Rating: " + place.rating + "</small></p>";
            }    
           
            if(typeof place.formatted_address !== 'undefined'){
                content += "<br><small>" + place.formatted_address + "</small>";
            }
           
            if(typeof place.formatted_phone_number !== 'undefined'){
                content += "<br><small><a href='tel:" + place.formatted_phone_number + "'>" + place.formatted_phone_number + "</a></small>";                                
            }
           
            if(typeof place.website !== 'undefined'){
                content += "<br><small><a href='" + place.website + "'>website</a></small>";
           
            }
        }                            
       
        infoWindow.setContent(content);
        infoWindow.open(map, marker);        
    });
});

Getting a Users Location using the HTML5 Geolocation API

Demo @ http://www.gurchet-rai.net/geo/
 
I spent some time today fiddling around with the HTML5 Geolocation API for an upcoming project I’m working on and I thought I’d share some stuff I learned. In this post, I’ll briefly go over the API and the variability in accuracy I experienced when testing on a few different devices/browsers.

Using the API

The Geolocation API is a high-level client-side API that returns a users longitude and latitude using the best available location information source (GPS, cell IDs, RFID, MAC address, Wi-Fi, bluetooth, or IP address). Here’s some code to get a users location:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
<script type="text/javascript">
try {
    if (navigator.geolocation) {
        /*
            getCurrentPosition takes up to three arguments:
                a successCallback
                a errorCallback (optional)
                PositionOptions (optional)

            PositionOptions is an object with three properties:
                enableHighAccuracy (boolean - default is false
                                    if true, response times may be slowed
                                    and battery consumption (on mobiles) may
                                    increase)
                timeout (in ms - default is infinity)
                maximumAge (in ms - default is 0
                            if > 0, the API tries to fetch a cached position)
        */

        navigator.geolocation.getCurrentPosition (function(position){
            /*
                the position object contains a 'coords' and 'timestamp' object;
                coords contains:
                    latitude
                    longitude
                    accuracy (in metres with 95% confidence)
                    altitude (in metres)
                    altitudeAccuracy (in metres)
                    heading (degrees clockwise relative to true north)
                    speed (in m/s - more relevant for the watchPosition function
                           which is not described here)
            */

        },
        function(error){
            /*
                The error object has a 'code' and 'message' property
                error.code can be:
                    1 (permission denied),
                    2 (position unavailable),
                    3 (timeout)        
            */
     
        }
    } else {
        /* browser doesn't support geolocation */  
    }
} catch (e) {
 /* error handling here */
}
</script>

Error Handling

Several errors need to be handled when using the geolocation API.

A permission denied error (error code 1) may occur if the user denies the location lookup request. Note that this error will also be triggered if the user chooses ‘always deny’ on an alternate website making a geolocation request.

A position unavailable error (error code 2) may occur if the call to navigator.getCurrentPosition() fails. I was able to trigger this error on my phone by disabling Google Location Services.

Finally, the timeout error (error code 3) will occur if PositionOptions.timeout (if set) is exceeded.

Accuracy

I experimented with the geolocation API on a few different browsers and devices. On my laptop, Chrome, Firefox, and Safari gave comparable results with an accuracy within 20 meters. IE, on the other hand, seemed to be be off by about 1.6 km. I was able to get down to within 5 meters on my cell phone (Galaxy Nexus) with GPS enabled.

Integrating with Google Maps

For the purposes of this demo, I used the coordinates returned by the API to plot the users current location using the Google Maps Static Maps API. You can see a demo of this here.

News Aggregator Updates – Getting Tweets Related to Article Content

App @ http://www.gurchet-rai.net/news

Just a small update today – the news aggregator I built a few weeks ago can now retrieve tweets related to each news cluster (just click the small bird :-) ).

All I’m doing is taking the three most important words in each cluster (generated via tf-idf weighting) to build a search query that I use to search Twitter by making a YQL client-side call.

I’m also experimenting with OpenCalais to get a richer set of semantic metadata. I’m planning on using the additional metadata to allow browsing by topic/event type, people/organizations involved, and location. Although the metadata isn’t being used for anything useful right now, you can see some of it in the markup (span class=”meta”).

Syntactic Clustering of News Headlines

App: http://www.gurchet-rai.net/news/
Source on github

I’ve been using Google Reader to keep up with around 75 RSS feed subscriptions. Unfortunately there’s a lot of overlap between articles, especially articles from traditional news outlets. Google News groups similar articles so I wanted to see if I could replicate this behaviour for my RSS feeds (Yea, I know that I could just grab the RSS feed straight from google news or any other aggregator but where’s the fun in that ;-) ).

Here, I look at the syntactic similarity between RSS items from 31 news outlets in an attempt to categorize the news by subject.
 

Fetching RSS Feeds

To get my feeds, I used the YQL rss.multi.list data table. which parses any RSS/atom feed and standardizes it into XML, HTML, or JSON. I retrieved my feeds in XML then used XPath to parse individual RSS feed items, stripped any HTML, and built RSSItem objects that I could easily manipulate programmatically.
 

Preprocessing

For each RSSItem object, I built a length-normalized map of word frequencies for the 20 most ‘important’ words. Instead of finding the frequency of words as-is, I stemmed each word with a Porter stemmer which reduces derived/inflected words to their root form. I then evaluate the ‘importance’ of each word – if the word occurs frequently in a document but relatively infrequently in the entire corpus, then its important; if the word occurs frequently across many documents, its unimportant (this is the general premise of tf-idf weighting).
 

Clustering

To cluster feeds together, I take the intersection of the most important words for each feed. If each document shares at least 4 ‘important’ words (works out to ~20-40% of ‘important’ words shared), they’re considered to be on the same topic. Here’s the code that does the clustering:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
public static List<DocumentCluster> cluster(List<DocumentCluster> clusters){
    return cluster(clusters, MIN_WORDS_MATCH, 5);
}

private static List<DocumentCluster> cluster(List<DocumentCluster> clusters, int minMatch, int numIterations){
    if(numIterations < 0){
        return clusters;
    }
    int numClustered = 0;
    for(int i = 0; i < clusters.size(); i++){
        DocumentCluster c1 = clusters.get(i);
        if(c1.size() > 0){
            for(int j = i + 1; j < clusters.size(); j++){
                DocumentCluster c2 = clusters.get(j);
                if(c2.size() > 0){
                    int numShared = getNumSharedWords(c1,c2);
                    if(numShared > minMatch){
                        c1.merge(c2);
                        c2.removeDocuments();
                        numClustered++;
                    }
                }
            }
        }
    }
   
    ArrayList<DocumentCluster> clusterList = new ArrayList<DocumentCluster>();
    for(int i = 0; i < clusters.size(); i++){
        if(clusters.get(i).size()>0)
            clusterList.add(clusters.get(i));
    }
   
    //no clusters formed on this iteration
    if(numClustered == 0){
        return clusterList;
    }
    return cluster(clusterList, minMatch, numIterations -1);
}

public static int getNumSharedWords(DocumentCluster c1, DocumentCluster c2){
    Set<WordStem> topWords2 = new HashSet<WordStem>(c2.getTopWords());
    Set<WordStem> topWords1 = new HashSet<WordStem>(c1.getTopWords());
    topWords1.retainAll(topWords2);
    return topWords1.size();
}

 

Analysis

I pulled news from 31 sources focused primarily on top stories, world, and national news. From the 526 RSS news items, 306 clusters were formed with at least 2 news stories each. Of the 306, 21 were incorrectly clustered (stories were unrelated to those in the cluster). Of the remaining 220 stories which weren’t part of a cluster,  89 should have been in a cluster. Not being one to give up an opportunity to draw a graph, here’s the above info represented visually :-)

document_classification_breakdown (1)

True positives are news items that were correctly clustered. False positives are news items that were incorrectly clustered. True negatives were correctly identified as a unique story (not part of a cluster). False negatives were incorrectly identified to be unique or part of a cluster that should have been merged with another cluster.

The majority of true negatives were lifestyle/editorial style stories. False negatives were mostly articles with quotes/anecdotes in the headline. I tried to decrease the number of false negatives by lowering the number of words that need to be shared but this inevitably led to a decrease in specificity.

Although improvements could probably be made by looking at semantic similarity or by considering a larger body of text (the average rss item contained about 50 words of which only ~20 were ‘important’), syntax-based comparisons provide results that I’d consider ‘good enough’ .

Retrieving Yahoo! Finance Information using YQL

YQL is a query language that lets you retrieve data from across the web using an SQL-like syntax. By making a simple REST query using a standardized syntax, you can receive XML- or JSON-formatted data from a bunch of web service APIs. You can play around with YQL here

I wrote a YQL wrapper in python which queries the Yahoo! Finance open data tables for financial data. Using this script, you can get current and historical financial info, news feeds, and options data for any ticker symbol.

Usage

The below examples assume the following import/initialization steps have been done. You can get most data with just a line of code.

import stockretriever
stocks = stockretriever.StockRetriever()


Getting Current Stock Summary Information

info = stocks.get_current_info(["YHOO","AAPL","GOOG","MSFT"])

get_current_info() uses the yahoo.finance.quotes datatable to get all of the stock information presented in the main table on a typical stock page and a bunch of data from the key statistics page.
Here’s what a typical result-set looks like:

 [{
    "symbol": "YHOO",
    "Ask": null,
    "AverageDailyVolume": "25755200",
    "Bid": null,
    "AskRealtime": "0.00",
    "BidRealtime": "0.00",
    "BookValue": "9.184",
    "Change_PercentChange": "0.00 - 0.00%",
    "Change": "0.00",
    "Commission": null,
    "ChangeRealtime": "0.00",
    "AfterHoursChangeRealtime": "N/A - N/A",
    "DividendShare": "0.00",
    "LastTradeDate": "6/29/2010",
    "TradeDate": null,
    "EarningsShare": "0.557",
    "ErrorIndicationreturnedforsymbolchangedinvalid": "N/A",
    "EPSEstimateCurrentYear": "0.68",
    "EPSEstimateNextYear": "0.77",
    "EPSEstimateNextQuarter": "0.16",
    "DaysLow": null,
    "DaysHigh": null,
    "YearLow": "13.88",
    "YearHigh": "19.12",
    "HoldingsGainPercent": "- - -",
    "AnnualizedGain": "-",
    "HoldingsGain": null,
    "HoldingsGainPercentRealtime": "N/A - N/A",
    "HoldingsGainRealtime": null,
    "MoreInfo": "cnsprmiIed",
    "OrderBookRealtime": "N/A",
    "MarketCapitalization": "19.446B",
    "MarketCapRealtime": null,
    "EBITDA": "1.323B",
    "ChangeFromYearLow": "+0.16",
    "PercentChangeFromYearLow": "+1.15%",
    "LastTradeRealtimeWithTime": "N/A - <b>14.04</b>",
    "ChangePercentRealtime": "N/A - 0.00%",
    "ChangeFromYearHigh": "-5.08",
    "PercebtChangeFromYearHigh": "-26.57%",
    "LastTradeWithTime": "Jun 29 - <b>14.04</b>",
    "LastTradePriceOnly": "14.04",
    "HighLimit": null,
    "LowLimit": null,
    "DaysRange": "N/A - N/A",
    "DaysRangeRealtime": "N/A - N/A",
    "FiftydayMovingAverage": "15.4047",
    "TwoHundreddayMovingAverage": "16.1174",
    "ChangeFromTwoHundreddayMovingAverage": "-2.0774",
    "PercentChangeFromTwoHundreddayMovingAverage": "-12.89%",
    "ChangeFromFiftydayMovingAverage": "-1.3647",
    "PercentChangeFromFiftydayMovingAverage": "-8.86%",
    "Name": "Yahoo! Inc.",
    "Notes": "-",
    "Open": null,
    "PreviousClose": "14.04",
    "PricePaid": null,
    "ChangeinPercent": "0.00%",
    "PriceSales": "3.00",
    "PriceBook": "1.53",
    "ExDividendDate": "12-May-04",
    "PERatio": "25.21",
    "DividendPayDate": "N/A",
    "PERatioRealtime": null,
    "PEGRatio": "1.47",
    "PriceEPSEstimateCurrentYear": "20.65",
    "PriceEPSEstimateNextYear": "18.23",
    "Symbol": "YHOO",
    "SharesOwned": null,
    "ShortRatio": "1.40",
    "LastTradeTime": "4:00pm",
    "TickerTrend": " ====== ",
    "OneyrTargetPrice": "19.48",
    "Volume": "0",
    "HoldingsValue": null,
    "HoldingsValueRealtime": null,
    "YearRange": "13.88 - 19.12",
    "DaysValueChange": "- - 0.00%",
    "DaysValueChangeRealtime": "N/A - N/A",
    "StockExchange": "NasdaqNM",
    "DividendYield": null,
    "PercentChange": "0.00%"
}]


Getting Current Stock News

news = stocks.get_news_feed('YHOO')

get_news_feed() uses the rss data table to get rss feeds under the Headlines and Financial Blogs headings on a typical stock page. Here’s an example result-set:

[
    {
     "pubDate": "Wed, 30 Jun 2010 06:45:23 Etc/GMT",
     "title": "Update: Amazon Back To Normal; So Where's The Explanation? (at Barron's Online)",
     "link": "http://us.rd.yahoo.com/finance/external/xbarronsblog/rss/SIG=13fvfc6kn/*http%3A//blogs.barrons.com/techtraderdaily/2010/06/30/update-amazon-back-to-normal-so-wheres-the-explanation/?mod=yahoobarrons",
     "description": null
    },
    {
     "pubDate": "Wed, 30 Jun 2010 01:56:30 Etc/GMT",
     "title": "[$$] Andreessen Horowitz Leads $20 Million Investment in Foursquare (at The Wall Street Journal)",
     "link": "http://us.rd.yahoo.com/finance/external/wsj/rss/SIG=12icuornn/*http%3A//online.wsj.com/article/SB10001424052748704103904575337411584163440.html?ru=yahoo&mod=yahoo_hs",
     "description": "Andreessen Horowitz Leads Foursquare Investment Foursquare, a mobile-based startup that lets people \"check in\" at bars, restaurants and other places through their smartphones, has raised $20 million from a group of venture capitalists that will help fuel its expansion."
    },
    {
     "pubDate": "Wed, 30 Jun 2010 00:59:13 Etc/GMT",
     "title": "[$$] Foursquare Raises New Funds (at The Wall Street Journal)",
     "link": "http://us.rd.yahoo.com/finance/external/wsj/rss/SIG=12ircfrro/*http%3A//online.wsj.com/article/SB10001424052748703374104575337434242710928.html?ru=yahoo&mod=yahoo_hs",
     "description": "Foursquare Raises New Funds Foursquare, a location-based social networking startup, raised $20 million from a group of venture capitalists to help fuel its expansion."
    }
]


Getting Historical Stock Information

news = stocks.get_historical_info('YHOO')

get_historical_info() uses the csv datatable to retrieve all available historical data on a typical historical prices page. Here’s an example result-set (truncated):

[
    {
     "Date": "2010-06-29",
     "Open": "463.44",
     "High": "464.55",
     "Low": "451.12",
     "Close": "454.26",
     "Volume": "3502100",
     "AdjClose": "454.26"
    },
    {
     "Date": "2010-06-28",
     "Open": "472.59",
     "High": "477.55",
     "Low": "469.01",
     "Close": "472.08",
     "Volume": "1762300",
     "AdjClose": "472.08"
    }
]


Getting Options Data

options = stocks.get_options_data('YHOO')

get_options_data() uses the yahoo.finance.options table to retrieve call and put options from the options page. Here’s a typical result-set:

[
    {
     "symbol": "YHOO",
     "option": [
      {
       "symbol": "YHOO100717C00007500",
       "type": "C",
       "strikePrice": "7.5",
       "lastPrice": "7.90",
       "change": "0",
       "changeDir": null,
       "bid": "NaN",
       "ask": "NaN",
       "vol": "1",
       "openInt": "363"
      },
      {
       "symbol": "YHOO100717C00010000",
       "type": "C",
       "strikePrice": "10",
       "lastPrice": "5.20",
       "change": "0",
       "changeDir": null,
       "bid": "NaN",
       "ask": "NaN",
       "vol": "0",
       "openInt": "486"
      }
    }
]

Summary

The script is available here. Although the python script provides quick and easy access to a bunch of finance data, it negates many of the benefits of using YQL in the first place; namely, you can’t add your own conditions to the where clause, do joins, or apply aggregate operators. I’ll write a post sometime in the future on some of the cool stuff you can do with YQL.

A 16-Step Sequencer in Javascript

You can find the sequencer here.

I was bored today so I built a 16-step sequencer in javascript which uses sounds from 5 different instruments that I ripped from Reason. I was planning on using the HTML5 tag for audio playback but codec support is inconsistent across browsers – Chrome, Firefox, and Opera support Ogg audio, Safari supports AAC through quicktime and IE doesn’t have <audio> tag support at all. Instead of encoding audio in a bunch of different formats or using flash, I decided to use SoundManager2 which wraps the HTML5 audio api and uses flash as a failsafe. Since this is just a toy project, I didn’t mix the audio as you’d get in a real sequencer so you’ll occasionally notice clipping and samples playing off-beat. I wouldn’t recommend HTML5 audio or the above library for time-sensitive playback since you can’t get very low latency.

How to Find Every Word in Word Jumble-Style Games

Scramble is a word game where players try to find as many words as they can in a 4×4 or 5×5 grid of letters. I’ve been playing this game more than I’d like to admit to so in an attempt to hopefully ween me off the game, I’ve coded a solution in Java which finds every word in the grid.

Setting Up the Problem

I set the problem up using an undirected, unweighted, connected graph.
Here’s my graph implementation using an adjacency list:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import java.util.HashMap;
import java.util.HashSet;
import java.util.Set;


public class Graph<T> {
    private HashMap<T, HashSet<T>> adj;
   
    public Graph(){
        adj = new HashMap<T, HashSet<T>>();
    }
   
    public void addEdge(T from, T to){
        if(from != null && to != null){
            if(!hasVertex(from)) addVertex(from);
            if(!hasVertex(to)) addVertex(to);
            adj.get(from).add(to);
            adj.get(to).add(from);
        }
    }
   
    public void addVertex(T c){
        adj.put(c, new HashSet<T>());
    }
   
    public boolean hasVertex(T c){
        return adj.get(c) != null;
    }
   
    public Set<T> getVertices(){
        return adj.keySet();
    }
   
    public Set<T> getAdjacentVertices(T c){
        return adj.get(c);
    }
   
    public String toString(){
        String out = "";
        String graph = "";
        for(T c : adj.keySet()){
            out += c+":";
            for(T a : adj.get(c)){
                out += a+" ";
            }
            graph += out+"\n";
            out = "";
        }
        return graph;
    }
}

Since the same letters can appear multiple times in the grid, I represented each vertex using a Node object rather than simple Strings (letting me override equals()). Here’s the code for the Node class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class Node {
    private String val;
    private static int ct = 0;
    private int count;
   
    public Node(String val){
        this.val = val;
        this.count = ct++;
    }
   
    public String getValue(){
        return val;
    }
   
    public boolean equals(Object otherObj){
        if(this == otherObj) return true;
        if(otherObj == null) return false;
        if(getClass() != otherObj.getClass()) return false;
        Node other = (Node)otherObj;
        return other.count == this.count;
    }
   
}

The next step is to build the graph representation of the grid. I decided to read the grid from file into a 2d array and use that to create the necessary vertices/edges:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
public static final int GAME_SIZE = 4;
public static final int BUFFER_SIZE = 2;
...
private Graph<Node> buildGraph(String file) throws Exception{
    Node[][] board = readBoardTo2DArray(file);
    Graph<Node> g = new Graph<Node>();
    for(int i = 1; i < GAME_SIZE + 1; i++){
        for(int j = 1; j < GAME_SIZE + 1; j++){
            Node from = board[i][j];
            g.addEdge(from, board[i-1][j]);
            g.addEdge(from, board[i-1][j-1]);
            g.addEdge(from, board[i-1][j+1]);
            g.addEdge(from, board[i+1][j]);
            g.addEdge(from, board[i+1][j-1]);
            g.addEdge(from, board[i+1][j+1]);
            g.addEdge(from, board[i][j-1]);
            g.addEdge(from, board[i][j+1]);
        }
    }
    return g;
}

private Node[][] readBoardTo2DArray(String file) throws Exception{
    Node[][] board = new Node[GAME_SIZE + BUFFER_SIZE][GAME_SIZE + BUFFER_SIZE];
    BufferedReader br = new BufferedReader(new FileReader(file));
    String line = null;
    int row = 1;
    while((line = br.readLine()) != null){
        String[] cline = line.split(" ");
        for(int i = 0; i < cline.length; i++){
            board[row][i+1] = new Node(cline[i]);
        }
        row++;
    }
    return board;
}

Finding Every Word in the Graph

Now that we have a graph representation of the board, we can find every word using a variant of depth-first search(DFS).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
private Set<String> words;
...
public Set<String> findWords(){
    try{
        openConnection();
        HashMap<Node, Boolean> visited = new HashMap<Node, Boolean>();
        for(Node l : g.getVertices())
            visited.put(l, false);
   
        for(Node l : g.getVertices()){
            buildWords(g, l, visited, l.getValue());
        }
    }catch(Exception e){
        System.out.println(e.getMessage());
    }finally{
        try{
            closeConnection();
        }catch(Exception e){}
    }
    return words;
}

private void buildWords(Graph<Node> g, Node n, HashMap<Node, Boolean> visited, String word){
    visited.put(n, true);
    for(Node v: g.getAdjacentVertices(n)){
        String currWord = word + v.getValue();
        if(!visited.get(v)){
            visitWord(currWord);
            if(wordsMatch(currWord)){
                buildWords(g, v, visited, currWord);
            }
        }
    }
    visited.put(n, false);
}

private void visitWord(String word){
    if(isWord(word) && !words.contains(word)){
        words.add(word);   
    }
}

In findWords(), we initialize the visited HashMap which is used to keep track of the vertices we’ve seen on the current path and to prevent the same vertices from being used multiple times for the same word. From there, we call buildWords() with each of the vertices (treating each of the vertices as the root).
buildWords() constructs depth-first trees – Node n is the node we’re currently visiting and String word is the word that we’re constructing. First, we set the node to visited and then we travel down a path. To keep from proceeding down a path with no words, we apply a simple constraint to check that words exist with the letters on the current path. Without the constraint in place, the algorithm can be used to find every unique path in a graph.
Traditionally in DFS, we continually search deeper until a goal vertex is found or until we reach a terminal vertex with no children whilst marking vertices we’ve seen to keep from falling into infinite loops. Taking a coloring approach, we can color unvisited nodes white; visited nodes are colored two ways – gray vertices have some adjacent white vertices whereas all vertices adjacent to black vertices have been visited. By coloring vertices in this way, each vertex is visited at most once.
In our case, we don’t need to differentiate between visited vertices since rather than just traversing the graph to visit each vertex once, we want to potentially traverse every path in the graph. Fortunately for us, doing so requires only a single modification to DFS (asides from using a simpler coloring approach) – we reset the current vertex to not visited once we backtrace which allows for its inclusion in other paths.

The Word List to Test Against

I used JDBC to connect to a mysql database built using the English word list from ispell. I modified the word list by removing words less than 2 characters long and words with non-alphabetical characters. Unfortunately, I don’t know which word list is being used in Scramble and so many of the words found via my code are considered invalid by the game and vice versa.

Conclusion

My solution takes <5 seconds to find all words in a game of Scramble and I'm guessing I could increase the performance by forking off calls to buildWords. Although it's a bit clumsy to use while playing since the games are timed, I managed to come in first a bunch of times while playing online multiplayer - my code has sufficiently sucked the fun out of the game and so hopefully I won't be playing it as much =). Now if only I could code a solution to Paper Toss :p.

All code is available here. Note that a Java properties file (database.properties) is required with the following keys: jdbc.drivers, jdbc.url, jdbc.username, and jdbc.password. The jumble.txt file demonstrates the required format for the puzzle (space between each vertex; newline for each row).

Collaborative Filtering: A Movie Recommendation App

I built a reallllly simple movie recommendation engine using the movielens dataset (~10k movies, 10 million ratings from around 20k users). Just type in a movie you really liked thats in the genre you feel like watching now. My algorithm is super simple – I take the Euclidean distance between the user score and the stored scores then I filter by genre. For a first attempt, the results seem decent.

Try the recommendation engine here

EDIT: I played around with this some more and…its not very good lol…comedies rank horribly on this. Going by genres and rankings alone doesn’t seem to be sufficient to get good recommendations, especially for comedies which tend to have more polar user rankings (see the Napolean dynamite problem). I’ll be adding more metadata later.

Setting up a custom php install on Dreamhost

I’ve been using Dreamhost for a week and so far its been great. Since I knew that my site would get minimal traffic and I probably won’t be using it to do anything cpu-intensive, I figured shared hosting would be just fine. For $10 for the first year (promo code: 777), I got a domain and this plan.
I was hoping to spend today playing around with php; unfortunately I quickly ran into some troubles when I tried to include libraries not included in the shared Dreamhost install. So I decided to install my own copy of php – here’s how I did it:

1. Enable shell access from cpanel

Users->Manage Users->edit->select Shell Account

2. Download and build php

Use this script, change line 14 to your domain, and execute to build and install php with a bunch of useful libs.

3. Update .htaccess

1
2
3
4
5
6
7
8
9
Options +ExecCGI
AddHandler php-cgi .php
Action php-cgi /cgi-bin/php.cgi

<FilesMatch "^php5?.(ini|cgi)$">
Order Deny,Allow
Deny from All
Allow from env=REDIRECT_STATUS
</FilesMatch>

4. change some environment variables

Execute these commands on command line:

1
2
3
pear config-set php_bin /home/your-username-here/php5/bin/php
export PHP_PEAR_PHP_BIN=/home/your-username-here/php5/bin/php
export PATH=/home/your-username-here/pear/:/home/your-username-here/php5/bin:$PATH

5. Update your php.ini file

Update include_path to include the pear php_dir variable

1
pear config-get php_dir

and any other settings you’d like to change. upload_max_filesize and post_max_size are usually the properties people want to increase.

6. Test install via call to phpinfo()