Twitter Search is Now 3x Faster


http://engineering.twitter.com/2011/04/twitter-search-is-now-3x-faster_1656.html

Advertisements

Depth First Search – DFS


This is the second post in the series of tree/graph traversal. As in the last post breadth first search I used a concrete example with sample code in Java to explain the traversal algorithm. So without further a due here is the java code for depth first search (dfs)

	public void depthFirstTraversal(Node rootNode) {

		Stack s = new Stack();
		s.add(rootNode);
		rootNode.visited = true;
		while(!s.isEmpty()){
			Node n = s.pop();
			System.out.print(n.data +  " ");
			for(Node adj : n.adjacentNodes){
				if(!adj.visited){
					adj.visited = true;
					s.push(adj);
				}
			}
		}
	}

As opposed to breadth first traversal, depth first traversal uses stack to push the newly expanded node instead of Queue. For further reference read this

Breadth First Search – BFS


There are plethora of articles on the web explaining the breadth first search . But there aren’t implementations available especially in Java. Here is the java implementation of the breadth first search. Each vertex in the graph is represented by a struct Node.

	public class Node {

		public String data; // data element
		public boolean visited=false; // flag to track the already visited node
		public List adjacentNodes = new LinkedList(); // adjacency list

		public Node(String data){
			this.data = data;
		}

		public void addAdjacentNode(final Node node){
			adjacentNodes.add(node);
			node.adjacentNodes.add(this);
		}

	}

There are different ways to store graphs and the method used is this blog post is using adjacency list to store the neighboring vertices instead of adjacency matrix. You can read more about storing graphs here

the above shown Node is self-explanatory; data is the data element of the vertex, visited a boolean flag to track already visited node and the List as the adjacent list.

The breadth first search algo is very simple and is clearly explained in this article. Here goes the implementation

public class Graph {

	private List nodes = new ArrayList();

	public void breadthFirstTraversal(Node rootNode){

		Queue q = new LinkedList();
		q.add(rootNode);
		rootNode.visited=true;
		while(!q.isEmpty()){
			Node n = (Node)q.poll();
			System.out.print(n.data + " ");
			for(Node adj : n.adjacentNodes){
				if(!adj.visited){
					adj.visited=true;
					q.add(adj);
				}
			}
		}

	}

}

The graph class represent the graph of nodes with each node maintaining it’s own adjacent list; the breadth first search is one of the use cases of queue. the code is very simple and I guess there is no need for me to explain it, for some one who is not really familiar here goes the algorithm.

- enqueue the start node to a Queue
- make the start node as visited
- while queue is not empty
  - dequeue the node lets say u
  - print or whatever you want to
  - for every adjacent node v of u
      - if v is not already visited
          - mark v as visited
          - enqueue v to the Queue

You can easily test the graph using this test program (graph example straight from wikipedia)

	public static void main(String[] args) {

		Node frankfurt = new Node("frankfurt");
		Node mannheim = new Node("mannheim");
		Node wurzburg = new Node("wurzburg");
		Node stuttgard = new Node("stuttgard");
		Node kassel = new Node("kassel");
		Node karlsruhe = new Node("karlsruhe");
		Node erfurt = new Node("erfurt");
		Node numberg = new Node("numberg");
		Node augsburg = new Node("augsburg");
		Node munchen = new Node("munchen");

		Graph g = new Graph();

		g.nodes.add(frankfurt);
		g.nodes.add(mannheim);
		g.nodes.add(wurzburg);
		g.nodes.add(stuttgard);
		g.nodes.add(kassel);
		g.nodes.add(karlsruhe);
		g.nodes.add(erfurt);
		g.nodes.add(numberg);
		g.nodes.add(augsburg);
		g.nodes.add(munchen);

		frankfurt.addAdjacentNode(mannheim);
		frankfurt.addAdjacentNode(wurzburg);
		frankfurt.addAdjacentNode(kassel);

		mannheim.addAdjacentNode(karlsruhe);

		karlsruhe.addAdjacentNode(augsburg);

		augsburg.addAdjacentNode(munchen);

		munchen.addAdjacentNode(kassel);
		munchen.addAdjacentNode(numberg);

		wurzburg.addAdjacentNode(erfurt);
		wurzburg.addAdjacentNode(numberg);

		numberg.addAdjacentNode(stuttgard);

		g.breadthFirstTraversal(frankfurt);
	}

Next blog post would be for Depth First Search – DFS

MinHash – Java implementation


here is java implementation of MinHash

import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Random;
import java.util.Set;

public class MinHash<T> {

	private int hash[];
	private int numHash;
    
	/**
	 * 
	 */
	public MinHash(int numHash){
		this.numHash = numHash;
        hash = new int[numHash];

        Random r = new Random(11);
        for (int i = 0; i < numHash; i++){
            int a = (int)r.nextInt();
            int b = (int)r.nextInt();
            int c = (int)r.nextInt();
            int x = hash(a*b*c, a, b, c);
            hash[i] = x;
        } 
    }
	

    public double similarity(Set<T> set1, Set<T> set2){

        int numSets = 2;
        Map<T, boolean[]> bitMap = buildBitMap(set1, set2);
        
        int[][] minHashValues = initializeHashBuckets(numSets, numHash);

        computeMinHashForSet(set1, 0, minHashValues, bitMap);
        computeMinHashForSet(set2, 1, minHashValues, bitMap);

        return computeSimilarityFromSignatures(minHashValues, numHash);
    }
    
	/**
	 * 
	 */
	private static int[][] initializeHashBuckets(int numSets, int numHashFunctions) {
		int[][] minHashValues = new int[numSets][numHashFunctions];

        for (int i = 0; i < numSets; i++) {
        	for (int j = 0; j < numHashFunctions; j++) {
        		minHashValues[i][j] = Integer.MAX_VALUE;
            }
        }
        return minHashValues;
    }
	 
	/**
	 * 
	 * @param minHashValues
	 * @param numHashFunctions
	 * @return
	 */
	private static double computeSimilarityFromSignatures(int[][] minHashValues, int numHashFunctions) {
		int identicalMinHashes = 0;
        for (int i = 0; i < numHashFunctions; i++){
            if (minHashValues[0][i] == minHashValues[1][i]) {
                identicalMinHashes++;
            }
        }
        return (1.0 * identicalMinHashes) / numHashFunctions;
    }
	
	/**
	 * 
	 * @param x
	 * @param a
	 * @param b
	 * @param c
	 * @return
	 */
	private static int hash(int x, int a, int b, int c) {
        int hashValue = (int)((a * (x >> 4) + b * x + c) & 131071);
        return Math.abs(hashValue);
    }
	
	
    private void computeMinHashForSet(Set<T> set, int setIndex, int[][] minHashValues, Map<T, boolean[]> bitArray){

    	int index=0;
    	
    	for(T element : bitArray.keySet()) { // for every element in the bit array
    		for (int i = 0; i < numHash; i++){ // for every hash
    			if(set.contains(element)) { // if the set contains the element
    				int hindex = hash[index]; // get the hash
    				if (hindex < minHashValues[setIndex][index]) { 
    					// if current hash is smaller than the existing hash in the slot then replace with the smaller hash value
    					minHashValues[setIndex][i] = hindex;
    				}
    			}
    		}
    		index++;
    	}
    }
    
	/**
	 * 
	 * @param set1
	 * @param set2
	 * @return
	 */
	public Map<T,boolean[]> buildBitMap(Set<T> set1, Set<T> set2){
		
		Map<T,boolean[]> bitArray = new HashMap<T,boolean[]>();
		
		for(T t : set1){
			bitArray.put(t, new boolean[]{true,false});
		}
		
		for(T t : set2){
			if(bitArray.containsKey(t)){
				// item is not present in set1
				bitArray.put(t, new boolean[]{true,true});
			}else if(!bitArray.containsKey(t)){
				// item is not present in set1
				bitArray.put(t, new boolean[]{false,true});
			}
		}
		
		
		return bitArray;
	}
	
	
	public static void main(String[] args){
		Set<String> set1 = new HashSet<String>();
		set1.add("FRANCISCO");
		set1.add("MISSION");
		set1.add("SAN");
		
		Set<String> set2 = new HashSet<String>();
		set2.add("FRANCISCO");
		set2.add("MISSION");
		set2.add("SAN");
		set2.add("USA");

		MinHash<String> minHash = new MinHash<String>(set1.size()+set2.size());
		System.out.println(minHash.similarity(set1, set2));
	}

MinHash


There are quite a few posts on the web explaining min hash specially this and this. Both of these posts are excellent and clearly explains min hash. My objective of this post is to take an example and explain step-by-step min-hash algorithm that does not permute the input but uses n-hash functions to get similarity score.

Lets suppose there are two sets of data S1 {“THIS”,”IS”,”ME”} and S2 {“THAT”,”IS”,”ME”}. Lets create a bit map index for this set of data with rows as union of Set S1 and Set S2 and column values as bits (1,0) representing presence or absence of data row in the set

THIS  1 0
THAT 0 1
IS     1 1
ME 1 1

Bit “1″ represent presence in the data set. For example, “THIS” 1 0 means “THIS” is present in the data set S1 but not in data set S2, “IS” is present in the both data sets S1 and S2 etc. etc.

Pick n-hash functions (n is random number), n number of hash functions could be any +ve integer >= S1.size()+S2.size(). For this example lets says n = S1.size() + S2.size().

[0][1][2][3][4][5] => {2222,12332,45432,45426,2124,8656}

For each column, hash function keep a slot. With our example we have 2 columns and 6 hash functions

minHashSlots => [0][0]  [0][1] [0][2] [0][3] [0][4] [0][5], [1][0]  [1][1] [1][2] [1][3] [1][4] [1][5]

Follow this algorithm:

For each row, each column with value 1, for each hash function if the hash value h[row index] < minHashSlot[column][row index] then replace the minHashSlot[column][row index] = h[rowIndex]. Here is the formal alogrithm

1. prepare the bitMap
2. initialize the h[n] => with n hash functions
3. initialize minHashSlots[column count][n] with Int.maxValue
4. rowIndex = 1
5. for every row in the bit map
5.1 for every column with value 1
5.1.1 for every hash
5.1.1.1. hash[rowIndex] < minHashSlot[column][rowIndex] then set minHashSlot[column][rowIndex]=hash[rowIndex]
5.1.1.2 rowIndex++
6 minHash = count of similar hash / total hashes

I am not sure if this is correct explanation of min-hash but whatever I heard this is quite close to being correct.