User:Prasenjitmukherjee

From Wikipedia, the free encyclopedia

1 Prasen Mukherjee
2 What do I do
3 Thoughts on Personalized Recommended Systems
4 CondtionalRandomField (CRF)
5 LocalitySensitiveHashing (LSH)
- 5.1 Basic LSH Algorithm
- 5.2 LSH hash functions

[edit] Prasen Mukherjee

I live in Bangalore and work for AOL. I am interested in data mining, mathematics, physics, religion and lots of other random topics.

[edit] What do I do

[edit] Architecting Mobile Platform Solutions

Mobile AIM Platform
- Wireless Village/IMPS

Open Mobile SDK effort (http://dev.aol.com/openmobile)
- AirMedia Platform

[edit] Spearheading R&D initiatives

Leading R&D initiative by collaborating with premier research inistitutes
Leading open source Machine Learning initiatives (http://lucene.apache.org/mahout)

[edit] Ad monetization (Mobile/Social-Networking)

Interact/Influence different business units to aid revenue-generation and monetization
Apply technology to gauge trends and user-behaviour
- Spot market opportunities

[edit] Thoughts on Personalized Recommended Systems

[edit] Recommendation System ( common for Mobile/Desktop)

Recommend (news) articles
Recommend upcoming events
Personalized browsing - Idea here is to learn the immediate intent of user dynamically ( in the current session as he broswes more and more pages ) from his past history, and recommend articles based on that.

[edit] Temporal profiling of user log data

Analyze distribution of personal topics ( probably extracted via PLSI,LDA etc.) over time
Use the above distribution data to group temporally similar users
Reference papers :
- Andrew McCallum's Publications
- A Continuous-Time Model of Topic Co-occurrence Trends. Xuerui Wang, Wei Li, and Andrew McCallum. AAAI Workshop on Event Detection, 2006. Capture the time distributions not only of a topics, but also of their co-occurrences. For example, notice that while NLP and ML have both been around for a long time, but their co-occurrence has been rising recently. The model is effectively a combination of the Pachinko Allocation Model (PAM) and Topics-Over-Time (TOT).
- Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. Xuerui Wang and Andrew McCallum. Conference on Knowledge Discovery and Data Mining (KDD) 2006. A new LDA-style topic model that models trends over time. The meaning of a topic remains fixed and reliable, but its prevalence over time is captured, and topics may thus focus in on co-occurrence patterns that are time-sensitive. Unlike other work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps. Improvements in topic saliency and the ability to predict time given words.

[edit] Personalized Summarization

Can be applied to mobile devices
Perform document summarization ( snippet-generation ) based on user profile.

Describe a user based on his past behavior. This requires NLP techniques.

[edit] Use Session data to aid unsupervised clustering

Each session could be thought as a representative of a user-intent ( aka topic ) as generally ( there are always exceptions) we do related things in a session. Like searching for a travel destination, followed by searching for hotel rates, followed by searching for air-tickets. If we make this assumption, this greatly reduces the problem of high dimensions.

Now instead of clustering based on document/word tuple, we could create a representative vector for each user session ( call it a session-vector in term-space ) and apply clustering logic on session/word tuple. We can use LSH to see the distribution of session-vectors across different hash-buckets, which can be used to estimate the aprior number of clusters for latent variable algorithms like PLSI/LDA etc.

[edit] CondtionalRandomField (CRF)

Nice Articles
- http://en.wikipedia.org/wiki/Hidden_Markov_model - Pretty good article on HMMs, understandign of which is extremely important to really understand CRF
- http://www.cs.huji.ac.il/course/2004/learns/ConditionalRandomFields2.ppt

Comparison with HMM
- In HMM observations ( emission entities ) are independent
- Unlike CR which is based on discriminative model, HMM is based on a generative model, which means its based on joint probability distribution p(s,o) and requires a model of p(o)

Comparison with MEMM
- MEMM is a directed graph, means that hidden-states are dependent ONLY on past history
- Like CRF, MEMM is also based on discriminative model, which means its based on conditional probability distribution p(s|s',o) and doesn't require a model of p(o) as observation (o) is already given.

[edit] LocalitySensitiveHashing (LSH)

CAVEAT :: These are all my personal observations. Appreciate constructive comments.

Interesting technique to do the following

Speeds up Clustering
- It avoids pairwise distance/similarity computation
Near Duplicate Detection
Tries to solve "Curse of Dimensionality" problem
- As query is independent of corpus size
More applicable to multimedia indexing
- Unlike text indexing, they follow QueryByExample hence LSH is directly applicable.
Less applicable to text indexing
- As text-queries are very short ( 1-2 words), hence there will be very few hits as very few documents will have 1-2 words.

[edit] Basic LSH Algorithm

Use functions of the form
- $g(p) = <h_1(p),h_2(p), \ldots h_k(p)>$

Preprocessing:
- Select $g 1 ... g L$
- For all $p \in P$ , hash p to buckets $g_1(p) \cdots g_L(p)$

Query
- Retrieve the points from buckets until
  - Total points <= 2L or all
- Return top k-points by doing similarity computation among 2L points

[edit] LSH hash functions

Basic definition can be found in Locality_sensitive_hashing

[edit] Hamming Hash Function

This is the first LSH hash funchion mentioned in Indyk-Motwani'98's original paper on LSH. Seems to be more applicable to compute Image Similarity ( Color Histogram match etc )

Article link : Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality (1998) Indyk-Motwani
Metric : Hamming distance
$h (v) = v i$ , i.e., the i-th bit of vector v in d dimensional Hamming space, where i=1 to d
$Pr[ h(v) = h(u) ] = 1 - \frac{D(v,u)}{d}$

[edit] Minwise Permutation

Initial intent was to do similarity computation using some probabilistic technique. Later used as an efficient LSH. Set based approach. All the other approaches are vector/space based. Since its set based unlike vector-space approach, it doesn't take care of the relative distribution of term-frequencies in a document. More applicable to do Clustering, Duplicate Detection etc. in text corpus.

Article link : Minwise Independent Permutations Broder et al.
Metric : Jaccard Coefficient
$h(d) = min_{w \in d} \{ \pi(w) \}$
$P r [h (v) = h (u)] = J (v, u)$ , where J = Jaccard Coefficient
Good for text/document similarity

[edit] p-Stable Distribution

Uses properties of stable distribution family. Seems to be the most talked about these days. Wide applications including text-retrieval, (sub)image-retrieval, clustering(text,media), duplicate-detection(text,media). Andoni-Indyk-06 extended the projection space from 1 dimension to t-dimension and introduced leech-lattice LSH based on Voronoi diagrammes and leech-lattice-space.

Article link : LSH Scheme Based on p-Stable Distributions Indyk et al.
Metric : $l p (v 1, v 2)$ Basically it honours $l_p^d$ norm
$h_{\mathbf{a},b} (\boldsymbol{\upsilon}) = \left \lfloor \frac{\mathbf{a}\cdot \mathbf{v}+b}{r} \right \rfloor$
, where
- $c = \left \Vert v-u\right \|_p$ , and
- $f p (t)$ denote the prob density of the absolute value of the p-stable distribution