User:Prasenjitmukherjee

From Wikipedia, the free encyclopedia

Contents

[edit] Prasen Mukherjee

I live in Bangalore and work for AOL. I am interested in data mining, mathematics, physics, religion and lots of other random topics.

[edit] What do I do

[edit] Architecting Mobile Platform Solutions

  • Mobile AIM Platform
    • Wireless Village/IMPS

[edit] Spearheading R&D initiatives

  • Leading R&D initiative by collaborating with premier research inistitutes
  • Leading open source Machine Learning initiatives (http://lucene.apache.org/mahout)

[edit] Ad monetization (Mobile/Social-Networking)

  • Interact/Influence different business units to aid revenue-generation and monetization
  • Apply technology to gauge trends and user-behaviour
    • Spot market opportunities

[edit] Thoughts on Personalized Recommended Systems

[edit] Recommendation System ( common for Mobile/Desktop)

  • Recommend (news) articles
  • Recommend upcoming events
  • Personalized browsing - Idea here is to learn the immediate intent of user dynamically ( in the current session as he broswes more and more pages ) from his past history, and recommend articles based on that.

[edit] Temporal profiling of user log data

  • Analyze distribution of personal topics ( probably extracted via PLSI,LDA etc.) over time
  • Use the above distribution data to group temporally similar users
  • Reference papers :
    • Andrew McCallum's Publications
    • A Continuous-Time Model of Topic Co-occurrence Trends. Xuerui Wang, Wei Li, and Andrew McCallum. AAAI Workshop on Event Detection, 2006. Capture the time distributions not only of a topics, but also of their co-occurrences. For example, notice that while NLP and ML have both been around for a long time, but their co-occurrence has been rising recently. The model is effectively a combination of the Pachinko Allocation Model (PAM) and Topics-Over-Time (TOT).
    • Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. Xuerui Wang and Andrew McCallum. Conference on Knowledge Discovery and Data Mining (KDD) 2006. A new LDA-style topic model that models trends over time. The meaning of a topic remains fixed and reliable, but its prevalence over time is captured, and topics may thus focus in on co-occurrence patterns that are time-sensitive. Unlike other work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps. Improvements in topic saliency and the ability to predict time given words.

[edit] Personalized Summarization

  • Can be applied to mobile devices
  • Perform document summarization ( snippet-generation ) based on user profile.
  • Describe a user based on his past behavior. This requires NLP techniques.

[edit] Use Session data to aid unsupervised clustering

  • Each session could be thought as a representative of a user-intent ( aka topic ) as generally ( there are always exceptions) we do related things in a session. Like searching for a travel destination, followed by searching for hotel rates, followed by searching for air-tickets. If we make this assumption, this greatly reduces the problem of high dimensions.
  • Now instead of clustering based on document/word tuple, we could create a representative vector for each user session ( call it a session-vector in term-space ) and apply clustering logic on session/word tuple. We can use LSH to see the distribution of session-vectors across different hash-buckets, which can be used to estimate the aprior number of clusters for latent variable algorithms like PLSI/LDA etc.

[edit] CondtionalRandomField (CRF)

  • Comparison with HMM
    • In HMM observations ( emission entities ) are independent
    • Unlike CR which is based on discriminative model, HMM is based on a generative model, which means its based on joint probability distribution p(s,o) and requires a model of p(o)
  • Comparison with MEMM
    • MEMM is a directed graph, means that hidden-states are dependent ONLY on past history
    • Like CRF, MEMM is also based on discriminative model, which means its based on conditional probability distribution p(s|s',o) and doesn't require a model of p(o) as observation (o) is already given.


[edit] LocalitySensitiveHashing (LSH)

CAVEAT :: These are all my personal observations. Appreciate constructive comments.


Interesting technique to do the following

  • Speeds up Clustering
    • It avoids pairwise distance/similarity computation
  • Near Duplicate Detection
  • Tries to solve "Curse of Dimensionality" problem
    • As query is independent of corpus size
  • More applicable to multimedia indexing
    • Unlike text indexing, they follow QueryByExample hence LSH is directly applicable.
  • Less applicable to text indexing
    • As text-queries are very short ( 1-2 words), hence there will be very few hits as very few documents will have 1-2 words.

[edit] Basic LSH Algorithm

  • Use functions of the form
    • g(p) = <h_1(p),h_2(p), \ldots h_k(p)>
  • Preprocessing:
    • Select g1...gL
    • For all p \in P, hash p to buckets g_1(p) \cdots g_L(p)
  • Query
    • Retrieve the points from buckets g_1(p),g_2(p) \cdots until
      • Total points <= 2L or all
    • Return top k-points by doing similarity computation among 2L points


[edit] LSH hash functions

Basic definition can be found in Locality_sensitive_hashing

[edit] Hamming Hash Function

This is the first LSH hash funchion mentioned in Indyk-Motwani'98's original paper on LSH. Seems to be more applicable to compute Image Similarity ( Color Histogram match etc )

[edit] Minwise Permutation

Initial intent was to do similarity computation using some probabilistic technique. Later used as an efficient LSH. Set based approach. All the other approaches are vector/space based. Since its set based unlike vector-space approach, it doesn't take care of the relative distribution of term-frequencies in a document. More applicable to do Clustering, Duplicate Detection etc. in text corpus.

  • Article link : Minwise Independent Permutations Broder et al.
  • Metric : Jaccard Coefficient
  • h(d) = min_{w \in d} \{ \pi(w) \}
  • Pr[h(v) = h(u)] = J(v,u), where J = Jaccard Coefficient
  • Good for text/document similarity

[edit] p-Stable Distribution

Uses properties of stable distribution family. Seems to be the most talked about these days. Wide applications including text-retrieval, (sub)image-retrieval, clustering(text,media), duplicate-detection(text,media). Andoni-Indyk-06 extended the projection space from 1 dimension to t-dimension and introduced leech-lattice LSH based on Voronoi diagrammes and leech-lattice-space.

  • Article link : LSH Scheme Based on p-Stable Distributions Indyk et al.
  • Metric : lp(v1,v2) Basically it honours l_p^d norm
  • h_{\mathbf{a},b} (\boldsymbol{\upsilon}) = \left \lfloor \frac{\mathbf{a}\cdot \mathbf{v}+b}{r} \right \rfloor
  • Pr[h(v)=h(u)] = \int_{0}^{r} \frac{1}{c}f_p(t/c) (1- t/r)\, dt, where
    • c = \left \Vert v-u\right \|_p, and
    • fp(t) denote the prob density of the absolute value of the p-stable distribution