User:Prasenjitmukherjee
From Wikipedia, the free encyclopedia
Contents |
[edit] Prasen Mukherjee
I live in Bangalore and work for AOL. I am interested in data mining, mathematics, physics, religion and lots of other random topics.
[edit] What do I do
[edit] Architecting Mobile Platform Solutions
- Mobile AIM Platform
- Wireless Village/IMPS
- Open Mobile SDK effort (http://dev.aol.com/openmobile)
- AirMedia Platform
[edit] Spearheading R&D initiatives
- Leading R&D initiative by collaborating with premier research inistitutes
- Leading open source Machine Learning initiatives (http://lucene.apache.org/mahout)
[edit] Ad monetization (Mobile/Social-Networking)
- Interact/Influence different business units to aid revenue-generation and monetization
- Apply technology to gauge trends and user-behaviour
- Spot market opportunities
[edit] Thoughts on Personalized Recommended Systems
[edit] Recommendation System ( common for Mobile/Desktop)
- Recommend (news) articles
- Recommend upcoming events
- Personalized browsing - Idea here is to learn the immediate intent of user dynamically ( in the current session as he broswes more and more pages ) from his past history, and recommend articles based on that.
[edit] Temporal profiling of user log data
- Analyze distribution of personal topics ( probably extracted via PLSI,LDA etc.) over time
- Use the above distribution data to group temporally similar users
- Reference papers :
- Andrew McCallum's Publications
- A Continuous-Time Model of Topic Co-occurrence Trends. Xuerui Wang, Wei Li, and Andrew McCallum. AAAI Workshop on Event Detection, 2006. Capture the time distributions not only of a topics, but also of their co-occurrences. For example, notice that while NLP and ML have both been around for a long time, but their co-occurrence has been rising recently. The model is effectively a combination of the Pachinko Allocation Model (PAM) and Topics-Over-Time (TOT).
- Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. Xuerui Wang and Andrew McCallum. Conference on Knowledge Discovery and Data Mining (KDD) 2006. A new LDA-style topic model that models trends over time. The meaning of a topic remains fixed and reliable, but its prevalence over time is captured, and topics may thus focus in on co-occurrence patterns that are time-sensitive. Unlike other work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps. Improvements in topic saliency and the ability to predict time given words.
[edit] Personalized Summarization
- Can be applied to mobile devices
- Perform document summarization ( snippet-generation ) based on user profile.
- Describe a user based on his past behavior. This requires NLP techniques.
[edit] Use Session data to aid unsupervised clustering
- Each session could be thought as a representative of a user-intent ( aka topic ) as generally ( there are always exceptions) we do related things in a session. Like searching for a travel destination, followed by searching for hotel rates, followed by searching for air-tickets. If we make this assumption, this greatly reduces the problem of high dimensions.
- Now instead of clustering based on document/word tuple, we could create a representative vector for each user session ( call it a session-vector in term-space ) and apply clustering logic on session/word tuple. We can use LSH to see the distribution of session-vectors across different hash-buckets, which can be used to estimate the aprior number of clusters for latent variable algorithms like PLSI/LDA etc.
[edit] CondtionalRandomField (CRF)
- Nice Articles
- http://en.wikipedia.org/wiki/Hidden_Markov_model - Pretty good article on HMMs, understandign of which is extremely important to really understand CRF
- http://www.cs.huji.ac.il/course/2004/learns/ConditionalRandomFields2.ppt
- Comparison with HMM
- In HMM observations ( emission entities ) are independent
- Unlike CR which is based on discriminative model, HMM is based on a generative model, which means its based on joint probability distribution p(s,o) and requires a model of p(o)
- Comparison with MEMM
- MEMM is a directed graph, means that hidden-states are dependent ONLY on past history
- Like CRF, MEMM is also based on discriminative model, which means its based on conditional probability distribution p(s|s',o) and doesn't require a model of p(o) as observation (o) is already given.
[edit] LocalitySensitiveHashing (LSH)
CAVEAT :: These are all my personal observations. Appreciate constructive comments.
Interesting technique to do the following
- Speeds up Clustering
- It avoids pairwise distance/similarity computation
- Near Duplicate Detection
- Tries to solve "Curse of Dimensionality" problem
- As query is independent of corpus size
- More applicable to multimedia indexing
- Unlike text indexing, they follow QueryByExample hence LSH is directly applicable.
- Less applicable to text indexing
- As text-queries are very short ( 1-2 words), hence there will be very few hits as very few documents will have 1-2 words.
[edit] Basic LSH Algorithm
- Use functions of the form
- Preprocessing:
- Select g1...gL
- For all , hash p to buckets
- Query
- Retrieve the points from buckets until
- Total points <= 2L or all
- Return top k-points by doing similarity computation among 2L points
- Retrieve the points from buckets until
[edit] LSH hash functions
Basic definition can be found in Locality_sensitive_hashing
[edit] Hamming Hash Function
This is the first LSH hash funchion mentioned in Indyk-Motwani'98's original paper on LSH. Seems to be more applicable to compute Image Similarity ( Color Histogram match etc )
- Article link : Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality (1998) Indyk-Motwani
- Metric : Hamming distance
- h(v) = vi, i.e., the i-th bit of vector v in d dimensional Hamming space, where i=1 to d
[edit] Minwise Permutation
Initial intent was to do similarity computation using some probabilistic technique. Later used as an efficient LSH. Set based approach. All the other approaches are vector/space based. Since its set based unlike vector-space approach, it doesn't take care of the relative distribution of term-frequencies in a document. More applicable to do Clustering, Duplicate Detection etc. in text corpus.
- Article link : Minwise Independent Permutations Broder et al.
- Metric : Jaccard Coefficient
- Pr[h(v) = h(u)] = J(v,u), where J = Jaccard Coefficient
- Good for text/document similarity
[edit] p-Stable Distribution
Uses properties of stable distribution family. Seems to be the most talked about these days. Wide applications including text-retrieval, (sub)image-retrieval, clustering(text,media), duplicate-detection(text,media). Andoni-Indyk-06 extended the projection space from 1 dimension to t-dimension and introduced leech-lattice LSH based on Voronoi diagrammes and leech-lattice-space.
- Article link : LSH Scheme Based on p-Stable Distributions Indyk et al.
- Metric : lp(v1,v2) Basically it honours norm
- , where
- , and
- fp(t) denote the prob density of the absolute value of the p-stable distribution