Talk:Query expansion

From Wikipedia, the free encyclopedia

I forgot about this when writing the first draft, but I really should mention the emergency trend of 'personalized results' where sites are reweighted. For example, in language, a word form can have multiple word senses. To the search engine, the user's desired sense of the word is ambiguous, because users do not bother to give all the details or be specific enough. The goal is to identify which word form, or forms, did the user perceive and intend to search for. For example, the user might enter in the word 'bank' and be referred to a financial institution or the side of a river. Without disambiguation, irrespective of other ranking factors, both types of documents are likely to be recalled. The user sees this as poor search quality, even though, for the most part, the search engine is powerless to do anything other than 'alert' the user that he can be more specific to improve his results. Or, using word sense disambiguation techniques, the search engine can supplement the query by specifying the set of senses of each word that the user probably meant. This can be inferred based on the other words in the query, or based on the history of the user's searches or the visited websites from the search engine. For example, if the user typed in 'river bank', a bigram, the search engine can aid the word sense disambiguation algorithm (which may likely be a probabilistic model partially based on bigrams and trigrams!) and infer the proper, specific sense of bank, and filter out documents which have alternate senses. Or, it may be that the user has historically searched for nature related documents, the dominating theme under which his intended sense of bank falls, and not financial documents. There are approaches to using the theme of the user's previous queries or viewed results to assist in the disambiguation. I know the company Autonomy does this with its DRE (Dynamic Reasoning Engine) which is really just using a naive-bayes classification algorithm to assist in reranking. Google offers 'personalized search' and even shows you the search history (although most Googlers probably do not see the correlation). This is definitely and important part of query expansion and eventually this concept should be incorporated. I should come up with some references first. Josh Froelich 17:16, 9 December 2006 (UTC)

You might consider augmenting Information retrieval (or a more appropriately focused article related to query details) with information related to the impact of term weighting on the results. In the 'Query expansion' article, you might focus on the potential for different expansion paths, strategies for path selection and avoidance, and the consequences of a path resulting that violates the Principle of least astonishment. --User:Ceyockey (talk to me) 17:46, 9 December 2006 (UTC)

--- Another note. SQL interpreters augment SQL queries and fill in unwritten information. It is purely technical and obvious and typically only done to help the query handler build the proper parse tree. For example, SQL-92 is a lazy syntax in the sense that it, in limited cases, allows for leaving out the table name prefix before each column name in a SELECT query. It was done to make the lives of SQL authors easier and to make SQL queries shorter, and it is rather easy for the query handler to pick up and fill in the missing parts, if it is even necessary. I do not think this that important to acknowlege. Josh Froelich 21:10, 9 December 2006 (UTC)

Also, XSL, for querying XML as a dataset, supports something like this. Still not appropriate.

--- In the case of other media-related searches, like using an analog signal as input in an attempt to find fuzzily similar signals (differences in frequency and amplitude but similar according to some criteria), may augment to signal. Imagine how detectives might use this system to take a recorded phone conversation and automatically identify the speakers (speaker recognition or speech recognition). Probably not appropriate. Also no appropriate is how ISDN lines and the cable structure of major countries work using repeater signals. I would rule these out. This article is a focus within information retrieval only. Josh Froelich 21:10, 9 December 2006 (UTC)

--- I am little queasy about saying the word 'seed'. I know what it means, but I am a tech geek. I understand how it is a characteristic of an original query that is about to transformed. But I don't think the average joe has the slightest clue what we mean when we say 'seed'. Might want to remove this. Josh Froelich 21:10, 9 December 2006 (UTC)

--- >given that no user wants even more results to comb through, regardless of the precision. - may I suggest you reword this - e.g 'many users' instead of 'no users'? since I know that many of our clients in the legal profession, law enforcement agencies etc would rather have all the possible relevant matches returned, rather than face losing a case or worse... Ray3055 20:54, 27 September 2007 (UTC)

[edit] Precision - recall tradeoff

Hi, First time for me, so I hope that's the way to do it:

The following sentence in the article doesn't seem right: This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator.

The precision P = tp/(tp+fp) can actually increase with recall. Example: tp=1, fp=1: P=0.5, now we increase recall by QE and get one more correct doc: now tp=2 and we get that P=2/3.

Typically, such in the QE case, with greater recall more errors (fp) are introduced and the precision of the added part is lower then the original precision. So, in practice recall often does decrease precision, but that's not the equaltion's fault.Sodagou (talk) 09:40, 5 April 2008 (UTC)