Monday, July 30, 2012

Highlighting relevant snippet in document search


Problem Statement

Let X be a set of documents. Let Q be a set of query keywords.  A search for query Q ranks the document as per relevance and returns the most relevant document. A common practice is to highlight the relevant snippets in the document. A snippet is a part of the document that is most relevant to the query. Usually, a snippet will consists query keywords and other contextual words. For a given document D and query Q, design an algorithm to find and highlight the most relevant snippet.

Example

Document =>
Java provides built-in support for multithreaded programming. A multithreaded program contains two or more parts that can run concurrently. Each part of such a program is called a thread, and each thread defines a separate path of execution. A multithreading is a specialized form of multitasking. Multitasking threads require less overhead than multitasking processes.
  
Query => "Multithreading Java"


Output=>
[Start-Highlight] Java provides built-in support for multithreaded programming. A multithreaded program[End-Highlight] contains two or more parts that can run concurrently. Each part of such a program is called a thread, and each thread defines a separate path of execution. A multithreading is a specialized form of multitasking. Multitasking threads require less overhead than multitasking processes.


Source code

Java source code with test cases. 

Instructions for executing the program

1. BRIEF DESCRIPTION

This program performs exact case sensitive matching of given query words to doc words.
It does not perform any stemming, spell check, synonym matching, and  has no support
for regular expressions and logical conditions in query.

Following characters are used as delimeters:
{' ','?','!',',','.',';',':','+','*','-','/','{', '}','[',']','(',')'}
Whenever, the program find a matching query keyword in the program, it
extracts a snippet of character length 150. Thus, a list of snippets are
generated. Snippets are ranked using a score based technique.
The highest snippet is returned as a result with the query keywords highlighted.
If there is no exact match in the page/document for any of the query keywords, NO
result is returned

2. CODE FILES IN THE APPLICATION
Constants.java
Controller.java
ErrorMessages.java
RenderOutput.java
SnippetMaker.java
SnippetRanker.java
Tokenizer.java

TestCases.java


3. ENVIRONMENT REQUIREMENT
Language: Java
Need JRE 1.4 or higher


4. COMPILATION
   a) create a folder snippetHighlight  (case sensitive)
   b) copy all the *.java files into it
   c) change to  snippetHighlight  directory
   d) type command
      javac Constants.java  ErrorMessages.java  Tokenizer.java SnippetMaker.java SnippetRanker.java  RenderOutput.java Controller.java TestCases.java


If the system fails to compile giving error that a class is not found than please fix the classpath as described in following article: http://download.oracle.com/javase/1.3/docs/tooldocs/win32/javac.html


5. EXECUTION WITH TEST CASES
   a) change to parent directory of snippetHighlight folder
   b) Type command
       java snippetHighlight.TestCases 0
       java snippetHighlight.TestCases 1
       java snippetHighlight.TestCases 2
       java snippetHighlight.TestCases 3
       These are four test cases to run the program.
     

6. PROVIDING OWN TEST CASES

   1) Look into file TestCases.java for details
 
   Each test method creates an instance of class  Controller
   Then it passes two strings.
   First string is doc string
   Second string is list of query keywords
 
   Here is the code snippet
 
   Controller  control = new Controller();
   control.snippetHighlight(pageStr, queryStr);    


7. NON SUPPORTED  FUNCTIONALITIES
a. This program does not provide functionality for spell checking on query or page.
  This can be included using edit distance check using dictionary and character sampling.
b. This program has no support for operators like OR, AND, REGEX, etc. for query.
c. This program does not perform stemming.
d. This program does not use synonyms of query word.
  This can be done using word graph like WORDNET or a Dictionary. This program provides framework in the code to plug this service.

1 comment: