Problem Statement
Let X be a set of documents. Let Q be a set of query keywords. A search for query Q ranks the document as per relevance and returns the most relevant document. A common practice is to highlight the relevant snippets in the document. A snippet is a part of the document that is most relevant to the query. Usually, a snippet will consists query keywords and other contextual words. For a given document D and query Q, design an algorithm to find and highlight the most relevant snippet.Example
Document =>Java provides built-in support for multithreaded programming. A multithreaded program contains two or more parts that can run concurrently. Each part of such a program is called a thread, and each thread defines a separate path of execution. A multithreading is a specialized form of multitasking. Multitasking threads require less overhead than multitasking processes.
Query => "Multithreading Java"
Output=>
[Start-Highlight] Java provides built-in support for multithreaded programming. A multithreaded program[End-Highlight] contains two or more parts that can run concurrently. Each part of such a program is called a thread, and each thread defines a separate path of execution. A multithreading is a specialized form of multitasking. Multitasking threads require less overhead than multitasking processes.
Source code
Java source code with test cases.Instructions for executing the program
1. BRIEF DESCRIPTIONThis program performs exact case sensitive matching of given query words to doc words.
It does not perform any stemming, spell check, synonym matching, and has no support
for regular expressions and logical conditions in query.
Following characters are used as delimeters:
{' ','?','!',',','.',';',':','+','*','-','/','{', '}','[',']','(',')'}
Whenever, the program find a matching query keyword in the program, it
extracts a snippet of character length 150. Thus, a list of snippets are
generated. Snippets are ranked using a score based technique.
The highest snippet is returned as a result with the query keywords highlighted.
If there is no exact match in the page/document for any of the query keywords, NO
result is returned
2. CODE FILES IN THE APPLICATION
Constants.java
Controller.java
ErrorMessages.java
RenderOutput.java
SnippetMaker.java
SnippetRanker.java
Tokenizer.java
TestCases.java
3. ENVIRONMENT REQUIREMENT
Language: Java
Need JRE 1.4 or higher
4. COMPILATION
a) create a folder snippetHighlight (case sensitive)
b) copy all the *.java files into it
c) change to snippetHighlight directory
d) type command
javac Constants.java ErrorMessages.java Tokenizer.java SnippetMaker.java SnippetRanker.java RenderOutput.java Controller.java TestCases.java
If the system fails to compile giving error that a class is not found than please fix the classpath as described in following article: http://download.oracle.com/javase/1.3/docs/tooldocs/win32/javac.html
5. EXECUTION WITH TEST CASES
a) change to parent directory of snippetHighlight folder
b) Type command
java snippetHighlight.TestCases 0
java snippetHighlight.TestCases 1
java snippetHighlight.TestCases 2
java snippetHighlight.TestCases 3
These are four test cases to run the program.
6. PROVIDING OWN TEST CASES
1) Look into file TestCases.java for details
Each test method creates an instance of class Controller
Then it passes two strings.
First string is doc string
Second string is list of query keywords
Here is the code snippet
Controller control = new Controller();
control.snippetHighlight(pageStr, queryStr);
7. NON SUPPORTED FUNCTIONALITIES
a. This program does not provide functionality for spell checking on query or page.
This can be included using edit distance check using dictionary and character sampling.
b. This program has no support for operators like OR, AND, REGEX, etc. for query.
c. This program does not perform stemming.
d. This program does not use synonyms of query word.
This can be done using word graph like WORDNET or a Dictionary. This program provides framework in the code to plug this service.
source code link does not work
ReplyDelete