Building a Boolean Search Engine
implement the methods AND, OR, NOT and perform some queries. We have a small corpus (document collection) of Amazon.com reviews on electronics. After your page is generated, you can check out the new branch:
$ cd krithivasanchandran/BooleanSearchEngine
$ git fetch origin
$ git checkout BooleanSearchEngine
Corpus (document collection): mini_proj/all.txt
This contains the raw reviews. Each line is a review/document in our context. There are total 1248 documents. Documents in theposting list should be indexed from 1 (and not 0) where the first document refers to the first line in mini_proj/all.txt. The corpus is also available in document matrix format: mini_proj/docs.txt with associated vocabulary mapping mini_proj/vocab_map.txt generated using a helper method DatasetFormatter.
Pre-processing: DatasetFormatter.java
This class provides preprocessing (parsing, generating the vocabulary map, filtering stopwords based on counts, etc.) already implemented to get you started. You are required to work on the formatted data produced by this class. Your best bet is to use it directly within the constructor of BooleanRetrieval.java class as follows:
public BooleanRetrieval() throws Exception{
// Initialize variables and Format the data using a pre-processing class and
set up variables
invIndex = new HashMap<String, Set<Integer>>();
DatasetFormatter formater = new DatasetFormatter();
formater.textCorpusFormatter("./all.txt");
docs = formater.getDocs();
vocab = formater.getVocab();
void createPostingList()
You are required to build the inverted index in the method void createPostingList() for this preprocessed corpus. Note that you are required to work with the pre-processed data, i.e., output of DatasetFormatter.textCorpusFormatter()defined in Section 2 OR the document matrix/vocabulary defined by the files mini_proj/docs.txt and mini_proj/vocab_map.txt (which are the same). Do NOT work with the raw corpus mini_proj/all.txt and re-parse!
void createPostingList() {
//Initialze the inverted index with a SortedSet (so that the later additions become
easy!)
for(String s:vocab){
invIndex.put(s, new TreeSet<Integer>());
}
//for each doc
for(int i=0; i<docs.length; i++){
//for each word of that doc
for(int j=0; j<docs[i].length; j++){
//Get the actual word in position j of doc i
String w = map.get(docs[i][j]);
//Now simply update the posting list of this word w in the invIndex
by adding the document i
Set intersection(Set a, Set b)
This method accepts two posting lists using the Set collections and returns the intersection of them. You are required to implement it using the algorithm for intersection discussed in class.
Set<Integer> intersection(Set<Integer> a, Set<Integer> b){
/*
First convert the posting lists from sorted set to something we
can iterate easily using an index. I choose to use ArrayList<Integer>.
Once can also use other enumerable.
*/
ArrayList<Integer> PostingList_a = new ArrayList<Integer>(a);
ArrayList<Integer> PostingList_b = new ArrayList<Integer>(b);
TreeSet result = new TreeSet();
//Set indices to iterate two lists. I use i, j
int i = 0;
int j = 0;
while(i!=PostingList_a.size() && j!=PostingList_b.size()){
…
Then we can use the result of intersection to perform a AND query as follows:
Set <Integer> evaluateANDQuery(String a, String b){
return intersection(invIndex.get(a), invIndex.get(b));
}
Evaluating/checking posting lists in the inverted index
Let us get a feel of the corpus by exploring some examples. Assuming you have implemented the inverted index correctly. Numbering document indices starting at 1 (in our corpus), the posting lists for the following words should be as follows:
freeze -> [129, 313, 442, 493, 912] (i.e., the term “freeze” is present in documents
129, 313, etc.)
ctrl -> [23, 47, 138, 232, 498, 882, 993, 1106, 1124, 1218]
wifi -> [46, 66, 113, 158, 746, 1248]
cpu -> [604, 800, 959, 1156]
Evaluating/checking AND queries
Assuming you have implemented the intersection method correctly, then we should expect the following results for the following queries:
mouse AND wifi -> [113, 158] (i.e., there are two documents in our corpus which contain
both the terms “mouse” and “wifi”. And these are document at line 113 and 158 in all.txt)
mouse AND scrolling -> [80, 86, 348, 1029]
errors AND report -> [] (i.e., there are no documents in the collection containing both
the terms “errors” and “report”)
Implementing UNION and NOT operations on posting lists
Implement the method Set union(Set a, Set b) to evaluate and retrieve documents using the OR Boolean query. Union operation on queries mouse OR keyboard like simply means that we should retrieve all documents that appear either in the posting list of the term “mouse” or “keyboard”. You can use the evaluateORQuery() method implemented to return the result of an OR query containing two terms. Set evaluateORQuery(String a, String b){ return union(invIndex.get(a), invIndex.get(b)); } Assuming you have implemented union correctly, we should expect to get the following results for these queries:
youtube OR reported -> [19, 67, 122, 227, 313, 342, 377, 507, 659, 825, 846]
errors OR report -> [115, 251, 471, 508, 674, 784, 821, 1068, 1080, 1111, 1158, 1245]
hell OR movie -> [3, 89, 147, 192, 235, 262, 336, 342, 558, 766, 864, 882, 1120, 1244]
Final Output
$> project_executable query_type query_string output_file_path
where query_type takes on one of the 4 values: PLIST, AND, OR, AND-NOT. Query_string takes the actual query and output_file_path is the filename with path in ./ that should report the result of your query.
Example runs:
$> project_executable PLIST cpu cpu_plist.txt
Should produce
cpu -> [604, 800, 959, 1156]
as the contents of cpu_plist.txt
$> project_executable AND mouse AND scrolling and_result.txt
Should produce
mouse AND scrolling -> [80, 86, 348, 1029]
as the contents of and_result.txt
$> project_executable AND-NOT Lenovo AND (NOT logitech) and_not_result.txt
Should produce
lenovo AND (NOT logitech) -> [360, 373, 379, 451, 517, 540, 787, 869, 942, 1055, 1146]
as the contents of and_not_result.txt
Authors and Contributors
This is my @KrithivasanChandran GitHub profile.
Support or Contact
Having trouble with running the project please contact me on : krithivasanchandran@gmail.com