Merge pull request #16 from dkutin/readme

dkutin · web-flow · commit 9e70e2d9a4d4 · 2021-02-14T21:56:20.000-05:00
Readme Changes
diff --git a/README.md b/README.md
@@ -1,33 +1,216 @@
+
+
 # CSI4107 Assignment 1
 
+## Reference
+
+http://www.site.uottawa.ca/~diana/csi4107/A1_2021/A1_2021.htm
+
 ## Group Members
 
-### Dmitry Kutin - 300015920
-### Dilanga Algama - 8253677
-### Joshua O Erivwo - 8887065
+Dmitry Kutin - 300015920
+Dilanga Algama - 8253677
+Joshua O Erivwo - 8887065
 
-## Reference
+## Task Distribution
 
-Microblog information retrieval system: http://www.site.uottawa.ca/~diana/csi4107/A1_2021/A1_2021.htm
+Dmitry Kutin
+- Step 1, Step 3 (Improvements), Step 5, README.
 
-## Functionality
+Dilanga Algama
+- Step 2, Step 3 (Initial Implementation), Step 4.
+
+Joshua O Erivwo
+- Step 3, README.
 
 ## Setting up
 
-Prerequisites: 
-1. `python3` installed on your computer
+Prerequisites:
+
+1. `python3` installed and executable.
 2. `nltk` libaries installed (all of these can be downloaded using `python3` -> `import nltk` -> `nltk.download('corpus | tokenize | stem.porter')`: 
-  * `corpus`
-  * `tokenize`
-  * `stem.porter`
+	  - `corpus`
+	  - `tokenize`
+	  - `stem.porter`
+  
+  **Note: These will be downloaded and imported by `main.py` during execution**
+ 
+## Project Overview
+
+The `assets/` directory contains all the information provided for this assignment:
+- `tweet_list.txt`  - Contains the list of Documents.
+- `stop_words.txt` - A collection of stopwords, and
+- `test_queries.txt` - A collection of 49 test queries.
+- `Trec_microblog11-qrels.txt` - Provided relevance feedback file.
 
-run `python3 main.py`
+The `dist/` directory is where results of the execution are stored. 
+- `Results.txt` - Contains a collection of all 49 test queries, and their corresponding relevant documents, ordered by highest to lowest relevance. 
+- `trec_eval.txt` - Resulting file from Trec Eval execution. This file contains a detailed comparision of `Results.txt` against `Trec_microblog11-qrels.txt`.
+
+
+		Topic_id  Q0  docno              rank  score                  tag   
+		1         Q0  30198105513140224  1     0.588467208018523      myRun 
+		1         Q0  30260724248870912  2     0.5870127971399565     myRun 
+		1         Q0  32229379287289857  3     0.5311552466369023     myRun 
+		
 
 ## Execution
 
-## Algorithms, Data Structures, and Optimizations 
+Once all of the prerequesits are met, the program can be ran with:
+**```python3 main.py```**
+
+This will generate `Results.txt` in the `dist/` directory in the following format: 
+
+		Topic_id  Q0  docno              rank  score                  tag   
+		1         Q0  30198105513140224  1     0.588467208018523      myRun 
+		1         Q0  30260724248870912  2     0.5870127971399565     myRun 
+		1         Q0  32229379287289857  3     0.5311552466369023     myRun
+
+## Evaluation
+
+To evaluate the effectiveness of our Microblog retrieval system:
+
+- Run the `eval.sh` script. Running this script will create a txt file called `trec_eval.txt` which will list the overall performance measures of the code for all the queries as a whole.
+
+- Run the `full-eval.sh` script to see all the trec_eval measures for each query. Running this script will create a txt file called `trec_eval_all_query.txt` which will list out all the measures the trec_eval module has to offer for each query that was run through the code. 
+
+## Functionality
+
+Our task for this assignment was to implement an Information Retrieval (IR) system for a collection of documents (Twitter messages). A quick recap of what our code does as a whole is as follows:
+
+1. We import both the data files, one with the test queries and the other with the list of tweets to format the information to a Python code readable manner and to organize the words for our functions to read (we used dictionaries to store the data). This step also runs the words through a stemming and stop word removal process that basically stems all the words and handles the removal of stop words from the list of words.
+
+2. We create an inverted index dictionary for all the words in each tweet in the list of tweets. During this process we calculate the `idf` for each word and the `tf-idf` for the words within the tweets. Which is also add to the dictionary created during this step.
+
+3. We calculate the idf and tf-idf for the words in the test queries. Once these measures are calculated. We use the measures calculated in step 2 with the query measures we calculated in this step to calculate the lengths of the queries and tweets. Once the lengths of the queries and tweets are calculated we this information to calculate the `CosSim` for the tweets, to find their similarity score to the query. We order the tweets in a dictionary from highest to lowest similarity score and pass this information to step 4.
+
+4. We use the data calculated in step 3 to write to a txt file (Results.txt) in the format mentioned in the assignment.
+
+## Algorithms, Data Structures, and Optimizations
+
+  Our implementation of the information retrieval system was based on the guidelines provided in the assignment. The folder contains five python files containing the function used in implementing the IR system. 
+
+### Project Specific Files
+
+#### `main.py`:
+  This file contains the main() function. In the `main()`, we started by importing the important functions that were used for implementing the IR system. The first step was to import the tweets and the queries from the `assert folder`. By importing the tweets and queries from the `asset folder`, `step1: preprocessing` was being done using the `filterSentence` that was implemented directly in the `import` function. After importing the tweets and queries from the text and then filtering them. We then moved to build the `inverted index` for the tweet. We got the `length of the document` for the indexes and tweets, which was then used to retrieve the length of the queries from the text file. In the `retrieval` function, the CosSimalarity scores were calculated and then ranked in descending order.  To understand what was happening in the `main()` function, we created a set of print statements that would notify the user when the preprocessing and the ranking of the document are done. The user then gets informed of the creation of the result file. 
+#### Preprocess.py:
+ This file contains the process of developing `step1:Preprocessing` and `step2:Indexing` using python. Below are the functions implemented in the `preprocess.py`
+ - isNumeric(subject): Check if a string contains numerical values
+ - importTweets(): imports the tweets from the collection. We first started by opening the text files, then we filter the file using our filterSentence function.
+ - importQuery(): imports query from the collection. Same process as the importTweet().
+ - filterSentence(sentence): Filters sentences from tweets and queries. This function builds a list of `stopwords` and then `tokenizes` each word in the sentences by removing any numerics, punctuation, or stopwords contained in the list. Each imported tweet and query runs through the `NLTK's stopword list`, our `custom stopword list` that included the `URLs and abbreviations`, and the provided `stopword list`. After this step, each word is tokenized and stemmed with `Porter stemmer`. Under the `additional libraries` section, we discussed in-depth the use of `tokenization`, `stopwords`, and `porter stemmer`.
+ - buildIndex(documents): builds the inverted index for each entry word in the vocabulary. The data structure used for the implementation was hash maps. In the realms of python development, dictionaries are equivalent to hash maps. We used dictionaries for storing the data that was being processed and used for the documents and queries. We initialized the `inverted index` and the `word_idf` as empty dictionaries that would be returned in the end. The next step stored the frequency of each word inside the already filtered documents. The final step calculated the `IDF` and the `TF-IDF` for all words contained in a document.
+- lengthOfDocument(index, tweets): calculates the length of documents for each tweet.
+#### Result.py: 
+  This file contains the function for calculating the Cosimilarity values for the set of documents against each queries and then ranks the similarity scores in descending order. Dictionary was used as our main source for storing the values of the `query_index`,  `retrieval`,  and the `query_length`. The function comprises mainly on `for loops`. At the start, we first calculated the occurrences of the token in each query. We then moved to calculate the `TF-IDF` and the `length of the query`. After getting the necessary calculations needed, we then moved to solving the `CosSimalarity values` and then `ranking the document` according to the order that was specified.
+#### write.py: 
+  This file contains the procedure for implementing `step4`. The function creates a table for each of the results generated in the `result.py` and then stores it in the `dist folder` as a text file.
+ 
+### Additional Libraries
+
+#### Prettytable (`prettytable.py`):  
+ 
+A helper library to format the output for the `Results.txt` file. Used in the implementation of the `write.py`.
+
+#### NLTK:
+
+#### PorterStemmer
+Porter stemmer was an external resource that was used in the implementation of `filterSentence(sentence)`. It was used for normalizing the data for each token that was created. Stemming helps remove the morphological and inflexional endings from words in the text file.
+#### Stopwords
+Stopwords were also used in the preprocessing of the data. Since stopwords are common that generally do not contribute to the meaning of a sentence, we tend to filter them out which can be seen done in the `filterSentence(sentence)` function.
+#### Tokenizer
+We Tokenized our data in the `filterSentence(sentence)` so as to provide a link between queries and documents. Tokens are sequences of alphanumeric characters separated by nonalphanumeric characters, which are performed as part of the preprocessing (`step1` requirement).
+
+## Final Result Discussion
+  The following is the evaluation of our system using the trec_eval script by comparing our results (`dist/Results.txt`) with the expected results from the provided relevance feedback file.
+
+    runid                 	all	myRun
+    num_q                 	all	49
+    num_ret               	all	39091
+    num_rel               	all	2640
+    num_rel_ret           	all	2054
+    map                   	all	0.1634
+    gm_map                	all	0.0919
+    Rprec                 	all	0.1856
+    bpref                 	all	0.1465
+    recip_rank            	all	0.3484
+    iprec_at_recall_0.00  	all	0.4229
+    iprec_at_recall_0.10  	all	0.3001
+    iprec_at_recall_0.20  	all	0.2653
+    iprec_at_recall_0.30  	all	0.2195
+    iprec_at_recall_0.40  	all	0.2025
+    iprec_at_recall_0.50  	all	0.1770
+    iprec_at_recall_0.60  	all	0.1436
+    iprec_at_recall_0.70  	all	0.1230
+    iprec_at_recall_0.80  	all	0.1027
+    iprec_at_recall_0.90  	all	0.0685
+    iprec_at_recall_1.00  	all	0.0115
+    P_5                   	all	0.1714
+    P_10                  	all	0.1796
+    P_15                  	all	0.1796
+    P_20                  	all	0.1776
+    P_30                  	all	0.1714
+    P_100                 	all	0.1406
+    P_200                 	all	0.1133
+    P_500                 	all	0.0713
+    P_1000                	all	0.0419
+
+
+From an overall perspective, The result seemed okay, though not as great as we would have hoped. Paying attention to the map, which represents the overall performance of our searching. We got a MAP score of `16.3%` and `p_10` of `0.17`.  The map score seemed much better after re-evaluating the `result.py`. We made some optimization to our retrieval and ranking after discovering some anomalies in our calculations for the queries in the inverted index. This optimization must have made the map score slightly increase to the `number recorded above`. When performing searches manually, it seemed much better and relevant as the numbers begin to make more sense.
+
+## Results from Queries 3 and 20
+
+### Query 3
+
+     3  Q0  32333726654398464  1   0.69484735460699       myRun 
+ 
+     3  Q0  32910196598636545  2   0.6734426036041226     myRun 
+ 
+     3  Q0  35040428893937664  3   0.5424091725376433     myRun 
+ 
+     3  Q0  35039337598947328  4   0.5424091725376433     myRun 
+ 
+     3  Q0  29613127372898304  5   0.5233927588038552     myRun 
+ 
+     3  Q0  29615296666931200  6   0.5054085301107222     myRun 
+ 
+     3  Q0  32204788955357184  7   0.48949945859699995    myRun 
+ 
+     3  Q0  33711164877701120  8   0.47740062368197117    myRun 
+ 
+     3  Q0  33995136060882945  9   0.47209559331399364    myRun 
+
+     3  Q0  31167954573852672  10  0.47209559331399364    myRun 
+     
 
-## Task Assigned
-Each member was assigned to at least one of the steps provided by the assignment. We all contributed and helped each other with each step we assigned to ourselves step by reviewing and improving the code. Below contains how the step was divided:
-Dimitry was tasked with step1 and step2 and was reviewed by Joshua and Don. Joshua and Don were tasked with step3 and was reviewed and improved by Dimitry. Don was tasked with step 4 and was reviewed by Joshua and Dimitry. Step 5 was performed by the group and Joshua was tasked with the final step which was the README file.
+### Query 20
 
+    20        Q0  33356942797701120  1     0.8821317020383918     myRun 
+    
+    20        Q0  34082003779330048  2     0.7311611336720092     myRun 
+    
+    20        Q0  34066620821282816  3     0.7311611336720092     myRun 
+    
+    20        Q0  33752688764125184  4     0.7311611336720092     myRun 
+    
+    20        Q0  33695252271480832  5     0.7311611336720092     myRun 
+    
+    20        Q0  33580510970126337  6     0.7311611336720092     myRun 
+    
+    20        Q0  32866366780342272  7     0.7311611336720092     myRun
+    
+    20        Q0  32269178773708800  8     0.7311611336720092     myRun 
+    
+    20        Q0  32179898437218304  9     0.7311611336720092     myRun 
+    
+    20        Q0  31752644565409792  10    0.7311611336720092     myRun 
+    
+    
+## Vocabulary
+  
+Our vocabulary size was `88422` tokens
+  
+Below is the sample of 100 tokens from our vocabulary:
+  
+```['bbc', 'world', 'servic', 'staff', 'cut', 'fifa', 'soccer', 'haiti', 'aristid', 'return', 'mexico', 'drug', 'war', 'diplomat', 'arrest', 'murder', 'phone', 'hack', 'british', 'politician', 'toyota', 'reca', 'egyptian', 'protest', 'attack', 'museumkubica', 'crash', 'assang', 'nobel', 'peac', 'nomin', 'oprah', 'winfrey', 'half-sist', 'known', 'unknown', 'white', 'stripe', 'breakup', 'william', 'kate', 'fax', 'save-the-da', 'cuomo', 'budget', 'super', 'bowl', 'seat', 'tsa', 'airport', 'screen', 'unemploymen', 'reduc', 'energi', 'consumpt', 'detroit', 'auto', 'global', 'warm', 'weather', 'keith', 'olbermann', 'job', 'special', 'athlet', 'state', 'union', 'dog', 'whisper', 'cesar', 'millan', "'s", 'techniqu', 'msnbc', 'rachel', 'maddow', 'sargent', 'shriver', 'tribut', 'moscow', 'bomb', 'gifford', 'recoveri', 'jordan', 'curfew', 'beck', 'piven', 'obama', 'birth', 'certifica', 'campaign', 'social', 'media', 'veneta', 'organ', 'farm', 'requir', 'evacu', 'carbon', 'monoxid']```