You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+69-110
Original file line number
Diff line number
Diff line change
@@ -6,76 +6,62 @@
6
6
7
7
Keywords4CV is a Python-based tool designed to help job seekers optimize their resumes and LinkedIn profiles for Applicant Tracking Systems (ATS) and human recruiters. It analyzes a collection of job descriptions and extracts the most important and relevant keywords, enabling users to tailor their application materials to specific job roles. By incorporating these keywords, users can significantly increase their chances of getting noticed by both ATS and recruiters, leading to more interview opportunities.
8
8
9
+
## State of Production
9
10
10
-
## State of production
11
-
12
-
**Not ready! Under active development.**
11
+
**Not ready! Under active development.**
12
+
Currently at version `0.09 (Alpha)`. While significant improvements have been made, critical issues persist (see Known Issues below), and the tool is not yet production-ready.
13
13
14
14
## Features
15
15
16
-
***Enhanced Keyword Extraction:** Identifies key skills, qualifications, and terminology from job descriptions using advanced NLP techniques (spaCy, NLTK, scikit-learn). Improved accuracy and handling of edge cases.
17
-
***TF-IDF Analysis with Advanced Weighting:** Computes Term Frequency-Inverse Document Frequency (TF-IDF) scores, combined with *whitelist boosting* and *keyword frequency*. Configurable weights allow for fine-tuning keyword importance.
18
-
***Synonym Expansion:**Leverages WordNet to suggest synonyms, expanding keyword coverage and improving semantic matching.
16
+
***Enhanced Keyword Extraction:** Identifies key skills, qualifications, and terminology from job descriptions using advanced NLP techniques (spaCy, NLTK, scikit-learn). Now preserves multi-word skills (e.g., "machine learning") through entity recognition.
17
+
***TF-IDF Analysis with Advanced Weighting:** Computes Term Frequency-Inverse Document Frequency (TF-IDF) scores, combined with *whitelist boosting* and *keyword frequency*. Configurable weights allow for fine-tuning keyword importance.
18
+
***Synonym Expansion:** Leverages WordNet to suggest synonyms, expanding keyword coverage and improving semantic matching.
19
19
***Highly Configurable:** Uses a `config.yaml` file for extensive customization:
20
-
* Stop words (with options to add and exclude specific words). More comprehensive default stop word list.
21
-
* Skills whitelist: Extensive and customizable whitelist of technical and soft skills.
22
-
* Keyword categories and semantic categorization: Group keywords into meaningful categories for better report organization. Semantic categorization using word vectors for improved accuracy.
23
-
* Output directory: Control where reports and logs are saved.
24
-
* N-gram range for keyword extraction: Adjust for unigrams, bigrams, trigrams, and more.
25
-
* Similarity threshold for categorization: Fine-tune semantic category assignment.
26
-
* Weighting for TF-IDF, frequency, and whitelist boost: Customize the scoring mechanism.
20
+
* Stop words (with options to add and exclude specific words). More comprehensive default stop word list.
21
+
* Skills whitelist: Extensive and customizable whitelist of technical and soft skills, now integrated into spaCy entity ruler for better preservation.
22
+
* Keyword categories and semantic categorization: Group keywords into meaningful categories (e.g., "Technical Skills", "Soft Skills") with improved semantic accuracy via word vectors.
23
+
* N-gram range: Adjusted to `[1, 2]` in `0.09` for concise, actionable phrases (configurable).
24
+
* Similarity threshold: Increased to `0.65` in `0.09` for stricter semantic categorization (configurable).
25
+
* Weighting for TF-IDF, frequency, and whitelist boost: Customize the scoring mechanism.
***Keyword Summary:** Aggregated keyword scores, job counts, average scores, and assigned categories, providing a high-level overview.
30
-
***Job Specific Details:** Pivot table showing keyword scores for each job title, enabling targeted resume customization.
31
-
***Robust Input Validation:** Rigorous validation of job descriptions, handling empty titles, descriptions, incorrect data types, and encoding issues. Clear error messages and logging for invalid inputs.
32
-
***User-Friendly Command-Line Interface:**`argparse` provides a clear and easy-to-use command-line interface for script execution.
33
-
***Comprehensive Error Handling and Logging:** Detailed logging using `logging` module, capturing informational messages, warnings, and errors to `ats_optimizer.log`. Improved error handling for configuration issues, input validation failures, and unexpected exceptions with informative exit codes.
34
-
***Multiprocessing for Robust Analysis:** Core analysis runs in a separate process using `multiprocessing`, enhancing stability and enabling timeout control, especially for large datasets.
35
-
***Efficient Caching:** Extensive caching mechanisms using `functools.lru_cache` and `OrderedDict` to minimize redundant computations during text preprocessing and keyword categorization, significantly improving performance. Cache invalidation on configuration changes.
36
-
***SpaCy Pipeline Optimization:** Disables unnecessary spaCy pipeline components (`parser`, `ner`) for increased efficiency and reduced memory footprint.
37
-
***Automatic NLTK Resource Management:** Ensures required NLTK resources (WordNet) are automatically downloaded if missing, simplifying setup.
38
-
***Configurable Timeout Mechanism:** Prevents script hangs and resource exhaustion with a configurable timeout, ensuring stability even with problematic inputs or large datasets.
39
-
***Memory Management and Chunking:** Implements memory monitoring and adaptive chunking for processing large job description datasets, preventing memory errors. Caches are cleared proactively to manage memory usage.
40
-
***Extensive Unit Testing:** Includes a comprehensive suite of unit tests using `unittest` and `pytest` in the `tests/` directory, ensuring code quality, reliability, and facilitating future development. Tests cover core functionalities, edge cases, and configuration scenarios.
29
+
***Keyword Summary:** Aggregated keyword scores, job counts, average scores, and assigned categories for a high-level overview.
30
+
***Job Specific Details:** Detailed table (formerly pivot) showing keyword scores, TF-IDF, frequency, category, and whitelist status per job title.
31
+
***Robust Input Validation:** Rigorous validation of job descriptions, handling empty titles, descriptions, incorrect data types, and encoding issues. Clear error messages and logging.
32
+
***User-Friendly Command-Line Interface:**`argparse` provides a clear and easy-to-use interface.
33
+
***Comprehensive Error Handling and Logging:** Detailed logging to `ats_optimizer.log` with improved error handling for configuration, input, and memory issues.
34
+
***Multiprocessing for Robust Analysis:** Core analysis uses multiprocessing for stability and timeout control.
35
+
***Efficient Caching:** Uses `functools.lru_cache` and `OrderedDict` to optimize performance, with cache invalidation on config changes.
36
+
***SpaCy Pipeline Optimization:** Disables unnecessary components (`parser`, `ner`) and adds `sentencizer` for efficiency and consistency.
37
+
***Automatic NLTK Resource Management:** Ensures WordNet and other resources are downloaded if missing.
38
+
***Memory Management and Chunking:** Adaptive chunking and memory monitoring prevent memory errors for large datasets.
39
+
***Extensive Unit Testing:** Includes a test suite in `tests/`, though currently unreliable (see Known Issues).
41
40
42
41
## How it Works
43
42
44
-
1.**Input:** The script accepts a JSON file (e.g., `job_descriptions.json`) as input via the command line. The JSON should contain:
45
-
***Keys:** Job titles (strings).
46
-
***Values:** Full text of the corresponding job descriptions (strings).
* Semantic Similarity: Word vectors from spaCy and cosine similarity are used to measure the semantic similarity between extracted keywords and category terms.
64
-
* Category Assignment: Keywords are assigned to categories based on the highest semantic similarity score above a configurable `similarity_threshold`.
65
-
6.**Output Generation:** Results are organized and presented in user-friendly Excel reports:
66
-
* Keyword Summary Sheet: Provides a ranked list of keywords with aggregated "Total_Score", "Avg_Score", "Job_Count", and assigned "Category".
67
-
* Detailed Scores Sheet: A pivot table showing the "Score" for each "Keyword" across each "Job Title", enabling detailed analysis and targeted resume tailoring.
43
+
1.**Input:** Accepts a JSON file (e.g., `job_descriptions.json`) with job titles as keys and descriptions as values.
44
+
2.**Preprocessing:** Cleans text (lowercasing, URL/email removal), tokenizes, lemmatizes, and caches results.
45
+
3.**Keyword Extraction:**
46
+
* Matches `skills_whitelist` phrases as `SKILL` entities using spaCy.
47
+
* Generates n-grams (default `[1, 2]` in `0.09`) and filters out noise (single-letter tokens, stop words).
48
+
* Expands with WordNet synonyms.
49
+
4.**Keyword Weighting and Scoring:** Combines TF-IDF, frequency, and whitelist boost (configurable weights).
50
+
5.**Semantic Keyword Categorization:** Assigns categories using substring matching and semantic similarity (threshold `0.65` in `0.09`).
51
+
6.**Output Generation:** Produces Excel reports:
52
+
***Summary:** Ranked keywords with `Total_Score`, `Avg_Score`, `Job_Count`.
"Software Engineer": "Proficient in Java, Spring, REST APIs..."
107
92
}
108
93
```
109
-
110
-
2. **Customize `config.yaml`:** Modify the `config.yaml` file to adjust:
111
-
*`stop_words`, `stop_words_add`, `stop_words_exclude`: Fine-tune the stop word list for your specific needs.
112
-
*`skills_whitelist`: Expand and customize the skills whitelist to include technologies and skills relevant to your target roles and industry. The more comprehensive this list, the more accurate keyword extraction will be.
113
-
*`keyword_categories`: Review and customize keyword categories and their associated terms to align with your desired output organization.
114
-
*`output_dir`: Specify a directory to save output reports and log files, or leave it to default to the current directory.
115
-
*`ngram_range`: Experiment with different n-gram ranges to optimize keyword extraction for your job descriptions. `[1, 3]` (unigrams, bigrams, trigrams) is a good starting point.
116
-
*`similarity_threshold`: Adjust the `similarity_threshold`if you want to fine-tune how keywords are assigned to categories based on semantic similarity.
117
-
*`weighting`: Customize the `weighting` parameters (`tfidf_weight`, `frequency_weight`, `whitelist_boost`) to adjust the keyword scoring mechanism.
118
-
*`min_desc_length`, `min_jobs`, `max_desc_length`: Adjust input validation parameters as needed.
119
-
* Explore other advanced configuration options in`config.yaml` to further tailor the tool to your specific requirements.
120
-
121
-
3. **Run the script:**
122
-
94
+
2. **Customize `config.yaml`:** Adjust `skills_whitelist`, `ngram_range`, `similarity_threshold`, etc.
*`-i` or `--input`: Path to the input JSON file (required).
127
-
*`-c` or `--config`: Path to the configuration file (optional, defaults to `config.yaml`).
128
-
*`-o` or `--output`: Path to the output Excel file (optional, defaults to `results.xlsx`).
129
-
130
-
4. **Review the output:** The Excel file (e.g., `results.xlsx`) will be generated in the same directory where you run the script. Examine the "Summary" and "Detailed Scores" sheets to identify top keywords and tailor your resume accordingly. Check `ats_optimizer.log`for any warnings or errors during processing.
99
+
4. **Review Output:** Check `results.xlsx` and `ats_optimizer.log`.
131
100
132
101
### Running Unit Tests
133
102
134
-
To ensure the script's functionality and stability, a suite of unit tests is included in the `tests/` directory. It is highly recommended to run these tests after making any modifications to the code.
135
-
136
-
1. **Navigate to the `tests/` directory:**
137
-
```bash
138
-
cd tests
139
-
```
140
-
141
-
2. **Run the tests using `pytest`:**
142
-
```bash
143
-
pytest
144
-
```
145
-
Ensure `pytest` is installed (`pip install pytest`). `pytest` will automatically discover and run all test files (`test_*.py`) in the `tests/` directory. Review the test output to confirm all tests pass.
103
+
```bash
104
+
cd tests
105
+
pytest
106
+
```
107
+
**Note:** Tests are currently unreliable (see Known Issues).
146
108
109
+
## Known Issues (Version 0.09)
110
+
- **Incorrect Keyword Display:** Summary shows single-word keywords with uniform scores (e.g., `1.42192903`), missing multi-word phrases from `skills_whitelist`.
111
+
- **Unreliable Unit Tests:** Test suite fails consistently and lacks coverage.
112
+
- **Whitelist Inconsistency:** Many whitelisted terms marked `FALSE`in output.
113
+
- **Low TF-IDF Variance:** Scores lack differentiation, possibly due to scoring bugs.
147
114
148
115
## Repository Structure
149
116
150
-
* `keywords4cv.py`: The main Python script containing the core ATS optimization logic.
151
-
* `config.yaml`: YAML configuration file for customizing stop words, skills whitelist, keyword categories, weighting, and other parameters.
152
-
* `README.md`: This file, providing documentation and usage instructions.
153
-
* `output/`: (Created automatically) Directory where Excel output reports (`results.xlsx`) and log files (`ats_optimizer.log`) are saved by default.
154
-
* `requirements.txt`: List of Python package dependencies. It is recommended to update this file using `pip freeze > requirements.txt` after installing required packages.
155
-
* `tests/`: Directory containing unit tests:
156
-
* `test_keywords4cv.py`: Python file containing unit tests written using `unittest` and `pytest` frameworks to verify the functionality of `keywords4cv.py`.
157
-
* `job_descriptions.json`: Example input JSON file demonstrating the expected input format for job descriptions.
158
-
* `ats_optimizer.log`: Log file where the script records informational messages, warnings, and errors during execution.
117
+
*`keywords4cv.py`: Main script.
118
+
*`config.yaml`: Configuration file.
119
+
*`README.md`: This documentation.
120
+
*`output/`: Auto-generated for`results.xlsx` and `ats_optimizer.log`.
121
+
*`requirements.txt`: Dependency list (update with `pip freeze > requirements.txt`).
122
+
*`tests/`: Unit tests (currently unreliable).
123
+
*`job_descriptions.json`: Sample input.
159
124
160
125
## Contributing
161
126
162
-
Contributions are welcome! If you wish to contribute to the project, please follow these steps:
163
-
164
-
1. Fork the repository to your GitHub account.
165
-
2. Create a new branch for your feature or bug fix. Choose a descriptive branch name.
166
-
3. Make your code changes, ensuring they are well-commented and follow coding best practices.
167
-
4. Add or update unit tests in the `tests/` directory to cover your changes and ensure code reliability.
168
-
5. Run the unit tests locally using `pytest` from the `tests/` directory to confirm that all tests pass, including the ones you added.
169
-
6. Commit your changes to your branch with clear and informative commit messages.
170
-
7. Push your branch to your forked repository on GitHub.
171
-
8. Submit a pull request to the main repository, describing your changes and their purpose in detail.
127
+
1. Fork, create a branch, and make changes.
128
+
2. Update/add tests in`tests/`.
129
+
3. Run `pytest` (despite issues) and commit.
130
+
4. Submit a pull request with detailed description.
0 commit comments