Version 0.09 (Alpha) - 21/02/2025
Pre-release
Pre-release
This release builds on the foundation laid in 0.05
and 0.051
, introducing significant enhancements to keyword extraction, semantic analysis, and overall robustness of the ATS Optimizer. While progress has been made on precision and functionality, some known issues remain unresolved, and new challenges have emerged.
Major Changes
Enhanced Keyword Extraction
- Entity Ruler Integration: Added whitelisted phrases from
skills_whitelist
asSKILL
entities in the spaCy pipeline, preserving multi-word skills (e.g., "machine learning") during tokenization. - Improved N-Gram Generation: Updated
_generate_ngrams
to filter out single-letter tokens and stop words, ensuring cleaner and more relevant keyword sets for TF-IDF analysis. - Refined Keyword Extraction: Enhanced
extract_keywords
to combine preservedSKILL
entities with tokenized keywords, improving accuracy for technical terms.
Semantic Analysis Improvements
- Enabled by Default: Set
semantic_validation: True
in the config to filter keywords based on context, reducing irrelevant terms. - Stricter Similarity Threshold: Increased
similarity_threshold
from0.6
to0.65
for more precise semantic categorization. - N-Gram Range Adjustment: Reduced
ngram_range
andwhitelist_ngram_range
from[1, 3]
to[1, 2]
to focus on shorter, actionable phrases.
Robustness and Debugging
- TF-IDF Matrix Validation: Improved
_create_tfidf_matrix
with pre-vectorization filtering of invalid tokens (e.g., single-letter words) and added debug logging for better traceability. - SpaCy Pipeline: Added
sentencizer
to the model loading process for consistent sentence boundary detection. - Code Cleanup: Removed unused
Pool
import from multiprocessing and streamlined imports for better maintainability.
New Features
- Preservation of Whitelisted Skills: Multi-word skills from
skills_whitelist
are now recognized as entities, ensuring they remain intact in the output (e.g., "product management" instead of split terms). - Dynamic N-Gram Filtering: Automatically excludes noisy n-grams containing stop words or single characters.
Resolved Issues
- None from previous versions fully resolved yet (see Known Issues).
Known Issues
- [Critical, Unresolved from 0.05] Incorrectly Displayed Keywords in Excel: The
Summary
sheet shows single-word keywords (e.g., "science", "cross") with suspiciously uniform scores (e.g.,1.42192903
for multiple terms), indicating a potential scoring or aggregation issue. Multi-word phrases fromskills_whitelist
are not consistently appearing as expected. - [Critical, Unresolved from 0.05] Unreliable Unit Tests: The test suite remains untested and fails consistently, lacking coverage for critical components like keyword preservation and scoring.
- [New, High Priority] Inconsistent Whitelist Application: The
Detailed Scores
sheet shows many keywords marked asIn Whitelist: FALSE
despite being inskills_whitelist
(e.g., "product owner"), suggesting an issue with whitelist matching or entity recognition. - [New, Medium Priority] Low TF-IDF Variance: TF-IDF scores in the
Detailed Scores
sheet are often identical (e.g.,0.049574662
), indicating potential issues with document differentiation or scoring normalization.
Sample Output Analysis
- Summary Page: Keywords like "science", "cross", and "technical" have identical
Total_Score
(1.42192903
) andAvg_Score
(0.473976343
) across 3 jobs, suggesting a possible bug in score calculation or keyword weighting. - Detailed Scores: Multi-word phrases (e.g., "product owner", "leveraging llms") appear, but their
In Whitelist
status is inconsistent, and scores are low (0.034702263
), possibly due to TF-IDF dilution or whitelist boost not applying correctly.
Dependencies
nltk
numpy
pandas
spacy
scikit-learn
pyyaml
psutil
hashlib
Future Improvements
- [High Priority] Fix Keyword Display and Scoring: Address the uniform scoring and missing multi-word keywords in the Excel output.
- [High Priority] Overhaul Unit Tests: Develop comprehensive tests to validate entity recognition, whitelist application, and scoring accuracy.
- [Medium Priority] Enhance Whitelist Boost: Ensure
whitelist_boost
(1.5) is consistently applied to whitelisted terms. - [Medium Priority] Optimize TF-IDF: Investigate low variance in TF-IDF scores and improve differentiation across documents.
- Further refine semantic filtering for domain-specific terms.
- Enhance logging to pinpoint scoring and categorization issues.
What's Changed
- Create bandit.yml by @DavidOsipov in #17
- Update shundor/python-bandit-scan digest to ab1d87d by @renovate in #18
- Enhance ATS Optimizer with Improved Keyword Extraction and Semantic Analysis by @DavidOsipov in #19
- Update README.md by @DavidOsipov in #20
Full Changelog: 0.051...0.09