Version 0.24 (Alpha version)
Pre-release
Pre-release
Major Enhancements and New Features:
- Comprehensive Configuration Validation: Implemented a robust two-stage configuration validation system. This uses the
schema
library for initial YAML structure validation (ensuring correct keys, data types, and relationships) and Pydantic for runtime validation and type coercion. This significantly improves the reliability and user-friendliness of the script by catching configuration errors early and providing informative error messages. Theconfig_validation.py
module encapsulates this logic. Pydantic models are used extensively throughout, ensuring type safety and data integrity. - Advanced Keyword Extraction and Filtering:
- Fuzzy Matching Integration: Integrated
rapidfuzz
for fuzzy matching of keywords against a whitelist (or the expanded set of skills). This allows for variations in spelling and phrasing, improving recall. Configuration options include the matching algorithm (ratio
,partial_ratio
,token_sort_ratio
,token_set_ratio
,WRatio
), minimum similarity score, and allowed POS tags. - Configurable Processing Order: Added the
fuzzy_before_semantic
option (text_processing
section inconfig.yaml
). This allows users to choose whether fuzzy matching is applied before or after semantic validation, providing greater flexibility in the keyword extraction pipeline. - Phrase-Level Synonym Handling: Introduced support for phrase-level synonyms (e.g., "product management" synonyms: ["product leadership", "product ownership"]). Synonyms can be loaded from a static JSON file (
phrase_synonyms_path
) or fetched from an API (api_endpoint
,api_key
). This significantly expands the ability to capture relevant skills expressed in different ways. TheSynonymEntry
Pydantic model enforces data integrity for static synonyms. - Improved Contextual Validation: Enhanced semantic validation using a configurable context window (
context_window_size
). The script now considers the surrounding sentences (respecting paragraph breaks) to determine if a keyword is used in the relevant context. This reduces false positives. The sentence splitting logic now handles bullet points and numbered lists more robustly. - POS Tag Filtering: Added more granular control over POS tag filtering with the
pos_filter
andallowed_pos
options. This allows users to specify which parts of speech are considered for keyword extraction and fuzzy matching. - Trigram Optimization: Implemented a
TrigramOptimizer
to improve the efficiency of n-gram generation and candidate selection. This uses an LRU cache to store frequently used trigrams, reducing redundant computations. - Dynamic N-gram Generation: The
_generate_ngrams
function is now cached and handles edge cases more robustly (e.g., invalid inputn
).
- Fuzzy Matching Integration: Integrated
- Adaptive Chunking and Parameter Tuning:
- Smart Chunker: Introduced a
SmartChunker
class that uses a Q-learning algorithm to dynamically adjust the chunk size based on dataset statistics (average job description length, number of texts) and system resource usage (memory). This helps to optimize performance and prevent out-of-memory errors. - Auto Tuner: Added an
AutoTuner
class that automatically adjusts parameters (e.g.,chunk_size
,pos_processing
) based on metrics (recall, memory usage, processing time) and the trigram cache hit rate. This allows the script to adapt to different datasets and hardware configurations.
- Smart Chunker: Introduced a
- Intermediate Result Saving and Checkpointing:
- Configurable Intermediate Saving: Implemented robust intermediate saving of results (summary and detailed scores) to disk. This allows for resuming processing after interruptions and prevents data loss in case of errors. The
intermediate_save
section inconfig.yaml
controls the format (feather
,jsonl
,json
), save interval, working directory, and cleanup behavior. - Data Integrity Checks: Added checksum verification (using
xxhash
) for intermediate files to ensure data integrity. A checksum manifest file (checksums.jsonl
) is created and used to verify the integrity of the saved data. - Streaming Data Aggregation: Implemented a streaming data aggregation approach for combining intermediate results. This allows the script to handle very large datasets that don't fit in memory. The
_aggregate_results
function handles both lists and generators of DataFrames. - Schema Validation and Appending: The code now validates the schema of intermediate files (especially for feather and jsonl) and it is able to append new chunks to already existing files.
- Configurable Intermediate Saving: Implemented robust intermediate saving of results (summary and detailed scores) to disk. This allows for resuming processing after interruptions and prevents data loss in case of errors. The
- Enhanced Error Handling and Logging:
- Custom Exceptions: Defined custom exceptions (
ConfigError
,InputValidationError
,CriticalFailureError
,AggregationError
,DataIntegrityError
) for more specific error handling and reporting. - Comprehensive Error Handling: Added extensive error handling throughout the script, including checks for invalid input, file I/O errors, API errors, memory errors, and data integrity issues.
- Improved Logging: Enhanced logging to provide more informative messages about the script's progress, warnings, and errors. This includes logging of configuration parameters, dataset statistics, processing times, memory usage, and cache hit rates.
- Strict Mode: Added a
strict_mode
option (in theconfig.yaml
) that, when enabled, causes the script to raise exceptions on certain errors (e.g., invalid input, empty descriptions) instead of logging warnings and continuing.
- Custom Exceptions: Defined custom exceptions (
- Code Refactoring and Optimization:
- Modular Design: Refactored the code into smaller, more manageable classes and functions (e.g.,
ParallelProcessor
,TrigramOptimizer
,SmartChunker
,AutoTuner
). - Type Hinting: Added type hints throughout the code to improve readability and maintainability.
- Memory Management: Implemented various memory management techniques, including explicit garbage collection (
gc.collect()
), releasing spaCy Doc objects after processing, and using generators for streaming data processing. - Caching: Used
lru_cache
andLRUCache
to cache frequently used computations (e.g., term vectorization, n-gram generation, fuzzy matching). - Parallel Processing: Leveraged
concurrent.futures.ProcessPoolExecutor
for parallel processing of job descriptions, significantly improving performance on multi-core systems. - Dynamic Batch Size: The batch size for spaCy processing is now dynamically calculated, considering available memory and the configured
memory_scaling_factor
. - GPU Memory Check: Added an optional check for available GPU memory (if
use_gpu
andcheck_gpu_memory
are enabled). If GPU memory is low, it can either disable GPU usage or reduce the number of workers.
- Modular Design: Refactored the code into smaller, more manageable classes and functions (e.g.,
- Refactored TF-IDF Matrix Creation: The TF-IDF matrix creation is now more efficient and robust. The vectorizer is fitted only once (with optional sampling for large datasets), and keyword sets are pre-validated.
- Consistent Hashing: The caching system now uses a
cache_salt
to ensure that cache keys are unique across different runs and configurations. The salt can be set via an environment variable (K4CV_CACHE_SALT
) or in theconfig.yaml
file. - Improved Keyword Categorization: Keyword categorization logic is enhanced, and a configurable
default_category
is used for terms that cannot be categorized. Thecategorization_cache_size
allows controlling the cache size for term categorization.
Bug Fixes:
- Fixed several issues related to data loading, validation, and processing.
- Improved error handling and logging in various parts of the script.
- Addressed potential memory leaks and improved overall memory management.
- Corrected issues with chunk size calculation and Q-table updates.
- Fixed inconsistencies in the application of the whitelist boost.
- Resolved issues with intermediate file saving and loading.
- Addressed errors during vectorization and score calculations.
Known Issues:
- NOTE! At this point of time, the script doesn't work. This release aims to introduct critical architectural changes.
Dependencies:
- nltk
- pandas
- spacy (>=3.0.0 recommended)
- scikit-learn
- pyyaml
- psutil
- hashlib (replaced with xxhash)
- requests
- rapidfuzz
- srsly
- xxhash
- cachetools
- pydantic (>=2.0 recommended, but v1 is supported)
- schema
- pyarrow
- numpy
- itertools
Future Improvements:
- [List any planned future improvements.]
- Explore the use of Dask for distributed processing.
- Continue to refine the reinforcement learning algorithms for adaptive parameter tuning.
- Add more comprehensive unit tests.
- Improve documentation and user guide.
- Consider adding support for other input formats (e.g., CSV, text files).
- Explore the use of more advanced NLP techniques (e.g., transformer-based models).
How to Upgrade:
- Backup your existing
config.yaml
andsynonyms.json
files. - Replace the old script files (
keywords4cv_*.py.txt
,exceptions.py.txt
,config_validation.py.txt
) with the new versions. - Carefully review the updated
config.yaml.truncated.txt
file. There are many new configuration options and changes to existing ones. You will need to merge your existing configuration with the new template. Pay close attention to the following sections:validation
text_processing
(especiallyphrase_synonym_source
,phrase_synonyms_path
,api_endpoint
,api_key
,fuzzy_before_semantic
)whitelist
(especiallyfuzzy_matching
)hardware_limits
optimization
caching
(especiallycache_salt
)intermediate_save
advanced
- If you are using a static synonym file, update its format to match the
SynonymEntry
model (see documentation). - Install any new dependencies:
pip install rapidfuzz srsly xxhash cachetools pydantic schema pyarrow
.
Breaking Changes:
- The configuration file format has changed significantly. You will need to update your
config.yaml
file. - The
SynonymEntry
format insynonyms.json
is now enforced using Pydantic. - The
hashlib
library has been replaced withxxhash
for checksum calculation. - The intermediate file format and naming conventions have changed.
- The
max_workers
parameter is now also used within thenlp.pipe
function. - The
analyzer._load_all_intermediate
function now returns a generator. - The
_create_tfidf_matrix
function's parameters have changed. - The
_calculate_scores
function now yields results instead of returning a list.
What's Changed
- Update LICENSE by @DavidOsipov in #21
- Delete test_keywords4cv.py by @DavidOsipov in #22
- Delete ats_optimizer.log by @DavidOsipov in #23
- Add files via upload by @DavidOsipov in #24
Full Changelog: 0.09...0.24