Skip to content

Version 0.24 (Alpha version)

Pre-release
Pre-release
Compare
Choose a tag to compare
@DavidOsipov DavidOsipov released this 28 Feb 19:13
· 17 commits to main since this release
1619c8b

Major Enhancements and New Features:

  • Comprehensive Configuration Validation: Implemented a robust two-stage configuration validation system. This uses the schema library for initial YAML structure validation (ensuring correct keys, data types, and relationships) and Pydantic for runtime validation and type coercion. This significantly improves the reliability and user-friendliness of the script by catching configuration errors early and providing informative error messages. The config_validation.py module encapsulates this logic. Pydantic models are used extensively throughout, ensuring type safety and data integrity.
  • Advanced Keyword Extraction and Filtering:
    • Fuzzy Matching Integration: Integrated rapidfuzz for fuzzy matching of keywords against a whitelist (or the expanded set of skills). This allows for variations in spelling and phrasing, improving recall. Configuration options include the matching algorithm (ratio, partial_ratio, token_sort_ratio, token_set_ratio, WRatio), minimum similarity score, and allowed POS tags.
    • Configurable Processing Order: Added the fuzzy_before_semantic option (text_processing section in config.yaml). This allows users to choose whether fuzzy matching is applied before or after semantic validation, providing greater flexibility in the keyword extraction pipeline.
    • Phrase-Level Synonym Handling: Introduced support for phrase-level synonyms (e.g., "product management" synonyms: ["product leadership", "product ownership"]). Synonyms can be loaded from a static JSON file (phrase_synonyms_path) or fetched from an API (api_endpoint, api_key). This significantly expands the ability to capture relevant skills expressed in different ways. The SynonymEntry Pydantic model enforces data integrity for static synonyms.
    • Improved Contextual Validation: Enhanced semantic validation using a configurable context window (context_window_size). The script now considers the surrounding sentences (respecting paragraph breaks) to determine if a keyword is used in the relevant context. This reduces false positives. The sentence splitting logic now handles bullet points and numbered lists more robustly.
    • POS Tag Filtering: Added more granular control over POS tag filtering with the pos_filter and allowed_pos options. This allows users to specify which parts of speech are considered for keyword extraction and fuzzy matching.
    • Trigram Optimization: Implemented a TrigramOptimizer to improve the efficiency of n-gram generation and candidate selection. This uses an LRU cache to store frequently used trigrams, reducing redundant computations.
    • Dynamic N-gram Generation: The _generate_ngrams function is now cached and handles edge cases more robustly (e.g., invalid input n).
  • Adaptive Chunking and Parameter Tuning:
    • Smart Chunker: Introduced a SmartChunker class that uses a Q-learning algorithm to dynamically adjust the chunk size based on dataset statistics (average job description length, number of texts) and system resource usage (memory). This helps to optimize performance and prevent out-of-memory errors.
    • Auto Tuner: Added an AutoTuner class that automatically adjusts parameters (e.g., chunk_size, pos_processing) based on metrics (recall, memory usage, processing time) and the trigram cache hit rate. This allows the script to adapt to different datasets and hardware configurations.
  • Intermediate Result Saving and Checkpointing:
    • Configurable Intermediate Saving: Implemented robust intermediate saving of results (summary and detailed scores) to disk. This allows for resuming processing after interruptions and prevents data loss in case of errors. The intermediate_save section in config.yaml controls the format (feather, jsonl, json), save interval, working directory, and cleanup behavior.
    • Data Integrity Checks: Added checksum verification (using xxhash) for intermediate files to ensure data integrity. A checksum manifest file (checksums.jsonl) is created and used to verify the integrity of the saved data.
    • Streaming Data Aggregation: Implemented a streaming data aggregation approach for combining intermediate results. This allows the script to handle very large datasets that don't fit in memory. The _aggregate_results function handles both lists and generators of DataFrames.
    • Schema Validation and Appending: The code now validates the schema of intermediate files (especially for feather and jsonl) and it is able to append new chunks to already existing files.
  • Enhanced Error Handling and Logging:
    • Custom Exceptions: Defined custom exceptions (ConfigError, InputValidationError, CriticalFailureError, AggregationError, DataIntegrityError) for more specific error handling and reporting.
    • Comprehensive Error Handling: Added extensive error handling throughout the script, including checks for invalid input, file I/O errors, API errors, memory errors, and data integrity issues.
    • Improved Logging: Enhanced logging to provide more informative messages about the script's progress, warnings, and errors. This includes logging of configuration parameters, dataset statistics, processing times, memory usage, and cache hit rates.
    • Strict Mode: Added a strict_mode option (in the config.yaml) that, when enabled, causes the script to raise exceptions on certain errors (e.g., invalid input, empty descriptions) instead of logging warnings and continuing.
  • Code Refactoring and Optimization:
    • Modular Design: Refactored the code into smaller, more manageable classes and functions (e.g., ParallelProcessor, TrigramOptimizer, SmartChunker, AutoTuner).
    • Type Hinting: Added type hints throughout the code to improve readability and maintainability.
    • Memory Management: Implemented various memory management techniques, including explicit garbage collection (gc.collect()), releasing spaCy Doc objects after processing, and using generators for streaming data processing.
    • Caching: Used lru_cache and LRUCache to cache frequently used computations (e.g., term vectorization, n-gram generation, fuzzy matching).
    • Parallel Processing: Leveraged concurrent.futures.ProcessPoolExecutor for parallel processing of job descriptions, significantly improving performance on multi-core systems.
    • Dynamic Batch Size: The batch size for spaCy processing is now dynamically calculated, considering available memory and the configured memory_scaling_factor.
    • GPU Memory Check: Added an optional check for available GPU memory (if use_gpu and check_gpu_memory are enabled). If GPU memory is low, it can either disable GPU usage or reduce the number of workers.
  • Refactored TF-IDF Matrix Creation: The TF-IDF matrix creation is now more efficient and robust. The vectorizer is fitted only once (with optional sampling for large datasets), and keyword sets are pre-validated.
  • Consistent Hashing: The caching system now uses a cache_salt to ensure that cache keys are unique across different runs and configurations. The salt can be set via an environment variable (K4CV_CACHE_SALT) or in the config.yaml file.
  • Improved Keyword Categorization: Keyword categorization logic is enhanced, and a configurable default_category is used for terms that cannot be categorized. The categorization_cache_size allows controlling the cache size for term categorization.

Bug Fixes:

  • Fixed several issues related to data loading, validation, and processing.
  • Improved error handling and logging in various parts of the script.
  • Addressed potential memory leaks and improved overall memory management.
  • Corrected issues with chunk size calculation and Q-table updates.
  • Fixed inconsistencies in the application of the whitelist boost.
  • Resolved issues with intermediate file saving and loading.
  • Addressed errors during vectorization and score calculations.

Known Issues:

  • NOTE! At this point of time, the script doesn't work. This release aims to introduct critical architectural changes.

Dependencies:

  • nltk
  • pandas
  • spacy (>=3.0.0 recommended)
  • scikit-learn
  • pyyaml
  • psutil
  • hashlib (replaced with xxhash)
  • requests
  • rapidfuzz
  • srsly
  • xxhash
  • cachetools
  • pydantic (>=2.0 recommended, but v1 is supported)
  • schema
  • pyarrow
  • numpy
  • itertools

Future Improvements:

  • [List any planned future improvements.]
  • Explore the use of Dask for distributed processing.
  • Continue to refine the reinforcement learning algorithms for adaptive parameter tuning.
  • Add more comprehensive unit tests.
  • Improve documentation and user guide.
  • Consider adding support for other input formats (e.g., CSV, text files).
  • Explore the use of more advanced NLP techniques (e.g., transformer-based models).

How to Upgrade:

  1. Backup your existing config.yaml and synonyms.json files.
  2. Replace the old script files (keywords4cv_*.py.txt, exceptions.py.txt, config_validation.py.txt) with the new versions.
  3. Carefully review the updated config.yaml.truncated.txt file. There are many new configuration options and changes to existing ones. You will need to merge your existing configuration with the new template. Pay close attention to the following sections:
    • validation
    • text_processing (especially phrase_synonym_source, phrase_synonyms_path, api_endpoint, api_key, fuzzy_before_semantic)
    • whitelist (especially fuzzy_matching)
    • hardware_limits
    • optimization
    • caching (especially cache_salt)
    • intermediate_save
    • advanced
  4. If you are using a static synonym file, update its format to match the SynonymEntry model (see documentation).
  5. Install any new dependencies: pip install rapidfuzz srsly xxhash cachetools pydantic schema pyarrow.

Breaking Changes:

  • The configuration file format has changed significantly. You will need to update your config.yaml file.
  • The SynonymEntry format in synonyms.json is now enforced using Pydantic.
  • The hashlib library has been replaced with xxhash for checksum calculation.
  • The intermediate file format and naming conventions have changed.
  • The max_workers parameter is now also used within the nlp.pipe function.
  • The analyzer._load_all_intermediate function now returns a generator.
  • The _create_tfidf_matrix function's parameters have changed.
  • The _calculate_scores function now yields results instead of returning a list.

What's Changed

Full Changelog: 0.09...0.24