-
Notifications
You must be signed in to change notification settings - Fork 0
How script v0.26 works
This document provides a complete, in-depth analysis of the keywords4cv.py
script, version 0.26. We'll cover every aspect, from high-level design to low-level implementation details, including justifications, performance considerations, and potential limitations.
keywords4cv.py
is a Python script designed to extract, analyze, and categorize keywords from job descriptions. It is intended to mimic and enhance the functionality of an Applicant Tracking System (ATS), providing sophisticated capabilities beyond simple keyword matching. The core goal is to identify the most relevant skills and qualifications from a set of job descriptions, enabling efficient filtering and ranking of candidates.
Key Features and Design Principles:
-
Accuracy: The script prioritizes accurate keyword identification, minimizing both false positives (irrelevant terms) and false negatives (missing relevant terms). It achieves this through a multi-layered approach combining:
- Named Entity Recognition (NER): Using spaCy's pre-trained models.
- N-gram Generation: Capturing multi-word phrases.
- Fuzzy Matching: Handling spelling variations and synonyms.
- Semantic Validation: Ensuring contextual relevance using word embeddings.
- Part-of-Speech (POS) Filtering: Focusing on relevant grammatical categories.
- Whitelist and Negative List: Boosting known skills and filtering out unwanted terms.
-
Efficiency: Designed for large-scale processing, the script incorporates several optimization techniques:
- Parallel Processing: Distributing the workload across multiple CPU cores.
-
Batch Processing: Processing job descriptions in batches to leverage spaCy's
nlp.pipe
efficiency. - Caching: Storing frequently accessed data (fuzzy matches, semantic validations, etc.) to avoid redundant computations.
- HashingVectorizer: Using a memory-efficient method for TF-IDF vectorization.
- BK-Tree: Employing a BK-tree data structure for fast approximate string matching.
- Generators: Using generators to avoid loading large datasets into memory.
- Configurability: Extensive customization options via a YAML configuration file, allowing users to tailor the script's behavior to specific needs and datasets.
- Robustness: Comprehensive error handling, input validation, and data integrity checks to ensure reliable operation.
- Maintainability: Modular design with well-defined classes and functions, promoting code reuse and ease of modification.
- Extensibility: The architecture is designed to be extensible, allowing for the addition of new features and algorithms.
The script follows a modular architecture, with distinct components responsible for different aspects of the pipeline. This section outlines the major components and their interactions.
2.1. High-Level Workflow:
- Initialization: Load configuration, initialize spaCy model, preprocessor, keyword extractor, canonicalizer, parallel processor, and other components.
- Input: Load job descriptions from a JSON file.
- Preprocessing: Sanitize job descriptions (validate titles and descriptions).
- Chunking: Divide job descriptions into smaller chunks for processing.
-
Keyword Extraction (per chunk):
- Extract keywords using spaCy's NER, n-gram generation, fuzzy matching, and semantic validation.
- Canonicalize keywords (deduplication, abbreviation expansion, clustering).
- Calculate TF-IDF scores.
- Categorize keywords.
- Intermediate Saving (Optional): Save results to disk (Feather, JSONL, or JSON format).
- Metrics Calculation: Calculate performance metrics (recall, precision, F1-score, etc.).
- Model Update: Adjust parameters (chunk size, POS processing) based on metrics.
- Aggregation: Combine results from all chunks.
- Output: Save final results to an Excel file.
- Metrics Report (Optional): Generate a comprehensive HTML report with visualizations.
2.2. Core Components:
-
OptimizedATS
(Main Class):- Coordinates the entire workflow.
- Holds references to all other components.
- Manages configuration, initialization, data loading, processing, and output.
-
EnhancedTextPreprocessor
:- Performs text cleaning and normalization (lowercasing, removing URLs, special characters, etc.).
- Manages stop words (loading, adding, excluding).
- Caches preprocessed text.
-
AdvancedKeywordExtractor
:- The core keyword extraction engine.
- Uses spaCy for tokenization, POS tagging, lemmatization, and NER.
- Generates n-grams.
- Performs fuzzy matching using an enhanced BK-tree.
- Applies semantic validation using word embeddings.
- Calculates keyword scores (TF-IDF, frequency, whitelist boost, section weights).
- Detects the section of a job description where a keyword appears.
- Categorizes keywords.
-
KeywordCanonicalizer
:- Deduplicates and standardizes keywords.
- Expands abbreviations.
- Resolves overlapping n-grams.
- Optionally clusters similar terms using embeddings.
-
ParallelProcessor
:- Manages parallel processing of job descriptions using
multiprocessing
. - Determines the optimal number of worker processes based on system resources.
- Manages parallel processing of job descriptions using
-
TrigramOptimizer
:- Caches trigram candidates to improve the efficiency of n-gram generation.
-
SmartChunker
:- Adaptively determines chunk sizes for processing using a reinforcement learning approach (Q-learning).
-
AutoTuner
:- Tunes parameters (e.g., chunk size, POS processing strategy) based on performance metrics.
-
SemanticValidator
:- Provides a unified interface for validation checks (POS, semantic, negative keywords).
-
EnhancedBKTree
:- An optimized BK-tree implementation with caching for fast fuzzy matching.
-
CacheManager
:- Manages caching with different backend implementations (currently supports in-memory caching).
-
MetricsReporter
andKeywordMetricsEvaluator
:- Calculates and reports a comprehensive set of performance metrics.
2.3. Data Flow Diagram:
[Command Line Arguments] --> [Configuration File (YAML)] --> [OptimizedATS (Initialization)]
|
V
[Job Descriptions (JSON)] --> [OptimizedATS (Input)] --> [Sanitization] --> [Chunking]
|
V
+--------------+--------------+
| | |
V V V
[Chunk 1] [Chunk 2] [Chunk N]
| | |
+--------------+--------------+
|
+---------------------------------+
| Parallel Processing (per chunk) |
+---------------------------------+
|
+--------------+--------------+
| | |
V V V
[Keyword Extraction] [Keyword Extraction] [Keyword Extraction]
| | |
V V V
[Canonicalization] [Canonicalization] [Canonicalization]
| | |
V V V
[TF-IDF & Scoring] [TF-IDF & Scoring] [TF-IDF & Scoring]
| | |
V V V
[Intermediate Saving (Optional)] [Intermediate Saving (Optional)] [Intermediate Saving (Optional)]
| | |
+--------------+--------------+
|
V
+---------------------------------+
| Metrics Calculation |
+---------------------------------+
|
V
+---------------------------------+
| Model Update |
+---------------------------------+
|
V
+---------------------------------+
| Aggregation (All Chunks) |
+---------------------------------+
|
V
[Final Results (Excel)]
|
V
[Metrics Report (HTML, Optional)]
3.1.1. Purpose:
This module is the gatekeeper for the configuration. It ensures that the config.yaml
file (or any file specified with the -c
flag) is valid before the main script attempts to use it. This prevents cryptic errors later on.
3.1.2. Key Classes and Functions:
-
ConfigError
(Exception): A custom exception raised for configuration errors. -
Pydantic Models (e.g.,
ValidationConfig
,DatasetConfig
,TextProcessingConfig
, etc.): These models define the structure and data types of the configuration. They use Pydantic's features:-
Field(...)
: Specifies default values, constraints (e.g.,ge=1
for "greater than or equal to 1"), and aliases (e.g.,format_
for theformat
field). -
field_validator
: Defines custom validation logic (e.g., checking that n-gram ranges are valid, validating API settings). -
ConfigDict(extra="forbid")
: Prevents the user from adding extra, undefined fields to the configuration, maintaining strictness.
-
-
config_schema
(Schema): This uses theschema
library to define the overall structure of the configuration. It complements the Pydantic models by providing a higher-level view of the expected structure. It uses:-
Schema(...)
: Defines the main schema. -
SchemaOptional(...)
: Marks optional keys. -
And(...)
: Combines multiple validation rules. -
Or(...)
: Specifies allowed values (e.g., forphrase_synonym_source
). -
Lambda Functions: Used for simple, inline validation checks (e.g.,
lambda n: n >= 1
).
-
-
validate_config(config)
: Performs the actual validation:- Validates the YAML structure using
config_schema.validate(config)
. - Creates a
Config
instance (Pydantic model) from the validated data, triggering Pydantic's validation and type coercion. - Performs top-level checks (e.g., ensuring that
keyword_categories
is not empty).
- Validates the YAML structure using
-
validate_config_file(config_path)
: Loads the YAML file, handles file errors (FileNotFoundError
,yaml.YAMLError
), and callsvalidate_config
.
3.1.3. Why This Approach?
-
Defense in Depth: Uses two validation libraries (
schema
andpydantic
) for extra robustness.schema
is good for structural checks, whilepydantic
excels at type validation and custom logic. - Early Failure: Validation happens before any significant processing, preventing wasted computation on invalid configurations.
-
Clear Error Messages: Both
schema
andpydantic
provide informative error messages, making it easier for users to fix configuration problems. - Type Safety: Pydantic enforces type hints, reducing the risk of type-related errors.
- Code as Documentation: The Pydantic models serve as clear documentation of the configuration options and their expected types and values.
3.1.4. Example Configuration Snippet (from config.yaml.truncated.txt
):
validation:
allow_numeric_titles: True
empty_description_policy: "warn"
title_min_length: 2
title_max_length: 100
min_desc_length: 60
text_encoding: "utf-8"
text_processing:
spacy_model: "en_core_web_lg"
ngram_range: [1, 3]
pos_filter: ["NOUN", "PROPN", "ADJ"]
semantic_validation: True
phrase_synonym_source: "static"
phrase_synonyms_path: "synonyms.json"
3.1.5. Facts and Figures:
- The
schema
library is lightweight and fast, suitable for quick structural checks. - Pydantic is more feature-rich, supporting complex validation rules, type coercion, and data serialization.
- The combination of
schema
andpydantic
provides a good balance between performance and expressiveness.
This is the heart of the system. Let's break it down further.
3.2.1. OptimizedATS
Class:
-
Purpose: The central controller. It's responsible for:
- Loading and managing the configuration.
- Initializing all other components (spaCy, preprocessor, keyword extractor, etc.).
- Orchestrating the data flow (loading, preprocessing, chunking, extracting, scoring, saving).
- Cleaning up intermediate files.
-
Key Attributes:
-
config
: The validated configuration dictionary. -
nlp
: The loaded spaCy model. -
preprocessor
: An instance ofEnhancedTextPreprocessor
. -
keyword_extractor
: An instance ofAdvancedKeywordExtractor
. -
keyword_canonicalizer
: An instance ofKeywordCanonicalizer
. -
processor
: An instance ofParallelProcessor
. -
trigram_optim
: An instance ofTrigramOptimizer
. -
chunker
: An instance ofSmartChunker
. -
tuner
: An instance ofAutoTuner
. -
working_dir
: APath
object representing the working directory for intermediate files. -
run_id
: A unique identifier for the current run (usingxxhash
). -
checksum_manifest_path
: The path to the checksum manifest file.
-
-
Key Methods: (Described in previous responses, but we'll highlight the why here)
-
__init__(config_path)
: The constructor. Its careful initialization order is crucial:- Load Config: Loads the configuration first. Everything else depends on the configuration.
- Validate Config: Validates the basic structure.
- Initialize NLP: Loads the spaCy model. This is a potentially expensive operation, so it's done early.
-
Initialize Preprocessor: Creates the preprocessor, which depends on the
nlp
object. -
Initialize Keyword Extractor: Creates the keyword extractor, which depends on
nlp
and the preprocessor. -
Initialize Keyword Canonicalizer: Creates the canonicalizer, which depends on
nlp
. -
Initialize Parallel Processor: Creates the parallel processor, which depends on the keyword extractor and
nlp
. - Initialize Optimizer: Creates the trigram optimizer, which depends on the keyword extractor.
- Initialize Chunker/Tuner: Creates the chunker and tuner.
- Working Directory/Run ID: Sets up the working directory and generates a unique run ID.
- Initialize Categories: Pre-computes category vectors for semantic categorization.
-
analyze_jobs(job_descriptions)
: The main analysis loop. It orchestrates the entire process, from sanitization to saving results. The use of a loop and intermediate saving allows for processing very large datasets that wouldn't fit in memory. The loop also allows for adaptive parameter tuning. -
_create_chunks(job_descriptions)
: Splits the job descriptions into chunks. Chunking is essential for:- Parallel Processing: Each chunk can be processed by a separate worker process.
- Memory Management: Processing smaller chunks reduces memory usage.
-
Adaptive Chunking: The
SmartChunker
can adjust the chunk size dynamically.
-
_process_chunk(chunk)
: Processes a single chunk. This is where the core keyword extraction, canonicalization, and scoring happen. The use of generators (extract_keywords
,_calculate_scores
) minimizes memory usage. -
_create_tfidf_matrix(...)
: Creates the TF-IDF matrix usingHashingVectorizer
. The vectorizer is initialized once and reused for all chunks. -
_calculate_scores(...)
: Calculates scores for each keyword in each job description. It uses the TF-IDF values, frequency information, whitelist status, and section weights. The use of generators and optimized data structures (sets, dictionaries) improves efficiency. -
_save_intermediate(...)
: Saves intermediate results to disk. This allows the script to resume processing if it's interrupted. The use of checksums ensures data integrity. -
_load_all_intermediate(...)
: Loads intermediate results. -
_aggregate_results(...)
: Combines results from all chunks into final summary and details DataFrames. - **
_cleanup_intermediate()
: ** Deletes the intermediate files. -
sanitize_input(jobs)
: Cleans the input job descriptions, handling invalid titles and descriptions according to the configuration.
-
3.2.2. EnhancedTextPreprocessor
Class:
-
Purpose: Handles text preprocessing tasks, including:
- Lowercasing.
- Removing URLs, email addresses, and special characters.
- Normalizing whitespace.
- Managing stop words (including adding and excluding words).
-
Key Attributes:
-
config
: The configuration dictionary. -
nlp
: The spaCy model. -
stop_words
: A set of stop words. -
regex_patterns
: Compiled regular expressions for cleaning text. -
_cache
: AnLRUCache
for caching preprocessed text. -
cache_salt
: A salt for the cache key. -
config_hash
: A hash of the relevant configuration parameters, used for cache invalidation.
-
-
Key Methods:
-
preprocess(text)
: Performs the preprocessing steps on a single text. -
preprocess_batch(texts)
: Preprocesses a list of texts. -
tokenize_batch(texts)
: Tokenizes a list of texts. - **
_calculate_config_hash()
: ** Calculates a hash of the relevant configuration parameters. - **
_load_stop_words()
: ** Loads stop words from the config.
-
-
Why this approach?
- Centralized Preprocessing: All preprocessing logic is encapsulated in a single class, making it easier to maintain and modify.
- Caching: Caching preprocessed text significantly improves performance, especially when the same text appears multiple times.
- Configurable Stop Words: The user can customize the list of stop words.
- Regular Expressions: Using compiled regular expressions improves the efficiency of text cleaning.
-
Unicode Handling: Using
casefold()
instead oflower()
provides more robust handling of Unicode characters.
3.2.3. AdvancedKeywordExtractor
Class:
- Purpose: The core keyword extraction engine.
-
Key Attributes:
-
config
: The configuration dictionary. -
nlp
: The spaCy model. -
preprocessor
: An instance ofEnhancedTextPreprocessor
. -
phrase_synonyms
: A dictionary of phrase synonyms. -
all_skills
: A set of all known skills (including synonyms and n-grams). -
category_vectors
: A dictionary mapping category names to their centroid vectors. -
bk_tree
: An instance ofEnhancedBKTree
for fuzzy matching. -
validator
: An instance ofSemanticValidator
.
-
-
Key Methods:
-
extract_keywords(texts)
: Extracts keywords from a list of texts. This is a generator function, yielding results for each text as they become available.-
Inside extract_keywords
*
docs = list(self.nlp.pipe(texts))
: Processes the texts using spaCy'snlp.pipe
for efficient batch processing. This performs tokenization, POS tagging, lemmatization, and entity recognition. * Iterates through thedocs
and correspondingtexts
. * Entity Extraction: Extracts entities labeled as "SKILL". * Tokenization, Lemmatization, and Preprocessing:- Identifies spans of skill entities.
- Processes non-entity tokens, filtering by POS, stop words, and length.
- Preprocesses the non-entity tokens (lowercasing, removing URLs, etc.).
* N-gram Generation: Generates n-grams (up to the configured
ngram_range
) from the processed tokens. * Keyword Filtering: - Combines entity keywords and generated n-grams.
- Filters out short keywords and keywords consisting entirely of stop words. * Staged Filtering:
-
Stage 1: Whitelist: Separates the keywords that are direct matches to the whitelist (
self.all_skills
). -
Stage 2: Fuzzy Matching: Applies fuzzy matching only to the non-whitelisted keywords using the enhanced BK-tree (
self.bk_tree.find
). Filters matches based on a minimum similarity threshold and allowed POS tags. -
Stage 3: Semantic Filtering (Optional): If
semantic_validation
is enabled:- Combines whitelisted and fuzzy-matched keywords.
- Filters out negative keywords.
- Calls
_is_in_context
to check if the keyword is semantically relevant to its context, using cosine similarity between the keyword's embedding and the context window's embedding. * Yields a tuple(original_tokens, filtered_keywords)
for each job description.
-
Inside extract_keywords
*
-
_apply_fuzzy_matching_and_pos_filter(keyword_lists)
: Applies fuzzy matching and POS filtering. -
_semantic_filter(keyword_lists, docs)
: Applies semantic filtering. -
_is_in_context(keyword, doc)
: Checks if a keyword is semantically relevant to its context. -
_get_context_window(sentences, keyword)
: Extracts a context window around a keyword. -
_detect_keyword_section(keyword, text)
: Detects the section of a job description where a keyword appears. -
_process_term(term)
: Processes a term using spaCy. -
_validate_fuzzy_candidate(candidate)
: Validates a fuzzy match candidate. -
**
_load_phrase_synonyms()
: ** Loads phrase synonyms from a file or API. -
**
_load_and_process_all_skills()
: ** Loads, preprocesses, and expands all skills from the configuration. -
_generate_synonyms(skills)
: Generates synonyms for skills (using phrase-level synonyms and WordNet). -
**
_init_categories()
: ** Initializes category vectors for semantic categorization. -
_get_term_vector(term)
: Gets the vector representation of a term. -
_semantic_categorization(term)
: Categorizes a term based on its semantic similarity to category centroids. -
_generate_ngrams(tokens, n)
: Generates n-grams from a list of tokens.
-
-
Why this approach?
- Multi-Stage Filtering: The staged filtering approach (whitelist, fuzzy matching, semantic validation) improves accuracy by progressively refining the set of keywords.
- Efficiency: Fuzzy matching and semantic validation are only applied to a subset of keywords, reducing computational cost.
- Contextual Relevance: Semantic validation ensures that keywords are relevant to the specific job description.
- Flexibility: The user can configure various parameters (e.g., fuzzy matching threshold, semantic similarity threshold, POS filter).
- spaCy Integration: Leverages spaCy's efficient NLP pipeline and pre-trained models.
- BK-Tree: Uses a BK-tree for fast fuzzy matching.
- Generators: Uses generators for memory efficiency.
3.2.4. KeywordCanonicalizer
Class:
- Purpose: Deduplicates and standardizes keywords.
-
Key Attributes:
-
nlp
: The spaCy model. -
config
: The configuration dictionary. -
abbreviation_map
: A dictionary mapping abbreviations to their expanded forms. -
canonical_cache
: A cache for canonical forms. -
stats
: A dictionary for tracking statistics (duplicates found, terms canonicalized, clusters formed).
-
-
Key Methods:
-
canonicalize_keywords(keywords, all_skills=None)
: Performs the canonicalization process:- Normalization: Lowercases and removes extra whitespace.
- Abbreviation Expansion: Expands abbreviations (e.g., "ML" -> "machine learning").
- N-gram Overlap Resolution: Resolves overlapping n-grams (e.g., "machine learning" vs. "machine" and "learning").
- Embedding-Based Clustering (Optional): Groups similar terms using DBSCAN clustering and selects a representative term for each cluster.
-
_normalize_keyword(keyword)
: Normalizes a keyword. -
_expand_abbreviations(keywords)
: Expands abbreviations. -
_resolve_ngram_overlaps(keywords)
: Resolves overlapping n-grams. -
_cluster_similar_terms(keywords, all_skills=None)
: Clusters similar terms using embeddings. -
_select_cluster_representative(...)
: Selects a representative term for each cluster.
-
-
Why this approach?
- Improved Accuracy: Canonicalization reduces noise and improves the accuracy of keyword analysis by grouping together variations of the same keyword. For example, "machine learning", "ML", and "MachineLearning" would all be canonicalized to "machine learning".
- Efficiency: Reduces the number of unique keywords, improving the performance of subsequent processing steps.
- Flexibility: The canonicalization process is configurable (e.g., enabling/disabling clustering, setting similarity thresholds).
- DBSCAN Clustering: DBSCAN is a density-based clustering algorithm that's well-suited for grouping similar terms based on their embeddings. It doesn't require specifying the number of clusters in advance.
- Prioritization: The cluster representative selection prioritizes terms from the whitelist, then longer terms, and finally terms closest to the cluster centroid.
3.2.5. ParallelProcessor
Class:
- Purpose: Manages parallel processing of job descriptions.
-
Key Attributes:
-
config
: The configuration dictionary. -
keyword_extractor
: An instance ofAdvancedKeywordExtractor
. -
nlp
: The spaCy model. -
disabled_pipes
: A list of disabled spaCy pipeline components. -
complexity_cache
: A cache for text complexity scores.
-
-
Key Methods:
-
get_optimal_workers(texts)
: Determines the optimal number of worker processes based on text complexity, available memory (CPU and GPU), and configuration parameters. -
extract_keywords(texts)
: Extracts keywords from a list of texts using parallel processing. -
_process_text_chunk(texts)
: Processes a chunk of texts using the initialized spaCy model and keyword extractor. -
_chunk_texts(texts, chunk_size)
: Splits a list of texts into chunks.
-
-
Why this approach?
- Efficiency: Parallel processing significantly speeds up the processing of large datasets by distributing the workload across multiple CPU cores.
-
Resource Management:
get_optimal_workers
dynamically adjusts the number of workers based on available resources, preventing excessive memory usage. -
multiprocessing.Pool
: Usesmultiprocessing.Pool
with an initializer function (init_worker
) to avoid repeatedly loading the spaCy model in each worker process.
3.2.6. TrigramOptimizer
Class:
- Purpose: Optimizes trigram candidate generation by caching frequently occurring trigrams.
-
Key Attributes:
-
config
: The configuration dictionary. -
nlp
: The spaCy model. -
cache
: AnLRUCache
for caching trigram candidates. -
hit_rates
: A deque for tracking cache hit rates. -
keyword_extractor
: An instance ofAdvancedKeywordExtractor
. -
preprocessor
: The text preprocessor.
-
-
Key Methods:
-
get_candidates(text)
: Gets trigram candidates for a given text, using the cache if possible. -
_generate_ngrams(tokens, n)
: Generates n-grams from a list of tokens. - **
_adjust_cache_size()
: ** Adjusts the cache size based on the hit rate.
-
-
Why this approach?
- Efficiency: Caching trigram candidates reduces the number of calls to the more expensive n-gram generation and filtering logic.
- Adaptive Cache Size: The cache size is adjusted dynamically based on the hit rate.
3.2.7. SmartChunker
Class:
- Purpose: Adaptively determines chunk sizes for processing using a reinforcement learning approach (Q-learning).
-
Key Attributes:
-
config
: The configuration dictionary. -
q_table
: AnLRUCache
storing Q-values for different states. -
timestamps
: A dictionary tracking the last time each state was encountered. -
decay_factor
: The decay factor for the Q-table. -
learning_rate
: The learning rate for Q-learning. -
reward_history
: A deque storing recent rewards. -
state_history
: A list storing recent states.
-
-
Key Methods:
-
get_chunk_size(dataset_stats)
: Determines the optimal chunk size based on the current state. -
update_model(reward, chunk_size=None)
: Updates the Q-table based on the received reward.
-
-
Why this approach?
- Adaptive Chunking: The chunk size is adjusted dynamically based on dataset statistics and system resources, optimizing for both processing speed and memory usage.
- Reinforcement Learning: Q-learning allows the script to learn the optimal chunk size over time.
3.2.8. AutoTuner
Class:
- Purpose: Tunes parameters (e.g., chunk size, POS processing strategy) based on performance metrics.
-
Key Attributes:
-
config
: The configuration dictionary.
-
-
Key Methods:
-
tune_parameters(metrics, trigram_hit_rate)
: Adjusts parameters based on metrics.
-
-
Why this approach?
- Automated Optimization: The script can automatically adjust its parameters to improve performance for different datasets.
3.2.9. SemanticValidator
Class:
- Purpose: Provides a unified interface for applying validation checks to keywords (POS, semantic, negative keywords).
-
Key Attributes:
-
config
: The configuration dictionary. -
nlp
: The spaCy model. -
semantic_validation
: Whether semantic validation is enabled. -
similarity_threshold
: The minimum similarity score for semantic validation. -
negative_keywords
: A set of negative keywords. -
allowed_pos
: A set of allowed POS tags. -
_validation_cache
: A cache for validation results.
-
-
Key Methods:
-
validate_term(term, context_doc=None)
: Performs all validation checks. -
_validate_pos(doc)
: Checks POS tags. -
_validate_semantics(term_text, context_doc)
: Checks semantic relevance.
-
-
Why this approach?
- Consistency: Ensures that all validation checks are applied consistently.
- Modularity: Separates the validation logic from the keyword extraction logic.
- Efficiency: Caches validation results.
3.2.10. EnhancedBKTree
Class:
- Purpose: Provides an optimized BK-tree implementation for fast fuzzy matching.
-
Key Attributes:
-
bk_tree
: The underlyingpybktree.BKTree
object. -
cache
: AnLRUCache
for caching fuzzy matching results.
-
-
Key Methods:
-
find(query, threshold, limit=None)
: Finds items within a given Levenshtein distance of the query.
-
-
Why this approach?
-
Efficiency: BK-trees are efficient for approximate string matching. The
EnhancedBKTree
adds caching to further improve performance.
-
Efficiency: BK-trees are efficient for approximate string matching. The
3.2.11. CacheManager
Class:
- Purpose: Manages caching with different backend implementations.
-
Key Attributes:
-
backend
: The cache backend to use (e.g.,MemoryCacheBackend
). -
namespace
: Namespace for keys to avoid collisions.
-
-
Key Methods:
-
get(key)
: Retrieves a value from the cache. -
set(key, value, ttl=None)
: Stores a value in the cache. -
delete(key)
: Removes a value from the cache. -
clear()
: Clears the cache. -
get_or_compute(key, compute_func, ttl=None, *args, **kwargs)
: Gets a value from the cache or computes it if not found.
-
4.1. spaCy's Doc
Object:
- The core data structure used for representing processed text.
- Contains tokens, POS tags, lemmas, named entities, word vectors, and sentence boundaries.
- Provides efficient access to linguistic annotations.
4.2. Sets (set
):
- Used extensively for storing collections of unique items (e.g.,
all_skills
,negative_keywords
,stop_words
). - Provide O(1) average time complexity for membership checks (
in
operator).
4.3. Dictionaries (dict
):
- Used for storing key-value pairs (e.g.,
config
,phrase_synonyms
,category_vectors
). - Provide O(1) average time complexity for key lookups.
4.4. Lists (list
):
- Used for storing ordered sequences of items (e.g.,
original_tokens
,filtered_keywords
).
4.5. Deques (collections.deque
):
- Used for storing limited-size histories (e.g.,
reward_history
,hit_rates
inTrigramOptimizer
). - Provide O(1) time complexity for appending and popping elements from both ends.
4.6. LRUCache
(from cachetools
):
- Used for caching various intermediate results (e.g., fuzzy matching results, semantic validation results, trigram candidates).
- Implements a Least Recently Used (LRU) cache eviction policy, automatically removing the least recently used items when the cache is full.
- Provides O(1) average time complexity for get and set operations.
4.7. HashingVectorizer
(from sklearn.feature_extraction.text
):
- Used for converting a collection of text documents (in this case, lists of keywords) to a matrix of TF-IDF features.
- Uses a hashing trick to map features to indices, avoiding the need to store a vocabulary in memory. This makes it more memory-efficient than
TfidfVectorizer
, especially for large vocabularies. - The
HashingVectorizer
is initialized with:-
ngram_range
: Specifies the range of n-grams to consider (taken from the configuration). -
n_features
: The maximum number of features (keywords) to keep (taken from the configuration). -
dtype
: The data type of the matrix elements (set tonp.float32
for memory efficiency). -
lowercase
: Set toFalse
because lowercasing is already handled during preprocessing. -
tokenizer
: Set tolambda x: x
because the input is already a list of tokens. -
preprocessor
: Set tolambda x: x
because no further preprocessing is needed. -
norm
: Set to'l2'
for consistent normalization. -
alternate_sign
: Set toFalse
to prevent feature cancellation.
-
4.8. BK-Tree (using pybktree
and EnhancedBKTree
):
- A tree-based data structure for fast approximate string matching.
- Organizes strings based on their Levenshtein distance to each other.
- Allows for efficient searching of strings within a given edit distance of a query string.
- The
EnhancedBKTree
class wraps thepybktree.BKTree
object and adds caching to further improve performance.
4.9. DBSCAN (from sklearn.cluster
- used indirectly in KeywordCanonicalizer
):
- A density-based clustering algorithm.
- Used for grouping similar keywords based on their word embeddings.
- Doesn't require specifying the number of clusters in advance.
- Parameters (e.g.,
eps
,min_samples
) are configurable.
4.10. Q-Learning (in SmartChunker
):
- A reinforcement learning algorithm used for adaptively determining chunk sizes.
- Maintains a Q-table that stores the estimated value of taking a particular action (choosing a chunk size) in a given state (dataset statistics and system resources).
- Updates the Q-table based on the received rewards (a combination of recall, memory usage, and processing time).
4.11. Pandas DataFrames:
- Used for storing and manipulating tabular data (e.g., keyword scores, aggregated results).
- Provide efficient data analysis and manipulation capabilities.
4.12. NumPy Arrays:
- Used for numerical computations (e.g., calculating cosine similarity, averaging vectors).
- Provide efficient array operations.
Let's walk through the main workflow (run_analysis
and analyze_jobs
) again, adding more nuance and detail:
-
Initialization (
initialize_analyzer
):-
Configuration Loading and Validation: The configuration file is loaded and validated twice: once using
schema
for structural checks and again usingpydantic
for type and constraint checks. This two-layered approach ensures robustness. TheConfigError
exception is raised if validation fails. -
spaCy Model Loading: The specified spaCy model (
en_core_web_lg
by default) is loaded. The script handles potentialOSError
exceptions (e.g., if the model is not found) and attempts to download the model if necessary. GPU usage is enabled if available and configured. Essential pipeline components (sentencizer
,lemmatizer
,entity_ruler
) are added if they are missing and not explicitly disabled. The entity ruler is populated with patterns for skills and section headings. -
Component Initialization: Instances of
EnhancedTextPreprocessor
,AdvancedKeywordExtractor
,KeywordCanonicalizer
,ParallelProcessor
,TrigramOptimizer
,SmartChunker
, andAutoTuner
are created. These components are interconnected through dependency injection. -
Working Directory and Run ID: A working directory is created for intermediate files, and a unique run ID is generated using
xxhash
.
-
Configuration Loading and Validation: The configuration file is loaded and validated twice: once using
-
Job Description Loading (
load_job_data
):- The job descriptions are loaded from the specified JSON file.
FileNotFoundError
andjson.JSONDecodeError
are handled.
- The job descriptions are loaded from the specified JSON file.
-
Sanitization (
analyzer.sanitize_input
):- Job titles and descriptions are validated. Non-string titles are handled based on the
allow_numeric_titles
setting. Empty or non-string descriptions are handled based on theempty_description_policy
setting ("warn", "error", "allow"). In "strict" mode,InputValidationError
is raised for invalid input.
- Job titles and descriptions are validated. Non-string titles are handled based on the
-
Dataset Statistics Calculation (
analyzer._calc_dataset_stats
):- The average length of job descriptions and the total number of descriptions are calculated.
np.nanmean
is used to handle potential empty descriptions.
- The average length of job descriptions and the total number of descriptions are calculated.
-
Chunk Size Determination (
analyzer.chunker.get_chunk_size
):- The
SmartChunker
determines the optimal chunk size based on the dataset statistics, available system resources (memory), and its internal Q-table. The Q-table is updated using a reinforcement learning approach (Q-learning).
- The
-
Chunking (
analyzer._create_chunks
):- The job descriptions are split into chunks of the determined size.
-
Main Processing Loop (Iterating through Chunks):
a. Keyword Extraction (
self.processor.keyword_extractor.extract_keywords
): * Parallel Processing: TheParallelProcessor
distributes the processing of chunks across multiple worker processes. * spaCy Processing: Each worker process loads the spaCy model (usinginit_worker
) and processes the text usingnlp.pipe
for efficient batch processing. This generatesDoc
objects containing tokens, POS tags, lemmas, named entities, and word vectors. * Entity Extraction: Entities labeled as "SKILL" are extracted. * Tokenization and Preprocessing: Non-entity tokens are filtered based on POS tags, stop words, and length. The remaining tokens are lemmatized and preprocessed (lowercased, URLs and special characters removed). * N-gram Generation: N-grams (sequences of tokens) are generated up to the configured maximum length. * Staged Filtering: * Whitelist: Keywords that are direct matches to the whitelist (all_skills
) are identified. * Fuzzy Matching: Fuzzy matching is applied only to the non-whitelisted keywords using theEnhancedBKTree
. Matches are filtered based on a minimum similarity threshold and allowed POS tags. Therapidfuzz
library provides fast fuzzy matching algorithms. * Semantic Validation (Optional): If enabled, semantic validation is performed. The script extracts a context window around each keyword and calculates the cosine similarity between the keyword's embedding and the context window's embedding. Keywords with similarity below a configured threshold are filtered out. TheSemanticValidator
class handles this logic and caches results. * Generator Output: A tuple of(original_tokens, filtered_keywords)
is yielded for each job description.b. Keyword Canonicalization (
analyzer.keyword_canonicalizer.canonicalize_keywords
): * All keywords (original and filtered) from the chunk are collected. * Normalization: Keywords are lowercased and extra whitespace is removed. * Abbreviation Expansion: Abbreviations are expanded based on a predefined mapping. * N-gram Overlap Resolution: Overlapping n-grams are resolved, prioritizing longer n-grams. * Embedding-Based Clustering (Optional): If enabled, similar keywords are grouped using DBSCAN clustering based on their spaCy embeddings. A representative keyword is selected for each cluster. * Caching: Canonical forms can be cached (optional). * The script maps each original keyword to its canonical form.c. TF-IDF Matrix Creation (
analyzer._create_tfidf_matrix
): * A TF-IDF matrix is created from the canonicalized keywords usingHashingVectorizer
. This matrix represents the importance of each keyword in each job description. TheHashingVectorizer
is initialized with parameters from the configuration (n-gram range, maximum features, etc.).d. Keyword Scoring (
analyzer._calculate_scores
): * Scores are calculated for each keyword in each job description. * The base score is calculated as a weighted combination of the TF-IDF value and the keyword's frequency in the job description. * A whitelist boost is applied to keywords that are present in theall_skills
set. * Section-specific weights are applied based on the section of the job description where the keyword appears (e.g., "requirements" section might have a higher weight). * The keyword's category is determined using semantic similarity to pre-calculated category centroids.e. Intermediate Saving (Optional): * If
intermediate_save
is enabled and the save interval is reached, the results (summary and details DataFrames) are saved to disk in the specified format (Feather, JSONL, or JSON). * Checksums of the saved files are calculated and stored in a manifest file.f. Metrics Calculation and Model Update: *
analyzer._calc_metrics()
calculates performance metrics (original recall, expanded recall, precision, F1-score, memory usage, time per job). *analyzer.tuner.tune_parameters()
adjusts parameters (chunk size, POS processing strategy) based on the metrics and the trigram cache hit rate. *analyzer.chunker.update_model()
updates the Q-table for adaptive chunking based on the calculated reward. -
Garbage Collection:
-
gc.collect()
is called to explicitly release memory.
-
-
Loading and Aggregating Intermediate Results:
-
analyzer._verify_intermediate_checksums()
verifies the integrity of the intermediate files by comparing their checksums to the values stored in the manifest file. -
analyzer._load_all_intermediate()
loads the intermediate results from disk. -
analyzer._aggregate_results()
combines the results from all chunks into final summary and details DataFrames.
-
-
Saving Results:
-
save_results()
saves the final DataFrames to an Excel file.
-
-
Metrics Report (Optional):
- If the
--metrics-report
flag is set, a comprehensive metrics report is generated (HTML, plots, JSON).
- If the
-
spaCy: Chosen for its speed, accuracy, and comprehensive features (tokenization, POS tagging, NER, word embeddings).
en_core_web_lg
is a good balance between size and accuracy. - NLTK: Used primarily for accessing WordNet for synonym generation and for downloading necessary corpora.
-
scikit-learn (
HashingVectorizer
):HashingVectorizer
is preferred overTfidfVectorizer
for its memory efficiency, especially with large vocabularies. It avoids storing a vocabulary in memory by using a hashing trick. - NumPy: Used for efficient numerical operations and array manipulation.
- Pandas: Used for data manipulation and analysis (DataFrames).
-
rapidfuzz: A fast fuzzy string matching library, used for its performance compared to other options like
fuzzywuzzy
. - pybktree: Provides a BK-tree implementation, which is efficient for approximate string matching.
-
cachetools (
LRUCache
): Provides a simple and efficient in-memory cache with a least recently used (LRU) eviction policy. - xxhash: A fast non-cryptographic hash algorithm, used for generating unique IDs and checksums.
- requests: Used for making HTTP requests to the synonym API (if configured).
- tenacity: A retry library, used for retrying API calls with exponential backoff.
- structlog: Used for structured logging, making it easier to parse and analyze log messages.
- pydantic: Used for configuration validation and data modeling, providing type safety and clear documentation.
- schema: Used for structural validation of the configuration file.
- srsly: Used for serializing data to JSON and JSONL formats.
- pyarrow: Used for reading and writing Feather and Parquet files.
- matplotlib and seaborn: Used for generating visualizations in the metrics report.
-
Parallel Processing: The use of
multiprocessing
is crucial for performance, especially for large datasets. TheParallelProcessor
class dynamically determines the optimal number of worker processes based on system resources. -
Batch Processing: spaCy's
nlp.pipe
is used for efficient batch processing of text. -
Caching: Extensive caching is used to avoid redundant computations:
-
EnhancedTextPreprocessor
: Caches preprocessed text. -
AdvancedKeywordExtractor
: Caches term vectors, fuzzy matching results, and semantic validation results. -
KeywordCanonicalizer
: Caches canonical forms. -
TrigramOptimizer
: Caches trigram candidates. -
SemanticValidator
: Caches validation results.
-
-
HashingVectorizer:
HashingVectorizer
is used for memory-efficient TF-IDF vectorization. - BK-Tree: The BK-tree provides fast approximate string matching.
- Generators: Generators are used extensively to avoid loading large datasets into memory.
- Optimized Data Structures: Sets, dictionaries, and deques are used for efficient operations.
-
Adaptive Chunking: The
SmartChunker
dynamically adjusts chunk sizes to balance processing speed and memory usage. - GPU Usage: The script can leverage GPU acceleration for spaCy processing if a GPU is available and configured.
- spaCy Model Dependency: The script's performance and accuracy are dependent on the chosen spaCy model. Different models may have different strengths and weaknesses.
- Language Support: The script is primarily designed for English text. Supporting other languages would require using different spaCy models and potentially modifying the preprocessing and keyword extraction logic.
- Synonym Generation: The current synonym generation relies on WordNet and a static list of phrase synonyms (or an external API). More sophisticated synonym generation techniques could be explored.
- Context Window Size: The fixed context window size for semantic validation might not be optimal for all cases. A more dynamic approach to context extraction could be considered.
- Section Detection: The current section detection relies on regular expressions and a predefined list of section headings. More robust section detection methods could be investigated.
- Reinforcement Learning: The Q-learning algorithm used for adaptive chunking could be further refined and optimized.
- API Dependency: If the synonym API is unavailable or slow, it can impact the script's performance. More robust error handling and fallback mechanisms could be implemented.
- Memory Usage: Although the script includes several memory optimizations, processing very large datasets can still require significant memory.
- Scalability: While the script uses parallel processing, scaling to extremely large datasets might require a distributed processing framework (e.g., Dask, Spark).
keywords4cv.py
(v0.26) is a sophisticated and well-engineered keyword extraction and analysis tool. It combines a range of NLP techniques, optimization strategies, and robust error handling to provide accurate, efficient, and configurable processing of job descriptions. The script's modular design, extensive configuration options, and comprehensive metrics reporting make it a valuable tool for ATS development and related tasks. The detailed explanations and justifications provided in this document should give a thorough understanding of the script's inner workings.
Project Documentation