-
Notifications
You must be signed in to change notification settings - Fork 0
How script v0.1 works
- Script Overview and Purpose:
The script, Keywords4CV.py, is designed to analyze a set of job descriptions, extract relevant keywords and skills, categorize those keywords, and calculate scores for each keyword based on TF-IDF, frequency, and presence in a whitelist. The ultimate goal is to help candidates identify the most important skills to highlight in their CVs/resumes to improve their compatibility with Applicant Tracking Systems (ATS).
- Data Input Stage:
2.1. Source:
The script accepts input data from a JSON file specified via the -i
or --input
command-line argument. The default file path is "job_descriptions.json".
2.2. Data Format:
The input JSON file should have the following structure:
{
"Job Title 1": "Job Description 1",
"Job Title 2": "Job Description 2",
...
}
It's a dictionary where keys are job titles (strings) and values are the corresponding job descriptions (strings).
Example:
{
"Data Scientist": "We are looking for a data scientist...",
"Software Engineer": "The ideal candidate will have experience..."
}
2.3. Data Loading/Ingestion:
The load_job_data
function handles the data loading:
- It takes the input file path as an argument.
- It uses the
open()
function in awith
statement to open the file, specifyingencoding="utf-8"
to handle various character sets correctly. This is crucial for robust text processing. - It uses
json.load(f)
to parse the JSON data from the file objectf
into a Python dictionary. - It includes error handling:
-
json.JSONDecodeError
: Catches errors if the file doesn't contain valid JSON and exits the script with an error code. -
FileNotFoundError
: Catches errors if the specified input file doesn't exist and exits the script with an error code.
-
2.4. Libraries/Modules Used (Input Stage):
-
json
: For parsing the JSON input file. -
sys
: For exiting the script in case of errors. -
argparse
: For handling command-line arguments. -
logging
: For logging error messages.
- Data Processing Pipeline (Step-by-Step):
Let's break down the processing within the ATSOptimizer
class, focusing on the key methods:
Initialization (__init__
)
-
Step 0: Configuration Loading and Setup
- 3.0.1. Input Data for this Step: The path to the configuration file (default:
config.yaml
). - 3.0.2. Operations Performed:
- Calls
self._load_config(config_path)
to load the YAML configuration file. This method also sets default values for many configuration options if they are not explicitly provided in the config file. This ensures the script has sensible defaults. - Calls
self._validate_config()
to check that the loaded configuration conforms to the expected schema (data types, required keys, valid ranges). This helps prevent runtime errors due to incorrect configuration. - Loads the spaCy model specified in the configuration (default:
en_core_web_lg
) usingself._load_and_configure_spacy_model()
. This includes handling potential download issues and model loading failures, with a fallback toen_core_web_md
. - Adds the entity ruler to the spaCy pipeline, using skills from the configuration.
- Initializes the
EnhancedTextPreprocessor
andAdvancedKeywordExtractor
objects, passing the loaded configuration and spaCy model. - Creates a working directory for intermediate files (if enabled), using settings from config.yaml.
- Calls
- 3.0.3. Output Data of this Step: An initialized
ATSOptimizer
object with the configuration, spaCy model, text preprocessor, and keyword extractor loaded. - 3.0.4. Storage of Intermediate Results: The loaded configuration is stored in
self.config
. The spaCy model is stored inself.nlp
. The preprocessor and keyword extractor are stored inself.preprocessor
andself.keyword_extractor
, respectively. The working directory path is inself.working_dir
. - 3.0.5. Libraries/Modules Used (Step 0):
yaml
,spacy
,logging
,sys
,psutil
,pathlib
,uuid
,json
,blake3
,srsly
. - 3.0.6. Computational Complexity: O(1) for loading the configuration and initializing objects. The complexity of loading the spaCy model depends on the model size, but this happens only once.
- 3.0.1. Input Data for this Step: The path to the configuration file (default:
Main Analysis (analyze_jobs
)
-
Step 1: Input Validation
- 3.1.1. Input Data for this Step: A dictionary of job titles and descriptions (loaded from the JSON input file).
- 3.1.2. Operations Performed:
- Calls
self._validate_input(job_descriptions)
to:- Check if the input is a list and converts it to a dictionary if necessary.
- Iterate through each job title and description.
- Call
self._validate_title
to ensure the title is a string, not empty, and within the configured length limits. - Call
self._validate_description
to ensure the description is a string, within configured length limits, not too short (word count), and handles encoding issues. Also removes URLs, email addresses, and control characters. - Collect errors for invalid titles or descriptions.
- Ensure that at least
min_jobs
(from config.yaml) valid job descriptions remain after validation.
- Calls
- 3.1.3. Output Data of this Step: A dictionary containing only the valid job titles and descriptions. Invalid entries are discarded.
- 3.1.4. Storage of Intermediate Results: The validated job descriptions are stored in a new dictionary (
valid_jobs
). - 3.1.5. Libraries/Modules Used (Step 1):
re
,logging
. - 3.1.6. Computational Complexity: O(n), where n is the number of job descriptions. It iterates through each job description once.
-
Step 2: Chunking Decision (if necessary)
- 3.2.1. Input Data: The validated dictionary of job titles and descriptions.
- 3.2.2. Operations Performed:
- Calls
self._needs_chunking(job_descriptions)
which checks:- If the number of jobs exceeds
auto_chunk_threshold
. - If current memory usage (using
psutil.virtual_memory().percent
) exceedsmemory_threshold
.
- If the number of jobs exceeds
- Calls
- 3.2.3 Output Data: A boolean value indicating whether to use chunking.
- 3.2.4 Storage: No intermediate storage.
- 3.2.5 Libraries/Modules:
psutil
- 3.2.6 Computational Complexity: O(1). It performs a few simple checks.
-
Step 3: Chunked Analysis (if needed) or Internal Analysis
-
If chunking is required (and intermediate saving is enabled):
_analyze_jobs_chunked
is called.- 3.3.1. Input Data: The validated dictionary of job titles and descriptions.
- 3.3.2. Operations Performed:
- Calculates a chunk size using
self._calculate_chunk_size
based on available memory and average job description size. - Splits the job descriptions into chunks.
- Uses a
ProcessPoolExecutor
to process each chunk in parallel:- For each chunk, it calls
self._process_chunk_wrapper
which in turn callsself._process_chunk
. -
_process_chunk
:- Extracts keywords from the chunk using
self.keyword_extractor.extract_keywords(texts)
. - Creates a TF-IDF matrix using
self._create_tfidf_matrix(texts, keyword_sets)
. - Calculates keyword scores using
self._calculate_scores(dtm, features, keyword_sets, chunk)
. - Creates Pandas DataFrames for summary and detailed results.
- Extracts keywords from the chunk using
- Intermediate results are saved to disk at intervals specified by
save_interval
inconfig.yaml
. The format (JSON, JSONL, or Feather) is also determined by the configuration.
- For each chunk, it calls
- After processing all chunks, it loads the intermediate results from disk, combines them, and aggregates the results.
- Calculates a chunk size using
- 3.3.3. Output Data: A tuple of two Pandas DataFrames: a summary DataFrame and a detailed DataFrame.
- 3.3.4. Storage of Intermediate Results: Intermediate results are stored in files in the
working_dir
with names including therun_id
and chunk index. - 3.3.5. Libraries/Modules Used:
concurrent.futures
,pandas
,srsly
,pathlib
,shutil
,dask
(if enabled in the configuration). - 3.3.6. Computational Complexity: The complexity depends on the number of chunks and the size of each chunk. Keyword extraction and TF-IDF calculation are the most computationally intensive parts. Parallel processing with
ProcessPoolExecutor
significantly reduces the overall processing time.
-
If chunking is not required (or intermediate saving is disabled):
_analyze_jobs_internal
is called.- 3.4.1. Input Data: The validated dictionary of job titles and descriptions.
- 3.4.2. Operations Performed:
- Extracts keywords using
self.keyword_extractor.extract_keywords(texts)
. - Creates a TF-IDF matrix using
self._create_tfidf_matrix(texts, keyword_sets)
. - Calculates keyword scores using
self._calculate_scores(dtm, features, keyword_sets, job_descriptions)
. - Creates Pandas DataFrames for summary and detailed results (similar to
_process_chunk
).
- Extracts keywords using
- 3.4.3. Output Data: A tuple of two Pandas DataFrames: a summary DataFrame and a detailed (pivot) DataFrame.
- 3.4.4. Storage of Intermediate Results: No intermediate storage to disk.
- 3.4.5. Libraries/Modules Used:
pandas
,psutil
. - 3.4.6. Computational Complexity: The complexity depends on the number of job descriptions and the length of each description. Keyword extraction and TF-IDF calculation are the most computationally intensive steps.
-
-
Step 4: Keyword Extraction (within
_process_chunk
or_analyze_jobs_internal
)This is handled by the
AdvancedKeywordExtractor
class'sextract_keywords
method:- 3.4.x.1 Input Data: A list of job description texts.
- 3.4.x.2 Operations Performed:
- Uses spaCy's
nlp.pipe
for efficient batch processing of the job descriptions. - Extracts named entities with the label "SKILL" (using the entity ruler added during initialization).
- Tokenizes the text, excluding any part of the extracted skill entity.
- Preprocesses the remaining text using the
EnhancedTextPreprocessor
(lowercasing, removing URLs, emails, special characters, and extra whitespace). - Generates n-grams (based on
ngram_range
from the configuration) from the preprocessed tokens. - Combines the extracted entities and n-grams.
- Filters the keywords, removing any that contain only stop words or single-character words.
- Optionally applies semantic filtering (
_semantic_filter
) ifsemantic_validation
is enabled in the configuration. This checks if keywords are semantically related to their context using cosine similarity between the keyword's embedding and the section's embedding. - Ensures whitelist recall using
ensure_whitelist_recall
. This uses fuzzy matching to find terms from theskills_whitelist
within the job descriptions and adds them to the extracted keywords, even if they weren't initially extracted.
- Uses spaCy's
- 3.4.x.3 Output Data: A list of lists, where each inner list contains the extracted keywords for a corresponding job description.
- 3.4.x.4 Storage of Intermediate Results: Intermediate results (preprocessed text, tokens, n-grams) are stored in memory within the
AdvancedKeywordExtractor
andEnhancedTextPreprocessor
objects. TheEnhancedTextPreprocessor
uses an LRU cache (_cache
) to store preprocessed text, reducing redundant preprocessing. - 3.4.x.5 Libraries/Modules Used:
spacy
,re
,concurrent.futures
,rapidfuzz
,nltk
. - 3.4.x.6 Computational Complexity: The complexity depends on the length of the job descriptions and the configured n-gram range. spaCy's
nlp.pipe
provides significant performance improvements for batch processing. The semantic filtering and whitelist recall steps add computational overhead, but are optimized using vector operations and fuzzy matching.
-
Step 5: TF-IDF Calculation (within
_process_chunk
or_analyze_jobs_internal
) This is handled by the_create_tfidf_matrix
method:- 3.5.x.1 Input data: List of job description texts and the keyword sets for each job description.
- 3.5.x.2 Operations Performed:
- Initializes a
TfidfVectorizer
from scikit-learn. Crucially, it's configured to not re-tokenize the input, treating the providedkeyword_sets
as pre-tokenized lists. This is achieved by settingtokenizer=lambda x: x
andpreprocessor=lambda x: x
. - Calls
fit_transform
on thevectorizer
with thekeyword_sets
to create the TF-IDF matrix. - Gets the feature names (terms) from the vectorizer.
- Initializes a
- 3.5.x.3 Output Data: The TF-IDF matrix (a sparse matrix) and a list of feature names (terms).
- 3.5.x.4 Storage of Intermediate Results: The
TfidfVectorizer
object is created and used within this function. - 3.5.x.5 Libraries/Modules Used:
sklearn.feature_extraction.text
(specificallyTfidfVectorizer
),numpy
. - 3.5.x.6 Computational Complexity: The complexity of TF-IDF calculation depends on the size of the vocabulary (number of unique terms) and the number of job descriptions. The
fit_transform
method is the most computationally expensive part.
-
Step 6: Score Calculation (within
_process_chunk
or_analyze_jobs_internal
)This is handled by the
_calculate_scores
method:- 3.6.x.1. Input Data: The TF-IDF matrix, feature names, keyword sets, and the original job descriptions (dictionary).
- 3.6.x.2. Operations Performed:
- Iterates through the non-zero entries in the TF-IDF matrix (using the COO format for efficiency).
- For each keyword, it calculates a score based on:
- The TF-IDF value (weighted by
tfidf_weight
). - The frequency of the keyword in the job description (weighted by
frequency_weight
). Note: This uses binary presence (0 or 1) rather than raw count. - Whether the keyword is in the whitelist (if so, the score is multiplied by
whitelist_boost
).
- The TF-IDF value (weighted by
- Applies section weights based on the section in which the keyword is found (using
_detect_keyword_section
). - Determines the category of the keyword.
- 3.6.x.3. Output Data: A list of dictionaries, where each dictionary represents a keyword and contains its score, TF-IDF value, frequency, category, whitelist status, and the job title it was found in.
- 3.6.x.4. Storage of Intermediate Results: The results are accumulated in a list (
results
). - 3.6.x.5. Libraries/Modules Used: None (beyond built-in Python functions and data structures).
- 3.6.x.6. Computational Complexity: O(m), where m is the number of non-zero entries in the TF-IDF matrix.
-
Step 7: Result Aggregation (within
_analyze_jobs_chunked
or_analyze_jobs_internal
)-
In
_analyze_jobs_chunked
:-
3.7.1. Input Data: A list of summary DataFrames (one for each processed chunk).
-
3.7.2. Operations Performed:
- If Dask is enabled:
- Creates a Dask DataFrame from the list of Pandas DataFrames.
- Performs a groupby operation on the Dask DataFrame to aggregate the results (sum and mean of "Total_Score", sum of "Job_Count").
- Computes the result, converting the Dask DataFrame to a Pandas DataFrame.
- Includes a checksum validation to ensure the Dask computation is correct.
- If Dask is not enabled:
- Concatenates the list of Pandas DataFrames into a single DataFrame.
- Performs a groupby operation on the concatenated DataFrame (similar to the Dask case).
- If Dask is enabled:
-
3.7.3. Output Data: A single summary DataFrame.
-
3.7.4. Storage of Intermediate Results: None (beyond the intermediate storage of chunks themselves).
-
3.7.5. Libraries/Modules Used:
pandas
,dask
(if enabled),numpy
. -
3.7.6. Computational Complexity: If Dask is used, the complexity depends on the number of chunks and the size of each chunk. Dask parallelizes the aggregation. If Dask is not used, the complexity is O(n log n), where n is the total number of keywords across all chunks (due to the groupby operation).
-
-
In
_analyze_jobs_internal
:- 3.8.1. Input Data: The list of dictionaries produced by
_calculate_scores
. - 3.8.2. Operations Performed:
- Creates a Pandas DataFrame from the list of dictionaries.
- Performs a groupby operation on the DataFrame to aggregate the results (sum and mean of "Score", count of unique "Job Title").
- Renames the columns of the resulting DataFrame.
- Creates a pivot table.
- 3.8.3. Output Data: A summary DataFrame and a pivot table DataFrame.
- 3.8.4. Storage of Intermediate Results: None.
- 3.8.5. Libraries/Modules Used:
pandas
. - 3.8.6. Computational Complexity: O(n log n), where n is the number of keywords (due to the groupby operation).
- 3.8.1. Input Data: The list of dictionaries produced by
-
- Intermediate Data Handling:
-
In-memory caching: The
EnhancedTextPreprocessor
uses an LRU (Least Recently Used) cache (_cache
) to store the results of text preprocessing. This avoids redundant preprocessing of the same text. The cache size is configurable viacache_size
inconfig.yaml
. The cache is cleared if the configuration changes (detected using a BLAKE3 hash). -
Whitelist caching: The
_load_or_create_whitelist
method inAdvancedKeywordExtractor
caches the expanded whitelist to a file (whitelist_cache.json) usingsrsly
. This avoids regenerating the whitelist on every run. The cache is validated using a BLAKE3 hash of theskills_whitelist
from the configuration. -
Intermediate file storage (optional): If
intermediate_save.enabled
is set toTrue
inconfig.yaml
, the script saves intermediate results to disk at regular intervals during chunked processing. This is handled by the_analyze_jobs_chunked
method and the_save_intermediate
function.- The results are saved to files in the
intermediate_save.working_dir
directory. - The file format can be JSON, JSONL, or Feather, specified by
intermediate_save.format
. - The filenames include a unique
run_id
and a chunk index. - The
_cleanup_intermediate
function removes these intermediate files after processing is complete (ifintermediate_save.cleanup
isTrue
).
- The results are saved to files in the
- The script uses the term vectors generated by spaCy and caches them.
- The script categorizes terms and caches them.
- Final Output Stage:
5.1. Storage Location:
The final results are saved to an Excel file specified via the -o
or --output
command-line argument. The default file path is "results.xlsx". This is handled by the save_results
function.
5.2. Output Format:
The output is an Excel file with two sheets:
-
Summary: Contains aggregated keyword statistics:
- Keyword: The extracted keyword.
- Total_Score: The sum of scores for the keyword across all job descriptions.
- Avg_Score: The average score for the keyword.
- Job_Count: The number of job descriptions in which the keyword appears.
-
Detailed Scores: A pivot table showing the score of each keyword in each job description.
5.3. Purpose of Output:
The output is intended to be used by job seekers or recruiters to identify the most important keywords and skills for a set of job descriptions. The summary sheet provides an overview of the most frequent and highly-scored keywords. The detailed sheet allows for a more granular analysis of keyword scores across individual job descriptions.
5.4. Libraries/Modules Used (Output Stage):
-
pandas
: For creating and manipulating DataFrames and writing to Excel. -
pathlib
: For file path manipulation -
shutil
: For disk space checks.
Project Documentation