Skip to content

How script v0.1 works

David Osipov edited this page Feb 28, 2025 · 1 revision
  1. Script Overview and Purpose:

The script, Keywords4CV.py, is designed to analyze a set of job descriptions, extract relevant keywords and skills, categorize those keywords, and calculate scores for each keyword based on TF-IDF, frequency, and presence in a whitelist. The ultimate goal is to help candidates identify the most important skills to highlight in their CVs/resumes to improve their compatibility with Applicant Tracking Systems (ATS).

  1. Data Input Stage:

2.1. Source:

The script accepts input data from a JSON file specified via the -i or --input command-line argument. The default file path is "job_descriptions.json".

2.2. Data Format:

The input JSON file should have the following structure:

{
    "Job Title 1": "Job Description 1",
    "Job Title 2": "Job Description 2",
    ...
}

It's a dictionary where keys are job titles (strings) and values are the corresponding job descriptions (strings).

Example:

{
  "Data Scientist": "We are looking for a data scientist...",
  "Software Engineer": "The ideal candidate will have experience..."
}

2.3. Data Loading/Ingestion:

The load_job_data function handles the data loading:

  • It takes the input file path as an argument.
  • It uses the open() function in a with statement to open the file, specifying encoding="utf-8" to handle various character sets correctly. This is crucial for robust text processing.
  • It uses json.load(f) to parse the JSON data from the file object f into a Python dictionary.
  • It includes error handling:
    • json.JSONDecodeError: Catches errors if the file doesn't contain valid JSON and exits the script with an error code.
    • FileNotFoundError: Catches errors if the specified input file doesn't exist and exits the script with an error code.

2.4. Libraries/Modules Used (Input Stage):

  • json: For parsing the JSON input file.
  • sys: For exiting the script in case of errors.
  • argparse: For handling command-line arguments.
  • logging: For logging error messages.
  1. Data Processing Pipeline (Step-by-Step):

Let's break down the processing within the ATSOptimizer class, focusing on the key methods:

Initialization (__init__)

  • Step 0: Configuration Loading and Setup
    • 3.0.1. Input Data for this Step: The path to the configuration file (default: config.yaml).
    • 3.0.2. Operations Performed:
      • Calls self._load_config(config_path) to load the YAML configuration file. This method also sets default values for many configuration options if they are not explicitly provided in the config file. This ensures the script has sensible defaults.
      • Calls self._validate_config() to check that the loaded configuration conforms to the expected schema (data types, required keys, valid ranges). This helps prevent runtime errors due to incorrect configuration.
      • Loads the spaCy model specified in the configuration (default: en_core_web_lg) using self._load_and_configure_spacy_model(). This includes handling potential download issues and model loading failures, with a fallback to en_core_web_md.
      • Adds the entity ruler to the spaCy pipeline, using skills from the configuration.
      • Initializes the EnhancedTextPreprocessor and AdvancedKeywordExtractor objects, passing the loaded configuration and spaCy model.
      • Creates a working directory for intermediate files (if enabled), using settings from config.yaml.
    • 3.0.3. Output Data of this Step: An initialized ATSOptimizer object with the configuration, spaCy model, text preprocessor, and keyword extractor loaded.
    • 3.0.4. Storage of Intermediate Results: The loaded configuration is stored in self.config. The spaCy model is stored in self.nlp. The preprocessor and keyword extractor are stored in self.preprocessor and self.keyword_extractor, respectively. The working directory path is in self.working_dir.
    • 3.0.5. Libraries/Modules Used (Step 0): yaml, spacy, logging, sys, psutil, pathlib, uuid, json, blake3, srsly.
    • 3.0.6. Computational Complexity: O(1) for loading the configuration and initializing objects. The complexity of loading the spaCy model depends on the model size, but this happens only once.

Main Analysis (analyze_jobs)

  • Step 1: Input Validation

    • 3.1.1. Input Data for this Step: A dictionary of job titles and descriptions (loaded from the JSON input file).
    • 3.1.2. Operations Performed:
      • Calls self._validate_input(job_descriptions) to:
        • Check if the input is a list and converts it to a dictionary if necessary.
        • Iterate through each job title and description.
        • Call self._validate_title to ensure the title is a string, not empty, and within the configured length limits.
        • Call self._validate_description to ensure the description is a string, within configured length limits, not too short (word count), and handles encoding issues. Also removes URLs, email addresses, and control characters.
        • Collect errors for invalid titles or descriptions.
        • Ensure that at least min_jobs (from config.yaml) valid job descriptions remain after validation.
    • 3.1.3. Output Data of this Step: A dictionary containing only the valid job titles and descriptions. Invalid entries are discarded.
    • 3.1.4. Storage of Intermediate Results: The validated job descriptions are stored in a new dictionary (valid_jobs).
    • 3.1.5. Libraries/Modules Used (Step 1): re, logging.
    • 3.1.6. Computational Complexity: O(n), where n is the number of job descriptions. It iterates through each job description once.
  • Step 2: Chunking Decision (if necessary)

    • 3.2.1. Input Data: The validated dictionary of job titles and descriptions.
    • 3.2.2. Operations Performed:
      • Calls self._needs_chunking(job_descriptions) which checks:
        • If the number of jobs exceeds auto_chunk_threshold.
        • If current memory usage (using psutil.virtual_memory().percent) exceeds memory_threshold.
    • 3.2.3 Output Data: A boolean value indicating whether to use chunking.
    • 3.2.4 Storage: No intermediate storage.
    • 3.2.5 Libraries/Modules: psutil
    • 3.2.6 Computational Complexity: O(1). It performs a few simple checks.
  • Step 3: Chunked Analysis (if needed) or Internal Analysis

    • If chunking is required (and intermediate saving is enabled): _analyze_jobs_chunked is called.

      • 3.3.1. Input Data: The validated dictionary of job titles and descriptions.
      • 3.3.2. Operations Performed:
        • Calculates a chunk size using self._calculate_chunk_size based on available memory and average job description size.
        • Splits the job descriptions into chunks.
        • Uses a ProcessPoolExecutor to process each chunk in parallel:
          • For each chunk, it calls self._process_chunk_wrapper which in turn calls self._process_chunk.
          • _process_chunk:
            • Extracts keywords from the chunk using self.keyword_extractor.extract_keywords(texts).
            • Creates a TF-IDF matrix using self._create_tfidf_matrix(texts, keyword_sets).
            • Calculates keyword scores using self._calculate_scores(dtm, features, keyword_sets, chunk).
            • Creates Pandas DataFrames for summary and detailed results.
          • Intermediate results are saved to disk at intervals specified by save_interval in config.yaml. The format (JSON, JSONL, or Feather) is also determined by the configuration.
        • After processing all chunks, it loads the intermediate results from disk, combines them, and aggregates the results.
      • 3.3.3. Output Data: A tuple of two Pandas DataFrames: a summary DataFrame and a detailed DataFrame.
      • 3.3.4. Storage of Intermediate Results: Intermediate results are stored in files in the working_dir with names including the run_id and chunk index.
      • 3.3.5. Libraries/Modules Used: concurrent.futures, pandas, srsly, pathlib, shutil, dask (if enabled in the configuration).
      • 3.3.6. Computational Complexity: The complexity depends on the number of chunks and the size of each chunk. Keyword extraction and TF-IDF calculation are the most computationally intensive parts. Parallel processing with ProcessPoolExecutor significantly reduces the overall processing time.
    • If chunking is not required (or intermediate saving is disabled): _analyze_jobs_internal is called.

      • 3.4.1. Input Data: The validated dictionary of job titles and descriptions.
      • 3.4.2. Operations Performed:
        • Extracts keywords using self.keyword_extractor.extract_keywords(texts).
        • Creates a TF-IDF matrix using self._create_tfidf_matrix(texts, keyword_sets).
        • Calculates keyword scores using self._calculate_scores(dtm, features, keyword_sets, job_descriptions).
        • Creates Pandas DataFrames for summary and detailed results (similar to _process_chunk).
      • 3.4.3. Output Data: A tuple of two Pandas DataFrames: a summary DataFrame and a detailed (pivot) DataFrame.
      • 3.4.4. Storage of Intermediate Results: No intermediate storage to disk.
      • 3.4.5. Libraries/Modules Used: pandas, psutil.
      • 3.4.6. Computational Complexity: The complexity depends on the number of job descriptions and the length of each description. Keyword extraction and TF-IDF calculation are the most computationally intensive steps.
  • Step 4: Keyword Extraction (within _process_chunk or _analyze_jobs_internal)

    This is handled by the AdvancedKeywordExtractor class's extract_keywords method:

    • 3.4.x.1 Input Data: A list of job description texts.
    • 3.4.x.2 Operations Performed:
      • Uses spaCy's nlp.pipe for efficient batch processing of the job descriptions.
      • Extracts named entities with the label "SKILL" (using the entity ruler added during initialization).
      • Tokenizes the text, excluding any part of the extracted skill entity.
      • Preprocesses the remaining text using the EnhancedTextPreprocessor (lowercasing, removing URLs, emails, special characters, and extra whitespace).
      • Generates n-grams (based on ngram_range from the configuration) from the preprocessed tokens.
      • Combines the extracted entities and n-grams.
      • Filters the keywords, removing any that contain only stop words or single-character words.
      • Optionally applies semantic filtering (_semantic_filter) if semantic_validation is enabled in the configuration. This checks if keywords are semantically related to their context using cosine similarity between the keyword's embedding and the section's embedding.
      • Ensures whitelist recall using ensure_whitelist_recall. This uses fuzzy matching to find terms from the skills_whitelist within the job descriptions and adds them to the extracted keywords, even if they weren't initially extracted.
    • 3.4.x.3 Output Data: A list of lists, where each inner list contains the extracted keywords for a corresponding job description.
    • 3.4.x.4 Storage of Intermediate Results: Intermediate results (preprocessed text, tokens, n-grams) are stored in memory within the AdvancedKeywordExtractor and EnhancedTextPreprocessor objects. The EnhancedTextPreprocessor uses an LRU cache (_cache) to store preprocessed text, reducing redundant preprocessing.
    • 3.4.x.5 Libraries/Modules Used: spacy, re, concurrent.futures, rapidfuzz, nltk.
    • 3.4.x.6 Computational Complexity: The complexity depends on the length of the job descriptions and the configured n-gram range. spaCy's nlp.pipe provides significant performance improvements for batch processing. The semantic filtering and whitelist recall steps add computational overhead, but are optimized using vector operations and fuzzy matching.
  • Step 5: TF-IDF Calculation (within _process_chunk or _analyze_jobs_internal) This is handled by the _create_tfidf_matrix method:

    • 3.5.x.1 Input data: List of job description texts and the keyword sets for each job description.
    • 3.5.x.2 Operations Performed:
      • Initializes a TfidfVectorizer from scikit-learn. Crucially, it's configured to not re-tokenize the input, treating the provided keyword_sets as pre-tokenized lists. This is achieved by setting tokenizer=lambda x: x and preprocessor=lambda x: x.
      • Calls fit_transform on the vectorizer with the keyword_sets to create the TF-IDF matrix.
      • Gets the feature names (terms) from the vectorizer.
    • 3.5.x.3 Output Data: The TF-IDF matrix (a sparse matrix) and a list of feature names (terms).
    • 3.5.x.4 Storage of Intermediate Results: The TfidfVectorizer object is created and used within this function.
    • 3.5.x.5 Libraries/Modules Used: sklearn.feature_extraction.text (specifically TfidfVectorizer), numpy.
    • 3.5.x.6 Computational Complexity: The complexity of TF-IDF calculation depends on the size of the vocabulary (number of unique terms) and the number of job descriptions. The fit_transform method is the most computationally expensive part.
  • Step 6: Score Calculation (within _process_chunk or _analyze_jobs_internal)

    This is handled by the _calculate_scores method:

    • 3.6.x.1. Input Data: The TF-IDF matrix, feature names, keyword sets, and the original job descriptions (dictionary).
    • 3.6.x.2. Operations Performed:
      • Iterates through the non-zero entries in the TF-IDF matrix (using the COO format for efficiency).
      • For each keyword, it calculates a score based on:
        • The TF-IDF value (weighted by tfidf_weight).
        • The frequency of the keyword in the job description (weighted by frequency_weight). Note: This uses binary presence (0 or 1) rather than raw count.
        • Whether the keyword is in the whitelist (if so, the score is multiplied by whitelist_boost).
      • Applies section weights based on the section in which the keyword is found (using _detect_keyword_section).
      • Determines the category of the keyword.
    • 3.6.x.3. Output Data: A list of dictionaries, where each dictionary represents a keyword and contains its score, TF-IDF value, frequency, category, whitelist status, and the job title it was found in.
    • 3.6.x.4. Storage of Intermediate Results: The results are accumulated in a list (results).
    • 3.6.x.5. Libraries/Modules Used: None (beyond built-in Python functions and data structures).
    • 3.6.x.6. Computational Complexity: O(m), where m is the number of non-zero entries in the TF-IDF matrix.
  • Step 7: Result Aggregation (within _analyze_jobs_chunked or _analyze_jobs_internal)

    • In _analyze_jobs_chunked:

      • 3.7.1. Input Data: A list of summary DataFrames (one for each processed chunk).

      • 3.7.2. Operations Performed:

        • If Dask is enabled:
          • Creates a Dask DataFrame from the list of Pandas DataFrames.
          • Performs a groupby operation on the Dask DataFrame to aggregate the results (sum and mean of "Total_Score", sum of "Job_Count").
          • Computes the result, converting the Dask DataFrame to a Pandas DataFrame.
          • Includes a checksum validation to ensure the Dask computation is correct.
        • If Dask is not enabled:
          • Concatenates the list of Pandas DataFrames into a single DataFrame.
          • Performs a groupby operation on the concatenated DataFrame (similar to the Dask case).
      • 3.7.3. Output Data: A single summary DataFrame.

      • 3.7.4. Storage of Intermediate Results: None (beyond the intermediate storage of chunks themselves).

      • 3.7.5. Libraries/Modules Used: pandas, dask (if enabled), numpy.

      • 3.7.6. Computational Complexity: If Dask is used, the complexity depends on the number of chunks and the size of each chunk. Dask parallelizes the aggregation. If Dask is not used, the complexity is O(n log n), where n is the total number of keywords across all chunks (due to the groupby operation).

    • In _analyze_jobs_internal:

      • 3.8.1. Input Data: The list of dictionaries produced by _calculate_scores.
      • 3.8.2. Operations Performed:
        • Creates a Pandas DataFrame from the list of dictionaries.
        • Performs a groupby operation on the DataFrame to aggregate the results (sum and mean of "Score", count of unique "Job Title").
        • Renames the columns of the resulting DataFrame.
        • Creates a pivot table.
      • 3.8.3. Output Data: A summary DataFrame and a pivot table DataFrame.
      • 3.8.4. Storage of Intermediate Results: None.
      • 3.8.5. Libraries/Modules Used: pandas.
      • 3.8.6. Computational Complexity: O(n log n), where n is the number of keywords (due to the groupby operation).
  1. Intermediate Data Handling:
  • In-memory caching: The EnhancedTextPreprocessor uses an LRU (Least Recently Used) cache (_cache) to store the results of text preprocessing. This avoids redundant preprocessing of the same text. The cache size is configurable via cache_size in config.yaml. The cache is cleared if the configuration changes (detected using a BLAKE3 hash).
  • Whitelist caching: The _load_or_create_whitelist method in AdvancedKeywordExtractor caches the expanded whitelist to a file (whitelist_cache.json) using srsly. This avoids regenerating the whitelist on every run. The cache is validated using a BLAKE3 hash of the skills_whitelist from the configuration.
  • Intermediate file storage (optional): If intermediate_save.enabled is set to True in config.yaml, the script saves intermediate results to disk at regular intervals during chunked processing. This is handled by the _analyze_jobs_chunked method and the _save_intermediate function.
    • The results are saved to files in the intermediate_save.working_dir directory.
    • The file format can be JSON, JSONL, or Feather, specified by intermediate_save.format.
    • The filenames include a unique run_id and a chunk index.
    • The _cleanup_intermediate function removes these intermediate files after processing is complete (if intermediate_save.cleanup is True).
  • The script uses the term vectors generated by spaCy and caches them.
  • The script categorizes terms and caches them.
  1. Final Output Stage:

5.1. Storage Location:

The final results are saved to an Excel file specified via the -o or --output command-line argument. The default file path is "results.xlsx". This is handled by the save_results function.

5.2. Output Format:

The output is an Excel file with two sheets:

  • Summary: Contains aggregated keyword statistics:

    • Keyword: The extracted keyword.
    • Total_Score: The sum of scores for the keyword across all job descriptions.
    • Avg_Score: The average score for the keyword.
    • Job_Count: The number of job descriptions in which the keyword appears.
  • Detailed Scores: A pivot table showing the score of each keyword in each job description.

5.3. Purpose of Output:

The output is intended to be used by job seekers or recruiters to identify the most important keywords and skills for a set of job descriptions. The summary sheet provides an overview of the most frequent and highly-scored keywords. The detailed sheet allows for a more granular analysis of keyword scores across individual job descriptions.

5.4. Libraries/Modules Used (Output Stage):

  • pandas: For creating and manipulating DataFrames and writing to Excel.
  • pathlib: For file path manipulation
  • shutil: For disk space checks.
Clone this wiki locally