Skip to content

Alpha version

Pre-release
Pre-release
Compare
Choose a tag to compare
@DavidOsipov DavidOsipov released this 19 Feb 18:14
· 36 commits to main since this release
0521f98

Changelog

Version 0.01 (Alpha) - 19/02/2025

This is the initial Alpha release of the Job Keyword Analyzer. It provides a robust and optimized solution for extracting keywords from job descriptions, calculating TF-IDF scores, and categorizing keywords.

Major Features:

  • Keyword Extraction:
    • Extracts keywords from job descriptions using TF-IDF.
    • Supports configurable n-gram extraction (unigrams, bigrams, trigrams, etc.).
    • Includes a user-managed whitelist of skills and synonyms for improved accuracy.
    • Performs Named Entity Recognition (NER) to identify relevant entities (configurable).
    • Generates synonyms using WordNet and spaCy's lemmatization.
  • Keyword Categorization:
    • Categorizes keywords based on semantic similarity using spaCy's word embeddings.
    • Falls back to direct keyword matching for improved accuracy.
    • Uses a configurable similarity threshold.
  • Analysis and Output:
    • Calculates adjusted TF-IDF scores, incorporating whitelist boosts and term frequency.
    • Generates two output files:
      • A summary table with total TF-IDF, job count, and average TF-IDF for each keyword.
      • A detailed pivot table showing keyword scores for each job title.
    • Saves results to an Excel file.
  • Configuration and Customization:
    • Highly configurable via a config.yaml file:
      • skills_whitelist: List of important skills.
      • stop_words, stop_words_add, stop_words_exclude: Control over stop word removal.
      • ngram_range: Configurable range for n-gram extraction.
      • whitelist_ngram_range: Configurable range for n-gram extraction for the whitelist.
      • allowed_entity_types: Specify which NER entity types to extract.
      • keyword_categories: Define custom categories and associated keywords.
      • weighting: Adjust the weights for TF-IDF, frequency, and whitelist boosts.
      • spacy_model: Specify the spaCy model to use.
      • cache_size: Configure the size of the preprocessing cache.
      • max_desc_length: Maximum length of job descriptions.
      • min_desc_length: Minimum length of job descriptions.
      • min_jobs: Minimum number of job descriptions required.
      • similarity_threshold: Threshold for semantic similarity categorization.
      • timeout: Timeout for analysis.
  • Usability:
    • Command-line interface (using argparse) for easy execution.
    • Interactive whitelist management (add/remove skills and synonyms).
    • Comprehensive logging for debugging and monitoring.
  • Robustness and Optimization:
    • Extensive error handling with custom exception classes.
    • Input validation to prevent common errors.
    • Memory safety checks to avoid crashes on large inputs.
    • Analysis timeout to prevent indefinite execution.
    • Optimized text preprocessing with caching and batch processing.
    • Efficient synonym generation and n-gram extraction.

Dependencies:

  • Python 3.8+
  • nltk
  • numpy
  • pandas
  • spacy
  • scikit-learn
  • PyYAML
  • psutil
  • hashlib

Known Issues:

  • This is an Alpha release, so there may be undiscovered bugs.
  • Performance may vary depending on the size and complexity of the job descriptions.
  • The semantic similarity categorization relies on the quality of the spaCy word embeddings.

Future Improvements:

  • Further optimization of keyword extraction and scoring.
  • Improved handling of rare or domain-specific keywords.
  • Enhanced user interface.
  • Integration with other ATS systems.
  • More sophisticated synonym generation techniques.

How to Run:

  1. Install the required dependencies: pip install nltk numpy pandas spacy scikit-learn pyyaml psutil
  2. Create a config.yaml file to customize the analysis (see the documentation for details).
  3. Create a JSON file containing your job descriptions (see examples in the documentation).
  4. Run the script from the command line: python your_script_name.py -i input.json -o output.xlsx -c config.yaml

We welcome feedback and contributions!

Full Changelog: https://github.com/DavidOsipov/Keywords4Cv/commits/0.01