Changelog

Version 0.01 (Alpha) - 19/02/2025

This is the initial Alpha release of the Job Keyword Analyzer. It provides a robust and optimized solution for extracting keywords from job descriptions, calculating TF-IDF scores, and categorizing keywords.

Major Features:

Keyword Extraction:
- Extracts keywords from job descriptions using TF-IDF.
- Supports configurable n-gram extraction (unigrams, bigrams, trigrams, etc.).
- Includes a user-managed whitelist of skills and synonyms for improved accuracy.
- Performs Named Entity Recognition (NER) to identify relevant entities (configurable).
- Generates synonyms using WordNet and spaCy's lemmatization.
Keyword Categorization:
- Categorizes keywords based on semantic similarity using spaCy's word embeddings.
- Falls back to direct keyword matching for improved accuracy.
- Uses a configurable similarity threshold.
Analysis and Output:
- Calculates adjusted TF-IDF scores, incorporating whitelist boosts and term frequency.
- Generates two output files:
  - A summary table with total TF-IDF, job count, and average TF-IDF for each keyword.
  - A detailed pivot table showing keyword scores for each job title.
- Saves results to an Excel file.
Configuration and Customization:
- Highly configurable via a config.yaml file:
  - skills_whitelist: List of important skills.
  - stop_words, stop_words_add, stop_words_exclude: Control over stop word removal.
  - ngram_range: Configurable range for n-gram extraction.
  - whitelist_ngram_range: Configurable range for n-gram extraction for the whitelist.
  - allowed_entity_types: Specify which NER entity types to extract.
  - keyword_categories: Define custom categories and associated keywords.
  - weighting: Adjust the weights for TF-IDF, frequency, and whitelist boosts.
  - spacy_model: Specify the spaCy model to use.
  - cache_size: Configure the size of the preprocessing cache.
  - max_desc_length: Maximum length of job descriptions.
  - min_desc_length: Minimum length of job descriptions.
  - min_jobs: Minimum number of job descriptions required.
  - similarity_threshold: Threshold for semantic similarity categorization.
  - timeout: Timeout for analysis.
Usability:
- Command-line interface (using argparse) for easy execution.
- Interactive whitelist management (add/remove skills and synonyms).
- Comprehensive logging for debugging and monitoring.
Robustness and Optimization:
- Extensive error handling with custom exception classes.
- Input validation to prevent common errors.
- Memory safety checks to avoid crashes on large inputs.
- Analysis timeout to prevent indefinite execution.
- Optimized text preprocessing with caching and batch processing.
- Efficient synonym generation and n-gram extraction.

Dependencies:

Python 3.8+
nltk
numpy
pandas
spacy
scikit-learn
PyYAML
psutil
hashlib

Known Issues:

This is an Alpha release, so there may be undiscovered bugs.
Performance may vary depending on the size and complexity of the job descriptions.
The semantic similarity categorization relies on the quality of the spaCy word embeddings.

Future Improvements:

Further optimization of keyword extraction and scoring.
Improved handling of rare or domain-specific keywords.
Enhanced user interface.
Integration with other ATS systems.
More sophisticated synonym generation techniques.

How to Run:

Install the required dependencies: pip install nltk numpy pandas spacy scikit-learn pyyaml psutil
Create a config.yaml file to customize the analysis (see the documentation for details).
Create a JSON file containing your job descriptions (see examples in the documentation).
Run the script from the command line: python your_script_name.py -i input.json -o output.xlsx -c config.yaml

We welcome feedback and contributions!

Full Changelog: https://github.com/DavidOsipov/Keywords4Cv/commits/0.01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alpha version

Changelog

Version 0.01 (Alpha) - 19/02/2025