Alpha version
Pre-release
Pre-release
Changelog
Version 0.01 (Alpha) - 19/02/2025
This is the initial Alpha release of the Job Keyword Analyzer. It provides a robust and optimized solution for extracting keywords from job descriptions, calculating TF-IDF scores, and categorizing keywords.
Major Features:
- Keyword Extraction:
- Extracts keywords from job descriptions using TF-IDF.
- Supports configurable n-gram extraction (unigrams, bigrams, trigrams, etc.).
- Includes a user-managed whitelist of skills and synonyms for improved accuracy.
- Performs Named Entity Recognition (NER) to identify relevant entities (configurable).
- Generates synonyms using WordNet and spaCy's lemmatization.
- Keyword Categorization:
- Categorizes keywords based on semantic similarity using spaCy's word embeddings.
- Falls back to direct keyword matching for improved accuracy.
- Uses a configurable similarity threshold.
- Analysis and Output:
- Calculates adjusted TF-IDF scores, incorporating whitelist boosts and term frequency.
- Generates two output files:
- A summary table with total TF-IDF, job count, and average TF-IDF for each keyword.
- A detailed pivot table showing keyword scores for each job title.
- Saves results to an Excel file.
- Configuration and Customization:
- Highly configurable via a
config.yaml
file:skills_whitelist
: List of important skills.stop_words
,stop_words_add
,stop_words_exclude
: Control over stop word removal.ngram_range
: Configurable range for n-gram extraction.whitelist_ngram_range
: Configurable range for n-gram extraction for the whitelist.allowed_entity_types
: Specify which NER entity types to extract.keyword_categories
: Define custom categories and associated keywords.weighting
: Adjust the weights for TF-IDF, frequency, and whitelist boosts.spacy_model
: Specify the spaCy model to use.cache_size
: Configure the size of the preprocessing cache.max_desc_length
: Maximum length of job descriptions.min_desc_length
: Minimum length of job descriptions.min_jobs
: Minimum number of job descriptions required.similarity_threshold
: Threshold for semantic similarity categorization.timeout
: Timeout for analysis.
- Highly configurable via a
- Usability:
- Command-line interface (using
argparse
) for easy execution. - Interactive whitelist management (add/remove skills and synonyms).
- Comprehensive logging for debugging and monitoring.
- Command-line interface (using
- Robustness and Optimization:
- Extensive error handling with custom exception classes.
- Input validation to prevent common errors.
- Memory safety checks to avoid crashes on large inputs.
- Analysis timeout to prevent indefinite execution.
- Optimized text preprocessing with caching and batch processing.
- Efficient synonym generation and n-gram extraction.
Dependencies:
- Python 3.8+
nltk
numpy
pandas
spacy
scikit-learn
PyYAML
psutil
hashlib
Known Issues:
- This is an Alpha release, so there may be undiscovered bugs.
- Performance may vary depending on the size and complexity of the job descriptions.
- The semantic similarity categorization relies on the quality of the spaCy word embeddings.
Future Improvements:
- Further optimization of keyword extraction and scoring.
- Improved handling of rare or domain-specific keywords.
- Enhanced user interface.
- Integration with other ATS systems.
- More sophisticated synonym generation techniques.
How to Run:
- Install the required dependencies:
pip install nltk numpy pandas spacy scikit-learn pyyaml psutil
- Create a
config.yaml
file to customize the analysis (see the documentation for details). - Create a JSON file containing your job descriptions (see examples in the documentation).
- Run the script from the command line:
python your_script_name.py -i input.json -o output.xlsx -c config.yaml
We welcome feedback and contributions!
Full Changelog: https://github.com/DavidOsipov/Keywords4Cv/commits/0.01