Skip to content

This project aims to identify the most important words in the German language from a dataset containing 3 million sentences. The goal is to preprocess the data, analyze word frequencies, and determine key words that are essential for learning German.

License

Notifications You must be signed in to change notification settings

amirdarvishi/German_essential_words

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

German Words Analysis

Project Overview

This project aims to identify the most important words in the German language using a dataset of 3 million sentences. The main objectives are to preprocess the data, analyze word frequencies, and determine key words for learning German. Additionally, the project includes data quality checks to ensure the dataset's integrity.

Directory Structure

  • data/: Contains the dataset file german_sentences_sample.txt.
  • src/: Contains the source code for preprocessing, analyzing, and quality checking the data.
    • preprocess.py: Preprocessing functions for cleaning the text data.
    • analyze.py: Functions for analyzing word frequencies.
    • quality_check.py: Functions for checking data quality.
  • tests/: Contains test cases for the source code.
    • test_preprocess.py: Tests for preprocessing functions.
    • test_analyze.py: Tests for analysis functions.
    • test_quality_check.py: Tests for data quality check functions.
  • requirements.txt: Required Python packages.
  • README.md: Project documentation.
  • main.py: Main script to run the project.

Setup Instructions

  1. Clone the repository:

    git clone https://github.com/yourusername/german-words-analysis.git
    cd german-words-analysis
  2. Download the dataset from this link and place it in the data/ directory.

  3. Install the required packages:

    pip install -r requirements.txt
    
  4. Run the main script:

    python main.py
    

Data Quality Checks

The data quality check function provides metrics on:

Total number of sentences Number of empty sentences Number of repeated sentences These metrics help in understanding the quality and cleanliness of the dataset.

About

This project aims to identify the most important words in the German language from a dataset containing 3 million sentences. The goal is to preprocess the data, analyze word frequencies, and determine key words that are essential for learning German.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages