German Words Analysis

Project Overview

This project aims to identify the most important words in the German language using a dataset of 3 million sentences. The main objectives are to preprocess the data, analyze word frequencies, and determine key words for learning German. Additionally, the project includes data quality checks to ensure the dataset's integrity.

Directory Structure

data/: Contains the dataset file german_sentences_sample.txt.
src/: Contains the source code for preprocessing, analyzing, and quality checking the data.
- preprocess.py: Preprocessing functions for cleaning the text data.
- analyze.py: Functions for analyzing word frequencies.
- quality_check.py: Functions for checking data quality.
tests/: Contains test cases for the source code.
- test_preprocess.py: Tests for preprocessing functions.
- test_analyze.py: Tests for analysis functions.
- test_quality_check.py: Tests for data quality check functions.
requirements.txt: Required Python packages.
README.md: Project documentation.
main.py: Main script to run the project.

Setup Instructions

Clone the repository:

git clone https://github.com/yourusername/german-words-analysis.git
cd german-words-analysis

Download the dataset from this link and place it in the data/ directory.
Install the required packages:
```
pip install -r requirements.txt
```
Run the main script:
```
python main.py
```

Data Quality Checks

The data quality check function provides metrics on:

Total number of sentences Number of empty sentences Number of repeated sentences These metrics help in understanding the quality and cleanliness of the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

German Words Analysis

Project Overview

Directory Structure

Setup Instructions

Data Quality Checks

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
src		src
tests		tests
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

License

amirdarvishi/German_essential_words

Folders and files

Latest commit

History

Repository files navigation

German Words Analysis

Project Overview

Directory Structure

Setup Instructions

Data Quality Checks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages