Measuring Corporate Culture Using Machine Learning

Introduction

The repository implements the method described in the paper

Li, K., Mai, F., Shen, R., & Yan, X. (2020). Measuring corporate culture using machine learning. Review of Financial Studies, forthcoming.

The code is tested in Ubuntu 18.04 and macOS Catalina.

FastText

This repository was forked to improve on the author's existing implementation as part of a project assigned during my stint as a Research Assistant.

While exploring the repository as prelimiary work for a project, word2vec model was unable to Facebook's FastText model, which looks at the n-grams making up the words rather than whole words. This allows for words that aren't in the corpus to be vectorized.

To implement fastText instead of word2vec:

Run python clean_and_train.py --mode=fasttext in step 2 of "Running the Code"
Run python create_unseen_dict.py instead of python create_dict.py in step 3 of "Running the Code".

Requirement

The code requres

Python 3.6+
The required Python packages can be installed via pip install -r requirements.txt
Download and uncompress Stanford CoreNLP v3.9.2. Newer versions may work, but they are not tested. Either set the environment variable to the location of the uncompressed folder, or edit the following line in the global_options.py to the location of the uncompressed folder, for example:

os.environ["CORENLP_HOME"] = "/home/user/stanford-corenlp-full-2018-10-05/"

Make sure requirements for CoreNLP are met. For example, you need to have Java installed.

Data

We included some example data in the data/input/ folder. The three files are

documents.txt: Each line is a document (e.g., each earnings call). Each document needs to have line breaks remvoed. The file has no header row.
document_ids.txt: Each line is document ID (e.g., unique identifier for each earnings call). A document ID cannot have _ or whitespaces. The file has no header row.
(Optional) id2firms.csv: A csv file with three columns (document_id:str, firm_id:str, time:int). The file has a header row.

NEW:

data_to_input.py has been added to read pdf files and write them to a text file for parsing.

Instructions:

Put all the pdf files in a directory.
Edit line 9 to point to the directory containing the pdf files.

Before running the code

You can config global options in the global_options.py. The most important options are perhaps:

The RAM allocated for CoreNLP
The number of CPU cores
The seed words
The max number of words to include in each dimension. Note that after filtering and deduplication (each word can only be loaded under a single dimension), the number of words will be smaller.

Running the code

Use python parse.py to use Stanford CoreNLP to parse the raw documents. The parsed files are output in the data/processed/parsed/ folder:
- documents.txt: Each line is a sentence.
- document_sent_ids.txt: Each line is a id in the format of docID_sentenceID (e.g. doc0_0, doc0_1, ..., doc1_0, doc1_1, doc1_2, ...). Each line in the file corresponds to documents.txt.
Use python clean_and_train.py to clean, remove stopwords, and named entities in parsed documents.txt. The program then learns corpus specific phrases using gensim and concatenate them. Finally, the program trains the word2vec model.

The options can be configured in the global_options.py file. The program outputs the following 3 output files:
- data/processed/unigram/documents_cleaned.txt: Each line is a sentence. NERs are replaced by tags. Stopwords, 1-letter words, punctuation marks, and pure numeric tokens are removed. MWEs and compound words are concatenated.
- data/processed/bigram/documents_cleaned.txt: Each line is a sentence. 2-word phrases are concatenated.
- data/processed/trigram/documents_cleaned.txt: Each line is a sentence. 3-word phrases are concatenated. This is the final corpus for training the word2vec model and scoring.
The program also saves the following gensim models:
- models/phrases/bigram.mod: phrase model for 2-word phrases
- models/phrases/trigram.mod: phrase model for 3-word phrases
- models/w2v/w2v.mod: word2vec model
Use python create_dict.py to create the expanded dictionary. The program outputs the following files:
- outputs/dict/expanded_dict.csv: A csv file with the number of columns equal to the number of dimensions in the dictionary (five in the paper). The row headers are the dimension names.
(Optional): It is possible to manually remove or add items to the expanded_dict.csv before scoring the documents.
Use python score.py to score the documents. Note that the output scores for the documents are not adjusted by the document length. The program outputs three sets of scores:
- outputs/scores/scores_TF.csv: using raw term counts or term frequency (TF),
- outputs/scores/scores_TFIDF.csv: using TF-IDF weights,
- outputs/scores/scores_WFIDF.csv: TF-IDF with Log normalization (WFIDF).
(Optional): It is possible to use additional weights on the words (see score.score_tf_idf() for detail).
(Optional): Use python aggregate_firms.py to aggregate the scores to the firm-time level. The final scores are adjusted by the document lengths.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Measuring Corporate Culture Using Machine Learning

Introduction

FastText

Requirement

Data

Before running the code

Running the code

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
culture		culture
data		data
resources		resources
.gitignore		.gitignore
README.md		README.md
aggregate_firms.py		aggregate_firms.py
clean_and_train.py		clean_and_train.py
create_dict.py		create_dict.py
create_unseeen_dict.py		create_unseeen_dict.py
global_options.py		global_options.py
parse.py		parse.py
requirements.txt		requirements.txt
score.py		score.py

Ivanlxw/Measuring-Corporate-Culture-Using-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

Measuring Corporate Culture Using Machine Learning

Introduction

FastText

Requirement

Data

Before running the code

Running the code

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages