pyspark-dataframe-searcher

Taiwo's pyspark dataframe feature extraction problem.

Problem description

This problem can be found on StackOverflow. It involves a search vis-a-vis pattern matching in two dimensional array space; and, it can be regarded as a two-dimensional pattern matching problem(Ref. [1]).

For example:

Given a pyspark dataframe of the form:

id  col1  col2 col3 col4
------------------------
as1  4    10    4    6
as2  6    3     6    1
as3  6    0     2    1
as4  8    8     6    1
as5  9    6     6    9

The objective is to search the col 2-4 of the pyspark dataframe for values in col1 and to return the id row-name(s), column-name(s). The feasible solutions are the following:

In col1, 4 is found in (as1, col3)
In col1, 6 is found in (as2,col3),(as1,col4),(as4, col3) (as5,col3)
In col1, 8 is found in (as4,col2)
In col1, 9 is found in (as5,col4)

Hint: Assume that col1 is a set: {4,6,8,9} i.e. distinct

Solution

An attempt is made to solve the above problem statement using pyspark.

Window users, run the following commands:

git clone https://github.com/taiwotman/pyspark-dataframe-searcher.git

spark2-submit --py-files job_searchdataframe.zip main.py --search >> output_`date +\%m\%d\%y\%T`.txt

Unix user, run the bash script: run.ksh.

chmod 777 run.ksh

./run.ksh

Output

Refer to the output text file

Requirements

pyspark>=2.4.0

Installation

pip install pyspark

To do:

Improve the runtime(i.e. End-time minus Start-time)
Improve the search to obtain the rows and columns(i.e. x,y coordinates) of feasible solutions.

Reference

J. Alwidian, H. Abu-Mansour and M. Ali, "Efficient algorithm for two dimensional pattern matching problem (non-square pattern)," 2012 International Conference on Information Technology and e-Services, Sousse, 2012, pp. 1-8. doi: 10.1109/ICITeS.2012.6216622

You want to be a contributor?

Open an issue
Fork repository
Send me a message

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
LICENSE		LICENSE
README.md		README.md
job_searchdataframe.zip		job_searchdataframe.zip
main.py		main.py
output_031219.txt		output_031219.txt
requirements.txt		requirements.txt
run.ksh		run.ksh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyspark-dataframe-searcher

About

Releases

Packages

Languages

License

taiwotman/pyspark-dataframe-searcher

Folders and files

Latest commit

History

Repository files navigation

pyspark-dataframe-searcher

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages