Skip to content

Developing an efficient algorithm for two-dimensional pattern matching problem (non-square pattern) using Pyspark

License

Notifications You must be signed in to change notification settings

taiwotman/pyspark-dataframe-searcher

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pyspark-dataframe-searcher

Taiwo's pyspark dataframe feature extraction problem.

Problem description

This problem can be found on StackOverflow. It involves a search vis-a-vis pattern matching in two dimensional array space; and, it can be regarded as a two-dimensional pattern matching problem(Ref. [1]).

For example:

Given a pyspark dataframe of the form:

id  col1  col2 col3 col4
------------------------
as1  4    10    4    6
as2  6    3     6    1
as3  6    0     2    1
as4  8    8     6    1
as5  9    6     6    9

The objective is to search the col 2-4 of the pyspark dataframe for values in col1 and to return the id row-name(s), column-name(s). The feasible solutions are the following:

In col1, 4 is found in (as1, col3)
In col1, 6 is found in (as2,col3),(as1,col4),(as4, col3) (as5,col3)
In col1, 8 is found in (as4,col2)
In col1, 9 is found in (as5,col4)

Hint: Assume that col1 is a set: {4,6,8,9} i.e. distinct

Solution

An attempt is made to solve the above problem statement using pyspark.

Window users, run the following commands:

git clone https://github.com/taiwotman/pyspark-dataframe-searcher.git

spark2-submit --py-files job_searchdataframe.zip main.py --search >> output_`date +\%m\%d\%y\%T`.txt

Unix user, run the bash script: run.ksh.

chmod 777 run.ksh

./run.ksh

Output

Refer to the output text file

Requirements

pyspark>=2.4.0

Installation

pip install pyspark

To do:

  1. Improve the runtime(i.e. End-time minus Start-time)
  2. Improve the search to obtain the rows and columns(i.e. x,y coordinates) of feasible solutions.

Reference

  1. J. Alwidian, H. Abu-Mansour and M. Ali, "Efficient algorithm for two dimensional pattern matching problem (non-square pattern)," 2012 International Conference on Information Technology and e-Services, Sousse, 2012, pp. 1-8. doi: 10.1109/ICITeS.2012.6216622

You want to be a contributor?

About

Developing an efficient algorithm for two-dimensional pattern matching problem (non-square pattern) using Pyspark

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published