Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] pivot_longer_spec #1362

Merged
merged 37 commits into from
Jul 3, 2024
Merged

[ENH] pivot_longer_spec #1362

merged 37 commits into from
Jul 3, 2024

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented May 10, 2024

PR Description

Please describe the changes proposed in the pull request:

  • Improve performance when sort_by_appearance is True
  • Added pivot_longer_spec, which allows unpivoting by hand - this allows more granular control on how the final dataframe should look in long form.
  • general refactoring
import pandas as pd; import janitor as jn

In [11]: events = pd.DataFrame(
    ...:             {
    ...:                 "country": ["United States", "Russia", "China"],
    ...:                 "vault_2012_f": [
    ...:                     48.132,
    ...:                     46.366,
    ...:                     44.266,
    ...:                 ],
    ...:                 "vault_2012_m": [46.632, 46.866, 48.316],
    ...:                 "vault_2016_f": [
    ...:                     46.866,
    ...:                     45.733,
    ...:                     44.332,
    ...:                 ],
    ...:                 "vault_2016_m": [45.865, 46.033, 45.0],
    ...:                 "floor_2012_f": [45.366, 41.599, 40.833],
    ...:                 "floor_2012_m": [45.266, 45.308, 45.133],
    ...:                 "floor_2016_f": [45.999, 42.032, 42.066],
    ...:                 "floor_2016_m": [43.757, 44.766, 43.799],
    ...:             }
    ...:         )

In [12]: events
Out[12]:
         country  vault_2012_f  vault_2012_m  ...  floor_2012_m  floor_2016_f  floor_2016_m
0  United States        48.132        46.632  ...        45.266        45.999        43.757
1         Russia        46.366        46.866  ...        45.308        42.032        44.766
2          China        44.266        48.316  ...        45.133        42.066        43.799

[3 rows x 9 columns]

events = pd.concat([events]*100_000)

# dev
In [848]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=False)
62.9 ms ± 361 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [849]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=True)
165 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# PR
In [842]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=False)
53.2 ms ± 264 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [843]: %timeit events.pivot_longer(index='country', names_to=['event','year','gender'], names_sep='_',sort_by_appearance=True)
48 ms ± 486 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performance test for lots of columns (YMMV):

events = pd.DataFrame(
     ...:             {
     ...:                 "country": ["United States", "Russia", "China"],
     ...:                 "vault_2012_f": [
     ...:                     48.132,
     ...:                     46.366,
     ...:                     44.266,
     ...:                 ],
     ...:                 "vault_2012_m": [46.632, 46.866, 48.316],
     ...:                 "vault_2016_f": [
     ...:                     46.866,
     ...:                     45.733,
     ...:                     44.332,
     ...:                 ],
     ...:                 "vault_2016_m": [45.865, 46.033, 45.0],
     ...:                 "floor_2012_f": [45.366, 41.599, 40.833],
     ...:                 "floor_2012_m": [45.266, 45.308, 45.133],
     ...:                 "floor_2016_f": [45.999, 42.032, 42.066],
     ...:                 "floor_2016_m": [43.757, 44.766, 43.799],
     ...:             }
     ...:         )

events
         country  vault_2012_f  vault_2012_m  vault_2016_f  vault_2016_m  floor_2012_f  floor_2012_m  floor_2016_f  floor_2016_m
0  United States        48.132        46.632        46.866        45.865        45.366        45.266        45.999        43.757
1         Russia        46.366        46.866        45.733        46.033        41.599        45.308        42.032        44.766
2          China        44.266        48.316        44.332        45.000        40.833        45.133        42.066        43.799

events = events.set_index('country')
events = pd.concat([events.add_suffix(f'_{num}') for num in range(100)],axis=1)
events = pd.concat([events]*10_000)
events = events.reset_index()
In [143]: events.shape
Out[143]: (30000, 801)

# dev 
In [147]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=True)
2.85 s ± 34.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [148]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'year':int}, sort_by_appearance=False)
687 ms ± 7.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


# this PR
 In [13]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
    ...: year':int}, sort_by_appearance=True)
420 ms ± 3.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [14]: %timeit events.pivot_longer('country', names_to=['event','year','gender','num'],names_sep='_',values_to='score', names_transform={'
    ...: year':int}, sort_by_appearance=False)
470 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR resolves #1361 .

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@samukweku samukweku added the enhancement New feature or request label May 10, 2024
@samukweku samukweku requested review from ericmjl, Zeroto521, thatlittleboy and a team May 10, 2024 02:31
@samukweku samukweku self-assigned this May 10, 2024
@ericmjl
Copy link
Member

ericmjl commented May 10, 2024

@samukweku samukweku marked this pull request as draft June 8, 2024 13:52
@samukweku samukweku marked this pull request as ready for review June 14, 2024 01:47
@ericmjl
Copy link
Member

ericmjl commented Jul 3, 2024

Ok, I just had a chance to look through the PR. Super high quality work! There was one file that was a tad too long where the implementation happened; I'm going to trust that it works fine. Otherwise, thank you for keeping the code test coverage high, @samukweku!

@ericmjl
Copy link
Member

ericmjl commented Jul 3, 2024

I am going to approve. Please do the honors of merging!

@samukweku
Copy link
Collaborator Author

@ericmjl thanks for the feedback... I have to figure out how to break up such PRs into small chunks

@samukweku samukweku merged commit 2521ce7 into dev Jul 3, 2024
4 checks passed
@samukweku samukweku deleted the samukweku/pivot_longer_spec branch July 3, 2024 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perf slower when sort_by_appearance is True for pivot_longer
2 participants