Efficiency Comparison of Data Operations in Pandas

Introduction

Optimizing data operations is crucial for efficient processing of large datasets. Three common methods in Pandas are:

Vectorization: Directly operating on entire arrays for efficient processing, leveraging libraries like NumPy. Highly recommended for improved performance with large datasets.
Iterating with iterrows(): Involves iterating row by row through a DataFrame using a for loop, which can be less efficient compared to vectorization due to the computational cost of accessing individual rows and columns.
Using apply(): Applying a function along an axis of the DataFrame, useful for complex or custom operations. However, it may be less optimal than vectorization due to the overhead of applying a function to each row or column.

Importance of Optimization in Data Operations:

Efficiency: Enhances performance to process data quickly and effectively.
Scalability: Ensures algorithms can handle growing datasets without compromising execution time.
Computational Resources: Efficient utilization of computational resources is crucial for applications requiring fast and accurate processing.
Productivity: Optimization reduces task time, boosting productivity in data analysis and development.

Since efficiency on very large data is crucial, I have made a brief comparison at the code level to see how it could be more optimal. For my test, I decided to create a DataFrame of a considerable size (10000 rows), and performed a sum between the two columns (which have the same size). In this way, I will be able to check which method is the most suitable by measuring the time it takes each one to perform this operation 1 time. Let's go into the code in more detail:

Code explanation

Addition example

We will import the libraries needed for this test.

import pandas as pd
import numpy as np
import timeit

We will create the example DataFrame, with the dimensions mentioned above.

df = pd.DataFrame({'A': np.random.randint(1, 100, 10000),
                   'B': np.random.randint(1, 100, 10000)})

We will include the different methods to be tested, being the ones mentioned at the beginning.

3.1. Vectorization

def vectorization():
  df['C'] = df['A'] + df['B']

3.2. Iterrows

def iterrows():
  for index, row in df.iterrows():
    df.at[index, 'C'] = row['A'] + row['B']

3.3. Apply

def apply():
  df['C'] = df.apply(lambda row: row['A'] + row['B'], axis=1)

Each function will be performed once, measuring the time taken to verify which was more optimal.

vectorized_time = timeit.timeit(vectorization, number=1)
iterrows_time = timeit.timeit(iterrows, number=1)
apply_time = timeit.timeit(apply, number=1)

print(f"Vectorization: {vectorized_time}s")
print(f"Iterrows: {iterrows_time}s")
print(f"Apply: {apply_time}s")

Then, doing a test, the results were:

However, considering that the load may become heavier and would need to be performed several times, each method was tested with 10 iterations. The results were:

Imagine you have to do some operations with your dataframe that are much larger and over a long period of time, the time benefit of using vectorisation over other available options is brutal.

Complex example

Next, a large DataFrame is created, and a complex function is applied to it using both the apply() method and vectorized operations. The goal is to compare the performance of these two approaches. By streamlining the code and enhancing its efficiency, we aim to demonstrate the benefits of vectorization over traditional iterative methods in data manipulation tasks. Let's delve into the optimized script to explore these concepts further.

First, the librarys.

import pandas as pd
import numpy as np
import timeit

Then, the new DataFrame.

np.random.seed(0)
df_large = pd.DataFrame(np.random.randint(1, 100, size=(100000, 3)),
columns=['A', 'B', 'C'])

The experiment for the apply function.

def setup_large():

    # Recreate the large DataFrame
    df_large = pd.DataFrame(np.random.randint(1, 100, size=(100000, 3)),
                            columns=['A', 'B', 'C'])

    def complex_func(row):
        if row['A'] > row['B']:
            return row['A'] + row['C']
        else:
            return row['B'] - row['C']

    return df_large, complex_func

def apply_function_large():
    df_large, complex_func = setup_large()
    df_large['D_apply'] = df_large.apply(complex_func, axis=1)

The experiment for the vectorization.

def vectorized_function_large():
    df_large, _ = setup_large()
    df_large['D_vect'] = np.where(df_large['A'] > df_large['B'],
                                  df_large['A'] + df_large['C'], df_large['B'] - df_large['C'])

Finally, the time taken for each function.

time_apply_large = timeit.timeit(apply_function_large, number=1)
time_vect_large = timeit.timeit(vectorized_function_large, number=1)

print(f'Apply: {time_apply_large}s')
print(f'Vectorization: {time_vect_large}s')

For this comparison, the results in terms of time consumed were:

Conclusion

To conclude, vectorisation is generally faster and more efficient in terms of performance, especially for simple and fast operations. However, apply() can be more flexible and useful for applying more complex functions that cannot be easily expressed as vectorised operations. In general, vectorisation is recommended whenever possible, but apply() can be a useful tool when vectorisation is not practical or sufficient.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
README.md		README.md
methods.py		methods.py
rows_&_cols.py		rows_&_cols.py
test_apply_vect.png		test_apply_vect.png
time_three_method.png		time_three_method.png
time_three_method2.png		time_three_method2.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficiency Comparison of Data Operations in Pandas

Introduction

Code explanation

Addition example

Complex example

Conclusion

About

Releases

Packages

Languages

IvanFernande/pandas_operation_methods

Folders and files

Latest commit

History

Repository files navigation

Efficiency Comparison of Data Operations in Pandas

Introduction

Code explanation

Addition example

Complex example

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages