Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bohrium 100 times slower than numpy, time "unaccounted for" #633

Open
jamesavery opened this issue Aug 2, 2019 · 3 comments
Open

Bohrium 100 times slower than numpy, time "unaccounted for" #633

jamesavery opened this issue Aug 2, 2019 · 3 comments
Assignees

Comments

@jamesavery
Copy link
Contributor

jamesavery commented Aug 2, 2019

When running with Bohrium or bh107, a crazy amount of time is being spent in "unaccounted for". The following code runs in 0.27 seconds in pure Numpy, but takes 25 seconds with both bohrium and bh107, despite only doing Bohrium-friendly operations, and the generated kernels being efficient. Where does the time go, and how do we fix this?

Switch between the backends by setting backend to "numpy", "bohrium", or "bh107".

#!/usr/bin/python
from time import time
import numpy  as np
import sys

backend = "numpy"
if backend == "bohrium":
    import bohrium as bh
    def to_np(A): return A.copy2numpy()
    def bh_flush(): bh.flush()
    def linspace(*args): return bh.linspace(*args)
elif backend == "bh107":
    import bh107 as bh
    def to_np(A): return A.copy2numpy()
    def bh_flush(): bh.flush()
    def linspace(*args): return bh.array(np.linspace(*args)) # No linspace in bh107
else:
    bh = np
    def to_np(A):   return A
    def bh_flush(): return
    def linspace(*args): return np.linspace(*args)    

# -------- SET UP SIMULATION PARAMETERS --------
n      = 2000
g      = 9.81
h0     = 10 
dx     = 0.5
dt     = 0.01

# -------- ALLOCATE MEMORY --------
dtype=np.float64
u          = bh.zeros(n,dtype=dtype)
eta        = bh.zeros(n,dtype=dtype)
h          = bh.zeros(n,dtype=dtype)

# -------- THE ACTUAL CODE --------
def Dx0(f,dx): return (f[2:]-f[1:-1])*(1/dx)  # d/dx on Grid 0
def Dx1(f,dx): return (f[1:-1]-f[:-2])*(1/dx) # d/dx on Grid 1

def simulation_step_naive():
#    eta_x    =  (eta[2:]-eta[1:-1])*(1/dx) # bh107 can't multiply from left
    eta_x    =  Dx0(eta,dx)
    u[1:-1]    += eta_x*dt*(-g)     # Forward-Euler time integration
    bh_flush()

    h        = eta+h0
    h_x      = Dx0(h,dx)
    u_x      = Dx1(u,dx)
    eta_t    = u[1:-1]*h_x*-1 - u_x*h[1:-1];  # Unary minus doesn't work in bh107
    
    eta[1:-1]  += eta_t*dt      # Forward-Euler time integratio
    bh_flush()


# -------- INITIALIZE AND RUN --------    
xs = linspace(0,(n-1)*dx,n)
eta[:] = bh.exp((xs-500)**2/(2*50**2)*-1)

bh_flush(); T0 = time()
for i in range(10000):
    simulation_step_naive()

bh_flush(); T1 = time()
print(round(T1-T0,4),"seconds")
@jamesavery
Copy link
Contributor Author

The problem is both in the openmp and opencl stack with pretty much the same cost (confirming that it's not execution time or fusion).

@dionhaefner
Copy link
Collaborator

This has been known for a long time (see #342).

Here is a dump of cProfile:

28.7435 seconds
         31982860 function calls (31979077 primitive calls) in 28.915 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    10000    7.001    0.001   28.577    0.003 bh_test.py:40(simulation_step_naive)
    20000    2.868    0.000    6.644    0.000 bh_test.py:37(Dx0)
    20002    2.859    0.000    2.859    0.000 {built-in method bohrium._bh.flush}
2881424/2880576    1.882    0.000    2.900    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
   140015    1.319    0.000    1.319    0.000 _info.py:947(numpy_types)
    10000    1.313    0.000    3.104    0.000 bh_test.py:38(Dx1)
  1460075    1.269    0.000    2.505    0.000 numeric.py:2035(isscalar)
   140010    1.177    0.000    3.255    0.000 array_create.py:168(empty)
  2880841    1.169    0.000    1.642    0.000 <frozen importlib._bootstrap>:416(parent)
  5962860    0.985    0.000    1.553    0.000 {built-in method builtins.isinstance}
   760052    0.887    0.000    2.499    0.000 _dtype.py:319(_name_get)
   180008    0.795    0.000    2.005    0.000 array_manipulation.py:442(broadcast_arrays)
  5770832    0.682    0.000    0.682    0.000 {built-in method builtins.hasattr}
   760052    0.646    0.000    1.611    0.000 numerictypes.py:365(issubdtype)
  1520104    0.614    0.000    0.882    0.000 numerictypes.py:293(issubclass_)
  2882516    0.473    0.000    0.473    0.000 {method 'rpartition' of 'str' objects}
   140014    0.421    0.000    1.740    0.000 {bohrium._util.dtype_support}
   480017    0.385    0.000    0.998    0.000 array_manipulation.py:477(<genexpr>)
  2280328    0.351    0.000    0.351    0.000 {built-in method builtins.issubclass}
   240016    0.297    0.000    0.693    0.000 loop.py:522(add_slide_info)
  1120040    0.295    0.000    0.295    0.000 {built-in method _abc._abc_instancecheck}
  1120040    0.274    0.000    0.569    0.000 abc.py:137(__instancecheck__)
        1    0.167    0.167   28.918   28.918 bh_test.py:2(<module>)
   180007    0.165    0.000    1.163    0.000 {built-in method builtins.all}
   120006    0.134    0.000    0.295    0.000 loop.py:363(has_iterator)
   120006    0.102    0.000    0.146    0.000 loop.py:375(check_simple_type)
   140005    0.097    0.000    0.097    0.000 loop.py:131(__init__)
   140005    0.052    0.000    0.052    0.000 loop.py:266(has_changes)
363306/363091    0.048    0.000    0.048    0.000 {built-in method builtins.len}
    20002    0.027    0.000    2.886    0.000 bh_test.py:10(bh_flush)
      224    0.024    0.000    0.024    0.000 {built-in method marshal.loads}
       23    0.014    0.001    0.017    0.001 {built-in method _imp.create_dynamic}
  430/427    0.006    0.000    0.010    0.000 {built-in method builtins.__build_class__}
     1063    0.005    0.000    0.005    0.000 {built-in method posix.stat}
        3    0.004    0.001    0.004    0.001 array_create.py:269(zeros)
      484    0.004    0.000    0.015    0.000 <frozen importlib._bootstrap_external>:1356(find_spec)
     1815    0.004    0.000    0.004    0.000 {method 'search' of 're.Pattern' objects}
       70    0.003    0.000    0.016    0.000 inspect.py:316(getmembers)
      224    0.003    0.000    0.003    0.000 {method 'read' of '_io.FileIO' objects}
      224    0.003    0.000    0.006    0.000 <frozen importlib._bootstrap_external>:914(get_data)
     4172    0.003    0.000    0.010    0.000 inspect.py:285(isroutine)
    337/1    0.002    0.000   28.918   28.918 {built-in method builtins.exec}
    23/22    0.002    0.000    0.005    0.000 {built-in method _imp.exec_dynamic}
        1    0.002    0.002    0.050    0.050 bohriumify.py:29(modules)
     7853    0.002    0.000    0.002    0.000 {built-in method builtins.getattr}
      224    0.002    0.000    0.037    0.000 <frozen importlib._bootstrap_external>:793(get_code)
   105/34    0.002    0.000    0.005    0.000 sre_parse.py:475(_parse)
      766    0.002    0.000    0.003    0.000 <frozen importlib._bootstrap>:157(_get_module_lock)
      294    0.002    0.000    0.021    0.000 <frozen importlib._bootstrap>:882(_find_spec)
     2581    0.002    0.000    0.004    0.000 <frozen importlib._bootstrap_external>:56(_path_join)
     2160    0.002    0.000    0.004    0.000 inspect.py:90(ismethoddescriptor)
    361/2    0.002    0.000    0.168    0.084 <frozen importlib._bootstrap>:978(_find_and_load)
     2581    0.001    0.000    0.002    0.000 <frozen importlib._bootstrap_external>:58(<listcomp>)
      766    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:78(acquire)
      448    0.001    0.000    0.003    0.000 <frozen importlib._bootstrap_external>:271(cache_from_source)
     5724    0.001    0.000    0.002    0.000 inspect.py:158(isfunction)
      277    0.001    0.000    0.017    0.000 <frozen importlib._bootstrap_external>:1240(_get_spec)
    264/2    0.001    0.000    0.167    0.084 <frozen importlib._bootstrap>:663(_load_unlocked)
      264    0.001    0.000    0.005    0.000 <frozen importlib._bootstrap>:504(_init_module_attrs)
      766    0.001    0.000    0.001    0.000 <frozen importlib._bootstrap>:103(release)
       21    0.001    0.000    0.003    0.000 __init__.py:316(namedtuple)
    306/2    0.001    0.000    0.168    0.084 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)
     1037    0.001    0.000    0.001    0.000 {bohrium.bhary.fix_biclass_wrapper}
   198/32    0.001    0.000    0.003    0.000 sre_compile.py:71(_compile)
     4172    0.001    0.000    0.002    0.000 inspect.py:276(isbuiltin)
     3977    0.001    0.000    0.001    0.000 inspect.py:80(ismethod)

I think the time spent in simulation_step_naive and Dx0 is actually spent in Bohrium's C backend (which Python cannot profile). The rest is Python and Cython overhead inside Bohrium (isinstance, numpy_types, isscalar, ...).

Due to this overhead, I usually don't recommend using Bohrium for problem sizes < 1e6.

@dionhaefner
Copy link
Collaborator

Weirdly, the "unaccounted for" time decreases slightly also for this problem when I use OMP_NUM_THREADS=1. So it can't just be overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants