1

What is a correct way to do the matrix multiplication using ctype ?

in my current implementation data going back and forth consuming lots of time, is there any way to do it optimally ? by passing array address and getting pointer in return instead of generating entire array using .contents method.

cpp_function.cpp

compile using g++ -shared -fPIC cpp_function.cpp -o cpp_function.so

#include <iostream>
extern "C" {

double* mult_matrix(double *a1, double *a2, size_t a1_h, size_t a1_w, 
                  size_t a2_h, size_t a2_w, int size)
{
    
    double* ret_arr = new double[size];
    for(size_t i = 0; i < a1_h; i++){
        for (size_t j = 0; j < a2_w; j++) {
            double val = 0;
            for (size_t k = 0; k < a2_h; k++){
                val += a1[i * a1_h + k] * a2[k * a2_h +j] ;
                
            }
            ret_arr[i * a1_h +j ] = val;
            // printf("%f ", ret_arr[i * a1_h +j ]);
        }
        // printf("\n");
    }
    return ret_arr;

   }

}

Python file to call the so file main.py

import ctypes
import numpy
from time import time

libmatmult = ctypes.CDLL("./cpp_function.so")
ND_POINTER_1 = numpy.ctypeslib.ndpointer(dtype=numpy.float64, 
                                      ndim=2,
                                      flags="C")
ND_POINTER_2 = numpy.ctypeslib.ndpointer(dtype=numpy.float64, 
                                    ndim=2,
                                    flags="C")
libmatmult.mult_matrix.argtypes = [ND_POINTER_1, ND_POINTER_2, ctypes.c_size_t, ctypes.c_size_t]

def mult_matrix_cpp(a,b):
    shape = a.shape[0] * a.shape[1]
    libmatmult.mult_matrix.restype = ctypes.POINTER(ctypes.c_double * shape )
    ret_cpp = libmatmult.mult_matrix(a, b, *a.shape, *b.shape , a.shape[0] * a.shape[1])
    out_list_c = [i for i in ret_cpp.contents] # <---- regenrating list which is time consuming
    return out_list_c

size_a = (300,300)
size_b = size_a

a = numpy.random.uniform(low=1, high=255, size=size_a)
b = numpy.random.uniform(low=1, high=255, size=size_b)

t2 = time()
out_cpp = mult_matrix_cpp(a,b)
print("cpp time taken:{:.2f} ms".format((time() - t2) * 1000))
out_cpp = numpy.array(out_cpp).reshape(size_a[0], size_a[1])

t3 = time()
out_np = numpy.dot(a,b)
# print(out_np)
print("Numpy dot() time taken:{:.2f} ms".format((time() - t3) * 1000))

This solution works but time consuming is there any way to make it faster ?

Devil
  • 1,054
  • 12
  • 18
  • 3
    you need to enable compiler optimizations when you want fast. Default optimization level is no optimization – 463035818_is_not_an_ai Nov 29 '22 at 09:49
  • numpy.matmul: https://numpy.org/doc/stable/reference/generated/numpy.matmul.html – nick Nov 29 '22 at 09:51
  • @463035818_is_not_a_number : can you please tell me the command which i try ? – Devil Nov 29 '22 at 09:52
  • @nick : thanks nick, but i wants to implement that raw function in cpp – Devil Nov 29 '22 at 09:53
  • You're not going to beat the BLAS matrix multiply. – user2357112 Nov 29 '22 at 09:54
  • https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html – 463035818_is_not_an_ai Nov 29 '22 at 09:56
  • numpy is implemented in C. https://numpy.org/doc/stable/user/whatisnumpy.html#why-is-numpy-fast. Actually many python modules are implemented in C or C++ – 463035818_is_not_an_ai Nov 29 '22 at 10:01
  • @463035818_is_not_a_number : .dot() numpy method is vectorized multiplication method, which is way faster in compare to any looping method, thanks for the great findings – Devil Nov 29 '22 at 10:06
  • Heard a great talk on YT on python performance these days, might be of general interest: https://www.youtube.com/watch?v=vVUnCXKuNOg – nick Nov 29 '22 at 10:34
  • 2
    Please do not write your own matrix multiplication. Reuse the one of Numpy (which use a BLAS library) or use a BLAS library directly. Optimized BLAS libraries a much more efficiently implemented than your very inefficient naive code. Moreover, it is generally not a good idea to reinvent the wheel unless the goal of the project is only to learn how to write such a code (if so, the binding part is not useful to learn that, and if you want to do the binding then you can use a BLAS call in your function anyway). – Jérôme Richard Nov 29 '22 at 12:58
  • 1
    Does this answer help ? https://stackoverflow.com/questions/71195319/how-can-i-return-a-multidimensional-array-from-a-shared-library-in-c-to-python-a/71197327#71197327 – Jérôme Richard Nov 29 '22 at 13:02
  • @JérômeRichard : Indeed it's a helpful and you're correct, Optimized BLAS libraries are much more efficiently – Devil Nov 30 '22 at 08:46

2 Answers2

1

One reason for the time consumption is not using an ndpointer for the return value and copying it into a Python list. Instead use the following restype. You won't need the later reshape as well. But take the commenters' advice and don't reinvent the wheel.

def mult_matrix_cpp(a, b):
    shape = a.shape[0] * a.shape[1]
    libmatmult.mult_matrix.restype = np.ctypeslib.ndpointer(dtype=np.float64, ndim=2, shape=a.shape, flags="C")
    return libmatmult.mult_matrix(a, b, *a.shape, *b.shape , a.shape[0] * a.shape[1])
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
-1

use restype

def mult_matrix_cpp(a, b):
    shape = a.shape[0] * a.shape[1]
    libmatmult.mult_matrix.restype = np.ctypeslib.ndpointer(dtype=np.float64, ndim=2, shape=a.shape, flags="C")
    return libmatmult.mult_matrix(a, b, *a.shape, *b.shape , a.shape[0] * a.shape[1])
Burak
  • 2,251
  • 1
  • 16
  • 33