Matrix multiplication from foundations

The foundations we will assume throughout this course are - Python - matplotlib - The python standard library - Jupyter notebooks and nbdev

!pip install deeplake

Requirement already satisfied: deeplake in /usr/local/lib/python3.10/dist-packages (3.8.8)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from deeplake) (1.23.5)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from deeplake) (9.4.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.10/dist-packages (from deeplake) (1.28.64)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from deeplake) (8.1.7)
Requirement already satisfied: pathos in /usr/local/lib/python3.10/dist-packages (from deeplake) (0.3.1)
Requirement already satisfied: humbug>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from deeplake) (0.3.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from deeplake) (4.66.1)
Requirement already satisfied: lz4 in /usr/local/lib/python3.10/dist-packages (from deeplake) (4.3.2)
Requirement already satisfied: pyjwt in /usr/lib/python3/dist-packages (from deeplake) (2.3.0)
Requirement already satisfied: pydantic in /usr/local/lib/python3.10/dist-packages (from deeplake) (1.10.13)
Requirement already satisfied: libdeeplake==0.0.90 in /usr/local/lib/python3.10/dist-packages (from deeplake) (0.0.90)
Requirement already satisfied: aioboto3>=10.4.0 in /usr/local/lib/python3.10/dist-packages (from deeplake) (12.0.0)
Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/dist-packages (from deeplake) (1.5.8)
Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from libdeeplake==0.0.90->deeplake) (0.3.7)
Requirement already satisfied: aiobotocore[boto3]==2.7.0 in /usr/local/lib/python3.10/dist-packages (from aioboto3>=10.4.0->deeplake) (2.7.0)
Requirement already satisfied: botocore<1.31.65,>=1.31.16 in /usr/local/lib/python3.10/dist-packages (from aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.31.64)
Requirement already satisfied: aiohttp<4.0.0,>=3.7.4.post0 in /usr/local/lib/python3.10/dist-packages (from aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (3.8.6)
Requirement already satisfied: wrapt<2.0.0,>=1.10.10 in /usr/local/lib/python3.10/dist-packages (from aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.14.1)
Requirement already satisfied: aioitertools<1.0.0,>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (0.11.0)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from boto3->deeplake) (1.0.1)
Requirement already satisfied: s3transfer<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from boto3->deeplake) (0.7.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from humbug>=0.3.1->deeplake) (2.31.0)
Requirement already satisfied: ppft>=1.7.6.7 in /usr/local/lib/python3.10/dist-packages (from pathos->deeplake) (1.7.6.7)
Requirement already satisfied: pox>=0.3.3 in /usr/local/lib/python3.10/dist-packages (from pathos->deeplake) (0.3.3)
Requirement already satisfied: multiprocess>=0.70.15 in /usr/local/lib/python3.10/dist-packages (from pathos->deeplake) (0.70.15)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pydantic->deeplake) (4.5.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.10/dist-packages (from botocore<1.31.65,>=1.31.16->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (2.8.2)
Requirement already satisfied: urllib3<2.1,>=1.25.4 in /usr/local/lib/python3.10/dist-packages (from botocore<1.31.65,>=1.31.16->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (2.0.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->humbug>=0.3.1->deeplake) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->humbug>=0.3.1->deeplake) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->humbug>=0.3.1->deeplake) (2023.7.22)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (23.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (4.0.3)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.3.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.31.65,>=1.31.16->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.16.0)

from pathlib import Path
import pickle, gzip, math, os, time, shutil, matplotlib as mpl, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import random

What is not-MNIST?

The not-MNIST dataset comprises of some freely accessible fonts and symbols extracted to create a dataset similar to MNIST. The dataset is divided into two parts: a relatively small hand-cleaned portion of approximately 19k samples and a larger uncleaned portion of 500k samples. There are ten classes, with letters A-J drawn from various fonts. Here we will use the hand-cleaned portion

1. A

2. B

3. C

4. D

5. E

6. F

7. G

8. H

9. I

10. J

Get data

import deeplake
ds = deeplake.load('hub://activeloop/not-mnist-small')

-|-

Opening dataset in read-only mode as you don't have write permissions.
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/not-mnist-small

hub://activeloop/not-mnist-small loaded successfully.

images = ds.tensors['images'].numpy().astype('float32') / 255.
labels = ds.tensors['labels'].numpy().astype(int)

images.shape

(18724, 28, 28)

labels.shape

(18724, 1)

X_train, X_test, y_train, y_test = train_test_split(images, labels, test_size=0.2, random_state=1)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((14979, 28, 28), (3745, 28, 28), (14979, 1), (3745, 1))

plt.imshow(X_train[0], cmap='gray')

<matplotlib.image.AxesImage>

y_train = y_train.squeeze(-1)
y_test = y_test.squeeze(-1)

def char(y): return 'ABCDEFGHIJ'[y]

char(y_train[0])

'I'

X_train = X_train.reshape(-1, 784)
X_test = X_test.reshape(-1, 784)

X_train.shape, X_test.shape

((14979, 784), (3745, 784))

lst1 = list(X_train[0])
vals = lst1[290: 300]
vals

[1.0, 0.9843137, 1.0, 1.0, 1.0, 1.0, 0.9843137, 1.0, 0.68235296, 0.0]

def chunks(x, sz):
    for i in range(0, len(x), sz): yield x[i: i + sz]

list(chunks(vals, 5))

[[1.0, 0.9843137, 1.0, 1.0, 1.0], [1.0, 0.9843137, 1.0, 0.68235296, 0.0]]

val_iter = chunks(vals, 5)

next(val_iter)

[1.0, 0.9843137, 1.0, 1.0, 1.0]

next(val_iter)

[1.0, 0.9843137, 1.0, 0.68235296, 0.0]

next(val_iter)

StopIteration: ignored

plt.imshow(list(chunks(lst1, 28)), cmap='gray')

<matplotlib.image.AxesImage>

from itertools import islice

it = iter(vals)
islice(it, 5)

<itertools.islice>

list(islice(it, 5))

[1.0, 0.9843137, 1.0, 1.0, 1.0]

for i in islice(it, 5):
    print(i)

it = iter(lst1)
img = list(iter(lambda : list(islice(it, 28)), []))

plt.imshow(img)

<matplotlib.image.AxesImage>

Matrix and Tensor

img[20][15]

1.0

class Matrix:
    def __init__(self, xs): self.xs = xs
    def __getitem__(self, idxs): return self.xs[idxs[0]][idxs[1]]

m = Matrix(img)
m[20, 15]

1.0

import torch
from torch import tensor

tensor([1, 2, 3])

tensor([1, 2, 3])

tens = tensor(img); tens[20, 15]

tensor(1.)

X_train, y_train, X_test, y_test = map(tensor, (X_train, y_train, X_test, y_test))

X_train.type()

'torch.FloatTensor'

X_train.shape

torch.Size([14979, 784])

imgs = X_train.reshape((-1, 28, 28))
imgs.shape

torch.Size([14979, 28, 28])

plt.imshow(imgs[0])

<matplotlib.image.AxesImage>

Random numbers

rnd_state = None
def seed(a):
    global rnd_state
    a, x = divmod(a, 30268)
    a, y = divmod(a, 30306)
    a, z = divmod(a, 30322)

    rnd_state = int(x) + 1, int(y) + 1, int(z) + 1

seed(457428938475)

rnd_state

(4976, 20238, 499)

def rand():
    global rnd_state
    x, y, z = rnd_state
    x = (171 * x) % 30269
    y = (172 * y) % 30307
    z = (170 * z) % 30323
    rnd_state = x, y, z
    return (x / 30269 + y / 30307 + z / 30323) % 1.0

rand(), rand(), rand()

(0.7645251082582081, 0.7920889799553945, 0.06912886811267205)

if os.fork(): print(f'In parent: {rand()}')
else:
    print(f'In child: {rand()}')
    os._exit(os.EX_OK)

In parent: 0.9559050644103264
In child: 0.9559050644103264

if os.fork(): print(f'In parent: {torch.rand(1)}')
else:
    print(f'In child: {torch.rand(1)}')
    os._exit(os.EX_OK)

In parent: tensor([0.7129])
In child: tensor([0.7129])

if os.fork(): print(f'In parent: {random.random()}')
else:
    print(f'In child: {random.random()}')
    os._exit(os.EX_OK)

In parent: 0.7579544029403025
In child: 0.1583037383506456

plt.plot([rand() for _ in range(50)]);

plt.hist([rand() for _ in range(100_000)])

(array([10045.,  9990.,  9701., 10008.,  9934.,  9935., 10182., 10210.,
         9924., 10071.]),
 array([3.03151161e-06, 1.00002109e-01, 2.00001186e-01, 3.00000263e-01,
        3.99999341e-01, 4.99998418e-01, 5.99997495e-01, 6.99996572e-01,
        7.99995650e-01, 8.99994727e-01, 9.99993804e-01]),
 <BarContainer object of 10 artists>)

8.74 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

143 µs ± 60 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Matrix Multiplication

torch.manual_seed(1)
weights = torch.randn(784, 10)
bias = torch.zeros(10)

m1 = X_test[:5]
m2 = weights

m1.shape, m2.shape

(torch.Size([5, 784]), torch.Size([784, 10]))

ar, ac = m1.shape
br, bc = m2.shape
(ar, ac), (br, bc)

((5, 784), (784, 10))

t1 = torch.zeros(ar, bc)
t1.shape

torch.Size([5, 10])

for i in range(ar):
    for j in range(bc):
        for k in range(ac):
            t1[i, j] += m1[i, k] * m2[k, j]

t1

tensor([[  8.2509, -32.2384, -31.2659,  29.6144,  37.1024, -10.6518, -18.7646,
          -9.3841,  13.5039,   3.1983],
        [ 15.3977, -19.5008,  -5.6448,  30.4349,  59.5697,   6.9771, -25.8890,
         -40.2893, -17.3914,  43.8492],
        [  7.8317,  -4.5760,   5.5933, -13.6489,  28.6691, -11.6068, -25.0293,
         -16.1800,   7.5447,  -6.0635],
        [  5.7755,  -7.0180,   5.6232, -14.6877,  40.6079,   4.7634, -18.4372,
         -18.1103,  -4.8335,  26.0490],
        [-10.4290, -20.8873, -15.5178,  11.3828,  39.0230, -28.5750, -34.5579,
          -0.3404,   8.5624,  -8.3245]])

torch.set_printoptions(precision=2, linewidth=140, sci_mode=False)

t1

tensor([[  8.25, -32.24, -31.27,  29.61,  37.10, -10.65, -18.76,  -9.38,  13.50,   3.20],
        [ 15.40, -19.50,  -5.64,  30.43,  59.57,   6.98, -25.89, -40.29, -17.39,  43.85],
        [  7.83,  -4.58,   5.59, -13.65,  28.67, -11.61, -25.03, -16.18,   7.54,  -6.06],
        [  5.78,  -7.02,   5.62, -14.69,  40.61,   4.76, -18.44, -18.11,  -4.83,  26.05],
        [-10.43, -20.89, -15.52,  11.38,  39.02, -28.58, -34.56,  -0.34,   8.56,  -8.32]])

import numpy as np
np.set_printoptions(precision=2, linewidth=140)

def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            for k in range(ac):
                c[i, j] += a[i, k] * b[k, j]
    return c

ar*bc*ac

Numba

from numba import njit

@njit
def dot(a, b):
    res = 0
    for i in range(len(a)):
        res += a[i] * b[i]
    return res

from numpy import array

CPU times: user 248 ms, sys: 21.4 ms, total: 270 ms
Wall time: 416 ms

20.0

CPU times: user 27 µs, sys: 0 ns, total: 27 µs
Wall time: 30 µs

20.0

Now only two of our loops are running in python, not three:

def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            c[i, j] += dot(a[i, :], b[:, j])
    return c

m1a, m2a = m1.numpy(), m2.numpy()

from fastcore.test import *

test_close(t1, matmul(m1a, m2a), eps=1e-4)

Elementwise Ops

a = tensor([10., 6, -4])
b = tensor([2., 8, 7])
a, b

(tensor([10.,  6., -4.]), tensor([2., 8., 7.]))

a + b

tensor([12., 14.,  3.])

(a < b).float().mean()

tensor(0.67)

m = tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]); m

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

Frobenius Norm

sf = (m*m).sum()
sf

tensor(285)

sf.sqrt()

tensor(16.88)

m[2, :], m[:, 2]

(tensor([7, 8, 9]), tensor([3, 6, 9]))

m[2]

tensor([7, 8, 9])

def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            c[i, j] = (a[i, :] * b[:, j]).sum()
    return c

test_close(t1, matmul(m1, m2), eps=1e-4)

1.33 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)

def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            c[i, j] = torch.dot(a[i, :], b[:, j])
    return c

test_close(t1, matmul(m1, m2), eps=1e-4)

906 µs ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)

Broadcasting

tensor([10.,  6., -4.])

a > 0

tensor([ True,  True, False])

a + 1

tensor([11.,  7., -3.])

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

2 * m

tensor([[ 2,  4,  6],
        [ 8, 10, 12],
        [14, 16, 18]])

Broadcasting a vector to matrix

c = tensor([10, 20, 30]); c

tensor([10, 20, 30])

tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

m.shape, c.shape

(torch.Size([3, 3]), torch.Size([3]))

m + c

tensor([[11, 22, 33],
        [14, 25, 36],
        [17, 28, 39]])

c + m

tensor([[11, 22, 33],
        [14, 25, 36],
        [17, 28, 39]])

t = c.expand_as(m); t

tensor([[10, 20, 30],
        [10, 20, 30],
        [10, 20, 30]])

t.storage()

UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  t.storage()

 10
 20
 30
[torch.storage.TypedStorage(dtype=torch.int64, device=cpu) of size 3]

t.stride(), t.shape

((0, 1), torch.Size([3, 3]))

c.unsqueeze(0), c[None, :]

(tensor([[10, 20, 30]]), tensor([[10, 20, 30]]))

c.shape, c.unsqueeze(0).shape

(torch.Size([3]), torch.Size([1, 3]))

c.shape, c.unsqueeze(1).shape

(torch.Size([3]), torch.Size([3, 1]))

c[None].shape, c[..., None].shape

(torch.Size([1, 3]), torch.Size([3, 1]))

c[:, None].expand_as(m)

tensor([[10, 10, 10],
        [20, 20, 20],
        [30, 30, 30]])

m + c[:, None]

tensor([[11, 12, 13],
        [24, 25, 26],
        [37, 38, 39]])

m + c[None, :]

tensor([[11, 22, 33],
        [14, 25, 36],
        [17, 28, 39]])

Broadcasting rules

c[None, :]

tensor([[10, 20, 30]])

c[None, :].shape

torch.Size([1, 3])

c[:, None]

tensor([[10],
        [20],
        [30]])

c[:, None].shape

torch.Size([3, 1])

c[:, None] * c[None, :]

tensor([[100, 200, 300],
        [200, 400, 600],
        [300, 600, 900]])

c[None] > c[:, None]

tensor([[False,  True,  True],
        [False, False,  True],
        [False, False, False]])

Matmul with broadcasting

digit = m1[0]

digit.shape, m2.shape

(torch.Size([784]), torch.Size([784, 10]))

digit[:, None].shape

torch.Size([784, 1])

digit[:, None].expand_as(m2).shape

torch.Size([784, 10])

(digit[:, None] * m2).shape

torch.Size([784, 10])

def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        c[i] = (a[i,:, None] * b).sum(dim=0)
    return c

test_close(t1, matmul(m1, m2), eps=1e-4)

276 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)

tr = matmul(X_train, weights)
tr

tensor([[ 16.25, -15.36, -25.81,  ..., -21.11, -26.31,  17.92],
        [  5.01,   0.29, -23.21,  ..., -14.26,  23.98,  24.13],
        [  5.70,  -3.06,  -9.48,  ...,  -9.08,  11.74,   0.88],
        ...,
        [-12.85,  -6.30,  -4.18,  ..., -22.07,  16.46,  -0.86],
        [ 13.19, -21.23, -10.63,  ...,  -9.95,  19.52,   3.76],
        [ -1.56,  -5.29,  -8.10,  ...,  -2.18,  -0.69,   1.49]])

tr.shape

torch.Size([14979, 10])

CPU times: user 430 ms, sys: 3.76 ms, total: 434 ms
Wall time: 436 ms

Einstein Summation

Einstein Summation is a compact representation for combining products and sums in a general way. The key rules are: - Repeating letters between input arrays means that values along those axes will be multiplied together. - Omitting a letter from the output means that values along that axis will be summed

m1.shape, m2.shape

(torch.Size([5, 784]), torch.Size([784, 10]))

mr = torch.einsum('ik,kj->ikj', m1, m2)
mr.shape

torch.Size([5, 784, 10])

mr.sum(1)

tensor([[  8.25, -32.24, -31.27,  29.61,  37.10, -10.65, -18.76,  -9.38,  13.50,   3.20],
        [ 15.40, -19.50,  -5.64,  30.43,  59.57,   6.98, -25.89, -40.29, -17.39,  43.85],
        [  7.83,  -4.58,   5.59, -13.65,  28.67, -11.61, -25.03, -16.18,   7.54,  -6.06],
        [  5.78,  -7.02,   5.62, -14.69,  40.61,   4.76, -18.44, -18.11,  -4.83,  26.05],
        [-10.43, -20.89, -15.52,  11.38,  39.02, -28.58, -34.56,  -0.34,   8.56,  -8.32]])

torch.einsum('ik,kj->ij', m1,m2)

tensor([[  8.25, -32.24, -31.27,  29.61,  37.10, -10.65, -18.76,  -9.38,  13.50,   3.20],
        [ 15.40, -19.50,  -5.64,  30.43,  59.57,   6.98, -25.89, -40.29, -17.39,  43.85],
        [  7.83,  -4.58,   5.59, -13.65,  28.67, -11.61, -25.03, -16.18,   7.54,  -6.06],
        [  5.78,  -7.02,   5.62, -14.69,  40.61,   4.76, -18.44, -18.11,  -4.83,  26.05],
        [-10.43, -20.89, -15.52,  11.38,  39.02, -28.58, -34.56,  -0.34,   8.56,  -8.32]])

def matmul(a, b): return torch.einsum('ik,kj->ij', a, b)

test_close(tr, matmul(X_train, weights), eps=1e-4)

7.54 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

pytorch op

We can use pytorch’s function or operator directly for matrix multiplication

test_close(tr, X_train@weights, eps=1e-3)

7.24 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

7.53 ms ± 501 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

CUDA

def matmul(grid, a, b, c):
    i, j = grid
    if i < c.shape[0] and j < c.shape[1]:
        tmp = 0.
        for k in range(a.shape[1]): tmp += a[i, k] * b[k, j]
        c[i, j] = tmp

res = torch.zeros(ar, bc)
matmul((0, 0), m1, m2, res)
res

tensor([[8.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]])

def launch_kernel(kernel, grid_x, grid_y, *args, **kwargs):
    for i in range(grid_x):
        for j in range(grid_y): kernel((i, j), *args, **kwargs)

res = torch.zeros(ar, bc)
launch_kernel(matmul, ar, bc, m1, m2, res)
res

tensor([[  8.25, -32.24, -31.27,  29.61,  37.10, -10.65, -18.76,  -9.38,  13.50,   3.20],
        [ 15.40, -19.50,  -5.64,  30.43,  59.57,   6.98, -25.89, -40.29, -17.39,  43.85],
        [  7.83,  -4.58,   5.59, -13.65,  28.67, -11.61, -25.03, -16.18,   7.54,  -6.06],
        [  5.78,  -7.02,   5.62, -14.69,  40.61,   4.76, -18.44, -18.11,  -4.83,  26.05],
        [-10.43, -20.89, -15.52,  11.38,  39.02, -28.58, -34.56,  -0.34,   8.56,  -8.32]])

from numba import cuda

@cuda.jit
def matmul(a, b, c):
    i, j = cuda.grid(2)
    if i < c.shape[0] and j < c.shape[1]:
        tmp = 0.
        for k in range(a.shape[1]): tmp += a[i, k] * b[k, j]
        c[i, j] = tmp

r = np.zeros(tr.shape)
m1g,m2g,rg = cuda.to_device(X_train), cuda.to_device(weights), cuda.to_device(r)

cuda.is_available()

True

m1g,m2g,rg = map(cuda.to_device, (X_train, weights, r))

r.shape

(14979, 10)

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

TPB = 16
rr, rc = r.shape
blockspergrid = (math.ceil(rr / TPB), math.ceil(rc / TPB))
blockspergrid

(937, 1)

matmul[blockspergrid, (TPB, TPB)](m1g, m2g, rg)
r =rg.copy_to_host()
test_close(tr, r, eps=1.03)

matmul[blockspergrid, (TPB, TPB)](m1g, m2g, rg)
r = rg.copy_to_host()

4.98 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

m1c, m2c = X_train.cuda(), weights.cuda()

r = (m1c @ m2c).cpu()

673 µs ± 65.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)