Matrix multiplication from foundations

Open In Colab

The foundations we will assume throughout this course are - Python - matplotlib - The python standard library - Jupyter notebooks and nbdev

!pip install deeplake
Requirement already satisfied: deeplake in /usr/local/lib/python3.10/dist-packages (3.8.8)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from deeplake) (1.23.5)
Requirement already satisfied: pillow in /usr/local/lib/python3.10/dist-packages (from deeplake) (9.4.0)
Requirement already satisfied: boto3 in /usr/local/lib/python3.10/dist-packages (from deeplake) (1.28.64)
Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from deeplake) (8.1.7)
Requirement already satisfied: pathos in /usr/local/lib/python3.10/dist-packages (from deeplake) (0.3.1)
Requirement already satisfied: humbug>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from deeplake) (0.3.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from deeplake) (4.66.1)
Requirement already satisfied: lz4 in /usr/local/lib/python3.10/dist-packages (from deeplake) (4.3.2)
Requirement already satisfied: pyjwt in /usr/lib/python3/dist-packages (from deeplake) (2.3.0)
Requirement already satisfied: pydantic in /usr/local/lib/python3.10/dist-packages (from deeplake) (1.10.13)
Requirement already satisfied: libdeeplake==0.0.90 in /usr/local/lib/python3.10/dist-packages (from deeplake) (0.0.90)
Requirement already satisfied: aioboto3>=10.4.0 in /usr/local/lib/python3.10/dist-packages (from deeplake) (12.0.0)
Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/dist-packages (from deeplake) (1.5.8)
Requirement already satisfied: dill in /usr/local/lib/python3.10/dist-packages (from libdeeplake==0.0.90->deeplake) (0.3.7)
Requirement already satisfied: aiobotocore[boto3]==2.7.0 in /usr/local/lib/python3.10/dist-packages (from aioboto3>=10.4.0->deeplake) (2.7.0)
Requirement already satisfied: botocore<1.31.65,>=1.31.16 in /usr/local/lib/python3.10/dist-packages (from aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.31.64)
Requirement already satisfied: aiohttp<4.0.0,>=3.7.4.post0 in /usr/local/lib/python3.10/dist-packages (from aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (3.8.6)
Requirement already satisfied: wrapt<2.0.0,>=1.10.10 in /usr/local/lib/python3.10/dist-packages (from aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.14.1)
Requirement already satisfied: aioitertools<1.0.0,>=0.5.1 in /usr/local/lib/python3.10/dist-packages (from aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (0.11.0)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from boto3->deeplake) (1.0.1)
Requirement already satisfied: s3transfer<0.8.0,>=0.7.0 in /usr/local/lib/python3.10/dist-packages (from boto3->deeplake) (0.7.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from humbug>=0.3.1->deeplake) (2.31.0)
Requirement already satisfied: ppft>=1.7.6.7 in /usr/local/lib/python3.10/dist-packages (from pathos->deeplake) (1.7.6.7)
Requirement already satisfied: pox>=0.3.3 in /usr/local/lib/python3.10/dist-packages (from pathos->deeplake) (0.3.3)
Requirement already satisfied: multiprocess>=0.70.15 in /usr/local/lib/python3.10/dist-packages (from pathos->deeplake) (0.70.15)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.10/dist-packages (from pydantic->deeplake) (4.5.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.10/dist-packages (from botocore<1.31.65,>=1.31.16->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (2.8.2)
Requirement already satisfied: urllib3<2.1,>=1.25.4 in /usr/local/lib/python3.10/dist-packages (from botocore<1.31.65,>=1.31.16->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (2.0.7)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->humbug>=0.3.1->deeplake) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->humbug>=0.3.1->deeplake) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->humbug>=0.3.1->deeplake) (2023.7.22)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (23.1.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (6.0.4)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (4.0.3)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.9.2)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.4.0)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0.0,>=3.7.4.post0->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.3.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.31.65,>=1.31.16->aiobotocore[boto3]==2.7.0->aioboto3>=10.4.0->deeplake) (1.16.0)
from pathlib import Path
import pickle, gzip, math, os, time, shutil, matplotlib as mpl, matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import random

What is not-MNIST?

The not-MNIST dataset comprises of some freely accessible fonts and symbols extracted to create a dataset similar to MNIST. The dataset is divided into two parts: a relatively small hand-cleaned portion of approximately 19k samples and a larger uncleaned portion of 500k samples. There are ten classes, with letters A-J drawn from various fonts. Here we will use the hand-cleaned portion

1. A

2. B

3. C

4. D

5. E

6. F

7. G

8. H

9. I

10. J

Get data

import deeplake
ds = deeplake.load('hub://activeloop/not-mnist-small')
-|- 
Opening dataset in read-only mode as you don't have write permissions.
This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/not-mnist-small

hub://activeloop/not-mnist-small loaded successfully.
images = ds.tensors['images'].numpy().astype('float32') / 255.
labels = ds.tensors['labels'].numpy().astype(int)
images.shape
(18724, 28, 28)
labels.shape
(18724, 1)
X_train, X_test, y_train, y_test = train_test_split(images, labels, test_size=0.2, random_state=1)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((14979, 28, 28), (3745, 28, 28), (14979, 1), (3745, 1))
plt.imshow(X_train[0], cmap='gray')
<matplotlib.image.AxesImage>

y_train = y_train.squeeze(-1)
y_test = y_test.squeeze(-1)
def char(y): return 'ABCDEFGHIJ'[y]
char(y_train[0])
'I'
X_train = X_train.reshape(-1, 784)
X_test = X_test.reshape(-1, 784)
X_train.shape, X_test.shape
((14979, 784), (3745, 784))
lst1 = list(X_train[0])
vals = lst1[290: 300]
vals
[1.0, 0.9843137, 1.0, 1.0, 1.0, 1.0, 0.9843137, 1.0, 0.68235296, 0.0]
def chunks(x, sz):
    for i in range(0, len(x), sz): yield x[i: i + sz]
list(chunks(vals, 5))
[[1.0, 0.9843137, 1.0, 1.0, 1.0], [1.0, 0.9843137, 1.0, 0.68235296, 0.0]]
val_iter = chunks(vals, 5)
next(val_iter)
[1.0, 0.9843137, 1.0, 1.0, 1.0]
next(val_iter)
[1.0, 0.9843137, 1.0, 0.68235296, 0.0]
next(val_iter)
StopIteration: ignored
plt.imshow(list(chunks(lst1, 28)), cmap='gray')
<matplotlib.image.AxesImage>

from itertools import islice
it = iter(vals)
islice(it, 5)
<itertools.islice>
list(islice(it, 5))
[1.0, 0.9843137, 1.0, 1.0, 1.0]
for i in islice(it, 5):
    print(i)
1.0
0.9843137
1.0
0.68235296
0.0
it = iter(lst1)
img = list(iter(lambda : list(islice(it, 28)), []))
plt.imshow(img)
<matplotlib.image.AxesImage>

Matrix and Tensor

img[20][15]
1.0
class Matrix:
    def __init__(self, xs): self.xs = xs
    def __getitem__(self, idxs): return self.xs[idxs[0]][idxs[1]]
m = Matrix(img)
m[20, 15]
1.0
import torch
from torch import tensor
tensor([1, 2, 3])
tensor([1, 2, 3])
tens = tensor(img); tens[20, 15]
tensor(1.)
X_train, y_train, X_test, y_test = map(tensor, (X_train, y_train, X_test, y_test))
X_train.type()
'torch.FloatTensor'
X_train.shape
torch.Size([14979, 784])
imgs = X_train.reshape((-1, 28, 28))
imgs.shape
torch.Size([14979, 28, 28])
plt.imshow(imgs[0])
<matplotlib.image.AxesImage>

Random numbers

rnd_state = None
def seed(a):
    global rnd_state
    a, x = divmod(a, 30268)
    a, y = divmod(a, 30306)
    a, z = divmod(a, 30322)

    rnd_state = int(x) + 1, int(y) + 1, int(z) + 1
seed(457428938475)
rnd_state
(4976, 20238, 499)
def rand():
    global rnd_state
    x, y, z = rnd_state
    x = (171 * x) % 30269
    y = (172 * y) % 30307
    z = (170 * z) % 30323
    rnd_state = x, y, z
    return (x / 30269 + y / 30307 + z / 30323) % 1.0
rand(), rand(), rand()
(0.7645251082582081, 0.7920889799553945, 0.06912886811267205)
if os.fork(): print(f'In parent: {rand()}')
else:
    print(f'In child: {rand()}')
    os._exit(os.EX_OK)
In parent: 0.9559050644103264
In child: 0.9559050644103264
if os.fork(): print(f'In parent: {torch.rand(1)}')
else:
    print(f'In child: {torch.rand(1)}')
    os._exit(os.EX_OK)
In parent: tensor([0.7129])
In child: tensor([0.7129])
if os.fork(): print(f'In parent: {random.random()}')
else:
    print(f'In child: {random.random()}')
    os._exit(os.EX_OK)
In parent: 0.7579544029403025
In child: 0.1583037383506456
plt.plot([rand() for _ in range(50)]);

plt.hist([rand() for _ in range(100_000)])
(array([10045.,  9990.,  9701., 10008.,  9934.,  9935., 10182., 10210.,
         9924., 10071.]),
 array([3.03151161e-06, 1.00002109e-01, 2.00001186e-01, 3.00000263e-01,
        3.99999341e-01, 4.99998418e-01, 5.99997495e-01, 6.99996572e-01,
        7.99995650e-01, 8.99994727e-01, 9.99993804e-01]),
 <BarContainer object of 10 artists>)

8.74 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
143 µs ± 60 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Matrix Multiplication

torch.manual_seed(1)
weights = torch.randn(784, 10)
bias = torch.zeros(10)
m1 = X_test[:5]
m2 = weights
m1.shape, m2.shape
(torch.Size([5, 784]), torch.Size([784, 10]))
ar, ac = m1.shape
br, bc = m2.shape
(ar, ac), (br, bc)
((5, 784), (784, 10))
t1 = torch.zeros(ar, bc)
t1.shape
torch.Size([5, 10])
for i in range(ar):
    for j in range(bc):
        for k in range(ac):
            t1[i, j] += m1[i, k] * m2[k, j]
t1
tensor([[  8.2509, -32.2384, -31.2659,  29.6144,  37.1024, -10.6518, -18.7646,
          -9.3841,  13.5039,   3.1983],
        [ 15.3977, -19.5008,  -5.6448,  30.4349,  59.5697,   6.9771, -25.8890,
         -40.2893, -17.3914,  43.8492],
        [  7.8317,  -4.5760,   5.5933, -13.6489,  28.6691, -11.6068, -25.0293,
         -16.1800,   7.5447,  -6.0635],
        [  5.7755,  -7.0180,   5.6232, -14.6877,  40.6079,   4.7634, -18.4372,
         -18.1103,  -4.8335,  26.0490],
        [-10.4290, -20.8873, -15.5178,  11.3828,  39.0230, -28.5750, -34.5579,
          -0.3404,   8.5624,  -8.3245]])
torch.set_printoptions(precision=2, linewidth=140, sci_mode=False)
t1
tensor([[  8.25, -32.24, -31.27,  29.61,  37.10, -10.65, -18.76,  -9.38,  13.50,   3.20],
        [ 15.40, -19.50,  -5.64,  30.43,  59.57,   6.98, -25.89, -40.29, -17.39,  43.85],
        [  7.83,  -4.58,   5.59, -13.65,  28.67, -11.61, -25.03, -16.18,   7.54,  -6.06],
        [  5.78,  -7.02,   5.62, -14.69,  40.61,   4.76, -18.44, -18.11,  -4.83,  26.05],
        [-10.43, -20.89, -15.52,  11.38,  39.02, -28.58, -34.56,  -0.34,   8.56,  -8.32]])
import numpy as np
np.set_printoptions(precision=2, linewidth=140)
def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            for k in range(ac):
                c[i, j] += a[i, k] * b[k, j]
    return c
ar*bc*ac
39200

Numba

from numba import njit
@njit
def dot(a, b):
    res = 0
    for i in range(len(a)):
        res += a[i] * b[i]
    return res
from numpy import array
CPU times: user 248 ms, sys: 21.4 ms, total: 270 ms
Wall time: 416 ms
20.0
CPU times: user 27 µs, sys: 0 ns, total: 27 µs
Wall time: 30 µs
20.0

Now only two of our loops are running in python, not three:

def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            c[i, j] += dot(a[i, :], b[:, j])
    return c
m1a, m2a = m1.numpy(), m2.numpy()
from fastcore.test import *
test_close(t1, matmul(m1a, m2a), eps=1e-4)

Elementwise Ops

a = tensor([10., 6, -4])
b = tensor([2., 8, 7])
a, b
(tensor([10.,  6., -4.]), tensor([2., 8., 7.]))
a + b
tensor([12., 14.,  3.])
(a < b).float().mean()
tensor(0.67)
m = tensor([[1, 2, 3], [4, 5, 6], [7, 8, 9]]); m
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])

Frobenius Norm

sf = (m*m).sum()
sf
tensor(285)
sf.sqrt()
tensor(16.88)
m[2, :], m[:, 2]
(tensor([7, 8, 9]), tensor([3, 6, 9]))
m[2]
tensor([7, 8, 9])
def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            c[i, j] = (a[i, :] * b[:, j]).sum()
    return c
test_close(t1, matmul(m1, m2), eps=1e-4)
1.33 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        for j in range(bc):
            c[i, j] = torch.dot(a[i, :], b[:, j])
    return c
test_close(t1, matmul(m1, m2), eps=1e-4)
906 µs ± 57.7 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)

Broadcasting

a
tensor([10.,  6., -4.])
a > 0
tensor([ True,  True, False])
a + 1
tensor([11.,  7., -3.])
m
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
2 * m
tensor([[ 2,  4,  6],
        [ 8, 10, 12],
        [14, 16, 18]])

Broadcasting a vector to matrix

c = tensor([10, 20, 30]); c
tensor([10, 20, 30])
m
tensor([[1, 2, 3],
        [4, 5, 6],
        [7, 8, 9]])
m.shape, c.shape
(torch.Size([3, 3]), torch.Size([3]))
m + c
tensor([[11, 22, 33],
        [14, 25, 36],
        [17, 28, 39]])
c + m
tensor([[11, 22, 33],
        [14, 25, 36],
        [17, 28, 39]])
t = c.expand_as(m); t
tensor([[10, 20, 30],
        [10, 20, 30],
        [10, 20, 30]])
t.storage()
UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  t.storage()
 10
 20
 30
[torch.storage.TypedStorage(dtype=torch.int64, device=cpu) of size 3]
t.stride(), t.shape
((0, 1), torch.Size([3, 3]))
c.unsqueeze(0), c[None, :]
(tensor([[10, 20, 30]]), tensor([[10, 20, 30]]))
c.shape, c.unsqueeze(0).shape
(torch.Size([3]), torch.Size([1, 3]))
c.shape, c.unsqueeze(1).shape
(torch.Size([3]), torch.Size([3, 1]))
c[None].shape, c[..., None].shape
(torch.Size([1, 3]), torch.Size([3, 1]))
c[:, None].expand_as(m)
tensor([[10, 10, 10],
        [20, 20, 20],
        [30, 30, 30]])
m + c[:, None]
tensor([[11, 12, 13],
        [24, 25, 26],
        [37, 38, 39]])
m + c[None, :]
tensor([[11, 22, 33],
        [14, 25, 36],
        [17, 28, 39]])

Broadcasting rules

c[None, :]
tensor([[10, 20, 30]])
c[None, :].shape
torch.Size([1, 3])
c[:, None]
tensor([[10],
        [20],
        [30]])
c[:, None].shape
torch.Size([3, 1])
c[:, None] * c[None, :]
tensor([[100, 200, 300],
        [200, 400, 600],
        [300, 600, 900]])
c[None] > c[:, None]
tensor([[False,  True,  True],
        [False, False,  True],
        [False, False, False]])

Matmul with broadcasting

digit = m1[0]
digit.shape, m2.shape
(torch.Size([784]), torch.Size([784, 10]))
digit[:, None].shape
torch.Size([784, 1])
digit[:, None].expand_as(m2).shape
torch.Size([784, 10])
(digit[:, None] * m2).shape
torch.Size([784, 10])
def matmul(a, b):
    (ar, ac), (br, bc) = a.shape, b.shape
    c = torch.zeros(ar, bc)
    for i in range(ar):
        c[i] = (a[i,:, None] * b).sum(dim=0)
    return c
test_close(t1, matmul(m1, m2), eps=1e-4)
276 µs ± 30.4 µs per loop (mean ± std. dev. of 7 runs, 50 loops each)
tr = matmul(X_train, weights)
tr
tensor([[ 16.25, -15.36, -25.81,  ..., -21.11, -26.31,  17.92],
        [  5.01,   0.29, -23.21,  ..., -14.26,  23.98,  24.13],
        [  5.70,  -3.06,  -9.48,  ...,  -9.08,  11.74,   0.88],
        ...,
        [-12.85,  -6.30,  -4.18,  ..., -22.07,  16.46,  -0.86],
        [ 13.19, -21.23, -10.63,  ...,  -9.95,  19.52,   3.76],
        [ -1.56,  -5.29,  -8.10,  ...,  -2.18,  -0.69,   1.49]])
tr.shape
torch.Size([14979, 10])
CPU times: user 430 ms, sys: 3.76 ms, total: 434 ms
Wall time: 436 ms

Einstein Summation

Einstein Summation is a compact representation for combining products and sums in a general way. The key rules are: - Repeating letters between input arrays means that values along those axes will be multiplied together. - Omitting a letter from the output means that values along that axis will be summed

m1.shape, m2.shape
(torch.Size([5, 784]), torch.Size([784, 10]))
mr = torch.einsum('ik,kj->ikj', m1, m2)
mr.shape
torch.Size([5, 784, 10])
mr.sum(1)
tensor([[  8.25, -32.24, -31.27,  29.61,  37.10, -10.65, -18.76,  -9.38,  13.50,   3.20],
        [ 15.40, -19.50,  -5.64,  30.43,  59.57,   6.98, -25.89, -40.29, -17.39,  43.85],
        [  7.83,  -4.58,   5.59, -13.65,  28.67, -11.61, -25.03, -16.18,   7.54,  -6.06],
        [  5.78,  -7.02,   5.62, -14.69,  40.61,   4.76, -18.44, -18.11,  -4.83,  26.05],
        [-10.43, -20.89, -15.52,  11.38,  39.02, -28.58, -34.56,  -0.34,   8.56,  -8.32]])
torch.einsum('ik,kj->ij', m1,m2)
tensor([[  8.25, -32.24, -31.27,  29.61,  37.10, -10.65, -18.76,  -9.38,  13.50,   3.20],
        [ 15.40, -19.50,  -5.64,  30.43,  59.57,   6.98, -25.89, -40.29, -17.39,  43.85],
        [  7.83,  -4.58,   5.59, -13.65,  28.67, -11.61, -25.03, -16.18,   7.54,  -6.06],
        [  5.78,  -7.02,   5.62, -14.69,  40.61,   4.76, -18.44, -18.11,  -4.83,  26.05],
        [-10.43, -20.89, -15.52,  11.38,  39.02, -28.58, -34.56,  -0.34,   8.56,  -8.32]])
def matmul(a, b): return torch.einsum('ik,kj->ij', a, b)
test_close(tr, matmul(X_train, weights), eps=1e-4)
7.54 ms ± 489 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

pytorch op

We can use pytorch’s function or operator directly for matrix multiplication

test_close(tr, X_train@weights, eps=1e-3)
7.24 ms ± 502 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)
7.53 ms ± 501 µs per loop (mean ± std. dev. of 7 runs, 5 loops each)

CUDA

def matmul(grid, a, b, c):
    i, j = grid
    if i < c.shape[0] and j < c.shape[1]:
        tmp = 0.
        for k in range(a.shape[1]): tmp += a[i, k] * b[k, j]
        c[i, j] = tmp
res = torch.zeros(ar, bc)
matmul((0, 0), m1, m2, res)
res
tensor([[8.25, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00],
        [0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00]])
def launch_kernel(kernel, grid_x, grid_y, *args, **kwargs):
    for i in range(grid_x):
        for j in range(grid_y): kernel((i, j), *args, **kwargs)
res = torch.zeros(ar, bc)
launch_kernel(matmul, ar, bc, m1, m2, res)
res
tensor([[  8.25, -32.24, -31.27,  29.61,  37.10, -10.65, -18.76,  -9.38,  13.50,   3.20],
        [ 15.40, -19.50,  -5.64,  30.43,  59.57,   6.98, -25.89, -40.29, -17.39,  43.85],
        [  7.83,  -4.58,   5.59, -13.65,  28.67, -11.61, -25.03, -16.18,   7.54,  -6.06],
        [  5.78,  -7.02,   5.62, -14.69,  40.61,   4.76, -18.44, -18.11,  -4.83,  26.05],
        [-10.43, -20.89, -15.52,  11.38,  39.02, -28.58, -34.56,  -0.34,   8.56,  -8.32]])
from numba import cuda
@cuda.jit
def matmul(a, b, c):
    i, j = cuda.grid(2)
    if i < c.shape[0] and j < c.shape[1]:
        tmp = 0.
        for k in range(a.shape[1]): tmp += a[i, k] * b[k, j]
        c[i, j] = tmp
r = np.zeros(tr.shape)
m1g,m2g,rg = cuda.to_device(X_train), cuda.to_device(weights), cuda.to_device(r)
cuda.is_available()
True
m1g,m2g,rg = map(cuda.to_device, (X_train, weights, r))
r.shape
(14979, 10)
r
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
TPB = 16
rr, rc = r.shape
blockspergrid = (math.ceil(rr / TPB), math.ceil(rc / TPB))
blockspergrid
(937, 1)
matmul[blockspergrid, (TPB, TPB)](m1g, m2g, rg)
r =rg.copy_to_host()
test_close(tr, r, eps=1.03)
matmul[blockspergrid, (TPB, TPB)](m1g, m2g, rg)
r = rg.copy_to_host()
4.98 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
m1c, m2c = X_train.cuda(), weights.cuda()
r = (m1c @ m2c).cpu()
673 µs ± 65.5 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)