Pipelining data processing and host-to-device data transfer

The purpose of this post is to show how to use multi-threading to parallelize data processing with data transfer from pageable to page-locked memory. I was motivated to examine this issue while looking at the effect of the pin_memory flag in PyTorch’s dataloader. Let me provide some background information first.

Data loading is the first step in any machine learning (ML) application. During data loading on a GPU enabled machine, a batch of data is loaded from the disk into main memory and then transferred to the GPU, where compute intensive tasks such as forward pass, backward pass (calculation of gradients) and parameter updates are carried out. These tasks are highly parallel, i.e., multiple calculations can proceed at the same time and thus implementation on a massively parallel compute platform such as a GPU is very efficient.

Table of Contents

Host-GPU Interconnect

GPUs are usually connected to the motherboard over a PCIe connection and data from the main (host) memory must be transferred to the GPU memory over this PCIe link. What is the expected bandwidth of this link? To answer this question, lets look at some basic info about PCIe links.

PCIe link performance is characterized by transfer rate and encoding scheme. Transfer rate refers to the total number of bits transmitted, including the data and overhead bits. Encoding scheme is the ratio of data bits to the total number of bits. There are many revisions of PCIe varying in transfer rate and encoding scheme. The table below shows the per lane transfer rate and throughput for PCIe 1.0, 2.0 and 3.0. For more information, see the PCIe wikipedia article.

	Encoding	Transfer Rate( bits/sec)	Throughput (MB/sec)
PCI 1.0	8b/10b	2.5 Gb/sec	8/10*2.5/8 = 250 MB/sec
PCI 2.0	8b/10b	5 Gb/sec	8/10*5/8 = 500 MB/sec
PCI 3.0	128b/130b	8 Gb/sec	128/130*8/8 = 984.5MB/sec

Physical PCIe links may contain from one to 32 lanes. Lane counts are written with an “x” prefix (for example, “x8” represents an eight-lane card or slot), with x16 being the largest size in common use. The lane count together with the throughput per lane gives us the expected data transfer bandwidth. For example, an x16 PCIe 3.0 connection is expected to have a maximum bandwidth of 16*984.5=15.7 GB/sec.

Identifying PCIe version and number of lanes on your system

To find out the PCIe version and number of lanes for the GPU interconnect on my system (Windows 10 x64, with a Nvidia 1080 Ti GPU), I used this Python code that uses the pycuda CUDA-Python wrapper to query CUDA device info.

    driver_ver = cuda.get_version()
    print("CUDA Driver Version: {0}.{1}.{2}".format(driver_ver[0], driver_ver[1], driver_ver[2]))
    num_cuda_devices = cuda.Device.count()
    for i in range(0, num_cuda_devices):
        dev = cuda.Device(i)
        pci_bus_id = dev.pci_bus_id()
        dev_name = dev.name()
        print("device id: {0}, device name: {1}, bus_id: {2}".format(i, dev_name, pci_bus_id))

driver_ver = cuda.get_version()

print("CUDA Driver Version: {0}.{1}.{2}".format(driver_ver[0], driver_ver[1], driver_ver[2]))

num_cuda_devices = cuda.Device.count()

for i in range(0, num_cuda_devices):

dev = cuda.Device(i)

pci_bus_id = dev.pci_bus_id()

dev_name = dev.name()

print("device id: {0}, device name: {1}, bus_id: {2}".format(i, dev_name, pci_bus_id))

The pci_bus_id gives you the id of the bus used by GPUs on your system. From the pci_bus_id, you can obtain the number of lanes as follows:

Linux

Run the linux command lspci as shown below

lspci -vv |grep 04:00.0 -C 100

1	lspci -vv \|grep 04:00.0 -C 100

Replace the 04:00.0 with the bus id on your system. Look for Width in “LnkCap” in the resulting output.

Windows

Getting this info was surprisingly tricky on Windows. I tried installing the Windows Driver Kit and using the “devcon” utility that comes with it, however that was not easy or intuitive and I didn’t want to go down that rabbit hole. I ended up using the windows device manager as follows:

windows device manager -> Display adapters-> <right click “properties” on GPU name> -> in the “property” drop down in the dialog box, “PCI max link speed” gives the PCIe version and “PCI max link width” gives the number of lanes (in hex).

Host memory types

Turning to host memory, there are are two main categories – pageable (or “un-pinned”) and page-locked (or “pinned”) memory. When you allocate memory using malloc in a C program, the allocation is done in pageable memory. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver first allocates a temporary pinned host array, copies the host data to the pinned array, and then transfers the data from the pinned array to device memory, as illustrated below (see this page for more info)

Performance of host-to-device transfer from pinned and un-pinned memory

It is expected that data transfer from pinned memory to device will have a higher bandwidth than transfer from un-pinned memory. To verify this, I wrote a python script (see appendix) to test the host-to-device (h2d) data transfer BandWidth (BW) when the data is located in pageable vs. page-locked memory. The script uses the pycuda library that exposes low-level CUDA API in Python. The results are shown in the table below for 5 data sizes.

Data Size (MB)	20	40	80	120	160
Transfer BW (page-locked) (GB)	11.36	11.44	12.02	12.09	12.13
Transfer BW (pageable) (GB)	5.06	5.31	5.39	5.35	5.04

As expected, the transfer bandwidth from page-locked memory is much closer to the expected bandwidth on my system.

The figure below shows a plot of the two types of transfer involved:

h2d transfer (no pinning): In this case, data is copied from pageable memory to the GPU host memory. The CUDA driver takes care of copying the data to page-locked memory
h2d transfer (with pinning): In this case, data is allocated in pageable memory and explicitly copied to page-locked memory using the register_host_memory function in pycuda. This transfer time (h2h transfer time) is shown in the blue plot. Then, the data in page-locked memory is copied to the GPU. This transfer time is shown in the red plot. The total time is the sum of the two and is shown in the black plot

As expected, the two transfer times (green plot and the black plot) are nearly identical. This is because for h2d transfer from pageable memory, the CUDA driver is internally performing the transfer from pageable to page-locked memory.

We see that transferring data from unpinned to pinned memory is an expensive operation. Thus, it is clear that transfer from page locked memory is more efficient only if either the data can be allocated directly in page-locked memory or if the data processing on the GPU can be parallelized with the data transfer so that the data transfer latency is partly or completely hidden. CUDA provides APIs such as cudaMallocHost() to implement the first method.

Dataloader in PyTorch 1.0

PyTorch’s dataloader takes the second approach. It uses multiple processes to load batch data from the disk into pageable memory and then spins up a separate thread to transfer loaded data to pinned memory (if the pin_memory flag = True). If not, the data loaded by the multiple processes is returned directly when enumerate or next(iter) is called by the user (see code for the imagenet example in PyTorch docs). The figure below shows a schematic of the two options.

Threaded data transfer experiment

I wrote a python script (see appendix) that implements a simple threaded loader to parallelize copy to page locked memory task with data processing. The script copies 160MB of data 100 times using three methods:

Data is allocated in pageable memory and copied to the GPU using pycuda memcpy_htod() function on the main thread
Data is allocated in pageable memory, copied explicitly to page-locked memory using pycuda PageLockedMemoryPool APIs and then copied to the GPU on the main thread
A separate thread is used to perform copy to page-locked memory

The data process time is simulated by adding a sleep of 400 ms, which is comparable to the backprop time for Resnet 50 on a 1080Ti GPU for a batch size of 64.

The total processing times for the three methods are shown in the table below.

	H2D (pageable)	H2D (page-locked)	H2D (page-locked, threaded)
Total Processing Time (sec) (100 Runs)	7.90	7.92	4.92

The threaded option is able to almost completely hide the data transfer latency. The total execution time is now dominated by the process time (sleep of 400*100 = 4 sec). We can also see this by profiling the code using cProfile and visualizing the result using snakeviz. Note that the actual cumulative sleep time is not exactly 4 sec because the resolution of the time.sleep() function on Windows is ~3-4 ms.

Incidentally, it is not easy to profile multiple threads in Python. cProfile works well for the main thread, but doesn’t give information about what’s happening with other threads. I also played around with Intel vtunes, but didn’t have much luck. If you know of good profiling tools for multi-threaded Python apps, let me know!

Conclusion

To conclude, in this post we looked at some basic information about PCIe links that helps you determine the expected data transfer bandwidth and considered an example that shows data transfer from pinned to device memory is much more efficient than transfer from un-pinned memory but unpinned-pinned memory transfer is an expensive operation in itself. We then examined how multi-threading can be used to perform unpinned-pinned memory transfer concurrently with host-to-device data transfer and data processing on device. Given a sufficiently large processing time, this can almost completely hide the latency of the data transfer.

Hope you found this useful. Please leave a comment if you did!

Appendix

Code for transfer BW performance comparison

import pycuda.driver as cuda
from pycuda.tools import PageLockedMemoryPool
import numpy as np
import time
import ctypes
import pdb
from queue import Queue
from threading import Thread
from pycuda.tools import make_default_context
import matplotlib.pyplot as plt

import threading
# Initialize CUDA
cuda.init()

global ctx
ctx = make_default_context() # will initialize the first device it finds
dev = ctx.get_device()

def _finish_up():
    global ctx
    ctx.pop()
    ctx = None

    from pycuda.tools import clear_context_caches
    clear_context_caches()

import atexit
atexit.register(_finish_up)

num_elems = [5000000, 10000000, 20000000, 30000000, 40000000]

# prints pci_bus_id, device name and device id for installed GPUs. You can run the Linux lspci command on the bus_id to
# obtain information about the number of PCIe lanes on that bus. This will give you the expected bandwidth
def print_device_info():
    driver_ver = cuda.get_version()
    print("CUDA Driver Version: {0}.{1}.{2}".format(driver_ver[0], driver_ver[1], driver_ver[2]))
    num_cuda_devices = cuda.Device.count()
    for i in range(0, num_cuda_devices):
        dev = cuda.Device(i)
        pci_bus_id = dev.pci_bus_id()
        dev_name = dev.name()
        print("device id: {0}, device name: {1}, bus_id: {2}".format(i, dev_name, pci_bus_id))

# Helper function to copy src array to destination using ctypes memmove
def copy_np_to_pinned_memory(src, dest):
    src_ = src.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
    dest_ = dest.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
    sz = src.size * ctypes.sizeof(ctypes.c_float)
    ctypes.memmove(dest_, src_, sz)

# This function measures the time taken to transfer data from host-to-device (h2d) when:
# 1. source is in unpinned (pagaeable) memory
# 2. source is in pinned memory. In this case, we also measure time taken to transfer data
# from unpinned to pinned memory.
# Times are measured for different data sizes and plotted. Data transfer bandwidth is also calculated from
# the transfer times.
def compare_performance():
    # a quick warm up..
    n = 25000000
    a = np.random.randn(n).astype(np.float32)
    # allocate space on GPU
    mem_gpu = cuda.mem_alloc(a.nbytes)
    cuda.memcpy_htod(mem_gpu, a)
    # free space on GPU
    mem_gpu.free()
    h2d_nopin = []
    h2d_nopin_bw = []
    # measure timing without pinning
    for n in num_elems:
        # the data to be transferred
        a = np.random.randn(n).astype(np.float32)
        # allocate space on GPU
        mem_gpu = cuda.mem_alloc(a.nbytes)
        # only measure h2d transfer time
        start = time.perf_counter()
        cuda.memcpy_htod(mem_gpu, a)
        te = time.perf_counter() - start #te: time elapsed
        h2d_nopin.append(te)
        h2d_nopin_bw.append(a.nbytes/(10**9 * (te))) # convert to a bandwidth
        # free space on GPU
        mem_gpu.free()
    # now do pinning and measure time to pin and time to transfer
    h2h_pinned = [] # records the transfer time from unpinned -> pinned memory
    h2d_pin = [] # records the host to device transfer time with data in pinned memory.
    h2d_pin_total = [] # records the total (sum of the previous two)
    h2d_pin_bw = [] #h2d_pin, converted to a bandwidth (GB/sec)
    for i, n in enumerate(num_elems):
        a = np.random.randn(n).astype(np.float32)
        # allocate space on GPU
        mem_gpu = cuda.mem_alloc(a.nbytes)
        # allocate page locked memory
        a_pin = cuda.register_host_memory(a)
        # copy data from np array to pinned memory and measure transfer time
        start = time.perf_counter()
        copy_np_to_pinned_memory(a, a_pin)
        te = time.perf_counter() - start  # te: time elapsed
        h2h_pinned.append(te)
        # measure h2d transfer time
        start = time.perf_counter()
        cuda.memcpy_htod(mem_gpu, a_pin)
        te = time.perf_counter() - start #te: time elapsed
        h2d_pin.append(te)
        h2d_pin_bw.append(a.nbytes / (10**9 * te))
        h2d_pin_total.append(h2d_pin[i] + h2h_pinned[i])
        # free allocated pinned memory
        a_pin.base.unregister()
        # free space on GPU
        mem_gpu.free()

    fig = plt.figure()
    num_elems_mb = [x*4/10**6 for x in num_elems]

    plt.plot(num_elems_mb, h2d_nopin, 'g', label='h2d transfer_time (no pinning)')
    plt.plot(num_elems_mb, h2d_pin, 'r', label='h2d transfer_time (with pinning)')
    plt.plot(num_elems_mb, h2h_pinned, 'b', label='h2h transfer_time')
    plt.plot(num_elems_mb, h2d_pin_total, 'k', label='h2d transfer_time (with pinning, total)')
    plt.legend()
    plt.xlabel('data size (MB)')
    plt.ylabel('time (sec)')
    plt.show()


if __name__ == '__main__':
    print_device_info()
    compare_performance()

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

import pycuda.driver as cuda

from pycuda.tools import PageLockedMemoryPool

import numpy as np

import time

import ctypes

import pdb

from queue import Queue

from threading import Thread

from pycuda.tools import make_default_context

import matplotlib.pyplot as plt

import threading

# Initialize CUDA

cuda.init()

global ctx

ctx = make_default_context() # will initialize the first device it finds

dev = ctx.get_device()

def _finish_up():

global ctx

ctx.pop()

ctx = None

from pycuda.tools import clear_context_caches

clear_context_caches()

import atexit

atexit.register(_finish_up)

num_elems = [5000000, 10000000, 20000000, 30000000, 40000000]

# prints pci_bus_id, device name and device id for installed GPUs. You can run the Linux lspci command on the bus_id to

# obtain information about the number of PCIe lanes on that bus. This will give you the expected bandwidth

def print_device_info():

driver_ver = cuda.get_version()

print("CUDA Driver Version: {0}.{1}.{2}".format(driver_ver[0], driver_ver[1], driver_ver[2]))

num_cuda_devices = cuda.Device.count()

for i in range(0, num_cuda_devices):

dev = cuda.Device(i)

pci_bus_id = dev.pci_bus_id()

dev_name = dev.name()

print("device id: {0}, device name: {1}, bus_id: {2}".format(i, dev_name, pci_bus_id))

# Helper function to copy src array to destination using ctypes memmove

def copy_np_to_pinned_memory(src, dest):

src_ = src.ctypes.data_as(ctypes.POINTER(ctypes.c_float))

dest_ = dest.ctypes.data_as(ctypes.POINTER(ctypes.c_float))

sz = src.size * ctypes.sizeof(ctypes.c_float)

ctypes.memmove(dest_, src_, sz)

# This function measures the time taken to transfer data from host-to-device (h2d) when:

# 1. source is in unpinned (pagaeable) memory

# 2. source is in pinned memory. In this case, we also measure time taken to transfer data

# from unpinned to pinned memory.

# Times are measured for different data sizes and plotted. Data transfer bandwidth is also calculated from

# the transfer times.

def compare_performance():

# a quick warm up..

n = 25000000

a = np.random.randn(n).astype(np.float32)

# allocate space on GPU

mem_gpu = cuda.mem_alloc(a.nbytes)

cuda.memcpy_htod(mem_gpu, a)

# free space on GPU

mem_gpu.free()

h2d_nopin = []

h2d_nopin_bw = []

# measure timing without pinning

for n in num_elems:

# the data to be transferred

a = np.random.randn(n).astype(np.float32)

# allocate space on GPU

mem_gpu = cuda.mem_alloc(a.nbytes)

# only measure h2d transfer time

start = time.perf_counter()

cuda.memcpy_htod(mem_gpu, a)

te = time.perf_counter() - start #te: time elapsed

h2d_nopin.append(te)

h2d_nopin_bw.append(a.nbytes/(10**9 * (te))) # convert to a bandwidth

# free space on GPU

mem_gpu.free()

# now do pinning and measure time to pin and time to transfer

h2h_pinned = [] # records the transfer time from unpinned -> pinned memory

h2d_pin = [] # records the host to device transfer time with data in pinned memory.

h2d_pin_total = [] # records the total (sum of the previous two)

h2d_pin_bw = [] #h2d_pin, converted to a bandwidth (GB/sec)

for i, n in enumerate(num_elems):

a = np.random.randn(n).astype(np.float32)

# allocate space on GPU

mem_gpu = cuda.mem_alloc(a.nbytes)

# allocate page locked memory

a_pin = cuda.register_host_memory(a)

# copy data from np array to pinned memory and measure transfer time

start = time.perf_counter()

copy_np_to_pinned_memory(a, a_pin)

te = time.perf_counter() - start # te: time elapsed

h2h_pinned.append(te)

# measure h2d transfer time

start = time.perf_counter()

cuda.memcpy_htod(mem_gpu, a_pin)

te = time.perf_counter() - start #te: time elapsed

h2d_pin.append(te)

h2d_pin_bw.append(a.nbytes / (10**9 * te))

h2d_pin_total.append(h2d_pin[i] + h2h_pinned[i])

# free allocated pinned memory

a_pin.base.unregister()

# free space on GPU

mem_gpu.free()

fig = plt.figure()

num_elems_mb = [x*4/10**6 for x in num_elems]

plt.plot(num_elems_mb, h2d_nopin, 'g', label='h2d transfer_time (no pinning)')

plt.plot(num_elems_mb, h2d_pin, 'r', label='h2d transfer_time (with pinning)')

plt.plot(num_elems_mb, h2h_pinned, 'b', label='h2h transfer_time')

plt.plot(num_elems_mb, h2d_pin_total, 'k', label='h2d transfer_time (with pinning, total)')

plt.legend()

plt.xlabel('data size (MB)')

plt.ylabel('time (sec)')

plt.show()

if __name__ == '__main__':

print_device_info()

compare_performance()

Code for threaded data transfer

import pycuda.driver as cuda
from pycuda.tools import PageLockedMemoryPool
import numpy as np
import time
import ctypes
import pdb
from queue import Queue
from threading import Thread
from pycuda.tools import make_default_context
import matplotlib.pyplot as plt

import threading
# Initialize CUDA
cuda.init()

global ctx
ctx = make_default_context() # will initialize the first device it finds
dev = ctx.get_device()

def _finish_up():
    global ctx
    ctx.pop()
    ctx = None

    from pycuda.tools import clear_context_caches
    clear_context_caches()

import atexit
atexit.register(_finish_up)

def print_device_info():
    driver_ver = cuda.get_version()
    print("CUDA Driver Version: {0}.{1}.{2}".format(driver_ver[0], driver_ver[1], driver_ver[2]))
    num_cuda_devices = cuda.Device.count()
    for i in range(0, num_cuda_devices):
        dev = cuda.Device(i)
        pci_bus_id = dev.pci_bus_id()
        dev_name = dev.name()
        print("device id: {0}, device name: {1}, bus_id: {2}".format(i, dev_name, pci_bus_id))

def copy_np_to_pinned_memory(src, dest):
    src_ = src.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
    dest_ = dest.ctypes.data_as(ctypes.POINTER(ctypes.c_float))
    sz = src.size * ctypes.sizeof(ctypes.c_float)
    ctypes.memmove(dest_, src_, sz)

def dataloader(a, q, mp, n):
    ctx2 = make_default_context()

    for epoch in range(num_epochs):
        a_pin = mp.allocate((n, 1), np.float32)
        copy_np_to_pinned_memory(a, a_pin)
        q.put(a_pin, block=True)
    q.put(None)
    while toKill is False:
        time.sleep(0.1)
    ctx2.pop()

# transfers data from pageable memory to device
def evaluate_paged_transfer(a, n, mem_gpu):
    start = time.perf_counter()
    for epoch in range(num_epochs):
        print('processing batch: {0}'.format(epoch))
        cuda.memcpy_htod(mem_gpu, a)
        time.sleep(0.04)
    print('h2d transfer time (from pageable memory): {0:3.4f}'.format(time.perf_counter()-start))

# transfers data from pageable memory to page-locked (pinned) memory, and then to the device
def evaluate_pagelocked_transfer(a, n, mp, mem_gpu):
    start = time.perf_counter()
    for epoch in range(num_epochs):
        print('processing batch: {0}'.format(epoch))
        a_pin = mp.allocate((n, 1), np.float32)
        copy_np_to_pinned_memory(a, a_pin)
        cuda.memcpy_htod(mem_gpu, a_pin)
        time.sleep(0.04)
        a_pin.base.free()
    print('h2d transfer time (from page-locked memory): {0:3.4f}'.format(time.perf_counter()-start))

# splitting the copy to device in a function so cProfile can show it separately.
def copy_to_device(mem_gpu, batch):
    cuda.memcpy_htod(mem_gpu, batch)
    batch.base.free()

# implements threaded data transfer
def process(q, mem_gpu):
    epoch = 0
    while True:
        batch = q.get(block=True)
        print('processing batch: {0}'.format(epoch))
        epoch = epoch + 1
        if batch is None:
            break
        # print('batch address: {0}'.format(hex(id(batch))))
        # copy data (equivalent of calling .cuda() in Pytorch)
        copy_to_device(mem_gpu, batch)
        time.sleep(0.04) # simulates processing on the GPU (run model, calculate gradients, update parameters)

if __name__ == '__main__':
    n = 40000000 # number of FP32 data elements
    num_epochs = 100
    a = np.random.randn(n).astype(np.float32)
    mem_gpu = cuda.mem_alloc(a.nbytes)

    q = Queue(maxsize=3)
    mp = PageLockedMemoryPool()
    toKill = False # tells the dataloader thread when to quit

    print_device_info()

    #evaluate_paged_transfer(a, n, mem_gpu)
    #evaluate_pagelocked_transfer(a, n, mp, mem_gpu)

    # Threaded transfer
    cuda_thread = Thread(target=dataloader, args=(a, q, mp, n))

    cuda_thread.start()
    start = time.perf_counter()
    process(q, mem_gpu)
    print('h2d transfer time (from pageable memory, threaded loading): {0:3.4f}'.format(time.perf_counter() - start))
    mp.free_held()
    # send a signal to the dataloader thread to die
    toKill = True
    # wait till the dataloader thread dies
    cuda_thread.join()
    print('done')

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

import pycuda.driver as cuda

from pycuda.tools import PageLockedMemoryPool

import numpy as np

import time

import ctypes

import pdb

from queue import Queue

from threading import Thread

from pycuda.tools import make_default_context

import matplotlib.pyplot as plt

import threading

# Initialize CUDA

cuda.init()

global ctx

ctx = make_default_context() # will initialize the first device it finds

dev = ctx.get_device()

def _finish_up():

global ctx

ctx.pop()

ctx = None

from pycuda.tools import clear_context_caches

clear_context_caches()

import atexit

atexit.register(_finish_up)

def print_device_info():

driver_ver = cuda.get_version()

print("CUDA Driver Version: {0}.{1}.{2}".format(driver_ver[0], driver_ver[1], driver_ver[2]))

num_cuda_devices = cuda.Device.count()

for i in range(0, num_cuda_devices):

dev = cuda.Device(i)

pci_bus_id = dev.pci_bus_id()

dev_name = dev.name()

print("device id: {0}, device name: {1}, bus_id: {2}".format(i, dev_name, pci_bus_id))

def copy_np_to_pinned_memory(src, dest):

src_ = src.ctypes.data_as(ctypes.POINTER(ctypes.c_float))

dest_ = dest.ctypes.data_as(ctypes.POINTER(ctypes.c_float))

sz = src.size * ctypes.sizeof(ctypes.c_float)

ctypes.memmove(dest_, src_, sz)

def dataloader(a, q, mp, n):

ctx2 = make_default_context()

for epoch in range(num_epochs):

a_pin = mp.allocate((n, 1), np.float32)

copy_np_to_pinned_memory(a, a_pin)

q.put(a_pin, block=True)

q.put(None)

while toKill is False:

time.sleep(0.1)

ctx2.pop()

# transfers data from pageable memory to device

def evaluate_paged_transfer(a, n, mem_gpu):

start = time.perf_counter()

for epoch in range(num_epochs):

print('processing batch: {0}'.format(epoch))

cuda.memcpy_htod(mem_gpu, a)

time.sleep(0.04)

print('h2d transfer time (from pageable memory): {0:3.4f}'.format(time.perf_counter()-start))

# transfers data from pageable memory to page-locked (pinned) memory, and then to the device

def evaluate_pagelocked_transfer(a, n, mp, mem_gpu):

start = time.perf_counter()

for epoch in range(num_epochs):

print('processing batch: {0}'.format(epoch))

a_pin = mp.allocate((n, 1), np.float32)

copy_np_to_pinned_memory(a, a_pin)

cuda.memcpy_htod(mem_gpu, a_pin)

time.sleep(0.04)

a_pin.base.free()

print('h2d transfer time (from page-locked memory): {0:3.4f}'.format(time.perf_counter()-start))

# splitting the copy to device in a function so cProfile can show it separately.

def copy_to_device(mem_gpu, batch):

cuda.memcpy_htod(mem_gpu, batch)

batch.base.free()

# implements threaded data transfer

def process(q, mem_gpu):

epoch = 0

while True:

batch = q.get(block=True)

print('processing batch: {0}'.format(epoch))

epoch = epoch + 1

if batch is None:

break

# print('batch address: {0}'.format(hex(id(batch))))

# copy data (equivalent of calling .cuda() in Pytorch)

copy_to_device(mem_gpu, batch)

time.sleep(0.04) # simulates processing on the GPU (run model, calculate gradients, update parameters)

if __name__ == '__main__':

n = 40000000 # number of FP32 data elements

num_epochs = 100

a = np.random.randn(n).astype(np.float32)

mem_gpu = cuda.mem_alloc(a.nbytes)

q = Queue(maxsize=3)

mp = PageLockedMemoryPool()

toKill = False # tells the dataloader thread when to quit

print_device_info()

#evaluate_paged_transfer(a, n, mem_gpu)

#evaluate_pagelocked_transfer(a, n, mp, mem_gpu)

# Threaded transfer

cuda_thread = Thread(target=dataloader, args=(a, q, mp, n))

cuda_thread.start()

start = time.perf_counter()

process(q, mem_gpu)

print('h2d transfer time (from pageable memory, threaded loading): {0:3.4f}'.format(time.perf_counter() - start))

mp.free_held()

# send a signal to the dataloader thread to die

toKill = True

# wait till the dataloader thread dies

cuda_thread.join()

print('done')

Telesens

Pipelining data processing and host-to-device data transfer

Host-GPU Interconnect

Identifying PCIe version and number of lanes on your system

Linux

Windows

Host memory types

Performance of host-to-device transfer from pinned and un-pinned memory

Dataloader in PyTorch 1.0

Threaded data transfer experiment

Conclusion

Appendix

Code for transfer BW performance comparison

Code for threaded data transfer

Be the first to comment

Leave a Reply Cancel reply