Pipelining data processing and host-to-device data transfer

The purpose of this post is to show how to use multi-threading to parallelize data processing with data transfer from pageable to page-locked memory. I was motivated to examine this issue while looking at the effect of the pin_memory flag in PyTorch’s dataloader. Let me provide some background information first.

Data loading is the first step in any machine learning (ML) application. During data loading on a GPU enabled machine, a batch of data is loaded from the disk into main memory and then transferred to the GPU, where compute intensive tasks such as forward pass, backward pass (calculation of gradients) and parameter updates are carried out. These tasks are highly parallel, i.e., multiple calculations can proceed at the same time and thus implementation on a massively parallel compute platform such as a GPU is very efficient.

Host-GPU Interconnect

GPUs are usually connected to the motherboard over a PCIe connection and data from the main (host) memory must be transferred to the GPU memory over this PCIe link. What is the expected bandwidth of this link? To answer this question, lets look at some basic info about PCIe links.

PCIe link performance is characterized by transfer rate and encoding scheme. Transfer rate refers to the total number of bits transmitted, including the data and overhead bits. Encoding scheme is the ratio of data bits to the total number of bits. There are many revisions of PCIe varying in transfer rate and encoding scheme. The table below shows the per lane transfer rate and throughput for PCIe 1.0, 2.0 and 3.0. For more information, see the PCIe wikipedia article.

 

Encoding Transfer Rate( bits/sec) Throughput (MB/sec)

PCI 1.0

8b/10b 2.5 Gb/sec

8/10*2.5/8 = 250 MB/sec

PCI 2.0

8b/10b 5 Gb/sec

8/10*5/8 = 500 MB/sec

PCI 3.0 128b/130b 8 Gb/sec

128/130*8/8 = 984.5MB/sec

Physical PCIe links may contain from one to 32 lanes. Lane counts are written with an “x” prefix (for example, “x8” represents an eight-lane card or slot), with x16 being the largest size in common use. The lane count together with the throughput per lane gives us the expected data transfer bandwidth. For example, an x16 PCIe 3.0 connection is expected to have a maximum bandwidth of 16*984.5=15.7 GB/sec.

Identifying PCIe version and number of lanes on your system

To find out the PCIe version and number of lanes for the GPU interconnect on my system (Windows 10 x64, with a Nvidia 1080 Ti GPU), I used this Python code that uses the pycuda CUDA-Python wrapper to query CUDA device info.

The pci_bus_id gives you the id of the bus used by GPUs on your system. From the pci_bus_id, you can obtain the number of lanes as follows:

Linux

Run the linux command lspci as shown below

Replace the 04:00.0 with the bus id on your system. Look for Width in “LnkCap” in the resulting output.

Windows

Getting this info was surprisingly tricky on Windows. I tried installing the Windows Driver Kit and using the “devcon” utility that comes with it, however that was not easy or intuitive and I didn’t want to go down that rabbit hole. I ended up using the windows device manager as follows:

windows device manager -> Display adapters-> <right click “properties” on GPU name> -> in the “property” drop down in the dialog box, “PCI max link speed” gives the PCIe version and “PCI max link width” gives the number of lanes (in hex).

Host memory types

Turning to host memory, there are are two main categories – pageable (or “un-pinned”) and page-locked (or “pinned”) memory. When you allocate memory using malloc in a C program, the allocation is done in pageable memory. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver first allocates a temporary pinned host array, copies the host data to the pinned array, and then transfers the data from the pinned array to device memory, as illustrated below (see this page for more info)

Performance of host-to-device transfer from pinned and un-pinned memory

It is expected that data transfer from pinned memory to device will have a higher bandwidth than transfer from un-pinned memory. To verify this, I wrote a python script (see appendix) to test the host-to-device (h2d) data transfer BandWidth (BW) when the data is located in pageable vs. page-locked memory. The script uses the pycuda library that exposes low-level CUDA API in Python. The results are shown in the table below for 5 data sizes.

Data Size (MB)

20 40 80 120 160

Transfer BW (page-locked) (GB)

11.36 11.44 12.02 12.09

12.13

Transfer BW (pageable) (GB) 5.06 5.31 5.39 5.35

5.04

As expected, the transfer bandwidth from page-locked memory is much closer to the expected bandwidth on my system.

The figure below shows a plot of the two types of transfer involved:

  • h2d transfer (no pinning): In this case, data is copied from pageable memory to the GPU host memory. The CUDA driver takes care of copying the data to page-locked memory
  • h2d transfer (with pinning): In this case, data is allocated in pageable memory and explicitly copied to page-locked memory using the register_host_memory function in pycuda. This transfer time  (h2h transfer time) is shown in the blue plot. Then, the data in page-locked memory is copied to the GPU. This transfer time is shown in the red plot. The total time is the sum of the two and is shown in the black plot

As expected, the two transfer times (green plot and the black plot) are nearly identical. This is because for h2d transfer from pageable memory, the CUDA driver is internally performing the transfer from pageable to page-locked memory.

We see that transferring data from unpinned to pinned memory is an expensive operation. Thus, it is clear that transfer from page locked memory is more efficient only if either the data can be allocated directly in page-locked memory or if the data processing on the GPU can be parallelized with the data transfer so that the data transfer latency is partly or completely hidden. CUDA provides APIs such as cudaMallocHost() to implement the first method.

Dataloader in PyTorch 1.0

PyTorch’s dataloader takes the second approach. It uses multiple processes to load batch data from the disk into pageable memory and then spins up a separate thread to transfer loaded data to pinned memory (if the pin_memory flag = True). If not, the data loaded by the multiple processes is returned directly when enumerate or next(iter) is called by the user (see code for the imagenet example in PyTorch docs). The figure below shows a schematic of the two options.

Threaded data transfer experiment

I wrote a python script (see appendix) that implements a simple threaded loader to parallelize copy to page locked memory task with data processing. The script copies 160MB of data 100 times using three methods:

  • Data is allocated in pageable memory and copied to the GPU using pycuda memcpy_htod() function on the main thread
  • Data is allocated in pageable memory, copied explicitly to page-locked memory using pycuda PageLockedMemoryPool APIs and then copied to the GPU on the main thread
  • A separate thread is used to perform copy to page-locked memory

The data process time is simulated by adding a sleep of 400 ms, which is comparable to the backprop time for Resnet 50 on a 1080Ti GPU for a batch size of 64.

The total processing times for the three methods are shown in the table below.

 

H2D (pageable) H2D (page-locked)

H2D (page-locked, threaded)

Total Processing Time (sec) (100 Runs) 7.90 7.92

4.92

The threaded option is able to almost completely hide the data transfer latency. The total execution time is now dominated by the process time (sleep of 400*100 = 4 sec). We can also see this by profiling the code using cProfile and visualizing the result using snakeviz. Note that the actual cumulative sleep time is not exactly 4 sec because the resolution of the time.sleep() function on Windows is ~3-4 ms.

Incidentally, it is not easy to profile multiple threads in Python. cProfile works well for the main thread, but doesn’t give information about what’s happening with other threads. I also played around with Intel vtunes, but didn’t have much luck. If you know of good profiling tools for multi-threaded Python apps, let me know!

Conclusion

To conclude, in this post we looked at some basic information about PCIe links that helps you determine the expected data transfer bandwidth and considered an example that shows data transfer from pinned to device memory is much more efficient than transfer from un-pinned memory but unpinned-pinned memory transfer is an expensive operation in itself. We then examined how multi-threading can be used to perform unpinned-pinned memory transfer concurrently with host-to-device data transfer and data processing on device. Given a sufficiently large processing time, this can almost completely hide the latency of the data transfer.

Hope you found this useful. Please leave a comment if you did!

Appendix

Code for transfer BW performance comparison

Code for threaded data transfer

 

Be the first to comment

Leave a Reply

Your email address will not be published.


*