参考
李会民:PGI编译器和使用ustc
小小河:CUDA Fortran 高效编程实践
PGI编译器基本用法
编译器
- 编译C、 C++源程序的命令分别:
pgcc、 pgCC
- 编译Fortran 77源程序的命令:
pgf77
- 编译Fortran 90的源程序的命令:
pgf90、 pgf901、 pgf902、 pgf90_ex、pgf95和pgfortran
- 与NVIDIA CUDA库配合,可以编译Fortran-CUDA程序
参数
-c |
仅编译成对象文件(.o文件) |
-o |
生成可执行文件名 |
-O |
优化 |
-mp |
openmp |
编译普通Fortran示例
cndaqiang@girl:~/code/cuda$ cat hello.f90
PROGRAM hello
WRITE(*,*) "hello"
ENDPROGRAM hello
cndaqiang@girl:~/code/cuda$ pgf90 hello.f90 -o hello
cndaqiang@girl:~/code/cuda$ ./hello
hello
编译cuda示例
cndaqiang@girl:~/code/cuda$ cat example.cuf
!attributes(global) 设备函数,global可被主机调用
attributes(global) subroutine hellocuda()
IMPLICIT NONE
INTEGER :: i,d,j
i = BlockIdx%x !线程块索引
d = BlockDim%x !线程块长度
j = ThreadIdx%x !线程索引
WRITE(*,*) "Block",i,"Thread",j,"In",d
ENDSUBROUTINE
PROGRAM main
USE cudafor
IMPLICIT NONE
INTEGER :: i
call hellocuda<<<2,3>>>()
i=cudaDeviceSynchronize() !use cudafor才能正常使用
!在执行hellocuda时,控制权就交给cpu了
END PROGRAM
cndaqiang@girl:~/code/cuda$ pgfortran example.cuf && ./a.out
Block 2 Thread 1 In 3
Block 2 Thread 2 In 3
Block 2 Thread 3 In 3
Block 1 Thread 1 In 3
Block 1 Thread 2 In 3
Block 1 Thread 3 In 3
因为启动GPU上的函数hellocuda
后,控制权就返回CPU了,因此,加上i=cudaDeviceSynchronize()
,等待GPU输出之后,再继续
CUDA
基本概念
- 主机host:指CPU及其内存。
- 设备device:指GPU及其内存。
- CPU代码:指一个仅用到CPU的实现/在host上执行的代码
- 核函数kernel: host上调用,device上执行的subroutine,如
attributes(global) subroutine hellocuda()
- 启动核函数:在host上调用device上的函数 ,如
CALL hellocuda <<<2,3>>>([参数])
使用<<<线程块数量,每个线程块的线程数>>>
,即<<<Grid,Block>>>
- 定义函数
attributes(属性) subroutine funname()
,当属性为:
global
, device上执行,host上调用,例attributes(global) subroutine hellocuda()
,调用时指明网格<<<>>>
device
, device上执行,device上调用,例attributes(device) subroutine hellocuda()
,直接启用,没有<<<>>>
host
, host上执行,host上调用,不写属性,默认就是此情况,例subroutine hellohost()
,直接启用,没有<<<>>>
- 内存
全局内存(Gloabl Memory):由显存决定,所有线程都可以访问,操作,我们传递给核函数的device变量存储在共享内存上
共享内存(Shard Memory):每个线程块共享,寿命与线程块一样
私有本地内存(Local Memory):仅本线程能访问
注意事项
- 设备上的函数不能contained在主程序或其他子程序中
- 很多函数变量都来自
USE cudafor
- CPU进行控制,GPU进行计算
查看显卡属性
cndaqiang@girl:~/code/cuda$ cat checkcuda.cuf
PROGRAM checkcuda
USE cudafor
IMPLICIT NONE
type(cudaDeviceProp) :: prop
INTEGER :: nDevices=0,i,ierr
!
ierr = cudaGetDeviceCount(nDevices)
if(nDevices .EQ. 0) RETURN
DO i=0,nDevices-1
WRITE(*,"('Device Number:',i0)"),i
ierr = cudaGetDeviceProperties(prop,i)
WRITE(*,"('Device Name: ',a)") TRIM(prop%name)
!计算能力 主要.次要 ,理解为版本号
WRITE(*,"('Compute Capability: ',i0,'.',i0)") prop%major, prop%minor
!处理器数量
WRITE(*,"('Number of Multiprocessors: ',i0)") prop%multiProcessorCount
WRITE(*,"('Max Threads per Multiprocessor:',i0)") prop%maxThreadsPerMultiprocessor
WRITE(*,"('Global Memory (GB):',f9.3)" ) prop%totalGlobalMem/(1024.0**3)
WRITE(*,"(/,A)") "Execution Configuration Limits "
WRITE(*,"('Max Grid Dims: ',i0,'x',i0,'x',i0)") prop%maxGridSize
!每维Block内最大线程数
WRITE(*,"('Max Block Dims: ',i0,'x',i0,'x',i0)") prop%maxThreadsDim
WRITE(*,"('Max Threads per Block: ',i0)") prop%maxThreadsPerBlock
ENDDO
END PROGRAM
cndaqiang@girl:~/code/cuda$ pgf90 checkcuda.cuf && ./a.out
Device Number:0
Device Name: GeForce 940MX
Compute Capability: 5.0
Number of Multiprocessors: 3
Max Threads per Multiprocessor:2048
Global Memory (GB): 1.958
Execution Configuration Limits
Max Grid Dims: 2147483647x65535x65535
Max Block Dims: 1024x1024x64
Max Threads per Block: 1024
并发线程数=多处理器数量*每处理器上最大线程数量
对于此940MX而言,使用一维网格,一个Block线程数可达2147483647*1024
也可以使用pgaccelinfo
查看
cndaqiang@girl:~/code/cuda$ pgaccelinfo
CUDA Driver Version: 10010
NVRM version: NVIDIA UNIX x86_64 Kernel Module 435.21 Sun Aug 25 08:17:57 CDT 2019
Device Number: 0
Device Name: GeForce 940MX
Device Revision Number: 5.0
Global Memory Size: 2101870592
Number of Multiprocessors: 3
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1189 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2000 MHz
Memory Bus Width: 64 bits
L2 Cache Size: 1048576 bytes
Max Threads Per SMP: 2048
Async Engines: 1
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
PGI Default Target: -ta=tesla:cc50
设置核函数计算网格
一维Gird,一维Block
!一维网格中有M个Block,每个Block有N个Thread
call sub<<<M,N>>>([argument])
多维
TYPE(dim3) grid,block
!
grid=dim3(2,2,2)
block=dim3(1,1,1)
call hellocuda<<<grid,block>>>()
kernal函数中的网格内置变量,是dim3的类型
gx=GridDim%x !网格大小,即Block数量
gy=GridDim%y !网格大小,即Block数量
gz=GridDim%z !网格大小,即Block数量
bx=BlockIdx%x !Block ID
by=BlockIdx%y !Block ID
bz=BlockIdx%z !Block ID
bDx=BlockDim%x !Block大小,即Thread数量/Block
bDy=BlockDim%y !Block大小,即Thread数量/Block
bDz=BlockDim%z !Block大小,即Thread数量/Block
tx=ThreadIdx%x !Thead ID
ty=ThreadIdx%y !Thead ID
tz=ThreadIdx%z !Thead ID
确定绝对线程数,可用
bi = BlockIdx%x + &
(BlockIdx%y - 1)*GridDim%x + &
(BlockIdx%z - 1)*GridDim%y*GridDim%x
ti = (bi-1)*BlockDim%x*BlockDim%y*BlockDim%z + &
ThreadIdx%x + &
(ThreadIdx%y - 1)*BlockDim%x + &
(ThreadIdx%z - 1)*BlockDim%y*BlockDim%x
在核函数上启动device
函数时,不能设置网格(因为是各个Thread分别调用此函数),device
中也有网格内置变量,值由调用该函数的线程确定
主机,设备代码同步问题
默认主机和设备间的代码(golbal的核函数)执行是不同步的,
即下面代码在开始执行CALL hellocuda <<<2,3>>>()
时,控制权就返回CPU了
a_d=1
CALL hellocuda <<<2,3>>>()
主机和内核之前的数据传输是同步的(或者是阻塞的),如a_d=1
,只有当CPU和GPU都执行到此处才可以传输数据。
其他同步方法
- 在代码内添加
i=cudaDeviceSynchronize()
,直到所有设备(GPU)上代码执行完到此处,主机(CPU)程序才继续执行 - 可以设置环境变量,使CPU和GPU同步函数调用
export CUDA_LAUNCH_BLOCKING=1
核函数的参数传递
默认传址调用
传递的变量需要保存在device上,即该变量原始定义时加上device属性,如INTEGER,device :: x_d(100)
应该只能传递在GPU/device上的地址
不可以直接传递host上的变量地址
传值调用时
传值调用可以传递host上的变量,此时,在核函数内部定义时加上value属性,如INTEGER,value :: N
包含传递形参的核函数只能在MODULE里定义,不能使用external,不然传递过来的值都很随机,引起计算异常
传递示例,可看到external的函数不能传值,module的可以,传递device形变量都可以
cndaqiang@girl:~/code/cuda$ cat cudavar.cuf
attributes(global) SUBROUTINE whoIam(n,n_d)
IMPLICIT NONE
INTEGER,value :: N
INTEGER :: N_d
WRITE(*,*) "External", "N",N,"N_d",N_d
END SUBROUTINE
MODULE m_var
CONTAINS
attributes(global) SUBROUTINE whoIam_m(n,n_d)
IMPLICIT NONE
INTEGER,value :: N
INTEGER :: N_d
WRITE(*,*) "Module", "N",N,"N_d",N_d
END SUBROUTINE
END MODULE
PROGRAM testSend
use cudafor
USE m_var
IMPLICIT NONE
INTEGER :: N,i
INTEGER,device :: N_d
N=10
N_d=10
CALL whoIam<<<1,1>>>(N,N_d)
CALL whoIam_m<<<1,1>>>(N,N_d)
i=cudaDeviceSynchronize()
END PROGRAM
cndaqiang@girl:~/code/cuda$ pgfortran cudavar.cuf && ./a.out
External N 816657356 N_d 10
Module N 10 N_d 10
性能测试
下载fastcuda.cuf,编译测试
可以看到当矩阵大于1E5时,GPU(940MX)计算时间=CPU(i7-7500U)时间/3
不同变量赋值耗时 < 相同变量赋值
不同变量赋值: GPU-GPU < CPU-CPU
GPU-CPU赋值耗时:GPUx-GPUy + CPUx-CPUy
GPUx-GPUy默认的赋值还挺快的
cndaqiang@girl:~/code/cuda$ pgfortran fastcuda.cuf && ./a.out
-----------------------
N 100000 CPU build, WALL 1.10000 ms
N 100000 CPU cal, WALL 0.25800 ms
N 100000 GPU build, WALL 0.12800 ms
N 100000 GPU cal, WALL 0.06400 ms
N 100000 GPU->CPU, WALL 0.33300 ms
N 100000 CPU->GPU, WALL 0.30300 ms
N 100000 CPUy->CPUx, WALL 0.20800 ms
N 100000 CPUy->CPUy, WALL 0.42600 ms
N 100000 GPUy->GPUz, WALL 0.07100 ms
N 100000 GPUy->GPUy, WALL 0.59500 ms
N 100000 Error 0.00000
-----------------------
N 100100000 CPU build, WALL 910.17902 ms
N 100100000 CPU cal, WALL 202.79601 ms
N 100100000 GPU build, WALL 45.75900 ms
N 100100000 GPU cal, WALL 59.73200 ms
N 100100000 GPU->CPU, WALL 244.66000 ms
N 100100000 CPU->GPU, WALL 251.89999 ms
N 100100000 CPUy->CPUx, WALL 210.46001 ms
N 100100000 CPUy->CPUy, WALL 495.70901 ms
N 100100000 GPUy->GPUz, WALL 31.10200 ms
N 100100000 GPUy->GPUy, WALL 499.45700 ms
N 100100000 Error 0.00000
检测各个模块运行时间 nvprof ./a.out
示例
(python27) cndaqiang@girl:~/code/cuda$ nvprof ./a.out
2 2 2
==2410== NVPROF is profiling process 2410, command: ./a.out
...程序运行输出
==2410== Profiling application: ./a.out
==2410== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 509.55us 1 509.55us 509.55us 509.55us m_mycuda_hellocuda_
API calls: 99.61% 264.75ms 1 264.75ms 264.75ms 264.75ms cudaLaunchKernel
0.21% 549.89us 1 549.89us 549.89us 549.89us cudaDeviceSynchronize
0.12% 318.01us 97 3.2780us 250ns 139.94us cuDeviceGetAttribute
0.05% 123.14us 1 123.14us 123.14us 123.14us cuDeviceTotalMem
0.02% 50.688us 1 50.688us 50.688us 50.688us cuDeviceGetName
0.00% 2.9850us 3 995ns 267ns 2.0500us cuDeviceGetCount
0.00% 2.1570us 2 1.0780us 348ns 1.8090us cuDeviceGet
0.00% 464ns 1 464ns 464ns 464ns
报错
CUDA代码错误
在设备(GPU)代码上运行主机(CPU)程序,如CALL SLEEP(1)
,CALL system("")
PGF90-S-0155-Calls from device code to a host subroutine are allowed only in emulation mode - sleep (hellocuda.cuf: 11)
0 inform, 0 warnings, 1 severes, 0 fatal for mycuda2
在host上输出device变量,如WRITE(*,*) a_d
PGF90-S-0155-device data allowed in I/O statements only in emulation mode (hellocuda.cuf: 45)
0 inform, 0 warnings, 1 severes, 0 fatal for hellocuda
device代码不能使用格式化输出? 不能用WRITE(*,"(A)") "Hello"
,只能WRITE(*,*) "Hello"
PGF90-S-0155-I/O statements allowed in device routines only in emulation mode (hellocuda.cuf: 12)
0 inform, 0 warnings, 1 severes, 0 fatal for mycuda
在host上调用device函数
PGF90-S-0155-Kernel launch of attributes(device) subprogram is not allowed - gpufun (hellocuda.cuf: 43)
0 inform, 0 warnings, 1 severes, 0 fatal for main
拓展名cuf
写为f90
时,不识别cuda函数
/home/cndaqiang/code/cuda/checkcuda.f90:8: undefined reference to `cudagetdevicecount_'
向核函数传递host上变量,如果非要传递,把输入参数类型加上value
,且需要在Module里面
PGF90-S-0528-Argument number 1 to whoiam_m: device attribute mismatch (cudavar.cuf: 47)
0 inform, 0 warnings, 1 severes, 0 fatal for testsend
PGI普通Fortran语法错误
在代码行后面多打中文字符(空格,叹号,汉字…),如`IMPLICIT NONE`与IMPLICIT NONE
的区别
警告,也可以运行
PGF90-W-0025-Illegal character (E3) - ignored (hellocuda.cuf: 7)
中英文逗号打错,
PGF90-W-0025-Illegal character (EF) - ignored (hellocuda.cuf: 12)
sqrt输入类型为实数,使用整形sqrt(10)
输入报错
PGF90-S-0038-Symbol, sqrt, has not been explicitly declared (fatercuda.cuf)
0 inform, 0 warnings, 1 severes, 0 fatal for fast
本文首发于我的博客@cndaqiang.
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!