cndaqiang Web Linux DFT

挖坑 CUDA编程(Fortran)

2020-02-29
cndaqiang
RSS

参考

李会民:PGI编译器和使用ustc
小小河:CUDA Fortran 高效编程实践

PGI编译器基本用法

编译器

  • 编译C、 C++源程序的命令分别: pgcc、 pgCC
  • 编译Fortran 77源程序的命令: pgf77
  • 编译Fortran 90的源程序的命令: pgf90、 pgf901、 pgf902、 pgf90_ex、pgf95和pgfortran
  • 与NVIDIA CUDA库配合,可以编译Fortran-CUDA程序

参数

-c 仅编译成对象文件(.o文件)
-o 生成可执行文件名
-O 优化
-mp openmp

编译普通Fortran示例

cndaqiang@girl:~/code/cuda$  cat hello.f90 
PROGRAM hello
      WRITE(*,*) "hello"
ENDPROGRAM hello
cndaqiang@girl:~/code/cuda$ pgf90 hello.f90 -o hello
cndaqiang@girl:~/code/cuda$ ./hello 
 hello

编译cuda示例

cndaqiang@girl:~/code/cuda$ cat example.cuf 
!attributes(global) 设备函数,global可被主机调用
attributes(global) subroutine hellocuda()
    IMPLICIT NONE
    INTEGER :: i,d,j
    i = BlockIdx%x  !线程块索引
    d = BlockDim%x  !线程块长度
    j = ThreadIdx%x !线程索引
    WRITE(*,*) "Block",i,"Thread",j,"In",d
ENDSUBROUTINE

PROGRAM main
USE cudafor
IMPLICIT NONE
INTEGER :: i
call hellocuda<<<2,3>>>()
i=cudaDeviceSynchronize() !use cudafor才能正常使用
!在执行hellocuda时,控制权就交给cpu了
END PROGRAM
cndaqiang@girl:~/code/cuda$ pgfortran example.cuf && ./a.out 
 Block            2 Thread            1 In            3
 Block            2 Thread            2 In            3
 Block            2 Thread            3 In            3
 Block            1 Thread            1 In            3
 Block            1 Thread            2 In            3
 Block            1 Thread            3 In            3

因为启动GPU上的函数hellocuda后,控制权就返回CPU了,因此,加上i=cudaDeviceSynchronize() ,等待GPU输出之后,再继续

CUDA

基本概念

  • 主机host:指CPU及其内存。
  • 设备device:指GPU及其内存。
  • CPU代码:指一个仅用到CPU的实现/在host上执行的代码
  • 核函数kernel: host上调用,device上执行的subroutine,如attributes(global) subroutine hellocuda()
  • 启动核函数:在host上调用device上的函数 ,如CALL hellocuda <<<2,3>>>([参数])
    使用<<<线程块数量,每个线程块的线程数>>>,即<<<Grid,Block>>>
  • 定义函数attributes(属性) subroutine funname(),当属性为:
    global, device上执行,host上调用,例attributes(global) subroutine hellocuda(),调用时指明网格<<<>>>
    device, device上执行,device上调用,例attributes(device) subroutine hellocuda(),直接启用,没有<<<>>>
    host, host上执行,host上调用,不写属性,默认就是此情况,例subroutine hellohost(),直接启用,没有<<<>>>
  • 内存
    全局内存(Gloabl Memory):由显存决定,所有线程都可以访问,操作,我们传递给核函数的device变量存储在共享内存上
    共享内存(Shard Memory):每个线程块共享,寿命与线程块一样
    私有本地内存(Local Memory):仅本线程能访问

注意事项

  • 设备上的函数不能contained在主程序或其他子程序中
  • 很多函数变量都来自USE cudafor
  • CPU进行控制,GPU进行计算

查看显卡属性

cndaqiang@girl:~/code/cuda$ cat checkcuda.cuf 
PROGRAM checkcuda
USE cudafor
IMPLICIT NONE
type(cudaDeviceProp) :: prop
INTEGER :: nDevices=0,i,ierr
!
ierr = cudaGetDeviceCount(nDevices)
if(nDevices .EQ. 0) RETURN
DO i=0,nDevices-1
    WRITE(*,"('Device Number:',i0)"),i
    ierr = cudaGetDeviceProperties(prop,i)
    WRITE(*,"('Device Name: ',a)") TRIM(prop%name)
    !计算能力 主要.次要 ,理解为版本号
    WRITE(*,"('Compute Capability: ',i0,'.',i0)") prop%major, prop%minor
    !处理器数量
    WRITE(*,"('Number of Multiprocessors: ',i0)") prop%multiProcessorCount
    WRITE(*,"('Max Threads per Multiprocessor:',i0)") prop%maxThreadsPerMultiprocessor
    WRITE(*,"('Global Memory (GB):',f9.3)" ) prop%totalGlobalMem/(1024.0**3)
    WRITE(*,"(/,A)") "Execution Configuration Limits "
    WRITE(*,"('Max Grid Dims: ',i0,'x',i0,'x',i0)") prop%maxGridSize
    !每维Block内最大线程数
    WRITE(*,"('Max Block Dims: ',i0,'x',i0,'x',i0)") prop%maxThreadsDim
    WRITE(*,"('Max Threads per Block: ',i0)") prop%maxThreadsPerBlock
ENDDO
END PROGRAM
cndaqiang@girl:~/code/cuda$ pgf90 checkcuda.cuf && ./a.out 
Device Number:0
Device Name: GeForce 940MX
Compute Capability: 5.0
Number of Multiprocessors: 3
Max Threads per Multiprocessor:2048
Global Memory (GB):    1.958

Execution Configuration Limits 
Max Grid Dims: 2147483647x65535x65535
Max Block Dims: 1024x1024x64
Max Threads per Block: 1024

并发线程数=多处理器数量*每处理器上最大线程数量
对于此940MX而言,使用一维网格,一个Block线程数可达2147483647*1024
也可以使用pgaccelinfo查看

cndaqiang@girl:~/code/cuda$ pgaccelinfo 

CUDA Driver Version:           10010
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  435.21  Sun Aug 25 08:17:57 CDT 2019

Device Number:                 0
Device Name:                   GeForce 940MX
Device Revision Number:        5.0
Global Memory Size:            2101870592
Number of Multiprocessors:     3
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1189 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             2000 MHz
Memory Bus Width:              64 bits
L2 Cache Size:                 1048576 bytes
Max Threads Per SMP:           2048
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     No
PGI Default Target:            -ta=tesla:cc50

设置核函数计算网格

一维Gird,一维Block

!一维网格中有M个Block,每个Block有N个Thread
call sub<<<M,N>>>([argument])

多维

TYPE(dim3) grid,block
!
grid=dim3(2,2,2)
block=dim3(1,1,1)
call hellocuda<<<grid,block>>>()

kernal函数中的网格内置变量,是dim3的类型

    gx=GridDim%x !网格大小,即Block数量
    gy=GridDim%y !网格大小,即Block数量
    gz=GridDim%z !网格大小,即Block数量
    bx=BlockIdx%x !Block ID
    by=BlockIdx%y !Block ID
    bz=BlockIdx%z !Block ID
    bDx=BlockDim%x !Block大小,即Thread数量/Block
    bDy=BlockDim%y !Block大小,即Thread数量/Block
    bDz=BlockDim%z !Block大小,即Thread数量/Block
    tx=ThreadIdx%x !Thead ID
    ty=ThreadIdx%y !Thead ID
    tz=ThreadIdx%z !Thead ID

确定绝对线程数,可用

bi =    BlockIdx%x +        &
        (BlockIdx%y - 1)*GridDim%x + &
        (BlockIdx%z - 1)*GridDim%y*GridDim%x
ti = (bi-1)*BlockDim%x*BlockDim%y*BlockDim%z + &
        ThreadIdx%x +        &
        (ThreadIdx%y - 1)*BlockDim%x + &
        (ThreadIdx%z - 1)*BlockDim%y*BlockDim%x

在核函数上启动device函数时,不能设置网格(因为是各个Thread分别调用此函数),device中也有网格内置变量,值由调用该函数的线程确定

主机,设备代码同步问题

默认主机和设备间的代码(golbal的核函数)执行是不同步的,
即下面代码在开始执行CALL hellocuda <<<2,3>>>()时,控制权就返回CPU了

a_d=1 
CALL hellocuda <<<2,3>>>()

主机和内核之前的数据传输是同步的(或者是阻塞的),如a_d=1,只有当CPU和GPU都执行到此处才可以传输数据。
其他同步方法

  • 在代码内添加i=cudaDeviceSynchronize(),直到所有设备(GPU)上代码执行完到此处,主机(CPU)程序才继续执行
  • 可以设置环境变量,使CPU和GPU同步函数调用export CUDA_LAUNCH_BLOCKING=1

核函数的参数传递

默认传址调用
传递的变量需要保存在device上,即该变量原始定义时加上device属性,如INTEGER,device :: x_d(100)
应该只能传递在GPU/device上的地址
不可以直接传递host上的变量地址

传值调用时
传值调用可以传递host上的变量,此时,在核函数内部定义时加上value属性,如INTEGER,value :: N
包含传递形参的核函数只能在MODULE里定义,不能使用external,不然传递过来的值都很随机,引起计算异常

传递示例,可看到external的函数不能传值,module的可以,传递device形变量都可以

cndaqiang@girl:~/code/cuda$ cat cudavar.cuf
attributes(global) SUBROUTINE whoIam(n,n_d)
IMPLICIT NONE
INTEGER,value :: N
INTEGER :: N_d
WRITE(*,*) "External", "N",N,"N_d",N_d
END SUBROUTINE

MODULE m_var
CONTAINS
attributes(global) SUBROUTINE whoIam_m(n,n_d)
IMPLICIT NONE
INTEGER,value :: N
INTEGER :: N_d
WRITE(*,*) "Module", "N",N,"N_d",N_d
END SUBROUTINE
END MODULE

PROGRAM testSend
use cudafor
USE m_var
IMPLICIT NONE
INTEGER   :: N,i
INTEGER,device :: N_d
N=10
N_d=10
CALL whoIam<<<1,1>>>(N,N_d)
CALL whoIam_m<<<1,1>>>(N,N_d)
i=cudaDeviceSynchronize()
END PROGRAM
cndaqiang@girl:~/code/cuda$ pgfortran cudavar.cuf && ./a.out 
 External N    816657356 N_d           10
 Module N           10 N_d           10

性能测试

下载fastcuda.cuf,编译测试
可以看到当矩阵大于1E5时,GPU(940MX)计算时间=CPU(i7-7500U)时间/3
不同变量赋值耗时 < 相同变量赋值
不同变量赋值: GPU-GPU < CPU-CPU
GPU-CPU赋值耗时:GPUx-GPUy + CPUx-CPUy
GPUx-GPUy默认的赋值还挺快的

cndaqiang@girl:~/code/cuda$ pgfortran fastcuda.cuf  && ./a.out 
 -----------------------
N     100000      CPU build, WALL         1.10000 ms
N     100000        CPU cal, WALL         0.25800 ms
N     100000      GPU build, WALL         0.12800 ms
N     100000        GPU cal, WALL         0.06400 ms
N     100000       GPU->CPU, WALL         0.33300 ms
N     100000       CPU->GPU, WALL         0.30300 ms
N     100000     CPUy->CPUx, WALL         0.20800 ms
N     100000     CPUy->CPUy, WALL         0.42600 ms
N     100000     GPUy->GPUz, WALL         0.07100 ms
N     100000     GPUy->GPUy, WALL         0.59500 ms
N     100000                Error         0.00000
 -----------------------
N  100100000      CPU build, WALL       910.17902 ms
N  100100000        CPU cal, WALL       202.79601 ms
N  100100000      GPU build, WALL        45.75900 ms
N  100100000        GPU cal, WALL        59.73200 ms
N  100100000       GPU->CPU, WALL       244.66000 ms
N  100100000       CPU->GPU, WALL       251.89999 ms
N  100100000     CPUy->CPUx, WALL       210.46001 ms
N  100100000     CPUy->CPUy, WALL       495.70901 ms
N  100100000     GPUy->GPUz, WALL        31.10200 ms
N  100100000     GPUy->GPUy, WALL       499.45700 ms
N  100100000                Error         0.00000

检测各个模块运行时间 nvprof ./a.out

示例

(python27) cndaqiang@girl:~/code/cuda$ nvprof ./a.out
            2            2            2
==2410== NVPROF is profiling process 2410, command: ./a.out
...程序运行输出
==2410== Profiling application: ./a.out
==2410== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  509.55us         1  509.55us  509.55us  509.55us  m_mycuda_hellocuda_
      API calls:   99.61%  264.75ms         1  264.75ms  264.75ms  264.75ms  cudaLaunchKernel
                    0.21%  549.89us         1  549.89us  549.89us  549.89us  cudaDeviceSynchronize
                    0.12%  318.01us        97  3.2780us     250ns  139.94us  cuDeviceGetAttribute
                    0.05%  123.14us         1  123.14us  123.14us  123.14us  cuDeviceTotalMem
                    0.02%  50.688us         1  50.688us  50.688us  50.688us  cuDeviceGetName
                    0.00%  2.9850us         3     995ns     267ns  2.0500us  cuDeviceGetCount
                    0.00%  2.1570us         2  1.0780us     348ns  1.8090us  cuDeviceGet
                    0.00%     464ns         1     464ns     464ns     464ns

报错

CUDA代码错误

在设备(GPU)代码上运行主机(CPU)程序,如CALL SLEEP(1),CALL system("")

PGF90-S-0155-Calls from device code to a host subroutine are allowed only in emulation mode - sleep (hellocuda.cuf: 11)
  0 inform,   0 warnings,   1 severes, 0 fatal for mycuda2

在host上输出device变量,如WRITE(*,*) a_d

PGF90-S-0155-device data allowed in I/O statements only in emulation mode  (hellocuda.cuf: 45)
  0 inform,   0 warnings,   1 severes, 0 fatal for hellocuda

device代码不能使用格式化输出? 不能用WRITE(*,"(A)") "Hello",只能WRITE(*,*) "Hello"

PGF90-S-0155-I/O statements allowed in device routines only in emulation mode  (hellocuda.cuf: 12)
  0 inform,   0 warnings,   1 severes, 0 fatal for mycuda

在host上调用device函数

PGF90-S-0155-Kernel launch of attributes(device) subprogram is not allowed - gpufun (hellocuda.cuf: 43)
  0 inform,   0 warnings,   1 severes, 0 fatal for main

拓展名cuf写为f90时,不识别cuda函数

/home/cndaqiang/code/cuda/checkcuda.f90:8: undefined reference to `cudagetdevicecount_'

向核函数传递host上变量,如果非要传递,把输入参数类型加上value,且需要在Module里面

PGF90-S-0528-Argument number 1 to whoiam_m: device attribute mismatch (cudavar.cuf: 47)
  0 inform,   0 warnings,   1 severes, 0 fatal for testsend

PGI普通Fortran语法错误

在代码行后面多打中文字符(空格,叹号,汉字…),如`IMPLICIT NONE`与IMPLICIT NONE 的区别 警告,也可以运行

PGF90-W-0025-Illegal character (E3) - ignored (hellocuda.cuf: 7)

中英文逗号打错,

PGF90-W-0025-Illegal character (EF) - ignored (hellocuda.cuf: 12)

sqrt输入类型为实数,使用整形sqrt(10)输入报错

PGF90-S-0038-Symbol, sqrt, has not been explicitly declared (fatercuda.cuf)
  0 inform,   0 warnings,   1 severes, 0 fatal for fast

本文首发于我的博客@cndaqiang.
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!


目录

访客数据