按照Tensorflow的安装说明,安装过程需要把满足下面这些条件:

  • CUDA Toolkit 9.0
  • NVIDIA drivers
  • cuDNN v7.0
  • 支付的GPU型号
  • 各种依赖库

系统准备

使用的Ubuntu系统信息:

lsb_release -a

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 17.10 Release: 17.10 Codename: artful

使用的内核版本信息

uname -a

Linux ALD 4.13.0-25-generic #29-Ubuntu SMP Mon Jan 8 21:14:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

查询NVIDIA显卡型号

lspci | grep -i nvidia

01:00.0 VGA compatible controller: NVIDIA Corporation GM107 [GeForce GTX 750 Ti] (rev a2)

可以在CUDA GPUs支持列表上查询显卡是否支持CUDA

查询GCC版本

gcc --version

gcc (Ubuntu 7.2.0-8ubuntu3) 7.2.0 Copyright (C) 2017 Free Software Foundation, Inc. This is free software; see the source for copying ,conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

GCC版本比较新,CUDA不支持,需要安装旧版本的GCC。如果你觉得在系统里面安装多个版本的GCC不符合强迫症般的洁癖,可以试试用conda创建一个虚拟环境,在里面安装对应版本的GCC,在虚拟环境中进行编译安装。
经过测试,发现在conda里面创建虚拟环境安装gcc后,安装CUDA和bazel,可能在tensorflow的编译过程中出现问题,不建议这样的安装方式 。下面

显卡驱动的安装可以直接去Software & Updates -> Additional Drivers ,找到合适的驱动,点击安装。

安装过程可能还会提示:

Missing recommended library: libGLU.so
Missing recommended library: libXmu.so

需要安装下面的依赖

sudo apt-get install libglu1-mesa libxi-dev libxmu-dev libglu1-mesa-dev

除了上面的依赖外,还需要安装:

sudo apt-get install libcupti-dev

安装CUDA 9

这里下载CUDA 安装文件,选择对应的操作系统和架构,我这里下载的是runfile。
之后运行sudo sh cuda_9.0.176_384.81_linux.run --override,后面的--override是因为GCC版本不匹配,如果GCC版本匹配的话,可以不用添加。
安装过程会出现一些信息,需要填写,我这里的是

Logging to /tmp/cuda_install_11627.log
Using more to view the EULA.
End User License Agreement
--------------------------


Preface
-------

The Software License Agreement in Chapter 1 and the Supplement
in Chapter 2 contain license terms and conditions that govern
the use of NVIDIA software. By accepting this agreement, you
agree to comply with all the terms and conditions applicable
to the product(s) included herein.


NVIDIA Driver


Description

This package contains the operating system driver and
fundamental system software components for NVIDIA GPUs.


Do you accept the previously read EULA?
accept/decline/quit: accept

You are attempting to install on an unsupported configuration. Do you wish to continue?
(y)es/(n)o [ default is no ]: y

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81?
(y)es/(n)o/(q)uit: n

Install the CUDA 9.0 Toolkit?
(y)es/(n)o/(q)uit: y

Enter Toolkit Location
 [ default is /usr/local/cuda-9.0 ]: 

Do you want to install a symbolic link at /usr/local/cuda?
(y)es/(n)o/(q)uit: y

Install the CUDA 9.0 Samples?
(y)es/(n)o/(q)uit: n

Installing the CUDA Toolkit in /usr/local/cuda-9.0 ...

===========
= Summary =
===========

Driver:   Not Selected
Toolkit:  Installed in /usr/local/cuda-9.0
Samples:  Not Selected

Please make sure that
 -   PATH includes /usr/local/cuda-9.0/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 t  o /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin

Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.

***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.0 functionality to work.
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
    sudo <CudaInstaller>.run -silent -driver

Logfile is /tmp/cuda_install_11627.log

其中Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81? 这里要选择n,因为前面已经安装好驱动了,网上说重新安装可能会出现问题。而且上面也没有选择安装CUDA Samples,这里可以自己选择。
最近出现的一个WARNING,说是没有完全安装,不过我上网查了下,其他人也遇到这个问题,可能是因为前面没有选择安装驱动的原因,不过CUDA的使用是没问题的。
最后按照提示,把安装路径添加到系统变量中,在~/.bashrc~/.zshrc中添加下面两行

export PATH=/usr/local/cuda-9.0/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:$LD_LIBRARY_PATH

添加完成,记得source ~/.zshrc or ~/.bashrc,让配置生效。

CUDA完成后,可以用nvidia-smi看看显卡及驱动信息:

➜  release  nvidia-smi
Wed Jan 17 15:24:35 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 750 Ti  Off  | 00000000:01:00.0  On |                  N/A |
| 40%   30C    P0     1W /  38W |    294MiB /  1995MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1585      G   /usr/lib/xorg/Xorg                            15MiB |
|    0      1655      G   /usr/bin/gnome-shell                          50MiB |
|    0      2130      G   /usr/lib/xorg/Xorg                            86MiB |
|    0      2264      G   /usr/bin/gnome-shell                         137MiB |
+-----------------------------------------------------------------------------+

如果前面安装了CUDA Samples的话,可以试试几个示例:

cd ~/NVIDIA_CUDA-9.0_Samples/0_Simple/asyncAPI
make

这里可能会出现问题,因为CUDA要求gcc的版本不能大于6,所以这里编译的时候会出现版本问题:

"/usr/local/cuda-9.0"/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o asyncAPI.o -c asyncAPI.cu
In file included from /usr/local/cuda-9.0/bin/..//include/host_config.h:50:0,
                 from /usr/local/cuda-9.0/bin/..//include/cuda_runtime.h:78,
                 from <command-line>:0:
/usr/local/cuda-9.0/bin/..//include/crt/host_config.h:119:2: error: #error -- unsupported GNU version! gcc versions later than 6 are not supported!
 #error -- unsupported GNU version! gcc versions later than 6 are not supported!
  ^~~~~
Makefile:273: recipe for target 'asyncAPI.o' failed
make: *** [asyncAPI.o] Error 1

网上的建议是使用低版本的gcc,然后为CUDA创建一个软链接:

sudo apt install gcc-6 g++-6
sudo ln -s /usr/bin/gcc-6 /usr/local/cuda/bin/gcc 
sudo ln -s /usr/bin/g++-6 /usr/local/cuda/bin/g++

这里是在系统中安装了2个版本的gcc,也可以把gcc的版本降低。不过我觉得上面这个方便比较简单。

这里再尝试编译一下:

➜  asyncAPI make
"/usr/local/cuda-9.0"/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o asyncAPI.o -c asyncAPI.cu
"/usr/local/cuda-9.0"/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_70,code=compute_70 -o asyncAPI asyncAPI.o 
mkdir -p ../../bin/x86_64/linux/release
cp asyncAPI ../../bin/x86_64/linux/release
➜  asyncAPI cd ../../bin/x86_64/linux/release
➜  release ./asyncAPI 
[./asyncAPI] - Starting...
GPU Device 0: "GeForce GTX 750 Ti" with compute capability 5.0

CUDA device [GeForce GTX 750 Ti]
time spent executing by the GPU: 12.30
time spent by CPU in CUDA calls: 0.03
CPU executed 44857 iterations while waiting for GPU to finish

上面编译过程正常,得到的可执行文件也能顺利工作,那CUDA也安装完成了。

安装cuDNN(CUDA Deep Neural Network library)

NVIDIA的网站下载安装包,网站需要注册后才能下载。cuDNN和CUDA的版本要匹配。 cuDNN中包括了许多深度神经网络中需要的库文件,能够加速运算过程。把下载好的文件解压后,将对应的文件复制到相应的目录就可以了:

tar zxvf cudnn-9.0-linux-x64-v7.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

安装Tensorflow

这里是利用conda来进行安装,Python版本为2.7。

conda create -n tensorflow python=2.7
pip install --ignore-installed --upgrade /home/l0o0/Downloads/tensorflow_gpu-1.4.1-cp27-none-linux_x86_64.whl

这里的安装步骤是参考了 官方文档

这里利用pip安装的时候,是需要从google storage上下载Python包,不过网络原因,是先用其他方式下载GPU support的安装包。

如果上面的安装后不能正常import,可以试试直接使用

pip install tensorflow-gpu

安装出现问题

虽然上面安装过程中没有出现其他问题,导入模块也很正常

python -m 'import tensorflow as tf'

不过当我测试一个小程序时,却发现了很多问题,根本运行不了。具体的信息,可以看我在Tensorflow在github上的issue信息,Tensorflow目前还不能支持CUDA 9 和cuDNN 7,所以在运行过程中出现类似找不到libcusolver.so.8.0的错误信息。不过根据深度学习服务器环境配置上的说明,可以考虑issue CUDA 9RC + cuDNN7这里已经有支持CUDA 9 和cuDNN 7版本的tensorflow版本了,需要手动编译安装。从bazel的github上下载了bazel-0.9.0-dist.zip.

mkdir bazel
unzip bazel-0.9.0-dist.zip -d bazel
cd bazel
bash ./compile.sh

等编译好后,将bazel可执行文件复制到系统的PATH

sudo output/bazel /usr/local/bin

然后我们从tensorflow上下载1.5版本,该版本支持CUDA 9和 cuDNN 7,

tar zxvf tensorflow-1.5.0-rc1.tar.gz
./configure

之后会出现一堆交互式的提问,我是按照下面的示例提交的

You have bazel 0.9.0- (@non-git) installed.
Please specify the location of python. [Default is /home/l0o0/.miniconda2/envs/tf/bin/python]: 


Found possible Python library paths:
  /home/l0o0/.miniconda2/envs/tf/lib/python2.7/site-packages
Please input the desired Python library path to use.  Default is [/home/l0o0/.miniconda2/envs/tf/lib/python2.7/site-packages]

Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: Y
jemalloc as malloc support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.

Do you wish to build TensorFlow with XLA JIT support? [y/N]: n
No XLA JIT support will be enabled for TensorFlow.

Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.

Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.

Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.

Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.

Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 


Please specify the location where CUDA 9.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: 


Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 


Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:


Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 5.0]


Do you want to use clang as CUDA compiler? [y/N]: N
nvcc will be used as CUDA compiler.

Please specify which gcc should be used by nvcc as the host compiler. [Default is /home/l0o0/.miniconda2/envs/tf/bin/gcc]: 


Do you wish to build TensorFlow with MPI support? [y/N]: N
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]: 


Add "--config=mkl" to your bazel command to build with MKL support.
Please note that MKL on MacOS or windows is still not supported.
If you would like to use a local MKL instead of downloading, please set the environment variable "TF_MKL_ROOT" every time before build.

Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: N
Not configuring the WORKSPACE for Android builds.

Configuration finished

编译所需要的配置完成后,在解压的tensorflow-1.5.0-rc1目录中,编译支持GPU的tensorflow pip 安装包,bazel编译tensorflow步骤参考官网链接

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package 
WARNING: /home/l0o0/Downloads/tensorflow-1.5.0-rc1/tensorflow/core/BUILD:1807:1: in includes attribute of cc_library rule //tensorflow/core:framework_headers_lib: '../../external/nsync/public' resolves to 'external/nsync/public' not below the relative path of its package 'tensorflow/core'. This will be an error in the future. Since this rule was created by the macro 'cc_header_only_library', the error might have been caused by the macro implementation in /home/l0o0/Downloads/tensorflow-1.5.0-rc1/tensorflow/tensorflow.bzl:1138:30
WARNING: /home/l0o0/Downloads/tensorflow-1.5.0-rc1/tensorflow/contrib/learn/BUILD:15:1: in py_library rule //tensorflow/contrib/learn:learn: target '//tensorflow/contrib/learn:learn' depends on deprecated target '//tensorflow/contrib/session_bundle:exporter': No longer supported. Switch to SavedModel immediately.
WARNING: /home/l0o0/Downloads/tensorflow-1.5.0-rc1/tensorflow/contrib/learn/BUILD:15:1: in py_library rule //tensorflow/contrib/learn:learn: target '//tensorflow/contrib/learn:learn' depends on deprecated target '//tensorflow/contrib/session_bundle:gc': No longer supported. Switch to SavedModel immediately.
INFO: Analysed target //tensorflow/tools/pip_package:build_pip_package (2 packages loaded).
INFO: Found 1 target...
ERROR: /home/l0o0/.cache/bazel/_bazel_l0o0/b6b8b60fd07dbe29a0dec917256bc463/external/protobuf_archive/BUILD:265:1: Linking of rule '@protobuf_archive//:js_embed' failed (Exit 1)
/usr/bin/ld: cannot find Scrt1.o: No such file or directory
collect2: error: ld returned 1 exit status
Target //tensorflow/tools/pip_package:build_pip_package failed to build
Use --verbose_failures to see the command lines of failed build steps.

上网查询了下相关的资料,出现上面的问题主要是因为最开始我是创建一个conda虚拟环境来安装gcc-6,GCC版本以及对应的库文件的问题,建议不要在conda的环境下进行CUDA和bazel的安装。直接使用系统的GCC进行安装,同时再准备一个低版本的GCC,我是下载了gcc-6,直接在/usr/local/cuda/bin中创建了gcc-6可执行文件的软链接。

后面我直接是利用Ubuntu 17.10 自带的gcc-7安装了CUDA,cuDNN和bazel,之后再用bazel编译tensorflow-1.5.0-rc1

bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package 

编译过程大概花了半小时左右,在编译目录bazel-bin/tensorflow/tools/pip_package/中生成一个可执行文件build_pip_package,需要运行这个文件来生成pip安装所需要的,whl文件:

# 将生成的whl文件保存在/tmp/tensorflow_pkg目录中
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip install /tmp/tensorflow_pkg/tensorflow-1.4.1-py2-none-any.whl

之后就可以正常使用tensorflow了。