A random walk in PyTorch (3) -- My Precious!

好，开篇又是老问题，aten/src 下面的文件那么多，怎么读呢？

我强烈建议你先自己试试，理清aten下面的代码结构。

从难到易：

不看 readme 行不行？
不看 doc 行不行。
不行就再看一下 doc 吧。

好，那么我们开始。

Hello, ATen

注意到 ATen 是用 CMake 编译的，我们先看看 CMakeLists.txt。

不懂 CMake 不要紧，反正不要你写。

我们来看一下，设置一些基本参数；设置编译的 flag；找了一些依赖库，注意一下（可选或必须）依赖库有 CUDA，OpenMP，MAGMA，BLAS，LAPACK；检查了 CPU 的指令集；还检测了一些编译器的 bug （feature）。大多数都只是细节，无需关心。依赖库需要用到，我们需要了解一下它们的用途，我就不展开了。

接下来就开始处理 ATen 了，可以看到：

add_definitions(-DTH_INDEX_BASE=0)
set(TH_LINK_STYLE STATIC)
add_subdirectory(src/TH)
include_directories(
  # dense
  ${CMAKE_CURRENT_SOURCE_DIR}/src/TH
  ${CMAKE_CURRENT_SOURCE_DIR}/src/THC
  ${CMAKE_CURRENT_BINARY_DIR}/src/TH
  ${CMAKE_CURRENT_BINARY_DIR}/src/THC
  # sparse
  ${CMAKE_CURRENT_SOURCE_DIR}/src/THS
  ${CMAKE_CURRENT_SOURCE_DIR}/src/THCS
  ${CMAKE_CURRENT_BINARY_DIR}/src/THS
  ${CMAKE_CURRENT_BINARY_DIR}/src/THCS

  ${CMAKE_CURRENT_SOURCE_DIR}/src
  ${CMAKE_CURRENT_BINARY_DIR}/src)
add_subdirectory(src/THNN)
add_subdirectory(src/THS)

if(NO_CUDA)
  message("disabling CUDA because NO_CUDA is set")
  SET(CUDA_FLAG -n)
  SET(AT_CUDA_ENABLED 0)
else()
  SET(AT_CUDA_ENABLED 1)
  INCLUDE_DIRECTORIES(${CUDA_INCLUDE_DIRS})
  find_package(CUDA 5.5 REQUIRED)
  add_subdirectory(src/THC)
  add_subdirectory(src/THCUNN)
  add_subdirectory(src/THCS)
endif()

我们可以看到，这里添加了 src/TH，src/THNN，src/THS三个子文件夹。并且，根据有没有 CUDA，又进一步添加了 src/THC，src/THCUNN，src/THCS。这就非常清楚了：有三组 varations，第一个是普通的 dense tensor，第二个是 sparse tensor （注释里都告诉你了），第三个明显和神经网络有关，做什么的还不知道，同时每一个 tensor 还有一个对应的 GPU 版本。

加下来添加了依赖 CuDNN 和 NNPACK。

注意到：

set(cwrap_files
  ${CMAKE_CURRENT_SOURCE_DIR}/src/ATen/Declarations.cwrap
  ${CMAKE_CURRENT_SOURCE_DIR}/src/THNN/generic/THNN.h
  ${CMAKE_CURRENT_SOURCE_DIR}/src/THCUNN/generic/THCUNN.h
  ${CMAKE_CURRENT_SOURCE_DIR}/src/ATen/nn.yaml
  ${CMAKE_CURRENT_SOURCE_DIR}/src/ATen/native/native_functions.yaml
)

添加了几个文件，这些文件显然很特殊，我们需要注意。

最后又添加了src/ATen，test 可以忽略。contrib 就不用管了，我们主要是看核心部分。

OK，那么 src/ATen 是干什么的呢？其实从添加这些子文件夹的顺序和文件夹的名字就应该猜的出来 src/ATen 就是这六个 tensor 的具体实现的文件夹的接口。不过你们可能要验证一下，所以我们继续看。

在 src/ATen 下搜索 TH.h，可以看到在 gen.py 中有引用，这也就意味着在生成 ATen 的某些文件的时候需要用到 TH.h，可见 ATen 封装了 TH。 Q.E.D.

好吧，说实话，我开始也不是这么找的。而是：

在 src/ATen 中，可以看到 ATen.h：

#pragma once

#include "ATen/ATenGeneral.h"
#include "ATen/Allocator.h"
#include "ATen/Scalar.h"
#include "ATen/Type.h"
#include "ATen/Generator.h"
#include "ATen/Context.h"
#include "ATen/Storage.h"
#include "ATen/Tensor.h"
#include "ATen/TensorGeometry.h"
#include "ATen/Functions.h"
#include "ATen/Formatting.h"
#include "ATen/TensorOperators.h"
#include "ATen/TensorMethods.h"

注意到这里有两个文件可能有重要内容，ATen/ATenGeneral.h 和 ATen/Tensor.h，前一个什么也没有，后一个并不存在。不存在只有一个可能，这个 header 是生成的。

我们打开 Aten/CMakeLists.txt，搜索 Tensor 是搜不到任何东西的，这就意味着，这个文件是批量生成的，我们又看到 ATen 下面有 templates 文件夹，而其中刚好有 Tensor.h。

但是，我们找不到任何引用 TH.h 的地方，这不合理，所以需要搜索一下，就可以找到之前的 gen.py 了。弄明白 gen.py 很麻烦，因为我们并不知道究竟这些生成的文件的目的和逻辑是什么（虽然并不难猜）。最重要的是，我们已经有了生成的结果，doc/Tensor.h 了。

通过 doc/Tensor.h 可以看到：

struct Tensor : public detail::TensorBase {

...

inline Tensor toType(const Type & t) const;
inline Tensor & copy_(const Tensor & src);
inline Tensor toType(ScalarType t) const;
inline Tensor toBackend(Backend b) const;

...

}

Tensor 是一个 tensor 的封装，这个 tensor 可能是各种不同的数值类型，也可以在内存或是显存中。换言之，Tensor 包装了 TH，THC（sparse 和 dense 在任何库中都是不可能隐式转换的），通过统一的函数调用完成了不同数据类型和不同架构上的运算。

Follow the Numpy

在继续之前，我建议所有没有（正确）使用过 numpy 或者 Matlab 或者 R 的同学读一下关于 numpy 的一些基本原理（不是教程），这里推荐 From Python to Numpy ¹，不过大家也可以读自己喜欢的。网上这类文章应该不少，写的都比我好，所以我接下来就不会再重复了。

本质上，除了 Tensor 支持 GPU，Tensor 和 numpy 中的 ndarray 要做的事基本一致。所以如果你理解 numpy 的基本原理，那么 TH 想要做的事情是显而易见的。

DRY

现在读读 src/TH/THStorage.h。我估计很多人已经晕了。

#define THStorage        TH_CONCAT_3(TH,Real,Storage)
#define THStorage_(NAME) TH_CONCAT_4(TH,Real,Storage_,NAME)

/* fast access methods */
#define TH_STORAGE_GET(storage, idx) ((storage)->data[(idx)])
#define TH_STORAGE_SET(storage, idx, value) ((storage)->data[(idx)] = (value))

#include "generic/THStorage.h"
#include "THGenerateAllTypes.h"

#include "generic/THStorage.h"
#include "THGenerateHalfType.h"

#include "generic/THStorageCopy.h"
#include "THGenerateAllTypes.h"

#include "generic/THStorageCopy.h"
#include "THGenerateHalfType.h"

我们先看 generic/THStorage.h：

typedef struct THStorage
{
    real *data;
    ptrdiff_t size;
    int refcount;
    char flag;
    THAllocator *allocator;
    void *allocatorContext;
    struct THStorage *view;
} THStorage;

我觉得 real 这个 macro 的名字起的实在很令人误解，但是我们可以看到它就是那一堆 THGenerate*Type.h 里面定义的各种数值类型。所以不难看出，这里的 THStorage 其实就是代表了一块内存，data 就是内存地址，size 是数组大小，refcount 标记了引用数（这样我们就可以在多个 Tensor 间共享同一块数据了。You should see that coming, right?），flag 标记了内存的一些特性（注意到这些 flag 就定义在这个 struct 定义的上面了么），allocator 和 allocatorContext 告诉我们这块内存是怎么来的，以及它将要怎么没的。

那么 view 是什么呢？稍微搜索一下代码就可以在 torch/csrc/generic 发现它是用来代表 data 并不指向 allocator 返回的内存块开头的时候用来指向最原始的数据块的。这个 field 加入的比较晚，可见它并不是核心功能（我还不知道它是解决什么的）。

static PyObject * THPStorage_(newTHView)(THStorage *base, ptrdiff_t offset, size_t size)
{
  void *data = (char*)base->data + offset;
  THStoragePtr view(THStorage_(newWithData)(LIBRARY_STATE (real*)data, size));
  view->flag = TH_STORAGE_REFCOUNTED | TH_STORAGE_VIEW;
  view->view = base;       <---------- here!
  THStorage_(retain)(LIBRARY_STATE base);
  return THPStorage_(New)(view.release());
}

明白了 src/TH/generic/THStorage.h 的内容就明白了这个文件的目的。在基本的运算中，我们需要支持 float，double，int 等各种数据类型，而这些类型的的逻辑都是完全一样的。很自然的，我们会想到使用 template，但是 PyTorch 并不想使用 template，可能的原因有很多，我并不是开发者，就不乱猜了。既然不使用 template，那么我们就需要生成这些只有类型不同的代码，这就是

#include "generic/THStorage.h"
#include "THGenerateAllTypes.h"

的作用。

明白了这一点，我们再看看这些代码是如何生成的。

在讲代码如何生成之前，我先大致讲一下宏生成代码常用的一个技巧。因为这个其实是 C 和 C++ 的问题，和 PyTorch 实在没什么关系，我不展开了。

#define TH_CONCAT_3(x,y,z) TH_CONCAT_3_EXPAND(x,y,z)
#define TH_CONCAT_3_EXPAND(x,y,z) x ## y ## z

#define TH_CONCAT_4_EXPAND(x,y,z,w) x ## y ## z ## w
#define TH_CONCAT_4(x,y,z,w) TH_CONCAT_4_EXPAND(x,y,z,w)


#define THStorage        TH_CONCAT_3(TH,Real,Storage)
#define THStorage_(NAME) TH_CONCAT_4(TH,Real,Storage_,NAME)

这里的 TH_CONCAT_3 是干什么的？这里其实就是把 TH，Real，Storage 这三个字符拼起来，并展开其中的宏。至于为什么要绕到 TH_CONCAT_3_EXPAND 去，大家可以看一下 https://gcc.gnu.org/onlinedocs/cpp/Argument-Prescan.html#Argument-Prescan ，核心在于 concat(##) 的时候 token 不展开，所以要先展开再 concat。

以 double 为例：

#define real double
#define Real Double

real* THStorage_(data)(const THStorage*);
// is equivalent to 
double* THDoubleStorage_data(const THDoubleStorage*);

接下来我们想的就是把同样的定义用不同的类型都生成一遍，最直观的想法我们会这么做呢？大概是这样

// The umbrella header for all type

#define real double
#define Real Double
#include "generic/some_header.h"
#undef real
#undef Real

#define real int
#define Real Int
#include "generic/some_header.h"
#undef real
#undef Real

实际中，需要定义的宏可能不止两个，我们最终可能需要写成

// The umbrella header for all types

#include "def_for_double.h"
#include "generic/some_header.h"
#include "undef.h"

#include "def_for_int.h"
#include "generic/some_header.h"
#include "undef.h"

这样的形式，相当繁琐。如果我们要增加新的类型，就意味着要修改所有的生成代码。

当然，考虑到实际上可能永远都不会有新的类型，所以这样除了重复多了一些以外没也有什么缺点。

如果想要减少重复，显然，我们需要将需要生成的模版文件作为“参数”，而不是把类型作为“参数”。

这也就是

#include "generic/THStorage.h"
#include "THGenerateAllTypes.h"

的目的。先导入需要生成的文件，然后在 THGenerateAllTypes.h 中生成所有的类型。

看一下 generic/THStorage.h，

#ifndef TH_GENERIC_FILE
#define TH_GENERIC_FILE "generic/THStorage.h"
#else

...

#define TH_STORAGE_REFCOUNTED 1
#define TH_STORAGE_RESIZABLE  2
#define TH_STORAGE_FREEMEM    4
#define TH_STORAGE_VIEW       8

typedef struct THStorage
{
    real *data;
    ptrdiff_t size;
    int refcount;
    char flag;
    THAllocator *allocator;
    void *allocatorContext;
    struct THStorage *view;
} THStorage;

#endif

在第一次导入的时候 TH_GENERIC_FILE 还没有定义，因此只是定义了 TH_GENERIC_FILE 为自己的路径。

具体的某个类型，比如 float，可以用对应的头文件导入，比如 THGenerateFloatType.h

#ifndef TH_GENERIC_FILE
#error "You must define TH_GENERIC_FILE before including THGenerateFloatType.h"
#endif

#define real float
#define accreal double
#define TH_CONVERT_REAL_TO_ACCREAL(_val) (accreal)(_val)
#define TH_CONVERT_ACCREAL_TO_REAL(_val) (real)(_val)
#define Real Float
#define THInf FLT_MAX
#define TH_REAL_IS_FLOAT
#line 1 TH_GENERIC_FILE
#include TH_GENERIC_FILE    --------- This is the template header
#undef accreal
#undef real
#undef Real
#undef THInf
#undef TH_REAL_IS_FLOAT
#undef TH_CONVERT_REAL_TO_ACCREAL
#undef TH_CONVERT_ACCREAL_TO_REAL

#ifndef THGenerateManyTypes
#undef TH_GENERIC_FILE
#endif

当然，我们还希望可以一次导入所有类型的定义（THGenerateAllTypes.h）。

为了简单起见，这里我用 THGenerateFloatTypes.h 解释一下这里的做法。在 THGenerateFloatTypes.h 中

#ifndef TH_GENERIC_FILE
#error "You must define TH_GENERIC_FILE before including THGenerateFloatTypes.h"
#endif

#ifndef THGenerateManyTypes
#define THFloatLocalGenerateManyTypes
#define THGenerateManyTypes
#endif

#include "THGenerateFloatType.h"
#include "THGenerateDoubleType.h"

#ifdef THFloatLocalGenerateManyTypes
#undef THFloatLocalGenerateManyTypes
#undef THGenerateManyTypes
#undef TH_GENERIC_FILE
#endif

我们看到，通过使用 THGenerateManyTypes 保证 TH_GENERIC_FILE 不会在具体类型的定义结束之后被 undef 掉，这样我们可以用一个文件生成多个类型的定义。

如果你觉得这个代码很难懂的话，这很正常，因为这里有更加清楚简单的写法。

## in THStorage.h

#define TH_CURRNET_GENERIC_FILE "generic/THStorage.h"
#include "THGenerateAllType.h"
#undef TH_CURRNET_GENERIC_FILE


## in THGenerateAllType.h

#include "THGenerateFloatTypes.h"
#include "THGenerateIntTypes.h"


## in THGenerateFloatTypes.h

#include "THGenerateFloatType.h"
#include "THGenerateDoubleType.h"


## in THGenerateFloatType.h

#ifndef TH_CURRNET_GENERIC_FILE
#error "You must define TH_GENERIC_FILE before including THGenerateFloatType.h"
#endif

#define real float
#define accreal double
#define TH_CONVERT_REAL_TO_ACCREAL(_val) (accreal)(_val)
#define TH_CONVERT_ACCREAL_TO_REAL(_val) (real)(_val)
#define Real Float
#define THInf FLT_MAX
#define TH_REAL_IS_FLOAT
#line 1 TH_CURRNET_GENERIC_FILE
#include TH_CURRNET_GENERIC_FILE
#undef accreal
#undef real
#undef Real
#undef THInf
#undef TH_REAL_IS_FLOAT
#undef TH_CONVERT_REAL_TO_ACCREAL
#undef TH_CONVERT_ACCREAL_TO_REAL


## in generic/THStorage.h

## Just the template code

THStorage.h 中的

#include "generic/THStorage.h"
#include "THGenerateAllTypes.h"

#include "generic/THStorage.h"
#include "THGenerateHalfType.h"

#include "generic/THStorageCopy.h"
#include "THGenerateAllTypes.h"

#include "generic/THStorageCopy.h"
#include "THGenerateHalfType.h"

可以转换为：

#define TH_CURRNET_GENERIC_FILE "generic/THStorage.h"
#include "THGenerateAllType.h"
#include "THGenerateHalfType.h"
#undef TH_CURRNET_GENERIC_FILE

#define TH_CURRNET_GENERIC_FILE "generic/THStorageCopy.h"
#include "THGenerateAllType.h"
#include "THGenerateHalfType.h"
#undef TH_CURRNET_GENERIC_FILE

可以看到，不需要种种判断 define/undef，只需要把 generic header 作为“参数”就可以了。所以不是特别理解为什么 PyTorch 会写成这么麻烦的形式，唯一能想到的好处是怕大家忘了 undef TH_CURRNET_GENERIC_FILE？而且很容易令人迷惑，因为很少见到同一个头文件被直接连续 include 多次的。

Half type

因为我们这里用的是 TH 作为例子，所以 half 看起来似乎就是普通的 float，没有任何意义。

神经网络对精度极其不敏感²，因此通过使用双字节的浮点类型降低内存占用并提升吞吐量可以数倍的提升运算速度³。不过只有在 CUDA 中（THC 中），才有 FP16 （float 的一半）的类型支持。

FP16 本身无关主题，就不再展开了。

The tensor

我不会一一解释每行代码做了什么，大体上来说都是显而易见的。

generic/THStorage.h 没什么太多好说的。

我只把 generic/THTensor.h 大致注释一下，没有什么值得特别讨论的，大家过一下代码就好了。

#ifndef TH_GENERIC_FILE
#define TH_GENERIC_FILE "generic/THTensor.h"
#else

/* a la lua? dim, storageoffset, ...  et les methodes ? */

#define TH_TENSOR_REFCOUNTED 1

typedef struct THTensor
{
    // Size of each dimension
    int64_t *size;
    // Stride of each dimension. Read the article mentioned above about numpy if you don't understand what it is for.
    int64_t *stride;
    // Dimension of tensor, e.g., for matrix, it's 2
    int nDimension;

    // Note: storage->size may be greater than the recorded size
    // of a tensor
    THStorage *storage;
    ptrdiff_t storageOffset;
    int refcount;

    char flag;

} THTensor;


/**** access methods ****/
TH_API THStorage* THTensor_(storage)(const THTensor *self);
TH_API ptrdiff_t THTensor_(storageOffset)(const THTensor *self);
TH_API int THTensor_(nDimension)(const THTensor *self);
TH_API int64_t THTensor_(size)(const THTensor *self, int dim);
TH_API int64_t THTensor_(stride)(const THTensor *self, int dim);
// Return a storage with the size of the current tensor as data. Notice `Long` is `int64_t`.
TH_API THLongStorage *THTensor_(newSizeOf)(THTensor *self);
// Return a storage with the stripe of the current tensor as data.
TH_API THLongStorage *THTensor_(newStrideOf)(THTensor *self);
TH_API real *THTensor_(data)(const THTensor *self);

TH_API void THTensor_(setFlag)(THTensor *self, const char flag);
TH_API void THTensor_(clearFlag)(THTensor *self, const char flag);


/**** creation methods ****/
// New empty tensor.
TH_API THTensor *THTensor_(new)(void);
// Tensor pointing to the same storage with same view.
TH_API THTensor *THTensor_(newWithTensor)(THTensor *tensor);

/* stride might be NULL */
// If `stride` is null, it will be inferred.
// Create a new tensor pointing to the given storage, see `THTensor_(setStorageNd)`.
TH_API THTensor *THTensor_(newWithStorage)(THStorage *storage_, ptrdiff_t storageOffset_, THLongStorage *size_, THLongStorage *stride_);

// Some shorthand methods.
TH_API THTensor *THTensor_(newWithStorage1d)(THStorage *storage_, ptrdiff_t storageOffset_,
                                int64_t size0_, int64_t stride0_);
TH_API THTensor *THTensor_(newWithStorage2d)(THStorage *storage_, ptrdiff_t storageOffset_,
                                int64_t size0_, int64_t stride0_,
                                int64_t size1_, int64_t stride1_);
TH_API THTensor *THTensor_(newWithStorage3d)(THStorage *storage_, ptrdiff_t storageOffset_,
                                int64_t size0_, int64_t stride0_,
                                int64_t size1_, int64_t stride1_,
                                int64_t size2_, int64_t stride2_);
TH_API THTensor *THTensor_(newWithStorage4d)(THStorage *storage_, ptrdiff_t storageOffset_,
                                int64_t size0_, int64_t stride0_,
                                int64_t size1_, int64_t stride1_,
                                int64_t size2_, int64_t stride2_,
                                int64_t size3_, int64_t stride3_);

/* stride might be NULL */
TH_API THTensor *THTensor_(newWithSize)(THLongStorage *size_, THLongStorage *stride_);
TH_API THTensor *THTensor_(newWithSize1d)(int64_t size0_);
TH_API THTensor *THTensor_(newWithSize2d)(int64_t size0_, int64_t size1_);
TH_API THTensor *THTensor_(newWithSize3d)(int64_t size0_, int64_t size1_, int64_t size2_);
TH_API THTensor *THTensor_(newWithSize4d)(int64_t size0_, int64_t size1_, int64_t size2_, int64_t size3_);

// Copy tensor.
TH_API THTensor *THTensor_(newClone)(THTensor *self);

// Return a contiguous tensor, create a new one if necessary, see `THTensor_(isContiguous)`.
TH_API THTensor *THTensor_(newContiguous)(THTensor *tensor);

// See THTensor_(select).
TH_API THTensor *THTensor_(newSelect)(THTensor *tensor, int dimension_, int64_t sliceIndex_);

// See THTensor_(narrow).
TH_API THTensor *THTensor_(newNarrow)(THTensor *tensor, int dimension_, int64_t firstIndex_, int64_t size_);

TH_API THTensor *THTensor_(newTranspose)(THTensor *tensor, int dimension1_, int dimension2_);

// See THTensor_(unfold).
TH_API THTensor *THTensor_(newUnfold)(THTensor *tensor, int dimension_, int64_t size_, int64_t step_);

// Tensor with a different view of the same storage. This is a little tricky if the `stride` is not trivial.
TH_API THTensor *THTensor_(newView)(THTensor *tensor, THLongStorage *size);

// See `THTensor_(expand)`.
TH_API THTensor *THTensor_(newExpand)(THTensor *tensor, THLongStorage *size);

// Think of repeat the tensor without reallocate the memory. Keep in mind the stride can be 0. 
// This is very important and widely used as `broadcast` in tensor manipulation.
TH_API void THTensor_(expand)(THTensor *r, THTensor *tensor, THLongStorage *size);
// Seems like trying to coerce several tensor broadcasted into the same shape. But I can't find a usage.
TH_API void THTensor_(expandNd)(THTensor **rets, THTensor **ops, int count);

TH_API void THTensor_(resize)(THTensor *tensor, THLongStorage *size, THLongStorage *stride);
TH_API void THTensor_(resizeAs)(THTensor *tensor, THTensor *src);

// Resize the tensor, resize the storage if necessary.
TH_API void THTensor_(resizeNd)(THTensor *tensor, int nDimension, int64_t *size, int64_t *stride);
TH_API void THTensor_(resize1d)(THTensor *tensor, int64_t size0_);
TH_API void THTensor_(resize2d)(THTensor *tensor, int64_t size0_, int64_t size1_);
TH_API void THTensor_(resize3d)(THTensor *tensor, int64_t size0_, int64_t size1_, int64_t size2_);
TH_API void THTensor_(resize4d)(THTensor *tensor, int64_t size0_, int64_t size1_, int64_t size2_, int64_t size3_);
TH_API void THTensor_(resize5d)(THTensor *tensor, int64_t size0_, int64_t size1_, int64_t size2_, int64_t size3_, int64_t size4_);

// Just `=`, sharing storage.
TH_API void THTensor_(set)(THTensor *self, THTensor *src);

TH_API void THTensor_(setStorage)(THTensor *self, THStorage *storage_, ptrdiff_t storageOffset_, THLongStorage *size_, THLongStorage *stride_);

// Change the internal storage, resize accordingly. Resize the storage if necessary. See `THTensor_(resizeNd)`.
TH_API void THTensor_(setStorageNd)(THTensor *self, THStorage *storage_, ptrdiff_t storageOffset_, int nDimension, int64_t *size, int64_t *stride);

TH_API void THTensor_(setStorage1d)(THTensor *self, THStorage *storage_, ptrdiff_t storageOffset_,
                                    int64_t size0_, int64_t stride0_);
TH_API void THTensor_(setStorage2d)(THTensor *self, THStorage *storage_, ptrdiff_t storageOffset_,
                                    int64_t size0_, int64_t stride0_,
                                    int64_t size1_, int64_t stride1_);
TH_API void THTensor_(setStorage3d)(THTensor *self, THStorage *storage_, ptrdiff_t storageOffset_,
                                    int64_t size0_, int64_t stride0_,
                                    int64_t size1_, int64_t stride1_,
                                    int64_t size2_, int64_t stride2_);
TH_API void THTensor_(setStorage4d)(THTensor *self, THStorage *storage_, ptrdiff_t storageOffset_,
                                    int64_t size0_, int64_t stride0_,
                                    int64_t size1_, int64_t stride1_,
                                    int64_t size2_, int64_t stride2_,
                                    int64_t size3_, int64_t stride3_);

// New view with the selected dimension starting with `firstIndex_` with length `size_`. You'd better try to visualize the memory layout yourself to understand it.
TH_API void THTensor_(narrow)(THTensor *self, THTensor *src, int dimension_, int64_t firstIndex_, int64_t size_);

// Select the data of index of the specified dimension.
TH_API void THTensor_(select)(THTensor *self, THTensor *src, int dimension_, int64_t sliceIndex_);

// !!!!!!!!! See how powerful a view can be with only as simple as `size` and `stride`.
TH_API void THTensor_(transpose)(THTensor *self, THTensor *src, int dimension1_, int dimension2_);

// Though there is possible advanced usage, most of time it is used to split one dimension into two with no overlap. Notice there is no `fold`, guessed why?
TH_API void THTensor_(unfold)(THTensor *self, THTensor *src, int dimension_, int64_t size_, int64_t step_);

// Remove singleton dimension.
TH_API void THTensor_(squeeze)(THTensor *self, THTensor *src);
// Remove specified dimension if it is singleton.
TH_API void THTensor_(squeeze1d)(THTensor *self, THTensor *src, int dimension_);
// Add a singleton dimension.
TH_API void THTensor_(unsqueeze1d)(THTensor *self, THTensor *src, int dimension_);

// If the current view is a contiguous (naive) view of the storage.
TH_API int THTensor_(isContiguous)(const THTensor *self);

TH_API int THTensor_(isSameSizeAs)(const THTensor *self, const THTensor *src);

// If the two tensor is pointing to the same storage with same view. Think of `==`.
TH_API int THTensor_(isSetTo)(const THTensor *self, const THTensor *src);

TH_API int THTensor_(isSize)(const THTensor *self, const THLongStorage *dims);
TH_API ptrdiff_t THTensor_(nElement)(const THTensor *self);

TH_API void THTensor_(retain)(THTensor *self);
TH_API void THTensor_(free)(THTensor *self);

// Think of `std::move`.
TH_API void THTensor_(freeCopyTo)(THTensor *self, THTensor *dst);

/* Slow access methods [check everything] */
TH_API void THTensor_(set1d)(THTensor *tensor, int64_t x0, real value);
TH_API void THTensor_(set2d)(THTensor *tensor, int64_t x0, int64_t x1, real value);
TH_API void THTensor_(set3d)(THTensor *tensor, int64_t x0, int64_t x1, int64_t x2, real value);
TH_API void THTensor_(set4d)(THTensor *tensor, int64_t x0, int64_t x1, int64_t x2, int64_t x3, real value);

TH_API real THTensor_(get1d)(const THTensor *tensor, int64_t x0);
TH_API real THTensor_(get2d)(const THTensor *tensor, int64_t x0, int64_t x1);
TH_API real THTensor_(get3d)(const THTensor *tensor, int64_t x0, int64_t x1, int64_t x2);
TH_API real THTensor_(get4d)(const THTensor *tensor, int64_t x0, int64_t x1, int64_t x2, int64_t x3);

/* Debug methods */
TH_API THDescBuff THTensor_(desc)(const THTensor *tensor);
TH_API THDescBuff THTensor_(sizeDesc)(const THTensor *tensor);

#endif

I want GPU!

本来准备这篇写完的，但是太长了。

下一篇，我们继续分析 THC。

https://www.labri.fr/perso/nrougier/from-python-to-numpy/ ↩
https://petewarden.com/2015/05/23/why-are-eight-bits-enough-for-deep-neural-networks/ ↩
https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/ ↩