Tuesday, August 17, 2010

Using the nVIDIA FERMI for a fast classification

The idea is to use the most advanced GPU of nVIDIA to increase the speed of the classifier.

The GPUs are the most available and performant multiprocessors for the microcomputers. They are specifically designed for graphic applications but the FERMI architecture of nVIDIA is more flexible than the previous G80 / GT200 architecture. In fact, this new architecture has been designed to allow more efficiently and easily the massive parallel computations that is needed in imaging, flow simulations and financial predictions. This new architecture looks very interesting for the linear classifier that is used in TexLexAn, and should (theoretically) increase the classifier of several order of magnitude.

Before starting this project, there are several points to check such as the specialized languages for parallel computing, how to redesign the classifier routine to use the advantage of a GPU, the debugging tools, the price of the graphic card based on the GTX 400 series.

1 - The specialized languages:

CUDA [1] and OpenCL [2] are both available for Linux with the particularity that CUDA is specific nVIDIA and OpenCL is cross-platform and able to work with any GPUs, multi-core CPUs or DSPs.

It seems that CUBA is better documented than OpenCL. But both languages look pretty close, so it will not too painful to switch from one to another.

2 - Rewriting the classifier routines:
2.1 - Model to compute:

yj= aj+sum(ajixji)

y1=a1+b1.1*x1.1+b1.2*x1.2....+b1.i*x1.i
y2=a1+b2.1*x2.1+b2.2*x2.2....+b2.i*x2.i
...
yj=a1+bj.1*xj.1+bj.2*xj.2....+bj.i*xj.i

2.1.2 - Dictionaries structure:

class_label1:cst1/w1.1\term1.1/w1.2\term1.2..../w1.i\term1.i class_label2:cst2/w2.1\term2.1/w2.2\term2.2..../w2.i\term2.i ... class_labelj:cstj/wj.1\termj.1/wj.2\termj.2..../wj.i\termj.i

term: word or sequence of words (n-grams) 

2.1.3 - Current routine: 

For the current version of TexLexAn, terms are searched inside in line of the dictionaries with the simple linear searching function strstr(). It's of course a very inefficient solution but it offers 3 advantages:
  • A small memory requirement, because only one line of the dictionary (one class) is loaded in memory.
  • The possibility to view and edit the dictionaries with a simple text editor. 
  • A simple and robust algorithm, because only one line of the dictionary (one class) is loaded in memory at once. 
A more efficient structure that tries to preserve the advantages listed above has been discussed in my posts of Febuary 2010 ( http://texlexan.blogspot.com/2010/02/speedup-strings-matching.html ). 

2.1.4 - New routine:

The idea is to keep the current file structure of the dictionaries for its simplicity to edit it and its relative robustness. The structure of the data will be modified during the loading in memory. The whole dictionary will be loaded and converted in a new structure more efficient for a parallel search. 

2.1.4.1 - Rethinking the structure of the data.

2.1.4.1.1 - The ideal processor architecture:

The ideal architecture of the processor reflects the model to compute [1st paragraph]. Each cell stores one term and weight, the code to compare the term stored with the term searched and the code to compute bji*xji . Each row of cells contains the terms and the weights of one class. Results of the cells in the row  are added.

Each cell performs the same code, the differences between cells are limited to the terms stored and the weights. Cells are strictly independent. These particularities will help to design a simplified "many-core" processor architecture. The "raw" specification could be:

  • Cells do not have to communicate each other.
  • Cells work on the same data ( searched term and its frequency ).
  • Cells run independently their program.
  • All cell must finish before another term can be proceeded.
  • Result is valid only when all cells have finished.
  • Only terms and weights change when a new dictionary is loaded.
  • Cells run the same program (string comparison and 64bits multiplication).
  • Program should never change.
  • 8 x registers 64 bits (weight, frequency, result, hash stored term, hash searched term, 3 x intermediates).
  • 2048 bits local memory for the stored term an searched term.
  • 4096 bits local memory for the Levenshtein function. (*)
  • Levenshtein implementation (microcode).
  • Results of each row of cells are compared, highest is cumuled to the previous results.
Cells will not run the instructions of the code strictly in the same order (different paths) depending of the terms compared, so the processor should have a MIND architecture.
(*) version memory optimized O(m).

2.1.4.1.2 - The nVIDIA FERMI architecture:

Unfortunately the ideal processor described above does not exist, it's closest equivalent is the last generation of GPU. It is important to well understand the architecture of the GPU and to take care of several limitations such as the shared memory, the memory bandwidth.

Features:


- Total of 16 streaming multiprocessors (SM).
- Total of 512 CUDA cores  (16 SMs of 32 CUDA cores).
- Total of 32 parallel threads (warps) can run concurrently.
- Floating point arithmetic: 1344 GFlops in single precision, and 672 GFlops double precision (GTX 480).
- 2 warps (parallel threads) schedulers and Instruction dispatcher in each SM.
- 32 CUDA cores per SM.
- 1 ALU and 1 FPU in each CUBA core
- 16KB/48 KB shared memory per SM
- Memory bandwidth: 177.4 GB/s  (GTX 480)
- PCI-E 2.0 bandwidth 500 MB/s

ALU supports 32 bits integers.
FPU supports single and double precision float.
Compare, bitwise, min,max, type conversion are available.
Freq. 1400 Mhz (GTX 480)
2.6 GFlops single precision  (GTX 480)
1.3 GFlops double precision  (GTX 480)


Quick analysis:


- 16 SM allow to run 16 different codes ( = 16 different kernels ) and 2 warps (parallel threads) can run concurrently the same code but with 2 different paths. So, only 32 parallel threads can run concurently, it's a severe limitation as we know that each string matching function have to run their code independently (same code following different paths).

- vram bandwidth: 177.4 GB/s => 44.35 GWords 32-bit or 22.175Gwords 64-bit, so the bandwidth is about 30 times (672/22.175) lower than the  floating point capacity. At this point we know that the vram bandwidth is second severe limitation.

- PCI bandwidth: 500 MB/s is so low compared to the vram bandwidth that the data transfer between the central ram or cpu and the graphic card must be strictly limited.

We know that we will have to solve 2 majors problems:
 - The relatively small number of parallel theads possible.
 - The memory bandwidth.

In the next post, we will discuss of solutions to speedup the classifier with the GPU.

[1] http://www.linux-mag.com/id/7791/1/
[2] http://www.khronos.org/opencl/
[3] http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html
[4] nVIDIA Fermi Whitepaper
[5] nVIDIA CUDA Programming guide 3
[6] nVIDIA FERMI The first complete GPU architecture P. N. Glaskowsky, Sept 2009

No comments:

Post a Comment