Saturday, August 21, 2010

Parallel classifier with a GPU

The objective is to speedup the classifier by using a GPU to do the strings matching and the computation of probabilities. But there are two big issues that I listed in my previous post:

- The number of threads limited at 32.
- The memory bandwidth 30 times lower than the computation capability.

The straightforward solution is to use the GPU for what it is essentially designed; "Compute a large array of independant floating point values with exactly the same code, the same instruction excuted on the same time." on many simplied processors. In this way, we will able to run the 512 cores.

The simplest way is to transform the terms (string of alphanumeric characters) in values. First step: A hash routine using a parallel algorithm [1] will perform this transformation. Next step: The hash values of terms in the dictionary are compared with the hash value of the searched term. Final step: All the product wi*xi will be computed. Both comparisons and multiplications are performed in parallel.

1 - First step

Example of MD6 routine, excerpt from Faster file matching using GPGPUs [1]:

dim3 grid(num_blocks,1,1);
dim3 threads(16,1,1);
md6_compress<<>>(s1,d_idata,N(leaves/4),num_block);
cudaThreadSynchronize();


int tx = threadIdx.x % num_threads;
int ty = threadIdx.y;
step = tx + ty;
index = 89;
loop_body(A,rsArgs[tx],lsArgs[tx],step,S,index);
    if ( tx == 15 ) //last thread
    {
        N += 16;
        S = ((S << 1) ^ (S >> (W-1)) ^ (S & Smask));
    }
_syncthreads();


/* MD6 compression routine */
__device__
void loop_body(md6_word* A,int rs,int ls,int step, unsigned long long int x, int i)
{
    x ^= A[i+step-t5];
    x ^= A[i+step-t0];
    x ^= ( A[i+step-t1] & A[i+step-t2] );
    x ^= ( A[i+step-t3] & A[i+step-t4] );
    x ^= ( x >>rs );
    A[i+step] = x ^ ( x << ls );
}

.... To be continued ....



[1] Faster file matching using GPGPUs, Deephan Mohan and John Cavazos, Department of Computer and Information sciences, University of Deleware, June27 2010

Tuesday, August 17, 2010

Using the nVIDIA FERMI for a fast classification

The idea is to use the most advanced GPU of nVIDIA to increase the speed of the classifier.

The GPUs are the most available and performant multiprocessors for the microcomputers. They are specifically designed for graphic applications but the FERMI architecture of nVIDIA is more flexible than the previous G80 / GT200 architecture. In fact, this new architecture has been designed to allow more efficiently and easily the massive parallel computations that is needed in imaging, flow simulations and financial predictions. This new architecture looks very interesting for the linear classifier that is used in TexLexAn, and should (theoretically) increase the classifier of several order of magnitude.

Before starting this project, there are several points to check such as the specialized languages for parallel computing, how to redesign the classifier routine to use the advantage of a GPU, the debugging tools, the price of the graphic card based on the GTX 400 series.

1 - The specialized languages:

CUDA [1] and OpenCL [2] are both available for Linux with the particularity that CUDA is specific nVIDIA and OpenCL is cross-platform and able to work with any GPUs, multi-core CPUs or DSPs.

It seems that CUBA is better documented than OpenCL. But both languages look pretty close, so it will not too painful to switch from one to another.

2 - Rewriting the classifier routines:
2.1 - Model to compute:

yj= aj+sum(ajixji)

y1=a1+b1.1*x1.1+b1.2*x1.2....+b1.i*x1.i
y2=a1+b2.1*x2.1+b2.2*x2.2....+b2.i*x2.i
...
yj=a1+bj.1*xj.1+bj.2*xj.2....+bj.i*xj.i

2.1.2 - Dictionaries structure:

class_label1:cst1/w1.1\term1.1/w1.2\term1.2..../w1.i\term1.i class_label2:cst2/w2.1\term2.1/w2.2\term2.2..../w2.i\term2.i ... class_labelj:cstj/wj.1\termj.1/wj.2\termj.2..../wj.i\termj.i

term: word or sequence of words (n-grams) 

2.1.3 - Current routine: 

For the current version of TexLexAn, terms are searched inside in line of the dictionaries with the simple linear searching function strstr(). It's of course a very inefficient solution but it offers 3 advantages:
  • A small memory requirement, because only one line of the dictionary (one class) is loaded in memory.
  • The possibility to view and edit the dictionaries with a simple text editor. 
  • A simple and robust algorithm, because only one line of the dictionary (one class) is loaded in memory at once. 
A more efficient structure that tries to preserve the advantages listed above has been discussed in my posts of Febuary 2010 ( http://texlexan.blogspot.com/2010/02/speedup-strings-matching.html ). 

2.1.4 - New routine:

The idea is to keep the current file structure of the dictionaries for its simplicity to edit it and its relative robustness. The structure of the data will be modified during the loading in memory. The whole dictionary will be loaded and converted in a new structure more efficient for a parallel search. 

2.1.4.1 - Rethinking the structure of the data.

2.1.4.1.1 - The ideal processor architecture:

The ideal architecture of the processor reflects the model to compute [1st paragraph]. Each cell stores one term and weight, the code to compare the term stored with the term searched and the code to compute bji*xji . Each row of cells contains the terms and the weights of one class. Results of the cells in the row  are added.

Each cell performs the same code, the differences between cells are limited to the terms stored and the weights. Cells are strictly independent. These particularities will help to design a simplified "many-core" processor architecture. The "raw" specification could be:

  • Cells do not have to communicate each other.
  • Cells work on the same data ( searched term and its frequency ).
  • Cells run independently their program.
  • All cell must finish before another term can be proceeded.
  • Result is valid only when all cells have finished.
  • Only terms and weights change when a new dictionary is loaded.
  • Cells run the same program (string comparison and 64bits multiplication).
  • Program should never change.
  • 8 x registers 64 bits (weight, frequency, result, hash stored term, hash searched term, 3 x intermediates).
  • 2048 bits local memory for the stored term an searched term.
  • 4096 bits local memory for the Levenshtein function. (*)
  • Levenshtein implementation (microcode).
  • Results of each row of cells are compared, highest is cumuled to the previous results.
Cells will not run the instructions of the code strictly in the same order (different paths) depending of the terms compared, so the processor should have a MIND architecture.
(*) version memory optimized O(m).

2.1.4.1.2 - The nVIDIA FERMI architecture:

Unfortunately the ideal processor described above does not exist, it's closest equivalent is the last generation of GPU. It is important to well understand the architecture of the GPU and to take care of several limitations such as the shared memory, the memory bandwidth.

Features:


- Total of 16 streaming multiprocessors (SM).
- Total of 512 CUDA cores  (16 SMs of 32 CUDA cores).
- Total of 32 parallel threads (warps) can run concurrently.
- Floating point arithmetic: 1344 GFlops in single precision, and 672 GFlops double precision (GTX 480).
- 2 warps (parallel threads) schedulers and Instruction dispatcher in each SM.
- 32 CUDA cores per SM.
- 1 ALU and 1 FPU in each CUBA core
- 16KB/48 KB shared memory per SM
- Memory bandwidth: 177.4 GB/s  (GTX 480)
- PCI-E 2.0 bandwidth 500 MB/s

ALU supports 32 bits integers.
FPU supports single and double precision float.
Compare, bitwise, min,max, type conversion are available.
Freq. 1400 Mhz (GTX 480)
2.6 GFlops single precision  (GTX 480)
1.3 GFlops double precision  (GTX 480)


Quick analysis:


- 16 SM allow to run 16 different codes ( = 16 different kernels ) and 2 warps (parallel threads) can run concurrently the same code but with 2 different paths. So, only 32 parallel threads can run concurently, it's a severe limitation as we know that each string matching function have to run their code independently (same code following different paths).

- vram bandwidth: 177.4 GB/s => 44.35 GWords 32-bit or 22.175Gwords 64-bit, so the bandwidth is about 30 times (672/22.175) lower than the  floating point capacity. At this point we know that the vram bandwidth is second severe limitation.

- PCI bandwidth: 500 MB/s is so low compared to the vram bandwidth that the data transfer between the central ram or cpu and the graphic card must be strictly limited.

We know that we will have to solve 2 majors problems:
 - The relatively small number of parallel theads possible.
 - The memory bandwidth.

In the next post, we will discuss of solutions to speedup the classifier with the GPU.

[1] http://www.linux-mag.com/id/7791/1/
[2] http://www.khronos.org/opencl/
[3] http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html
[4] nVIDIA Fermi Whitepaper
[5] nVIDIA CUDA Programming guide 3
[6] nVIDIA FERMI The first complete GPU architecture P. N. Glaskowsky, Sept 2009

Sunday, August 15, 2010

Job: Cognitive Computing

IBM has posted this job: Research staff member - cognitive computing https://jobs3.netmedia1.com/cp/job_summary.jsp?st=6316&job_id=RES-0338816
An extract: Interdisciplinary candidates with background in computer science, electrical engineering, biomedical engineering, and computational neuroscience are strongly encouraged to apply. Outstanding communication skills, ability to interact with a large, technically diverse, distributed team, demonstrated publication record, and relentless focus on project metrics and milestones are a must.

Saturday, August 14, 2010

Retrieving similar or related facts in the knowledge base

Searching for information inside a database is a very common task in computer science. Very efficient algorithms have been developed for a fast searching such as based on tree structures, on hash tables, binary method. Many softwares are designed to manage large database very efficiently (dBase, FoxPro, Access, MySql...), of course, they use fast searching algorithms mentioned above.

Unfortunatly and very logically [1] fast search algorithms are based on a strict matching, so they are not applicable to find the closest match. Approximate search methods require to compare each elements in the database with the searched element and to retain the element with the smallest difference. The comparison is performed by  a function that measures the similarity or the difference between the elements compared [2]. This process is expensive [3]. The current processors have one or hundred of cores [4], so the comparison is repeated as many time that there are elements to compare in database divided by the number of cores.

To speed the fuzzy matching, the simple idea could be a giant memory-coupled processors array (MCPA) strutured like a neural network. Each element of the database is stored in one cell of the MCPA. The element to search is distributed over all the MCPA. The processor in each cell compares its element stored with the element to seach and returns a value indicating the difference. The element can be a string (corresponding to one word or one sequence of words or one sentence), or a matrix of dots (small portion of an image), or a limited time series (small portion of sound). The MCPA  will contain the whole text, the whole image, the whole speech or song. The advantage is just a few cycles will be required to search the element and to find its closest match.

An example:

If we are able to design a MCPA of 1 million cells of 64 bits and with a clock at 500 Mhz, then it will be large enough to store the whole bible ( about 780,000 words ) where each cell will contain one word represented by a 64 bits value [5]. It will be possible to found any misspelled word in just a few nanoseconds, I will say 10 ns.

Unfortunately,  there are several problems, the first one is how to load quickly 1 million cells?  The memory bandwidth is the main bottle neck; if the databus has 64 lines, it will take 1 million cycles to load the whole chip. 1 million cycles at 500 MHz represent 2 ms, so the loading 200,000 times longer than the searching! If we use the both edges of the clock signal and use a databus of 128 lines, it will take 50,000 more times to load the data than to compare them. The library of Congress has 29,000,000 volumes. How long time that will take to load these books? Perhaps, something between 4 to 8 hours! The solution is to use thousands of MCPA chips and to fill up each chip with several books.

The second problem is the complexity of the chip in term of number of transistors. If each cell is a tiny 64 bits risc processor requiring 70,000 transistors (estimated) plus a 10,000 transistors for the tiny EEPROM and SRAM memories, the whole chip will contain at least 80 billions of transistors. It is much more than the current processors (Itanium 9300 contains 2 billion transistors). 80 billions of transistors in a chip is considerable but should be possible in a close future, probably around 2020.

The third problem is the price. This kind of chip will cost thousands of dollars the first years. Is it acceptable to spend so much just for a fast fuzzy pattern or string matching? In fact, there is probably no other possibility; with the current silicon technology, it's difficult to increase the speed of the processor fare above 3 GHz and so we have only one solution: create a giant array of processors in one chip. Because the transfert speed of data between the memory and a large multicores processor is a limitation, the solution is to integrate the memory and the processor in a cell memory-coupled processor.

The complexity of each cell memory-coupled processor is due to the separation of the storage of the information (the memory) and the processing of the information (the cpu). This structure requires many lines of communication between the cpu and the memory and some extra logic to manage the flux. The ideal solution will be to intregrate the memory and the logic in a single component. The crossbar latch [6] constitued of memristors [7] seems an interesting solution to combine the storage and the processing of the information.

[1]: Fast searching algorithm tries to minimize the number of data compared and ideally to none; just find the exact match in the first shoot, it's what the perfect hash table is expected to allow. In fact, it's exactly the opposite of a fuzzy search algorithm.
[2]: Edit distance of 2 strings such as Levenshtein distance, Hausdorff distance (smallest euclidian distance between points).
[3]: expensive in number of cycles, in energy and time. The algorithms that compute the distances require often O(m.n) cycles.
[4]: Tile Gx Processor ( TILEPro64 ) contains up to 100 identical 64 bits cores connected in a square matrix. The graphic processor GeForce GT200 can run 30,720 threads. The Fermi architecture http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf is fare more flexible than the G80 / GT200 architecture. An application for the classifier TexLexAn will discussed in a next post.
[5]: 64 bits (~ 1.84x10^19) is large enough to prevent the risk of collision (2 different words that a hash function will convert will the same value).
[6]:http://www.hpl.hp.com/news/2005/jan-mar/crossbar.html
[7]:http://www.hp.com/hpinfo/newsroom/press/2010/100408xa.html