- The number of threads limited at 32.
- The memory bandwidth 30 times lower than the computation capability.
The straightforward solution is to use the GPU for what it is essentially designed; "Compute a large array of independant floating point values with exactly the same code, the same instruction excuted on the same time." on many simplied processors. In this way, we will able to run the 512 cores.
The simplest way is to transform the terms (string of alphanumeric characters) in values. First step: A hash routine using a parallel algorithm  will perform this transformation. Next step: The hash values of terms in the dictionary are compared with the hash value of the searched term. Final step: All the product wi*xi will be computed. Both comparisons and multiplications are performed in parallel.
1 - First step
Example of MD6 routine, excerpt from Faster file matching using GPGPUs :
int tx = threadIdx.x % num_threads;
int ty = threadIdx.y;
step = tx + ty;
index = 89;
if ( tx == 15 ) //last thread
N += 16;
S = ((S << 1) ^ (S >> (W-1)) ^ (S & Smask));
/* MD6 compression routine */
void loop_body(md6_word* A,int rs,int ls,int step, unsigned long long int x, int i)
x ^= A[i+step-t5];
x ^= A[i+step-t0];
x ^= ( A[i+step-t1] & A[i+step-t2] );
x ^= ( A[i+step-t3] & A[i+step-t4] );
x ^= ( x >>rs );
A[i+step] = x ^ ( x << ls );
.... To be continued ....
 Faster file matching using GPGPUs, Deephan Mohan and John Cavazos, Department of Computer and Information sciences, University of Deleware, June27 2010