This sounds like it may be an issue with coalesced memory access. You should try to have consecutive threads access consecutive elements from memory.
For example, assume you're using 10 threads (numbered 0-9) and you're operating on a 10x10 element data set. It is easy to picture the data laid out in a grid like it is below, however, in memory, the way you declared it in your code, it is laid out in a linear manner, as a 100-element 1D array.
0, 1, 2, 3... 9,
10, 11, 12, 13... 19,
20, 21, 22, 23... 29,
30, 31, 32, 33... 39,
. . .
. . .
. . .
90, 91, 92, 93... 99
It sounds like your first implementation going "from top to bottom" is performing coalesced reads -- the ten threads operate on elements 0, 1, 2, 3... 9, then 10, 11, 12, 13... 19, etc. These reads are coalesced because the ten threads read ten elements that are adjacent in the 1D linear memory layout.
It sounds like your second implementation going "from left to right" may be accessing your array in an uncoalesced manner -- the ten threads operate on elements 0, 10, 20, 30... 90, then 1, 11, 21, 31... 91, etc. In this case, reads are uncoalesced because the ten consecutive threads are reading memory locations that are actually far apart. Remember, in a 1D linear memory layout, elements 12 and 22 are ten memory addresses away from one another!
The Best Practices Guide discusses the importance of coalesced access in section 3.2.1, and there's a pretty good description of coalesced accesses in this post.