3

Hi I'm using C++ / Boost ASIO and I have to inline ntohl() for performance reasons. Each data packet contains 256 int32s, hence a lot of calls to ntohl(). Has anyone done this?

Here is the compiled assembly output out of VC10++ with all optimizations turned on:

;  int32_t d = boost::asio::detail::socket_ops::network_to_host_long(*pdw++);
mov      esi, DWORD PTR _pdw$[esp+64]
mov      eax, DWORD PTR [esi]
push     eax
call     DWORD PTR __imp__ntohl@4

I've also tried the regular ntohl() provided by winsock. Any help would be greatly appreciated.

Also, I've been thinking the C way of having a #define macro that does simple int32 barrel shifts (if the network order doesn't match the machines order at compile time). And if anyone knows and can provide the most efficient assembly for ntohl() on a x86 / x64 architecture, that would be awesome. Eventually my code needs to be portable to ARM as well.

Mark
  • 510
  • 1
  • 5
  • 18
  • 1
    ... and the operating system and platform that you are using is....? –  Sep 21 '11 at 19:20
  • 2
    BTW, you are being terribly wrong if you use Boost ASIO and think that a call to `ntohl` is a performance bottleneck :) –  Sep 21 '11 at 19:23
  • Windows 7 and x64 platform for now, and possibly linux x64 platform soon. Eventually the code will be on ARM Cortex-A9 platform with linux. – Mark Sep 21 '11 at 19:24
  • 1
    The profiler is telling me it is. I only have one socket / one thread and its reading UDP datagrams, problem is the data rate is nearly 70MB/s or ~70,000 packets / s. The software CRC check is also a huge bottleneck but that's being moved into hardware. – Mark Sep 21 '11 at 19:27
  • Are you sure you have a custom handler allocator that avoids calls to malloc/free/new/delete? If not, how is that your profiler is showing you calls to `ntohl` but not to `malloc`? Or maybe you profile for CPU cycles taken by the process excluding blocking system calls? –  Sep 21 '11 at 19:44

3 Answers3

5

The x86-32 and x86-64 platforms have a 32-bit 'bswap' assembly instruction. I don't think you'll do better than one operation.

uint32_t asm_ntohl(uint32_t a)
{
   __asm
    {
       mov eax, a;
       bswap eax; 
    }
}
David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • Thanks, worked almost flawlessly, except that the Microsoft compiler inserted 2 instructions to backup `a` onto the stack, so I used the intrinsic `_byteswap_ulong()` [Link to topic on this.](http://stackoverflow.com/questions/2839710/how-to-inline-assembler-in-c-under-visual-studio-2010/2839772#2839772) I may switch to the Intel compiler if I'm permitted. :) – Mark Sep 21 '11 at 21:53
  • MSVC do NOT support inline assembly on x64/amd64. Intel introduce a movbe instruction since ATOM, but then the desktop CPU also support this instruction. It's recommended to test this feature by CPUID. movbe may be used to load data from memory to register, treated it as big-endian. – zhaorufei Mar 29 '21 at 16:05
1

Looking at the assembler, __imp__ntohl@4 is an import symbol from a DLL, so it is an external function and cannot be inlined.

Of course you can write your own, even macro, knowing that you are most likely using Windows in a little-endian machine, you just need to swap bytes.

You can find several highly optimized versions more or less portable version in the gtypes.h header from glib, macro GUINT32_SWAP_LE_BE: glib.h

rodrigo
  • 94,151
  • 12
  • 143
  • 190
  • A note for everyone else: glib.h is optimized for x86 with GNU C compiler, for other architectures/compilers it performs `<<` and `|` _ala_ `#define` marcos and regular C. – Mark Sep 21 '11 at 22:46
1

Please see optimizing byte swapping for fun and profit. It explains how to make it fast.

But I strongly recommend you stop worrying about it. Think about it - ASIO is allocating a memory to store handler's state every time you call async_read, just for example. That is by far more expensive than calling innocent ntohl, which is inlined in Linux by default, by the way. It seems like you have a premature optimization problem - you should stop that immediately, or you will waste your time and resources. After all - profile your application first, then optimize it (vTune or TotalView is recommended).

  • I'm using the sync_read inside a thread to avoid mallocs and passing the same pre-allocated buffer. You're right though, I am prematurely optimizing. I'll check out vTune forsure too, thanks! – Mark Sep 21 '11 at 21:08
  • @Mark: Tell you what - you are not avoiding memory allocation, even if you use the same thread, and pre-allocated buffer. A call to async_read requires additional state to be associated with the operation and ASIO is implicitly allocating/freeing memory for that, unless you have a custom handler allocator. So you are going the right way - profile first :-) –  Sep 21 '11 at 21:14
  • Oops, I said `sync_read` in my last post, but in actuality the call I'm making is `m_socket.receive_from( boost::asio::buffer(m_packet_buffer), m_remote_ep)`, hence no memory allocation? I think it should map directly to a BSD style socket call? – Mark Sep 21 '11 at 21:32
  • @Mark: Oops, for sync call there is probably no alloc as you don't have a callback that must be invoked in async manner, but you have to double-check. Asio is very pure performer when push comes to shove, so stay alert :) –  Sep 21 '11 at 22:43