As you say, if you can assume a 386-compatible CPU, a good option (especially for modern CPUs) is movzx dx, byte ptr [mem] / add ax, dx. If not, I guess we can pretend we're tuning for a real 8086, where code size in bytes is often more important than instruction count. (Especially on 8088, with its 8-bit bus.) So you definitely want to use xor dx, dx to zero DX (2 bytes instead of 3 for mov reg, imm16), if you can't avoid a zeroing instruction altogether.
Hoist the zeroing of DX (or DH) out of any loop, so you just mov dl, [mem] / add ax, dx. If the function only does it once, you may need to (manually)
inline the function in call sites that call it in a loop, if it's small enough for that to make sense. Or pick a register where callers are responsible for having the upper half zero.
As Raymond says, you can pick any other register whose high half you know to be zero at that point in your function. Perhaps you could mov cx, 4 instead of mov cl, 4 if you happened to need CL=4 for something else earlier, but you're done with CX by the time you need to add into AX. mov cx, 4 is only 1 byte longer, so you get CH zeroed with only 1 extra byte of code-size. (vs. xor cx, cx costs 2 bytes)
Another option is byte add/adc, but that isn't ideal for code size. (Or performance on later CPUs.)
add al, [mem] ; 2 bytes + extra depending on addr mode
adc ah, 0 ; 3 bytes
So that's 1 byte more than if you already had a spare upper-zeroed register:
mov dl, [mem] ; 2 bytes (+ optional displacement)
add ax, dx ; 2 bytes
But on the plus side, add/adc doesn't need any extra register at all.
With the pointer in SI, it's worth looking for ways to take advantage of lodsb if you're really optimizing for code-size. That does mov al, [si] / inc si (or instead dec si if DF=1), but without affecting FLAGS. So you'd want to add into a different register.
xchg ax, reg is only 1 byte, but if you need two swaps it may not pay for itself if you actually have to return in AX, not some other register.