It would cost even more shuffles to unpack the high lane of a _mm256_loadu_si256. Most AVX2 CPUs have more load throughput than shuffle throughput, and you already need at least 1 shuffle per output vector, so already your way of doing 1 load and 4 shuffles for 2 epi32 vectors is a poor tradeoff.
If anything it would be better to use 2x _mm256_cvtepu8_epi32 to get two vectors of inputs to cvtepi32_ps, with one load per shuffle.
It's a bit of a pain to use memory-source pmovz/sx because you need to tell the compiler you're doing a narrow load into a __m128i (for safety), and some compilers won't optimize away the zero-extending load into a memory-source for vpmovzx. See Loading 8 chars from memory into an __m256 variable as packed single precision floats
But apparently things have improved some since I originally wrote that answer in 2015; GCC9 and later fixed that missed-optimization bug and now folds a _mm_loadl_epi64( (const __m128i*)p) into a memory source for vpmovzx. clang and ICC are fine, even old versions. MSVC still does poor code-gen with a separate vmovq, even with -march:AVX2, even v19.28 and "latest". (Godbolt).
On Intel CPUs vpmovzxbd ymm, qword [mem] is always 2 uops; can't micro-fuse the load (only for an xmm destination) https://uops.info/table.html, so you don't gain anything (except code-size) even if the compiler does manage to fold a 64-bit memory source into a mem operand instead of using a vmovq load.
But on Zen2, that instruction has 2/clock throughput, vs. worse than 1/clock throughput for vpmovzxbd ymm, xmm (same for the wd shuffle, or the sign-extending version you're using, vpmovsxwd = epi16_epi32). So you do really want the compiler to get this right if you care about Zen CPUs, especially Zen 2.