Windows offers several APIs for video and audio, which happens because older APIs were replaced with [supposed] descendants, however older APIs remained operational to maintain compatibility with existing applications.
Audio APIs: waveInXxx
family of functions, DirectSound, DirectShow, WASAPI
Video APIs: Video for Windows, DirectShow, Media Foundation
Video/audio APIs with support for video+audio streams and files: Video for Windows, DirectShow, Media Foundation
All mentioned above offer certain functions, interfaces, methods, extensibility, compatibility options. I don't think fd, fds and select applies to any of the mentioned. For specific reasons one might prefer to use a combination of APIs, for example it is only WASAPI to give fine control over audio capture, however it is audio only API. Audio compression and production of media files, esp. video-enabled, is typically handled by DirectShow and Media Foundation.
Video and audio devices don't have file descriptors. In DirectShow and Media Foundation you obtain interfaces/objects of respective capture device and then you can discover device capabilities such as supported formats in API specific way. Then you can either obtain the captured data or connect the capture component to another object such as encoding or presenting the data. Since file descriptors are not a part of the story in Windows, your question becomes basically unclear. Apparently you are asking for some guidelines from those familiar with both Linux and Windows development on how to implement in Windows what you are already doing in Linux, however I am afraid you will have to end up doing it in regular Windows way, how Windows API suggest and demonstrate in documentation and samples.
--
DirectShow and Media Foundation APIs are covering the entire media processing pipeline steps: capture, processing, presentation. In DirectShow you build your pipeline using components "filters" connected together (MF has a similar concept) and then the higher level application controls their operation without touching actual data. The filters exchange with data without reporting to the application for every chunk of data streamed.
This is why you might have a hard time finding a possibility to "get a raw frame". DirectShow design assumes that raw frames are passed between filters and are not sent to the calling application. Getting a raw frame is trivial for a connected filter, you are expected to express all media data processing needs in terms of DirectShow filters, stock or customized.
Those who - for whatever reason - want to extract this media data stream from DirectShow pipeline often use so called Sample Grabber Filter (tens of questions on OS and MSDN forums), which is a stock filter easy to deal with capable to accept a callback function and report every piece of data streamed through. This filter is the easiest way to extract a frame from capture device with access to raw data.
DirectShow and Media Foundation standard video capture capabilities are based on supporting analog video capture devices that exist in Windows through WDM driver. For them the APIs have a respective component/filter created and available which can be connected within the pipeline. Because DirectShow is relatively easily extensible, it is possible to put other devices into the same form factor of video capture filter, and this can cover third party capture devices available through SDKs, virtual cameras etc. Once they are put into DirectShow filter, they are available to other DirectShow-compatible applications, in particular basically seeing no difference whether it is an actual camera or just a software thing.