Windows Audio and Video Capture Software Paradigm

Question

I am writing a program that reads from multiple audio and video devices and writes the data to suitable containers (such as mpeg). I have wrote the code in Linux, but now I have to write another version for windows as well. This is how I wrote it in Linux:

initialize the devices (audio: ALSA, video: V4L2)
get the file descriptors
mainloop
   select on file descriptors
   respond to the proper device

Unfortunately my expertise is only for Linux and I have never used windows SDK. I don't know what the right paradigm is. Do people do it the same way with the fds and select? In that case is there a way to get a fd from directshow? Oh and one last thing, I am bound to use only one thread for all of this. So the solution with multiple threads running at the same time and each handling one device is not admissible. The code in Linux currently runs on one thread as well. It is also preferred that the code should be written in c++. Thank you.

Second Thoughts There is only one question asked here and that is: How can one get the file descriptor of the video/audio device from DirectShow library. People who have worked with V4L2 and ALSA, I am looking for the same thing in DirectShow.

Windows has lots of A/V technologies and the best one to use depends on which OS version you're targeting. The most recent (I believe) is [Media Foundation](http://msdn.microsoft.com/en-us/library/windows/desktop/ms696274(v=vs.85).aspx). — Jonathan Potter, Oct 07 '14 at 04:46
I would say that the targeting has to be general. It would be nicer if the code runs on all windows distributions (ver>XP). However I am more interested in paradigm rather than the library itself. Since there is going to be multiple video and audio sources, it has to be efficient and also address the constraint of one thread only. — Amir Zadeh, Oct 07 '14 at 05:03
DirectShow is probably your best bet, have a look at http://msdn.microsoft.com/en-us/library/windows/desktop/dd407331(v=vs.85).aspx. — Jonathan Potter, Oct 07 '14 at 05:06
I think it is. The only question that remains is the paradigm and Linux vs Windows difference. Do people use the same method (as described in the question body for Linux). Or in other words, if people wanted to code the same program in Windows would they go for file descriptors? I am only asking because I have no experience in Windows SDK whatsoever and I just want to begin from the right spot. — Amir Zadeh, Oct 07 '14 at 05:13
No, Windows doesn't use file descriptors. DirectShow is a COM-based API and file access is generally done using the native Win32 file system functions. See the samples on that page I linked. — Jonathan Potter, Oct 07 '14 at 05:17
Thank you. So the proper way to do this would be to have different threads waiting on each device? — Amir Zadeh, Oct 07 '14 at 07:01
@A2B another possibility is gstreamer for multimedia applications. But directshow is much better supported, in particular if you want to use audio and video capture cards. — wimh, Oct 13 '14 at 21:09
I think it's odd you put a bounty on this without at least narrowing your question to asking something specific. It's hard to tell what you're looking for. — Jonathan Potter, Nov 09 '14 at 04:38
@JonathanPotter: thanks for the comment. My only question is how to get the file descriptor from DirectShow devices or classes (if any of them has such thing). If not then there is no other answer to my question. — Amir Zadeh, Nov 10 '14 at 05:33

Roman R. · Answer 1 · 2014-11-12T06:38:12.907

Windows offers several APIs for video and audio, which happens because older APIs were replaced with [supposed] descendants, however older APIs remained operational to maintain compatibility with existing applications.

Audio APIs: waveInXxx family of functions, DirectSound, DirectShow, WASAPI

Video APIs: Video for Windows, DirectShow, Media Foundation

Video/audio APIs with support for video+audio streams and files: Video for Windows, DirectShow, Media Foundation

All mentioned above offer certain functions, interfaces, methods, extensibility, compatibility options. I don't think fd, fds and select applies to any of the mentioned. For specific reasons one might prefer to use a combination of APIs, for example it is only WASAPI to give fine control over audio capture, however it is audio only API. Audio compression and production of media files, esp. video-enabled, is typically handled by DirectShow and Media Foundation.

Video and audio devices don't have file descriptors. In DirectShow and Media Foundation you obtain interfaces/objects of respective capture device and then you can discover device capabilities such as supported formats in API specific way. Then you can either obtain the captured data or connect the capture component to another object such as encoding or presenting the data. Since file descriptors are not a part of the story in Windows, your question becomes basically unclear. Apparently you are asking for some guidelines from those familiar with both Linux and Windows development on how to implement in Windows what you are already doing in Linux, however I am afraid you will have to end up doing it in regular Windows way, how Windows API suggest and demonstrate in documentation and samples.

--

DirectShow and Media Foundation APIs are covering the entire media processing pipeline steps: capture, processing, presentation. In DirectShow you build your pipeline using components "filters" connected together (MF has a similar concept) and then the higher level application controls their operation without touching actual data. The filters exchange with data without reporting to the application for every chunk of data streamed.

This is why you might have a hard time finding a possibility to "get a raw frame". DirectShow design assumes that raw frames are passed between filters and are not sent to the calling application. Getting a raw frame is trivial for a connected filter, you are expected to express all media data processing needs in terms of DirectShow filters, stock or customized.

Those who - for whatever reason - want to extract this media data stream from DirectShow pipeline often use so called Sample Grabber Filter (tens of questions on OS and MSDN forums), which is a stock filter easy to deal with capable to accept a callback function and report every piece of data streamed through. This filter is the easiest way to extract a frame from capture device with access to raw data.

DirectShow and Media Foundation standard video capture capabilities are based on supporting analog video capture devices that exist in Windows through WDM driver. For them the APIs have a respective component/filter created and available which can be connected within the pipeline. Because DirectShow is relatively easily extensible, it is possible to put other devices into the same form factor of video capture filter, and this can cover third party capture devices available through SDKs, virtual cameras etc. Once they are put into DirectShow filter, they are available to other DirectShow-compatible applications, in particular basically seeing no difference whether it is an actual camera or just a software thing.

Thanks for your answer. It seems odd to me, as a Linux programmer, that Windows mixes audio/video capture with multimedia transformations such as compression and formatting. I originally expected a capture library that gives you the stream and on top of that you would use your own mpeg converter or something like ffmpeg! Perhaps, that is why libraries in Windows have trivial capture capabilities, they focused on the multimedia aspect. I do believe that you are right I will accept your answer before the bounty ends if no one else had a better solution. — Amir Zadeh, Nov 12 '14 at 03:26
And another question when I was trying to figure out how DirectShow works. How does one get a frame from camera? I don't mind if the frame is raw or in whatever format. It seems to be quite a challenge! — Amir Zadeh, Nov 12 '14 at 05:25
Thank you Roman. Your answer and Ralf's were the best possible combination one could receive for such question. I have decided to accept Ralf's answer as the answer to my question. But since your's opened the deadlock I was stuck in, I gave the bounty to you. — Amir Zadeh, Nov 14 '14 at 05:15
No problem. Your original question did not receive attention because it's overly broad, and then perhaps because you limited it to non-existent in Windows Linux concept. I decided to explain basics a bit seeing you persistent and/or desperate in your attempt to find the closest approach in Windows world. — Roman R., Nov 14 '14 at 06:37

score 2 · Accepted Answer · answered Nov 12 '14 at 08:08

Roman is an expert in these topics and I don't think you'll find a better answer. To add to Roman's answer, you would do something like this in DirectShow:

Enumerate video/audio capture devices/select capture device
Construct DirectShow graph
  Add video and audio capture source filters to graph
  Add 2 sample grabbers to graph
  Configure sample grabber callbacks 
  Connect capture source filters to samples grabbers
  Add renderers to graph (This could be a null renderer or a video/audio renderer)
  Connect sample grabbers to renderers
Play graph
Run the event loop
DirectShow will invoke the callbacks per media sample traversing the graph.

Your graph would typically look like this:

                callback
                   |
Video capture -- sample grabber -- renderer

Audio capture -- sample grabber -- renderer
                   |
                 callback

As Roman said, there are many samples in the SDK showing how to

enumerate capture sources
use/configure the sample grabber
write an application in which you construct and play a graph

On the topic of threading, you'll be writing the code for the main application thread, and DirectShow will handle the internal thread management. However note that if your callback function is processing intensive, it may interfere with playback since (From MSDN):

The data processing thread blocks until the callback method returns. If the callback does not return quickly, it can interfere with playback.

This may or may not be important depending on your application requirements. If it is important you could e.g. pass the data to another thread for processing. The result of blocking the data processing thread is that you'll get lower framerates in the case of video.

Great Answer! Exactly what I was looking for. Although I found this 2 days ago, still the best way to summarize what really should happen. I gave the bounty to Roman since his answer showed me the way, but I chose your's to be the answer for people who have the same question in future. — Amir Zadeh, Nov 14 '14 at 05:17
Glad it helped and fully agree with the bounty going to Roman :-). — Ralf, Nov 14 '14 at 07:06

Windows Audio and Video Capture Software Paradigm

2 Answers2

Linked