Recently I started working with sockets. I realized that when reading from a network stream, you can not know how much data is coming in. So either you know in advance how many bytes have to be recieved or you know which bytes.
Since I am currently trying to implement a C# WebSocket server I need to process HTTP requests. A HTTP request can have arbitrary length, so knowing in advance how many bytes is out of the question. But a HTTP request always has a certain format. It starts with the request-line, followed by zero or more headers, etc. So with all this information it should be simple, right?
Nope.
One approach I came up with was reading all data until a specific sequence of bytes was recognized. The StreamReader class has the ReadLine
method which, I believe, works like this. For HTTP a reasonable delimiter would be the empty line separating the message head from the body.
The obvious problem here is the requirement of a (preferrably short) termination sequence, like a line break. Even the HTTP specification suggests that these two adjacent CRLFs are not a good choice, since they could also occur at the beginning of the message. And after all, two CRLFs are not a simple delimiter anyways.
So expanding the method to arbitrary type-3 grammars, I concluded the best choice for parsing the data is a finite state machine. I can feed the data to the machine byte after byte, just as I am reading it from the network stream. And as soon as the machine accepts the input I can stop reading data. Also, the FSM could immediately capture the significant tokens.
But is this really the best solution? Reading byte after byte and validating it with a custom parser seems tedious and expensive. And the FSM would be either slow or quite ugly. So...
How do you process data from a network stream when the form is known but not the size?
How can classes like the HttpListener parse the messages and be fast at it too?
Did I miss something here? How would this usually be done?