0

I have already read this and this questions. They are quite helpful but still I have some doubt regarding token generation in lexical analyzer for C. What if lexical analyzer detects int a2.5c; then according to my understandings 7 tokens will be generated.

int keyword
a identifier
2 constant
. special symbol
5 constant
c identifier
; special symbol

So Lexical analyzer will not report any error and tokens will be generated successfully.

Is my understanding correct? If not then can you please help me to understand?

Also If we declare any constant as double a = 10.10.10;
Will it generate any lexical errors? Why?

UPDATE :Asking out of curiosity, what if lexical analyzer detects :-) smiley kind of thing in program?? Will it generate any lexical error? Because as per my understandings : will be treated as special symbol, - will be treated as operator and again ) will be treated as special symbol
Thank You

Community
  • 1
  • 1
Hardik Modha
  • 12,098
  • 3
  • 36
  • 40
  • There are many resources to be found about _compiler construction. The book of N. Wirth with the same name is available free for download. – too honest for this site Jul 12 '15 at 15:47
  • I appreciate your help. I have downloaded the book and will read it. :) – Hardik Modha Jul 12 '15 at 15:54
  • In what is concerned C, `2` doesn't start a new token, but is part of the identifier. And `5c` without space between them isn't a valid token. – Jens Gustedt Jul 12 '15 at 16:08
  • In question [link](http://stackoverflow.com/questions/5535319/what-can-create-a-lexical-error-in-c#5535492), According to Ira Baxter's comment `int 3d = 1` generates valid token. – Hardik Modha Jul 12 '15 at 16:18
  • So lexical analyzer will not throw any error...I think and will generate two tokens `5` and `c` as `constant` and `identifier` respectively. – Hardik Modha Jul 12 '15 at 16:28

2 Answers2

2

Your first list of tokens is almost correct -- a2 is a valid identifier.

Its true that the first example won't generate any "lexical" errors per se, although there will be a parse error at the ..

It's hard to say whether the error in your second example is a lexical error or a parse error. The lexical structure of a floating-point constant is pretty complicated. I can imagine a compiler that grabs a string of digits and . and e/E and doesn't notice until it calls the equivalent of strtod that there are two decimal points, meaning that it might report a "lexical error". Strictly speaking, though, what we have there is two floating-point constants in a row -- 10.10 and .10, meaning that it's more likely a "parse error".

In the end, though, these are all just errors. Unless you're taking a compiler design/construction class, I'm not sure how important it is to classify errors as lexical or otherwise.


Addressing your follow-on question, yes, :-) would lex as three tokens :, -, and ).

Because just about any punctuation character is legal in C, there are relatively few character sequences that are lexically illegal (that is, that would generate errors during the lexical analysis phase). In fact, the only ones I can think of are:

  • Illegal character (I think the only unused ones are ` and @)
  • various problems with character and string constants (missing ' or ", bad escape sequences, etc.)

Indeed, almost any string of punctuation you care to bang out will make it through a C lexical analyzer, although of course it may or may not parse. (A somewhat infamous example is a+++++b, which unfortunately lexes as a++ ++ + b and is therefore a syntax error.)

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • Thanks for the explanation. It makes some sense now. Actually I am implementing a simple C lexical analyzer as h/w so I needed to clarify this doubts. – Hardik Modha Jul 12 '15 at 17:51
  • Asking out of curiosity, what if lexical analyser detects `:-)` smiley kind of thing in program?? Will it generate any lexical error? Because as per my understandings `:` will be treated as `special symbol`, `-` will be treated as `operator` and again `)` will be treated as `special symbol` – Hardik Modha Jul 13 '15 at 09:07
0

The C lexer I wrote tokenizes this as

keyid int
white " "
keyid a2
const .5
keyid c
punct ;
white "\n"

Where keyid is keyword or identifer; const is numerical constant, and punct is punctuator (white is white space). I would not say there is a lexical error; but certainly a syntax error that must be diagnosed due to an identifer followed by a numerical constant, which no grammar rule can reduce.

Jens
  • 69,818
  • 15
  • 125
  • 179