From the area of compilers, we get a host of tools to convert text files into programs. The first part of that process is often called lexical analysis, particularly for such languages as C.
A good tool for creating lexical analyzers is flex, based on the older lex program. Both take a specification file and create an analyzer, usually called lex.yy.c.
Flex is reasonably compatible with the UTF-8 encoding for Unicode.
One common characteristic of standard lexical analysis (but certainly not universal) is the assignment of types; another is passing attributes to the caller.
Upon recognizing a numerical constant, for instance, the scanner might pass back the the type (e.g., integer), the lexeme string, and the value as an C int.
Upon recognizing an identifier, the lexer might pass back the type (e.g., identifier), the lexeme string, and a pointer to information about the identifier.
Use a tool, like flex or re2c. (Example: code)
Write a one-off analyzer in your favorite programming language. (Most common strategy these days.) For example, you can use libc's strtok() function. (Example: code)
Write a one-analyzer in assembly. (Usually done for bootstrapping purposes, though lexical analysis in assembly against a mmap(2) can be an exceedingly fast technique.)
While it might be found in some libc's, you might also have to link explicitly with -lfl.
The lexer function is called yylex(), and it is quite easy to interface with bison/yacc.
*.l file --> flex --> lex.yy.c
lex.yy.c --> C compiler --> lexical analyzer
input stream --> lexical analyzer --> actions taken when rules applied
Flex source structure:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Declaration of ordinary C variables and whatnot.
flex definitions
The form of rules are
regularexpression action
The actions are C code.
s literal string s
\c character c literally
[s] character class
^ beginning of line
[^s] characters not in character class
s? s occurs zero or one time
. any character except newline
s* zero or occurrences of s
s+ one or more occurrences of s
r|s r or s
{s} grouping
$ end of line
s{m,n} m through n occurrences of s
a* zero or more a's
.* zero or more of any char except newline
.+ one or more characters
[a-z] a lower-case letter
[a-zA-Z] any letter
[^a-zA-Z] not a letter
a.b a followed by any char then followed by b
rs|tu rs or tu
a(b|c)d abd or acd
^start "start" at the beginning of line
END$ the characters END followed by end-of-line
Actions are just C code. If it is compound, or requires more than a single line, enclose with curly braces.
Examples:
[a-z]+ printf("found word\n");
[A-Z[a-z]* { printf("found capitalized word:\n");
printf(" '%s'\n",yytext);
}
The form is simply
name definition
The name is just a word beginning with a letter (or underscore, but I don't recommend those) followed by zero or more letters, underscores, or dashes. The definition actually from the first non-whitespace character to the end of line. You can refer to it via {name}, which will expand to your definition.
For example:
DIGIT [0-9]
{DIGIT}*\.{DIGIT}+
is equivalent to
([0-9])*\.([0-9])+
%{ int num_lines = 0;
int num_chars = 0;
%}
%%
\n {++num_lines; ++num_chars;}
. {++num_chars;}
%%
int main(int argc, char **argv)
{
yylex();
printf("# of lines = %d, # of chars = %d\n",
num_lines, num_chars);
}
digits [0-9]
ltr [a-zA-Z]
alphanum [a-zA-Z0-9]
%%
(-|\+)*{digits}+ printf("found number: '%s'\n",yytext);
{ltr}(_|{alphanum})* printf("found identifier: '%s'\n",yytext);
\. printf("found character: {%s}\n",yytext);
. { /* ignore others */ }
%%
int main(int argc, char **argv)
{
yylex();
}