To continue:
As you presumably know, in order to run any computer program, you first need a translator that can convert the program into a form that the computer can run. There are three basic classes of translators: assemblers, which convert a (more or less) direct representation of the CPU's operations into the actual running form; interpreters, which read one line of a language and perform the operation directly; and compilers, which convert one language into another, usually converting a high-level language into a low-level one (assembly language or machine code). There is some overlap between these; for example, most interpreters partially compile the language into an intermediate representation which can be more easily interpreted, while many compilers target an intermediate language (variously called 'p-code', 'bytecode', or 'CIL') which is then interpreted on a pseudo-machine emulator. Because interpreting and assembling are generally seen as a subset of compiling (that is to say, you need more or less the same skills for all three, but compiling is usually the most complicated of them), most discussions of the subject assume you'll be designing a compiler.
Compiling generally has three main stages, Lexical analysis (breaking the source code into a stream of tokens), parsing (processing the token stream for it's grammatical structure), and code generation (the resulting output, whether as an executable file or a source file for a different language). Most compilers use some sort of intermediate stage between the parser and the code generator (to separate the parts specific to the language from the parts specific to the target system, making it easier to re-target the compiler to a different system), and many have some sort of optimizer either between the parser and the code generator, or after the code generator, or both. It is
possible
to write a compiler in some other way, but this is by far the best-known way to do it, and almost always the easiest.
The first step (whether you are implementing an old language or a new one) is to write a grammar for the language. Almost all modern programming languages are designed so that they can be defined with a context-free grammars, often written in some variant of Backus-Naur Form. While this only defines the syntax of the language, it serves as the 'skeleton' for the semantics, and defines the tokens which the lexer will need to accept. For almost all modern programming languages, the grammar is the core of the design, especially since much of what might otherwise be 'semantic' can be pushed off into the syntax.
You will need to then decide on the semantics of the grammar, that is to say, what the language should do with a given parsed statement. This is one of the hardest parts, as there are no generally accepted formalisms like EBNF for defining the syntax (and those formalisms that do exist are often hard to work with, and/or specific to certain types of languages). Fortunately, if you were careful about the grammar design, the semantics should not be too difficult to decide upon. While I am describing this as a linear process, it's usually an iterative one, i.e., you may need to go back and forth between the grammar and the semantics as you design them.
To write the lexer, you generally want to define a Deterministic Finite State Automata which can 'recognize' the tokens of the language. It is not too difficult to write a lexical analyzer by hand, but the task is simple enough that it can be automated through a variety of tools such as LEX. However you write it, it should turn the source text into a stream of tokens, with each token comprised of a description or type (e.g., IF_STATEMENT, FLOAT_NUMBER, IDENTIFIER, etc.) and the actual text of the lexeme. The lexer usually will also generate a symbol table of all of the identifiers (text words not otherwise recognized, such a variable names) which creates a 'placeholder' for the parser and code generator to use later.
As with the lexer, the parser can either be written by hand, or using a parser generator such as YACC or ANTLR. Generally speaking, the parser is written based directly off of the grammar. However, there is a complication: there often is more than one way to define a language's grammar, and most approaches to parsing only work with certain classes of grammars. Most hand-coded parsers use a method called 'recursive descent', which is easy to write by hand but hard to generate automatically; RD parsers need what is called an LL(1) (left-to-right, leftmost derivation) grammar, while many parser generators require an LALR(1) (left-to-right, rightmost derivation) grammar. While they define the same language, they represent it in different ways, which affect how it is processed.
Unlike lexical analysis and parsing, there really are no established methods for defining a code generator; by its nature, defining the semantics of a Turing-complete language is more difficult than defining a context-free grammar. There are a number of approaches to formal semantics around, but each of them have weaknesses which limit their acceptance. Furthermore, the code generator itself relies on a detailed understanding of the target language, and mixing up the details of the target with the semantics of the source language makes portability nearly impossible. For these reasons, among others, it is common to separate the parser from the code generator with an intermediate stage, often in the form of an intermediate language which describes the grammar productions. Recurive-descent parsers often skip this, and emit code directly, but table-driven compilers almost always have some sort of intermediate code.
Optimization is a subject unto itself, and is something of a black art; but generally speaking, there are two places where optimization can be performed, on the intermediate language or parse tree, and on the code that is emitted. The latter ('peephole optimization') is generally more limited, as a lot of the semantic information has already been discarded, but can apply optimizations that are more specific to the target language. Parse tree optimizations tend to be more abstract ones, such as common subexpression elimination, and usually apply to any kind of target.
This thread includes an attachment with the source code for a simple compiler for a subset of an older language named Algol. The compiler is written in Python, and while it is an unusual design in some respects, it should serve as an example of a working (if very limited) compiler. |