I believe this understanding is important to all developers. Developers make more efficient software when they know how their tools work: there’s always more going on under the hood of Java or Python programs than what has been written.
A compiler, the software that is usually related to a programming language, translates high-level, human-readable programming code to ones and zeros, known as machine code. It does this by a complex process of transforming data. It starts with reading an input file, the “program”, and ends with producing an executable software or other output. A transpiler follows a similar process, but instead of producing an executable software, it chooses to produce code for another human-readable language. The diagram in Figure 1 should help you visualize the relationship between code and programs, and compilers and transpilers.
The three major stages to a compiler are lexical analysis, where you break the file up into words and numbers and symbols; semantic analysis when a tree of operations is made; and finally code generation, where the tree is translated into computer instructions.
Interpreters are like compilers, but require the program source code to execute the program as it compiles it. These typically produce better error messages due to the available high-level program code. Though, compilers commonly make faster programs than interpreters.
The speed of a program is determined by how fast it is at mapping its inputs to its outputs. Since interpreters have to translate instructions and map inputs to outputs, they are usually slower at producing outputs than programs generated by compilers. Compilers generate programs that only map inputs to outputs, and don’t contain the original source code.
The compiler will open and read a text file, being the program, and transform the characters into tokens, the output of the lexical analyzer, which is better known as the tokenizer. These tokens are passed into the semantic analysis phase for interpretation after the input file has been completely tokenized.
Tokens are symbolic units which make up a program. A token, in Java, for example, may be a keyword, integer constant, left parenthesis, equal-equal operator, or string. Tokens are attributed to their type and value. For example, a left parenthesis token would be of type “Symbol”, and value ‘(’. Alternatively, for a string, it may be of type “String”, and value “the string”.
Some compilers treat comments as tokens, and others ignore them. Some languages require that instructions be separated by a line, and thus include line ends as tokens. After tokens are constructed, they’re passed onto the semantics phase.
You can imagine tokens being passed in an array. And each is in their linear order. The purpose of semantic analysis, better known as parsing, is to turn the linear order into a tree of operations. That is, instead of each token one after the other, you would have each token connected logically. A tree is the best way to do this. Each type of connection has some dependencies, and they can be recursive. For example
I’ve emphasized that semantic analysis looks for patterns. It’s important to understand that the only possible way to build a tree is to know what to look for after you find a number, or a plus sign, or an opening parenthesis. The parser is very predictable: every token guides its search.
After all of the tokens have been sifted through and put in their logical orders, this tree, made of many sub-trees, will be passed on. This tree of operations is commonly referred to as the abstract syntax tree or AST. An AST may also hold type information or generate type inference instructions.
An example of generating type inference instructions is the way the Java compiler knows when a program says
2.5 * 3 to multiply double times a double, rather than doing a complex operation of multiplying two completely unique types: a double times an integer. The conversion node is inserted in the tree after generating and analyzing it.
Functional languages make expressing trees easier. If you know a functional language, I suggest you learn compiler development using it, as opposed to an imperative language like C++ or Java. A tree can be expressed syntactically in a functional language, while an imperative language may require using lots of pointers or objects. A language like C++ would express the AST using pointers and allocations, and tokens using classes, or structs, or even the preprocessor; while a language like Java or C# would use classes and objects to represent tokens and nodes in the AST. If you need a good language to build a compiler, choose the language you are most comfortable using.
After parsing, the compiler generates instructions for an assembly language. An assembly language is a human-readable language that translates directly to binary. Unlike high-level languages, assembly languages don’t contain data structures like classes or structs or functions. Every software that runs on a computer is in binary and thus was assembled from assembly code at one point.
Assemblers are like compilers for assembly code. Compilers translate their high-level languages into the assembly for assemblers to compile into machine code. Assemblers translate the instructions nearly 1:1 into what is technically called relocatable machine code— .o or .obj files. Relocatable machine code are better known as relocatable. There is always one relocatable for each source file, e.g. each .c, .cpp, or .asm. But that means the output of an assembler is one or more non-executable files. How do we construct an executable from these files?
Linkers take relocatable as input and generate executables, library files, or a single relocatable. You can imagine a linker stitching together all of the modules to construct a single file. Because the relocatable contain addresses and have requirements on other modules, the linker rearranges the memory of these files and compiles them together into a single file.
- Consider your favourite programming language. Is it a compiler or interpreter?
- What are the advantages and disadvantages of compilers and interpreters?
- Give five examples of tokens you write every day in your programs.
- Draw or imagine a parse tree for a math formula.
- Examine these two programs that have been compiled to assembly. Which has fewer generated assembly instructions? Which do you think is faster? https://godbolt.org/z/8DckfW