Preface

Deciding to go to university was motivated by a want to understand how to program at the hardware-level. This was accomplished early in the curriculum by means of taking Computer Organization.  The course satisfied my curiosity in this regard, but it opened up a new avenue of curiosity that would continue to inspire me to succeed. Computer Organization showed me a mapping of two different languages. The mathematics of propositional logic were used to scaffold assembly.

My time learning to program up to that point had an undercurrent of realizing that given a single problem, there are many, (if not infinite), solutions. The scaffolding shown through Computer Organization confirmed this intuition and highlighted the fact that this sentiment also applies to how a given problem can be represented as well. The Algorithms course I took the following semester generalized the concept of an algorithm and made it programming-language agnostic. This would serve as a prerequisite for the course that would synthesize these concepts.

I took Translation of Programming Languages during the fall semester of 2019. Students of the course were tasked with building a compiler that Translates a high-level functional language to a low-level assembly language. Lectures consisted of covering the various components and steps in the process of translation. Despite the effort required to succeed, I look back on this course as a favorite. It helped me understand programming languages as an abstraction and thus helped trivialize the notion of learning a new programming language or framework.

Detailed below is documentation of the project. A repository for the compiler can be found via GitHub.


Project: Klein Compiler

The following acts both as documentation and as a progress journal for a compiler written in Python. The compiler's input language is a language called Klein which operates in the functional paradigm. The target language is an assembly language called Tiny Machine. Specifications of these two languages can be found in the concluding notes section of this page.

This implementation of the compiler includes all features described in the language specification except:

  • Arithmetic operations operate from right to left when stringed with operations within their respective set: ({+,-},{*,/}).
    • This likely stems from a flaw in the evaluation of a resultant abstract syntax tree
    • This can be prevented by taking full advantage of parenthesis to group respective operands.

The Repository

It's worth noting the structure of the repository for this implementation of the compiler. The key directories are as follows:

  • The home directory is KLEINcompiler/
    • Error logs are output at this level
    • Various shell scripts are housed here to test the stages of the compiler. Each represents running the compiler for each implementation stage, or phase, of the project:
      • ./kleins : scanner; creation of a set of tokens
      • ./kleinf : parse table lookups
      • ./kleinp : abstract syntax tree creation
      • ./kleinv : type checker
      • ./kleinc : code generation; the completed compiler
      To run each script run the following command:
          sh <script name> <program to be tested>
      
      For example:
          sh kleinc programs/exclusive_or.kln
      
  • The doc directory houses various forms of documentation, housed in folders with respect to the component of the compiler they belong.
  • The programs directory houses a small handful of Klein programs that were created for use of the compiler. It also houses a set of programs provided by Dr. Wallingford within the class-programs folder.
  • The tests directory houses various Klein programs used for testing the various stages of the complier.
  • The src directory houses all files used for implementation of the compiler.
    • src/drivers houses files used to run the shell scripts, noted above.

Flaws in implementation

It's worth noting flaws of implementation. These illuminate steps that should be taken to fine-tune the compiler as a future project. They ultimately are a reflection of the result of the project and mark lessons learned.

  • A lot of the logic is loaded into AST_node.py. This is the file which houses the class information for the abstract syntax tree. The layout itself is quite clever, but type-check and code-generation methods are added to them. This poses a problem in terms of separation of responsibility. The logic in AST_node.code_gen() is specific to outputting code in TM. Provided more time, exploration should occur here as a means to refactor theser methods into some exterior mechanism beside the abstract syntax tree.
  • The TM code being output is not efficient, ignoring the lack of any optimizer implementation. Most instructions that are output have the value stored in some memory position. This produces a lot of redundant instructions to IMEM and a lot of redundant data to DMEM. The most egregious lack of optimization comes from the IdentifierNode outputing a load instruction. This was an adhoc fix to print statements having no means to access an argument.
  • The error handling needs to be expanded upon. Effort was strong in the beginning of the project, but stalled out when time started becoming more of a premium and workload was shifted away from this effort. Error returns pertaining to parse errors do not help the user of a compiler as they are more geared for compiler debugging.

Implementation of Compiler: Progress Checks

The following list of phases represent a step of progress implementing each component of the compiler. Interact with the header to expand its details.

  • Associated files of implementation:
    • kleins
    • src/k_token.py
    • src/scanner.py
  • Scanner Description:

    The scanner is the portion of the compiler which takes input a Klein program as a single string and splits the string into a set of tokens. The token class is described in k_token.py. The scanner steps through each character of the program string and uses a set of conditions to determine if a resultant character, (or string of characters), is a valid token. These conditions are based on the possible terminals present in the language specification. These may either be a single character, such as the case of the unary and binary operators, or can be a string, such as any identifier or number. Thus, these conditions are based on a set of regular expressions, as noted in the scanner portion of the documents folder. These regular expressions are used to build the conditions in place, represented by the FSA diagram which is also housed in the same folder.

  • Associated files of documentation:
    • doc/scanner/regex.txt
    • doc/scanner/regex_FSA.jpg
      • doc/regex_FSA.jff is the jflapfile used to create the above jpeg
    • doc/scanner/scanner_status_check.txt
  • Date of completion: 09-20-2019
  • Associated files of implementation:
    • kleinf
    • src/parser.py
    • src/parse_table.py
  • Syntactic Analyzer Description:

    The first phase of the parser is the implementation of the component which decides if a program is syntactically correct. That is, the component makes sure that any given combination of tokens are in valid order. This is determined by the grammar of the Klein language, which has been refactored to eliminate the need for the parser to have to read-in more than one token at a time from the scanner to decide whether or not a combination of tokens is valid.

    The validity is determined by a parse table, which is determined by the first and follow sets of every declaration statement within the refactored grammar. These first and follow sets help us know every possible token which leads a nonterminal and every possible token that ends a terminal. With this information we can surmise all possible combinations while considering nonterminal composition (including nonterminals composed of other nonterminals). This is then mapped to a table using a specific algorithm which considers whether or not a grammar statement has an empty character in its first set.

    Thus, the parser has a parse stack which assumes an end of file token exists in addition to a nonterminal which represents the program as a whole. It takes a look at the top element of the stack and uses it to look up the valid combination of preceding tokens while considering the current token that is being read by the scanner. The parser walks the table using this logic, pushing terminals and nonterminals onto the parse stack based on the result of the current index of the parse table until either an invalid combination of tokens is discovered or the end-of-file token is reached.

  • Associated files of documentation:
    • doc/parser/extended_grammar.txt
    • doc/parser/first_and_follow_sets.txt
    • doc/parser/parse_table.pdf
      • doc/parser/parse_table.ods
  • Date of completion: 10-04-2019
  • Associated files of implementation:
    • kleinp
    • src/parser.py
    • src/parse_table.py
    • src/AST_node.py
  • Abstract Syntax Tree Description:

    The second phase of the parser is to build an abstract syntax tree which represents the various expressions and operations of the input program, (in the most general sense). These expressions and the resultant operations are put into data structures which will later be used to convert the statements into their equivalents during code generation.

  • What was finished:
    • The inclusion of the semantic stack and the implementation of the parse algorithm to communicate between the parse stack and the semantic stack.
    • The inclusion of semantic actions within the parse table.
    • An enumeration class in the parse table which is used to evaluate the semantic actions returned on the parse table.
    • When the semantic action is evaluated with respect to the parse algorithm, it is used to index into a dictionary called object_factory:
      • The object factory returns the relevant AST_node class which is used to construct the relevant node object within the parse algorithm.
      • The resultant node is then pushed onto the semantic stack. The resultant node is also passed the semantic stack upon construction. Logic will exist within the node's constructor to know what to do with the stack and the various nodes and relevant tokens that reside in it.
  • What is not finished:
    • The AST nodes themselves are not done. A couple of simple nodes have been finished; those that only really need to house a value - such as an identifier.
    • No error handling is in place once the AST nodes are ready for testing.
  • What else has been accomplished:
    • Errors regarding the previous phase with the parse table have been fixed.
    • Error classes have been expanded upon. Each error writes to a relevant text file the initial program and the associated stdout error message.
      • Lexical errors append the remaining program string to its error log.
      • Parse errors append the parse stack trace to its error log.
    • The scanner returning a none value when evaluating comments has been fixed.
      • Existing logic was extended to accomplish this. This has resulted in some messy code that needs to be refactored.
  • Date of completion: 10-18-2019
  • Associated files of implementation:
    • kleinp
      • To run, feed a program through kleinp. It will print back node information as the AST is being built. If there is a type error, a Semantic Error will be thrown.
    • src/parser.py
    • src/AST_node.py
  • Type Checker Description:

    The next phase of the project was to augment the abstract syntax tree with type information. This is used to make sure a user is inputting the correct type of data for a program to execute. This was handled by adding type checking methods to the ASTnode class within AST_node.py and made changes to a subclass' method where applicable. The type check occurs after the parser builds the abstract syntax tree, where the parser grabs the outer node off the semantic stack and calls process_node() which recursively descends into the composition of nodes, calling typeCheck() when applicable.

    When a type error is found, the recursive function starts to back-track, sending the error information as an error object back to the parser. If the error object exists on the semantic stack when the parser tries to call process_node(), it will instead raise an exception, feeding extra information about the program to it.

  • What is not finished:
    • Error handling needs to be expanded upon. As of now, the error messages are pretty slim.
    • Some code cleanup would be nice. This includes implementing correct OOP design such as making sure accessors are correctly called, as opposed to directly referring to an object's property. Some of the lines of code also need to be line-wrapped/escaped to allow easier viewing on smaller screen resolutions.
      • There exists some duplicate code within AST_node.py. Note the inclusion of the list functions. There is also some ad-hoc code duplication wihtin process_node(). This is in place to allow the recursive function to back-track once an error is found. The flag which causes this is the inclusion of 'errob' (an object class which needs to be renamed...) on the function_record list.
  • Date of completion: 11-01-2019
  • Associated files of implementation:
    • kleinc
    • AST_node.py
    • code-generator.py
  • Code Generator Description:

    The first phase of implementing the code generator involved priming behavior. What's meant by this is that the state of the generator was to set up the environment such that the address space for a main function is added to the runtime stack and relevant information is saved to the relevant registers. The body of the main is executed and registers are restored as per the TM specification. Lastly, functionality was put in place with respect to evaluating literals and print statements.

  • Schema of environment:
    • The Stack Frame:
      :----------------------:
      :     return addr      : 0
      :----------------------:
      :     return value     : 1
      :----------------------:
      :     arg  0           :
      :----------------------:
      :     args             :
      :----------------------:
      : register values(0-6) : - can update to only save whats modified
      :----------------------:
      :     temp space       :
      :----------------------:
      
    • Registers:
         :------------------:
      r0 :                  :
         :------------------:
       1 :                  :
         :------------------:
       2 :                  :
         :------------------:
       3 :                  :
         :------------------:
       4 :                  :
         :------------------:
       5 :                  :
         :------------------:
       6 :   top  (stack)   :
         :------------------:
       7 :    PC            :
         :------------------:
      
    • DMEM:
      0 :-------------------:
      | :                   :
      | :-------------------:
      V :                   :
      . :-------------------:
      . :                   :
        :-------------------: Top (R6)
        :        |          :
        :        |          :
        :        |          :
        :        |  Grows   :
        :        V          :
        :                   :
        :                   :
        :-------------------:
      
    • IMEM:
      0 :---------------------:
        : LDA 7 , <main addr> :
        :---------------------:
        :         .           :
        :         .           :
        :         .           :
        :         .           :
        :         .           :
        :---------------------: main
        : / / / / / / / / / / :
        : / / / / / / / / / / :
        :---------------------:
      
  • What is not finished:
    • Better register management is still under consideration.
    • Setting up three-address-code templates for each node.
  • Date of completion: 11-15-2019
  • Associated files of implementation:
    • AST_node.py
    • code-generator.py
    • kleinc
  • Phase 6 expands on the functionality of phase 5. The goal was to complete what wasn't finished. The following has been implemented:
    • Function Calls
    • Arithmetic Operations
      • Addition Operation
      • Subtraction Operation
      • Division Operation
      • Multiplication Operation
    • Boolean Connective Operations
      • And Operation
      • Or Operation
    • Boolean Comparison Operations
      • Less-Than Operation
      • Equal-To Operation
    • Unary Operations
      • Not Operation
      • Negation Operation
  • What needs to be implemented:
    • Grabbing arguments from the initial TM execution
      • The compiler does have logic that places the initial DMEM arguments into the control stack.
    • There currently exists a logical error that allows klein programs to have arguments with the same name.
    • Some more clear error handling
  • Date of completion: 12-06-2019
  • What was completed:
    • TM programs now allow arguments to be fed in through the command line.
      • The DMEM allocated to the initial main function call now factors the arguments that may exist when the TM machine runs.
      • Function declaration output no longer saves and restores register values for temporary register usage.
      • Negation and Not operation TM output is fixed.
      • Type checking error messages have been refined and are more informative.
      • A whole slew of klein programs have been created for testing purposes.
  • Date of completion: 12-13-2019

Concluding notes

The Klein language specification was written by Professor Eugene Wallingford of the University of Northern Iowa.

I chose the name to indicate the size of the language. klein is the German word for "small" or "little". It is one of the first German words I learned back in the eighth grade. - Eugene Wallingford
My looking back favorably on this course has a lot to do with having Dr. Wallingford as a professor. He was well equipped to teach the course and I always felt that he was very engaging with my level of curiosity.

Tiny Machine (TM) is an assembly language written by Kenneth Louden. The specifications for both languages are in the following expandable sections.

Klein is a small, mostly functional language that is designed specifically to be used as a manageable source language in a course on compiler design and implementation. Though small and simple, the language is Turing-complete.

Grammar

        <PROGRAM> ::= <DEFINITIONS>

        <DEFINITIONS> ::= ε
                        | <DEF> <DEFINITIONS>

                <DEF> ::= function <IDENTIFIER> ( <FORMALS> ) : <TYPE>
                             <BODY>

            <FORMALS> ::= ε
                        | <NONEMPTYFORMALS>

    <NONEMPTYFORMALS> ::= <FORMAL>
                        | <FORMAL> , <NONEMPTYFORMALS>

             <FORMAL> ::= <IDENTIFIER> : <TYPE>

               <BODY> ::= <PRINT-STATEMENT> <BODY>
                        | <EXPR>

               <TYPE> ::= integer
                        | boolean

               <EXPR> ::= <EXPR> < <SIMPLE-EXPR>
                        | <EXPR> = <SIMPLE-EXPR>
                        | <SIMPLE-EXPR>

        <SIMPLE-EXPR> ::= <SIMPLE-EXPR> or <TERM>
                        | <SIMPLE-EXPR> + <TERM>
                        | <SIMPLE-EXPR> - <TERM>
                        | <TERM>

               <TERM> ::= <TERM> and <FACTOR>
                        | <TERM> * <FACTOR>
                        | <TERM> / <FACTOR>
                        | <FACTOR>

             <FACTOR> ::= if <EXPR> then <EXPR> else <EXPR>
                        | not <FACTOR>
                        | <IDENTIFIER> ( <ACTUALS> )
                        | <IDENTIFIER>
                        | <LITERAL>
                        | - <FACTOR>
                        | ( <EXPR> )

            <ACTUALS> ::= ε
                        | <NONEMPTYACTUALS>

    <NONEMPTYACTUALS> ::= <EXPR>
                        | <EXPR> , <NONEMPTYACTUALS>

            <LITERAL> ::= <NUMBER>
                        | <BOOLEAN>

    <PRINT-STATEMENT> ::= print ( <EXPR> )

Syntax Features

  • These are the reserved words of Klein:
        integer     boolean
        true        false
        if          then       else
        not         and        or
        function    print
    
  • print is a primitive identifier. true and false are boolean literals. The rest are keywords.
  • Klein reserved words may not be used as user-defined names of functions or formal parameters.
  • Klein reserved words and identifiers are case-sensitive.
  • Identifiers must be no longer than 256 characters.
  • The following are the primitive operators and punctuation marks of Klein:
        +           -          *        /
        <           =          (        )
        ,           :          (*       *)
    
  • Klein operators and punctuation are self-delimiting.
  • Integer Literals must be in the range from 0 to 231-1, inclusive.
  • A comment in Klein begins with the characters (* and continues up to the next occurrence of the characters *). Any characters inside a comment are ignored.
  • A function many have zero or more formal parameters. The scope of a formal parameter is the body of the function. Arguments are passed by value.
  • Binary operators and function calls evaluate their arguments from left to right.
  • Whitespace consists only of blanks, tabs, and the end-of-line characters \n and \r. Whitespace characters may not appear inside a literal, identifier, keyword, or operator. Otherwise, whitespace is insignificant.

Data

All data in Klein is an integer or a boolean, and nearly every element in a program is an expression that produces an integer result or a boolean result.

Atomic Expressions

  • There are only two boolean values. The two primitive boolean literals are true and false.
  • Integer literals are strings of digits. There are no leading plus or minus signs to indicate positive or negative values; all integer literals are positive. Leading zeros are not permitted for non-zero integer literals.
  • Klein supports integer values in the range of -231 to 231-1.
  • User-defined identifiers are strings beginning with a letter and consisting of letters, digits, and the underscore.

Compound Expressions

The language provides the following kinds of expression:

  • Arithmetic
    Adds, subtracts, multiplies, or divides two integers.
         x + y
         x - y
         x * y
         x / y
    
  • Boolean Comparisons
    Compares two integers, yielding one of the boolean values true or false. < yields true if its left operand is less than its right operand, and false otherwise. = yields true if its left operand has the same value as its right operand, and false otherwise.
         x < y
         x = y
    
  • Boolean Connectives
    Negates a single boolean value, or computes the disjunction or conjunction of two boolean values. The unary not yields true if its operand is false, and false otherwise. or yields true if either its left operand or its right operand yields true, and false otherwise. and yields true if both its left operand and its right operand yield true, and false otherwise.
         not x
         x or y
         x and y
    
    or and and short-circuit evaluation when possible.
  • Conditional Selection
    Evaluates a test expression, and uses its value to select one of two expressions to evaluate. Yields the value of the first of these expressions if the test expression produces a true value, and the value of the second if the test expression yields a false value. The else clause is required.
    For example:
         if flag < 0 then
            x + y
         else
            x - y
    
    produces the sum of x and y if flag is less than 0; otherwise, it produces their difference.
  • Function call
    Applies a function to zero or more arguments, and yields the value of the expression in the body of the function. All functions return an integer value or a boolean value; Klein has no notion of a "void" function.
    For example:
         f( x+y, 1 )
    
    computes the sum of x an dy, passes that value and a 1 to the function f, and produces the value returned by applying the function to its arguments.
  • Miscellaneous
    Compound expressions can be nested to any depth.

    Note that the only user-defined identifiers in Klein ar ethe names of functions and the names of formal parameters to functions. There are no "variables".

User-Defined Functions

  • Each function declaration consists of the function's name, its formal parameters and their types, the type of the funtion, and its body.
  • Function names are unique.
  • A function may refer only to its formal parameters and to other functions.
  • A Klein function may call itself.

Primitive Functions

  • For the purposes of user interaction, Klein provides the primitive function print(expression). For example:
         print( x+y )
    
    print writes its argument on standard output, followed by a new line character.
  • Unlike all user-defined functions, the value of a call to print is undefined. For this reason, if a function contains a call to print, that call must come at the top of the function body.

Programs

  • A Klein program consists of a sequence of function definitions. Every program must define a function named main, which is called first when the program runs.
  • Users may provide arguments to main on the command line. The result returned by main is printed on standard output.
  • For example, here is a complete Klein program that computes the absolute value of its argument:
         function main( n : integer ) : integer
            if n < 0
               then -n
               else n
    
    If this program were compiled into an executable file named abs, then running it under Unix might look something like this:
        mac os x > abs -3
        3
    

TM is a simple target machine that has an architecture and instruction set complex enough to illustrate the important issues faced when writing a compiler, yet simple enough to not distract with unnecessary details.

Architecture

  • TM provides two kinds of memory:
    • instruction memory, which is read-only
    • data memory
  • Memory addresses are non-negative integers. When the machine is started, all data memory is set to 0, except for the first memory location. That location contains the value of the highest legal address.
  • An extended version of the TM interpreter is used which accepts command-line arguments to the TM program and stores them in memory locations 1 through n.
  • TM provides eight registers, numbered 0 through 7. Register 7 is the program counter. The other seven registers are available for program use. WHen the machine is started, all registers are set to 0.
  • When the machine is started, after memory and registers have been initialized, TM begins execution of the program beginning in the first location of instruction memory. The machine follows a standard fetch-execute cycle:
    • fetch the current instruction from the address indicated by the program counter
    • increment the program counter
    • execute the instruction
  • The loop terminates when it reaches a HALT instruction or when an error occurs. TM has three native error conditions:
    • IMEM_ERR, which occurs in the fetch step whenever the address of the next instruction to be executed is out of bounds
    • DMEM_ERR, which occurs in the execute step whenever the address of a memory access is out of bounds
    • ZERO_DIV, which occurs in the execute step whenever the divisor to a div is zero

Instruction Set

  • TM provides two kinds of instructions: register-only and register-memory.
  • Register-only (RO) instructions are of the form
        opcode r1,r2,r3
    
    where the ri are legal registers.
  • These are the RO opcodes:
    IN      read an integer from stdin and place result in r1; ignore operands r2 and r3
    OUT     write contents of r1 to stdout; ignore operands r2 and r3
    ADD     add contents of r2 and r3 and place result in r1
    SUB     subtract contents of r3 from contents of r2 and place result in r1
    MUL     multiply contents of r2 and contents of r3 and place result in r1
    DIV     divide contents of r2 by contents of r3 and place result in r1
    HALT    ignore operands and terminate the machine
    
  • Register-memory (RM) instructions are of the form
        opcode r1,offset(r2)
    
    Where the ri are legal registers and offset is an integer offset. offset may be negative. With the exception of the LDC instruction, the expression offset(r2) is used to compute the address of a memory at location:
        address = (contents of r2) + offset
    
  • There are four RM opcodes for memory manipulation:
    LDC    place the constant offset in r1; ignore r2
    LDA    place the address address in r1
    LD     place the contents of data memory location address in r1
    ST     place the contents of r1 to data memory location address
    
  • There are six RM opcodes for branching. If the value of r1 satisfies the opcode's condition, then branch to the instruction at memory location address.
    JEQ    equal to 0
    JNE    not equal to 0
    JLT    less than 0
    JLE    less than or equal to 0
    JGT    greater than 0
    JGE    greater than or equal to 0
    
  • All arithmetic is done with registers (not memory locations) and on integers. Floating-point numbers must be simulated in the run-time system.
  • There are no restrictions on the usage of registers. For example, the source and target registers for an operation can be the same.
  • The note above is true of the program counter, Register 7. For example,
    • To branch unconditionally to an instruction, a program can load the target address into the PC using an LDA instruction.
    • To branch unconditionally to an instruction whose address is stored in data memory, a program can load the target address into the PC using an LD instruction.
    • To branch conditionally to an instruction whose address is relative to the current position in the program, a program can use the PC as r2 in any of the jxx instructions.