Metamorphing Machine I rather be this walking metamorphosis
than having that old formed opinion about everything!

Let's build a transpiler! Part 1

Let's build a compiler is a tutorial on building a compiler from scratch written by Jack Crenshaw from 1988 through 1995 in 16 installments.
Even though he has written it for almost 7 years, it is incomplete. If my mind serves me well, one is left with no working compiler at the end.
Following this noble tradition, I'll start a series of posts regarding building a transpiler.

(Just note that while Mr. Crenshaw explains everything in excruciating detail, my style is a lot more succinct.
I mention something and then drop the code. You are the one who must make the connection between the former and the latter...)

But wait! What is a transpiler?

A transpiler is a software that reads source code and, instead of converting it to machine code, converts it to source code for another programming language.
Then it can be compiled to machine code.

I intend to build a transpiler from Visual Basic source code to another language like Java, C, or Javascript (I did not make my mind yet.)
We'll have a lot of challenges to overcome.

Let's start with the basis

Source code is stored in files. We need to open those files and read them character by character.
Due to my plans for the transpiler, I can't restrict it to have only A to Z letters. We must be able to read letters from other alphabets, like Chinese or Cyrillic.
So, our source code files should be Unicode. Now, Unicode can be encoded in some different ways, like UTF-8 or UTF-16.

From the top of my head, we can deal with it in at least three ways: With that out of our way, we now need to define how can we get those letters, as the Unicode standard has several of them.
We usually would need a big table to detect what is a letter and what is not, but due to Visual Basic's use of To in a Select Case, we have a more compact way to check that.

And now, let's start discussing identifiers.
There are programming constructs that need to be named, like variables, constants, functions, etc.
The most common rules to come up with these names are: There are exceptions to those rules, but we'll deal with them later.

There's one missing piece before we can start coding. In which language we'll write our transpiler?
For these posts, I'll be using VB itself.

Here is how we will be opening a file, reading it character by character, and displaying only valid identifiers:
Obs.: Comments below are redundant, but they are there for the sake of the ones that cannot "read" VB.

Rem This code goes inside a module.

Sub Main()
Dim Cp As Integer
Dim Fh As Integer
Dim Id As String

Rem Get an available file number.
Fh = FreeFile
Rem File path for the source code is passed as a command-line argument.
Open Command$ For Binary As #Fh
Rem Ensuring we close the file in case we have an error.
On Error GoTo CloseIt

Rem While we do not reach the end of file...
While Not EOF(Fh)
Rem ...read a codepoint from it.
Get #Fh, , Cp

If IsLetter(Cp) Then
Id = ReadIdentifier(Fh, Cp)
Debug.Print Id
End If
Wend

CloseIt:
Close #Fh
Rem This is equivalent to a Throw in a Catch.
If Err.Number Then Err.Raise Err.Number
End Sub

Private Function ReadIdentifier(ByVal FileHandle As Integer, ByVal CodePoint As Integer) As String
Dim Buffer As String * 255
Dim Pos As Long
Dim Cp As Integer
Dim IsOK As Boolean
Dim Count As Integer

Count = 1
Mid$(Buffer, Count, 1) = ToChar(CodePoint)

Do While Not EOF(FileHandle)
Get #FileHandle, , Cp

IsOK = Cp = AscW("_")
If Not IsOK Then IsOK = Cp >= AscW("0") And Cp <= AscW("9")
If Not IsOK Then IsOK = IsLetter(Cp)

If Not IsOK Then
GoSub UngetChar
Exit Do
End If

Count = Count + 1
Mid$(Buffer, Count, 1) = ToChar(Cp)
If Count > 255 Then Err.Raise vbObjectError + 13, , "Identifier too long"
Loop

ReadIdentifier = Left$(Buffer, Count)
Exit Function

UngetChar:
Pos = Seek(FileHandle)
Seek #FileHandle, Pos - 2
Return
End Function

Private Function ToChar(ByVal CodePoint As Integer) As String
Dim Bytes(0 To 1) As Byte

Bytes(0) = CodePoint And &HFF
Bytes(1) = ((CodePoint And &HFF00) \ &H100) And &HFF
ToChar = Bytes
End Function

Function IsLetter(ByVal CodePoint As Integer) As Boolean
Select Case CodePoint
Case -32768 To -24645, -24576 To -23412, -22761 To -22758, -22528 To -22527, -22525 To -22523, _
-22521 To -22518, -22516 To -22494, -22464 To -22413, -21504 To -10333, -1792 To -1491, _
-1488 To -1430, -1424 To -1319, -1280 To -1274, -1261 To -1257, -1251, -1249 To -1240, _
-1238 To -1226, -1224 To -1220, -1218, -1216, -1215, -1213, -1212, -1210 To -1103, _
-1069, -1068 To -707, -688 To -625, -622 To -569, -528 To -517, -400 To -396, -394 To -260, _
-223 To -198, -191 To -166, -154 To -66, -62 To -57, -54 To -49, -46 To -41, -38 To -36, _
65 To 90, 97 To 122, 170, 181, 186, 192 To 214, 216 To 246, 248 To 705, 710 To 721, _
736 To 740, 750, 890 To 893, 902, 904 To 906, 908, 910 To 929, 931 To 974, 976 To 1013, _
1015 To 1153, 1162 To 1299, 1329 To 1366, 1369, 1377 To 1415, 1488 To 1514, 1520 To 1522, _
1569 To 1594, 1600 To 1610, 1646, 1647, 1649 To 1747, 1749, 1765, 1766, 1774, 1775, _
1786 To 1788, 1791, 1808, 1810 To 1839, 1869 To 1901, 1920 To 1957, 1969, 1994 To 2026, 2036, _
2037, 2042
IsLetter = True
End Select
End Function


Now, we'll reject the identifier if it is a keyword.
Add the function below and change the line "Debug.Print Id" to "If Not IsKeyword(Id) Then Debug.Print Id".

Option Compare Text


Private Function IsKeyword(ByVal Identifier As String) As Boolean
Select Case Identifier
Case "AddressOf", "And", "Any", "As", "Attribute", "Boolean", "ByRef", "ByVal", "Byte", "Call", "Case", "CDecl", "Circle", "Close", "Const", "Currency", "Date", "Debug", "Decimal", "Declare", "DefBool", "DefByte", "DefCur", "DefDate", "DefDbl", "DefDec", "DefInt", "DefLng", "DefObj", "DefSng", "DefStr", "DefVar", "Dim", "Do", "Double", "Each", "ElseIf", "Else", "Empty", "EndIf", "End", "EndIf", "Enum", "Eqv", "Erase", "Event", "Exit", "False", "For", "Friend", "Function", "Get", "Global", "GoSub", "GoTo", "If", "Imp", "Implements", "In", "Input", "Integer", "Is", "Local", "Lock", "Let", "Like", "Local", "Long", "Loop"
IsKeyword = True
Case "LSet", "Len", "Me", "Mod", "New", "Next", "Not", "Nothing", "Null", "On", "Open", "Option", "Optional", "Or", "ParamArray", "PSet", "Preserve", "Print", "Private", "Public", "Put", "RaiseEvent", "ReDim", "Rem", "Resume", "Return", "RSet", "Seek", "Select", "Set", "Scale", "Shared", "Single", "Static", "Spc", "Stop", "String", "Sub", "Tab", "Then", "To", "True", "Type", "TypeOf", "Unlock", "Until", "Variant", "Wend", "While", "With", "WithEvents", "Write", "Xor"
IsKeyword = True
End Select
End Function

Just for fun, this is the C version of the first part of the code above:

#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include <stdlib.h>

bool IsLetter(int16_t cp) {
if (
cp >= -32768 && cp <= -24645 ||
cp >= -24576 && cp <= -23412 ||
cp >= -22761 && cp <= -22758 ||
cp >= -22528 && cp <= -22527 ||
cp >= -22525 && cp <= -22523 ||
cp >= -22521 && cp <= -22518 ||
cp >= -22516 && cp <= -22494 ||
cp >= -22464 && cp <= -22413 ||
cp >= -21504 && cp <= -10333 ||
cp >= -1792 && cp <= -1491 ||
cp >= -1488 && cp <= -1430 ||
cp >= -1424 && cp <= -1319 ||
cp >= -1280 && cp <= -1274 ||
cp >= -1261 && cp <= -1257 ||
cp == -1251 ||
cp >= -1249 && cp <= -1240 ||
cp >= -1238 && cp <= -1226 ||
cp >= -1224 && cp <= -1220 ||
cp == -1218 ||
cp == -1216 ||
cp == -1215 ||
cp == -1213 ||
cp == -1212 ||
cp >= -1210 && cp <= -1103 ||
cp == -1069 ||
cp >= -1068 && cp <= -707 ||
cp >= -688 && cp <= -625 ||
cp >= -622 && cp <= -569 ||
cp >= -528 && cp <= -517 ||
cp >= -400 && cp <= -396 ||
cp >= -394 && cp <= -260 ||
cp >= -223 && cp <= -198 ||
cp >= -191 && cp <= -166 ||
cp >= -154 && cp <= -66 ||
cp >= -62 && cp <= -57 ||
cp >= -54 && cp <= -49 ||
cp >= -46 && cp <= -41 ||
cp >= -38 && cp <= -36 ||
cp >= 65 && cp <= 90 ||
cp >= 97 && cp <= 122 ||
cp == 170 ||
cp == 181 ||
cp == 186 ||
cp >= 192 && cp <= 214 ||
cp >= 216 && cp <= 246 ||
cp >= 248 && cp <= 705 ||
cp >= 710 && cp <= 721 ||
cp >= 736 && cp <= 740 ||
cp == 750 ||
cp >= 890 && cp <= 893 ||
cp == 902 ||
cp >= 904 && cp <= 906 ||
cp == 908 ||
cp >= 910 && cp <= 929 ||
cp >= 931 && cp <= 974 ||
cp >= 976 && cp <= 1013 ||
cp >= 1015 && cp <= 1153 ||
cp >= 1162 && cp <= 1299 ||
cp >= 1329 && cp <= 1366 ||
cp == 1369 ||
cp >= 1377 && cp <= 1415 ||
cp >= 1488 && cp <= 1514 ||
cp >= 1520 && cp <= 1522 ||
cp >= 1569 && cp <= 1594 ||
cp >= 1600 && cp <= 1610 ||
cp == 1646 ||
cp == 1647 ||
cp >= 1649 && cp <= 1747 ||
cp == 1749 ||
cp == 1765 ||
cp == 1766 ||
cp == 1774 ||
cp == 1775 ||
cp >= 1786 && cp <= 1788 ||
cp == 1791 ||
cp == 1808 ||
cp >= 1810 && cp <= 1839 ||
cp >= 1869 && cp <= 1901 ||
cp >= 1920 && cp <= 1957 ||
cp == 1969 ||
cp >= 1994 && cp <= 2026 ||
cp == 2036 ||
cp == 2037 ||
cp == 2042
) return true;
return false;
}

void ReadIdentifier(FILE* fileHandle, int16_t codePoint, int16_t* buffer) {
buffer[0] = codePoint;
int16_t cp;
int16_t count = 1;

for (;;) {
size_t read = fread(&cp, sizeof(int16_t), 1, fileHandle);
if (read < 1) break;

if (count >= 255) {
printf("Identifier too long");
exit(1);
}

if (cp == '_' || cp >= '0' && cp <= '9' || IsLetter(cp)) {
buffer[count++] = cp;
} else {
fseek(fileHandle, -sizeof(int16_t), SEEK_CUR);
break;
}
}

buffer[count] = 0;
}

int main(int argc, char *argv)
{
char buffer[512];
FILE* fh = fopen(argv[1], "rb");
int16_t cp;

if (fh == NULL) {
printf("Error opening file!");
exit(1);
}

for (;;) {
size_t read = fread(&cp, sizeof(int16_t), 1, fh);
if (read < 1) break;

if (IsLetter(cp)) {
ReadIdentifier(fh, cp, (int16_t*)buffer);
printf("%ls", buffer);
}
}

fclose(fh);
exit(0);
}

As for the second part of the code, that is, the one where we check if the identifier is a keyword, this is a little harder to convert to C.
See, I cheated a little. I used Option Compare Text, that's a VB feature that makes it compare two strings irrespective of their cases.
This way, "declare" is equals to "Declare", and we know VB is case insensitive, so we must compare identifiers this way.
To achieve that for Unicode characters in plain old C, the code is a little more involved.
We will need to revisit it later.

Also, note that we are just half-way of dealing with Unicode correctly. There are characters that exceed two bytes in length.

What about those exceptions you said before?

I'm glad you asked it!

Names for Enum members can be anything, as long as they are between square brackets.
(See WTF VB6? Part 6 for more details.)

And names for Type members can be any identifier including keywords.

The first case is easy to deal with. We read a "[", and then keep reading and accumulating characters until finding a "]".
Unless, of course, we find a line break. Line breaks are not allowed in identifiers, even escaped ones.

The second one is also easy. We just don't check whether the identifier is a keyword or not.

Private Function ReadEscapedIdentifier(ByVal FileHandle As Integer) As String
Dim Buffer As String * 255
Dim Cp As Integer
Dim Count As Integer

Do While Not EOF(FileHandle)
Get #FileHandle, , Cp
If Cp = AscW("]") Then Exit Do
If Cp = 10 Or Cp = 13 Then Err.Raise vbObjectError + 13, , "Invalid identifier"
Count = Count + 1
If Count > 255 Then Err.Raise vbObjectError + 13, , "Identifier too long"
Mid$(Buffer, Count, 1) = ToChar(Cp)
Loop

ReadEscapedIdentifier = Left$(Buffer, Count)
End Function

Next time, literals.

Andrej Biasic
2020-07-15
Update:
Added Local keyword.

Andrej Biasic
2020-08-06
Update:
Removed InputB from keywords.

Andrej Biasic
2020-08-08