1
- # Model 100 Tokenizer
1
+ # Model 100 Tokenizer in C
2
2
3
- A tokenizer for TRS-80 Model 100 (AKA "M100") BASIC language. Converts
4
- ` .DO ` files to ` .BA ` .
3
+ An external "tokenizer" for TRS-80 Model 100 (AKA "M100") BASIC
4
+ language. Converts BASIC programs in ASCII text (` .DO ` ) to executable
5
+ BASIC (` .BA ` ) on a host machine. Useful for large programs which the
6
+ Model 100 cannot tokenize due to memory limitations.
5
7
6
- tokenize FOO.DO FOO.BA
8
+ $ cat FOO.DO
9
+ 10 ?"Hello!"
7
10
8
- Although, this documentation refers to the "Model 100", this program
9
- also works for the Tandy 102, Tandy 200, Kyocera Kyotronic-85, and
10
- Olivetti M10, which all have [ identical
11
+ $ tokenize FOO.DO
12
+ Tokenizing 'FOO.DO' into 'FOO.BA'
13
+
14
+ $ hd FOO.BA
15
+ 00000000 0f 80 0a 00 a3 22 48 65 6c 6c 6f 21 22 00 |....."Hello!".|
16
+ 0000000e
17
+
18
+ ![ A screenshot of running the FOO.BA program on a Tandy 102 (emulated
19
+ via Virtual T)] ( README.md.d/HELLO.gif " Running a .BA file on a Tandy 102 (Virtual T) ")
20
+
21
+ This program creates an executable BASIC file that works on the Model
22
+ 100, the Tandy 102, Tandy 200, Kyocera Kyotronic-85, and Olivetti M10.
23
+ Those five machine have [ identical
11
24
tokenization] ( http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file ) .
25
+ This does not (yet) work for the NEC PC-8201/8201A/8300 whose N82 BASIC
26
+ has a different tokenization.
12
27
13
- _ This does not work for the NEC PC-8201/8201A/8300 whose N82 BASIC has
14
- a different tokenization._
28
+ Additionally, this project provides a decommenter and cruncher
29
+ (whitespace remover) to save bytes in the tokenized output. This
30
+ allows one to have both well-commented and easy to read source code
31
+ and a small executable size.
15
32
16
33
## Introduction
17
34
18
35
The Tandy/Radio-Shack Model 100 portable computer can save its BASIC
19
- files in ASCII (plain text) or in a " tokenized" format where the
36
+ files in ASCII (plain text) or in a _ “ tokenized” _ format where the
20
37
keywords — such as ` FOR ` , ` IF ` , ` PRINT ` , ` REM ` — are converted to a
21
- single byte. Not only is this more compact, but it loads much faster.
38
+ single byte. The Model 100 automatically tokenizes an ASCII program
39
+ when it is ` LOAD ` ed so that it can be ` RUN ` or ` SAVE ` d. The tokenized
40
+ format saves space and loads faster, but where it is tokenized can
41
+ matter.
22
42
23
43
### The problem
24
44
25
- Programs for the Model 100 are generally distributed in ASCII format,
26
- but that has two downsides: ① the user must LOAD and re-SAVE the file
27
- on their machine to tokenize it as only tokenized BASIC can be run and
28
- ② the machine may not have enough storage space to tokenize if the
29
- ASCII version is also in memory.
45
+ Programs for the Model 100 are generally distributed in ASCII format
46
+ which is good for portability and easy transfer. However, ASCII files
47
+ have downsides:
48
+
49
+ 1 . ` RUN ` ning an ASCII program is quite slow because the Model 100 must
50
+ tokenize it first.
51
+
52
+ 1 . Large programs can run out of memory (` ?OM Error ` ) when tokenizing
53
+ on the Model 100 because both the ASCII and the tokenized versions
54
+ must be in memory simultaneously.
55
+
56
+ ![ A screenshot of an OM Error after attempting to tokenize on
57
+ a Tandy 102] (README.md.d/OMerror.png "Out of Memory error when
58
+ attempting to tokenize on a Tandy 102 (Virtual T)")
59
+
30
60
31
61
### The solution
32
62
33
- This program solves that problem by tokenizing on a host computer
34
- before downloading to the Model 100. Additionally, this project
35
- provides a decommenter and cruncher (whitespace remover) to save bytes
36
- in the tokenized output at the expense of readability.
63
+ This program tokenizes on a host computer before downloading to the Model 100.
37
64
38
65
### File extension terminology
39
66
40
67
Tokenized BASIC files use the extension ` .BA ` . ASCII formatted BASIC
41
68
files should be given the extension ` .DO ` so that the Model 100 will
42
69
see them as text documents, although people often misuse ` .BA ` for
43
- ASCII.
70
+ ASCII BASIC .
44
71
45
72
## Programs in this project
46
73
47
- * ** tokenize** : A shell script which ties together all the following.
74
+ * ** tokenize** : A shell script which ties together all the following
75
+ tools. Most people will only run this program directly.
48
76
49
77
* ** m100-tokenize** : Convert M100 BASIC program from ASCII (.DO)
50
78
to executable .BA file.
@@ -110,14 +138,51 @@ The **-d** option decomments before tokenizing.
110
138
The ** -c** option decomments _ and_ removes all optional
111
139
whitespace before tokenizing.
112
140
113
- #### Example
141
+ #### Example 1: Simplest usage: tokenize filename
142
+
143
+ #### Example 2: Overwrite or rename
114
144
115
145
``` bash
116
146
$ tokenize PROG.DO
117
- Output file ' PROG.BA' already exists. Overwrite [yes/No/rename]? R
147
+ Output file ' PROG.BA' is newer than ' PROG.DO' .
148
+ Overwrite [yes/No/rename]? R
118
149
Old file renamed to ' PROG.BA~'
119
150
```
120
151
152
+ #### Example 3: Crunching to save space: tokenize -c
153
+
154
+ ``` bash
155
+ $ wc -c M100LE.DO
156
+ 17630 M100LE.DO
157
+
158
+ $ tokenize M100LE.DO
159
+ Tokenizing ' M100LE.DO' into ' M100LE.BA'
160
+
161
+ $ wc -c M100LE.BA
162
+ 15667 M100LE.BA
163
+
164
+ $ tokenize -c M100LE.DO M100LE-crunched.BA
165
+ Decommenting, crunching, and tokenizing ' M100LE.DO' into ' M100LE-crunched.BA'
166
+
167
+ $ wc -c M100LE-crunched.BA
168
+ 6199 M100LE-crunched.BA
169
+ ```
170
+
171
+ In this case, using ` tokenize -c ` reduced the BASIC executable from 16
172
+ to 6 kibibytes, which is quite significant on a machine that might
173
+ have only 24K of RAM. However, this is an extreme example from a well
174
+ commented program. Many Model 100 programs have already been
175
+ "decommented" and "crunched" by hand to save space.
176
+
177
+ <ul >
178
+
179
+ _ Tip: When distributing a crunched program, it is a good idea to also
180
+ include the original source code to make it easy for people to learn
181
+ from, debug, and improve it._
182
+
183
+ </ul >
184
+
185
+
121
186
### Running m100-tokenize and friends manually
122
187
123
188
Certain programs should _ usually_ be run to process the input before
@@ -247,33 +312,47 @@ If you find this to be a problem, please file an issue as it is
247
312
potentially correctable using ` open_memstream() ` , but hackerb9 does
248
313
not see the need.
249
314
250
- </details >
315
+ </details > <!-- Running manually -->
251
316
252
317
253
318
## Machine compatibility
254
319
255
- Across the eight Kyotronic-85 Sisters, there are actually only
256
- two different tokenized formats. The first, which I call "M100
257
- BASIC" is supported by this program. The second, which is known
258
- as "N82 BASIC", is not yet supported.
320
+ Across the eight Kyotronic-85 sisters, there are actually only two
321
+ different tokenized formats: "M100 BASIC" and "N82 BASIC".
259
322
260
- The TRS-80 Models 100 and 102 and the Tandy 200 all share the same
261
- tokenized BASIC. While less commonly seen, the Kyocera Kyotronic-85
262
- and Olivetti M10 also use that tokenization, so one .BA program can
263
- work for any of them. However, the NEC family of portables -- the
264
- PC-8201, PC-8201A, and PC-8300 -- run N82 BASIC which has a different
265
- tokenization format.
323
+ The three Radio-Shack portables ( Models 100, 102 and 200) all share
324
+ the same tokenized BASIC. While less commonly seen, the Kyocera
325
+ Kyotronic-85 and Olivetti M10 also use that tokenization, so one .BA
326
+ program can work for any of them. However, the NEC family of portables
327
+ -- the PC-8201, PC-8201A, and PC-8300 -- run N82 BASIC which has a
328
+ different tokenization format.
266
329
267
330
### Checksum differences are not a compatibility problem
268
331
269
332
The .BA files generated by ` tokenize ` aim to be exactly the same, byte
270
- for byte, as the output from tokenizing on a Model 100. The Tandy 200
271
- stores the first BASIC program in RAM at a slightly different location
272
- (0xA000 instead of 0x8000). This has no affect on compatibility, but
273
- it does change the line number pointers in the .BA file. The pointers
274
- saved in the file are _ never_ used as they are recalculated when the
275
- program is loaded into RAM.
276
-
333
+ for byte, as the output from tokenizing on a Model 100 using ` LOAD `
334
+ and ` SAVE ` . There are some bytes, however, which can and do change but
335
+ do not matter.
336
+
337
+ An artifact of the ` .BA ` file format is that the saved file contains
338
+ the pointer locations of where the program happened to be in memory on
339
+ the computer from which it was saved. The pointers are _ never_ used as
340
+ they are recalculated when the program is loaded into RAM.
341
+
342
+ Specifically, the output of this program is intended to be identical to:
343
+
344
+ * A Model 100
345
+ * that has been freshly reset
346
+ * with no other BASIC programs on it
347
+ * running ` LOAD "COM:88N1" ` and ` SAVE "FOO" ` while a host computer
348
+ sends the ASCII BASIC program over the serial port.
349
+
350
+ I (hackerb9) believe the Tandy 102, Kyotronic-85, and M10 also output
351
+ byte identical files, but the Tandy 200 does not. The 200 has more ROM
352
+ than the other Model T computers, so it stores the first BASIC program
353
+ at a slightly different RAM location (0xA000 instead of 0x8000). This
354
+ has no affect on compatibility between machines, but it does change
355
+ the line number pointers in the .BA file.
277
356
278
357
## Why Lex?
279
358
@@ -285,9 +364,9 @@ and the corresponding byte they should emit. Flex handles special
285
364
cases, like quoted strings and REMarks, easily.
286
365
287
366
The downside is that one must have flex installed to _ modify_ the
288
- tokenizer. Flex is _ not_ necessary to compile on a machine as flex can
289
- generate portable C code. See the tokenize-cfiles.tar.gz in the github
290
- release or run ` make cfiles ` .
367
+ tokenizer. Flex is _ not_ necessary to compile on a machine as flex
368
+ generates portable C code. See the tokenize-cfiles.tar.gz in the
369
+ github release or run ` make cfiles ` .
291
370
292
371
## Abnormal code
293
372
@@ -442,10 +521,8 @@ has followed suit.
442
521
443
522
## Known Bugs
444
523
445
- * Currently no attempt is made to change lowercase variable names to
446
- UPPERCASE. The Model T computers cannot run programs with lowercase
447
- variables (Syntax Error). One can use EDIT on such a program and
448
- simply resave it to fix the case issue.
524
+ * None known. Reports are gratefully accepted.
525
+
449
526
450
527
## Alternatives
451
528
0 commit comments