Skip to content

Commit 9e9b652

Browse files
committed
Major cleanup and update. Add screenshots, examples. Compatibility.
1 parent 31b4415 commit 9e9b652

File tree

1 file changed

+126
-49
lines changed

1 file changed

+126
-49
lines changed

README.md

+126-49
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,78 @@
1-
# Model 100 Tokenizer
1+
# Model 100 Tokenizer in C
22

3-
A tokenizer for TRS-80 Model 100 (AKA "M100") BASIC language. Converts
4-
`.DO` files to `.BA`.
3+
An external "tokenizer" for TRS-80 Model 100 (AKA "M100") BASIC
4+
language. Converts BASIC programs in ASCII text (`.DO`) to executable
5+
BASIC (`.BA`) on a host machine. Useful for large programs which the
6+
Model 100 cannot tokenize due to memory limitations.
57

6-
tokenize FOO.DO FOO.BA
8+
$ cat FOO.DO
9+
10 ?"Hello!"
710

8-
Although, this documentation refers to the "Model 100", this program
9-
also works for the Tandy 102, Tandy 200, Kyocera Kyotronic-85, and
10-
Olivetti M10, which all have [identical
11+
$ tokenize FOO.DO
12+
Tokenizing 'FOO.DO' into 'FOO.BA'
13+
14+
$ hd FOO.BA
15+
00000000 0f 80 0a 00 a3 22 48 65 6c 6c 6f 21 22 00 |....."Hello!".|
16+
0000000e
17+
18+
![A screenshot of running the FOO.BA program on a Tandy 102 (emulated
19+
via Virtual T)](README.md.d/HELLO.gif "Running a .BA file on a Tandy 102 (Virtual T)")
20+
21+
This program creates an executable BASIC file that works on the Model
22+
100, the Tandy 102, Tandy 200, Kyocera Kyotronic-85, and Olivetti M10.
23+
Those five machine have [identical
1124
tokenization](http://fileformats.archiveteam.org/wiki/Tandy_200_BASIC_tokenized_file).
25+
This does not (yet) work for the NEC PC-8201/8201A/8300 whose N82 BASIC
26+
has a different tokenization.
1227

13-
_This does not work for the NEC PC-8201/8201A/8300 whose N82 BASIC has
14-
a different tokenization._
28+
Additionally, this project provides a decommenter and cruncher
29+
(whitespace remover) to save bytes in the tokenized output. This
30+
allows one to have both well-commented and easy to read source code
31+
and a small executable size.
1532

1633
## Introduction
1734

1835
The Tandy/Radio-Shack Model 100 portable computer can save its BASIC
19-
files in ASCII (plain text) or in a "tokenized" format where the
36+
files in ASCII (plain text) or in a _tokenized_ format where the
2037
keywords — such as `FOR`, `IF`, `PRINT`, `REM` — are converted to a
21-
single byte. Not only is this more compact, but it loads much faster.
38+
single byte. The Model 100 automatically tokenizes an ASCII program
39+
when it is `LOAD`ed so that it can be `RUN` or `SAVE`d. The tokenized
40+
format saves space and loads faster, but where it is tokenized can
41+
matter.
2242

2343
### The problem
2444

25-
Programs for the Model 100 are generally distributed in ASCII format,
26-
but that has two downsides: ① the user must LOAD and re-SAVE the file
27-
on their machine to tokenize it as only tokenized BASIC can be run and
28-
② the machine may not have enough storage space to tokenize if the
29-
ASCII version is also in memory.
45+
Programs for the Model 100 are generally distributed in ASCII format
46+
which is good for portability and easy transfer. However, ASCII files
47+
have downsides:
48+
49+
1. `RUN`ning an ASCII program is quite slow because the Model 100 must
50+
tokenize it first.
51+
52+
1. Large programs can run out of memory (`?OM Error`) when tokenizing
53+
on the Model 100 because both the ASCII and the tokenized versions
54+
must be in memory simultaneously.
55+
56+
![A screenshot of an OM Error after attempting to tokenize on
57+
a Tandy 102](README.md.d/OMerror.png "Out of Memory error when
58+
attempting to tokenize on a Tandy 102 (Virtual T)")
59+
3060

3161
### The solution
3262

33-
This program solves that problem by tokenizing on a host computer
34-
before downloading to the Model 100. Additionally, this project
35-
provides a decommenter and cruncher (whitespace remover) to save bytes
36-
in the tokenized output at the expense of readability.
63+
This program tokenizes on a host computer before downloading to the Model 100.
3764

3865
### File extension terminology
3966

4067
Tokenized BASIC files use the extension `.BA`. ASCII formatted BASIC
4168
files should be given the extension `.DO` so that the Model 100 will
4269
see them as text documents, although people often misuse `.BA` for
43-
ASCII.
70+
ASCII BASIC.
4471

4572
## Programs in this project
4673

47-
* **tokenize**: A shell script which ties together all the following.
74+
* **tokenize**: A shell script which ties together all the following
75+
tools. Most people will only run this program directly.
4876

4977
* **m100-tokenize**: Convert M100 BASIC program from ASCII (.DO)
5078
to executable .BA file.
@@ -110,14 +138,51 @@ The **-d** option decomments before tokenizing.
110138
The **-c** option decomments _and_ removes all optional
111139
whitespace before tokenizing.
112140

113-
#### Example
141+
#### Example 1: Simplest usage: tokenize filename
142+
143+
#### Example 2: Overwrite or rename
114144

115145
``` bash
116146
$ tokenize PROG.DO
117-
Output file 'PROG.BA' already exists. Overwrite [yes/No/rename]? R
147+
Output file 'PROG.BA' is newer than 'PROG.DO'.
148+
Overwrite [yes/No/rename]? R
118149
Old file renamed to 'PROG.BA~'
119150
```
120151

152+
#### Example 3: Crunching to save space: tokenize -c
153+
154+
``` bash
155+
$ wc -c M100LE.DO
156+
17630 M100LE.DO
157+
158+
$ tokenize M100LE.DO
159+
Tokenizing 'M100LE.DO' into 'M100LE.BA'
160+
161+
$ wc -c M100LE.BA
162+
15667 M100LE.BA
163+
164+
$ tokenize -c M100LE.DO M100LE-crunched.BA
165+
Decommenting, crunching, and tokenizing 'M100LE.DO' into 'M100LE-crunched.BA'
166+
167+
$ wc -c M100LE-crunched.BA
168+
6199 M100LE-crunched.BA
169+
```
170+
171+
In this case, using `tokenize -c` reduced the BASIC executable from 16
172+
to 6 kibibytes, which is quite significant on a machine that might
173+
have only 24K of RAM. However, this is an extreme example from a well
174+
commented program. Many Model 100 programs have already been
175+
"decommented" and "crunched" by hand to save space.
176+
177+
<ul>
178+
179+
_Tip: When distributing a crunched program, it is a good idea to also
180+
include the original source code to make it easy for people to learn
181+
from, debug, and improve it._
182+
183+
</ul>
184+
185+
121186
### Running m100-tokenize and friends manually
122187

123188
Certain programs should _usually_ be run to process the input before
@@ -247,33 +312,47 @@ If you find this to be a problem, please file an issue as it is
247312
potentially correctable using `open_memstream()`, but hackerb9 does
248313
not see the need.
249314

250-
</details>
315+
</details> <!-- Running manually -->
251316

252317

253318
## Machine compatibility
254319

255-
Across the eight Kyotronic-85 Sisters, there are actually only
256-
two different tokenized formats. The first, which I call "M100
257-
BASIC" is supported by this program. The second, which is known
258-
as "N82 BASIC", is not yet supported.
320+
Across the eight Kyotronic-85 sisters, there are actually only two
321+
different tokenized formats: "M100 BASIC" and "N82 BASIC".
259322

260-
The TRS-80 Models 100 and 102 and the Tandy 200 all share the same
261-
tokenized BASIC. While less commonly seen, the Kyocera Kyotronic-85
262-
and Olivetti M10 also use that tokenization, so one .BA program can
263-
work for any of them. However, the NEC family of portables -- the
264-
PC-8201, PC-8201A, and PC-8300 -- run N82 BASIC which has a different
265-
tokenization format.
323+
The three Radio-Shack portables (Models 100, 102 and 200) all share
324+
the same tokenized BASIC. While less commonly seen, the Kyocera
325+
Kyotronic-85 and Olivetti M10 also use that tokenization, so one .BA
326+
program can work for any of them. However, the NEC family of portables
327+
-- the PC-8201, PC-8201A, and PC-8300 -- run N82 BASIC which has a
328+
different tokenization format.
266329

267330
### Checksum differences are not a compatibility problem
268331

269332
The .BA files generated by `tokenize` aim to be exactly the same, byte
270-
for byte, as the output from tokenizing on a Model 100. The Tandy 200
271-
stores the first BASIC program in RAM at a slightly different location
272-
(0xA000 instead of 0x8000). This has no affect on compatibility, but
273-
it does change the line number pointers in the .BA file. The pointers
274-
saved in the file are _never_ used as they are recalculated when the
275-
program is loaded into RAM.
276-
333+
for byte, as the output from tokenizing on a Model 100 using `LOAD`
334+
and `SAVE`. There are some bytes, however, which can and do change but
335+
do not matter.
336+
337+
An artifact of the `.BA` file format is that the saved file contains
338+
the pointer locations of where the program happened to be in memory on
339+
the computer from which it was saved. The pointers are _never_ used as
340+
they are recalculated when the program is loaded into RAM.
341+
342+
Specifically, the output of this program is intended to be identical to:
343+
344+
* A Model 100
345+
* that has been freshly reset
346+
* with no other BASIC programs on it
347+
* running `LOAD "COM:88N1"` and `SAVE "FOO"` while a host computer
348+
sends the ASCII BASIC program over the serial port.
349+
350+
I (hackerb9) believe the Tandy 102, Kyotronic-85, and M10 also output
351+
byte identical files, but the Tandy 200 does not. The 200 has more ROM
352+
than the other Model T computers, so it stores the first BASIC program
353+
at a slightly different RAM location (0xA000 instead of 0x8000). This
354+
has no affect on compatibility between machines, but it does change
355+
the line number pointers in the .BA file.
277356

278357
## Why Lex?
279358

@@ -285,9 +364,9 @@ and the corresponding byte they should emit. Flex handles special
285364
cases, like quoted strings and REMarks, easily.
286365

287366
The downside is that one must have flex installed to _modify_ the
288-
tokenizer. Flex is _not_ necessary to compile on a machine as flex can
289-
generate portable C code. See the tokenize-cfiles.tar.gz in the github
290-
release or run `make cfiles`.
367+
tokenizer. Flex is _not_ necessary to compile on a machine as flex
368+
generates portable C code. See the tokenize-cfiles.tar.gz in the
369+
github release or run `make cfiles`.
291370

292371
## Abnormal code
293372

@@ -442,10 +521,8 @@ has followed suit.
442521

443522
## Known Bugs
444523

445-
* Currently no attempt is made to change lowercase variable names to
446-
UPPERCASE. The Model T computers cannot run programs with lowercase
447-
variables (Syntax Error). One can use EDIT on such a program and
448-
simply resave it to fix the case issue.
524+
* None known. Reports are gratefully accepted.
525+
449526

450527
## Alternatives
451528

0 commit comments

Comments
 (0)