CHAPTER 12

The SETL Database Facility

Chapter Contents.

  1. General Introduction to the Functionality and Behavior of SETL Databases

  2. A First Example of SETL Database Use.

  3. A Second Example. Value-semantics of SETL Databases, free form of records, databases as ordinary objects.

  4. More Examples: Databases which store Free-form Text.

  5. A First Sketch of the SETL Database Implementation concept: simplest form of the basic B-tree structure.

  6. B-trees Incorporating More General Cumulants.

  7. Mapping large strings to collections of disk pages.

  8. Examples of the use of Large-string Objects.

  9. Code for large-string management with reference counting.

  10. The full SETL database design - Introduction.

  11. Code Details for SETL Database objects.

  12. 'Hibernating' transactions - Introduction.

  13. 'Hibernating' transactions - An example application.

  14. 'Hibernating' transactions - additional implementation details.

  15. Guaranteed recovery after crash for programs which manipulate databases.

  16. Recovery mechanisms - code details.

  17. Formal verification of crash recovery code correctness; a 'brute force' approach.

  18. 'Versioned' B-trees.

  19. Atomicity and recovery of distributed transactions.

12.1 General Introduction to the Functionality and Behavior of SETL Databases. SETL includes a powerful facility for database design and implementation. Associated with this is an extended facility for prototyping and implementing complex commercial applications, based on the notion of 'hibernating transactions' which can be resident for long periods of time in the database itself. These facilities are fully transactional, in the sense explained below, and also facilitate prototyping of and implementation of large distributed databases.

SETL databases provide an alternative to standard SQL databases. Each SETL database object is a potentially very large (up to hundreds of gigabytes) abstract object stored on disk. Each such database object resembles a map from a set of system generated 4- or 5-byte record identifiers to the records which the database stores, each of which is a fully general object of the SETL language, but which is itself not excessively large (e.g. no more than a few dozen megabytes.) The database records are systematically indexed for efficient access, using the abstract technique described below. Once a record of a SETL database has been accessed, all the operations of the SETL language can be applied to it. Conversely, any standard SETL object can be stored as a database record.

The operations specific to SETL databases (other than standard SETL operations applicable to individual records) are as follows:

 x := db(id);	-- find a database record given its identifier
 db(id) := x;	-- store x as a database record, associating it with the specified identifier
 db with x	-- create a new record and put x into it. 
 		-- a pair of the form [modified_db,id_of_new_record] is returned.
 db.to_string;	-- database-associated mapping of records to 
 		-- reduced string variants ('summary strings')
			-- used for indexing; see comment below

 db.contains(wd) -- creates and returns a database iterator object, which iterates
		-- over all records whose summary string contains the given word

Iterations 'x in iter_obj' over these iterator objects return the ordered sequence of keys of all records containing the specified word (in a standard key-collating order). These iteration objects support the operations:

	#iter_obj (number of keys in an iteration group)
	arb(iter_obj) (first item in the iteration group, or OM if none)
	iter_obj + iter_obj (union iteration)
	iter_obj * iter_obj (iterate over intersection),
	iter_obj - iter_obj (iterate over difference)
Internally, the database 'records' are represented by long string sections, separated by special marks representing each record's id. The detailed layout of the 'id-and-record' file used to store this information is described below.

The SQL view of database semantics has become pervasive. However, the alternative approach described in this chapter may have significant advantages. To begin to see how these to approaches compare, note that the SQL database records ordinarily depicted as

NameAgeDepart.SalaryEmployedPosn.
John Peter Doe43063210006/7/79Clerk
Mary M. Smith25063450009/1/91Mgr.

can be converted to the strings (used only for searching in the SETL version of the same database)

"John Peter Doe 43 D:063 $21000 6/7/79 Clerk" "Mary Smith 25 D:063 $45000 9/1/91 Mgr."

by a function f sophisticated enough to recognize middle names and treat them in a special way. Then the operations listed above would allow the crucial core of database searches, e.g. a search for the manager of the employee John Doe, to be written as follows:

	{name(y): x in contains("Doe"), y in contains(department(x)) | 
				posn(y) = "Mgr." & "John" in words_of(string_of(x))}.
More properly, this returns all the managers in any department containing an employee called John Doe, along with whatever 'accidental' data might occasionally result from accidental combinations of John and Doe, as in a conceivable department that manufactured Doe Skin gloves and had an employee named Wallace John, or John Johnson. But this sort of thing should be quite rare, and can in any case be detected by forming and filtering or displaying the set of pairs
	{[x,y]: x in contains("Doe"), y in contains(department(x)) | 
				posn(y) = "Mgr." & "John" in words_of(string_of(x))}.
in an appropriate format.

This example shows the felt advantage of the approach proposed: (a) it separates the optimization of database queries rather cleanly from their abstract definition; (b) no special statement of the fields to be indexed is required; (c) it allows arbitrary strings to appear in fields, and is prepared to search using concordances of the words in these fields, which is often an effective search technique. (E.g. we can just as well find the manager of "Peter Doe", or "J. Peter Doe", a task which might well vex an SQL database.)

Indexing of SETL databases. Each database has an administrator-defined 'to_string(record)' method which can be applied to each of its records and which yields a 'record summary string' defining those aspects of the record which are to be indexed to improve the efficiency of searches over the database. By cutting these summary strings into words (simply by cutting them at all included whitespaces) we generate an associated internal word_list(record). These word lists are automatically used to build a 'word index' associated with the database. Iterations over the db.contains(wd) iterator objects described above exploit this index.

Transaction handling in SETL databases - Basics. SETL databases, like all other SETL objects, have value semantics. For example, if db is a database object, then the sequence of statements

db2 := db; db := db with new_record;

changes db by adding a new record, but leaves db2 unchanged since no new assignment has been made to it. The SETL database implementation creates the 'logical copies' that this requires in an efficient way, whose cost is only weakly dependent of the size of the database objects involved.

This gives an easy way of obtaining transaction-like effects. To begin a transaction which may have to be aborted, simply write

db_copy := db;

Then edit db as desired. To 'commit' the transaction, simply erase the copy by writing db_copy := OM. To back out of the transaction instead, simply write db := db_copy; this will restore db to its original state.

More Advanced Transaction-related facilities.

Commercial 'transactional' database systems provide several important transaction-related capabilities in addition to the basic transaction-backout ability noted above. These are: (i) guaranteed recovery after system failure; (ii) distributed execution, with guaranteed coordination of multiple distributed files, and guaranteed recovery after failure of any of the components of a distributed system; (iii) parallel execution of multiple independent transactions, with guaranteed isolation of each transaction from the others (serializability.) These fundamental capabilities are also supported by the SETL database system.

Guaranteed recovery after system failure - introduction.

We model the system failures to be guarded against by assuming that:

  1. a processor's CPU can crash, e.g. by losing power or by freezing after a software fault.

  2. Such crashes can occur unpredictably, even in the middle of writing a disk record. However, when such a crash occurs, bad data can appear only in the disk area being written, but not outside it.

  3. Once written to disk, data never 'evaporates', i.e. can always be read.

If any of these assumptions are violated in an actual physical system, a lower supporting layer of physical and software redundancy will be needed. This must implement a 'virtual system' satisfying assumptions (a-c).

Recovery after system failure is guaranteed if we can always restart the system after crash by reading the disk and recovering to a process state equivalent to one reached previously to, and not too distant from, the moment of crash. We call system states to which we could recover 'checkpoints', and can think of them as being marked by a statement

 	checkpoint();

Such checkpoint operations must reliably create something logically equivalent to a complete, self-standing copy of the database, but, since they must be reasonably frequent they must not be too expensive. Recovery is then possible if (i) we can always accept a crash event which destroys RAM and leaves a half-written record on the disk, as explained above; (ii) we can write a top-level 'handler' for this event which reads the state of the disk and brings us back to a system state fully equivalent to that reached immediately prior to execution of the last previous checkpoint() call.

SETL databases support this functionality. The technique used to accomplish this is described in a later section.

A First Example of SETL Database Use. The following example illustrates the fundamental database operations described above, and can be executed to test them. In it we create a very tiny 'database' containing just a few records and apply the main database operations to it. Note that the elements of the 'database' are simply SETL pairs [name, age], but that the database-associated mapping 'pair_to_stg' sends the pairs into summary string of the form 'N:name A:age'.

 program test; -- a first test program for the SETL database class
	use SETL_database; -- use the class of SETL database objects
	
	db := SETL_database(pair_to_stg); -- create an empty database, to store test pairs

	[db,jack_id] := db with ["Jack",71]; -- add a first pair
	print("Jack record: ",db(jack_id)); 
	
	[db,diana_id] := db with ["Diana",43]; -- add a second pair 
	[db,rachel_id] := db with ["Rachel",43]; -- add a third pair 

	print("\n*** Initial Iteration test - records containing 'A:43' ***");
	for idx in db.contains("A:43") loop print(db(idx)); end loop;
	
	print("\n*** Testing empty arb - records containing 'Walter' ***");
	print(arb(db.contains("N:Walter"))); 
	
	print("\n*** Initial Union test - records containing 'Jack' or 'Diana' ***");
	for key in (jd_recs := (db.contains("N:Jack") + db.contains("N:Diana"))) loop 
		print(db(key)); 
	end loop;

	print("\n*** Iterate again ***");
	for key in jd_recs loop print(db(key)); end loop;

	print("testing arb for union: list a record containing 'Jack' or 'Diana': ",
		db(arb jd_recs)); 

	print("\n*** Intersection test - records containing '43' and 'Diana' ***");
	for key in (d43_recs := (db.contains("A:43") * db.contains("N:Diana"))) loop 
		print(db(key)); 
	end loop;
	
	print("\n*** Intersection test - records containing '43' and 'Rachel' ***");
	for key in (r43_recs := (db.contains("A:43") * db.contains("N:Rachel"))) loop 
		print(db(key)); 
	end loop;
	
	print("\n*** Difference test - records containing '43' and not 'Diana' ***");
	for key in (nd43_recs := (db.contains("A:43") - db.contains("N:Diana"))) loop 
		print(db(key)); 
	end loop;
	
	print("\n*** Difference test - records containing '43' and not 'Rachel' ***");
	for key in (nr43_recs := (db.contains("A:43") - db.contains("N:Rachel"))) loop 
		print(db(key)); 
	end loop;
	
	print("testing arb for union: list a record containing 'Jack' or 'Diana': "
		,db(arb jd_recs)); -- ****
	print("testing arb for Intersection: list a record containing '43' and 'Rachel': "
		,db(arb r43_recs));
	print("testing arb for Intersection: list a record containing '43' and 'Diana': "
		,db(arb d43_recs));
	print("testing arb for Intersection: list a record containing 'Rachel' and 'Diana': "
		,arb(db.contains("N:Rachel") * db.contains("N:Diana")));

	-- try various iterator-based retrievals
	print("testing arb: list a record containing 'Jack': "
			,db(jack_key := arb(db.contains("N:Jack"))));
	print("testing arb: list a record containing 'Diana': "
		,db(arb(db.contains("N:Diana"))));
	print("testing arb: list a record containing 'A:71': "
		,db(arb(db.contains("A:71"))));
	print("testing arb: list a record containing 'A:43': "
		,db(arb(db.contains("A:43"))));

	print("testing count: count records containing 'A:43': "
		,#db.contains("A:43")); 
	
	-- now try some elementary database edits
	print("testing change of record, set ['Jack',71] to ['Jack',72]");
	db(jack_key) := ["Jack",72]; -- another year gone by

	print("testing arb after change: list a record containing 'N:Jack': ",
		db(arb(db.contains("N:Jack"))));
	print("testing arb after change: list a record containing 'A:71': "
		,arb(db.contains("A:71")));
	print("testing arb after change: list a record containing 'A:71': "
		,if (a_rec := arb(db.contains("A:71"))) /= OM then db(a_rec) 	
	else "Non-existent ***" end if);
	print("testing arb after change: list a record containing 'A:72': "
		,if (a_rec := arb(db.contains("A:72"))) /= OM then db(a_rec) 
	else "Non-existent ***" end if);

	print("testing arb after change: list a record containing 'N:Diana': "
		,db(arb(db.contains("N:Diana"))));
	print("testing arb after change: list a record containing 'A:43': "
		,if (a_rec := arb(db.contains("A:43"))) /= OM then db(a_rec) 
	else "Non-existent ***" end if);
	
	print("\nadding ['Jack', 6]: "); [db,younger_id] := db with ["Jack", 6]; 
			-- add a younger jack
	
	print("testing the basic iterators again: list records containing '6'");
	for key in (jack_recs := db.contains("A:6")) loop 
		print(db(key)); 
	end loop;
	
	print("*** Print all records containing 'Jack' ***");
	for key in (jack_recs := db.contains("N:Jack")) loop 
		print(db(key)); 
	end loop;

	print("\n*** Deleting a record -- the 'Jack' with A:72 ***");
	db(jack_id) := OM;
	
	print("*** Print all records containing 'Jack' after deletion ***");
	for key in (jack_recs := db.contains("N:Jack")) loop 
		print(db(key)); 
	end loop;
	
	print("\n*** Restoration Test - Restoring ['Jack',71]");
	
	[db,jack_id2] := db with ["Jack",71]; -- the other Jack again
	
	print("*** Print all records containing 'Jack' after restoration ***");
	for key in (jack_recs := db.contains("N:Jack")) loop 
		print(db(key)); 
	end loop;
	
	print("*** Add large group of new records ***");
	for j in [1..20] loop -- add sixty new records
		[db,-] := db with ["Jack",j]; 
		[db,-] := db with ["Diana",j]; 
		[db,-] := db with ["Rachel",j];
	end loop;
	
	print("records with 'A:3'");
	print(str([db(x): x in (jnd_recs := db.contains("A:3"))])); 
	print("records with 'N:Diana'");
	print(str([db(x): x in (jnd_recs := db.contains("N:Diana"))])); 

	print("\n*** Union test - records containing 'Jack' or 'Diana' ***");
	print([db(key): key in (jd_recs := (db.contains("N:Jack") + db.contains("N:Diana")))]);
	
	print("\n*** Union test - records containing 'Rachel' or 'Diana' ***");
	print([db(key): key in (jd_recs := (db.contains("N:Rachel") + db.contains("N:Diana")))]);
	
	print("\n*** Intersection test - records containing '43' and 'Diana' ***");
	for key in (jd_recs := (db.contains("A:43") * db.contains("N:Diana"))) loop 
		print(db(key)); 
	end loop;
	
	print("\n*** Intersection test - records containing '43' and 'Rachel' ***");
	for key in (jd_recs := (db.contains("A:43") * db.contains("N:Rachel"))) loop 
		print(db(key)); 
	end loop;
	
	print("\n*** Difference test - records containing '43' and not 'Diana' ***");
	for key in (jd_recs := (db.contains("A:43") - db.contains("N:Diana"))) loop 
		print(db(key)); 
	end loop;
	
	print("\n*** Difference test - records containing '43' and not 'Rachel' ***");
	for key in (jd_recs := (db.contains("A:43") - db.contains("N:Rachel"))) loop 
		print(db(key)); 
	end loop;
	
	print("recs with Jack and '6': "); 
		-- this adds a second ["Jack,6] record, with a different id
	if (jack_rec6 := arb(db.contains("N:Jack") * db.contains("A:6"))) /= OM then 
		print(db(jack_rec6)); 
	end if;
	
	print("\n*** Difference test ***");
	print("recs with Jack but not '6'");
	print([db(key): key in (jn6_recs := db.contains("N:Jack") - db.contains("A:6"))]);
	
	print("\n*** Intersection test ***");
	print("recs with Jack and '6'");
	for x in (jnd_recs := db.contains("N:Jack") * db.contains("A:6")) loop 
		print(db(x)); 
	end loop;
	
	print("recs with Jack and '72': ");
	print(if (jack_rec72 := arb(db.contains("N:Jack") * db.contains("A:72"))) /= OM 
		then db(jack_rec72) 
	else "Nonexistent" end if); 
	
	print("\n*** Deleting a record -- the 'Jack' with A:10 ***");
	jack_rec10 := arb(db.contains("N:Jack") * db.contains("A:10")); 
	db(jack_rec10) := OM;
	
	print("remaining recs with Jack");
	print([db(key): key in (jack_recs := db.contains("N:Jack"))]);
	
	print("\n*** Restoration Test - Restoring ['Jack',10]");
	
	[db,jack10_id] := db with ["Jack",10]; -- the other Jack again

	print("Records with 'Jack' but not 'Diana'");
	print([db(key): key in (jnd_recs := db.contains("N:Jack") - db.contains("N:Diana"))]);
	
	print("\n*** Deleting another record -- one of the two 'Jack's with A:6 ***");
	print("Deleting ",db(lit_jack_rec := arb(db.contains("N:Jack") * db.contains("A:6"))));
	db(lit_jack_rec) := OM; -- remove this younger 'Jack'

	print("All remaining records with 'Jack'"); 
	print([db(key): key in (jack_recs := db.contains("N:Jack"))]);
	
	for j in [1..20] loop -- add twenty new records
		[db,-] := db with ["Jack",100 + j]; 
		[db,-] := db with ["Diana",100 + j]; 
		[db,-] := db with ["Rachel",100 + j];
	end loop;
	
	print("records with 'A:103'");
	print(str([db(x): x in (all3_recs := db.contains("A:103"))])); 
	
	print("records with 'N:Diana'");
	print(str([db(x): x in (jnd_recs := db.contains("N:Diana"))])); 
	print("records with 'A:3' but not 'N:Diana'");
	print(str([db(x): x in (jnd_recs := db.contains("A:3") - db.contains("N:Diana"))])); 
	print("records with 'N:Diana' but not 'A:3'");
	print(str([db(x): x in (jnd_recs := db.contains("N:Diana") - db.contains("A:3"))])); 
	print("records with 'N:Diana' and 'A:3'");
	print(str([db(x): x in (jnd_recs := db.contains("N:Diana") * db.contains("A:3"))])); 
	print("records with 'N:Diana' or 'A:3'");
	print(str([db(x): x in (jnd_recs := db.contains("N:Diana") + db.contains("A:3"))])); 
	
	print("\n ***END OF TESTS ***"); 
	
	end test;
The output produced by this program is as follows:
	Jack record: ["Jack", 71]
	test elimination of duplicates in union iteration
	[["Jack", 71]]

	*** Initial Iteration test - records containing 'A:43' ***
	["Diana", 43]
	["Rachel", 43]

	*** Testing empty arb - records containing 'Walter' ***
	<om>

	*** Initial Union test - records containing 'Jack' or 'Diana' ***
	["Jack", 71]
	["Diana", 43]

	*** Iterate again ***
	["Jack", 71]
	["Diana", 43]
	testing arb for union: list a record containing 'Jack' or 'Diana': ["Jack", 71]

	*** Initial Intersection test - records containing '43' and 'Diana' ***
	["Diana", 43]

	*** Initial Intersection test - records containing '43' and 'Rachel' ***
	["Rachel", 43]

	*** Initial Difference test - records containing '43' and not 'Diana' ***
	["Rachel", 43]

	*** Initial Difference test - records containing '43' and not 'Rachel' ***
	["Diana", 43]
	testing arb for union: list a record containing 'Jack' or 'Diana': ["Jack", 71]
	testing arb for Intersection: list a record containing '43' and 'Rachel': ["Rachel", 43]
	testing arb for Intersection: list a record containing '43' and 'Diana': ["Diana", 43]
	testing arb for Intersection: list a record containing 'Rachel' and 'Diana': <om>
	testing arb: list a record containing 'Jack': ["Jack", 71]
	testing arb: list a record containing 'Diana': ["Diana", 43]
	testing arb: list a record containing 'A:71': ["Jack", 71]
	testing arb: list a record containing 'A:43': ["Diana", 43]
	testing count: count records containing 'A:43': 2
	testing change of record, set ['Jack',71] to ['Jack',72]
	testing arb after change: list a record containing 'N:Jack': ["Jack", 72]
	testing arb after change: list a record containing 'A:71': <om>
	testing arb after change: list a record containing 'A:71': Non-existent ***
	testing arb after change: list a record containing 'A:72': ["Jack", 72]
	testing arb after change: list a record containing 'N:Diana': ["Diana", 43]
	testing arb after change: list a record containing 'A:43': ["Diana", 43]

	adding ['Jack', 6]: 
	testing the basic iterators again: list records containing '6'
	["Jack", 6]
	*** Print all records containing 'Jack' ***
	["Jack", 72]
	["Jack", 6]

	*** Deleting a record -- the 'Jack' with A:72 ***
	*** Print all records containing 'Jack' after deletion ***
	["Jack", 6]

	*** Restoration Test - Restoring ['Jack',71] ***
	*** Print all records containing 'Jack' after restoration ***
	["Jack", 6]
	["Jack", 71]
	*** Add large group of new records ***
	records with 'A:3'
	[["Jack", 3], ["Diana", 3], ["Rachel", 3]]
	records with 'N:Diana'
	[["Diana", 43], ["Diana", 1], ["Diana", 2], ["Diana", 3], 
	["Diana", 4], ["Diana", 5], ["Diana", 6], ["Diana", 7], ["Diana", 8], 
	["Diana", 9], ["Diana", 10], ["Diana", 11], ["Diana", 12], 
	["Diana", 13], ["Diana", 14], ["Diana", 15], ["Diana", 16], 
	["Diana", 17], ["Diana", 18], ["Diana", 19], ["Diana", 20]]

	*** Union test - records containing 'Jack' or 'Diana' ***
	[["Diana", 43], ["Jack", 6], ["Jack", 71], ["Jack", 1], ["Diana", 1], 
	["Jack", 2], ["Diana", 2], ["Jack", 3], ["Diana", 3], 
	["Jack", 4], ["Diana", 4], ["Jack", 5], ["Diana", 5], ["Jack", 6], 
	["Diana", 6], ["Jack", 7], ["Diana", 7], ["Jack", 8], 
	["Diana", 8], ["Jack", 9], ["Diana", 9], ["Jack", 10], ["Diana", 10], 
	["Jack", 11], ["Diana", 11], ["Jack", 12], ["Diana", 12], 
	["Jack", 13], ["Diana", 13], ["Jack", 14], ["Diana", 14], ["Jack", 15], 
	["Diana", 15], ["Jack", 16], ["Diana", 16], ["Jack", 17], 
	["Diana", 17], ["Jack", 18], ["Diana", 18], ["Jack", 19], ["Diana", 19], 
	["Jack", 20], ["Diana", 20]]

	*** Union test - records containing 'Rachel' or 'Diana' ***
	[["Diana", 43], ["Rachel", 43], ["Diana", 1], ["Rachel", 1], ["Diana", 2], 
	["Rachel", 2], ["Diana", 3], ["Rachel", 3], ["Diana", 4], 
	["Rachel", 4], ["Diana", 5], ["Rachel", 5], ["Diana", 6], ["Rachel", 6], 
	["Diana", 7], ["Rachel", 7], ["Diana", 8], ["Rachel", 8], 
	["Diana", 9], ["Rachel", 9], ["Diana", 10], ["Rachel", 10], ["Diana", 11], 
	["Rachel", 11], ["Diana", 12], ["Rachel", 12], ["Diana", 13], 
	["Rachel", 13], ["Diana", 14], ["Rachel", 14], ["Diana", 15], ["Rachel", 15], 
	["Diana", 16], ["Rachel", 16], ["Diana", 17], 
	["Rachel", 17], ["Diana", 18], ["Rachel", 18], ["Diana", 19], ["Rachel", 19], 
	["Diana", 20], ["Rachel", 20]]

	*** Intersection test - records containing '43' and 'Diana' ***
	["Diana", 43]

	*** Intersection test - records containing '43' and 'Rachel' ***
	["Rachel", 43]

	*** Difference test - records containing '43' and not 'Diana' ***
	["Rachel", 43]

	*** Difference test - records containing '43' and not 'Rachel' ***
	["Diana", 43]
	recs with Jack and '6': 
	["Jack", 6]

	*** Difference test ***
	recs with Jack but not '6'
	[["Jack", 71], ["Jack", 1], ["Jack", 2], ["Jack", 3], ["Jack", 4], 
	["Jack", 5], ["Jack", 7], ["Jack", 8], ["Jack", 9], 
	["Jack", 10], ["Jack", 11], ["Jack", 12], ["Jack", 13], ["Jack", 14], 
	["Jack", 15], ["Jack", 16], ["Jack", 17], ["Jack", 18], ["Jack", 19], 
		["Jack", 20]]

	*** Intersection test ***
	recs with Jack and '6'
	["Jack", 6]
	["Jack", 6]
	recs with Jack and '72': 
	Nonexistent

	*** Deleting a record -- the 'Jack' with A:10 ***
	remaining recs with Jack
	[["Jack", 6], ["Jack", 71], ["Jack", 1], ["Jack", 2], ["Jack", 3], 
	["Jack", 4], ["Jack", 5], ["Jack", 6], ["Jack", 7], 
	["Jack", 8], ["Jack", 9], ["Jack", 11], ["Jack", 12], ["Jack", 13], 
	["Jack", 14], ["Jack", 15], ["Jack", 16], ["Jack", 17], 
	["Jack", 18], ["Jack", 19], ["Jack", 20]]

	*** Restoration Test - Restoring ['Jack',10] ***
	[["Jack", 6], ["Jack", 71], ["Jack", 1], ["Jack", 2], ["Jack", 3], 
	["Jack", 4], ["Jack", 5], ["Jack", 6], ["Jack", 7], 
	["Jack", 8], ["Jack", 9], ["Jack", 11], ["Jack", 12], ["Jack", 13], 
	["Jack", 14], ["Jack", 15], ["Jack", 16], ["Jack", 17], 
	["Jack", 18], ["Jack", 19], ["Jack", 20], ["Jack", 10]]

	*** Deleting another record -- one of the two 'Jack's with A:6 ***
	Deleting ["Jack", 6]
	All remaining records with 'Jack'
	[["Jack", 71], ["Jack", 1], ["Jack", 2], ["Jack", 3], ["Jack", 4], 
	["Jack", 5], ["Jack", 6], ["Jack", 7], ["Jack", 8], 
	["Jack", 9], ["Jack", 11], ["Jack", 12], ["Jack", 13], ["Jack", 14], 
	["Jack", 15], ["Jack", 16], ["Jack", 17], ["Jack", 18], 
	["Jack", 19], ["Jack", 20], ["Jack", 10]]
	records with 'A:103'
	[["Jack", 103], ["Diana", 103], ["Rachel", 103]]
	records with 'N:Diana'
	[["Diana", 43], ["Diana", 1], ["Diana", 2], ["Diana", 3], ["Diana", 4], 
	["Diana", 5], ["Diana", 6], ["Diana", 7], 
	["Diana", 8], ["Diana", 9], ["Diana", 10], ["Diana", 11], ["Diana", 12], 
	["Diana", 13], ["Diana", 14], ["Diana", 15], 
	["Diana", 16], ["Diana", 17], ["Diana", 18], ["Diana", 19], ["Diana", 20], 
	["Diana", 101], ["Diana", 102], ["Diana", 103], 
	["Diana", 104], ["Diana", 105], ["Diana", 106], ["Diana", 107], 
	["Diana", 108], ["Diana", 109], ["Diana", 110], 
	["Diana", 111], ["Diana", 112], ["Diana", 113], ["Diana", 114], 
	["Diana", 115], ["Diana", 116], ["Diana", 117], 
	["Diana", 118], ["Diana", 119], ["Diana", 120]]
	records with 'A:3' but not 'N:Diana'
	[["Jack", 3], ["Rachel", 3]]
	records with 'N:Diana' but not 'A:3'
	[["Diana", 43], ["Diana", 1], ["Diana", 2], ["Diana", 4], ["Diana", 5], 
	["Diana", 6], ["Diana", 7], ["Diana", 8], 
	["Diana", 9], ["Diana", 10], ["Diana", 11], ["Diana", 12], ["Diana", 13], 
	["Diana", 14], ["Diana", 15], ["Diana", 16], 
	["Diana", 17], ["Diana", 18], ["Diana", 19], ["Diana", 20], ["Diana", 101], 
	["Diana", 102], ["Diana", 103], 
	["Diana", 104], ["Diana", 105], ["Diana", 106], ["Diana", 107], 
	["Diana", 108], ["Diana", 109], ["Diana", 110], 
	["Diana", 111], ["Diana", 112], ["Diana", 113], ["Diana", 114], 
	["Diana", 115], ["Diana", 116], ["Diana", 117], 
	["Diana", 118], ["Diana", 119], ["Diana", 120]]
	records with 'N:Diana' and 'A:3'
	[["Diana", 3]]
	records with 'N:Diana' or 'A:3'
	[["Diana", 43], ["Diana", 1], ["Diana", 2], ["Jack", 3], ["Diana", 3], 
	["Rachel", 3], ["Diana", 4], ["Diana", 5], 
	["Diana", 6], ["Diana", 7], ["Diana", 8], ["Diana", 9], ["Diana", 10], 
	["Diana", 11], ["Diana", 12], ["Diana", 13], 
	["Diana", 14], ["Diana", 15], ["Diana", 16], ["Diana", 17], ["Diana", 18], 
	["Diana", 19], ["Diana", 20], ["Diana", 101], 
	["Diana", 102], ["Diana", 103], ["Diana", 104], ["Diana", 105], 
	["Diana", 106], ["Diana", 107], ["Diana", 108], 
	["Diana", 109], ["Diana", 110], ["Diana", 111], ["Diana", 112], 
	["Diana", 113], ["Diana", 114], ["Diana", 115], 
	["Diana", 116], ["Diana", 117], ["Diana", 118], ["Diana", 119], 
	["Diana", 120]]

 	***END OF TESTS ***
The following comments will aid in understanding this output. We begin by creating a database, into which a first record '["Jack",71]' is inserted and then retrieved. (The summarization routine for the database simply maps such records into strings like "N:Jack A:71", ignoring all but the first two components of each record.) Next two additional records, '["Diana",43]' and '["Rachel",43]' are inserted, and then retrieved using the iterator 'db.contains("A:43")' (which iterates over both of these records, since both contain '43' as their second component.) The empty iterator 'db.contains("N:Walter")' is then used in combination with the 'arb' operation, which returns OM since no record contains the name 'Walter'.

Next we iterate over all records containing either of the names 'Jack' or 'Diana', The union iterator 'db.contains("N:Jack") + db.contains("N:Diana")' is saved in a variable and then used a second time to show that this is possible, and a third time in combination with the 'arb' operation.

Next the corresponding intersection and difference iterators are invoked, either to support iterations or in connection with 'arb' or the count operator '#' and seen to behave as expected. After this, we use the database record-change operation db(x) := y to change a record, which is then retrieved in changed form, using an iterator, while the previous record is seen to be absent, and other records in the database are seen to be unchanged. Then a second record '["Jack", 6]' containing the name "Jack" is added to the databases, and retrieved, first using 'db.contains("A:6")' (which retrieves only '["Jack", 6]'), and 'db.contains("N:Jack")' (which retrieves both '["Jack", 6]' and '["Jack", 72]'). The record '["Jack", 72]' is then deleted by using the record-change operation in the form db(x) := OM, its absence verified, and the original '["Jack",71]' record added back to the database. Following this, we add 60 records, containing ["Jack",j], ["Diana",j], and ["Rachel",j] for j from 1 to 20 to the database, verify via various retrievals and iterations that this has been done correctly, and delete a record, subsequently verifying the correctness of this deletion, and then re-inserting the deleted record and re-verifying. Finally we insert another 60 records, now containing ["Jack",j], ["Diana",j], and ["Rachel",j] for j from 101 to 120, and re-verify.

A Second Example. Value-semantics of SETL Databases, free form of records, databases as ordinary objects.

The following short program illustrates several other significant properties of SETL databases, specifically that

  1. they obey value-semantics rather than pointer semantics rules;

  2. records in them need not have any fixed structure;

  3. databases behave like ordinary SETL objects, i.e. can be put into tuples and/or sets, can be passed as parameters to subroutines, etc.

The program is:

	program test; -- test of SETL database class
	use SETL_database;
	
	db := SETL_database(pair_to_stg); -- create an empty database, to store test record
	[db,jack_id] := db with ["Jack",71]; -- add a first record
	[db,diana_id] := db with ["Diana",43]; -- add a second record 
	[db,rachel_id] := db with ["Rachel",43]; -- add a third record 
	
	print("Database records");
	for key in (db.contains("A:43") + db.contains("A:71")) loop 
		print(db(key)); 
	end loop;

	db_save := db;  -- save the original database
	
	db(jack_id) := db(jack_id) with "a swell guy";
	db(diana_id) := ["Diana",44]; db(diana_id) := db(diana_id) with "a swell gal";
	db(rachel_id) := db(rachel_id) with "a folksinger";
	db(rachel_id) := db(rachel_id) with "remember upcoming club date";
	
	print("\nEdited database");
	for key in (db.contains("A:43") + db.contains("A:71")) loop 
		print(db(key)); 
	end loop;

	print("\nSaved database");
	for key in (db_save.contains("A:43") + db_save.contains("A:71")) loop 
		print(db_save(key)); 
	end loop;
	
	print("\nComparison of records in the two databases");
	print("db_save(jack_id): ",db_save(jack_id)); 
	print("db(jack_id): ",db(jack_id));
	print("db_save(rachel_id): ",db_save(rachel_id)); 
	print("db(rachel_id): ",db(rachel_id));
	print("db_save(diana_id): ",db_save(diana_id)); 
	print("db(diana_id): ",db(diana_id));

	database_list := [db,db_save];

	print("\nFirst database");
	for key in (database_list(1).contains("A:43") + database_list(1).contains("A:71")) loop 
		print(database_list(1)(key)); 
	end loop;

	print("\nSecond database");
	for key in (database_list(2).contains("A:43") + database_list(2).contains("A:71")) loop 
		print(database_list(2)(key)); 
	end loop;
	
	procedure pair_to_stg(p); -- convert name, age pair to string
	
		[nm,age,-] := p; return "N:" + nm + " A:" + str(age);
	
	end pair_to_stg; 

	end test;
The preceding program produces the following output.
	Database records
	["Jack", 71]
	["Diana", 43]
	["Rachel", 43]

	Edited database
	["Jack", 71, "a swell guy"]
	["Rachel", 43, "a folksinger", "remember upcoming club date"]

	Saved database
	["Jack", 71]
	["Diana", 43]
	["Rachel", 43]

	Comparison of records in the two databases
	db_save(jack_id): ["Jack", 71]
	db(jack_id): ["Jack", 71, "a swell guy"]
	db_save(rachel_id): ["Rachel", 43]
	db(rachel_id): ["Rachel", 43, "a folksinger", "remember upcoming club date"]
	db_save(diana_id): ["Diana", 43]
	db(diana_id): ["Diana", 44, "a swell gal"]

	First database
	["Jack", 71, "a swell guy"]
	["Rachel", 43, "a folksinger", "remember upcoming club date"]

	Second database
	["Jack", 71]
	["Diana", 43]
	["Rachel", 43]

The program seen above creates a database and inserts three records, each a simple pair, into it. The database is then copied to a second variable 'db_save', and the original database is modified by editing each of its three records, to which additional string components are freely added. (The simplicity of this operation, which is a bit clumsy in a standard SQL database since it changes the number of SQL 'columns', illustrates one of the advantages of SETL databases: since records in them are entirely free-form, additional elements of any kind can be added to records in a completely dynamic way.) The edited records are retrieved from the edited database using an iterator, but then also from the saved database, which is seen to contain the original records, unchanged. This illustrates a second advantage of SETL databases, their value semantics, on which a 'transactional' capability can be built in the manner noted above. Finally we include the databases in a tuple, from which the databases are then recovered and used. This illustrates the fact that databases behave like ordinary SETL objects, and so can be put into tuples and/or sets, passed as parameters to subroutines, etc.

More Examples: Databases which store Free-form Text. Next we give a few examples having somewhat more realistic application flavors. The first of these, seen below, reads a file of text divided into paragraphs by blank lines, and inserts these paragraphs into a database. Paragraphs are summarized by collecting their words, reducing these to lowercase, suppressing any words which belong to a file of 'excluded_wds', and depluralizing them by dropping any terminal 's'. After building the database, the code seen below executes a few sample queries, some of which illustrate the use of Boolean combinations of search keys.

	program test; -- test of SETL database class
	use SETL_database;
	use string_utility_pak,get_lines_pak; 

	const punc := "';:<>,.?/\\|][}{=-_+\t ~!@#$%^&*()_-+=\"€˜€‘„`„˜"; 
		-- punctuation and whitespace characters
	var excluded_wds;  	
		-- we set up a global collection of words excluded from summaries
	
	excluded_wds := breakup("" +/ [line + " ": line in get_lines("excluded_wds")],punc);	
	excluded_wds := {case_change(wd,"ul"): wd in excluded_wds | wd /= ""};

	db := SETL_database(record_to_summary_stg); 
		-- create an empty database, to store test record
	lines := get_lines("database_text.txt");  
		-- read the input text for the database

	paragraphs := current_paragraph := [];  
		-- we will collect list of paragraphs, delimited by empty lines
	
	for line in lines loop   
		--loop over all lines,collecting paragraphs

		span(line,"\t "); 
		if line /= "" then 
			current_paragraph with:= line; continue; 
		end if;

	if current_paragraph /= [] then  
		-- collect and rebuild the current_paragraph, adding it to the database
	[db,-] := db with ("" +/ [pline + " ": pline in current_paragraph]); 
	current_paragraph := [];
	end if;

	end loop;

	if current_paragraph /= [] then  
		-- collect the final paragraph
	[db,-] := db with ("" +/ [pline + " ": pline in current_paragraph]); 
	end if;
	        -- now some database queries
	print("\nParagraphs containing 'object' *** ");
	for x in db.contains("object") loop 
		print("\n",db(x)); 
	end loop;
	
	print("\nParagraphs containing 'object' and 'value' *** ");
	for x in db.contains("object") * db.contains("value") loop 
		print("\n",db(x)); 
	end loop;
	
	print("\nParagraphs containing 'insertion' and 'fast' *** ");
	for x in db.contains("insertion") * db.contains("fast") loop 
		print("\n",db(x)); 
	end loop;
	
	procedure record_to_summary_stg(stg); -- convert paragraph to summary string
	
	wd_list := breakup(stg,punc);  -- form list of all words in paragraph
	wd_set := {drop_plural(case_change(wd,"ul")): wd in wd_list | 
		#wd > 1 and wd notin excluded_wds and 
	 (wd(1) notin 		
		"0123456789#:=,./?;'[]=-)(*&^%$#@!~`\n\t\r\"\\|{}+\x3E\x3C")};
	 
	return ("" +/ [" " + wd: wd in wd_set])(2..);  
		-- return collection of words as a blank-delimited string
	
	end record_to_summary_stg; 

	procedure drop_plural(stg); 
		x := rmatch(stg,"s"); 
		return stg; 
	end drop_plural;  -- drop plural ending from word

	end test;

The sample text file which this code (and the variants of it which follow) reads into its database is simply:

The underlying structure used to realize the large objects supported by SETL (and most other databases) is the B-tree. This famous structure, can be thought of as a way of maintaining large tuples in a form which allows fast execution of all the standard tuple operations, i.e., component value access, component change, component insertion, slicing, and slice assignment. B-trees can be thought of as arising by recursive elaboration of the simple idea of dividing very long tuples T (e.g. tuples of length 1 million) into sections of some smaller length, e.g., length approximately 1000.

If this is done, then a tuple of length 1 million will be represented by a tuple R of tuples, each of length roughly 1000, the whole list R also being of length approximately 1000. To make an insertion into the middle of such a structure (regarded as a representation of the large tuple T that would be obtained by concatenating all the components of R) is then much faster than it would be in a flat representation of T. Insertion in the middle of a flat representation might require all following components to be moved, so if T is of length one million, 500,000 of its components might have to be moved. But in the alternate representation suggested above the insertion is made in a sub-tuple of R having length roughly 1000 that is easily located, and so should require no more than 500 components to be moved. This gives a speedup that can be as high as 1000.

Database objects have value, rather than pointer semantics. This rule is implemented in an efficient way.

As already stated, the B-tree structure arises by recursive elaboration of the idea just described. Rather than arbitrarily dividing long tuples like T into some fixed number of pieces in the two-level way sketched above, we divide it into a tuple of tuples of tuples ..., down to as many levels as needed, limiting each of the tuples ('nodes') that appear at each level of this nested tree structure to some convenient maximum length. It turns out to be most convenient to insist that (with the one exception of the tree root) each 'node', (i.e., tuple) occurring in the data structure have a length lying between some integer n and 2n - 1.

This ensures that insertions remain very fast while at the same time limiting the number of tree levels that must be searched to locate a given component of the large tuple T that the tree structure R represents. Specifically, this number of levels can be no more than log n L where L is the length of the tuple represented.

The process of insertion of a new component into the large tuple represented by the recursive tree structure R will force insertion of a component at the bottom tree level, and this may cause the tree node into which the insertion is made to overflow the maximum length 2n - 1 to which it is supposed to be limited. But when this happens we can simply divide that node into two nodes of length n, inserting one of these at the next level up in the tree. Even though the process of recursive insertion thereby triggered may propagate all the way up the tree to its root, it remains orderly at each recursive level, as the more detailed code shown below makes plain.

The values stored in B-trees appear in leaf nodes at their bottom level. These leaf nodes are accessed via pointers stored in nodes one level higher.

If text stored in a database like that appearing in the preceding example is divided into longer paragraphs or chapters, it may be useful to support searches by combinations of keywords constrained to lie near each other in the text being searched. The following example shows one way in which such functionality can be supported. 'Assuming that 'near' can be interpreted as meaning 'separated by no more than 4 intervening words', we divide the text into overlapping zones of length 10 words. A new zone is started every 5 words, so that nearby words are certain to belong to some common zone. For each such zone we insert a 'zone record' containing its words, and for each paragraph collect the ids of the zones in the paragraph. Each time a paragraph completes, we insert a 'paragraph record' for it, indexed by the list of the zones in the paragraph; these are sanitized by representing them as hex strings prefixed by the character 'z'. To search for paragraphs containing two words 'wd1' and 'wd2' near to each other, we can then use the compound iterator

zone in contains(wd1) * contains(wd1), parag in contains(hex_of(zone)).

This is seen in the following code.

program test; -- test of SETL database class
	use SETL_database;
	use string_utility_pak,get_lines_pak; 

	const punc := "';:<>,.?/\\|][}{=-_+\t ~!@#$%^&*()_-+=\"€˜€‘„`„˜"; 
		-- punctuation and whitespace characters
	var excluded_wds;  
		-- we set up a global collection of words excluded from summaries
	var db;   -- make global set of words excluded from summaries
	var zod_list;  -- list of zone ids, used by record summary routine
	
	excluded_wds := breakup("" +/ [line + " ": line in get_lines("excluded_wds")],punc);
	excluded_wds := {case_change(wd,"ul"): wd in excluded_wds | wd /= ""};

	db := SETL_database(record_to_summary_stg); 
		-- create an empty database, to store test record
	lines := get_lines("database_text.txt");  
		-- read the input text for the database

	paragraphs := current_paragraph := []; 
		-- we will collect list of paragraphs, delimited by empty lines
	
	for line in lines loop   
		-- loop over all lines,collecting paragraphs

	 span(line,"\t "); 
	 if line /= "" then current_paragraph with:= line; continue; end if;

	 if current_paragraph /= [] then  
	 	-- collect and restart the current_paragraph, adding it to the database
	 digest_paragraph(current_paragraph);
	 current_paragraph := []; 
	 end if;

	end loop;
	
	if current_paragraph /= [] then  -- collect the final paragraph
	 digest_paragraph(current_paragraph);
	end if;
	         -- now some database queries
	print("\nParagraphs containing 'value' and 'pointer' nearby *** ");
	id_seen := {};
	for zone in db.contains("value") * db.contains("pointer"),
		 parag in db.contains(hex_of(zone)) | parag notin id_seen loop 
	 id_seen with:= parag; print("\n",db(parag)); 
	end loop;

	print("\nParagraphs containing 'value' *** ");
	 id_seen := {};
	for zone in db.contains("value"),
		 parag in db.contains(hex_of(zone)) | parag notin id_seen loop 
		id_seen with:= parag; 
		print("\n",db(parag)); 
	end loop;
	
	print("\nParagraphs containing 'pointer' *** ");
	 id_seen := {};
	 for zone in db.contains("pointer")
	 	, parag in db.contains(hex_of(zone)) | parag notin id_seen loop 
	 	id_seen with:= parag; print("\n",db(parag)); 
	 end loop;

	procedure digest_paragraph(para);

	 zones_of_paragraph := zones_of(para_stg := "" +/ [" " + line: line in para]); 
	 zod_list := OM; zone_ids := []; 
	 for zone in zones_of_paragraph loop 
	 	[db,zone_id] := db with zone; 
	 	zone_ids with:= zone_id; 
	 end loop;
	 zod_list := join([hex_of(zod): zod in zone_ids]," ");
	 [db,-] := db with para_stg;

	end digest_paragraph;
	 
	procedure zones_of(stg); -- find list of zones in paragraph
	
	 wd_list :=[wd in breakup(stg,punc) | wd /= "" 
	 	and case_change(wd,"ul") notin excluded_wds];
	 			-- form list of all words in paragraph
	 zones_in_para := [];   -- collect list of all zones in paragraph
	 
	 for j in [1,6..nwl := #wd_list] loop -- loop over the zones
	  zone_wds := wd_list(j..(j + 9) min nwl);  -- the words in the zone
	  wd_set := {drop_plural(ccw := case_change(wd,"ul")): wd in zone_wds | #wd > 1 and 
	  not (wd(1) in " 0123456789#:=,./?;'[]=-)(*&^%$#@!~`\n\t\r\"\\|{}+\x3E\x3C")};
	  if (zone_stg := ("" +/ [" " + zw: zw in wd_set])) /= "" then 
			zones_in_para with:= zone_stg(2..); 
	  end if;
	 end loop;

	 return zones_in_para; 
	
	 end zones_of; 
	
	procedure record_to_summary_stg(stg); -- convert paragraph to summary string
	 		-- note that the database we are setting up contains records of two 
	 		-- different kinds, namely zone records representing short stretches of words,
	 		-- and paragraph records representing entire paragraphs. 
	 		-- zone records are added by the statement [db,zone_id] := db with zone
	 		-- seen above; paragraph records are added by the statement 
	 		-- [db,-] := db with para_stg;
	 		-- This summary string routine causes the summary of a zone to be the words
	 		-- within it, and the summary of a paragraph to be the zones within it. 
	 		-- This has the advantage that intersection-iterators like 
	 		
	 		-- B>for zone in db.contains("value") * db.contains("pointer")...
	 		
	 		-- can be used to iterate over nearby pairs and triples of words, 
	 		-- but the disadvantage that iterations over the paragraphs containing
	 		-- a single must also be written using a compound iterator
	 		-- that iterates over zones first.
	 		 
	 if zod_list /= OM return zod_list; end if;
	 
	 
	 wd_list := breakup(stg,punc);  -- form list of all words in paragraph
	 wd_set := {drop_plural(case_change(wd,"ul")): wd in wd_list | 
	 	#wd > 1 and wd notin excluded_wds and 
	 	not (wd(1) in 
		" 0123456789#:=,./?;'[]=-)(*&^%$#@!~`\n\t\r\"\\|{}+\x3E\x3C")};
	
	 return ("" +/ [" " + wd: wd in wd_set]);  
	 	-- return collection of words as a blank-delimited string
	 
	end record_to_summary_stg; 

	procedure drop_plural(stg);  -- drop plural ending from word
	 us := rmatch(stg,"us"); if us /= "" then return stg + us; end if;
	 us := rmatch(stg,"ss"); if us /= "" then return stg + us; end if;
	 rmatch(stg,"s"); return stg; 
	end drop_plural; 

	procedure hex_of(id); 
	 return "z" +/ [char((ac := abs(c)) / 16 + 97) 
		+ char((ac mod 16) + 97): c in id]; 
	end hex_of;
	
end test;

Given the sample text listed above as input, the program seen above produces the following output.
Paragraphs containing 'value' and 'pointer' nearby ***

Database objects have value, rather than pointer semantics. This rule is implemented in an efficient way.

Paragraphs containing 'value' ***

The underlying structure used to realize the large objects supported by SETL (and most other databases) is the B-tree. This famous structure, can be thought of as a way of maintaining large tuples in a form which allows fast execution of all the standard tuple operations, i.e., component value access, component change, component insertion, slicing, and slice assignment. B-trees can be thought of as arising by recursive elaboration of the simple idea of dividing very long tuples T (e.g. tuples of length 1 million) into sections of some smaller length, e.g., length approximately 1000.

Database objects have value, rather than pointer semantics. This rule is implemented in an efficient way.

The values stored in B-trees appear in leaf nodes at their bottom level. These leaf nodes are accessed via pointers stored in nodes one level higher.

Paragraphs containing 'pointer' ***

Database objects have value, rather than pointer semantics. This rule is implemented in an efficient way.

The values stored in B-trees appear in leaf nodes at their bottom level. These leaf nodes are accessed via pointers stored in nodes one level higher.

Note that the keywords 'value' and 'pointer' occur both in the paragraph

Database objects have value, rather than pointer semantics. This rule is implemented in an efficient way.
and in the paragraph
The values stored in B-trees appear in leaf nodes at their bottom level. These leaf nodes are accessed via pointers stored in nodes one level higher.
However, the code seen above retrieves only the first of these paragraphs, since only in this paragraph do the two words 'value' and 'pointer' occur within 5 words of each other.

In the code seen above, the 'digest_paragraph' procedure used to process each of the paragraphs occurring in the source text read first calls the 'zones_of' procedure to return the list of zones in the paragraph. The 'zones_of' procedure builds the list of zones by simply iterating over the words in the paragraph in steps of 5 and returning the group of 10 words beginning at each starting point. These short text segments are then inserted into the database, and the ids of the resulting records collected into a globally accessible variable 'zod_list'. The paragraph itself is then entered into the database, causing the normal call to 'record_to_summary_stg', which however behaves unusually since it detects 'zod_list' and returns this as the summary of the paragraph passed to it. This causes each 'zod_list' entry to reference the paragraph, allowing nested iterators like the

	zone in contains(wd1) * contains(wd1), parag in contains(hex_of(zone)).

seen in the code above to behave in the expected way.

The clauses 'parag notin id_seen' appearing in the queries seen above, and the other manipulations of th variable 'id_seen' simply serve to prevent records from appearing repeatedly in our iterations.

Two shortcomings of the code seen above can be noted:

  1. It only supports searches for pairs of words occurring nearby, but not for pairs of words occurring anywhere in a paragraph.

  2. Words occurring in the source text are uselessly stored three times, once in the paragraph records formed, then again in each of two overlapping zone records.

The variant code seen below cures these problems in the following way. Words (logically) inserted into zone records, but not words inserted into paragraph records, are prefixed with a '.' (i.e. a 'dot'). This allows proximity-based searches like those seen above to be written using iterators like

zone in db.contains(".value") * db.contains(".pointer"), parag in db.contains(hex_of(zone))

while searches ignoring proximity can be written in the normal way, e.g. as

parag in db.contains("value") * db.contains("pointer")

One can even use hybrid combinations like

zone in db.contains(".value") * db.contains(".pointer"), parag in db.contains(hex_of(zone)) * db.contains("nodes")

which finds all paragraphs containing the three words 'value', 'pointer', and 'node', constraining the first two of these words, but not the third, to lie near each other.

The code which follows is changed only slightly from its preceding version. Insertion of each successive zone into the database is preceded by a call 'munge(zone)', which extracts the words appearing in the zone, prefixes each of them by a '.', posts the words in 'zod_list' as a blank-delimited string, and returns a null string (which becomes the 'node record') associated with the zone; however the changes also made to the summarization routine 'record_to_summary_stg' ensure that each zone is properly indexed by the words it contains. The paragraph-level processing collects the ids of all these zones, and posts them to 'zod_list' as a tuple. Then, when 'record_to_summary_stg' is called at the paragraph level, the ids in 'zod_list' are prefixed to the paragraph summary, causing the paragraph to be indexed both by the ids of the zones in it and by the non-excluded words which it contains.

program test; -- test of SETL database class
	use SETL_database;
	use string_utility_pak,get_lines_pak; 

	const punc := "';:<>,.?/\\|][}{=-_+\t ~!@#$%^&*()_-+=\""; 
			-- punctuation and whitespace characters
	var excluded_wds; 
		-- we set up a global collection of words excluded from summaries
	var db;   -- make global set of words excluded from summaries
	var zod_list;  -- list of zone ids, used by record summary routine
	
	excluded_wds := breakup("" +/ [line + " ": line in get_lines("excluded_wds")],punc);
	excluded_wds := {case_change(wd,"ul"): wd in excluded_wds | wd /= ""};

	db := SETL_database(record_to_summary_stg);
		 -- create an empty database, to store test record
	lines := get_lines("database_text.txt");  
		-- read the input text for the database

	paragraphs := current_paragraph := [];  
		-- we will collect list of paragraphs, delimited by empty lines
	
	for line in lines loop   
		--loop over all lines,collecting paragraphs

	 span(line,"\t "); 
	 if line /= "" then 
	 	current_paragraph with:= line; 
	 	continue; 
	 end if;

	 if current_paragraph /= [] then  
	 	-- collect the current_paragraph, adding it to the database
	 digest_paragraph(current_paragraph);
	 current_paragraph := []; 
	 end if;

	end loop;
	
	if current_paragraph /= [] then  -- collect the final paragraph
	 digest_paragraph(current_paragraph);
	end if;
	         -- now some database queries
	print("\nParagraphs containing 'value' and 'pointer' nearby *** ");
	id_seen := {};
	for zone in db.contains(".value") * db.contains(".pointer"), 
		parag in db.contains(hex_of(zone)) | parag notin id_seen loop 
	 id_seen with:= parag; print("\n",db(parag)); 
	end loop;

	print("\nParagraphs containing 'value' and 'pointer' anywhere *** ");
	id_seen := {};
	for parag in db.contains("value") * db.contains("pointer") | parag notin id_seen loop 
	 id_seen with:= parag; print("\n",db(parag)); 
	end loop;

	print("\nParagraphs containing 'value' and 'pointer' nearby, together with 'rule' *** ");
	id_seen := {};
	for zone in db.contains(".value") * db.contains(".pointer"), 
		parag in (db.contains(hex_of(zone)) * db.contains("rule")) 
			| parag notin id_seen loop 
		id_seen with:= parag; print("\n",db(parag)); 
	end loop;

	procedure digest_paragraph(para);  
		-- finds zones in paragraph and inserts them into database

	 zones_of_paragraph := zones_of(para_stg := "" +/ [" " + line: line in para]); 
	 zone_ids := []; 
	 for zone in zones_of_paragraph loop 
	 [db,zone_id] := db with munge(zone); 
	 zone_ids with:= zone_id; end loop;
	 zod_list := [hex_of(zod): zod in zone_ids];  
	 	-- build the paragraph build the paragraph-level zod_list, 
	 	-- leaving it as a tuple
	 [db,-] := db with para_stg;

	end digest_paragraph;
	
	procedure munge(zone_stg);  
		-- finds zones in paragraph and inserts them into database
	 wds_list := breakup(zone_stg," ");  
	 	 -- break into words using blanks as delimiters
	 zod_list := join(["." + wd: wd in wds_list]," "); 
	 	-- prefix words in zone by '.' and rejoin into string, posting this globally
	 return "";      -- return empty string as nominal contents of zone
	end munge;
	 
	procedure zones_of(stg); 
		-- find list of zones in paragraph
	
	 wd_list :=[wd in breakup(stg,punc) | 
	 	wd /= "" and case_change(wd,"ul") notin excluded_wds];  
	 		 -- form list of all words in paragraph
	 zones_in_para := [];   -- collect list of all zones in paragraph
	 
	 for j in [1,6..nwl := #wd_list] loop 
	 	-- loop over the zones
	  zone_wds := wd_list(j..(j + 9) min nwl);  
	  	-- the words in the zone
	  wd_set := {drop_plural(ccw := case_change(wd,"ul")): wd in zone_wds | 
	  	#wd > 1 and (wd(1) notin 
	  	" 0123456789#:=,./?;'[]=-)(*&^%$#@!~`\n\t\r\"\\|{}+\x3E\x3C")};
 
	  if (zone_stg := ("" +/ [" " + zw: zw in wd_set])) /= "" then 
		zones_in_para with:= zone_stg(2..); 
	  end if;
	 end loop;

	 return zones_in_para; 
	
	 end zones_of; 
	
	procedure record_to_summary_stg(stg); -- convert paragraph to summary string
	 
	 if is_string(zod_list) then return zod_list; end if; 
	 	-- this clause processes zones
	 
	 wd_list := breakup(stg,punc);  
	 	-- paragraph processing continues here: form list of all words in paragraph
	 wd_set := {drop_plural(case_change(wd,"ul")): wd in wd_list | #wd > 1 
			and wd notin excluded_wds and 
	 not (wd(1) in 
			" 0123456789#:=,./?;'[]=-)(*&^%$#@!~`\n\t\r\"\\|{}+\x3E\x3C")};

	 return join(zod_list," ") +/ [" " + wd: wd in wd_set]; 
	 	-- return collection of words, prefixed by ids in zod_list, 
	 	-- as a blank-delimited string
	 
	end record_to_summary_stg; 

	procedure drop_plural(stg);  -- drop plural ending from word
	 us := rmatch(stg,"us"); if us /= "" then return stg + us; end if;
	 us := rmatch(stg,"ss"); if us /= "" then return stg + us; end if;
	 rmatch(stg,"s"); return stg; 
	end drop_plural; 

	procedure hex_of(id); 
		return "z" +/ [char((ac := abs(c)) / 16 + 97) 
		+ char((ac mod 16) + 97): c in id]; 
	end hex_of;
	
end test;

The code seen above produces the following output, which shows that each of the three sample iterations appearing in this example behaves properly.

Paragraphs containing 'value' and 'pointer' nearby ***

Database objects have value, rather than pointer semantics. This rule is implemented in an efficient way.

Paragraphs containing 'value' and 'pointer' anywhere ***

Database objects have value, rather than pointer semantics. This rule is implemented in an efficient way.

The values stored in B-trees appear in leaf nodes at their bottom level. These leaf nodes are accessed via pointers stored in nodes one level higher.

Paragraphs containing 'value' and 'pointer' nearby, together with 'rule' ***

Database objects have value, rather than pointer semantics. This rule is implemented in an efficient way.

The examples seen above hint at the way in which SETL databases can be used to organize proximity-based queries in geographic information applications.

12.2. A First Sketch of the SETL Database Implementation concept: simplest form of the basic B-tree structure.

The underlying structure used to realize the large objects supported by SETL (and most other databases) is the B-tree. This famous structure can be thought of as a way of maintaining large tuples in a form which allows fast execution of all the standard tuple operations, i.e., component access, component change, component insertion, slicing, and slice assignment. B-trees can be thought of as arising by recursive elaboration of the simple idea of dividing very long tuples T (e.g. tuples of length 1 million) into sections of some smaller length, e.g., length approximately 1000. If this is done, then a tuple of length 1 million will be represented by a tuple R of tuples, each of length roughly 1000, the whole list R also being of length approximately 1000. To make an insertion into the middle of such a structure (regarded as a representation of the large tuple T that would be obtained by concatenating all the components of R) is then much faster than it would be in a flat representation of T. Insertion in the middle of a flat representation might require all following components to be moved, so if T is of length one million, 500,000 of its components might have to be moved. But in the alternate representation suggested above the insertion is made in a sub-tuple of R having length roughly 1000 that is easily located, and so should require no more than 500 components to be moved. This gives a speedup that can be as high as 1000.

As already stated, the B-tree structure arises by recursive elaboration of the idea just described. Rather than arbitrarily dividing long tuples like T into some fixed number of pieces in the two-level way sketched above, we divide it into a tuple of tuples of tuples ..., down to as many levels as needed, limiting each of the tuples ('nodes') that appear at each level of this nested tree structure to some convenient maximum length. It turns out to be most convenient to insist that (with the one exception of the tree root) each 'node', (i.e., tuple) occurring in the data structure have a length lying between some integer n and 2n - 1. This ensures that insertions remain very fast while at the same time limiting the number of tree levels that must be searched to locate a given component of the large tuple T that the tree structure R represents. Specifically, this number of levels can be no more than log n L where L is the length of the tuple represented.

The process of insertion of a new component into the large tuple represented by the recursive tree structure R will force insertion of a component at the bottom tree level, and this may cause the tree node into which the insertion is made to overflow the maximum length 2n - 1 to which it is supposed to be limited. But when this happens we can simply divide that node into two nodes of length n, inserting one of these at the next level up in the tree. Even though the process of recursive insertion thereby triggered may propagate all the way up the tree to its root, it remains orderly at each recursive level, as the more detailed code shown below makes plain.

Another crucial advantage of the recursive B-tree structure is that it allows value semantics to be implemented efficiently for large tuples. If, for example, we copy a tuple t1 into a tuple t2 by writing t2 := t1, and then change a designated component of t1 by writing t1(i) := x, we only need to copy and modify those tree nodes of the changed tree t1 which lie on the tree path down to the point at which the modification occurs. All other tree nodes can simply be shared between the two trees t1 and t2. Thus the amount of node copying and modification required by the rules of value semantics is the product of log L * 2n, where L is the length of the tuple represented, and 2n - 1 is the maximum allowed node size.

The generic B-tree code which follows reflects these considerations, some explicitly and others implicitly. Tree nodes are represented as objects, of class 'Btup', containing two vectors 'cum' and 'tup'. The 'tup' instance variable held in each tree node stores the list of child nodes of that node, or, in the case of nodes of height 1, stores the components of the represented tuple that come under it. The j'th component of the 'cum' instance variable, whose value is a tuple identical in length to 'tup', stores the total number of leaves that come under any of the nodes tup(1)...tup(j). These 'cum' vectors allow search down the tree for a twig of given index i within the large tuple T that the tree represents by a fast binary-search like process. This is seen in the procedure self(i) that appears in the following code, which is written as a single line of recursive search code. The same is true for the component change operation seen in the procedure self(i) := x that follows. It is worth thinking through the action of this latter routine in detail to understand how much node copying it implies when a component is changed, and where these copies occur in the code.

The 'cum' vector illustrates the important idea of storing 'cumulant' quantities, derived by some addition-like operation from the children of a node, at each of the nodes of a B-tree. This idea generalizes readily. Multiple such 'cumulants' can be stored in each node. Each such cumulant supports a corresponding form of fast binary search in the tree. This idea is explored more fully in the next section.

Since all of the twigs in a B-tree are required to have the same height, that is, to lie at the same distance from the tree root, it follows that every tree node lies at the same distance from all of the twigs that come under it. This height is third item of value data stored in each node.

In respect to all of these node data items, particularly 'cum' and 'tup', node objects have value semantics. That is, change of any component of these two tuple-valued instance variables causes a copy of the changed object to be made in such a way as to guarantee that no other object referenced directly or indirectly by an independent variable is affected. It should be clear that this implies that the large tuples represented by our system of tree nodes also have value semantics.

The two most revealing routines in the code seen below are those for the operations 'self(i)' and 'self(i) := x'. After these two most essential routines, the next most crucial is the concatenation routine that follows them. Concatenation is most easily understood in the special case that the two trees being concatenated are of the same height. In this case, it should be clear that they can be concatenated simply by concatenating their top-most nodes. If this generates a top node whose length is less than the limit 2n on which we insist, we have only to set the cumulant vector correctly; otherwise we split the children of the over-fat top node that results into two parts, making each of them the child of a separate parent, and introducing a new top node having these nodes as its children. But if the two nodes being concatenated are not of the same hight, then either the first or the second will be taller than the other. Suppose, to begin with, that the first tree t1 is taller than the second tree t2. Then we descend to the rightmost child t'1 of the first tree, and recursively concatenate t'1 with the second tree. If the resulting tree c'1 is of the same height as t'1, then it replaces t'1 as the final child of t1, thereby definining a new B-tree r1 whose internally stored cumulant we simply need to adjust. But if c'1 is taller than t'1 it will have two children, each of the same height as t'1. In this case these two children replace the single child t'1 of t1, resulting in a new B-tree r1. If this r1 does not have too many children, its stored cumulant is simply adjusted and it becomes the concatenation result. But if r1 has too many children, we spit these children evenly among two new parent nodes, which then become the two children of a result tree one level higher than the original t1. This conditional splitting operation, like all others in the code which follows, is performed by the utility routine 'ifsplit()' seen below.

The case of concatenations in which t2 is taller than t1 is handled symmetrically to the case just described.

The concatenation procedure we have just described is used as a subroutine by various other of the important procedures in the set seen below, including end-slice extraction, general slice extraction, and slice assignment. End-slice extraction, which is written as t(i..) where t is a B-tree object and i an integer, is performed as follows. We first locate that child t' of the top node of t whose associated suB-tree contains the leaf with index i. Let c be the total number of leaves in children of t which precede t'. Then we form the B-tree t2 whose children are those children of t which follow t', form t'(i - c..) recursively, and concatenate t'(i - c..) and t2 + t to get the desired result. It should be clear that a prefix-slice extraction t(1..i) could be performed in a manner symmetrical to the procedure just described, and that a general slice extraction t(1..i) can be viewed as a combination of a prefix-slice and an end-slice operation. The detailed code for general slice extraction seen in what follows does essentially this, but in a slightly optimized way which we leave it to the reader to examine.

Slice assignment t(i..j) := t* is performed as if the result to be formed had been written as t(1..i - 1) + t* + t(j + 1..). The remaining B-tree operations in the code seen below all have easy expressions in terms of the basic operations we have just described.

The collection of routines seen below also includes the procedures used to manage the iterator objects associated with our btup objects, and to handle iterations over these and their Boolean combinations. This group of routines is discussed following the code itself.

The following code reflects all of the foregoing design considerations, which it expands in detail.

 class btup;    -- B-tree class
 class var debug := false;

 procedure create();  -- create blank btup
 procedure set(t);  -- set btup object from standard SETL tuple

 procedure check_consistency();  -- check consistency of tree

end btup;

class body btup;    -- B-tree class
 const minw := 5, maxw := 2 * minw - 1;   
 	 -- minimum number of descendants, except at top
 
 var h := 1,cum := [],tup := [];   -- height, cumulant tuple, top-level tuple;
  -- Note that B-trees have value semantics, but with sharing of sub-trees.
  -- The cumulant records the number of descendants of each node, 
  -- plus those of all siblings to its left.
  
 var iter_ptr,compno;      -- iteration pointer and count
 
 procedure create();  -- create empty B-tree top from tuple
  iter_ptr := newat(); compno := newat();   
  	-- set up iteration stack pointer and compno pointer
  return self;
 end create;

 procedure self(i);  -- component extraction; 
   -- descend iteratively to desired leaf using cumulant to find child containing it
  return if h = 1 then tup(i) elseif i <= cum(1) then tup(1)(i)
    elseif exists c = cum(j) | i <= c then tup(j)(i - cum(j - 1))
     else OM end if;
 end;

 procedure self(i) := x;  -- component change
   -- descend iteratively to desired leaf using cumulant to find child containing it
   -- then change the value at this node. Note that the cumulant need not be changed 
   -- in this case, since here it represents the number of leaves beneath each node,
   -- which this operation does not change.
  if h = 1 then tup(i) := x; elseif i <= cum(1) then tup(1)(i) := x;
    elseif exists c = cum(j) | i <= c then tup(c)(i - cum(j - 1)) := x; 
     else abort("index " + str(i) + " is out of range"); end if;
 end;
 
 procedure -self;  
 	-- convert to tuple by recursively converting and concatenating descendants
  return if h <= 1 then tup else [] +/ [-t: t in tup] end if;
 end;

 procedure #self;  -- total leaf length, given by cumulant
  return cum(#cum);
 end;

 procedure self + x;  -- concatenation

  if is_tuple(xx := x) then x := btup(); x := x.set(xx); end if;  
	-- force second argument to btuple
  if #cum = 0 then return x; end if; 
  if #x.cum = 0 then return self; end if;  -- null cases

  new := btup();   -- create an empty B-tuple

  if x.h = h then   -- concatenated btup has same height

   new.tup := tup + x.tup; new.h := h;  -- perform a top-level concatenation

   c1l := cum(#cum); new.cum := cum + [c + c1l: c in x.cum];
       -- get cumulant of concatenation, adjusting second part
   return new.ifsplit();   -- split if necessary 

  elseif x.h < h then  -- concatenated btup is shorter

   new.tup := tup; new.cum := cum; new.h := h;   
   	-- copy the top node of this tree

   end_elt := tup(nt := #tup);  -- the final descendant of this tree's root
   end_elt := end_elt + x;  -- recursively catenate x, 1 level down

   if end_elt.h < h then    -- the subconcatenation has not split
    (new.tup)(nt) := end_elt;   -- just install
    new.cum(nt) := new.cum(nt) + x.cum(#x.cum);  
    	-- adjust the cumulant to reflect the added descendants

    return new;  -- return the new tree
   end if;

     -- otherwise the subconcatenation has grown taller, i.e. has split
     -- the descendants of the new top level root 
     -- replace the prior last node at this level
     -- (which is the node to which x was concatenated)
   new.tup(nt..nt) := end_elt.tup; new.h := h;
       -- replace the last child of the original tree 
       -- with the children of the new split sub-tree
 
   c1ml := if nt = 1 then 0 else cum(nt - 1) end if;  
   	-- get cumulant prior to last node of original tree 
   new.cum(nt..nt) := [c + c1ml: c in end_elt.cum];  
   	-- add this to all subsequent cumulants

   return new.ifsplit();   -- split if necessary 
   
  else  -- otherwise concatenated element is taller

   new.tup := x.tup; new.cum := xc := x.cum; new.h := xh := x.h;  
   	-- copy the top node of x
   first_x_elt := x.tup(1);  -- the first element of x
   tot_cum := cum(#cum);   -- total cumulant of original tree
   
   first_x_elt := self + first_x_elt;  
   	-- recursively catenate this tree to first child of x
   if first_x_elt.h < xh then  -- the subconcatenation has not split
    (new.tup)(1) := first_x_elt;   -- it becomes first child of the new tree
    for j in [1..#xc] loop 
    	new.cum(j) +:= tot_cum;   -- adjust the later cumulants
    end loop;
    return new;  -- return the new tree
   end if;

     -- otherwise the subconcatenation has grown taller, i.e. has split
     -- the descendants of the new top level root replace 
     -- the prior first node at this level
     -- (which is the node of x to which our original tree was concatenated)
   new.tup(1..1) := first_x_elt.tup;  
       -- replace the first child of the original tree 
       -- with the children of the new split sub-tree

   new.cum(1..1) := first_x_elt.cum;  
   	-- likewise replace the cumulant of the first child of the original tree
   for j in [3..#xc + 1] loop   -- adjust all later cumulants
   	new.cum(j) +:= tot_cum; 
   end loop;
   
   return new.ifsplit();   -- split if necessary 
 
  end if;

 end;

 procedure ifsplit();  -- split into 2 nodes if overpopulated

  if (nt := #tup) <= maxw then return self; end if;
  	   -- needn't split if not above length limit

  t1 := tup(1..nto2 := nt/2); t2 := tup(nto2 + 1..); -- split the top node in half
  c1 := cum(1..nto2); c2 := cum(nto2 + 1..);   -- split its cumulant in half
  cum1 := cum(nto2); cum2 := cum(nt);    
  	-- get cumulants for the two new nodes that will be formed
  
    -- form a new root with just two descendants
  new1 := btup(); new1.h := h; new1.tup := t1; new1.cum := c1;  
  	-- initialize first child and its cumulant
  new2 := btup(); new2.h := h; new2.tup := t2; 
  new2.cum := [c - cum1: c in c2]; -- initialize second child and its cumulant
 
  newtop := btup();  -- create a new empty node
  newtop.tup := [new1,new2]; 
  newtop.cum := [cum1,cum2]; newtop.h := h + 1; 
  	-- attach children, cumulant, and set height

  return newtop;
  
 end ifsplit;

 procedure x + self;  -- concatenation with btup in second position
  if is_tuple(xx := x) then x := btup(); x := x.set(xx); 
  else abort("bad concatenation argument: " + str(x)); end if;  
       -- force first argument to btuple
  return x + self; -- then just concatenate
 end;

 procedure set(t);  -- set a B-tree from a tuple

  if (nt := #t) <= maxw then 
   h := 1; cum := [1..nt]; tup := t;
   return self;
  else  -- make two trees of half length, and concatenate them
   bt1 := btup(); bt1 := bt1.set(t(1..nt/2));  -- first half
   bt2 := btup(); bt2 := bt2.set(t(nt/2 + 1..));  -- second half
   return bt1 + bt2;  -- return concatenated version
  end if;
 end set;

 procedure self with x;  -- item concatenation
  return self + [x];  -- just concatenate singleton tuple
 end;

 procedure self(i..);  -- end slice extraction

  if i > (cncp1 := cum(nc := #cum) + 1) or i < 1 then 
	abort("end slice index out of range: " + str(i)); end if;
  if i = cncp1 then return btup(); end if;  -- empty result
  if i = 1 then return self; end if;    -- the whole shebang
  
  if h = 1 then    -- minimal height case; slice the tuple
   new := btup(); new.h := 1; new.tup := tup(i..); new.cum := [1..nc - i + 1]; 
   return new;
  end if;

  must := exists c = cum(j) | c >= i; 
  	-- find the child position to which the slice propagates
  cumbef := if j = 1 then 0 else cum(j - 1) end if;  
  	 -- get the cumulant prior to position of the sliced child
  tj := tup(j);    -- get the child to which the slice propagates
  
  if j = nc then return tj(i - cumbef..); end if;  
  	-- just return slice of the child if this is last
  
  tail := btup(); tail.h := h; tail.tup := tup(j + 1..);  
  	-- otherwise get siblings following the sliced child 
  cj := cum(j); tail.cum := [c - cj: c in cum(j + 1..)];  
  	-- adjust cumulant for this isolated 'tail' group of siblings
 
  return tj(i - cumbef..) + tail;  
  	-- catenate the 'tail' to the sliced child, and return

 end;

 procedure self(i..j);  -- general slice extraction

  if i > (cncp1 := (cnc := cum(nc := #cum)) + 1) or i < 1 then  
  	-- check that first index is in range
   abort("first slice index out of range: " + str(i)); 
  end if;
  if j < i - 1 or j > cnc then 
  	abort("second slice index out of range: " + str(i)); 
  end if;
  		-- check that second index is in range
  if j = i - 1 then return btup(); end if;  
  	-- in self(i..i in self(i..i - 1) case, return an empty btuple

  if h = 1 then    -- minimal height case; just slice the tuple
   new := btup(); new.h := 1; new.tup := tup(i..j); new.cum := [1..j - i + 1];   
   	-- return a height 1 btuple
   return new;
  end if;

  must := exists c = cum(jloc) | c >= j; 
  	-- find the top find the top-level child location of the index j

  if i = 1 then   -- we have a prefix slice extraction t(1..j)
 
   tj := tup(jloc); -- get child to which slice propagates
   
   if jloc = 1 then return tj(1..j); end if; 
   	-- if this is first child, just get and return slice of first component 
   
   pref := btup();   
   	-- otherwise will generate new prefix, consisting of all preceding siblings
   cumbef := cum(jloc - 1); subslice := tj(1..j - cumbef);  
   	-- this 'subslice' will be concatenated to the preceding siblings
   
   if jloc = 2 then   
   	-- the preceding sibling(s) would form a tree with #tup = 1; so descend 1 level
    pref.h := h - 1; pref.tup := (t1 := tup(1)).tup; pref.cum := t1.cum;  
    	-- in this case, prefix is identical with first node
   else     
   	-- taking prefix part won't produce tree with #tup = 1, 
   	-- so we form a new prefix tree
    pref.h := h; pref.tup := tup(1..jloc - 1); 
    pref.cum := cum(1..jloc - 1);
   end if;
   
   return pref + subslice;  
   	-- catenate tail to the appropriate slice of the child, and return

  end if;
    
    -- in the remaining case, two of the children will be sliced
  must := exists c = cum(iloc) | c >= i; 
  	-- find the top-level tup location of the first index i
  ind := tup(iloc);   -- this is the first child that will be sliced
  prior_cum := if iloc = 1 then 0 else cum(iloc - 1) end if; 
		-- get the cumulant prior to this child

  if iloc = jloc then  
  	-- if the two children being sliced are the same, 
  	-- then just do a subextraction from this child
    return ind(i - prior_cum..j - prior_cum);    -- return this subextraction
    end if;

    jnd := tup(jloc);      -- this is the second child that will be sliced
    jprior_cum := cum(jloc - 1);  -- get the cumulant prior to this second child

    ipart := ind(i - prior_cum..); jpart := jnd(1..j - jprior_cum);
        -- recursively extract the appropriate pieces of these two children
        -- one of these is a head extraction, the second is a tail extraction

    if iloc + 1 = jloc then return ipart + jpart; end if;
        -- if the two tup locations differ by 1, just catenate these two extracted pieces

    newmid := btup();   -- generate new a new node for the 'middle' nodes
  
    if iloc + 2 < jloc then     -- the 'middle' part has at least 2 nodes
      newmid.h := h; newmid.tup := tup(iloc + 1..jloc - 1);     
      	-- attach these nodes to the new empty tree formed above 
      cumbef := cum(iloc); newmid.cum := [c - cumbef: c in cum(iloc + 1..jloc - 1)];     
		-- attach a properly adjusted cumulant
    else              -- middle has just 1 node; use this node as the middle tree
      newmid := tup(iloc + 1);
    end if;  
    
    return ipart + newmid + jpart;       
    	-- return concatenation of the front, middle, and end
    
  end;

  procedure self(i..j) := x;    -- slice assignment

    if is_tuple(xx := x) then       -- force x to btup if it is a tuple
      x := btup(); x := x.set(xx);
    elseif type(x) /= "BTUP" then    -- otherwise x must already be a btup
      abort("illegal slice-assignment right-hand side: " + str(x));
    end if;
    
    		-- check that first index is in range
    if i > (cncp1 := (cnc := if (nc := #cum) = 0 then 0 else cum(nc) end if) + 1) 
		or i < 1 then
      abort("first slice-assignment index out of range: " + str(i)); 
    end if;
    
         -- check that second index is in range
    if j < i - 1 or j > cnc then 
    	abort("second slice-assignment index out of range: " + str(i)); 
	end if;

    if i = 1 then       -- the over-written part of this tree is a prefix

      if j = cnc then  -- the whole initial tree is over-written; just copy x to self
        h := x.h; tup := x.tup; cum := x.cum; return x;    
        	-- modify this btup; return right-hand side
      end if; 

            -- otherwise only a prefix of the initial tree is over-written
      tail := self(j + 1..);       
      	-- extract the trailing B-tup section that is not over-written 
      new := btup(); new.h := x.h; new.tup := x.tup; new.cum := x.cum;     
      	-- make a copy of x

      new := new + tail;      -- catenate the assigned part plus the retained part
      h := new.h; tup := new.tup; cum := new.cum;    -- modify this btup
      return x;                      -- return right-hand side
      
    end if;
            -- in this final case, a middle portion of the original tree is over-written 
    pref := self(1..i - 1);       -- get the retained prefix
    new := btup(); 
    new.h := x.h; new.tup := x.tup; new.cum := x.cum;     
    	-- the assigned part
    
    if j = cnc then    -- over-written part is suffix
      new := pref + new;      -- catenate the retained part plus the assigned part 
    else         -- over-written part is middle
      tail := self(j + 1..); new2 := btup(); 
      new2.h := x.h; new2.tup := x.tup; new2.cum := x.cum; 
      new := pref + new + tail;      
      	-- catenate the retained part plus the two assigned parts 
    end if; 

    h := new.h; tup := new.tup; cum := new.cum;      -- modify this btup
    return x;                      -- return right-hand side

  end;

  procedure self(i..) := x;    -- end slice assignment
    self(i..#self) := x; return x;
  end;

  procedure from self;    -- front extraction
    if tup = [] then return OM; end if;    -- null case
    x := self(1); y := self(2..);
    h := y.h; cum := y.cum; tup := y.tup; 
    return x;
  end;

  procedure frome self;    -- end extraction
    if tup = [] then return OM; end if;    -- null case

    x := self(ln := cum(#cum)); y := self(1..ln - 1);
    h := y.h; cum := y.cum; tup := y.tup; 
    return x;
  end;

  procedure fromb self;    -- end extraction
    if tup = [] then return OM; end if;    -- null case
    x := self(1); y := self(2..);
    h := y.h; cum := y.cum; tup := y.tup; 
    return x;
  end;
			-- **** DIAGNOSTIC ROUTINE FOR DEBUGGING ****
  procedure check_consistency();    
  	-- check consistency of tree (diagnostic routine for debugging)

    if h > 1 then -- the number of nodes must be in the correct range
 
      if exists nd = tup(j) | (nc := #(nd.tup)) < minw or nc > maxw then 
        abort("node count out of range: " + str(nc) + " " + str(nd)); 
      end if;

      if exists nd = tup(j) | nd.h /= h - 1 then 
        abort("node height inconsistency: " + str(j) + " " + str(nd.h) + " " + str(h)); 
      end if;

    end if;
    
          -- the cumulant differences at this level must be 
          -- the sum of those at the next lower level
    if h > 1 then 

      if exists nd = tup(j) | 
      	cum(j) /= if j = 1 then 0 else cum(j - 1) end if 
      		+ nd.cum(#nd.cum) then
        abort("bad cumulant: loc = " + str(j) + " cum at loc = " 
        	+ str(cum(j)) + " priorcum = " 
           + if j = 1 then "" else str(cum(j - 1)) end if 
           	+ " node cum = " + str(nd.cum(#nd.cum))); -- + " " + str(nd)
      end if; 

    elseif cum /= [1..#tup] then

      abort("bad bottom-level cumulant: " + str(cum) + " " + str(nd)); 

    end if;
    
    if h > 1 and (exists nd in tup | not nd.check_consistency()) then 
		abort("bad abort"); 
	end if;
    
    return true;      -- passed the consistency check
    
  end check_consistency;
  
			-- **** ROUTINES FOR ITERATING OVER B-TREES ****

  procedure iterator_start;     -- initialize simple iteration over btup
      -- sets up iterator stack as value referenced by iter_ptr. 
      -- This is a stack of pairs [tup,posn]
    stack := [];    -- to be built

    node := self;
    for j in [1..h] loop
      stack with:= [notup := node.tup,if j = h then 0 else 1 end if]; node := notup(1); 
    end loop;

    ^iter_ptr := stack;      -- attach stack to iteration pointer

  end iterator_start;
  
  procedure iterator_next();     -- step simple iteration over btup
      -- returns value as singleton tuple  whose value  is a pair, or OM if terminating
      -- advances iterator stack referenced by iter_ptr
    stack := ^iter_ptr;    -- retrieve the iterator stack

    height := 1;
    
    for j in [ns := #stack,ns - 1..1] loop
      if (sj := stack(j))(2) = #sj(1) then 
        height +:= 1; removed frome stack;    -- remove any exhausted element
      else
        exit;
      end if;
    end loop;
    
    if height = 1 then 
      removed frome stack;  -- pop the top element; then advance it
      [tup,loc] := removed; 
      result := tup(loc +:= 1); stack(ns) := [tup,loc];
      ^iter_ptr := stack; return [result];    
      	-- return singleton tuple built from leaf element
    end if;
    
    if stack = [] then return OM; end if;    
    	-- iteration is exhausted

    removed frome stack;  
    	-- pop the top element, then advance it and use to rebuild rest of stack
    [tup,loc] := removed; node := tup(loc +:= 1); stack with:= [tup,loc];

    for j in [1..hm1 := height - 1] loop      
    	-- rebuild the stack, starting with the node that was advanced 
      stack with:= 
      	[notup := node.tup,if j = hm1 then 0 else 1 end if]; 
	  node := notup(1); 
    end loop;

    removed frome stack;  -- pop the top element; then advance it
    [tup,loc] := removed; result := tup(loc +:= 1); stack(ns) := [tup,loc];
    ^iter_ptr := stack; return [result];    
    	-- return singleton tuple built from leaf element
    
  end iterator_next;
  
  procedure set_iterator_start;     
  	-- initialize second-form iteration over btup (similar to iterator_start)
  	-- sets up iterator stack as value referenced by iter_ptr
    stack := [];    -- to be built
    ^compno := 0;
    
    node := self;
    for j in [1..h] loop
      stack with:= [notup := node.tup,if j = h then 0 else 1 end if]; node := notup(1); 
    end loop;

    ^iter_ptr := stack;      -- attach stack to iteration pointer

  end set_iterator_start;
  
  procedure set_iterator_next();     -- step second-form iteration over btup
    -- returns value as singleton, or OM if terminating
    -- advances iterator stack referenced by iter_ptr
    stack := ^iter_ptr;    -- retrieve the iterator stack
    ^compno := cno := ^compno + 1;    -- advance the component number
    
    height := 1;
    
    for j in [ns := #stack,ns - 1..1] loop
      if (sj := stack(j))(2) = #sj(1) then 
        height +:= 1; removed frome stack;    -- remove any exhausted element
      else
        exit;
      end if;
    end loop;
    
    if height = 1 then 
      removed frome stack;  -- pop the top element; then advance it
      [tup,loc] := removed; 
      result := tup(loc +:= 1); stack(ns) := [tup,loc];
      ^iter_ptr := stack; return [[cno,result]];    
      	-- return singleton tuple built from leaf element
    end if;
    
    if stack = [] then return OM; end if;    -- iteration is exhausted

    removed frome stack;  
    	-- pop the top element, then advance it and use to rebuild rest of stack
    [tup,loc] := removed; node := tup(loc +:= 1); stack with:= [tup,loc];

    for j in [1..hm1 := height - 1] loop      
    	-- rebuild the stack, starting with the node that was advanced 
      stack with:= [notup := node.tup,if j = hm1 then 0 else 1 end if]; node := notup(1); 
    end loop;

    removed frome stack;  -- pop the top element; then advance it
    [tup,loc] := removed; result := tup(loc +:= 1); stack(ns) := [tup,loc];
    ^iter_ptr := stack; return [[cno,result]];    
    	-- return singleton tuple built from leaf element

  end set_iterator_next;
  
end btup;

The family of routines for iterating over B-trees seen at the end of the preceding code work by maintaining a stack of tree nodes representing the sequence of ancestors leading to the leaf x currently reached by an iteration. Each stack entry stores a tree node t, along with the position i of the next lower node in the chain among the children of t. If the leaf x is not the last child of the bottom-most node t' in this chain, the iteration is advanced simply by moving on from x to the next child of t'. Otherwise we must locate the bottom-most node t* in the sequence whose final child has not yet been reached by the iteration, advance to the next child t2 of t*, and then rebuild the part of the stack following this t2 by appending the chain of children leading to the first node in the suB-tree headed by t2. Details of the code just outlined are found in the 'iterator_next()' routine seen above. The closely related 'set_iterator_next()' routine adapts this same idea to realize iterations of the modified form 'for y = t(i) loop..."

The B-tree object class 'btup' implemented by the SETL class code just described is logically ancestral to most of the other codes need to realize the database facility at which we aim, in the sense that most of the actual codes used are derived rom this initial prototype by working additional functionality and feature into it, which out really changing its basic structure very much. We shall see that such 'conservative reworking' allows us to define both various useful kinds of large string and large tuple objects, and also all of the large maps used for indexing SETL databases in the manner described at the start of this Chapter. Another essential feature which can be worked into the same basic B-tree structure is the reference counting mechanism used to give B-trees the value semantics whose importance we have already explained. The fact that this latter mechanism allows us to implement database modifications using only very limited numbers of tree node changes is essential both to the efficiency of the whole database system, and to the crash-recovery capabilities which are detailed in a later section.

We begin our account of this chain of generalizations by explaining one of the simplest of them, the use of additional cumulants to realize other important data structures by generalized B-trees.

B-trees Incorporating More General Cumulants. The B-tree data structure seen in the preceding code can be adapted to realize other data objects than tuples. For example, we can represent very large strings, e.g., strings several gigabytes in length, by similar B-tree structures, simply by replacing the initial cumulant 'cum' stored in the tree nodes by one which cumulates number of characters rather than number of nodes. The idea here is to store very large strings by breaking them up into pieces, say of a few hundred characters each, which are referenced by an overlying B-tree (of a few million nodes). In this B-tree the cumulant stored represents the number of characters in all the leaves coming below some particular tree node and its left siblings. A variation on this idea would be to maintain strings divided into words by the occurrence of non-alphabetic characters, and into lines by the occurrence of end-of-line characters, in such a way as to allow very rapid access either to a particular character position, or to a particular word by its numerical position in the sequence of all words, or to a particular line by line number. For this, we could simply maintain three auxiliary cumulants, one for total number of characters, the second for total number of word breaks, the third for total number of line-end characters. It should be clear that the code seen above can be adapted to handle all such cases, and that it can retain all the advantages seen above. (Another approach is sketched later.) That is, all supported accesses and edit operations can be executed in time proportional to the logarithm of data object size, and value semantics, with the advantages that we have noted above, can be maintained in the presence of active edit operations.

The cumulants described in the preceding section all are count-like, in that they are calculated for parent nodes by summing the corresponding values for their children. However, the same cumulant technique will work even if operation used to calculate the cumulant of a parent node from those of its children is an associative operation different from simple summation. Only associativity, not even commutativity, is required. This observation opens many possibilities. For example, the maximum and minimum operations in any ordered set are associative. Hence, if the leaves of a B-tree store elements (such as strings or integers) belonging to such a set, we can keep a cumulant derived from the maximum or minimum of children at each of the B-tree's nodes, and keep these cumulants updated as the tree is edited, in much the way seen in the code displayed above, without losing the logarithmic efficiency of the operations which this code implements. Suppose then that we aim to keep the tree leaves in sorted order. The availability of a cumulant representing the maximum of all the leaves coming under or to the left of any of our tree nodes makes it possible to find the largest leaf smaller than a new leaf to be inserted in logarithmic time. Hence we can insert this new leaf at its proper position in log time, thus always keeping the sequence of leaves in their desired order.

Here is another, more sophisticated but less generally useful variant of this same idea. Suppose that we wish to store very long parenthesized strings in a manner allowing the closing parenthesis matching a given open parenthesis to be located rapidly, even though these to parentheses may be separated by millions of characters and many nested pairs of matching parentheses. This can be done as follows. In each node we keep two cumulants, one 'free_opens_in' representing the total number of open-parentheses coming under the node which are not matched by any close-parenthesis coming to their right, the second 'free_closes_in' representing the total number of close-parentheses coming under the node which are not matched by any open-parenthesis lying to their left. To combine these quantities as calculated for two strings when the two strings are concatenated we can use the following formula.

	if (foi1 := free_opens_in(stg1)) >= (fci2 := free_closes_in(stg2)) then 
		free_opens_in(stg1 + stg2) := foi1 - fci2 + free_opens_in(stg2); 
		free_closes_in(stg1 + stg2) := free_closes_in(stg1);
	else
		free_opens_in(stg1 + stg2) := free_opens_in(stg2); 
		free_closes_in(stg1 + stg2) := free_closes_in(stg2) + (fci2 - foi1);
	end if;
This can be thought of as an operation on pairs [num_free_opens_in,num_free_closes_in] which, because of its relationship to the obviously associative string-concatenation operation, is also associative. Hence these quantities can be kept together as cumulants in the nodes of a B-tree representing strings. It is not hard to see that, with these quantities available, the parenthesis which matches a give open- or close-parenthesis can be found by a logarithmically efficient recursive search.

Our first example of the use of additional cumulants to realize modified data structures by B-trees is the 'large string' object represented by the fragments of code seen below, which are excerpted from the full class code for this object type. We give excerpts rather than the full code, since this code derives so closely from that for the B-tuple class seen above. Basically it just: (i) keeps strings in the B-tree leaves of the B-tuple; (ii) adds one more cumulant, the number of characters; (ii) renames all the original Btup routines so as not to conflict with the new string-oriented top-level function syntax; (iv) implements all the string operations using the underlying tuple operations, in a straightforward way. The code excerpts seen below flesh out this sketch with additional details.

Pure cumulant vectors. We can regard a tuple represented by a B-tree containing additional cumulants as a vector supporting logarithmically fast operations giving the partial sums extended over arbitrary tuple slices, in a structure that can also edited rapidly while always retaining fast seachability. For example, storing a cumulant for the square of the components of a tuple of numbers would allow us to find the standard deviation of an arbitrary tuple slice rapidly. This remark extends to associative operations other than sums. For example, storing a cumulant for the maximum of a tuple of numbers or strings, even if this is not sorted, allows us to find the maximum of an arbitrary tuple slice rapidly, and also to locate the maximum by a fast binary style of search.

Various other cumulants allow representation of other kinds of disk-storable objects. One such example is the 'large tuple of long strings'. This can have billions of components, and individual components can be billions of bytes long. The crucial operations linking such objects to the SETL environment can be represented as

	stg := obj(n..[j,k]);     and      obj(n..[j,k]) := stg; 

The first of these operations extracts characters j thru k from the n'th component of the vector of strings and returns it as a standard SETL string (which should be no more than a few megabytes in length). The second assigns a standard SETL string to this same range of characters. We also have

	obj(n..m) := tup;     and     obj(n..m) := OM; 

where tup is a standard SETL tuple of strings. The first of these inserts a vector of strings into the stated range of the 'large tuple of long strings'; the second deletes a range of components.

To represent these objects we need only introduce a family of B-trees with two cumulants: total number of characters in a node's descendant leaves and total number of breaks between vector components in a node's descendant leaves.

Another important case is that of string-to-string mappings, which in abstract terms are sets of pairs of strings [key,val], the key being of no more than standard SETL size, but the string value being possibly very large. The abstract mapping from 'key' to 'val' can be many-to-many, but for simplicity in the present remarks we shall suppose that it is single-valued. Here the main operations can be represented as

	stg := obj(key..[j,k]);         and          obj(key..[j,k]) := stg; 

The first of these operations extracts characters i thru j from the val string located by the key 'key'. The second assigns a standard SETL string to this same range of characters. We also have

	obj(key) := stg;          and         obj(key) := OM; 

where stg is a standard SETL string. The first of these inserts a new [key,val] pair; the second deletes the pair with the stated key.

To represent these objects we can introduce a family of B-trees with two cumulants: total number of characters in a node's descendant leaves and largest key-hash in a node's descendant leaves. Here each record is assumed to consist of four abutting fields

	hash_of_key,length_of_key,key,val

hash_of_key is a 4- (or 5-)byte hash of a record's 'key' value. The key itself can be of an arbitrary length, given by the 4-byte length_of_key field. Records are stored in order of their hash_of_key field. Keys with identical hash_of_key values should occur only very rarely, but when they do will be stored successively, in random order. Search for the record with a given key then proceeds by binary search on the key's hash, followed by key verification at the bottom level. In the rare case in which several keys have the same hash, a slightly more elaborate serial search at the bottom level is required.

These are the objects and operations central to the Berkeley database management library downloadable from http://www.sleepycat.com.

Separated versus integrated cumulants. Cumulants like those discussed above can either be integrated into a single B-tree structure carrying multiple cumulants, or organized into separate B-trees. For example, we can realise a vector of strings as a single tree carrying the two cumulants (i) total number of characters in a node's descendant leaves, and (ii) total number of breaks between vector components in a node's descendant leaves, as indicated above. But we can also represent it as a pair of objects, the first a simple string B-tuple, the second a cumulating vector of integers giving the end position of each of the sections into which this string falls if we regard it as being broken into a vector of strings. Though less efficient than an integrated representation, this second kind of representation is easier to construct since it involves addition of new, generally easy data structures rather than revision of existing structures. This second representation is also useful as a mental tool for thinking thru the design of novel B-tree based data structures.

The reader is invited to work out the code details needed to implement one of the B-tree based data structures sketched in the preceding paragraphs. Code for one of the simplest of these, a family of B-trees which can be used to represent very large strings, is given in a later section. But we postpone examination of this code until something is said about an additional layer of functionality which must be worked into it: the system of reference counts used to realize value semantics for B-trees too large to be held in RAM, whose nodes must therefore be stored as collections of pages held on disk (and perhaps occupying most of a disk or disks many gigabytes in size.)

12.3.Mapping large strings to collections of disk pages.

Reference-count management. For the reasons outlined in the first section on this chapter, we want to manage large objects represented by B-trees and stored on disk, be these simple strings or full databases, and we want them to have value semantics. The initial in_RAM prototypes of these objects, realized by the codes described earlier in this chapter, are implemented entirely by SETL objects, and so inherit their value semantics from that of their underlying SETL objects. In this setting the SETL interpreter supplies the implicit reference count management needed to support value semantics. For objects stored on disk this is no longer the case, and so the reference count management needed to give these objects their desired value semantics must be programmed explicitly. We will now explain what must be done, first in the relatively simple case of large disk-stored strings, and then, in a later section, for full databases.

We separate reference-count management into two parts, a first or 'bottom' part having to do with the counting references to disk records as these are set up and removed by the codes internal to the packages of routines (those described in the preceding section) which implement operations on disk-stored strings (and other like objects) represented by B-trees, and a second or 'top' part relating to the handling of reference counts in procedures, more at the application level, which use these lower-level routines. Note that routines of this second class are unaware of the very existence of internal B-tree nodes, and so never reference them directly. However, they do reference the root nodes of such objects, and so may modify the reference counts of such root nodes.

Reference counting can be implemented by making each disk record used for an object in which we are interested store a field which keeps track of the total current number of references to the record, either from variables in a program manipulating such records, or from other records. First consider reference-count management for (the limited family of) 'bottom' level routines, i.e. those which manipulate internal tree nodes. Whenever such a record, referenced by a variable obj, is about to be modified (e.g. by an operation like obj.set_cum(i,x) or obj.set_tup(i,x) which modifies a field within it), we check the refcount of the record. If this is 1, we can be sure that 'obj' is the only reference to the object, and so we can simply proceed with the change. However, if the refcount is > 1, then there are other references to the same record, and the rules of value semantics forbid the edit about to be performed from affecting the data value seen by these other references. To avoid this we first copy the record which we are about to edit (thereby replacing it with a brand new record whose reference count is certainly 1), and then modify the new record. When this is done the number of references to the old copy of the record diminishes by 1. The check-and-replace operation needed for this can be thought of as an assignment

	obj := obj.maycopy();

inserted at the start of edit operations like obj.set_cum(i,x). 'maycopy' just returns 'obj' if its count is 1, but otherwise returns a fresh copy and decrements the refcount of the old copy by 1.

Creation of the new record copy increments the number of references to all of the records to which the copied record refers. 'maycopy' can handle this supplementary action also.

This is all that we need do in the case of edit operations like obj.set_cum(i,x) which simply modify a numerical value. An additional step is required for operations like 'obj.set_tup(i,x)' which modifies a value which points to another record. This must decrement the number of references to the object 'obj.tup(i)' which the field being modified originally referenced (unless this field was initially unused), and must increment the number of references to the object x (unless x is undefined.) This action can be thought of as an operation

	note_replace_ref(obj.tup(i),x);

inserted at the start of the 'obj.set_tup(i,x)' procedure but not at the start of operations like 'obj.set_cum(i,x)'. Similarly any statement x := obj.tup(i); decrements the number of references to the record value of x (if any) and increments the number of references to obj.tup(i). Hence a call note_replace_ref(x,obj.tup(i)) should be attached to each such operation.

To create new the records required for some of the operations just described, we need a way of allocating the disk pages which will hold them. If disk space is not to be consumed progressively until no more remains, pages allocated to records that are no longer in use must be reclaimed. A record falls out of use whenever its reference count falls to zero, as may, for example, happen to the record referenced by 'obj.tup(i)' when we execute the operation 'obj.set_tup(i,x)'. The simplest (but not the most efficient) way of managing page reclamation is to maintain a free pages list, to the front of which records are linked when they are freed, and from which recycled pages can be recovered by delinking them as needed. This is a simple one-way list. Records put on the free list can retain any references to other records which they initially contain. However, when a record is delinked from the free-list in preparation for re-use,it implicitly loses all these references, so each of the other records that it references must have their refcount decremented by 1, perhaps causing them to be added to the free-list themselves. Alternatively, all the children of a record being freed can be processed immediately to decrement their reference counts and free them recursively if these counts fall to zero.

Whenever a simple assignment x := y; is used to replace the value of one variable by that of another, the record referenced by x (if any) loses one reference, and that referenced by y (if any) gains one. These actions can be implemented by a call adjust_counts(x,y);

To complete our account of the bottom-level reference-count manipulations needed to implement value semantics for disk-stored structures, it only remains to consider the treatment of procedure calls and returns. A call f(expn1,...,expnn) can be viewed as a group of assignments

param1 := expn1; ... paramn := expnn

of the call arguments to the parameter values of f, which are originally OM. Hence calls simply increment the reference-count values of those of their arguments which have references as values. This set of assignments is followed by a jump which has no effect on reference counts. All variables other than the parameter variables of called procedures should be given the value OM; recursive calls should be considered to introduce wholly new sets of parameter and internal variables.

The operation

return expn;

can be viewed as an assignment

retvar := expn;

of 'expn' to the return variable of the function call, followed by a jump which has no effect on reference counts. The return also deallocates all the internal and parameter variables of the procedure being returned from. This implicitly sets all these values to OM, and so should decrement the reference count of every variable whose immediate pre-return value is a record. This can be done by inserting a call

cleanup([v1,...,vm]);

immediately before each return statement, which should however first be rewritten in the form

retval := expn; return retval;

The 'retval := expn;' statement will have its normal effect on reference counts; the 'return retval;' statement which follows it will have none. Function calls which ignore their returned valves should be treated as if they read

retval := f(expn1,...,expnn); retval := OM;

The 'cleanup' routine seen above simply examines each component of its argument tuple and decrements the number of references of each such component which is a record.

Top-level considerations. Next we consider the 'top' part of the reference count problem, i.e the handling of counts in (the open-ended family of) application level procedures, which only modify the reference counts of the root nodes of disk-stored structures represented by B-trees. Here our problem is to integrate the treatment of these objects with the ordinary actions of the SETL space-management and reference counting system. This can be done is various ways, depending on what space-recovery mechanisms are available. In the code which realizes large string objects given later, these objects appear as instances of an object class which keeps track of all the large strings which visible at the application level. This class can then provide a 'hold(object_list)' method which application-level code can use to erase all string objects which do not appear in the 'object_list' parameter passed to 'hold', thereby recovering the storage space that these string objects would otherwise occupy.

Optimization of reference-count management.

Each time a B-tree node N becomes the child of some newly created node (as, e.g., when a node must be copied in consequence of a modification of some leaf of a tree to whose root there exists two independent references the number of references to N must be incremented by 1. Each time N ceases to be the child of such a node (e.g., when a node referencing it is destroyed, or when some external variable referencing a tree root is cleared, the count of references to N must be decremented. These rules imply a large amount of reference-count maintainance activity, and so tell us that refcount-map implementation is the most efficiency critical point in the SETL database design. The following numbers will give a sense of what is involved. Consider an implementation that will manage a 100GB disk, divided into 100M physical records, each 1000 bytes in size. Then child-node identifiers and 'cum' values can both be 4 or 5 bytes in size, allowing each node to have roughly 100 children. This implies a tree-branching factor of approximately 100, so if the whole disk is devoted to a single database, the depth of its representing B-tree will be roughly 4. Each B-tree edit forcing a logical tree copy to be made will therefore require roughly 400 refcount-map changes.

To accommodate this large amount of refcount-update activity it is import to keep the refcount entries in RAM and to accelerate access to them. The following technique is used. The refcount map is represented by two flat arrays, called refcount_disk and refcount_RAM. As their name suggests, the larger of these is held on disk, and the smaller in RAM. The refcount_disk array needs 4 byte fields, but the refcount_RAM array, which is an approximate representation of refcount_disk, can have one byte fields. Thus, even for a database containing up to 100 million physical pages, refcount_RAM can be stored in a 100MB flat array.

The refcount-RAM and refcount_disk arrays are related in the following way. The in-RAM refcount array of single bytes and the refcount_disk array are equally long. The length of these arrays is equal to the total number of physical disk pages to be managed, and together they store a refcount for each page. The refcount for page j is always the sum refcount_RAM(j) + refcount_disk(j). Integer values in 'refcount_RAM' are arranged in two numerical ranges. Values from 0 to 139 are used when refcount_disk(j) is 0, and so represent 'full' refcount values. Values from 140 to 255 are stored as integers offset by 140 (and so represent integers in the range 0 to 115). These are used when refcount_disk(j) is not 0, and so represent 'partial' refcount values. refcount_disk values no larger than 140 (ordinarily, this will be all refcount_disk values) are simply copied to refcount_RAM when this latter array is initialized (and the corresponding refcount_disk entry is zeroed.)

The refcount incrementation and decrementation operations 'incref' and 'decref' normally increment and decrement refcount_RAM(j) only. If refcount_RAM(j) sinks to zero, the corresponding refcount really is 0, and the corresponding record is freed. If refcount_RAM(j) rises to 140, 70 references are unloaded from it to refcount_disk(j), but then refcount_RAM(j) is kept in 'offset' form, i.e as an integer in the range 140 to 255. If a refcount_RAM(j) value held in this offset form sinks to 139, refcount_disk(j) is accessed and up to 60 reference units are subtracted from it, and added to refcount_RAM(j), either in offset or non-offset form, depending on whether anything remains. Similarly, if refcount_RAM(j) rises to 256, 70 references are unloaded from it to refcount_disk(j).

For purposes of crash recovery, as detailed in a later section, refcount_RAM is supplemented with an auxiliary log file refcount_log to which in_RAM refcount values are written (in a fail-safe manner) during backup operations (as described later) to insure that they are recoverable from disk.

Persistence. Large strings of the form we have been discussing must be persistent if they are to be useful, since it is not feasible to reconstruct them whenever they are accessed. (Think, e,g,, of a 50GB string.) For these objects to be persistent, they must have persistent external names (like file names) under which they can be stored, and using which they can be accessed. In the code seen below, the string objects shown are thought of as residing within files (whose names then give part of the persistent string object name needed.) However, it is often appropriate for several such string objects to be resident in a single file (for example, when these are variants of a single such object stg_obj, obtained by editing stg_obj in various ways.) To give each of these string objects its own persistent name, the prototype system implemented by the code seen below provides two operations

open_string_set(file_name);           and           close_string_set(file_name,names_to_strings_map);

These two operations are used to store arbitrary maps from string-names to bigstrings (i.e. 'stgs_btup' objects) in a persistent manner on disk, and more specifically in a file on disk, whose file name is the first parameter of the open_string_set and close_string_set operations. The 'open_string_set(file_name)' call opens the specified file, and returns a previously stored mapping from names to the string objects stored under these names by a preceding 'close_string_set' operation. These string objects can then be accessed and manipulated by operations of the normal SETL string form, e.g. stg_obj(i..j), stg_obj(i..j) := x; (where x is another such string object) and stg_obj_1 + stg_obj_2 (denoting concatenation.) Once any desired objects of this kind have been built, they can be stored persistently, just by wrapping them into a map 'names_to_strings_map' from names to the string objects which are to be saved, and then making a call

close_string_set(file_name,names_to_strings_map);

This writes a logical copy of the string objects included in the names_to_strings_map to the designated file (which can be different from the file named in the original 'open_string_set(file_name)' call). (In the prototype code shown below, an actual physical copy is written, but in a production-strength implementation, designed to manage string objects which might be gigabytes in length, only a short file of header information, containing a reference to the actual location of the very large strings represented, would be written.) Since our string objects have value semantics in any case, the difference between these two implementations is not significant, except for the efficiency issues involved.

To convert a string object which is not too large into the corresponding SETL string, the operation str(stg_obj) is used. To convert a SETL string into a string object of the type we have been discussing, one uses the creator routine 'stgs_btup' supplied by the corresponding object class in a 'creation call' of the form 'stgs_btup(stg)', where 'stg' is the SETL string to be converted. To initiate a new, initially empty collection of string objects (which can then be edited, and subsequently saved using the 'close_string_set' call) one writes

open_string_set(OM);
This returns an empty mapping.

Examples of the use of Large-string Objects. The prototype code seen in a subsequent subsection implements large-string objects of the kind discussed above. That code is organized around one principal object class, 'stgs_btup', and several auxiliary packages, as detailed below. Of these, the most salient is 'string_sets', which provides the 'open_string_set' and 'close_string_set' procedures discussed above. As seen in our first example, 'stgs_btup' provides an alternative creation call of the form 'stgs_btup(0)' which can be used to create a void string object if one is wanted to invoke one of the utility methods of the 'stgs_btup' class, e.g. 'cleanup_space()', which provides a report on then number of items lost from the expected free space pool, if any. (After all big-string objects have been erased, this can be used to check for 'memory leaks' in an application using these objects.) The global variable 'dump_count', made public by the 'string_sets' package, indicates the number of records that have been swapped to disk during a run.

Our first example creates an initially empty set of strings, and an empty string object, and then reads a sample text file, by 10K sections, inserting them into the string object. Once this has been done, a section of the string object is accessed and printed, the string is saved in a string set under the name '"Chapter 10.1"', and the string set is closed, being written into a file 'Chapter_10_file' which keeps it persistent. This file is then re-opened immediately, and the saved string object is accessed. After this, a few additional text sections are extracted and printed, and the string object is again saved, in the same file, under the same name. (For your own tests you can substitute any available text file.)

program test; -- test of stgs_btup operations
 use stgs_btup,string_sets; 
 	-- use the bigstring object class and its supporting package

 setup_chapter_test; -- test of half meg chapter string setup

 stgs_btup(0).cleanup_space(); 
 	-- *** for debugging: consume entire remaining free list
 print("\nDone *** ",dump_count," pages evicted during testing. ");

 procedure setup_chapter_test; -- test of half meg chapter string setup
 
 open_string_set(OM); -- create an initially empty set of strings
 
 file_handle := open(":SETL BOOK - HTML:Chapters:Chapter_10.html","RANDOM"); 
 	-- open the text file to be used in our example
 
 bigstring := stgs_btup(""); -- start an empty bigstring
 
 for j in [1,10001..fs := fsize(file_handle)] loop  -- read a text file, by sections
 gets(file_handle,j,(9999 min (fs - j)),piece);
 bigstring := bigstring + stgs_btup(piece);   
 	-- insert each section read into the big string
 end loop;
 
 close(file_handle); -- close the source file
 
 print; print("Part of bigstring of length: ",#bigstring); 
 	-- show the length of the big string constructed
 print; print(piece := bigstring(1000..1500));  
 	-- extract and display a section of it
 
 close_string_set("Chapter_10_file",{["Chapter 10.1",bigstring]}); 
 	-- save and close the big string just constructed
 
 print("\nRe-opening:\n");   -- now re-open the big string 
 name_to_strings_map := open_string_set("Chapter_10_file");
 print("domain name_to_strings_map: ",domain(name_to_strings_map)); 
 	-- display the names used for the items saved
 
 chapter := name_to_strings_map("Chapter 10.1");  -- access one of these items
 
 print("\nCHAPTER PIECE:\n",chapter(1000..1500));  
 	-- extract and display several sections of it
 print("\nCHAPTER PIECE2:\n",chapter(400..500));
 close_string_set("Chapter_10_file",{["Chapter 10.1",chapter]}); 
 	-- save the setup, in the same file as before
 
 chapter.erase();    
 	-- erase the sole surviving object,to verify that all memory is recovered
 
 end setup_chapter_test;

end test;

Next we show code which re-opens this same string object, creates a second and a third copy of it within which designated sections re overwritten, saves all three copies to the file originally containing the one copy, and finally re-opens this file to verify that all three logical copies are present. Note that the very low-cost copy operation used, which the 'stgs_btup' class provides as a method, exploits the fact that our string objects have value-like semantics. (For brevity, we show the code as a procedure, which for testing can be wrapped into the program context seen in our first example.)

 procedure read_chapter_test;  -- test of re-opening and generation of modified versions

	name_to_strings_map := open_string_set("Chapter_10_file");			
		-- access the previously saved large-string object
	chapter := name_to_strings_map("Chapter 10.1");

	print("\nCHAPTER PIECE:\n",chapter(300000..300500)); 		
		-- display various substrings of it
	print("\nCHAPTER PIECE2:\n",chapter(400000..400500));
	print("\nCHAPTER PIECE3:\n",chapter(350000..350500));
	
	chap_copy := chapter.copy();     -- create a second copy
	chap_copy(350000..350500) := stgs_btup("XXXX");   
		-- in which 500 characters are X-ed out
	
	chap_copy2 := chap_copy.copy();     -- and a third copy 
	chap_copy2(350004..350500) := stgs_btup("ZZZZ"); 

	print("\nCHAPTER PIECE3:\n",chapter(350000..350500)); 
		-- display pieces from each of these three versions
	print("\nCHAPTER PIECE3.vers2:\n",chap_copy(350000..350500));
	print("\nCHAPTER PIECE3.vers3:\n",chap_copy2(350000..350500));

	close_string_set("Chapter_10_file", -- save the setup
		{["Chapter 10.1",chapter],["Chapter 10.1.vers2",chap_copy],
			["Chapter 10.1.vers3",chap_copy2]}); 

	print("\n*** AFTER RE-OPENING.......\n");   
		-- re-open the saved string-set
	
	name_to_strings_map := open_string_set("Chapter_10_file");
	print(domain(name_to_strings_map));   
		-- display the names of the collection of string objects saved

	vers1 := name_to_strings_map("Chapter 10.1");   
		-- access all three of these string objects
	vers2 := name_to_strings_map("Chapter 10.1.vers2");
	vers3 := name_to_strings_map("Chapter 10.1.vers3");

	print("\nCHAPTER PIECE3:\n",chapter(350000..350500)); 
		-- display pieces from each of these three versions
	print("\nCHAPTER PIECE3.vers2:\n",chap_copy(350000..350500));
	print("\nCHAPTER PIECE3.vers3:\n",chap_copy2(350000..350500));
	
	chapter.erase(); chap_copy.erase(); chap_copy2.erase();  
		-- erase these to get an undistorted space usage report
	
 end read_chapter_test;
The output produced by the preceding code is as follows. By correlating this output with the code that produces it, you can verify that three string versions of the long (e.g. half megabyte) string object are generated and stored efficiently. (A file containing a single version of this string is 788 KB; a file containing all three versions is not noticeably different in size).
	Opened string_set: Chapter_10_file

	CHAPTER PIECE:
	program to see how text line selection and keyboard focus relate. 
		

	<PRE>	<B>program</B> test;    -- SETL interactive interface example 1
		<B>use</B> tkw,string_utility_pak; -- use the main widget class
		
		<B>var</B> tline;    -- globalize for use in procedure below
		
		Tk := tkw(); Tk(OM) := "Selection <B>in</B> textline";   
			 -- create the Tk interpreter
		
		tline := Tk("entry",30); tline("side") := 

	CHAPTER PIECE2:
	horizontal
	 obj("place,x") := <B>str</B>(nax);     -- assign new x coordinate
	<B>else</B>     -- if dragging is only vertical
	 obj("place,y") := <B>str</B>(nay);     -- assign new y coordinate
	<B>end if</B>;

	<B>end loop</B>;

	<B>end lambda</B>;
	
	<B>end</B> attach_drag_noproc;

	<B>procedure</B> make_drop_sensitive(the_obj,drop_response); 

	CHAPTER PIECE3:
	</CENTER>

	<P>sets, a slider's value, repositioning it appropriately, 
	as shown by the following short program, which creates a slider 
	and two buttons which are bound to operations which set the slider value. 

	<PRE>	<B>program</B> test;	  -- SETL interactive interface example 1
			<B>use</B> tkw;		-- use the main widget class
		
			Tk := tkw();			-- create the Tk interpreter
			Tk(<B>OM</B>) := "Interactive interface example 1";
		
			slider := Tk("scale","0,100"); -- param

	CHAPTER PIECE3:
	</CENTER>

	<P>sets, a slider's value, repositioning it appropriately, 
	as shown by the following short program, which creates a slider 
	and two buttons which are bound to operations which set the slider value. 

	<PRE>	<B>program</B> test;	  -- SETL interactive interface example 1
		<B>use</B> tkw;		-- use the main widget class
		
		Tk := tkw();		-- create the Tk interpreter
		Tk(<B>OM</B>) := "Interactive interface example 1";
		
		slider := Tk("scale","0,100"); 
			-- parameters are lower and upper limit of slider range
		slider("side") := "top"; 
		slider("orient,length") := "horizontal"; 	-- can be 'vertical'
		slider("length,width") := "350,10";		-- physical size of slider
		
		but := Tk("button","Set slider value = 30"); but("side") := "top"; 
			-- create and place a first button
		but{<B>OM</B>} := 
			<B>lambda</B>(); slider(<B>OM</B>) := 30; <B>
			end lambda</B>;	

		but2 := Tk("button","Reduce slider value by 1"); 
		but2("side") := "top"; -- create and place a second button
		but2{<B>OM</B>} := <B>lambda</B>(); 
		slider(<B>OM</B>) := slider(<B>OM</B>) - 1; <B>
		end lambda</B>;	
		
		Tk.mainloop();		-- enter the Tk main loop
		
		<B>end</B> test;</PRE>
	<P>
	The expression 

	<P><CENTER>slid.get(ij)</CENTER>

	<P>returns the slider value corresponding to the pixel position ij
	 of the slider trough (relative to the toplevel window containing the slider). 
	 The related operation 

	<P><CENTER>slid.identify(ij)</CENTER>

	<P>returns one of the values "tr

	Closing string_set: Chapter_10_file

	*** AFTER RE-OPENING.......

	Opened string_set: Chapter_10_file
	{"Chapter 10.1", "Chapter 10.1.vers2", "Chapter 10.1.vers3"}

	CHAPTER PIECE3:
	</CENTER>

	<P>sets a slider's value, repositioning it appropriately, 
	as shown by the following short program, which creates a slider 
	and two buttons which are bound to operations which set the slider value. 

	<PRE>	<B>program</B> test;	  -- SETL interactive interface example 1
		<B>use</B> tkw;		-- use the main widget class
		
		Tk := tkw();				-- create the Tk interpreter
		Tk(<B>OM</B>) := "Interactive interface example 1";
		
		slider := Tk("scale","0,100"); 
			-- parameters are lower and upper limit of slider range
		slider("side") := "top"; 
		slider("orient,length") := "horizontal"; 	-- can be 'vertical'
		slider("length,width") := "350,10";		-- physical size of slider
		
		but := Tk("button","Set slider value = 30"); but("side") := "top"; 
			-- create and place a first button
		but{<B>OM</B>} := <B>lambda</B>(); slider(<B>OM</B>) := 30; <B>end lambda</B>;	

		but2 := Tk("button","Reduce slider value by 1"); 
		but2("side") := "top"; -- create and place a second button
		but2{<B>OM</B>} := <B>lambda</B>(); 
				slider(<B>OM</B>) := slider(<B>OM</B>) - 1; 
		<B>end lambda</B>;	
		
		Tk.mainloop();		-- enter the Tk main loop
		
		<B>end</B> test;</PRE>
	<P>
	The expression 

	<P><CENTER>slid.get(ij)</CENTER>

	<P>returns the slider value corresponding to the pixel position ij
	of the slider trough (relative to the toplevel window containing the slider). 
	The related operation 

	<P><CENTER>slid.identify(ij)</CENTER>

	<P>returns one of the values "tr
	after cleanup: 0 pages lost: 

	Done *** 0 pages evicted during testing. 

Code for large-string management with reference counting. We will now detail the code for the large-string objects described and used in the preceding section. This code is organized into the following class and packages:

  1. class stgs_btup; -- top-level big-string class

  2. package btup_for_stgs; -- package managing btuples for strings; works directly with pointers to physical addresses

  3. package string_sets; -- auxiliary package for managing string_sets consisting of btuples for strings

  4. package buffered_disk; -- buffered disk with provision for in_RAM refcount storage

The 'btup_for_stgs' package provides basic B-tree manipulation, adapted for trees whose leaves contain strings, and in particular handles the bulk of all cumulant and reference count maintenance. This package is to some extent a reworking, as a package, of the generic B-tree class described previously, differing principally by the addition of reference count maintenance instructions and by modification of the recursive leaf search sequences seen in the generic code by the somewhat modified sequences needed to work with character positions rather than leaf numbers. The 'stgs_btup' class superimposes a string-like syntax on the underlying B-tree capabilities which 'btup_for_stgs' provides, and also handles most leaf-level character manipulation, so that 'btup_for_stgs' can work with entire leaves. 'buffered_disk' supports 'btup_for_stgs' by providing record allocation/deallocation and primitive buffered read-record/write-record, record copying, and reference-count operations. It also provides a diagnostic call for assessing memory usage, and a group of auxiliary debug-oriented procedures designed for tracking down memory leaks.(A bit more is said below about this interesting kind of debugging.) The 'string_sets' package handles persistence by providing just the two calls 'open_string_set' and 'close_string_set' already described above.

The fundamental 'btup_for_stgs' package makes the following procedures available.

 procedure findchar(stg_ptr,i);  
	-- character location; returns [leaf_contents,charpos]
 procedure set_leaf(stg_ptr,i, x); 
 	-- leaf change
 procedure endslice(stg_ptr,i);  
 	-- end slice extraction; part from leaf containing character i to end
 procedure slice(stg_ptr,i,j);  
 	-- general slice extraction from leaf containing character i 
 	-- to end to leaf containing character j
 procedure set_slice(stg_ptr,i,j,x); 
 	-- slice assignment; over-write part from 
 	-- leaf containing character i to end to leaf containing character j
 procedure cat_with(stg_ptr,x);  -- concatenation

 procedure stg_of_tree(stg_ptr);  -- string from tree

 procedure h(stg_ptr);    -- retrieval of integer height of tree node
 procedure stg_cum(stg_ptr);   
 	-- retrieval of all 4-byte string stg_cumulant values
 procedure stg_cum1(stg_ptr,j);  
 	-- retrieval of specified 4-byte string stg_cumulant value. 
 procedure set_h(stg_ptr,val);  -- set height value of tree node
 procedure set_stg_cum(stg_ptr,val); 
 	-- set all stg_cumulant values of tree node to components of a tuple
 procedure set_stg_cum1(stg_ptr,j,val); 
 	-- set specified stg_cumulant value of tree node to string val. 
 procedure tup(stg_ptr);    
 	-- retrieval of vector of all children or leaf strings, 
 	-- as determined by h() value
 procedure tup1(stg_ptr,j);   
 	-- retrieval of specified string- or record-valued child value, 
 	-- as determined by h() value. 
 procedure set_tup(stg_ptr,val);  
 	-- set all string- or record-valued child values. 
 	-- val must be a tuple, and all values are set
 procedure set_tup1(stg_ptr,j,val); 
 	-- set specified string- or record-valued child value, 
 	-- val can be a string or btup.

 procedure set(stg_ptr,t);   -- set a B-tree from a tuple of strings

 procedure initialize(f_name);  -- various required initializations

These procedures fall into several groups. The first group, consisting of 'findchar', 'set_leaf', 'endslice', 'slice', 'set_slice', and 'cat_with' are the basic operations which the 'stgs_btup' class requires. Once the necessary adaptation code has been supplied, these operations are easily repackaged in the 'stgs_btup' class as obj(j), obj(j) := c, obj(j..), obj(i..j), obj(i..j) := x, and obj_1 + obj_2. The 'stg_of_tree' operation, which converts a tree into the SETL string that it represents, reappears in the 'stgs_btup' class as 'selfstr'. The next group of operations provided by 'btup_for_stgs' are simple field access and edit operations, one pairs for each of the three types of tree-node fields that need to be maintained: child pointers, cumulants, and node heights. For the child pointers and cumulants these are provided in two versions, one of which accesses a single child or cumulant, while the second accesses the whole list of children or cumulants. 'set' builds a tree from a specified list of SETL strings to be inserted into its leaves, and is used by the 'stgs_btup' 'create' routine. 'initialize' simply supplies various required initializations.

All the procedures of 'btup_for_stgs' work with 4-byte strings representing encoded 'physical' record addresses, and are also passed numerical parameters representing character positions. Using a character's position, it is easy to identify both the leaf containing it, and the character's position within the leaf. The objects of the 'stgs_btup' class make the 4-byte record locators of the top nodes of trees available, using an intermediate pointer atom to avoid unwanted object copying. 'stgs_btup' puts all character position parameters passed to 'btup_for_stgs' into forms that 'btup_for_stgs' can use to find the suB-trees and leaves that it must locate.

We begin our detailed presentation of these routines with 'findchar' and 'set_leaf'. These are

 procedure findchar(stg_ptr,i);  
	-- character location; given char no, returns [leaf_contents,charpos]

  if i < 0 then return OM; end if;   
  	-- test for index in range, 
  	-- returning final character position if index is too large
  if i > (nc := stg_cum1(stg_ptr,#(stg_cumm := stg_cum(stg_ptr)))) then 

    return nc; 

	end if; 

  must := exists sc = stg_cumm(j) | i <= sc;   
  	-- find the index of the node containing the character
  
  prior_sc := if j = 1 then 0 else stg_cumm(j - 1) end if;

  if h(stg_ptr) = 1 then  -- in this case we actually return a leaf pointer
  	return [tup1(stg_ptr,j),i - prior_sc]; 
  end if; 
  
  [ln,cn] := findchar(tj := tup1(stg_ptr,j),i - prior_sc);    
  	-- search recursively for the desired leaf
  decref(tj);        -- since tj is not used past this point

  return [ln,cn];  --find the desired [leaf_index,charpos] recursively

 end findchar;
and
 procedure set_leaf(stg_ptr,i, x);  -- leaf change
   -- descend iteratively to leaf containing character i 
   -- using cumulant to find child containing it
   -- then change the value at this node
  
  stg_ptr := may_copy_rec(stg_ptr);   -- prepare the current node for editing
 
  cumm := stg_cum(stg_ptr);     -- get the cumulant vector
  
  if h(stg_ptr) = 1 then   
  -- handle bottom-level case. we must always adjust 
  -- all the string cumulants of following nodes

   if i >= 1 and (exists c = cumm(j) | i <= c) then    
   	-- leaf to be changed lies under the jth node
 
    prev_stg_cum := if j = 1 then 0 
    	else stg_cum1(stg_ptr,j - 1) end if;  
    		-- compute change in leaf length
    prev_len := stg_cum1(stg_ptr,j) - prev_stg_cum; 
    len_delta := #x - prev_len;
    for k in [j..#cumm] loop 		
		 set_stg_cum1(stg_ptr,k,stg_cum1(stg_ptr,k) + len_delta); 
		end loop;
    set_tup1(stg_ptr,j,x);    
    	-- set the appropriate leaf; then adjust the cumulants; 
    	-- note that the 'sets' only work for refcount = 1
 
    return stg_ptr;

   end if;
   
   abort("index " + str(i) + " is out of range"); 		
   	-- note out-of-range case
   
  end if;   -- end of height 1 case. 
  		-- In the remaining cases, we proceed recursively.
  
  if i <= cumm(1) then    
  	-- leaf to be changed lies under the first node 

   ot1 := tup1(stg_ptr,1);     -- get the old string cumulant of this node
   oscn := stg_cum1(stg_ptr,1); osc := stg_cum(stg_ptr);
   t1 := set_leaf(ot1,i,x);     
   	-- and now we must adjust all the string cumulants of following nodes

   set_tup1(stg_ptr,1,t1);     -- put the new component in place

   scn := stg_cum1(t1,#stg_cum(t1));   -- get the new string cumulant of this node
   len_delta := scn - oscn;
              -- adjust all the cumulants
   for j in [1..#cumm] loop 
   	set_stg_cum1(stg_ptr,j,stg_cum1(stg_ptr,j) + len_delta); 
   end loop;

   decref(t1);   -- t1, which is the same as ot1, is dead past this point
   
  elseif exists c = cumm(j) | i <= c then    
  	-- leaf to be changed lies under the jth node

   otj := tup1(stg_ptr,j); 
   oscn := stg_cum1(stg_ptr,j) - stg_cum1(stg_ptr,j);  
   	-- get the old string cumulant of this node
   tj := set_leaf1(otj,i - cumm(j - 1),x);   
   	-- and now we must adjust all the string cumulants of following nodes
   scn := stg_cum1(tj,#stg_cum(tj));   
   	-- get the new string cumulant of this node
   len_delta := scn - oscn;
              -- adjust all the cumulants
   for k in [j..#stg_cum(OM)] loop
     set_stg_cum(k,stg_cum1(stg_ptr,k) + len_delta); 
   end loop;

   set_tup1(stg_ptr,j,tj);      -- put the new component in place
   decref(tj);   -- t1, which is the same as ot1, is dead past this point
 
  else 
   abort("index " + str(i) + " is out of range"); 
  end if;

  return stg_ptr;
  
 end set_leaf;

Note that all the node and cumulant manipulation seen in the above routines is accomplished using the routines h(stg_ptr), tup(stg_ptr), tup1(stg_ptr,j), set_tup1, stg_cum1, etc. furnished by the package 'btup_for_stgs'. The same is true for the other bigstring-manipulation routines shown just below. We hold religiously to this constraint to ensure that the rules concerning reference counts described in the preceding section, whose application sometimes grows delicate, are properly observed. We aim to confine as many as we can of the 'incref' and 'decref' operations needed to handle reference counts within these low-level 'btup_for_stgs' routines, since this will ease implementation of higher level B-tree codes like those seen above and below. This design goal succeeds in part. For example, the code seen above contains only three occurrences of 'decref' and none of 'incref'; the more complex code seen below contains only eleven occurrences of 'decref', seven of 'decref_tup' and two of 'incref'. Much of the reminder of the work needed to manage reference counting is sucessfuly hidden within set_tup, set_tup1. Note also that, solely for purposes of debugging, 'btup_for_stgs' supplies dummy versions of the reference count incrementation and decrementation routines 'incref' and 'decref', and also dummy 'allocate_record' and 'free_record' routines. The production versions of these routines are supplied by the 'buffered_disk' package, whose design and optimization have been described in Section 12.3 above. A full list of the procedure supplied by 'buffered_disk' appears below.

Next we consider the more complex 'cat_with' procedure, which is

 procedure cat_with(stg_ptr,x);  -- concatenation

  if is_tuple(xx := x) then   -- force second argument to btuple
  	x := stgs_btup(stg_ptr); x := set(stg_ptr,xx); 
  end if;
  if #stg_cum(stg_ptr) = 0 then 
  	if xx = x then incref(x); end if; 
  	return x; 
  end if; 

  if #(xc := stg_cum(x)) = 0 then 
  	if xx /= x then decref(x); end if; 
  	incref(stg_ptr); return stg_ptr; 
  end if;
     -- null cases. Increment the input returned, 
     -- since it becomes the return value

  if (xh := h(x)) = (hh := h(stg_ptr)) then   
  	-- concatenated btup has same height

   set_h(double_stg_ptr,hh);   -- perform a top-level concatenation
   tupz := (tupa := tup(stg_ptr)) + (tupb := tup(x)); 
   set_tup(double_stg_ptr,tupz); 

   if hh > 1 then decref_tup(tupa); decref_tup(tupb); end if;
      -- previously created references to components 
      -- of tupa and tupb must be removed

   c2l := stg_cum1(stg_ptr,#stg_cum(stg_ptr)); 
   set_stg_cum(double_stg_ptr,stg_cum(stg_ptr) + [c + c2l: c in stg_cum(x)]);
       -- get cumulant of concatenation, adjusting second part

   if xx /= x then decref(x); end if;    
   	-- erase x if it was created in this routine

   return ifsplit();   -- split if necessary 

  elseif xh < hh then  -- concatenated btup is shorter

   old_end_elt := tup1(stg_ptr,nt := #stg_cum(stg_ptr));  
   	-- get the final descendant of this tree's root
   end_elt := cat_with(old_end_elt,x);       
   	-- recursively catenate x to this final descendant
   decref(old_end_elt);  -- this element is now dead

   if h(end_elt) < hh then          
   	-- the subconcatenation has not split

    new := allocate_record();         -- allocate a new record
    set_h(new,hh);		-- copy the top node of this tree
    set_tup(new,tupp := tup(stg_ptr)); 
		set_stg_cum(new,stg_cum(stg_ptr));    

    set_tup1(new,nt,end_elt);        
    	-- just install the concatenation as the new final node of this tree
    set_stg_cum1(new,nt,stg_cum1(new,nt) + stg_cum1(x,#stg_cum(x)));  
    	-- adjust the cumulant to reflect the added descendants
    decref(end_elt); decref_tup(tupp);      -- these elements are now dead
    
    if xx /= x then decref(x); end if;    
    	-- erase x if it was created in this routine

    return new;  -- return the new tree

   end if;

     -- otherwise the subconcatenation has grown taller, i.e. has split
     -- the descendants of the new toplevel root replace 
     -- the prior last node at this level
     -- (which is the node to which x was concatenated)

   set_h(double_stg_ptr,h(stg_ptr)); 
   noot := orig_tup := tup(stg_ptr); 
   scyum := stg_cum(stg_ptr);
       -- start to copy the top node of this tree, 
       -- using an extra-size node record to avoid
       -- the possibility of overflow

   noot(nt..nt) := (lexpt := tup(end_elt)); 
   set_tup(double_stg_ptr,noot);
       -- replace the last child of the original tree 
       -- with the children of the new split suB-tree
 
   decref_tup(lexpt); decref_tup(orig_tup);  -- these elements are now dead
 
   c2ml := if nt = 1 then 0 else scyum(nt - 1) end if;  
   	-- get stg_cumulant prior to the last node of original tree 
   scyum(nt..nt) := [c + c2ml: c in stg_cum(end_elt)]; 
   set_stg_cum(double_stg_ptr,scyum);
            -- add this to all subsequent string cumulants
   decref(end_elt);  -- this element is now dead

   if xx /= x then decref(x); end if;    
   	-- erase x if it was created in this routine
   return ifsplit();   -- split if necessary 
   
  else  -- otherwise the concatenated element is taller

   first_x_elt := tup1(x,1);   -- get the first element of x
   tot_stg_cum := stg_cum1(stg_ptr,#stg_cum(stg_ptr));   
   	-- total string cumulant of original tree

   first_x_elt := cat_with(stg_ptr,oldfirst_x_elt := first_x_elt);  
   	-- recursively catenate the present tree (to the left) with the first child of x
   decref(oldfirst_x_elt);          -- this element is now dead

   if h(first_x_elt) < xh then         -- the subconcatenation has not split

    new := allocate_record();         -- allocate a new record
    set_h(new,h(x)); set_tup(new,tupx := tup(x)); 
    set_stg_cum(new,stg_cum(x));   -- copy the top node of x
       -- note that the first node, which may have changed 
       -- (pointer semantics here) will not be used

    set_tup1(new,1,first_x_elt);        
    	-- the new subconcatenation becomes first child 
    	-- of the new tree created above

    for j in [1..#xc] loop   -- adjust the later cumulants
    	set_stg_cum1(new,j,stg_cum1(new,j) + tot_stg_cum); 
    end loop;

    decref(first_x_elt); decref_tup(tupx);     -- this element is now dead

    if xx /= x then decref(x); end if;
       -- erase x if it was created in this routine

    return new;            -- return the new tree

   end if;
     -- otherwise the subconcatenation has grown taller, i.e. has split
     -- the descendants of the new toplevel root 
     -- replace the prior first node at this level
     -- (which is the node of x to which our original tree was concatenated)

   set_h(double_stg_ptr,h(x)); tupp := orig_tup := tup(x); scyum := stg_cum(x);   
   	-- start to copy the top node of x

   tupp(1..1) := (fexpt := tup(first_x_elt)); set_tup(double_stg_ptr,tupp); 
       -- replace the first child of the original tree 
       -- with the two children of the new split suB-tree, 
       -- using a double-size record

   decref_tup(fexpt); decref_tup(orig_tup); -- these elements are now dead
   
   scyum(1..1) := stg_cum(first_x_elt); set_stg_cum(double_stg_ptr,scyum);

   for j in [3..#scyum] loop   -- adjust all later cumulants
   	set_stg_cum1(double_stg_ptr,j,stg_cum1(double_stg_ptr,j) + tot_stg_cum); 
   end loop;

   decref(first_x_elt);          -- this element is now dead
   
   if xx /= x then decref(x); end if;    
   	-- erase x if it was created in this routine
 
   return ifsplit();   -- split if necessary 
 
  end if;

 end cat_with;


The buffered_disk' package makes the following procedures available.

	 procedure start_disk();  
		-- initialize the buffered disk by reading disk_buf 

	 procedure open_buffered_disk(f_name);  
		-- initialize the buffered disk by reading disk_buf; 
		-- returns previously saved aux_stg 

	 procedure close_buffered_disk(fname,aux_stg);   
		-- close the buffered disk by adding the byte_buf values into disk_buf, 
		-- and writing everything to the backup file

	 procedure build_more_freelist();  
		-- get another block of memory from the system, 
		-- and set it up as an additional free_list 

	 procedure decref(stg_ptr);   
		-- decrement refcount of referenced record, possibly freeing it

	 procedure incref(stg_ptr);   
		-- decrement the refcount of the record referenced by a stg_ptr, 
		-- erasing it if necessary

	 procedure refcount(stg_ptr);  
		-- show the refcount of a node

	 procedure set_refcount(stg_ptr,n);  
		-- set the refcount of a node

	 procedure allocate_record();   
		-- get a new record from the free_list

	 procedure free_record(stg_ptr);  
		-- recycle a record by linking it into the free_list

	 procedure note_replace_ref(old_stg_ptr,new_stg_ptr);   
		-- adjust the refcounts when one reference is being replaced by another

	 procedure may_copy_rec(stg_ptr);  
		-- copy the value of the stg_ptr to a new record if the refcount is > 1; 
		-- set the refcount of the new copy to 1; 

	 procedure may_copy_stgrec(stg_ptr); 
		-- copy the value of the stg_ptr to a new record if the refcount is > 1; 
		-- set the refcount of the new copy to 1; etc.

	 procedure readval(stg_ptr);   
		-- reads the specified record from the disk if needed, 
		-- returning its value

	 procedure setval(stg_ptr,stg);  
		-- writes the specified id, first creating buffer space if needed
	 procedure may_dump();    
		-- dump a record if the buffering space is full

	 procedure cleanup_allocations();   
		-- *** for debugging: consume entire remaining free_list

	 procedure hold_rec(stg_ptr);    
		-- *** for debugging: note that an item is to be decrefed

	 procedure drophold_rec(stg_ptr);   
		-- *** for debugging: note that an item is to be decrefed

	 procedure erase_look(label);    
		-- *** for debugging: eraseability check

	 procedure dump_refs_in_tree(stg_ptr);  
		-- *** debugging: show refs at all levels of tree, 
		-- given by and working with string 

12.5. The full SETL database design - Introduction. SETL databases of the type described in the first section of this chapter are implemented using underlying objects of two types (ignoring, for the moment, those other structures needed for crash recovery. which are described in a later section). These are the logical record list and the word index. The logical record list can be thought of as an ordered list of strings, each consisting of the binstr version of a SETL object (the 'record'), prefixed by a 4- (possibly 5-) byte identifier field (the 'record id', generated to be unique). This list is kept in lexicographic order of the record id fields within it.

In abstract logical terms, the word index is an ordered tuple of pairs [h,id], where h is the 5-byte hash h = Hfcn(w) of a word w, and id is the index of a database record R whose string summary contains a (blank-delimited) word having the hash h. Since the (system) hash function used makes it unlikely that two distinct words drawn from the collection appearing in such a string have the same hash, almost all pairs [h,id] in the word index will correspond to cases in which a desired word w appears in R's string summary. Abstractly speaking, the word index is kept in lexicographic order of the pairs [h,id] in it. However, since this can be expected to create long runs of pairs with identical first components, efficiency is improved by suppressing the first component of every pair in such a run, except for the very first pair in the run. This results in a tuple of elements

h, id1, id2, ..,idn, h',id'1, id'2, .., id'n', , h'',id''1, id''2, .., id''n''...

where the elements h appear in increasing order, and the elements id appear in increasing order between any two successive elements h but not overall. The cumulants kept for this tuple record not only the number of components descending from each node N in its representing B-tree but the largest hash h descending from N and any of its left siblings. This gives us fast access to the last tuple component of type h satisfying any specified inequality h <= H, and the next tuple component of type h' following h. In this way, we find the run id1, id2, ..,idn of components of type id which appears between h and h', and so we have what we need to implement database iterator objects and db.contains(wd), and the further iterator objects

		iter_obj + iter_obj (union iteration)
		iter_obj * iter_obj (iterate over intersection),
		iter_obj - iter_obj (iterate over difference)
deriving from them.

Logical record list object and word index objects must, like the bigstring objects described earlier and the SETL database objects which record list and word index objects serve to implement, have value semantics. Since they are stored on disk, they must be supplied with the internal reference-count facilities needed to implement value semantics for very large, disk-stored objects. Our earlier discussion of the implementation of value semantics for bigstrings shows what is needed. For database objects implemented using them to be crash-recoverable and/or distributed in the sense described later in this chapter, it is necessary and almost sufficient for the logical record list and word index objects used to be crash-recoverable and/or distributed.

Logical record list objects lro need to support the two operations:

	lro{id}		-- retrieves the string addressed by the 4-byte id 'id'
 	lro{id} := x;	-- replaces the string addressed 'id',  if any,  by x;
 			-- if there is no such string,inserts one
 
Word index objects wio need to support three operations:
	wd_index.insert(wd_hash,id)		-- registers the 4-byte record id 'id'
 			-- as one of those associated with the word-hash 'wd_hash' 
 	wio.delete(wd_hash,id)		-- unregisters the 4-byte record id 'id'
 			-- from the list of those associated with the word-hash 'wd_hash' 
 	wio{hash_stng}	-- return an iterator object which iterates over 
 			-- all records associated with the word-hash 'wd_hash' 
 
The database code shown below (code for class SETL_database) makes it clear that once these operations are made available, programming of the SETL databases we have described is quite easy. Little more is needed than to create a kind of high-level wrapper' object for a pair of objects lro and wio.

A few additional details concerning the logical record list and word index classes will illuminate some of the major issues arising in their design. Full details concerning these structures are given later.

Logical record list objects. The records R stored within a database can vary considerably in size (from a few dozen bytes to dozens of megabytes). However, we may assume that no record is too large to be managed comfortably in RAM. (On current computers, this implies a maximum size no greater than several dozen megabytes). For reasons that have appeared in our earlier discussion of the realization of large string objects using pages on disk, we want to arrange these 'abstract logical' records' in physical records P (we continue to call these 'pages') of fixed size, e.g. 1KB. To accomplish this, we allow any single logical record R to be cut up into a series of record fragments R1,R2,..,Rn, each short enough to be held in a single disk page P. Each of these fragments is prefixed with the 4-byte record id of the overall record R, allowing R to be reconstructed by

  1. locating all the record fragments which are prefixed by R's record id;

  2. stripping the prefixed id from all but the first of these and concatenating the resulting series of reduced fragments.

It follows that at this level of representation the logical record list can be regarded as a list of strings, none greater than 1 page in size, each prefixed by a 4- (or 5-) byte (non-unique) identifier field, arranged in increasing order of these identifier fields. The crucial operations on these objects (which we will call the 'chopped record list') are

  1. accessing the sublist of all record fragments which carry a given identifier field and concatenating them;

  2. deleting the sublist of all record fragments which carry a given identifier field;

  3. decomposing a record R with given identifier into a series of such fragments, and inserting these at their appropriate position.

When individual fragments of the chopped record list are small, multiple such fragments can be kept on a single physical page. To keep storage utilization and access efficiency high, we aim to keep pages at least 2/3 full. To this end, each page consists of a series of record fragments, prefixed in the mannerjust described, where each such page begins with an on-page index which gives the page location of final byte of each of the fragments which it holds. This index is maintained as a series of 2-byte fields, prefixed by a single byte whih gives the overall length of the index.) Whenever manipulation of record fragments creates a page less than 2/3 full, a group of 3 successive pages containing this page is reprocessed by concatenating and then re-subdividing their contents into one, two, or three pages, each at least 2/3 full.

We represent the logical record list by a tuple of pages, each holding a string. These tuples are represented by B-trees like those described previously, but with an additional cumulant accelerating search for the last leaf LLNG(id) containing a record fragment whose identifier isnot larger than id, where id is a given identifier. This same cumulant supports fast search for the last leaf LLSM(id) containing a record fragment smaller than id. Once these two leaves are located, the string corresponding to id is simply a terminal portion of LLSM(id) (or of its immediate successor node), plus all leaves between LLSM(id) and LLNG(id), plus an initial portion of LLNG(id), and so can be constructed rapidly using the B-tuple slicing operation. Also, deletion of the string corresponding to id is expressed by deletion of all leaves between LLSM(id) and LLNG(id), truncation of the leaves LLSM(id) and LLNG(id) (which may combine into a single leaf), and, if necessary to keep all leaves at least 2/3 full, repair of a group of 3 leaves containing these two leaves.

Word index objects.

Code Details for SETL Database objects. Since crash recovery, and all needed B-tree operations, are implemented within the logical record list and word index classes provided, little remains to be done in the top-level database class. This simply needs to coordinate the logical record list and word index objects that it contains, and to give database operations their intended syntax. The iterations on 'contains' objects are provided by an auxiliary iterator object class.

The following slightly simplified code comes close to representing the full top-level logic of SETL databases.

class SETL_database;    -- SETL database class

 procedure create(itm_to_stg);  -- create an empty database

 procedure contains(wd);    
 	-- returns iterator over ids, including all those whose summary includes 'wd'
 
  -- databases support the following operations
  -- x := db(id);   -- find a database record given its identifier
  -- db(id) := x;   
  	-- store x as a database record, associating it with the specified identifier
  -- db with x    
  	-- create a new record and put x into it. 
  	-- a pair consisting of the modified db and the new id is returned.
  -- db.contains(wd)   
  	-- creates and returns a database iterator object, 
  	-- which iterates over all records whose summary string 
  	-- contains the given word
end SETL_database;

class body SETL_database;    -- SETL database class
 use btup,word_index,wd_index_iterator,database_iterator,btup_stgs_aux,string_utility_pak;
      -- use the 'btuples for id-prefixed strings' and the word_index classes

 var item_to_stg,id_generator;			
 	-- 'item_to_stg' is the user-supplied summarization routine for the database
 var logical_record_list,wd_index;		-- the two main database components

 procedure create(itm_to_stg);  -- create empty database

  item_to_stg := itm_to_stg;    
  	-- retain the original parameter, which maps 
  	-- database items to their summary strings
  logical_record_list := btup();   -- create the logical record list
  wd_index := word_index();    -- create the word index
  ^(id_generator := newat()) := 0;  -- the id_generator has pointer semantics

  return self;

 end create;

 procedure self(id);   -- database record retrieval 

  if type(id) /= "STRING" or #id /= 4 then 
  	abort("illegal record identifier in retrieval: " + str(id)); 
  end if;

  x := logical_record_list{id};	-- retrieve the desired record
 
  return if x = "" then OM else unbinstr(x) end if;

 end;

 procedure self(id) := x;  -- set database element. 
 	--If the id does not exist in the database, 
 	-- a new element with the given id is inserted
  
  if type(id) /= "STRING" or #id /= 4 then 
  	abort("illegal record identifier in assignment: " + str(id)); 	
  end if;
 
  record := logical_record_list{id};

  if x = OM then   
		-- delete the record, and all the wd_index items which refer to it
		-- do not insert any new record

	if record = "" then return OM; end if;  
		-- since record to be deleted does not exist
	
	logical_record_list{id} := OM;	-- first delete the recordd itself

	sum_words := [wd: wd in 
		breakup(summary := item_to_stg(unbinstr(record))," \t") | wd /= ""];
			-- calculate the string summary and its list of words

	for wd in sum_words loop 
		  -- delete all the [wd_hash,id] pairs deriving from the deleted record.
		  -- these are deleted from the word index.
		wd_hash := hash_stg(wd); 
	 	wd_index := wd_index.delete(wd_hash,id); 
	end loop;
 
	return x; 

  end if;

  logical_record_list{id} := binstr(x);   
  	-- modify the database record located by id 

  if record /= "" then   -- since record being modified did exist
		-- delete all the [wd_hash,id] pairs deriving from the record being changed.

	sum_words := [wd: wd in 
		breakup(summary := item_to_stg(unbinstr(record))," \t") | wd /= ""]; 
		-- calculate the string summary of the old record and its list of words

	for wd in sum_words loop	
		  -- delete all the [wd_hash,id] pairs deriving from the deleted record.
		  -- these are deleted from the word index.
		wd_hash := hash_stg(wd); 
		wd_index := wd_index.delete(wd_hash,id); 
	end loop; 
     
  end if;

  sum_words := [x: x in 
  	breakup(summary := item_to_stg(x)," \t") | x /= ""]; 
  	-- calculate the string summary of the new record and its list of words

  for wd in sum_words loop
		  -- insert all the [wd_hash,id] pairs deriving from the new record.
		  -- these are inserted into the word index.
  	wd_hash := hash_stg(wd); 
  	wd_index := wd_index.insert(wd_hash,id); 
  end loop;
 
  return x; 

 end;

 procedure self with x;	-- create a new database element. 
 	-- a pair consisting of the modified db and the new id is returned.

	id := bytes_from_int(^id_generator := ^id_generator + 1);		
		-- advance the id generator
	logical_record_list{id} := binstr(x);						
		-- insert the bew item into the database
	sum_words := [x: x in breakup(summary := item_to_stg(x)," \t") | x /= ""];		
		-- calculate the string summary and its list of words

	for wd in sum_words loop	 
			-- insert all the [wd_hash,id] pairs deriving from the new record.
			-- these are inserted into the word index.
 	wd_hash := hash_stg(wd); 
 	wd_index := wd_index.insert(wd_hash,id); 
	end loop;

	return [self,id];

 end;

 procedure contains(wd);    
 	-- returns iterator over ids, including all those whose summary includes 'wd'

	res := database_iterator(wd_index_iterator(wi := wd_index{hash_stg(wd)}));
		-- wrap a wd_index_iterator built from the ids corresponding to 
		-- the specified hash into a database_iterator to allow compounding

	return res;	
		-- get the index get the index-B-tree section addressed 
		-- by the hash of the word, and convert it to an iterator

 end contains;

end SETL_database;

Since database objects themselves are so simple, we can readily create specialized new kinds of databases by modifying the code seen above to extend its functionality. For example, one can create databases which store other data bases within records, even hierarchically. Also the records held in generalized databases need not be SETL objects small enough to be held in RAM, but can include very large strings and tuples held on disk and edited only piecemeal. The B-trees used to realize these strings and/or tuples can support additional cumulants of the kind described earlier, so that they can be kept in some sorted order whenever insertions are made in them, or can support various kinds of dynamic summation or specialized access. The codes described in this chapter are not hard to extend to support such features if desired.

'Hibernating' transactions - Introduction.

The SETL database system supports families of 'hibernating' transactions which make it easy to represent multiple ongoing, but occasionally delayed streams of purpose into which applications can be structured. These are ideal for programming otherwise challenging business applications. Consider, for example, the problem of creating software for a large household goods wholesaler with many clients (retailers) and sources of supply (manufacturers). Orders and payments will be received from clients and sent to sources of supply; inventory and accounts receivable must be tracked. Goods returns must be accounted for. Orders received from clients will be filled from inventory as available, shipped, and invoiced, say on 30-day payment terms. But sometimes inventory on hand will only allow partial fulfillment of some orders, in which case additional shipments must be set up as inventory comes in. If manufacturer shipments are delayed for unexpectedly long periods, letters of apology and explanation must be sent to clients at scheduled times, and letters of inquiry to wholesalers. Similarly, if payments are too long delayed, scheduled series of dunning letters must be sent to clients, and accounts occasionally suspended and reinstated. Refunds will need to be sent in cases of overpayment. The databases supporting all this activity must also support a variety of inquiries and allow generation of whatever activity summaries management finds convenient.

An application system of the kind described can be supported by a few databases, around which a buzzing cloud of immediate and deferred transactions will develop. Most of the difficulty of programming such a system lies in the management of its deferred transactions. Although these are hard to program if one thinks in the strictly serial terms which standard programming languages encourage, such applications become easier if one decomposes them into multiple streams of purpose by creating multiple virtual 'clerks', each responsible either for a single client or supplier account, for tracking a single client order (or possibly all orders from this client) from start to finish, for tracking the progress of resupply triggered by an order to a supplier, etc. These 'clerks' can then be created when their stream of purpose originates, and disappear when their purpose is fulfilled. When blocked by some temporary condition from proceeding, they must 'hibernate' until this condition changes, but then awaken and go on until they either reach their goal or need to hibernate again.

We represent transactions as objects (so that they can have internal state), which are required to support methods

	trans.wait([wd1,...,wdn]);
		-- suspend operation of a transaction until a subsequent 
		-- 'proceed' operation is invoked for it (see details below)

	trans.tickle(wd_hash,rec_id);
		-- notify transaction that a word of interest to it 
		-- has changed within the record with rec_id
and

	trans.proceed();	
	-- re-execute the transaction, which will use 
	-- its internally stored data to proceed

Transactions are initiated by a 'main input loop' which reads a data stream defining a series of requested transactions, uses the data packets thereby obtained to initialize corresponding transactions, and then calls each transaction's 'proceed' method to launch it. Transactions launched from this loop will return to it when they either complete or wait ('hibernate'), allowing new incoming transactions to be launched.

A transaction which wants to wait on a list of words should execute

[db,tid] := wait([wd1,...,wdn]);

and return immediately. The database code thereby invoked by the transaction detects the special form of this 'with' operation's second parameter, builds a special record containing the (blank-delimited) words wd1,...,wdn, and for this record issues a record id belonging to a special range which identifies it as a transaction-associated id (tid). Following this, the database code enrolls the record on the 'contains' lists L1,...,Ln associated with the words wd1,...,wdn. The transaction itself is associated with the id issued for it by executing a statement

transaction_of(tid) := transaction_object;

where transaction_of is a SETL map managed by the database code. When any of the record ids on such a a 'contains' list Lj is subsequently removed (because some record whose summary contins the corresponding word wdj has been changed) all of the transaction-associated ids 'tid' on Lj are moved to a special 'contains' list associated with a special hash value 'active_hash'. At the same time the transactions corresponding to all these tids are 'tickled' by executing

trans_of(tid).tickle(wdj,rec_id);

where as above wdj is the word with which the 'contains' list is associated, and rec_id identifies the changed record whos summary string contains wdj. The database code that does this is always invoked from within some transaction, to which it subsequently returns. The only recursive transaction method call possibly implied is to the 'tickle' routine, which should be designed to allow recursion (normally 'tickle' will simply store the pairs of parameters passed to it, for subsequent examination by the corresponding transaction when its 'execute' method is called.)

When a transaction whose 'proceed' method is invoked from the 'main input loop' returns (either becuse it terminates, or because it executes the 'wait' operation described above) the main input loop immediately calls a special database procedure 'process_active_transactions', which serves to ensure that all potentially reactivated transactions receive attention before any new transactions are read from the input stream. This has the pseudo code representation

 procedure process_active_transactions();

  while (clih := contains_list_of_active_hash()) /= [] loop
   tid from clih; set_contains_list_of_active_hash(clih); 
     -- access and remove the first transaction on the contains list of 'active_hash'
   trans := transaction_of(tid); -- get the corresponding transaction
   trans.proceed(); 
   	-- re-execute the transaction, which will use its internally stored data
  end loop;  
  	 -- continue in this loop as long as the contains list 
  	 -- of 'active_hash' is non-empty
  
 end process_active_transactions;

A transaction whose 'proceed' call is invoked in this way should first examine the data transmitted to it by preceding 'tickle' calls, and use this to determine whether any of the words of interest in records of interest to it have been changed. If not, it should simply delete this tickle data, which of no interest, and return. Otherwise, if it the tickle data indicates that it can actually proceed, the transaction should immediately execute the statement

db(tid) := OM;

to delete its record from the database, and also deleting its tid from all the 'contains' lists on which it previously appeared. This command will also cause the statement

	transaction_of(tid) := OM; 
	-- remove 'tid' from the domain of 'transaction_of', 
	-- and the transaction from its range
to be executed.

To manage the operations we have just described, the database system requires just a few special capabilities. It particular, it must

  1. recognize 'with' operations of the special form

    db with [transaction,wd1,...,wdn];

    and treat them by issuing a special id and building a word record and summary both equal to the list of words appearing in the call.

  2. react to removals of record ids from 'contains' lists by adding all the tids on these lists to the 'contains' list of a special hash value 'active_hash'.

  3. provide the process_active_transactions() entry point described above.

  4. manage the 'transaction_of' mapping described above.

Hibernating Transactions - Additional implementation details. The creator routine for the new 'active_database' class incorporating these facilities is also modified to

procedure create(my_to_string,my_do_transaction,file_name); -- create or reopen an active database

This allows a specification of the transaction routine to be passed to the active database. In a more elaborate version, pieces of code representing arbitrarily many different kinds of transactions would be stored in the database itself.

No new bottom level primitives are need for this: the 'watch' operation can be programmed by putting a new object interface on the B-tree mechanisms which support the existing SETL database primitives. This can be done as follows. The underlying database class is modified to post the set of all index words affected by a primitive operation

db(rec_id) := r; (change of record) or db with:= r; (insertion of new record)

to a public variable. The active database class is then an easy 'wrapper' class written around the underlying 'simple' database class. Each active database class consists of two simple databases, the 'inner_database' and the set of 'hibernators' (temporarily suspended transactions.)

We program a top-level scheduler S which is aware of all the 'transaction classes' needed to support our application; each transaction class is represented by a procedure drawn from a library of such procedures. A transaction T is launched by issuing it a Tid and calling it. Hibernating transactions are represented by records

[Tid,transaction_state,T_class,,w_1,w_2...w_k],
where [w_1,w_2...w_k] is the word_list parameter of the 'watch' call that brings T into hibernation. This record is initially indexed (in the 'hibernators' database) to each of the words w_1,w_2...w_k; this associates T with every such word, making it easy to locate the transactions sensitive to particular words and to wake them up. The iterator 'rid in db.hibernators.contains(wd) locates all the hibernating transactions sensitive to the word wd. The system we envisage extends the object interface through which operations db(rid) := x pass, inserting extra code which changes the transaction record of each transaction located by this iteration to [Tid,transaction_state,"."]. This moves the transaction to the list of transactions db.hibernators.contains("."); these are the 'active' transactions, i.e those which wish to be awakened from hibernation. This is easily done by executing
	for rid in db.hibernators.contains(".") loop
		[Tid,transaction_state,T_class,-] := db(rid);		
			-- get a transaction from the active list
		db(rid) := [transaction_state];		-- prepare it for execution
		fcn := library(T_class);	
			-- get the transaction's code from the code library
		fcn(Tid);			-- restart the transaction
	end loop;

In a system supporting crash-recovery, distributed execution, etc. the scheduler S can also include the codes, outlined above, which realize these capabilities.

	-- Whenever a transaction returns control to 
	-- the top-level system scheduler by returning, typically by executing 

	--		return watch(Tid,word_list,transaction_state);

	-- the scheduler gets the first element rid := arb(db.contains(".")); 
	-- of the active list, and removes this transaction from the active list 
	-- by executing 

	--	db.hibernators(rid) := db.hibernators(rid)(2); 
	-- and then passing control to the transaction 'rid' by executing 
	
	--	 do_transaction(rid); 

	-- This ensures that the transaction always sees 
	-- db(rid) = its_own_transaction_state when it is called.

The above ideas have been implemented in a prototype class 'active_database' with the following declaration:

class active_database;		-- 'active' database class, with hibernators

	procedure create(my_to_string,my_do_transaction,file_name);		
		-- create or reopen an active database
	procedure write_act();		-- write database to its file
	procedure contains(wd);
		-- returns the 'contains' iterator object for a given word
	procedure watch(Tid,word_list,transaction_state);
		-- 'watch' routine, which causes a hibernator to sleep, 
		-- watching a list of words

end active_database;

'Hibernating' transactions - An example application. To illustrate the preceding ideas in more detail, we return to the household goods wholesaler example with which we began. This can be organized around the following databases and function classes:

	clients		-- basic client data, history of client orders, 
			-- account balances by dates due
	suppliers	-- basic supplier data, history of orders from supplier, 
			-- payments due by dates
	active_invoices	-- unpaid invoices to clients, by due date
	active_resupply_orders	-- unfilled resupply_orders to suppliers, by date sent
	inventory		-- inventory of items stocked, 
			-- giving amounts on hand, suppliers, outstanding reorders
The following ordinary and hibernating transactions can be used.
	handle_an_order					
		-- track a single client order from receipt to fulfillment
	handle_client_orders			
			-- track all orders from a given client; 
			-- send letters of apology and explanation, as required 
			-- schedule shipments t client, as inventory available or received 
	handle_client_billing			
		-- track total of bills outstanding to a given client
	handle_a_client_bill			
		-- track one client bill from date of issue to payment
	handle_payments_to_supplier		
		-- handle all payments to a given supplier
	incoming_payments_clerk			
		-- receive and post all incoming client payments
	stockroom_clerk					
		-- handle incoming shipments from suppliers
	shipping_clerk					
		-- handle shipments of orders to clients
	returns_clerk					
		-- handle goods returned from clients
Input to the application is a time-ordered list of customer-orders, payments, supplier orders, etc.

These items have the following form:

	[Day_time,"Order",client_n,Item,Amount,Item,Amount,...]
	[Day_time,"Pay",client_n,Amount]
	[Day_time,"Man_Ship",man_n,Item,Amount,Item,Amount,...]
	[Day_time,"Order",
	[Day_time,"Order",
	[Day_time,"Order",
The database records use have the following forms, and are indexed in the following ways:
	clients	database		
		-- [client_name,client_address, [prior_orders_list], [active_orders_list],
		-- [shipments_list], [active_invoices_list] ]
	Indexed by:
	suppliers database		
		-- [client_name,client_address, [prior_orders_list], [active_orders_list],
		-- [shipments_list], [payments_due_list] ]
	Indexed by:

	active_invoices	database	-- [client_name,invoice_number,amount,for_shipment]
	Indexed by:

	active_resupply_orders database		
		-- [supplier_name,order_number,date,[Items_and_amounts_list]]
	Indexed by:

	inventory database		
		-- [Item_name, amounts_on_hand, supplier, outstanding_reorder}
	Indexed by:
Many of the application's most crucial access function have easy expressions as database iterators. In particular, we have:
	Supplier of item, amount on hand, outstanding reorders
			arb(inventory.contains("IS:n"))

	procedure handle_an_order(tid);					
		-- track a single client order from receipt to fulfillment

	end handle_an_order;

	procedure handle_client_orders(tid);			
		-- track all orders from a given client; 
		-- send letters of apology and explanation, as required 
		-- schedule shipments t client, as inventory available or received 

	end handle_client_orders;

	procedure handle_client_billing(tid);			
		-- track total of bills outstanding to a given client

	end handle_client_billing;

	procedure handle_a_client_bill(tid);			
		-- track one client bill from date of issue to payment

	end handle_a_client_bill;

	procedure handle_payments_to_supplier(tid);		
		-- handle all payments to a given supplier

	end handle_payments_to_supplier;

	procedure incoming_payments_clerk(tid);			
		-- receive and post all incoming client payments

	end incoming_payments_clerk;

	procedure stockroom_clerk(tid);					
		-- handle incoming shipments from suppliers

	end stockroom_clerk;

	procedure shipping_clerk(tid);					
		-- handle shipments of orders to clients

	end shipping_clerk;

	procedure returns_clerk(tid);					
		-- handle goods returned from clients

	end returns_clerk;

'Hibernating' transactions - additional implementation details. 'Hibernating' transactions of this kind are organized by adding one simple function call,

watch(Tid,word_list,transaction_state)

to the SETL database facilities already described. Here, Tid is an ID for the transaction T executing the 'watch' statement, and word_list is a list of indexed words that can appear in other database records. Execution of this statement notifies the system that the transaction cannot proceed immediately, but wishes to be notified whenever an operation db(rid) := x is executed which changes an account record indexed (before or after the operation) by any word belonging to word_list. Upon executing this statement, T goes into hibernation, but the system schedules it for wake up by moving it to an active list whenever an operation db(rid) := x to which it is sensitive is executed.

The creator routine for the new 'active_database' class incorporating these facilities is also modified to

procedure create(my_to_string,my_do_transaction,file_name); -- create or reopen an active database

This allows a specification of the transaction routine to be passed to the active database. In a more elaborate version, pieces of code representing arbitrarily many different kinds of transactions would be stored in the database itself.

No new bottom level primitives are need for this: the 'watch' operation can be programmed by putting a new object interface on the B-tree mechanisms which support the existing SETL database primitives. This can be done as follows. The underlying database class is modified to post the set of all index words affected by a primitive operation

db(rec_id) := r; (change of record) or db with:= r; (insertion of new record)

to a public variable. The active database class is then an easy 'wrapper' class written around the underlying 'simple' database class. Each active database class consists of two simple databases, the 'inner_database' and the set of 'hibernators' (temporarily suspended transactions.)

12.4. Guaranteed recovery after crash for programs which manipulate databases. The following discussion of crash recovery rests on the assumptions concerning the stability of data written to disk explained earlier in this chapter. We repeat these assumptions for the reader's convenience:

  1. a processor's CPU can crash, e.g. by losing power or by freezing after a software fault.

  2. Such crashes can occur unpredictably, even in the middle of writing a disk record. However, when such a crash occurs, bad data can appear only in the disk area being written, but not outside it.

  3. Once written to disk, data never 'evaporates', i.e. can always be read.

To understand more fully what 'crash recovery' must additionally entail, we first examine the simple case of a program which manipulates one or more database objects represented by B-trees and stored on disk. (In what follows, we will refer to these as 'dob' objects. Note that each of these objects will be represented by some small number, typically 2, of B-trees.) We can think of such a program as being required to withstand the effect of sudden unanticipated stop statements, after which the program is somehow able to resume and continue. To consider this problem closely, we must be specific about the overall program context which is to be recovered. We begin with an easy case: that of a program having one of the following simple structures, and start by describing a somewhat limited recovery mechanism and its implementation. Following this we will describe a more flexible recovery mechanism, whose implementation is more demanding.

 program test;   -- a program that should be crash-recoverable
 
	initialize; 
	  -- this is assumed to access several dob values v1,..,vk to be processed, 
	  -- and to initialize one or more values 'x' and a value 'n'
	  -- v1,..,vk represent the databases that this program is using;
	  -- x represents its internal state.
	
	for j in [n,n-1..1] loop [x,v1,..,vk] := my_proc(x,v1,..,vk); end loop;
	   -- exit from this loop represents 'success'

 end test;

or

 program test;   -- a second program that should be crash-recoverable
 
	initialize;
	  -- this is assumed to access several dob values v1,..,vk to be processed, 
	  -- and to initialize one or more values 'x' 
	  -- v1,..,vk represent the databases that this program is using;
	  -- x represents its internal state.
	
	while C(x,v1,..,vk) loop [x,v1,..,vk] := my_proc(x,v1,..,vk); end loop;   	   	   
	   -- exit from this loop represents 'success'

 end test;

In these programs, the assumed top-level procedure 'my_proc' can be arbitrarily complex, and can call any number of other, potentially recursive or mutually recursive, procedures. We think of 'crash' as a forced program stop somewhere in the nest of calls starting at the top-level call to the procedure 'my_proc' seen above, possibly with loss of whatever data values are stored in variables internal to the subroutines called. To what state shall we aim to recover after a crash?

A provisional answer to this question, for the particularly simple case of the first program shown, is to aim for recovery to the situation at the start of one of the cycles of the main loop that it contains. Any changes made after this point should be lost cleanly, as if the next iteration never started. For this to be possible, the dobs v1,..,vk must change only at the end of such cycles, and we must always be able to recover their current values (from disk), and also to recover the current value of the iteration variable j, plus the value of the variables x. We must also be able to locate all the auxiliary B-tree objects which may have been created by the incomplete activity of my_proc, and erase these cleanly, so as not to lose disk space every time a crash occurs.

To accomplish this, we proceed as follows. First of all a special object called library is introduced to manage B-tree object persistence. In abstract terms, 'library' is simply a mapping from string names, which we use to manage B-tree objects in a manner providing persistency, to the root nodes of B-trees. This object is stored at a known position on disk, so that it can always be recovered cleanly after a crash. The convention to which we shall adhere is that only those B-trees referenced by the 'library' mechanism will survive crashes. That is, one logical effect of a crash (or more properly the procedure which we use to recover from crashes) will be to erase all B-trees, created before the crash, which have never been recorded in the 'library' object.

To make this possible, the root nodes of all the B-tree objects that we ever create are linked together in a doubly linked list called Btree_root_list which is stored on disk and whose first element is also held in a known position on disk so as to survive crashes. The first thing we do after a crash is scan Btree_root_list and library, and reduce the reference count of every B-tree root node to the number of library entries that reference it. Trees found to be unreferenced are erased in the normal way. This guarantees that only those B-trees referenced by the 'library' mechanism will survive, as required, all other space being cleanly retrieved. The scan we have just described is implemented by a library-associated method called recover_references.

Continuing integrity of the 'library' and 'Btree_root_list' is essential if crash recovery is to be possible under all circumstances, and must therefore be guaranteed by use of the 'safe' writing techniques described later in this section. Changes to these objects, like changes to B-trees themselves, must always either complete cleanly or behave as if they had never been executed at all.

The other necessary methods of the library object are

 root_node := library("name");	-- return the root node of a tree stored in the library

 library([name1,...,namek]) := [root_node1,...,root_nodek];	
 	-- store a list of trees in the library, using the indicated list of names
 	-- This operation is performed safely and atomically, i.e. 
 	-- either all changes or none of them take effect.
 	-- Each tree that was previously stored under one of the indicated names 
 	-- loses one reference, but if root_nodej = OM, 
 	-- we have a pure erasure for namej.

 register([temp_name1,...,temp_namek]);		
 	-- register a set of library names which will be used to store temporary dob values
 	-- during an application program run. dobs strored under these names
 	-- should be erased once the program terminates or is abandoned.
 	
 domain(library);		-- returns set of all names used for B-trees in the library

 recover_references;
 	-- This procedure was described above. Note also that this routine
 	-- must detect and complete any dob operation that was in progress 
 	-- but incomplete at the moment of crash, thereby guaranteeing that 
 	-- the remainder of the code sees all such operations as atomic.
 	-- (The technique used to achieve this is described in the next subsection).
 

Using these operations, we can flesh out the first of the two skeleton 'crash-recoverable' programs seen above as follows.

 program test; -- a program that should be crash-recoverable
 
	recover_references;	-- erase any B-trees which may have been created pre-crash,
				-- but which were never recorded in the library.
				-- (This is a no-op if there are no such B-trees.)

	if (internal_state := library("application_state")) /= OM then 
		-- if this routine is called after a crash, an application state,
		-- representing the state of the application's internal data 
		-- as last successfully checkpointed, will be available
		-- in this library entry.

		[n,x] := internal_state;	
			-- unpack the cycle number and the rest of the internal state

		v1 := library(v1_temp_name);...vk := library(vk_temp_name);	
		-- use the temporary names established by the application 
		-- to get the major data structures being processed from the library
		
	else	-- not called after a crash, but initally
		
		n := initial_n();	-- initialize the number of iterations 
		x := initial_x(); -- and the internal state, in an application-specific way

		register([v1_temp_name,...,vk_temp_name,"application_state"]);
 			-- register the library names that will be used to store 
			-- the values of dob objects being processed at 'checkpoints' 
			--  to which recovery is possible 
	
		v1 := library(v1_name);...vk := library(vk_name);	
		-- using their known 'standard' names, 
		-- get the data objects to be processed from the library 
		
	end if;

	for j in [n,n-1..1] loop 			
		-- exit from this loop represents 'success'

		 -- call the application routine, which will process the temporary copies 
		 -- of the library data structures
		[x,v1,..,vk] := my_proc(x,v1,..,vk); 

		 -- move the modified data values back into the temporary library locations
		library([v1_temp_name,...,vk_temp_name,"application_state"]) 
					:= [v1,..,vk,[j - 1,x]];
		-- this is our 'checkpoint';
		-- each sucessful completion of this operation represents
		-- one finished iteration

	end loop;

	library([v1_name,...,vk_name,"application_state",v1_temp_name,...,vk_temp_name) 
			:= [v1,..,vk,OM,...,OM];
		-- completion of this operation represents the end of an application run

	recover_references;	-- erase the B-trees just dropped from the library

 end test;

Note that since the B-trees we use to represent dobs have value semantics, none of the changes to the string values vj or library(vj_temp_name), made for any j by the program seen above or by any of the procedures it calls before its final step, can effect library(vj_name). If the program shown recovers after a crash, the obligatory recover_references call will begin by completing any dob operation that was in progress but incomplete at the moment of crash. As said above, this guarantees that the remainder of the code sees all such operations as atomic. After this low-level repair step, followed by erasure of all dob objects not recorded in the library, the program displayed will proceed exactly as if it had returned to the last preceding successful

	library([v1_temp_name,...,vk_temp_name,"application_state"]) 
					:= [v1,..,vk,[j - 1,x]];
operation (or to its initialization steps if even the first of these operations was not executed successfully). Moreover, after a crash the operating system (or human operator) has the option of abandoning the application run, since the operation
	register([v1_temp_name,...,vk_temp_name,"application_state"]);
makes the dob objects which must then be removed from the library known to the system. In this case the system can recover by executing
	library(["application_state",v1_temp_name,...,vk_temp_name) := [];
	recover_references;	-- erase the B-trees just dropped from the library
in which case the situation will revert to that preceding launch of the application, as if the application had never executed at all.

Now we must describe the technique used to guarantee that elementary operations on dob structures can always be either abandoned cleanly or completed cleanly even if a crash occurs within it, even at the most embarrassing moment possible.

Making edits of dob structures reliable and indivisible. Available disk space is divided into a large number of 'physical' disk records of a common size. These disk areas are accessed using hire physical address A, and can be read and written independently of each other. Each tree node in a dob structure is assumed to occupy one physical record, and so the byte size of the physical record determines the maximum number 2n - 1 of children which any node is allowed to have. Nodes having fewer children than this are left partially empty. The physical records are buffered to RAM, and it is these RAM copies, rather than the corresponding disk copies, which are actually modified when dob edits take place. This is done by maintaining a map in_RAM_copy(record_addr) which maps (some of the) physical disk addresses to in_RAM copies of the disk records at these addresses. We also maintain a set 'dirtied' containing the addresses of those records whose in_RAM copies have been changed since they were last read in from disk. The number of physical records thereby buffered to disk is not allowed to exceed an integer B. When an additional record A needs to be accessed and this limit B has been reached, one of the undirtied records currently in_RAM is selected for replacement, and replaced in the domain of the in_RAM_copy map by this new A. (Note that this operation requires no disk write, since the record being replaced is undirtied.) Dirtied records are never written out except in consequence of an explicit 'flush_all' operation, which undirties all the dirtied records, and which has the following design.

(i) The memory management layer which handles all dob records works with a number of buffers large enough to hold all the records 'dirtied' by the data-write steps of any single allowed dob operation, e.g. B-tree modification or insertion. (Part of this buffer area can actually consist of any number of virtual RAM blocks maintained on disk. But we regard this as logical RAM, in that it is not protected against crashes: after crash, no attempt is made to recover the data that it contains.)

(ii) As noted above, 'dirtied' records are never written out, except in response to an explicit 'flush_all' operation. Such an operation is placed at the end of each dob operation such as 'dob(i) := x;' or 'dob(i..j) := x;'

(iii) The 'flush_all(...)' operation is properly 'fail safe', i.e. programmed in a way guaranteed to leave no logical effects unless it succeeds completely.

Then we can be sure that during recovery from a crash the system will always behave as if all the dob objects being manipulated are in the state attained by the last preceding successful 'flush_all(...)', i.e had completed its last atomic operation on a dob object. For (a) if the flush_all at the end of a dob operation is not reached, everything in RAM (i.e. our whole incomplete dob operation) will be thrown away, so recovery will behave as if we had just reached the state immediately after the last preceding flush_all. (b) if the flush_all is started but not completed, then by (iii), it will have no logical effects, so we will recover to exactly the same state as in (i). (c) if the flush_all is successfully completed, it will define the point to which we recover until some subsequent flush_all is successful.

So the approach we take reduces the problem of guaranteeing dob object integrity in the presence of crashes to that of guaranteeing that flush_all operations, once launched, will always complete successfully. The technique used to achieve this is detailed in the next few subsections. Note that the flush_all(...) operation must also guard the free_list of unused data records described in the preceding subsection, since if this is lost we will start to lose disk space.

Making writes of groups of records reliable and indivisible. To guard the flush_all operation properly we must guarantee that it will write out all the physical disk records for which it is responsible without error, or leave no logical traces on the disk (in which case it behaves as if it had never been launched). To ensure this, we adapt the familiar '2-phase commit' protocol. Specifically, we use two checksummed 'control' records, at least one of which is always valid. Together these function as a single auxiliary data value containing whatever indices and other variables we need to sequence safely through the steps of any multi-step process which we need to complete even in the presence of crashes which can occur unpredictably at the worst possible time. These two control records are used in the following careful way.

(i) Each control record holds an integer 'step number' in a known subfield. Both control records are checksummed, and (except at startup) at least one is always valid. If only one contains a valid check sum, we use the control values, including the step number, contained in that record. If both control record areas are valid, we use the control values contained in the area having the larger step number. This lets us advance through control states by writing successive control values alternately to the control areas. The records written contain steadily incrementing step numbers. If we crash during one of these writes, then after recovery we will simply return to the last preceding control step, repeat it, and then continue. As soon as the step number corresponding to a step has been written successfully, the step is logically complete, and we can go on to the next step.

(ii) Each of the individual steps of our recovery process must be idempotent, i.e. must have the same effects whether it is performed once or more than once. This implies that if a step is completed 'de facto', but the immediately subsequent control record write (which completes it 'de jure') fails, the step can simply be performed again after recovery.

(iii) When the end of the multi-step flush process is reached we want to restore both step number counters to the value zero. Suppose that this end occurs at step N, so that it is invoked when the larger of the two step number counters (if both are valid) is N. Then we

(a) write the control index 0 into that control area which is either invalid or is valid and contains a smaller step number than N. Until this step succeeds it will simply repeat.

(b) write the control index 0 into the control area CaN which contains the step number N. If this step fails without effect, it will simply repeat. As soon as it either succeeds or invalidates CaN, the logical step number will have successfully been set to 0.

(iv - An aside:) The procedure we have just described indicates how a fully recoverable, multi-step process can cycle from its last step back to its first. More generally, fully recoverable jumps from any step N to any other step M are possible. Suppose that step N, besides being idempotent, must end with a jump to step M. As above, we

(a') write the step number M into that control area which is either invalid or is valid and contains the smaller step number, which must be different from M. Until this step succeeds it will simply repeat.

(b') if M is greater than N (jump forward case) we are done as soon as operation A succeeds, since the next step begun will be that indexed by the larger of N and M. Otherwise (jump backward case) we must also write the control index M into the control area CaN which contains a step number different from M, to ensure that the next step executed is not N but M. If this step fails without effect, it will simply repeat. As soon as it either succeeds or invalidates CaN, the logical step number will have successfully been set to M.

All this will guarantee that individual dob operations are always properly transactional, i.e that each such operation either succeeds completely or leaves no logical traces on the disk.

We call the data stored in the valid half of the record pair described above (or, if both are valid, in that half containing the larger step index) the recovery control area. It should be clear from the preceding discussion that data stored in the recovery control area can always be recovered successfully. But only a limited amount of data can be stored there.

Note also that it is always possible to conform to the condition that the separate steps of an extended transaction should be idempotent. One only has to have the transaction write all its changes to one or more private areas until the transaction runs to completion, at which point it simply remains to finalize the transaction by copying these private areas to their 'official' locations in a fail-safe way. Idempotency is then guaranteed, since copy operations are always idempotent.

In our flush_all procedure, a step number of 0 is used to indicate that no flush_all is in progress. Any crash that takes place in this situation will simply recover to the state as of the preceding flush_all, and must be capable of repeating its work from that point on. The flush_all operation proper starts as soon as the step number 1 is successfully written to a control area. Before this, all the records to be flushed are first written as a chained list to a supplementary disk area, each preceded by a header defining its true physical location. Successive copy operations (each an idempotent step, since copies are idempotent) move these records to their proper physical location, and verify that each has been written correctly. Once this has been done, we go on to write the refcount data (from its the refcount log file) and the free list, as further detailed in the following paragraphs.

The code seen below makes these crucial flush_all completion steps part of a routine named 'read_control'. Calling this routine ensures that any flush_all is progress during an immediately preceding crash will be completed successfully. A call to read_control is therfore the first step of the higher-level 'recover_references' routine mentioned earlier.

Code details for crash recovery - a first version. The simplified crash recovery simulation seen just below will clarify the crash-recovery approach sketched above and supply additional details. It simulates the action of a nominal, highly simplified.'application' code which simply increments the integer contents of a pair of disk pages 40 times, writing each incremented value back to disk each time. However, it is assumed that each of these simulated disk writes (and also each simulated disk read) can fail or crash in one of several ways:

  1. Any such operation can simply 'fail' or 'crash' randomly. Such crashes lose the entire contents of RAM, and so require full restart from the current contents of disk to recover. Failed write operations can leave bad data on the disk page that they are attempting to write.

  2. Write operations can write bad data intermittently without visibly crashing in any other way.

  3. Read operations can misread data intermittently without visibly crashing in any other way.

  4. All in-RAM operations are assumed to perform correctly

To exercise the recovery mechanism, the simulated rate of disk misreads/miswrites and of crashes is deliberately set very high. In the simulation seen below, every 20th random disk operation (on the average) is assumed either to misperform (or to crash). This means that a typical simulation run must recover from about 220 crashes (not counting misreads and/or miswrites) in order to complete.

Recovery mechanisms - code details. The recovery mechanisms are organized as a package 'disk_recovery' which makes the following collection of routines available to simulated applications:

 procedure access_disk(npages);    
		-- 'disk' initalization routine; sets number of simulated pages 

 procedure close_disk();      
		-- release the simulated 'disk'

 procedure write(recno,newval);  
		-- simulated write; uses buffer. records written are dirtied

 procedure read_page(recno);  
		-- read a record from buffer, or from the disk, 
		-- accepting it only if its check-hash verifies
		-- but if the check-hash fails to verify, try repeatedly 
		-- to be sure that the check-hash failure is 'hard',
		-- rather than caused by an intermittent misread

 procedure read_weakly(recno);  
		-- read a record from buffer, or from the disk, 
		-- accepting it only if its check-hash verifies; 
		-- but good data might be seen as bad

 procedure flush_all();    
		-- commit to the new situation by writing out all buffered records

 procedure read_control();   
		-- find the contents of a valid control area. If this indicates 
		-- that a flush_all was in progress at the moment of crash,
		-- complete this flush-all operation

 procedure write_control(phaseno,extra_data); 
		-- writes control record to disk in a safe manner

 procedure read_actually(recno);  
		-- read the ground truth of the situation 
		-- (used for final report only)

This recovery package maintains an internal ('in_RAM') buffer of disk pages, with indication of which pages have been changed since they were last written out. The 'write' routine writes to this buffer (but not immediately to disk, since disk writes must be done in a carefully controlled manner), and the 'read_page' and 'read_weakly' routines read preferentially from this buffer, only fetching records from disk when they cannot be found in the buffer. The 'read_page' and 'read_weakly' routines differ only in the way that they treat records read from disk which appear to contain bad data. This works as follows. Simulated 'disk' pages are 128 bytes long.Their last 8 bytes are reserved for a check-hash calculated from their first 120 bytes. Good records will always contain valid hashes, so if the hash fails to verify immediately after the record is read back from disk, the record is probably corrupt and so should not be used. However, the read operation itself may have misperformed (this will happen intermittently, at a simulated rate of about 5%), and so the record on disk may actually be good, even though it appears on a first reading to be bad. For this reason, 'read_page' only concludes that a record is bad after 10 unsuccessful attempts have been made to read it. This reduces the probability of concluding that a good record is bad from .05 to 10-11, even for the very unreliable simulated 'disks' which we assume. 'read_weakly' bypasses this precaution.

Accordingly 'write' expects to be passed a string no longer than 120 bytes, which it pads to 120 bytes, and both 'read_page' and 'read_weakly' return either a string of this length, or OM if the record being read is (or appears to be) bad.

'Crashes' occurring within an operation are simulated by setting a global flag 'failed' and returning from the operation immediately. Any higher level routine receiving such a 'failure return' must also return immediately. Since SETL does not currently include any real 'exception' capability, these returns are handled in the simulation by placing a statement

if failed then return; end if;

immediately after each operation that might fail. The only exception to this is the invoking statement of the top level program, which should imitate the restart action of an operating system (or human operator) by re-entering the application code whenever it has received a failure return. To facilitate this, the body of the application code is written as a procedure. This procedure must be designed to read whatever information it needs for restart from the disk (in part by using the 'read_control' and 'write_control' operations provided by the recovery package; see below), and then restarting itself correctly. The main program of the example given below shows how to do this in a simple case.

The inner disk-read/write operations of the simulation, which are, so to speak, sources of random failure, contain calls to routines 'mayfail' and 'maywreck', the first of which generates simulated crashes at random with a specified probability, while the second generates simulated misreads/miswrites, but not crashes.

As should be understood from the discussion in the preceding subsection, 'flush_all()' is the heart of the crash recovery system. This designed either to succeed completely or to fail in a manner leaving no logical trace on the simulated disk. 'flush_all()' begins by writing all 'dirtied' disk records, along with the target address for which this information is subsequently destined, to a standard spare disk area. If a failure occurs during this step, only the spare area, invisible outside of 'flush_all()', can be corrupted, so the application can restart from the disk information left in place by the last preceding successful flush_all, exactly as if crash always occurred (and could only occur) immediately after some flush_all operation.

Once all 'dirtied' disk records have been moved to this spare disk area, flush_all enters its second phase, during which it moves the information just written from this spare disk area to the target locations for which the information is destined. Crashes occurring during this step are handled simply by re-entering flush_all and repeating the entire copy step. Once this whole disk-to-disk copy step succeeds, flush_all enters its third and final phase. During this phase it attempts to exit the flush_all processing and return to the application-level part of crash-recovery processing. As explained above, this is done by setting the flush_all phase indicator maintained safely on disk to 0. As soon as this is successfully accomplished, flush_all will no longer be called automatically, and application-level recovery can commence. Until flush_all's phase 3 succeeds, it will simply repeat.

'read_control()' and 'write_control(phaseno,extra_data)' are highly modified variants of flush_all, specialized for the writing of sequencing information. As outlined in the preceding subsection, they use a reserved dual pair of pages (the first two pages on the simulated 'disk'.) Each of these contains one control-record copy. Each of the control records is laid out as follows:

The first 3 control-record bytes are reserved for use by flush_all, so neither 'read_control' or 'write_control' makes these visible outside the recovery package itself. (Our sample application uses only one of these bytes, simply to store an application-level phase integer, which is 1 if the application is still initializing the records which it will then increment, 2 if the incrementation step has begun.) The very first control-record byte is simply the flush_all phase as described above; 0 if no flush_all is in progress, otherwise 1 or 2. If this byte is not zero, any recovery step will re-enter flush-all immediately to finish incomplete work within it; otherwise application-level recovery can go forward. Bytes 2 and 3 of the control record store information that must be passed between flush_all phases 1 and 2, specifically the integer number of pages, held in the spare disk area written during flush_all phase 1, that must be moved to their final destination during phase 2. In our highly simplified 'application' this integer is always 2, but in more realistic applications it might vary up to several hundred.

On first accessing this pair of control records flush_all expects to find at least one of them in valid condition: this is they key invariant that must always be maintained for crash recovery to be possible. If both records are valid, that containing the larger flush_all phase integer is the valid one, and the other member of the control-record pair is considered to be 'writable'. (In the code that follows, this information is stored in the variable 'control_was_valid'). To advance its phase, flush_all writes a record containing the larger phase into the writable member of the control-record pair. If a crash occurs during this attempt, the immediately preceding flush_all phase will simply repeat, as explained above.

To exit flush_all processing successfully, in effect jumping backwards from its third step to its exit/restart 'step 0', the flush_all procedure must zero the lead byte in both control records. flush_all step 3, which is responsible for accomplishing this, and has only this responsibility. The code of the 'write_control' procedure (which is designed for application-level use, and so is not used by flush_all itself) reflects similar design consideration. 'write_control' always attempts to write both control records, and so will accomplish jumps backward (in an application's control phase number) successfully, provided that such jumps are made the responsibility of an application phase which has only this responsibility. However, if 'write_control' succeeds in writing the first of the pair of control records, but then crashes while attempting to write the second control record, jumps forward (but not jumps backward) will take effect immediately anyhow. (Not all of these niceties play a role in the very simple 'application' which follows.)

'access_disk' and 'close_disk' merely begin and end the simulated disk usage. 'read_actually' is simply a disk read operation which, for debugging and reporting purposes steps out of the simulation and reads records on the disk without error.

'write_safely' is a disk-write routine, internal to the recovery package, which adds a check hash to the data passed to it, writes the result to disk, and then immediately reads it back and verifies it, failing if the data read back differs from that written. Hence this routine will always either write good data of fail. However, it can fail 'unnecessarily' even if good data has been written, if this data is misread during the verification step.

Correctness conditions for applications designed to recover from crashes. Assuming that all of the routines of the recovery package work in the manner just explained, an application using the recovery package will be certain to recover correctly provided that

  1. The application would be correct if it never failed;

  2. When the application fails immediately after a successful flush_all operation and is re-entered, and assuming that it does not then fail, it will always execute a series of pure disk-read and variable-read/write operations which return it to exactly the program point following the flush_all, and which also reconstruct the entire memory state preceding this flush-all, with the exception of variables dead at the program point immediately following the flush_all.

These conditions can be verified either by careful code inspection and ordinary debugging, or by the more rigorous methods of formal program proof. To debug a program designed for crash recovery, and so containing calls to flush_all, one can proceed as follows.

  1. First debug the program assuming that it never fails. In this phase, 'write' and 'read_disk' should simply be replaced by standard disk read and write operations, and 'flush_all' should be treated as a no-op.

  2. Once this has been done, debug each of the flush_alls individually in the following way. Insert a statement

    flushall_check([v1,...,vn]);

    immediately preceding the flush_all operation to be checked. Here, v1,...,vn should be the set of all program variables not dead immediately after the flush_all, and flushall_check should capture and store these variables, and should set flags forcing the first and then every odd numbered traversal of the immediately following flush_all operation to fail. On even numbered traversals, the values in the stored copies of v1,...,vn should be compared to their present values, to ensure that they are the same. Also, during restarts after failure on an odd numbered traversal, every attempt to execute a disk-write operation before the flushall_check operation should be treated as an error triggering a visible abort.

    One can then execute the program for as many cycles as needed to gain confidence that restarts after a failure immediately following the flush_all being debugged restore all the variables v1,...,vn to their pre-failure values. As noted above, this is the condition for correctness.

The example. The sample code anticipated by the preceding paragraphs is:

package disk_recovery;   
	-- package for disk management allowing recovery from crashes
	var failures_count := 0;  -- used to count simulated crashes
	var failed := true; -- global flag set upon simulated crash

	procedure access_disk(npages);    
		-- 'disk' initalization routine; sets number of simulated pages 

	procedure close_disk();  -- release the simulated 'disk'

	procedure read_control();
		-- find the contents of a valid control area. If this indicates 
		-- that a flush_all was in progress at the moment of crash,
		-- complete this flush-all operation

	procedure write_control(phaseno,extra_data); 
		-- writes control record to disk in a safe manner

	procedure write(recno,newval);  
		-- simulated write; uses buffer. records written are dirtied

	procedure read_weakly(recno);  
		-- read a record from buffer, or from the disk, 
		-- accepting it only if its check-hash verifies

	procedure read_page(recno);  
		-- read a record from buffer, or from the disk, 
		-- accepting it only if its check-hash verifies
		-- but if the check-hash fails to verify, 
		-- try repeatedly to be sure that the check-hash failure is 'hard',
		-- rather than caused by an intermittent misread

	procedure flush_all();    
		-- commit to the new situation by writing out all buffered records

	procedure read_actually(recno);  
		-- read the ground truth of the situation (used for debugging/reporting only)

end disk_recovery;

package body disk_recovery;   
	-- package for disk management allowing recovery from crashes
	use random_pak; -- for simulation of randomly occurring crashes

	var rand_handle;    -- used to generate crashes randomly
	var control_was_valid;   
		-- used to indicate which of two control areas was valid, 
		-- so that other can be written
	var file_handle;    -- the 'disk' used for recovery
	var RAM_buffer := {};   
		-- maps disk records into their RAM maps disk records 
		-- into their RAM-buffered images
	var dirtied := {};   -- set of disk records that have been written
 
    -- the 'disk' layout is : records 0,1 are reserved for control record; 
    -- then 2n records are used for spare space during flush-all,
    -- then the crash-protected records on the disk. 
    -- (In the present example, n = 2.) 
 
 const failure_prob := 0.05;          
 	-- probability that any operation will fail
 const small_prob := 0.01;          
 	-- reduced probability that operation will fail

  const smaller_prime := 18446744073709551557;  
  -- 2 * 2 * 11 * 137 * 547 + 1, the largest prime no larger than 
  -- 18446744073709551616 = 256**8.
 const primroot :=  10376293541458047902;  
 	-- a primitive root modulo this prime, used for hashing
 
 procedure access_disk(npages);      
 	-- 'disk' initalization routine; sets number of simulated pages 

  close(open("simulated_disk","TEXT-OUT"));   
  	-- clear the simulated 'disk'
  file_handle := open("simulated_disk","RANDOM");  
  	 -- access the simulated 'disk' in "RANDOM"
  puts(file_handle,1,npages * 128 * "\x00");   
  	-- make the disk large enough 
  rand_handle := start_random(1.0,OM);  -- prepare to generate random crashes
  control_was_valid := 1;        -- since a value is required
  RAM_buffer := {}; dirtied := {};    -- restart all in-RAM quantities 

 end access_disk;

 procedure close_disk(); close(file_handle); end close_disk;  
 	-- release the simulated 'disk'

 procedure read_control();   
 	-- find the contents of a valid control area
  
    -- first read the control records in a mode which makes 
    -- it very unlikely that a good record on disk will be read as bad
     
  c0 := read_page(0); if failed then return; end if;
  	   -- read the first control record from the disk
  c1 := read_page(1); if failed then return; end if;
  	   -- read the second control record from the disk
  
  if c0 = OM and c1 = OM then   
  	-- Cannot find valid control value. Report failure, with possible message
 
   failed := true; 

   if random(rand_handle) < 0.001 then   
   	-- print message if problem is long continued
    print("Cannot find valid control value. " + 
    "Control value not properly initialized " + 
    "or severe system problem detected."); 
   end if; 
   
   return;  

  end if;

  if c0 = OM then     
  	-- fourth byte of control record contains application-level phase

   control_was_valid := 1; 
   if c1(1) /= "\x00" then 
   	flush_all(); 
   end if; 
   
   if failed then return; end if; 

   	-- we must recover from the middle of some flush_all operation 
   return c1(4..);   
   	-- note the control which was valid; 
   	-- the other should be written first. 
   	-- The first 3 bytes are reserved for flush_all

  elseif c1 = OM then 

   control_was_valid := 0; 

   if c0(1) /= "\x00" then flush_all(); end if; 
   if failed then return; end if; 

   	-- we must recover from the middle of some flush_all operation 
   return c0(4..);   
   	-- note the control which was valid; 
   	-- the other should be written first

  end if; 
  
  if c0(1) /= "\x00" or c1(1) /= "\x00" then flush_all(); end if; 
	if failed then return; end if;
		-- we must recover from the middle of some flush_all operation 
  if (ph1 := c0(2)) > (ph2 := c1(2)) then    
  	-- the first control was valid
   control_was_valid := 0; return c0(4..);   
   	-- note the control which was valid; 
   	-- the other should be written first
  else            -- the second control was valid
   control_was_valid := 1; return c1(4..);   
   	-- note the control which was valid; 
   	-- the other should be written first
  end if;

 end read_control;

 procedure write_control(phaseno,extra_data); 
 	-- writes control record to disk in a safe manner
 
  write_safely(1 - control_was_valid,"\x00\x00\x00" + char(phaseno) + extra_data);
		 -- write the phase number and extra_data to the control record area 
		 -- that was not most recently used
		 -- the zero lead bytes indicate that we are not in a 'flush_all'. 
		 -- Note that this procedure is never called from within a 'flush_all'.
		 -- Note that the preceding operation might either write bad data and crash, 
		 -- or write what it should, but then crash anyhow. 
		 -- Note that the first three bytes are reserved for flush_all.

  if failed then return; end if;
 
    -- having now written at least one valid control record, 
    -- we go on to write the other, to handle possible 'jump backwards' 
    -- cases correctly 

  write_safely(control_was_valid,"\x00\x00\x00" + char(phaseno) + extra_data)); 
	if failed then return; end if;

 end write_control;

 procedure write(recno,newval);  
 	-- simulated write; uses buffer only. records written are dirtied

  RAM_buffer(recno) := newval; dirtied with:= recno;

 end write;

 procedure write_safely(recno,stg);  
 	-- write a string to the disk, adding an 8-byte check-hash, and verifying

  failed := false;    -- this operation may succeed
  
  stg := (stg + 120 * "\x00")(1..120);  -- pad to length 120
  stg := stg + hash_8(stg);     -- append the check-hash
  new_stg := maywreck(stg,failure_prob); 
  if mayfail(failure_prob) then return; end if;
        -- perhaps something wrong gets written, or nothing
  puts(file_handle,ix := recno * 128 + 1,new_stg); 
  if mayfail(failure_prob) then return; end if;  
  	-- write to disk
  gets(file_handle,ix,128,read_back); 
  if mayfail(failure_prob) then return; end if;     
  	-- read back

  if stg /= read_back then failed := true; end if;
  	-- fail if read is not verified; 
  	-- note that hash must be correct if read is verified.
  	-- but we may fail even if good data was really written

 end write_safely;

 procedure read_weakly(recno);  
 	-- read a record from the disk, 
 	-- accepting it only if its check-hash verifies
 	-- the buffer is used for records available there. 
 	-- records read from disk are undirtied

  failed := false;    -- this operation may succeed

  if (val:= RAM_buffer(recno)) /= OM then return val; end if;
  
  gets(file_handle,recno * 128 + 1,128,stg);   -- read from disk

  if mayfail(failure_prob) then return; end if;  

  stg := maywreck(stg,failure_prob); -- simulate possible read error

  if hash_8(part := stg(1..120)) = stg(121..) then   
  	-- return successfully, putting record in buffer
   RAM_buffer(recno) := part; dirtied less:= part; return part; 
  end if;

  return OM;    -- return indication that read has failed 
  
 end read_weakly;

 procedure read_page(recno);  
 	-- read a record from the disk, 
 	-- accepting it only if its check-hash verifies
 	-- but if the check but if the check-hash fails to verify, 
 	-- try repeatedly, to be sure that the check-hash failure is 'hard', 
 	-- rather than caused by an intermittent misread

  failed := false;    -- this operation may succeed

  for j in [1..10] loop  
  	-- insist on 100 bad reads before 
  	-- concluding that the page contains bad data
   gets(file_handle,recno * 128 + 1,128,stg); 
   if mayfail(small_prob) then return; end if;  
   	-- Read from disk. Here we lower the crash probability 
   	-- (but not the misread probability) of the read operation, 
   	-- so as not to have gross delay in recognizing hard errors

   stg := maywreck(stg,failure_prob);     
   	-- simulate possible read error
   if hash_8(part := stg(1..120)) = stg(121..) then return part; end if;  
		-- return successfully
  end loop;   
  	-- otherwise we try again, to overcome intermittent read error

  return OM; -- exit from the above loop indicates hard failure
  
 end read_page;

 procedure read_actually(recno);  -- read the ground truth of the situation

  gets(file_handle,recno * 128 + 1,128,stg);   -- read from disk

  if hash_8(part := stg(1..120)) = stg(121..) then 
  	return part;   -- return successfully
  end if;

  return OM;

 end read_actually;

 procedure flush_all();    
 	-- commit to the new situation by writing out all buffered records
 	-- undirtied records need not be written

 	-- the control records are 0 and 1
 	-- find the flush_all step which is valid
  c0 := read_page(0); if failed then return; end if; 
  	-- read the two control records; 
  	-- we need to be sure of their state in any case
  c1 := read_page(1); if failed then return; end if;

  if c0 = OM and c1 = OM then 
  	print("flush_all cannot find valid control record;" + 
  		" severe systems failure"); 
		failed := true; 
		return; 
  end if;

  if c0 = OM then   
  	 -- if one record is invalid, use the other 
   control_was_valid := 1; c0 := c1; 
  elseif c1 = OM then 
   control_was_valid := 0; c1 := c0;
  end if;  

  s0 := abs(c0(1)); s1 := abs(c1(1));   
  	-- get the two step numbers

  if s0 = 0 and s1 = 0  then  
  	-- if both steps are 0, set the flush step to 1; 
  	-- but we do not write the control at this point,
   	-- since if we fail in flush step 1, 
   	-- recovery must be to the previous application-level step, 
   	-- which will recompute 'RAM_buffer' and 
   	-- its associated 'dirtied' set. Until step 1 succeeds, 
   	-- these are hopelessly in_RAM, i.e. unrecoverable.
   
   flush_step := 1;    
   	-- having now set the flush step to 1, we will proceed accordingly below

   	 -- in the remaining cases we merely determine what flush step 
   	 -- is to be performed. 

  elseif s0 > s1 then     
	  -- otherwise we use the control record containing the largest valid 
		-- flush_all step
  
   flush_step := s0; control_was_valid := 0; good_c := c0;
   
  elseif s1 > s0 then     
	  -- otherwise we use the control record containing the largest valid 
		-- flush_all step
   
   flush_step := s1; control_was_valid := 1; good_c := c1; 
   
  	-- the other becomes the 'writeable control record'
  else    -- s1 = s0

   flush_step := s0; good_c := c0;   
   -- which is the same as s1; we leave control_was_valid, 
   -- which was set above, unchanged 
  
  end if;   
  	 -- at this point we have successfully read the control record, 
  	 -- determined which of its two copies we can safely try to write, 
  	 -- and determined the number of the flush-all step to be executed
  
  while flush_step /= 0 loop   -- until success, keep trying 

   case flush_step
 
    when 1 =>   
    	-- first non-control record-write step; 
    	-- move records to spare locations
    
    	-- write the first (application level) record to a spare location
     write_count := 1;

     for val = RAM_buffer(r) | r in dirtied loop 
     	-- we write a value.address pair 
     	-- for each dirtied record in the RAM buffer
      write_safely(write_count +:= 1,val); 
      if failed then return; end if; 
      write_safely(write_count +:= 1,str(r)); 
      if failed then return; end if; 
     end loop;

     dirtied := {};   
     	-- at this point the records are out on disk; 
     	-- phase 2 will put them into their intended places
     
     write_count := write_count / 2;   -- the number of records written
 
       -- advance the step number in the writeable control record to 2
     if control_was_valid = 1 then  
     	-- write record 0, which may fail
  
      c1(1..3) := "\x02" + char(write_count / 256) + char(write_count mod 256); 
      write_safely(0,good_c := c1); 
 
      if failed then return; end if;  

      control_was_valid := 0;	 -- we set the flush_all step to 2

     else   
     	-- record 0 was valid, and record 1 invalid, 
     	-- so write record 1, which may fail
  
      c0(1..3) := "\x02" + char(write_count / 256) + char(write_count mod 256); 
      write_safely(1,good_c := c0); 
 
      if failed then return; end if;  
      
      control_was_valid := 1;	 -- we set the flush_all step to 2
     end if;
     
     flush_step := 2;
     
    when 2 =>   
    	-- second non-control record-write step; 
    	-- move records to their standard locations
  
     records_to_move := abs(good_c(2)) * 256 + abs(good_c(3));   
     	-- find the number of characters to move
     read_position := 1;  
     	-- initialize the position in the temporary disk area to be read

     for j in [1..records_to_move] loop

      r_mirror_val := read_page(read_position +:= 1); 
      if failed then return; end if;

      r_address := read_page(read_position +:= 1); 
      if failed then return; end if;
 
                  -- only hard failures should count as failures
      write_safely(unstr(r_address),r_mirror_val); 
      if failed then return; end if;
       -- copy the first spare location to the first standard position

     end loop;

       -- advance the step number in the writeable control record to 3, 
       -- which is the flush_all exit step
     if control_was_valid = 1 then  -- write record 1, which may fail
  
      c1(1..3) := "\x03\x00\x00"; write_safely(0,c1); 

      if failed then return; end if; 
 
      	-- we set the flush_all step to 2
      control_was_valid := 0;
  
     else   
     	-- record 1 was valid, and record 2 invalid, 
     	-- so write record 1, which may fail
  
      c0(1..3) := "\x03\x00\x00"; write_safely(1,c0); 
 
      if failed then return; end if;

      control_was_valid := 1;
  
     end if;
      
     flush_step := 3;

    when 3 =>   
    	-- now jump back to state 0 to exit the flush-all operation
   
    	-- set both flush_all steps fields to 0 
     if control_was_valid = 1 then  
     	-- write record 1, which may fail
  
      c1(1..3) := "\x00\x00\x00"; write_safely(0,c1); 

      if failed then return; end if;  
 
      	-- we set the flush_all step to 0
      write_safely(1,c1); 

      if failed then return; end if;  
      	-- both records must be written, or we will just keep trying
      control_was_valid := 0;
  
     else   
     	-- record 1 was valid, and record 2 invalid, 
     	-- so write record 1, which may fail
  
      c0(1..3) := "\x00\x00\x00"; write_safely(1,c0); 

      if failed then return; end if;

      write_safely(0,c0); if failed then return; end if; 
      	-- both records must be written, or we will just keep trying
      control_was_valid := 1;
  
     end if;
     
     flush_step := 0;   -- this takes us out of the flush-all loop

   end case;

  end loop;
  
 end flush_all;

 procedure maywreck(stg,prob);  
 	-- change a string randomly, with a given probability

  if random(rand_handle) < prob then 
  	stg(floor(random(rand_handle) * 100.0) + 1) := "?"; end if; 
  		-- change a random character

  return stg;

 end maywreck;

 procedure mayfail(prob);  
 	-- fail randomly, with a given probability

  if random(rand_handle) < prob then 
  	failures_count +:= 1; return (failed := true); 
  end if;

  return (failed := false); 

 end mayfail;

  procedure hash_8(stg);   -- hash a string into an 8-byte string
  
  the_hash := 0;

  for c in stg loop 
  	the_hash := (the_hash * primroot + abs(c)) mod smaller_prime; 
  end loop;
  
  return bytes_from_int((the_hash * the_hash) mod smaller_prime);
  
 end hash_8;
 
 procedure bytes_from_int(int);  
 	-- converts integer to string of 8 bytes

   stg := char(int mod 256); 

   for j in [1..7] loop 
   	int /:= 256; stg := char(int mod 256) + stg; 
   end loop; 
   
   return stg;

 end bytes_from_int;

end disk_recovery;

program test;  -- test program for demonstration of crash-recovery technique ***
 use disk_recovery;	-- use the disk recovery package seen above
 
 const r1 := 6, r2 := 7;   
 	-- 'disk pages' used by program; page usage is as described above.
 const max_count := 40;
 
 for j in [1..num := 1] loop   
 	-- run specified number of simulations

  access_disk(8);      
  	-- disk initalization routine; sets number of simulated pages 
  print("\nStarting test ",j); 
 
  	-- set up an initial pair of control records, 
  	-- which show application phase = 1, no flush_all in progress
  failed := true; 
  
  while failed loop 
  	write_control(1,"");  -- pre-initialization; this must be done reliably
  end loop;

  failures_count := 0; failed := true; 

  while failed loop do_application; end loop;  
  	-- to recover, just keep re-entering the application until it succeeds 
  report_sucess;  
  	-- print the values from the two records being manipulated, 
  	-- and the number of failures/recoveries
  close_disk();

 end loop;
 
 print(num," recovery Tests Done **** ");
 
 procedure do_application;  -- record processing 'micro-application'
  
  while (phase := read_control()) = OM loop  
  	-- read application phase 
  	-- (this will automatically complete any outstanding flush_all)
   if failed then return; end if;
  end loop;
  
  loop     -- this loop exited by success or failure

  	-- first byte of control record (as returned) 
  	-- contains application-level phase
   if phase(1) = "\x01" then  
     -- must initialize; note that we ignore all but the first byte, 
     -- 'phase' could contain other parameters, but in one such are used here 
  
    record_with_zeroes := 120 * "0";  
    	---- an integer 0, encoded as a 120-byte string
    write(r1,record_with_zeroes); 
    		-- zero the first application record
 
    if failed then return; end if;   
 
    write(r2,record_with_zeroes); 
    	-- zero the second application record

    if failed then return; end if;   

    flush_all(); 	-- commit to the new situation
    if failed then return; end if;
    
    print("Done with initialization." + 
    " Values in records are: ",unstr(read_actually(r1)),"
    	 ",unstr(read_actually(r2)));

    phase := "\x02"; write_control(2,""); 
    if failed then return; end if;   
    	 -- set the control phase to 2, to begin next phase

   else     -- phase = 2; 
    
    loop  
    	-- keep trying to complete the record processing, 
    	-- in spite of continual crashes
    
     sval1 := read_page(r1); 
    	-- read the contents of the first application record, as a string
     if failed then return; end if;
 
     sval2 := read_page(r2); if failed then return; end if;
    	-- read the contents of the second application record, as a string
     val1 := unstr(sval1) + 1; 
     if val1 > max_count then return; end if; 
     	-- success!

     val2 := unstr(sval2) + 1;
     
     stg := 120 * "0" + str(val1); write(r1,sg1 := stg(#stg - 119..));
        -- write out the first result, zero-padded on left
     if failed then return; end if;
     stg := 120 * "0" + str(val2); write(r2,sg2 := stg(#stg - 119..));
        -- write out the second result, zero-padded on left
     if failed then return; end if;
 
     flush_all(); if failed then return; end if;
     	     -- commit to the new situation
     
    end loop;
     
   end if;
 
  end loop;     
  	-- end do_application work loop; 
  	-- this loop will be exited by a success or failure return
 
 end do_application;

 procedure report_sucess;  
 	-- print the values from the two records being manipulated, 
 	-- and the number of failures/recoveries

  print("have recovered from ",failures_count," crashes.");
  print("Final disk values: ",unstr((vr1 := read_actually(r1))?"666"),"
  	",unstr((vr2 := read_actually(r2))?"666"));

 end report_sucess;
 
end test;

Formal verification of crash recovery code correctness; a 'brute force' approach. Careful examination and extensive testing can build confidence in the correctness of the recovery code seen above, and in fact it has passed a battery of 1,250 simulation runs, involving successful recovery from 280,000 simulated random crashes. However, since large stakes (e.g. the integrity of the financial records of major banks, corporations, or government tax authorities) may rest on codes of this type, something closer to a formal proof of correctness may be desirable. In the simple case before us, this can be achieved as follows. We can view the recovery program as something that steps between disk states, any data held in RAM being merely an evanescent means to this continuing end. In examples as simple as that shown above, the collection of disk states that can possibly appear is finite, and, indeed, not very large. This is because only 8 records are involved, and

  1. Each record contains either bad data, which we can represent by an OM, or one of a small finite number of possible 'good' values.

  2. If not OM, each of the first two records contains a logical triple of the form [flush_phase,record_count,application_phase], where flush_phase varies from 0 to 3, record_count is 0 or 2 in our example, and application_phase is 1 or 2. Thus each of these records is in one of at most 17 possible states.

  3. If not OM, each of the records 2,4,6, and 7 contains an integer n ranging between 0 and 40, and these integers can differ by at most 2. Thus altogether these records are in one of a few hundred states.

  4. If not OM, each of the records 3 and 5 contains either 6 or 7.

Thus the total number of reachable states of our sample application lies in the thousands rather than in the millions. Accordingly, the semantics of the failure-generating operations 'mayfail' and 'maywreck', together with the method used to restart execution, can be modified in the following way to allow a full survey of the evolution and fate of all possible program states.

  1. We keep a collection of restart_states_reached. Initially restart_states_reached contains only the start state for our simulated application.

  2. The simulation is run repeatedly, once for each disk state that has passed into restart_states_reached. During these runs, operations do not fail, but every disk operation that might fail inserts one or more possibly new elements into the collection restart_states_reached. The states so inserted must be the disk state as of the moment that the operation is traversed, plus (for write operations) the state that the operation would have created if it wrote successfully, and also the state that the operation would have created if it wrote bad data.

  3. Runs continue as long as the set restart_states_reached is expanding, but the restart action is simulated only once for each of the elements of restart_states_reached.

  4. We must then verify that none of the our simulations loop indefinitely (this can be done by keeping track of the states encountered at each loop head, and verifying that none repeat), and that all terminate in the expected success state.

A more flexible crash-recovery system design. The preceding discussion of crash recovery issues gives us the insight needed to design a more flexible recovery approach. This will make use of a single new primitive

		checkpoint;
which can be called from anywhere within a SETL program, even from within recursive nests of procedures. The availability of this primitive makes certain of the crash-related primitives discussed above unnecessary, and so can replace them. Each 'checkpoint' successfully executed writes a copy of an application's entire state to disk, allowing recovery to begin at the program step immediately following the last successful 'checkpoint' operation.

This is accomplished as follows. When 'checkpoint' is called, a full copy of the current SETL environment, including the current value of all variables and any recursion stack that may have been generated by prior recursive calls is collected and written safely to disk. This data file, which we will call 'dump', can therefore be recovered after a crash and examined to determine the root nodes of all the B-trees associated with dob objects which the current program is manipulating. As in the recovery design described previously, the root nodes of all B-trees generated are held in a doubly linked Btree_root_list which is stored on disk and whose first element is also held in a known position on disk so as to survive crashes. During recovery from a crash one of our first steps is to scan Btree_root_list and library, and reduce the reference count of every B-tree root node to the number of library entries that reference it. As previously, trees found to be unreferenced are erased in the normal way. This guarantees that B-trees created between checkpoints will be erased and that the space they occupy will be retrieved.

The SETL interpreter must also examine all the variables being abandoned whenever a procedure return is taking place and decrement their reference counts by 1 for each such variable, erasing them logically if their count reaches 0.

The flush_all operation described above can be used to ensure that all of these operations appear to be atomic.

Using the 'checkpoint' primitive, we can rewrite the crash recovery example given earlier as

 program test; -- a program that should be crash-recoverable
 		
	n := initial_n();	-- initialize the number of iterations 
	x := initial_x(); -- and the internal state, in an application-specific way
	
	v1 := library(v1_name);...vk := library(vk_name);	
		-- using their known 'standard' names, 
		-- get the data objects to be processed from the library 

	for j in [n,n-1..1] loop 			
		-- exit from this loop represents 'success'

		 -- call the application routine, which will process the temporary copies 
		 -- of the library data structures
		[x,v1,..,vk] := my_proc(x,v1,..,vk); 

		checkpoint;		-- note the desired point of recovery
			-- each successful completion of this operation represents
			-- one finished iteration

	end loop;
	library([v1_name,...,vk_name];
		-- completion of this operation represents the end of an application run
		-- the stop operation erases all the B-trees being abandoned

 end test;

Besides its much enhanced simplicity of use, the great advantage of this new program version is that calls to 'checkpoint' can be inserted anywhere, even within the user procedure 'my_proc' which appears above. In our previous approach, recovery to program states within procedure calls is considerably less convenient. Note that the 'register' operation is no longer used, and so can be eliminated.

12.6.'Versioned' B-trees. A technique derived from that introduced by Richard Cole and developed by Tarjan, Sarnak et al (see Driscoll, J., Sarnak, N., Sleator, D., and Tarjan, R., "Making Data Structures Persistent," Proc. 8th ACM STOC, pp. 109 - 121, ACM, May, 1986; Peter Widmayer, Bruno Becker, et al, VLDB Journal 5(4): 264-275 (1996);An Asymptotically Optimal Multiversion B-Tree; Paolo Ferragina, Roberto Grossi, The String B-Tree: A New Data Structure for String Search in External Memory and its Applications, Journal of the ACM (1998)) can be used to improve the performance of the logical record list described above for databases consisting largely of records which are short relative to the size of the physical pages which hold them. This technique can also improve the performance of word index B-trees.

For this, we proceed by reserving enough space in each physical page (or in most such pages) to hold one more item than the page logically stores, plus a small amount of header information, and by associating a steadily incrementing version number with each logical copy of the logical record list created. These 'items' are records in the case of leaves, but in the case of intermediate tree nodes they consist of a single cumulant value and its associated child pointer. Each item written to a page represents a possible logical replacement for one of the items already present on the page; the change written is used if we want to read the page as it logically exists after, but not before, the indicated version came into existence.

Suppose, for example, that we have a page in a database B-tree which evolves through successive numbered versions, and that in all versions before version 999 the page contained the items 'a,b,c,d,e,..'. Suppose also that in version 999 item d was changed to d'. Versions are assumed to remain of interest for a while, but then to disappear logically; the set of all version numbers still of interest (i.e. potentially recoverable) is assumed to be globally available. Then, instead of copying the node in the manner described above, we can indicate the change made to this database version by writing

a,b,c,d,e,..but after 999: d'

instead. The appended 'but after' notation should be read as indicating that for all versions of the file that are still recoverable but carry version numbers less than 999, the fourth item in the node has the value d, but that in versions from 999 on, and must be changed to d'. In general a page will carry many items of data, along with a single additional ('single' can be generalized to 'several') 'but after' clause, and so will have the form

data,..but after version_number: change_to_data
.

Whenever a page is to be read, we ignore its 'but after' clause (if it has one) if we want the page version which applies before the 'version_number' which this 'but after' clause contains, but we apply the indicated data change if we want a later version. The rule applied when a page is written is a bit more complex. Suppose that we are writing version n of the page (since version numbers are assumed to increase steadily, this version will never have been written before.) If the page presently contains no 'but after' clause, we simply write a 'but after n' clause containing the data for the desired change. Next suppose that the page does contain a clause 'but after m', but that no version earlier than version m is recoverable. Then we apply the 'but after m' clause to the 'data..' in the node (thus freeing the space required to store this clause), and then write our new change as a 'but after n' clause, as before. Finally, if the page does contain a clause 'but after m', and version m is recoverable, then we copy the page, creating a new version in which both the old 'but after m' clause and the new change are absorbed into the 'data...', so that this new page has space for one additional 'but after' clause. When such a page copy operation is executed, it must be reflected in the parent page of the page being copied, which will either get a new 'but after n' clause of its own, or be copied itself, thereby possibly reflecting a whole series of copy operations all the way up to the root of the B-tree. However, in most cases it will only be necessary to write a single but after n' clause to the leaf containing some small record undergoing a change.

Reference count management in this environment differs little from reference count management for the B-tree structures described previously.

New versions of B-tree objects can be created by operations

new_obj := obj.version().

These are constant-time operations which simply increment a version counter and generate a new object header.

(iv) Atomicity and recovery of distributed transactions.

In discussing distributed transactions, we assume that messages can be passed reliably between processors networked together , each of which has its own disk. Message segment numbering and handshaking techniques familiar from TCP/IP practice are used to ensure this. We also assume to begin with that even if one of the processors in a distributed cooperating group crashes, its disk will survive, and the processor will reboot, possibly after physical replacement, allowing transaction code running on it to be resurrected swiftly. We assume that messages sen are reliably written to disk before their arrival is acknowledged, so that a processor participating in a transaction will never forget a message whose arrival it has acknowledged. Given this assumption, our first task is to describe the implementation of a coordinated distributed write, i.e one which takes place on each of a distributed family of cooperating processors, or fails on all of them, leaving no logical trace on any.

Suppose to begin with that all the records to be written are sent out by a 'lead' processor belonging to the family. This can send the appropriate records to each of the other family members, and request confirmation in return. If one of the processors fails to confirm, all of the others can be instructed to delete the messages that they have received. *** TO BE CONTINUED ***

*** Text Past here is unedited *** Text Past here is unedited *** Text Past here is unedited ***

12.5. The full SETL database design - *** Old Introduction. The SETL database design reflects the foregoing considerations in the following way. Whenever the number of undirtied records held in the in_RAM_copy map falls below B/4, we perform a (As explained in the next section, this is done in a manner guaranteeing its crash recoverability; also, 'flush' operations may be performed at other times).

More about implementation issues.

The preceding codes implicitly assume that the node objects of its Btup class have value semantics, i.e., are copied automatically whenever necessary, and equally important, are released to an automatic space-recovery mechanism when no longer required. (Without such release, space would be consumed progressively as a system using this class ran, which ultimately would lead to a 'memory exhausted' crash.) For SETL objects like those appearing in the preceding code, value semantics are guaranteed and a space recovery mechanism is provided, both by the SETL interpreter. For this guarantee to extend to the much larger data objects that will appear in serious databases, similar mechanisms must be extended to large objects held on disk. Thus a crucial step in setting up the SETL database capability is to create a system of nodes which can be used to form very large trees like those seen in the code appearing in the previous section, which are hosted on disk and which are allocated and deallocated automatically. It is most appropriate to program these in a style which reaches down to the 'physical' level at which transfers of blocks of data between RAM and disk are seen explicitly, rather than relying on the capabilities of some underlying file system, which is another, and apparently simply, option. The reason for this, spelled out in much greater detail in the next section of this chapter, is that the SETL database system is designed to be fully recoverable in the event of system crash. As said above, databases can be recovered even if power to the computer maintaining the database is suddenly interrupted, or because of any other equally alarming and unpredictable physical or software event. Reliance on a pre-existing file system would imply that the database we created could be no more reliable than the disk system used to realize it.

Of the many techniques available for building automatic space allocation and recovery mechanisms, we adopt the so-called reference count method. This requires maintenance, for each physical database node, of a reference count integer, always equal to the number of other nodes (and of primary variables) which reference this node. These reference counts, maintained in a manner described below, are used in conjunction with a free_list, which is simply a list of the addresses of all physical disk records currently unused, and hence, available for allocation. These lists are maintained in two parts, one complete copy of each always being held in a disk area reserved for it, and an additional part, including all more recently modified portions of either of these data structures, being held in an auxiliary in_RAM data structure and a supplementary disk area.

Since they are so central to the SETL database implementation, we will sketch the schemes used to manage the free_list and the reference count mapping in some detail. The free_list is represented by a disk area FL reserved for it and used as a circular buffer, plus a supplementary linear array maintained in RAM which represents that part of the free list currently in active use. A few pointers into the on-disk free_list area serve to manage it. These divide the free_list area into three parts, which in linear order are

  1. the portion of the free_list previously moved into RAM, some nodes of which may be in use for storage of newly created in_RAM tree nodes;
  2. the free_list portion of the on-disk free_list never moved into RAM;
  3. a free_list portion written out from RAM, e.g., when destruction of some large B-tree copy causes the in_RAM portion of the free_list to overflow its preassigned limit.

When more free space is required, entries are read from free_list portion (iii) (which need not be protected against crash) as long as any such are available. However, if portion (iii) is exhausted, a block of free_list entries is read from free_list portion (ii), and the boundary between portion (ii) and (iii) is shifted accordingly. Flush operations begin with writing of the entire in_RAM free_list to the end of portion (iii) of the on-disk free_list. Then all dirtied in_RAM nodes are copies to their disk positions, and the indices determining the starts and ends of free_list portions are given their new values. All this is done in the fail-safe manner described in the next section.

Implementation of crash-recovery.

We recover from crashes and simulated crashes in the following way.

flush_all(...) writes all dirtied records to disk and zeroes the 'dirtied' set. Between calls to flush_all(...), refcount and free_list exist as in-RAM copies of stable versions of this same data held on disk, with supplementary indications of change (refcount_delta, free_list_copy, and free_list_first_written (see below).). flush_all(...) writes refcount_delta to a log file (in fail safe fashion.) When this log file grows large enough we write a complete version of the refcount tuple to disk (in fail safe fashion) and throw away the log file.

Ordinary code always calls flush_all with the from_step parameter equal to zero. This begins the flush_all operation by writing all the elements of in-RAM data to be saved (changed records, current state of free_list, and refcount_log or current state of the refcount vector) to disk. Up to completion of this initial step the flush_all operation is not recoverable, i.e. will have no logical effect if it fails before finishing. In such case, we will recover to the state as of the last successful flush_all, and so will also need to be able to recover whatever information is necessary to pick up execution from that point. Beyond this point the flush_all operation is recoverable, i.e. if it fails before completing all its substeps, the step which has failed, but no preceding substeps, will be repeated until the whole flush_all eventually succeeds. To achieve this, the 'recover' procedure must first call flush_all with a from_step parameter > 0 that defines the next flush_all substep to execute, and then, when the flush_all completes, should cause the sequence of operations following the flush_all to execute.

In abstract terms, the free_list is simply a vector of pointers to all of the physical disk blocks available for allocation. The free_list is represented on disk by a complete copy of itself, grouped into a succession of 4-byte fields within physically successive data blocks, and by an integer held in the recovery control area which defines the length of this free-list copy. During in-RAM execution we maintain a pair of indices defining the length of the free list and the index of the first free-list entry modified since the free_list was last recovered from disk. During flush, and prior to setting the step index to 1, we write the section of the free list delimited by these two indices (or rather a slightly larger section which starts on a disk block boundary) to the area reserved for the free list image; but the first record of this section is initially written to a temporary area.We write the record number of the one displaced free_list record (the 'link record') to the recovery control area, along with an integer defining the length of the free_list. A second, fail-safe phase of the flush process copies this free_list 'link record' to its proper position.

Again in abstract terms, the refcount map is a map from the address of allocated B-tree records r to the number of other records which reference r. The refcount map is represented by the last full copy RMC of it written to disk, plus a log of refcount changes written since the last time this full copy was updated. To checkpoint the state of the refcount map when we flush, we ordinarily just execute a fail safe write of the refcount_delta map to the refcount log. This write is done just prior to setting the step index to 1. The refcount_delta map is a set of pairs [record_no,refcount]. These are grouped into 10-byte fields within successive data blocks of the log, which are written in checked fashion to a sufficiently large reserved area always available at the end of the refcount_delta log. An integer, refcount_log_length, written reliably to the recovery control area defines the number of entries in the refcount_delta log file. Each flush_all operation combines the refcount_delta map with the refcount tuple to construct an updated in-RAM image of the refcount tuple, and nulls refcount_delta. If the log file has grown to roughly the size of the refcount tuple, we write the refcount tuple to a temporary disk area besides writing the refcount_delta log data, after which we zero the recovery control data item refcount_log_length, thereby clearing the refcount_delta log file. All this is done just prior to setting the step index to 1, and will simply repeat if a crash occurs before the step index has become 1. If a new copy of RMC has just been written, then a flush step which will move this copy into the disk area normally occupied by RMC is performed.

No refcount log section can be larger than 5 times the full refcount map, i.e. than max_file_size. We reserve an area 6 times this for the refcount log, but as soon as the filled region reaches 5 * max_file_size we reconstruct the refcount map and empty the log

The refcount_delta map is a set of pairs [record,refcount], grouped into 10-byte fields and written to the refcount log area. Except for a first record, each new refcount log section is written to its proper position at the end of the refcount log. The first record F needs to include parts of the last prior section written, and to over-write this last previous section. To avoid loss of information in case of a crash during this write, F is first written to a preliminary position, and then copied from this position to its correct refcount log position. Note: the refcount log record identifiers have the numbering appropriate to the refcount vector; i.e. the first disk record used for ordinary tree nodes has the number 1. Thus the numbering of these records begins just past the zone of records reserved for the storage of the two free list copies and the refcount log; i.e. are offset from the physical recnos by the formula recno - refcount_log_end_int - 1. The refcount_delta map does not carry this offset, and so is addressed by the expression refcount_delta(rec)

The correct refcount image is reconstructed during recovery after each crash, at which time it is written to its reserved disk area. When this is done, the refcount_delta map is nulled. A new refcount log section is written to the end of the refcount log at each flush, at which time the refcount_delta map is absorbed into the refcount vector and the refcount_delta map is nulled. This does not change the disk copy of the full refcount vector. Finally, we combine the refcount log with the refcount vector and write a new version of the full refcount vector whenever the log grows to include a number of items roughly equal to the length of the full refcount vector. This operation writes a new version of the full refcount vector

Aside from refcount and free_list information, the data items changed by a database operation 'db(x) := changed_record' or 'db with new_record' are (a) the physical disk records holding the database record R itself, plus all the tree nodes addressing this group of physical records (and so amounting to about as many physical disk records as are needed to hold R, plus a few more.) (b) the physical disk records holding changed word index tree and word-occurrence tree entries. This should amount to no more than 16 times the number of indexed words in the record R (or, rather, the reduced version RV of it used for indexing) , and hence to about twice as many physical disk records as there are bytes in RV. Assuming that RV is no more than 10KB and that physical disk records are 1 KB, about 10 MB of space are ample for the in-RAM data representing the largest plausible database operation 'db(x) := changed_record' or 'db with new_record'.

(ii) Multistep transactions

Although some transactions may only change a single database record, coordinated changes to several records, or to records in several databases, are more typical. To handle these, some state data supplementary to the current condition of all the databases must be maintained. The following approach can be used.

(a) Each transaction T is structured as a procedure Tpr(Tid) with a single parameter: its transaction identifier Tid, which is a system-issued 4-byte string, which references the transaction's 'state record' in a supplementary database Tsdb, the 'transactions state database'. This database is not indexed, i.e has a null indexing function. Tsdb treats the database operations 'db(x) := changed_record' and 'db with new_record' differently from all other databases, in that 'flush' is not called as part of their execution, but only at the next following execution of a like operation affecting another database. This makes changes to Tsdb atomic with subsequent changes to other databases: following a crash, the effect of a change to Tsdb will be seen if there has been some later change to another database, otherwise not.

(b) Before calling a transaction for the first time, the system (i.e. 'transaction monitor') writes a 0 to the Tsdb record identified by the transaction's assigned Tid, by executing Tsdb(tid) := 0 and then calling flush() explicitly.

(c) As soon as the transaction Tpr(Tid) is called, either initially or subsequent to recovery from a crash, it examines the state record my_state := Tsdb(Tid) corresponding to it. If this is 0, Tpr starts from its beginning. Otherwise Tpr uses my_state to initialize some of its internal variables appropriately, and jumps to whatever code location the value of my_state indicates.

(d) It is Tpr's responsibility to maintain its own state is such a way as to restart it correctly after any crash that may occur.

(e) After Tpr returns the system executes Tsdb(tid) := OM to clean up.

As an example, consider the execution of a simple 3-step debit/credit transaction which (i) reads a record R from a debit/credit queue DCq. (ii) debits an account A defined by R by a designated dollar amount D. (iii) credits the same dollar amount to another account B, also defined by R. This transaction can be programmed as follows.

procedure debit_credit(tid);		-- debit/credit transaction example

	my_state := Tsdb(Tid);		-- read current transaction state.
	
	if my_state = 0 then		-- initial call

		step := "starting";		
		[to_debit,to_credit,amount] := DCq(1);		
			-- get the first debit/credit request posted

	else			-- resurrection after recovery
	
		[step,to_debit,to_credit,amount] := my_state;				
			-- recover the pre-crash state

	end if;												
																
			-- do all the detailed transaction work here
	debited_client_record := account_database(to_debit);		
		-- get the client record to be debited
	debited_client_record.account_total -:= amount;				
		-- post the debit
	credited_client_record := account_database(to_debit);		
		-- get the client record to be credited
	credited_client_record.account_total +:= amount;			
		-- post the debit

	if step = "starting" then

		Tsdb(Tid) := [step := "write_the_credit",to_debit,to_credit,amount];		
			-- save state in case of crash
		account_database(to_debit) := debited_client_record;						
			-- write to disk, and flush
		
	end if;

	if step = "write_the_credit" then
		
		Tsdb(Tid) := [step := "drop_completed_request",to_debit,to_credit,amount];	
			-- save state in case of crash
		account_database(to_credit) := credited_client_record;						
			-- write to disk, and flush

	end if;

	if step = "drop_completed_request" then
	
		Tsdb(Tid) := [step := bye_bye"];	
			-- save state in case of crash
		DCq(1) := OM;				
			-- note that the request has now been processed

	end if;

	if step = "bye_bye" then
		Tsdb(Tid) := OM;			-- farewell, cruel world
	end if;
	
end debit_credit;
Note that the preceding code starts and ends with full stereotyped sections, the first of which merely initializes or re-initializes the transaction state, while the second merely writes the transaction's final results out in fail safe fashion. Thus, in a suitable linguistic environment, it could be expressed more simply as
transactional procedure debit_credit(tid);		
	-- debit/credit transaction example, hypothetically rewritten 
	
	transactional var to_debit,to_credit,amount;		
		-- these variables will always be restored after crash

	-- initially 
		[to_debit,to_credit,amount] := DCq(1);		
			-- get the first debit/credit request posted
	end initially;

	debited_client_record := account_database(to_debit);		
		-- get the client record to be debited
	debited_client_record.account_total -:= amount;				
		-- post the debit
	credited_client_record := account_database(to_debit);		
		-- get the client record to be credited
	credited_client_record.account_total +:= amount;			
		-- post the debit

	steps: -- write the transaction effects to disk
		step: account_database(to_debit) := debited_client_record;
		step: account_database(to_credit) := credited_client_record;
		step: DCq(1) := OM;
	end steps; 
	
end debit_credit;

This latter representation hints at the way in which the code previously shown can be optimized by moving fragments of the state-save and state recovery code into the various 'ifs' that appear in it. However, this optimization will have little effect on average, since crash and recovery should be rather infrequent events.

The preceding discussion shows how we can handle coordinated groups of database operations, possibly to several databases db_1, db_2,..,db_k, which are either all to succeed, or all to fail leaving no disk trace. This is easy. We simply form backups of all the databases involved by writing db_1_back := db_1; db_2_back := db_2;..,db_k_back := db_k; A record referencing the list of backup databases db_1_back, db_2_back,..,db_k_back is then written, in fail safe fashion, to a disk position available to a recovery routine. The steps of our higher-level transaction are then applied to db_1, db_2,..,db_k. Physical system failure will recover, in the manner outlined above, to the state as of the last prior elementary database modification. Our higher-level transaction must be written in a manner which allows the rest of the

(3) More on transaction recovery at levels above that of single database operations.

Next assume that a transaction like that seen above wants to abort at some point within it, leaving no trace. This can be done as follows. The preceding code assumes that the top level record identifier internal to the object 'accounts_database' seen there is supplied by the transaction monitor (e.g. as a pointer to a physical disk record, the 'official-database root', always kept at a fixed disk position known to the transaction monitor) and passed from that monitor to the transaction code.

A first implementation sketch. To show the feasibility of the approach proposed, we now outline additional details of an implementation of the database facilities described in the preceding paragraphs. The implementation sketched here is chosen to be as simple as possible subject to the constraint that very large databases be handled with acceptable efficiency. However, the more intricate implementation actually used increases its efficiency very greatly. The implementation uses two data object types. The first is simply a very large string type, with a B-tree representation, supporting all the standard string operations, particularly insertion and deletion of string sections. The second, which we call a string index object SI, is also a string, but is somewhat more complex. It consists of fixed-size fields representing integers. The 'B-tree with cumulants' data structure used to achieve this is described below. This structure supports insertion, deletion, and an efficient increment/decrement operation increment_index(SI,i,n) which logically adds an integer i to the n-th and all succeeding fields of SI.

Given these object classes, ICS is simply a large string divided into sections representing successive values n,x, and ICX is a vector of pointers to the start of these sections. The codes implementing insertion/deletion of an element n,x into/from ICS are then as follows:

The list LN is a large string divided into sections structured as follows: (i) a fixed-size integer defining the length of the word w; (ii) the word w; (iii) the series of values representing all the values n associated with successive words w. The associated LW is a vector of pointers to the start of these sections. The codes implementing insertion/deletion of an element n,x into/from ICS is as follows.

procedure insert(n,x); 			-- insert [n,x] into a database

	s := string_of(x);		-- calculate the string summary s of the SETL object x
	words := breakup(s,"\t\r\n ");			-- find all the words in s
		
				-- iterate over all these words, performing insertions into LN

	for x in words | x /= "" loop word_insert(w,n); end loop;

					-- find the proper ICS and ICX locations for n,x
	[n_place_ICS,n_place_ICX] := ICS_locate(n);
	nx_stg := binstr(bignum + n) + binstr(x);	-- package n,x as a string
	ICS(n_place_ICS..n_place_ICS - 1) := nx_stg;	-- insert this string into ICS
	
		-- now the index ICX must be updated. First insert the place information for n.
	ICX(n_place_ICX..n_place_ICX - 1) := binstr(bignum + n_place_ICS);		

			-- now increment the place information for all following n
	increment_index(ICX,#nx_stg,((n_place_ICX + num_stg_size - 1)/num_stg_size) + 1);
	
end insert; 
procedure delete(n,x); 			-- delete [n,x] from a database

	[n_place_ICS,n_place_ICX] := ICS_locate(n);
	if n_place_ICS > #ICS
		or ICS(n_place_ICS..n_place_ICS + num_stg_size - 1) /= binstr(bignum + n) then 
		return;				-- no n in the database
	end if;
	
	s := string_of(x);		
		-- calculate the string s corresponding to the SETL object x
	words := breakup(s,"\t\r\n ");		-- find all the words in s

		-- iterate over all these words, performing deletions from LN
	
	for x in words | x /= "" loop word_delete(w,n); end loop;

							-- get the length of n,x as a string
	nx_stg_len := (nbs := #binstr(bignum + n)) + #binstr(x);		
	ICS(n_place_ICS..n_place_ICS + nx_stg_len - 1) := "";	
		-- delete this string from ICS

		-- now the index ICX must be updated. 
		-- First delete the place information for n.
	ICX(n_place_ICX..n_place_ICX + nbs - 1) := "";		

			-- now decrement the place information for all following n
	increment_index(ICX,-#nx_stg,((n_place_ICX + num_stg_size - 1)/num_stg_size));

end delete; 
procedure word_insert(w,n); 					-- inserts w,n into LN

		-- locate the start of the LN section corresponding to the word w
	[w_place_LN,w_place_LW] := LN_locate(w);
	
	nwwl := #(w_with_len := binstr(bignum + #w) + w);
		-- w with its length, which is the header for w's section if present
	
	if (w_last := w_place_LN + nwwl - 1) > #LN 
		or LN(w_place_LN..w_last) /= w_with_len then
			-- w not found; prepare to insert a section containing w and n. 

				-- insert w,n into LN
		LN(w_place_LN..w_place_LN - 1) := (w_with_len + binstr(bignum + n));

				-- insert location of w into LW
		LW(w_place_LW..w_place_LW + num_stg_size - 1) := binstr(bignum + w_place_LN);

		-- All following LW components must be incremented. 
		increment_index(LW,#num_stg_size,
			((w_place_LW + num_stg_size - 1)/num_stg_size) + 1);

	else		-- w found; search the corresponding LN section
			-- to find the proper insertion point for n
		
		zone_start := 
		w_place_LN + num_stg_size 
			+ unbinstr(LN(w_place_LN..w_place_LN + num_stg_size - 1));

		zone_end := 
			if (next_place_LW := w_place_LW + num_stg_size > #LN) then #LN 
			else 
			unbinstr(LW(next_place_LW..next_place_LW + num_stg_size - 1)) end if;

		loc_of_n := n_locate(n,zone_start,zone_end);
		LN(loc_of_n..loc_of_n - 1) := binstr(bignum + n);
		
		-- All following LW components must be incremented. 
		increment_index(LW,#num_stg_size,
			((w_place_LW + num_stg_size - 1)/num_stg_size) + 1);

	end if;

end word_insert; 
procedure word_delete(w,n); 		-- deletes w,n from LN
		-- locate the start of the LN section corresponding to the word w
	[w_place_LN,w_place_LW] := LN_locate(w);
	
	nwwl := #(w_with_len := binstr(bignum + #w) + w);
		-- w with its length, which is the header for w's section if present
	
	if (w_last := w_place_LN + nwwl - 1) > #LN 
			or LN(w_place_LN..w_last) /= w_with_len then
		return;			-- w not found
	end if;
	
	-- if the n being deleted is the only n with this w, 
	-- the w itself must be deleted
	zone_start := 
		w_place_LN + num_stg_size + 
			unbinstr(LN(w_place_LN..w_place_LN + num_stg_size - 1));

	zone_end := if (next_place_LW := w_place_LW + num_stg_size > #LN) then #LN 
		else unbinstr(LW(next_place_LW..next_place_LW + num_stg_size - 1)) 
			end if;

	loc_of_n := n_locate(n,zone_start,zone_end);
		LN(loc_of_n..loc_of_n - 1) := binstr(bignum + n);

end word_delete; 
Here are the binary search routines used in the insertion and deletion procedures seen above.
procedure ICS_locate(n);	-- locates the ICS section containing a given identifier
end ICS_locate; 
procedure LN_locate(w);	-- locates the LN section containing a given identifier
end LN_locate; 
procedure n_locate(n,zone_start,zone_end);
		-- locates an identifier n within the list accompanying a given word w
end n_locate; 

LN supports binary search for words w, allowing the object realizing the 'x in contains(w)' iteration, the 'x in contains(w)' boolean-valued test operation, and the '#contains(w)' operations to be coded as follows:

The straightforward ideas sketched in the last few paragraphs can be elaborated to cover more general cases. First consider the case in which we wish to store nested tuples of tuples of tuples of.. of strings, down to some maximum limiting depth, e.g. 8, which should be ample for many applications. All the tuples stored should be allowed to have very large numbers of components, and all the strings to be very long. This can be done by storing 10 cumulants in place of the 3 used for simple vectors of strings, namely we could store number of leaves, total number of characters in a node's descendant leaves, and total number of breaks between records at each of our 8 hierarchical levels. But a better design is possible. Specifically, by slightly restricting the number of embedded subtuples allowed to appear at each level (no more than 2 billion), we can instead store only number of leaves, total number of characters, and a single offset number of breaks between records. For this, we divide the range of integers that can be held in our third cumulant into 8 equal 21-bit subranges, where the largest, containing integers of the form 7 * 221 + n, represents n breaks between top-level tuples, while integers of the form 6 * 221 + n represents n breaks between second-level tuples nested in these top-level tuples, etc. The operations supported have the form

obj_2 := obj(n);         and           obj(n) := obj_2;

where 'obj' is one of our very large nested tuple objects, n is the index of one of its components, and 'obj_2' is a like object in which the nesting is one level less deep. At the bottom level substring extraction operations like those shown above come into play.

To understand the technique used we can begin by thinking of the crude idea of storing every one of the nbreaks_cum1,...,nbreaks_cum8 mentioned above at each position. But instead of this we store only two quantities, namely nbreaks_cumj, where j is the first position in this sequence at which it differs from its immediate predecessor, and nbreaks_cumj, where j is the first position in this sequence at which it differs from its immediate successor. Each integer stored is offset in the manner explained above, and so explicitly tells us the nesting level to which it relates. For example, instead of storing the full sequence

1.1.1.1.1.1.1.1; 1.1.2.1.1.1.1.1; 1.2.1.1.1.1.1.1; 2.1.1.1.1.1.1.1; 3.1.1.1.1.1.1.1; ...

we would store the abbreviated sequence

1(1); 2(3); 2(2); 2(1); 3(1); ...

etc. Here the number shown in parentheses is the component-index j of the full sequence nbreaks_cum1,...,nbreaks_cum8 from which the integer appearing before it is taken. However, each node ends with the full sequence nbreaks_cum1,...,nbreaks_cum8 that would appear in it; but the abbreviations we have shown appear in every other position.

Examination of the abbreviated cumulant sequence shown just above tells us that all nodes stored