Other SETL native and utility packages
One of the most essential features of very high level languages is the ability to interface easily and efficiently with lower level languages. This is important for two reasons. (I) Such interfaces make it possible to reuse the vast treasure of public domain code and libraries available on the Net, (i) particularly critical routines can be brought to high efficiency by coding them in lower level languages.
When SETL2 was revised in 1990, this need was recognized by Kirk Snyder, who added an interim Call-out and Call-back capability to the language. The present system provides a considerably more mature and powerful version of this important capability. Using it, the SETL system has been supplied with a collection of native packages providing useful libraries of high-efficiency functions in various areas. Besides the extensive Tk package described in the preceding chapter, these now include operations for string manipulation and matching, analysis of strings using regular expressions, image processing, numerical computation, computation with modular integers, and others. Still other native packages make important internal SETL system functions like parsing, compilation, and library manipulation available for external use within the SETL language. This includes the interface to Unix decribed below.
An interface to the Python language is also provided, making it possible to access all of Python's facilities, including its Web access operations.
The high-efficiency native packages are supplemented by a family of utility packages written in SETL itself. These provide a variety of string, graph, I/O, and integer procedures.
In this report we will first describe this collection of libraries in a manner helpful to their would-be user. Subsequently we descend to the 'C' level, to give the layer of detail needed by would-be implementors of additional native packages. This will make it necessary to describe parts of the internal structure of the SETL2 system and the way in which native packages are implemented.
Every native package must have a SETL specification, which must be compiled into a SETL library to make the native package available to the system.
An example of a SETL specification for a native package is:
native package string_utility_pak; procedure breakup(obj,chars); -- break string into tuple at any occurrence of a c in chars. -- works recursively on tuples of strings, etc. procedure join(tuple,chars); -- joins tuple of strings into string by inserting 'chars' between procedure single_out(obj,chars); -- break string into tuple by separating occurrence of a c in chars -- into singletons. works recursively on tuples of strings, etc. procedure segregate(obj,chars); -- break string into tuple by separating runs of c's in chars -- from nonruns. works recursively on tuples of strings, etc. procedure keep_chars(string,chars); -- suppress chars of string that are not in 'chars' procedure suppress_chars(string,chars); -- suppress chars of string that are 'chars' procedure case_change(string,mode); -- capitalize if mode is "lu"; decapitalize if mode is "ul"; -- and more ... end string_utility_pak;
Although SETL performs no type checking at compilation time (and no type declarations appear in a native package specification), the specification does document the functions provided by the compiled package.
As the above example illustrates, a native package specification is identical to the specification part of a standard SETL package, with the exception that the keyword native appears in the package header. Any variable name can be used as a placeholder for a parameter in the native package Specification, but it is important the number of arguments match the number expected by the package implementation.
Once a native package is included by some SETL code that uses it, the functions provided by the will native package will become available to the user exactly as if they were built-in functions.
For example, a user of the breakup operation supplied by the native package whose specifier is seen above, would write:
program test_breakup; use string_utility_pak; print(breakup({"a,b,c,d",["d,e","f"]},",")); end;to use it. When the above program is compiled, the system will verify statically that the breakup function is called correctly, i.e. with two parameters, but of course will not check the types of the arguments statically, since SETL treats these dynamically.
When program execution begins, the SETL runtime system will realize that the program seen above requires a native package, in this case 'string_utility_pak', and will search the appropriate directories for the presence of the DLL containing the code. If no such DLL can be found, the program will abort with an error message announcing that certain Native functions cannot be resolved. Otherwise, the DLL is loaded, and the system checks that all the functions declared in the native package specification are defined in the DLL.
string_utility_pak | Provides various commonly used string manipulation primitives, in high-performance form. These include string split and join operations, character suppression operations, and upper/lowercasing. A library of high-performance operations on strings represented as flat byte-arrays is also provided. |
file_pak | Provides various file and directory manipulation operations, including creation, deletion, listing, and navigation. |
rx_pak | High-performance regular-expressions package derived from the Gnu library. |
stringm_pak | High-performance exact and approximate string matching operations, including Knuth-Morris-Pratt and Boyer-Moore matching algorithms, Karp-Rabin probabilistic match, Aho-Corasik match of list of patterns to a string, suffix-tree match, edit-distance based approximate string matching and alignment, and detection of longest common substring of a pair of strings. |
tk_pak | Interface to the Tk widget set. This is the basis for SETL's interactive Graphical Interface, described in Chapter 10. |
parser_pak | Interface to SETL's parse and compile functions. 'Parse' analyses a SETL source file and returns its parse tree as a nested collection of tuples, making the standard SETL error messages available if there are parse errors. A 'compile' function, which takes a string as representing a SETL source file as input and compiles it into the currently active library is also provided. |
ide_pak | This package gives access to the string contents of the currently active SETL edit window and makes it possible to create the interactive source-code analysis and manipulation utilities described below. Its functions also allow the currently selected area of an edit window to be detected and manipulated. A second group of functions in this package allow the library list currently in effect to be read and modified, packages to be reloaded after compilation, and programs to be rerun. |
setldb_pak | Extensive collection of SETL database operations, described in Chapter 12. |
libnr_pak | Large library of high-performance numerical routines, derived from the collection described in the well-known treatise 'Numerical Recipes in C.' |
libntl_pak | Package of integer, modular arithmetic, and modular polynomials package derived for Victor Shoup's well-known 'NTL' library. |
unix_pak | Makes the facilities of Unix available to SETL programs on systems allowing this. |
python_pak | Interface to the Python language and all of its many libraries. In particular,this gives access to Python's internet routines, including its URL and Socket routines. |
grlib_pak | Extensive library of image-manipulation functions. Multiplane images are handled in both byte and floating formats. |
zip_pak | A Lempel-Ziv based package of compression and Zip-File access utilities. |
sound_pak | Procedures for manipulation and analysis of sound in AIFF and related formats. |
midi_pak | Procedures for manipulation of music in MIDI format. |
movie_pak | Procedures for manipulation and display of video image sequences. |
native package string_utility_pak; -- ************ general string utilities ************ procedure breakup(obj,chars); -- break string into tuple at any occurrence of a c in chars. -- works recursively on tuples of strings, etc. procedure join(tuple,chars); -- joins tuple of strings into string by inserting 'chars' between procedure single_out(obj,chars); -- break string into tuple by separating occurrence of a c in chars -- into singletons. works recursively on tuples of strings, etc. procedure segregate(obj,chars); -- break string into tuple by separating runs of c's in chars -- from nonruns. works recursively on tuples of strings, etc. procedure keep_chars(string,chars); -- suppress chars of string that are not in 'chars' procedure suppress_chars(string,chars); -- suppress chars of string that are 'chars' procedure case_change(string,mode); -- capitalize if mode is "lu"; decapitalize if mode is "ul"; procedure hexify(string); -- converts string to its hex representation -- ************ operations on flat strings ************ procedure flat_create(length); -- creates a flat-string object, as a flat C-type buffer of specified length procedure flat_len(obj); -- returns the length (in characters) of a flat-string object procedure flat_to_setl(obj); -- converts a flat-string object to the SETL string with the same characters procedure flat_from_setl(string); -- converts a SETL string to the flat-string object with the same characters procedure flat_clone(obj); -- returns copy of a flat-string object (as flat-string object) procedure flat_add(obj1,obj2); -- returns concatenation of two flat-string objects (as flat-string object) procedure flat_reverse(obj); -- returns reverse of a flat-string object (as flat-string object) procedure flat_slice(obj,i,j); -- returns indicated slice of a flat-string object -- (as flat-string object) procedure flat_slice_end(obj,i); -- returns indicated tail slice of a flat-string object -- (as flat-string object) procedure flat_set_slice(obj,i,obj2); -- sets indicated slice of a flat-string object -- (from another flat-string object, in place) procedure flat_translate(obj,map); -- translates characters of flat-string object, returns result procedure flat_reverse_translate(obj,map); -- translates and reverses characters of flat-string object, returns result procedure flat_set_char(obj,i,c); -- sets indicated character of a flat-string object -- (from SETL char, in place) procedure flat_get_char(obj,i); -- gets indicated character of a flat-string object (as SETL char) procedure flat_translate_all(obj,off,minlen,codestr,rar); -- procedure flat_file_get(filename,start_char,end_char); -- reads flat string from specified section of a file procedure flat_file_put(filename,start_char,obj); -- write flat string to specified location in a file procedure flat_match_scores(obj1,obj2,off,repeats); -- procedure flat_toto_prepare(flats); -- procedure flat_toto_print(obj); -- procedure flat_toto_clear(obj); -- procedure flat_toto_match(obj,stg,maxnumber); -- end string_utility_pak;
General string utilities. The first parameter of 'breakup' can be a string, tuple or set of strings, tuple of tuples of strings, etc. Its second parameter must be a string of characters. The same is true of 'single_out' and 'segregate'. If the first parameter of 'breakup' is simply a string 'stg', 'breakup' breaks 'stg' into a tuple by cutting it at any occurrence of any c in its second parameter, 'chars'. (It two characters of 'chars' follow in immediate succession, two cuts are made, and an empty string between them results.) The characters belonging to 'chars' are removed.
'single_out' behaves in almost exactly the same way, but retains the characters belonging to 'chars', which appear as single characters in the result tuple. 'single_out' behaves similarly, but makes only one cut for each run of characters belonging to 'chars'; these runs are left in the result tuple, in which they alternate with run of characters not belonging to 'chars'.
The small example
program test; -- string_utility test use string_utility_pak; print(breakup("a,b.c,.d",",.")); print(single_out("a,b.c,.d",",.")); print(segregate("a,b.c,.d",",.")); end test;
illustrates these rules. Its output is
["a", "b", "c", "", "d"] ["a", ",", "b", ".", "c", ",", ".", "d"] ["a", ",", "b", ".", "c", ",.", "d"]
The first parameter of 'breakup', 'single_out', or 'segregate' can also be a tuple or set of strings, tuple of tuples of strings, etc., in which case 'breakup', 'single_out', or 'segregate' is applied recursively to the elements or components of the set or tuple until strings to which it can be applied directly are reached. This is illustrated by the example
program test; -- string_utility test 2 use string_utility_pak; print(breakup({"a,b.c,.d","A,B.C,.D"},",.")); print(single_out(["a,b.c,.d","A,B.C,.D"],",.")); print(segregate({"a,b.c,.d","A,B.C,.D"},",.")); end test;
which produces the output
{["A", "B", "C", "", "D"], ["a", "b", "c", "", "d"]} [["a", ",", "b", ".", "c", ",", ".", "d"], ["A", ",", "B", ".", "C", ",", ".", "D"]] {["A", ",", "B", ".", "C", ",.", "D"], ["a", ",", "b", ".", "c", ",.", "d"]}
The recursive action of these three operations makes it easy to deal with arrays of objects represented by punctuated strings with multiple layers of punctuation marks. For example, if a 3-dimensional tensor of integers is represented in the string form
stg := "1,2,3;11,12,13;21,22,23:101,102,103;111,112,113;" + "121,122,123:201,202,203;211,212,213;221,222,223";
the following lines of code will read it and produce the output seen:
print(unstr_rec(breakup(breakup(breakup(stg,":"),";"),","))); procedure unstr_rec(obj); return if is_set(obj) then {unstr_rec(x):x in obj} elseif is_tuple(obj) then [unstr_rec(x):x in obj] else unstr(obj) end if; end unstr_rec;
[[[1, 2, 3], [11, 12, 13], [21, 22, 23]], [[101, 102, 103], [111, 112, 113], [121, 122, 123]], [[201, 202, 203], [211, 212, 213], [221, 222, 223]]]
The 'join' operation is a kind of inverse of 'breakup'. join(tuple,chars); takes a tuple of strings as its first argument and joins this into a string by inserting the characters of 'chars' between its components. An example is
join(breakup("a,bb,ccc",","),"@@@")
which produces the output
a@@@bb@@@ccc
Another example: to replace all the runs of (possibly multiple) whitespace characters in a string by single spaces, one can simply use
join([x: x in segregate(stg," \t") | x(1) notin " \t"]," ")
'case_change' is a simple capitalization/decapitalization utility which operates in the two modes "lu" and "ul". It capitalizes its first string argument if its mode parameter (second parameter) is "lu"; decapitalizes if its mode is parameter is "ul",The following line of code
print(case_change("a.b.c","lu")," ",case_change("A.B.C","ul"));
which produces the output
A.B.C a.b.c
shows this.
'keep_chars' and 'suppress_chars' are simple character-filtering utilities. 'keep_chars' suppresses all characters of its first string argument that are not found in its second argument 'chars'; 'keep_chars' suppresses all characters of its first string argument that are found in its second argument 'chars'. This is shown by the example
print(keep_chars("a.b.c.A.B.C","abc")," ", suppress_chars("a.b.c.A.B.C","."));
which produces the output
abc abcABCOperations on flat strings. The standard SETL string objects are designed to handle string modifications, including concatenation and insertion operations, rapidly for strings of moderate length but operations on them tend to bog down as the strings involved grow longer.For example, the double loop
for k in [1..50] loop stg := 1000 * " "; for j in [1..#stg] loop stg(j..j) := "ab"; end loop; end loop;
performs its 50,000 insertions in strings of length averaging 1200 in 4 seconds (450 MHz Mac processor), but the code
for k in [1..5] loop stg := 10000 * " "; for j in [1..#stg] loop stg(j..j) := "ab"; end loop; end loop;
requires 43 seconds to perform the same number of insertions in strings of length averaging 12000. The reason is that the internal SETL string representation is as a linked list of short string sections. For work with longer strings, and provided that the operations involved do not modify the length of the strings involved, a C-like 'flat' string representation, which allows high-efficiency machine-level addressing of individual, will be much more efficient. A library of operations on such objects, which we shall call 'flat strings', is provided by string_utility_pak. The mutually inverse operations
flat_from_setl(stg) and flat_to_setl(obj)convert standard SETL strings to and from 'flat string' objects. This is shown in the following example,
print(flat_to_setl(obj := flat_from_setl("Hello World"))," ",obj," ",flat_len(obj));
which produces the output
Hello World11
This example also shows the use of the 'flat_len' operation, which returns the length (in characters) of its flat-string argument.
flat_set_char(obj,i,c) and flat_get_char(obj,i)
provide access to individual characters of flat strings, the first of these setting a designated character and the second retrieving a single character. Examples are
obj := obj2 := flat_from_setl("Hello World"); print(flat_get_char(obj,1)); flat_set_char(obj,1,"J"); print(flat_to_setl(obj)," ",flat_to_setl(obj2));
producing the result
H Jello World Jello World
Note that this operation (characteristically for native objects, but uncharacteristically for SETL) has 'pointer' semantics; it changes the object on which it acts 'in place'. The same is true for the 'flat_set_slice' operation, as seen in the example
obj := obj2 := flat_from_setl("Hello World"); print(flat_to_setl(flat_slice(obj,2,5))," " flat_to_setl(flat_slice_end(obj,2))); flat_xs := flat_from_setl("XXXX"); flat_set_slice(obj,2,flat_xs); print(flat_to_setl(obj)," ",flat_to_setl(obj2));
This produces the output
ello ello World HXXXX World HXXXX World
The operation flat_slice(obj,i,j) returns the indicated slice of a flat-string object (as a flat-string object). The operation flat_slice_end(obj,i) returns indicated tail slice of a flat-string object, from the indicated character to the end, (as a flat-string object). The operation flat_set_slice(obj,i,obj2) sets the indicated slice of a flat-string object, starting at the indicated position, and for as many characters as ore in obj2 , in place (s the preceding example shows).
flat_clone(obj) creates a copy of the indicated flat string. flat_add(obj1,obj2) returns the concatenation of two flat-string objects (as a flat-string object). These appear in the following small example, whose output is 'Hello World!Hello World!'.
Hello World!Hello World! program test; -- flat-string cloning and adding use string_utility_pak; flat_stg := flat_from_SETL("Hello World!"); print(flat_to_SETL(flat_add(flat_stg,flat_clone(flat_stg)))); -- print translated end test;
flat_translate(obj) and flat_reverse_translate(obj)
program test; -- flat_translate and flat_reverse_translate use string_utility_pak; flat_stg := flat_from_SETL("Hello World"); map_stg := "" +/ [char(j): j in [0..255]]; -- all 256 characters map to themselves map_stg(abs("H") + 1) := "J"; -- change two of the character mappings map_stg(abs("o") + 1) := "A"; map := flat_from_SETL(map_stg); -- convert to flats string print(flat_to_SETL(flat_translate(flat_stg,map))); -- print translated print(flat_to_SETL(flat_reverse_translate(flat_stg,map))); -- also translated and reversed end test;The output produced is
JellA WArld dlrAW AlleJ
native package rx; -- derived from the GNU regular expression package procedure regcomp(rw rx,pattern,flags); -- sets rx to the compiled form of the regular expression 'pattern' -- and returns 0 or an error code. -- flags is an integer whose bits allow for special options,including -- extended = (treat as extended expression, see GNU documentation) -- case = (ignore case); nosub = (don't return matching parts info) -- line = (match list of substrings delimited by newlines separately) -- for additional options, see the detailed documentation on the -- Free Software Foundation website procedure regexec(rx,s,flags,rw pos); -- runs compiled rx against string, returning 0 or an error code -- if the match succeeds, pos is returned as a tuple [] defining -- the way in which the whole expression -- and its subparts matched -- flags is an integer whose two low order bits allow for -- the following special options: -- NOT_BEGLINE (value = 1): don't regard start of string as line start -- NOT_ENDLINE (value = 2): don't regard end of string as line end procedure regerror(err,rx); -- returns detailed error string if rx failed and -- returned 'err' as its error code end rx;In the next few paragraphs, we give a brief syntactic and semantic summary of regular expressions and their use. Regular expressions (which are the chocolate truffles of programming: small but amazingly rich) are patterns which can be matched efficiently to strings. Each such pattern is itself represented by a string, which must conform to the simple, very condensed syntax described in the next few paragraphs. This syntax treats a small number of characters, namely
in a special way, as syntactic/semantic markers, designating such things as subpattern repetition, and alternation. (Each of these characters can then be used 'as itself' within other strings by being preceded with the backslash character '\'). As a 'gotcha' for the unwary, a few other important characters must be preceded with a backslash to function as syntactic/semantic markers; if unescaped they represent themselves. These are the characters
Yet a further nuisance arises from the fact that to be written within SETL quotes the backslash character '\' must be doubled to '\\'.
The regular expression matching process works left-to-right through the 'target' string being matched, and tries to find the leftmost-longest substring of the target which matches the pattern. More specifically, the normal matching process begins first at the first character of the target string, but then if no match can be found, tries again starting at the second, third, .. characters of the target, until a match can be found; the match found is then the longest substring, starting at the given position, which matches the pattern. For simple matches,the match process generates and return a pair [i,j], where i is the starting position of the match, and j is the position of the first subsequent character which cannot be matched, i.e. 1 past the position of the last matched character. (When matches involving patterns and sub-patterns are performed, a more complex tuple is returned, as explained below.) if no match can be found, an empty tuple is returned.
The 'flags' parameters appearing in 'regcomp' and in 'regexec' allow regular expression compilation and matching to proceed in a variety of modes, e.g. case-sensitive and case-insensitive matching. Details concerning these flags and their use are given at the end of the present section.
For the rest one must understand what strings are matched by each possible pattern string. The rules determining this are as follows.
program test; -- regular expressions example 1 use rx; regcomp(regx,"Bc",0); -- compiles "Bc" into 'regx' regexec(regx,"abcdaBcd",0,pos); -- matches compiled "Bc" to "abcdaBcd" print(pos); end test;
produces the output
[[6, 8]]
which indicates that the matched section of the target runs from its 6th to its 7th character.
program test; -- regular expressions example 2 use rx; regcomp(regx,"b..a",0); -- compiles "b..a" into 'regx' regexec(regx,"abcdaBcd",0,pos); -- matches compiled "b..a" to "abcdaBcd" print(pos); end test;
matches the substring 'bcda' of the target and so produces the output
[[2, 6]]
program test; -- regular expressions example 3 use rx; regcomp(regx,"ab[a-df-z!][a-df-z!]d",0); -- compiles "ab[a-df-z!][a-df-z!]d" into 'regx' regexec(regx,"Aab!cdzBcd",0,pos); -- matches compiled "ab[a-df-z!][a-df-z!]d" to "Aab!cdzBcd" print(pos); end test;
produces the output
[[2, 7]]
since it matches the substring 'ab!cd' of the target.
program test; -- regular expressions example 4 use rx; regcomp(regx,"[^a-d][a-df-z!]d",0); -- compiles "[^a-d][a-df-z!]d" into 'regx' regexec(regx,"aab!cdzBcd",0,pos); -- matches compiled "[^a-d][a-df-z!]d" to "aab!cdzBcd"" print(pos); end test;
produces the output
[[4, 7]]
since it matches the substring '!cd' of the target.
program test; -- regular expressions example 5 use rx; regcomp(regx,"[[:punct:]]c[[:digit:][:punct:]][[:digit:][:punct:]]",0); regexec(regx,"aab!c9,Bcd",0,pos); print(pos); end test;
since it matches the substring '!c9,' of the target,and so produces [[4, 8]].
program test; -- regular expressions example 6 use rx; regcomp(regx,"[[:digit:][:punct:]]\\+",0); regexec(regx,"aab!9,99!!888Bcd",0,pos); print(pos); end test;
which matches the substring '!9,99!!888' of the target, and so produces [[4, 14]].
program test; -- regular expressions example 7 use rx; regcomp(regx,"[[:digit:][:punct:]]\\{2,9\\}",0); -- compiles "Bc" into 'regx' regexec(regx,"aab!9,99!!888Bcd",0,pos); -- matches compiled "Bc" to "abcdaBcd" print(pos); end test;
which produces the output [[4, 13]] since it matches the 9 characters '!9,99!!88' of the target.
program test; -- regular expressions matching example 8 use rx; print(regcomp(regx,"\\(fe\\)\\?d",0)); regexec(regx,"fed",0,pos); print(pos); regexec(regx,"d",0,pos); print(pos); end test;
is seen to match successfully in both cases since the 'optional' group '\(fe\)\?' matches both the 'fe' in'fed' and the empty string in 'd'.
A first example of this is
program test; -- regular expressions example 9a use rx; regcomp(regx,"c\\(a\\+\\)",0); -- compiles 'c\(a\+\)' into 'regx' regexec(regx,"caaaabb",0,pos); -- matches compiled 'c\(a\+\)' to "caaaabb" print(pos); end test;
which we shall compare to
program test; -- regular expressions example 9b use rx; regcomp(regx,"ca\\+",0); -- compiles 'ca\+' into 'regx' regexec(regx,"caaaabb",0,pos); -- matches compiled 'c\(a\+\)' to "caaaabb" print(pos); end test;
The difference between these two examples is that in the second form the regular expression used is simply 'ca\+', which matches 'c' followed by indefinitely many a', while in the first version the regular expression is the more sophisticated 'c\(a\+\)', in which the same repeated a's are matched, but as a group. The second example generates the expected match tuple [[1, 6]], but the first example generates the less expected [[1, 6], [2, 6]]. This is because besides the first, widest pair representing the entire range of characters matched, each subgroup on a regular expression generates an additional pair showing the range matched by the subgroup. In the example shown, this is the run of characters 'aaaa' in the second through fifth positions. However, if the group as a whole is repeated in the regular expression, only the last of its matches generates such a pair. The effect of this can be seen in the example
program test; -- regular expressions example 9c use rx; regcomp(regx,"c\\(a\\)\\+",0); -- compiles 'c(a)\+' into 'regx' regexec(regx,"caaaabb",0,pos); -- matches compiled 'c(a)\+' to "caaaabb" print(pos); end test;
Here, since we match the repeated group '(a)\+', only its final match generates a pair into the resulting tuple, which is [[1, 6], [5, 6]], indicating that the last match of the repeated group is to the character 'a' appearing 5th in the target string. (Compare this to the [[1, 6], [2, 6]] returned when 'c\(a\+\)' is matched instead.
The example
program test; -- regular expressions example 9b use rx; regcomp(regx,"c\\(a\\+\\)\\(b\\+\\)",0); -- compiles 'c(a)\+(b)\+' into 'regx' regexec(regx,"caaaabb",0,pos); -- matches compiled 'c(a)\+' to "caaaabb" print(pos); end test;
which returns [[1, 8], [5, 6], [6, 8]], makes another rule apparent. This is the fact that a series of successively matched subexpressions generates a series of successive pairs into the result returned. Each subexpression generates a pair indicating the range of characters that it has matched, or, if it is repeated, the range of characters matched by the last of its repetitions.
The general rule for construction of the tuple returned upon successful match of a regular expression involving (possibly nested) subexpressions and repetitions can now be stated as follows. Successful match generates a series of pairs corresponding to character ranges. The first such pair defines the total range covered by the match. This is followed by a sequence of pairs, grouped into successive subsequences corresponding to the series of top-level subgroups of the regular expression. For each such subgroup, the run of characters matched by the (final repetition of) the subgroup (if repeated) is indicated. If the subgroup contains nested sub-subgroups, the same rule is applied recursively, the pairs generated by any single subgroup always following successively in the overall list of pairs returned.
All this is shown in the example
program test; -- regular expressions example 10 use rx; regcomp(regx,"\\(\\(a\\+b\\+\\)\\+\\(c\\+\\)\\)\\(d\\+\\)",0); regexec(regx,"aaaabbaabbbcccddd",0,pos); print(pos); end test;
which returns the list of pairs
[[1, 18], [1, 15], [7, 12], [12, 15], [15, 18]]and repays careful study. The value returned reflects the following facts:
program test; -- regular expressions matching example 11 use rx; regcomp(regx,"\\(\\(ab\\)\\+\\|\\(cd\\)\\+\\)\\+",0); regexec(regx,"ababcdcdcd",0,pos); print(pos); end test;
returns [[1, 11], [5, 11], [5, 5], [9, 11]] since the alternative '\(\(ab\)\+\\(cd\)\+\)\+' matches either runs of 'ab' or runs of 'cd',and so the whole target string "ababcdcdcd" is matched.
program test; -- regular expressions matching example 12 use rx; regcomp(regx,"\\(cd\\+\\)\\+\\(ab\\+\\)\\+z\\1\\(\\2\\)",0); regexec(regx,"cdcddababbzcddabb",0,pos); print(pos); end test;
which returns [[1, 18], [3, 6], [8, 11], [15, 18]] since its first part '\(cd\+\)\+\\(ab\+\)\+' matches 'cdcddababb' and associates the two subexpressions it contains with 'cdd' and 'abb' (the last character ranges these repeated regular subexpressions match); then '\1' and '\(\2\)' match the second 'cdd' and 'abb' respectively. The latter is a subgroup match, and so generates the pair [15, 18] into the result tuple. This example also shows that the construct '/n' can be included in grouped subexpressions, etc/
program test; -- regular expressions example 13 use rx; regcomp(regx,"car$",0); -- compiles 'car$' into 'regx' regexec(regx,"carcar",0,pos); -- matches compiled 'car$' to "carcar" print(pos); end test;
returns [[4, 7]] since it can only match the second occurrence of 'car' in 'carcar'. With the '$' removed it would instead match the first occurrence, and so returns [[1, 4]]. Similarly
program test; -- regular expressions example 14 use rx; regcomp(regx,"^car",0); regexec(regx,"acar",0,pos); print(pos); end test;
fails to match, since it can only match at the start of the target string. Without the '^' it would return [[2, 5]], since a match starting at the second character would be found.
program test; -- regular expressions example 15 use rx; regcomp(regx,"cat\\>",0); regexec(regx,"thecatcat catcatcat",0,pos); print(pos); regcomp(regx,"\\<cat",0); regexec(regx,"thecatcat catcatcat",0,pos); print(pos); regcomp(regx,"\\Bcat\\B",0); regexec(regx,"the catcat catcatcat",0,pos); print(pos); regcomp(regx,"\\bcat\\B",0); regexec(regx,"thecatcat catcatcat",0,pos); print(pos); end test;
which returns
[[7, 10]] [[11, 14]] [[15, 18]] [[11, 14]]
reflecting the fact that
The 'flags' parameters appearing in 'regcomp' and in 'regexec' allow regular expression compilation and matching to proceed in a variaety of modes, e.g. case-sensitive and case-insensitive maching. To set these flags, one passes appropriate integer 'flags' parameter values to 'regcomp' and 'regexec' respectively. Each flag corresponds to a bit position in this integer, which must be 1 for the corresponding flag to be set. The following table lists the 'regcomp' flags.
EXTENDED (value = 1) | Treat the pattern as an extended regular expression, rather than as a basic regular expression. Basic regular expressions are regular expressions having the sytax detailed in the preceding paragrphs. Extended regular expressions differ from these in that instances of the characters
do not need to be preceded by the backslash character '/'. That is, we can write '(', ')', '+', '?', '|', '{', and '}' instead of '\(', '\)', '\+', '\?', '\|', '\{', and '\}' respectively, a welcome relief. In extended regular expressions, we can also use {m} as an abbreviation for {m,m}, and can use {m,} to mean 'at least m occcurences'. Examples are given below. |
ICASE (value = 2) | Ignore case when matching letters. |
NEWLINE (value = 4) | Don't treat the end-of-line character as an ordinary character; instead, treat each line as a target to be matched. This allows '^' to match at the start of any line, and '$' to match at the end of any line, but prevents '.' from matching the end-of-line character. Examples are given below. |
Here are examples of the use of these flags.
program test; -- regular expressions example 9a.1 use rx; regcomp(regx,"c(a+)",1); -- compiles 'c(a+)' into 'regx' regexec(regx,"caaaabb",0,pos); -- matches compiled 'c(a+)' to "caaaabb" print(pos); end test;/
is the extended regular expression version of Example 9a, and returns the same result. To use and ordinary regular expression we would have to write "c\\(a\\+\\)", as in Example 9a, instead. Similarly
program test; -- regular expressions matching example 11.1 use rx; regcomp(regx,"((ab)+|(cd)+)+",1); regexec(regx,"ababcdcdcd",0,pos); print(pos); end test;
is the extended regular expression version of Example 9a, and returns the same result. Plainly "((ab)+|(cd)+)+" is much more readable than "\\(\\(ab\\)\\+\\|\\(cd\\)\\+\\)\\+".
The example
program test; -- regular expressions example 16 use rx; regcomp(regx,"\\(abc\\)\\+",2); regexec(regx,"ABCabc",0,pos); print(pos); end test;
which returns [[1, 7], [4, 7]], or still better its extended regular expression version
program test; -- regular expressions example 16b use rx; regcomp(regx,"(abc)+",3); regexec(regx,"ABCabc",0,pos); print(pos); end test;
shows the use of case-insensitive matching.
The example
program test; -- regular expressions matching example 17 use rx; regcomp(regx,"e$",4); regexec(regx,"Oh, Jack be nimble\nJack be quick",0,pos); print(pos); regcomp(regx,"^J",4); regexec(regx,"Oh, Jack be nimble\nJack be quick",0,pos); print(pos); regcomp(regx,"e$.+^J",7); regexec(regx,"Oh, Jack be nimble\nJack be quick",0,pos); print(pos); end test;
which returns
[[18, 19]] [[20, 21]] []
shows the alternative treatment of end-of-line characters, and the failure of the third match in this example shows also that if this option is used matches can never extend over more than one line.
It is often useful to simplify the use of a native package by representing the raw functions it provides in some more convenient object syntax. In this section we present such a 'wrapper class', called 'rxp', for the regular expressions native package described in the preceding section. This allows us to write the regular expression and matching operation of the preceding section as 'pattern_object * target', where the 'pattern_object' is created from a regular expression string by writing
Here 'pattern' is a regular expression string of the kind described in the preceding section, and 'flags' is an integer composed of bitflags, as described there. If the matching operation pattern_object * target succeeds it returns exactly the kind of position tuple described in the preceding section.
This object class also allows 'pattern_object ** target' to be used to match a regular expression repeatedly to string
Note that the class seen below interprets the error codes returned by 'regcomp' and so explains their meaning.
The class code is as follows.
class rxp; -- SETL regular SETL regular-expression class procedure create(pattern,flags); -- compile pattern into regular expression end rxp; class body rxp; -- SETL regular SETL regular-expression class use rx; var native_rx; -- the native compiled form of the regular expression procedure create(pattern,flags); -- compile pattern into regular expression const REG_NOSUB := 2; flags ?:= REG_NOSUB; -- use no flags as default if (errc := regcomp(native_rx,pattern,flags)) = 0 then return; end if; message_string := case errc -- otherwise there is an error when 1 => "Bad brackets \\{...\\}" when 2 => "Bad regular expression syntax" when 3 => "Bad use of repetition operator, e.g. *,?,+ \\{...\\}" when 4 => "Bad collating element" when 5 => "Bad character class name" when 6 => "Terminal occurrence of \\" when 7 => "Bad digit in digit construct" when 8 => "Unbalanced square brackets" when 9 => "Unbalanced parentheses \\(...\\)" when 10 => "Unbalanced braces \\{...\\}" when 11 => "Bad endpoint in a range expression" when 12 => "Ran out of memory during compilation" end case; abort("**** REGULAR EXPRESSION COMPILATION ERROR: " + message_string); end create; procedure self * stg; -- match regular expression to string, returning [] or match-tuple if (errc := regexec(native_rx,stg,0,pos_tuple)) > 1 then -- ran out of memory abort("**** REGULAR EXPRESSION EXECUTION ERROR: " + "Ran out of memory"); end if; return if errc = 0 then pos_tuple else [] end if; end; procedure self ** stg; -- match regular expression repeatedly to string, -- returning [] or match-tuple matches := []; -- these will be collected match_offset := 0; -- offset to apply to match tuple while stg /= "" loop if (errc := regexec(native_rx,stg,0,pos_tuple)) > 1 then -- ran out of memory abort("**** REGULAR EXPRESSION EXECUTION ERROR: " + "Ran out of memory"); end if; if errc > 0 or (nexc := pos_tuple(1)(2)) = 1 then return matches; end if; -- nothing matched matches with:= [[x + match_offset,y + match_offset]: [x,y] in pos_tuple]; -- apply the match offset to the tuple returned match_offset +:= (nexc - 1); -- advance the match offset to the last matched character stg := stg(nexc..); -- continue with the remaining part of the string end loop; return matches; end; end rxp;
The following small test program illustrates the use of this object class.
program test; -- regular expressions class example use rxp; print(rxp("ab\\+",OM) * "aabbabbabbb"); print(rxp("ab\\+",OM) ** "aabbabbabbb"); end test;
The following small test program illustrates the use of this object class. It returns
[[2, 5]] [[[2, 5]], [[5, 8]], [[8, 12]]],
the first pair corresponding to the three characters 'abb' matched by the first, uniterated, match operation, and the list of three pairs corresponding to the threefold match iteration triggered by the second match operation.
The full regular expression for this entire lexical analyzer would be a totally incomprehensible coded string of several hundred characters. To have a chance of writing this correctly (which will certainly involve some amount of debugging) we must find some way of decomposing it into the string-sum of meaningful submodules. A technique generally applicable to large regular expressions is to think of it as a (perhaps recursive) alternation of concatenated subparts which can be written separately, debugged, and then combined. To make an alternation out of its separate alternatives we have then only to write
"("alternative1 + "|" + alternative2 + "|" + ... + alternativek")"To make a concatenation out of its separate pieces we write
"piece1 + piece2 + ... + piecekThis is done to construct the following lexical analyzer. Note that the analyzer is written as an extended regular expression to reduce clutter. At the top level, it sees the string to be analyzed as a concatenation of whitespace, variable names, quoted strings, based numerals, ordinary numerals, operation words, comments, and end-line characters. If a line ends with a comment, the following end-line character is included in the comment. Otherwise the end-line character which terminates a line counts as whitespace.
The lexical analyzer appears as the routine 'get_tokens'in the following package, which is supplied with the SETL system.
package extended_SETL_lexer; const extra_opsigns := "~!@$%^&" procedure get_tokens(stg); -- decompose a string into generalized SETL tokens end extended_SETL_lexer; package body extended_SETL_lexer; use rx,string_utility_pak; var lexer := OM; -- the 'escaped' characters allowed within SETL strings -- are \\, \0, \n, \r, \f, \t, \",\xdd const D_hex := "\\\\x[0-9a-f][0-9a-f]"; -- regular expression for hex strings const D_escaped := "\\\\f|\\\\0|\\\\t|\\\\n|\\\\r|\\\\\\\\|\\\\\\\""; -- regular expression for escaped special characters const D_opsigns_and_parens "`~!@#$%^&*()+={}[|':/?><,.[`;"; -- regular expression for generalized opsigns/parens const D_opsigns := "([~!@#$%^&*+=:(){}[/?><.]|])"; -- regular expression for generalized opsigns const quote := "\""; -- quote character const quote_eol := "[\"\n\r]"; -- quote and endline characters const not_quote_eol_char := "[^\"\n\r]"; -- characters other than quotes and endlines const escaped_quote := "\\\\\\\""; -- regular expression for quote mark preceded by backslash const start_quoted_string := quote + "(" + not_quote_eol_char + "|" + escaped_quote + ")+"; -- regular expression for start of complete or incomplete -- quoted string, without terminating quote or endline character const D_quoted := start_quoted_string + quote_eol; -- regular expression for complete and incomplete quoted strings const D_name := "[a-zA-Z]([[:alnum:]]|_)*"; -- regular expression for SETL variable names const D_white := "( |\t|\r|\n)*"; -- regular expression for whitespace const D_num := "[0-9][_0-9]*(.[0-9][_0-9]*((E|e|e-|e-)[0-9]+)?)?"; -- regular expression for real and floating numerals const D_based_num := "[0-9]+#[0-9a-z][_0-9a-z]*(.[0-9a-z]" + "[_0-9a-z]*#((E|e|e-|e-)[0-9]+)?)?"; -- regular expression for real and floating numerals -- with non-decimal bases const D_not_monadic_end := "([~!@#$%^&*+=-?><.:/]*[*+=:?><./])"; const D_opword := "([~!@#$%^&*+=(){},;[:/?><.]|]|" + D_not_monadic_end + "|[a-zA-Z][[:alnum:]]*)"; const ppe := {"procedure","program","class","package","use","var","const","end"}; -- declaration headers and tails const icl:= {"if","case","loop"}; -- control structure headers (omitting headers for iterative loops, -- which, as tokens, do not differ from variable names) procedure get_tokens(stg); -- decompose a string into generalized SETL tokens if #stg = 0 then return []; end if; -- null string case if stg(#stg) notin "\n\r" then stg +:= "\n"; end if; -- add a terminating endline if needed if lexer = OM then -- we must compile the regular expression for the lexer regcomp(lexer,"(" + D_white + "|" + D_name + "|" + D_quoted + "|" + D_based_num + "|" + D_num + "|" + D_opword + ")",1); -- use extended regular expression syntax end if; tokens := [ ]; -- will collect while stg /= "" loop if #stg > 1 and stg(1..2) = "--" then -- we have a comment comment := break(stg,"\n\r"); -- this runs to the line end comment := comment + span(stg,"\n\r \t"); -- and includes the line-end character tokens with := comment; -- collect the comment string as a token elseif regexec(lexer,stg,1,pos) = 0 then -- the regular expression finds a token [strt,endd] := pos(1); -- get token start and end tokens with := stg(strt..(endd - 1) min #stg); -- collect the token if endd = 1 then endd := 2;; end if; -- if the token is a null string then take 1 char stg := stg(endd min (#stg + 1)..); -- continue with remainder of input string else -- if no token can be found,then exit exit; end if; end loop; return tokens; -- return the list of tokens end get_tokens; end extended_SETL_lexer;
The package specifier is
native package stringm; -- string matching package ------------------------------------ -- Exact String Matching ------------------------------------ procedure kmp(string,pattern); -- Knuth-Morris-Pratt match of a pattern to a string -- returns tuple of all starting points of matches procedure kmp_compile(pattern); -- optimized Knuth-Morris-Pratt match; returns compiled pattern procedure kmp_exec(string,pat); -- optimized Knuth-Morris-Pratt match using pre-compiled pattern -- returns tuple of all starting points of matches procedure bm(string,pattern); -- Boyer-Moore match of a pattern to a string -- returns tuple of all starting points of matches procedure bm_compile(pattern); -- optimized Boyer-Moore match; returns compiled pattern procedure bm_exec(string,pat); -- optimized Boyer-Moore match using pre-compiled pattern -- returns tuple of all starting points of matches procedure kr(string,pattern); -- Karp-Rabin probabilistic match of a pattern to a string -- returns tuple of all starting points of matches ------------------------------------ -- Exact Matching to a Set of Strings ------------------------------------ procedure ac_compile(tuple); -- Aho-Corasik match of list of patterns to a string; compiles pattern procedure ac_init(pattern,text); -- Aho-Corasik match of list of patterns to a string; binds pattern to text procedure ac_next_match(pattern); -- Aho-Corasik match of list of patterns to a string; -- returns next match -- (as a triple [tuple_index_of_pattern_matched, lic,len]), or OM ------------------------------------ -- Exact Matching using Suffix Trees ------------------------------------ procedure st_compile(tuple); -- suffix-tree match of list of patterns to a string; compiles pattern procedure st_match(pattern,text); -- suffix-tree match of list of patterns to a string -- returns tuple of all match points ------------------------------------ -- Edit Distance ------------------------------------ procedure edist(s1,s2); -- edit distance between two strings procedure exedist(s1,s2,w); -- weighted edit distance between two strings -- the weights w should be supplied as an integer quadruple: -- [insert_weight,delete_weight,substitute_weight,match_weight] procedure etrans(s1,s2); -- best edit sequence between two strings -- returns a descriptor consisting of characters M (matching character) -- S (replace character) D (delete), I (insert), followed by integer distance procedure exetrans(s1,s2,w); -- best edit sequence between two strings, using weights -- returns a descriptor consisting of characters M (matching character) -- S (replace character) D (delete), I (insert), followed by integer distance -- the weights w should be supplied as an integer quadruple: -- [insert_weight,delete_weight,substitute_weight,match_weight] ------------------------------------ -- Longest matching substring ------------------------------------ procedure lcseq(s1,s2); -- finds longest matching subsequence of two strings; returns a pair -- [common_string,length]. Note that this is an increasing sequence, not -- necessarily contiguous ------------------------------------ -- Approximate matching operations ------------------------------------ procedure pwscores(a); -- compile a list of character-score triples into the form -- expected by simil and simlt procedure simil(s1,s2,pws); -- finds the 'optimal alignment distance' between two strings procedure similt(s1,s2,pws); -- finds the 'optimal alignment' of two strings, representing this -- as a descriptor of the same kind that 'exetrans' returns end stringm;
Exact String Matching. Three exact matching procedures for matching to single strings, namely the Knuth-Morris-Pratt, the roughly equivalent Boyer-Moore procedure, and the potentially more efficient Karp-Rabin probabilistic algorithm constitute the first group of operations provided. The first two of these are made available in two forms, one of which matches a pattern string directly to a target, while the second begins by compiling the pattern into a set of tables and then matches this compiled form of the pattern to the target string. This second form is intended for use in situations in which a single pattern is to be matched repeatedly to the members of a collection of target strings.
A first example of the use of these routines is
program test; -- string matching example 1 use stringm; print(kmp("ababaaba","aba")); print(bm("ababaaba","aba")); print(kr("ababaaba","aba")); end test;
which returns
[1, 3, 6] [1, 3, 6] [1, 3, 6]
in each case the starting characters of the three match positions. Note that the position '3' is found by all these procedures, even though it lies within the zone covered by the match to its left.
The 'precompiled' form of this same example is
program test; -- string matching example 1 use stringm; print(kmp_exec("ababaaba",kmp_compile("aba"))); print(bm_exec("ababaaba",bm_compile("aba"))); end test;
which returns the same result.
Exact Matching to a Set of Strings. Two methods, the Aho-Corasik method and a method using suffix Trees, are provided to match a list of patterns to a target string. As seen in the following example, the Aho-Corasik scheme is set up as a triple of procedures. The first of these compiles a list of strings into a form suitable for efficient matching, the second binds the resulting pattern object to the target string to be matched, and the third acts something like a SET iterator, retuning a new match each time it is called, until a final call returns OM, indicating that no more matches exist.
program test; -- exact string matching to a list of patterns use stringm; compiled_pat := ac_compile(["abc","aa","cde"]); -- compile list of patterns ac_init(compiled_pat,"abaabcde"); -- bind to target string print(ac_next_match(compiled_pat)); -- extract successive matches print(ac_next_match(compiled_pat)); print(ac_next_match(compiled_pat)); print(ac_next_match(compiled_pat)); end test;
Since each successful call to 'ac_next_match' returns a triple of the form [index_of_matching_pattern,match_start,match_length] this example generates the output
[2, 3, 2] [1, 4, 3] [3, 6, 3] <om>
Note that, even in the presence of overlapping matches, the algorithm reports every point at which one of the strings in the pattern list can be matched to the target string.
Approximate Matching.
This group of functions can be used to compute edit distances, longest common subsequences and string similarity.
Edit Distance. The edit distance between two strings is defined as the minimum number of edit operations needed to transform the first string into the second.
The edit operations considered In our implementation are insertions, deletions and substitutions. Two versions 'edist' and 'exedist' of the edit distance primitive are provided. The first of these computes edit distance using standard weights assigned to the edit operation. The second, extended version allows explicit assignment of a weight to each edit operation. These weights must be passed as a tuple of four integer weights having the form:
The default weights used in the standard version are simply [1,1,1,0].
The two routines 'etrans' and 'exetrans' work with the same edit distance as 'edist' and 'exedist', but also generate an edit transcript. This is a sequence of one-character edit operation codes in which D, I, S, and M designate Deletion, Insertion, Replacement, and Match respectively. The following example illustrates the use of 'etrans' and the way in which the edit transcript that it returns can be used to set up an alignment between two similar but not identical strings.
program test; -- approximate string matching use stringm; print([transcript,dist] := etrans(s1 := "aaababab",s2 := "aababaaa")); print; -- get edit distance and transcript loc_in_s1 := 0; revised_s1 := ""; loc_in_s2 := 0; revised_s2 := ""; indicator_stg := ""; for c in transcript loop case c when "M" => revised_s1 +:= s1(loc_in_s1 +:= 1); revised_s2 +:= s2(loc_in_s2 +:= 1); indicator_stg +:= "."; when "S" => revised_s1 +:= s1(loc_in_s1 +:= 1); revised_s2 +:= s2(loc_in_s2 +:= 1); indicator_stg +:= "|"; when "D" => revised_s1 +:= s1(loc_in_s1 +:= 1); revised_s2 +:= "-"; indicator_stg +:= "|"; when "I" => revised_s1 +:= "-"; revised_s2 +:= s2(loc_in_s2 +:= 1); indicator_stg +:= "|"; end case; end loop; print(revised_s1); print(indicator_stg); print(revised_s2); end test;
This produces the output
["MMIMMMSMD", 3] aa-ababab ..|...|.| aababaaa-
The first line is the standard pair [edit_transcript,edit_dist] returned by 'etrans'. The three remaining lines are (i) the first string, with insertions marked by dashes; (ii) an indicator string, showing where the aligned first and second strings differ; (iii) the second string, with insertions marked by dashes.
By increasing the cost of insertions and deletions, as in the following variant, we can of course force a different alignment.
program test; -- approximate string matching use stringm; print([transcript,dist] := exetrans(s1 := "aaabababb",s2 := "aababaaa",[3,3,1,0])); -- triple the cost of insertions and deletions print; -- get edit distance and transcript loc_in_s1 := 0; revised_s1 := ""; loc_in_s2 := 0; revised_s2 := ""; indicator_stg := ""; for c in transcript loop case c when "M" => revised_s1 +:= s1(loc_in_s1 +:= 1); revised_s2 +:= s2(loc_in_s2 +:= 1); indicator_stg +:= "."; when "S" => revised_s1 +:= s1(loc_in_s1 +:= 1); revised_s2 +:= s2(loc_in_s2 +:= 1); indicator_stg +:= "|"; when "D" => revised_s1 +:= s1(loc_in_s1 +:= 1); revised_s2 +:= "-"; indicator_stg +:= "|"; when "I" => revised_s1 +:= "-"; revised_s2 +:= s2(loc_in_s2 +:= 1); indicator_stg +:= "|"; end case; end loop; print(revised_s1); print(indicator_stg); print(revised_s2); end test;
This generates the alignment
["MMDMMMMSS", 5] aaabababb ..|....|| aa-babaaaString comparison by maximum similarity after alignment. A more flexible way of measuring the relatedness of two strings is to the two strings must be align them in all possible ways, possibly inserting spaces. Then, given a similarity score for each possible pair of characters (including the nominal character '0', indicating a blank), we can use the sum of the scores for the alignment as the value of the alignment. The alignment of maximum score then measures the similarity of the two strings.
This approach is that embodied in the group of three procedures 'pwscores', 'simil', and 'similt'. pwscores(score_mat) accepts a scoring matrix as parameter and returns a compiled form of this matrix as a opaque 'score matrix object'. The scoring matrix should be supplied as a list of triples [char_1,char_2,similarity_value], where scores for the nominal character '0' indicating a blank are also required. As seen in the next example, the procedures 'simil', and 'similt' then use this 'score matrix object' to find the optimal alignment of two strings containing the characters for which similarity values have been assigned.
program test; -- approximate string matching using a similarity matrix use stringm; pws_obj := pwscores([["a","b",1],["a","a",2], ["b","b",2],[0,"a",0],[0,"b",0]]); -- compile character similarity matrix print([transcript,score] := similt(s1 := "aaabababb", s2 := "aababaaa",pws_obj)); -- use this to align strings print; loc_in_s1 := 0; revised_s1 := ""; loc_in_s2 := 0; revised_s2 := ""; indicator_stg := ""; for c in transcript loop case c when "A" => revised_s1 +:= (c1 := s1(loc_in_s1 +:= 1)); revised_s2 +:= (c2 := s2(loc_in_s2 +:= 1)); indicator_stg +:= if c1 = c2 then "." else "|" end if; when "D" => revised_s1 +:= s1(loc_in_s1 +:= 1); revised_s2 +:= "-"; indicator_stg +:= "|"; when "I" => revised_s1 +:= "-"; revised_s2 +:= s2(loc_in_s2 +:= 1); indicator_stg +:= "|"; end case; end loop; print(revised_s1); print(indicator_stg); print(revised_s2); end test;
The code seen here generates the alignment
["AADAAAAAA", 14] aaabababb ..|....|| aa-babaaa
Note that 'simil' returns only an optimal similarity score; 'similt', like 'exetrans', returns a pair [transcript,score].
The 'similt' alignment technique is commonly used in computational genomics to align sequences of characters representing the amino acids out of which proteins are built. The standard characters for the 23 amino acids occuring in ordinary proteins are "ABCDEFGHIKLMNPQRSTVWXYZ". A commonly used protein-alignment scoring matrix, the so-called 'Blosum-62' matrix derived by collecting statistics from large numbers of related proteins, is (in the same order of amino acids, and with very low scores understood for pairs containing blanks)
4, -2, 4, 0, -3, 9, -2, 4, -3, 6, -1, 1, -4, 2, 5, -2, -3, -2, -3, -3, 6, 0, -1, -3, -1, -2, -3, 6, -2, 0, -3, -1, 0, -1, -2, 8, -1, -3, -1, -3, -3, 0, -4, -3, 4, -1, 0, -3, -1, 1, -3, -2, -1, -3, 5, -1, -4, -1, -4, -3, 0, -4, -3, 2, -2, 4, -1, -3, -1, -3, -2, 0, -3, -2, 1, -1, 2, 5, -2, 3, -3, 1, 0, -3, 0, 1, -3, 0, -3, -2, 6, -1, -2, -3, -1, -1, -4, -2, -2, -3, -1, -3, -2, -2, 7, -1, 0, -3, 0, 2, -3, -2, 0, -3, 1, -2, 0, 0, -1, 5, -1, -1, -3, -2, 0, -3, -2, 0, -3, 2, -2, -1, 0, -2, 1, 5, 1, 0, -1, 0, 0, -2, 0, -1, -2, 0, -2, -1, 1, -1, 0, -1, 4, 0, -1, -1, -1, -1, -2, -2, -2, -1, -1, -1, -1, 0, -1, -1, -1, 1, 5, 0, -3, -1, -3, -2, -1, -3, -3, 3, -2, 1, 1, -3, -2, -2, -3, -2, 0, 4, -3, -4, -2, -4, -3, 1, -2, -2, -3, -3, -2, -1, -4, -4, -2, -3, -3, -2, -3, 11, 0, -1, -2, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -2, -1, -1, 0, 0, -1, -2, -1, -2, -3, -2, -3, -2, 3, -3, 2, -1, -2, -1, -1, -2, -3, -1, -2, -2, -2, -1, 2, -1, 7, -1, 1, -3, 1, 4, -3, -2, 0, -3, 1, -3, -1, 0, -1, 3, 0, 0, -1, -2, -3, -1, -2, 4
Longest common subsequence. The final native procedure supplied by STRINGM_PAK is lcseq(s1,s2), which finds longest common subsequence of two strings, returning a pair [common_string,length]. Note that the sequence found is an ordered sequence of characters, not necessarily contiguous in either of the two strings being compared. That is, as the following example shows, the procedure finds a longest common subsequence, not the longest common run.
program test; -- longest common subsequence use stringm; print(lcseq("Now is the time for all good men","What hath God wrought?")); end test;
The pair ["ht a od ", 8] is returned.
String alignment using 'similt'.
program test; -- string alignment using 'similt' use stringm; scores_list := [["a","b",1],["a","c",1],["a","c",1],["a",0,-2], ["b","a",1],["b","c",1],["b","c",1],["b",0,-2], ["c","a",1],["c","b",1],["c","c",1],["c",0,-2], ["d","a",1],["d","b",1],["d","c",1],["d",0,-2]]; scores_obj := pwscores(scores_list); print(similt("aabbccdd","aabbccdd",scores_obj)); end test;
PARSER_PAK provides a small but important collection of functions which give access to SETL's internal parse operations. Its specifier is
native package parser; -- package of SETL parse functions procedure parse(str); -- parse a string, returning its parse tree as a nested set of tuples -- this handles all SETL 'compilation units' procedure parse_expr(str); -- parse a string, returning its parse tree as a nested set of tuples -- this handles any string that is a valid 'program body' procedure compile(str); -- compile a string, writing result to current library procedure setl_num_errors(); -- read number of errors generated during last previous -- 'parse' or 'compile' operation procedure setl_err_string(err_num); -- get text of indicated error message generated during -- last previous 'parse' or 'compile' operation end parser;The 'compile' function can be used to compile any SETL string dynamically that could otherwise be compiled as a SETL program, package, or class.If compilation is successful, the byte-code produced by the compilation is written to the current library in the standard manner, as explained in Section XXX.
The 'parse' and 'parse_expr' functions. The 'parse' and 'parse_expr' functions give more intimate access to the internal structure of the parse trees generated by the SETL system as a compilation intermediate, and so can be used to support variant forms of SETL and other quite different systems which can make use of the SETL syntax. When a compilation succeeds, the 'parse' function returns the syntax tree of the code it has compiled. This is returned as a nested set of tuples, structured in the manner explained below. 'parse' can handle any SETL 'compilation unit', i.e. anything that the ordinary SETL compiler can handle. 'parse_expr' is a variant form of 'parse', designed to make it a bit easier to work with isolated expressions. 'parse_expr' can handle sequences of expressions and statements, which must end in semicolons. It works as follows: the string submitted to as argument to 'parse_expr' is 'wrapped' by prefixing it with an initial line 'program test;' and appending a final line 'end;'. The resulting string is then parsed as a SETL program. If this operation succeeds, the relevant part of the resulting parse tree is returned. Otherwise OM is returned,in which case the number of error messages which are generated is available via the procedure 'setl_num_errors', and the individual error messages via 'setl_err_string'. The following small example shows the kind of constructs that can be parsed successfully:
program test; -- examples of successful parses use parser; print(parse_expr("x + 1; y + 1;")); -- expressions can be parsed print(parse_expr("x + 1; x := y + 1;")); -- statements can be parsed print(parse_expr("for j in [1..3] loop x + 1; x := {w * w: w in y}; end loop;")); -- loops can be parsed print(parse_expr("x + 1; y + 1; procedure f(z); return {z}; end f;")); -- any 'program body' (hence package body and class body) can be parsed end test;The output produced by the preceding code is seen below. Details concerning the manner in which this output represents parse trees are explained a bit later.
["ast_list", ["ast_add", "X", "1"], ["ast_add", "Y", "1"]] ["ast_list", ["ast_add", "X", "1"], ["ast_assign", "X", ["ast_add", "Y", "1"]]] ["ast_list", ["ast_for", ["ast_iter_list", ["ast_in", "J", ["ast_arith_tup", ["ast_list", "1"], "3"]]], ["ast_null"], ["ast_list", ["ast_add", "X", "1"], ["ast_assign", "X", ["ast_genset", ["ast_mult", "W", "W"], ["ast_iter_list", ["ast_in", "W", "Y"]], ["ast_null"]]]]]] ["ast_list", ["ast_add", "X", "1"], ["ast_add", "Y", "1"]]'parse' handles error messages in the same way as 'parse_expr' and 'compile'. Here is an example showing how error messages generated by failed parses can be recovered.
program test; -- example of failed parse, and error message recovery use parser; print(parse_expr("for j in [1..3] loop" + " x +, 1; x := {w * w: w inn y}; endloop;")); print(ne := setl_num_errors()); for j in [0..ne] loop print(setl_err_string(j)); end loop; end test;The output produced by this example is
3 [2:26] *ERROR* => syntax error at , [2:29] *ERROR* => syntax error at ; [2:46] *ERROR* => syntax error at INN
Parse tree structure. The parse trees returned by the 'parse' function reflect the nested structure of the inputs analyzed.
The overall list of statements is represented by a top-level list node of the form
Constructs involving built-in SETL operators and syntactic forms are represented by special nodes.
Variables and calls to functions other than built-in infix and prefix operators are represented using nodes of the form "_of, followed by the name of the function and its list of arguments. Examples are:
b(x,y) is represented by ["_of", "B", ["_list", "X", "Y"]] b() is represented by ["_of", "B"] b; is represented by the list element "B"
As these examples show, in parse trees all names are capitalized; this gives SETL its case independence.
The node-types used are:
[x,y] is represented by ["_enum_tup", "X", "Y"] {x,y} is represented by ["_enum_set", "X", "Y"] [x] is represented by ["_enum_tup", "X", "Y"], etc. [ ] is represented by ["_nulltup"], etc. [x..y] is represented by ["_arith_tup", ["_list", "X"], "Y"], etc. [x,y,z..w] is represented by ["_arith_tup", ["_list", "X", "Y", "Z"], "W"], etc. -- (note that the parser handles this general case) b(x..y) is represented by ["_slice", "B", "X", "Y"] b(x..) is represented by ["_end", "B", "X", "Y"] b := y is represented by ["_assign", "B", "Y"] --- all the other assignment forms are derived from this. -- For example, b(x..) := y is represented by ["_assign", ["_end", "B", "X"], "Y"] b(x..y) := z is represented by ["_assign", ["_slice", "B", "X", "Y"], "Z"] b(x) := z is represented by ["_assign", ["_of", "B", ["_list", "X"]], "Z"] b(x,y) is represented by ["_of", "B", ["_list", "X", "Y"]] b(x,y) := z is represented by ["_assign", ["_of", "B", ["_list", "X", "Y"]], "Z"] -- superfluous parentheses are elided in the parse tree, -- simplifying it but creating some problems -- when parse trees must be back-correlated with -- the token tuples from which they originate.The setl binary operators ** * / mod ? + - max min = /= < <= > >= in notin subset incs and or from fromb frome npow are represented by the following nodes:
"_EXPON" "_MULT" "_DIV" "_MOD" "_QUESTION" "_ADD" "_SUB" "_MAX" "_MIN" "_EQ" "_NE" "_LT" "_LE" "_GT" "_GE" "_IN" "_NOTIN" "_SUBSET" "_INCS" "_AND" "_OR" "_FROM" "_FROMB" "_FROME" "_NPOW"The SETL monadic operators 'not - # arb domain range pow' are represented by the following nodes:
"_UMINUS" "_NELT" "_ARB" "_DOMAIN" "_RANGE" "_POW"Set and tuple formers like [x: y in s | condition] are represented by structures like
["_gentup", "X", ["_iter_list", ["_in", "Y", "S"], "CONDITION"], etc.If the CONDITION is omitted in the source text it is omitted in the parse tree also. If the leading expression is omitted, "_gentup" and "_genset" are replaced by "_gentup_noexp" and "_genset_noexp" respectively.
All such iterative constructs involve an '"_iter_list' node, whose elements are the successive iterators. These are all expressions, e.g. [y: y = f(x) | condition] is represented by
["_gentup", "Y", ["_iter_list", ["_eq", "Y", ["_of", "F", ["_list", "X"]]]], "CONDITION"].Conversely the parser will accept arbitrary expressions in this position, e.g. [y: u + v = f/x | condition] is represented by
["_list", ["_gentup", "Y", ["_iter_list", ["_eq", ["_add", "U", "V"], ["_div", "F", "X"]]], "CONDITION"]]In the full SETL compiler such constructs are rejected during a post-parse phase; but programs which use the parser directly may find use for them.
Similar remarks apply to other allowed iterators, e.g. 'exists y = f(x) | condition' is represented by
["_exists", ["_ex_iter", ["_eq", "Y", ["_of", "F", ["_list", "X"]]]], "CONDITION"]and likewise for 'forall y = f(x) | condition'. The 'for' iterator construct
for y = f(x) | condition loop a; b; c; end loop;is represented by
["_for", ["_iter_list", ["_eq", "Y", ["_of", "F", ["_list", "X"]]]], "CONDITION", ["_list", "A", "B", "C"]],the body of the loop being represented by a final component of "_list" type. 'while' and 'until' iterators like
while condition loop a; b; c; end loop;are similarly represented by
["_while", "CONDITION", ["_list", "A", "B", "C"]], etc.Conditionals like
if c1 then a; b; c; elseif c2 then d; else e; end if;are represented by
["_if_stmt", "C1", ["_list", "A", "B", "C"], ["_if_stmt", "C2", ["_list", "D"], ["_list", "E"]]]and the corresponding conditional expressions, e.g.
if c1 then a elseif c2 then d else e end if;by
["_if_expr", "C1", "A", ["_if_expr", "C2", "D", "E"]].A 'case' statement like
case a when b,b2 => c; otherwise => d; end case;is represented by
["_case_stmt", "A", ["_list", ["_when", ["_list", "B", "B2"], ["_list", "C"]]], ["_list", "D"]]If the 'otherwise' clause is omitted so is the final component of the subtree representing the case statement, e.g
case a when b,b2 => c; when b3 => d; end case;is represented by
["_case_stmt", "A", ["_list", ["_when", ["_list", "B", "B2"], ["_list", "C"]], ["_when", ["_list", "B3"], ["_list", "D"]]]]similarly case expressions like
case a when b,b2 => c otherwise => d end caseare represented by trees like
["_case_expr", "A", ["_list", ["_when", ["_list", "B", "B2"], "C"]], "D"]Alternative case statement forms like case when b => c; otherwise => d; end case; are represented by trees like
["_guard_stmt", ["_list", ["_when", "B", ["_list", "C"]]], ["_list", "D"]];Again, omission of the 'otherwise' clause simply causes omission of the final component of the corresponding subtree.
Alternative case expression forms like
case when b => c otherwise => d end case;are represented by trees like
["_guard_expr", ["_list", ["_when", "B", "C"]], "D"]The 'return' statement is treated like a monadic operator, so
return x is represented by ["_return", "X"] return is represented by ["_return"]Similarly
exit is represented by ["_exit"] continue is represented by ["_continue"] stop is represented by ["_stop"]Assignment targets like [a,-,b,-,-] containing placeholders are represented by nodes like
["A", _placeholder, "B", _placeholder, _placeholder]Since no semantic checks are performed by the parser, constructs of this kind are allowed in any position.
Quoted strings within Setl Expressions are represented by themselves, with their enclosing double quotes.
Since no semantic checks are performed by the parser, it can handle a somewhat wider range of constructs than the SETL language proper allows, including various expression forms useful in a logic system.
This includes Iterators without restrictions in set formers and quantifiers, e.g {x + y: x, y in z | x > 0}, unrestricted existentals and universals like 'exists x, y in z | x > 0' and 'forall x, y in z | x > 0', and even mover general constructs like 'forall OM | expn(x,y,z,u,v)', which can be used to represent universal quantification all the variables in a general expression expn(x,y,z,u,v). These are seen in the following example
program test; -- parsing of generalized SETL-like expressions use parser; print(parse_expr("{x + y: x, y in z | x > 0};")); -- set former with unrestricted 'iterator' print(parse_expr("exists x, y in z | x > 0;")); -- existential with unrestricted 'iterator' print(parse_expr("forall x, y in z | x > 0;")); -- universal with unrestricted 'iterator' print(parse_expr("forall OM | expn(x,y,z,u,v);")); -- universal over 'all free variables' end test;The parse-trees produced are
["ast_list", ["ast_genset", ["ast_add", "X", "Y"], ["ast_iter_list", "X", ["ast_in", "Y", "Z"]], ["ast_gt", "X", "0"]]] ["ast_list", ["ast_exists", ["ast_ex_iter", "X", ["ast_in", "Y", "Z"]], ["ast_gt", "X", "0"]]] ["ast_list", ["ast_forall", ["ast_iter_list", "X", ["ast_in", "Y", "Z"]], ["ast_gt", "X", "0"]]] ["ast_list", ["ast_forall", ["ast_iter_list", "OM"], ["ast_of", "EXPN", ["ast_list", "X", "Y", "Z", "U", "V"]]]]
If compilation fails, error messages can be recovered using 'setl_num_errors' and 'setl_err_string' in the manner already explained in connection with our discussion of the 'parse' operation.
Here is an example of the use of 'compile' to generate and execute a series of entirely different procedures, all called 'changes_dynamically'.
package test_pak; procedure changes_dynamically(x); -- declare a procedure which will change dynamically end test_pak; program test; -- parsing of generalized SETL-like expressions use parser,ide_pak; -- define header and trailer lines for -- packages and procedures to be compiled var changes_dynamically; -- declare global name for the procedure to be created dynamically header_line := "package body test_pak;\nprocedure changes_dynamically(x);\n"; trailer_line := "end changes_dynamically;\nend test_pak;"; proc_bodies := ["return x + x;\n", "return str(x) + str(x);\n", "return cos(float(x));\n"]; -- these three strings will be used as procedure bodies for pbody in proc_bodies loop res := compile(header_line + pbody + trailer_line); -- wrap the string containing the procedure body, -- and compile it pak := reload_library_package("TEST_PAK"); -- force load of the newly compiled package body changes_dynamically := pak("CHANGES_DYNAMICALLY"); -- extract the 'changes_dynamically' routine print(my_procedure(2)); -- call a static routine that uses 'changes_dynamically' end loop; procedure my_procedure(x); -- auxiliary static routine return changes_dynamically(x); end my_procedure; end test;The three entirely different results produced by the three quite different routines ultimately called are:
4 22 -0.41614683655Note the following facts about the preceding example: (i) Both the name "TEST_PAK" of the package with which we work and the name "CHANGES_DYNAMICALLY" of the procedure retrieved from it must be given as uppercase strings, since (a) this is what 'reload_library_package' and (b) 'reload_library_package' produces a map from capitalized procedure names to the corresponding procedure object. (ii) in order for 'library_package' or 'reload_library_package' to find a package, the package's specifier must have been compiled previously.
These operations allow the IDE edit window to be used as an input console and are crucial to the programming of SETL Help Scripts of the kind described in Section XXX.
The IDE_PAK specifier is
native package ide_pak; -- procedures useful only in programs invoked by Help scripts procedure get_buffer(); -- get contents of edit window procedure get_length(); -- get length of string in edit window procedure set_buffer(from_pos,to_pos,string); -- insert string into edit window procedure get_selection(); -- get boundaries of current selection in edit window procedure set_selection(from_pos,to_pos); -- set boundaries of current selection in edit window procedure get_position(); procedure get_parameter(); -- Help Scripts, especially those of the form <SCRIPT=Personal4>, -- can be preceded by a text line which serves as a personalizing -- parameter. 'get_parameter()' returns this line -- procedures of general utility procedure ide_rerun(commandline); procedure get_library_list(); -- get current library list, as a comma-separated string -- of file names procedure set_library_list(liblist); -- set current library list, from a comma-separated string -- of file names procedure reload_library_package(pak); -- force reload of the named package,which may have been -- newly compiled end ide_pak;
We now give a variety of examples illustrating the use of these operations. Since the operations of the first group all refer implicitly to the contents of a SETL IDE edit window, they must be executed as SETL Help Scripts, by dragging a Help item into such a window, in the manner explained in Section XXX. For this it is convenient to use one of the 4 special help items available under the Help Panel tools tab as 'Personal1'-'Personal4'. Each of these is set up so that when dropped on a SETL IDE edit window it executes a program of the same name (which must have been precompiled). For example, the 'Personal1' Help Script is set up as
<Personal1> ANY PARAMETER STRING YOU LIKE <SCRIPT=Personal1>
so that when dropped on an edit window it executes the program named 'Personal1' (which must have been precompiled). Suppose that this has been supplied as
program Personal1; -- 13:59:58 use ide_pak; stg := get_buffer(); -- read edit window contents print(get_length()," ",get_selection()); -- read length of edit window contents, plus selection boundaries print(get_position()," ",get_parameter()); -- read insert cursor position, plus Help script parameter -- (from Help script) set_buffer(28,35,time()); -- write time to edit window end Personal1;
and that the 'Personal1' Help item is dropped on an Edit window containing the text of this small program when characters 87-97 of the Edit window have been selected. Then the output printed will be
461 [87, 97] 462 ANY PARAMETER STRING YOU LIKE
and the current time will replace the original '13:59:58' in the edit window.
As a second example we give the script of the 'capitalize' Help tool item, which is simply
program test; use ide_pak,string_utility_pak; [ix1,ix2] := get_selection(); -- read edit window selection boundaries if abs(ix2 - ix1) /= (ix2 - ix1) then [ix1,ix2] := [1,get_length()]; end if; -- if no selection, take whole buffer buffer_part := get_buffer()(ix1..ix2); -- read selected part of edit window set_buffer(ix1,ix2,case_change(buffer_part,"lu")); -- replace by capitalized version end test;
The following example shows the use of 'set_library_list' and 'get_library_list'. In order to minimize the risk of including a nonexistent library file in the library list (which would disable SETL) we set the library list to the appropriately comma-separated but redundant 'setl2.lib,setl2.lib'. The output printed by 'print(get_library_list());' is then of course 'setl2.lib,setl2.lib'.
program test; -- library-list manipulation use ide_pak; set_library_list("setl2.lib,setl2.lib"); print(get_library_list()); end test;An example of the use of reload_library_package is given in the preceding section.
This small package typifies the small interfaces that can be used to give SETL access to other fully dynamic languages, i.e. languages which support dynamic execution of statements in them. Unless very high efficiency is needed, the interface can consist of just one command which must be added to the external language, and two SETL routines. The command C added to the external language can simply pass a string back to SETL. One of the two SETL interface procedures should then which pass commands from SETL to the external language. The second must capture the string passed by the command C added to the external language.
Note that to use this scheme, one must prepare a slightly modified version of the external language, embodying the added command C mentioned above.
python_pak is just such an interface, for the increasingly popular Python language. The modified version of the Python interpreter that is needed is supplied with the SETL system as the binary module 'PythonCore'. This must be placed in the same folder as the SETL interpreter.
Note that when running in default mode, Python opens an annoying console window that SETL does not use and that the SETL-associated version of Python will only use when it needs to write error messages. To prevent this window from appearing unnecessarily you should invoke the 'EditPythonPrefs' utility that comes with Python and check the 'Delay Console Window Until Needed' option offered under the 'Default Startup Options' section of the Python preferences dialog.
The python_pak specifier is simply
native package python; -- SETL interface to Python language procedure python_exec(command); -- pass command string to Python for execution procedure python_getstring(); -- gets string returned to SETL by setl.pass_string(stg) end python;
The command added to the modified 'PythonCore' Python interpreter supplied with SETL is
setl.pass_string(stg)Here, 'stg' can be any string_valued expression in the Python language. This command converts it to a SETL string, which is returned to SETL. Note that this command implicitly uses Python's object syntax; 'setl.pass_string' is a normal Python method call, addressed to the object 'setl' which is built-in to 'PythonCore'.
Using these primitives it is easy to set up SETL procedures which make it easy to use Python, and in particular its extensive built-in libraries. Two such procedures are
procedure python_calc(expn); -- evaluate a Python expression, returning the value as a string -- request Python to execute 'result = expn', passing 'result' to SETL python_exec("result = " + expn + "\nsetl.pass_string(str(result))"); return python_getstring(); -- return value of Python expression 'str(result)' to SETL end python_calc; procedure python_val(vname); -- get the value of the indicated Python variable python_exec("setl.pass_string(str(" + vname + "))"); return python_getstring(); -- return value of Python expression 'str(vname)' to SETL end python_val;
An extensive collection of Internet access an file/folder manipulation utilities, built on the corresponding Python operations, is then available in the package net_utilities supplied with the SETL system, The specifier for this package is
package net_utilities; -- package of Python-derived internet utilities var telnet_prompt := "%"; -- the telnet prompt character procedure net_init(); -- obligatory initialization procedure python_calc(expn); -- evaluate a Python expression, returning the value as a string procedure python_val(vname); -- get the value of the indicated Python variable procedure http_get(url); -- transmit http request in 'get' form procedure http_post(url,stg); -- transmit http request in 'post' form ('old' version) procedure http_post_multi(url,map); -- transmit http request in 'post' form ('multi' version) procedure ftp_read(url); -- transmit request for ftp file read by blocks procedure ftp_read_lines(url); -- transmit request for ftp file read by lines procedure ftp_list(url); -- transmit request for ftp file-listing of directory procedure mail_send(host,who_from,whoall_to,subj,msg); -- transmit request for mail send procedure mail_summary(mail_host_user_pwd,nlines); -- transmit request for summary of mailbox procedure read_gzip(file_name); -- read a Gzipped file -- ******** telnet routines ******** procedure telnet_open(host); -- open connection to telnet host; name can be qualified by port procedure telnet_close(host); -- close connection to telnet host; procedure telnet_read_until(host,stg); -- read telnet stream up to indicated marker procedure telnet_expect(host,regex_list,secs); -- read telnet stream up to regexp, or timeout procedure telnet_read_timeout(host,stg,secs); -- read telnet stream up to marker, or timeout procedure telnet_write(host,stg); -- write telnet command procedure telnet_response(host,stg); -- write telnet command and get response procedure telnet_put_file(host,file_name,dest_file_name); -- send a text file via telnet procedure telnet_put_stg(host,dest_file_name,stg); -- send a string to start a file via telnet procedure telnet_add_stg(host,dest_file_name,stg); -- send a string to start adding to a file via telnet procedure telnet_end_stg(host); -- end a string sent via telnet procedure telnet_continue_stg(host,dest_file_name,stg); -- add a string to the end of a file via telnet procedure telnet_get_file(host,source_file_name,dest_file_name); -- get a text file via telnet procedure login(host,who,pwd); -- telnet login procedure -- procedure sanitize(stg); -- sanitize a string containing quotes and backslashes -- ******** file utilities ******** procedure file_chdir(path); -- change current working directory to "path" procedure file_getcwd(); -- get current working directory procedure file_listdir(path); -- get contents of indicated directory procedure file_mkdir(path); -- make a new directory procedure file_remove(path); -- remove a file procedure file_rmdir(path); -- remove a directory procedure file_rename(src,dst) ; -- rename a file or directory procedure file_stat(path); -- file status tuple, in order st_mode, ino, dev, --nlink, uid, gid, size, atime, mtime, ctime procedure file_kind(path); -- returns 1 of directory, 0 for path procedure file_set_times(path,creation,modif); -- set file creation and modification times procedure file_set_creator_type(path,creator,ftype); --set file creator and type 4-char Macintosh fields procedure python_ctime(floating_secs); -- convert python floating point time to standard rep end net_utilities;(The commented-out lines in this list document a few procedures available in the body of package net_utilities, but not declared for external use). As seen, these fall into roughly a half-dozen categories. The 'http' routines 'http_get' and 'http_post' retrieve Internet files identified by URL's, the first of these by the 'get' and the second by the 'post' method. (The 'post' method is often associated with retrieval of files generated dynamically on the server supplying them.) The strings that these operations retrieve are essentially identical with the text files that would appear if the same url is examined using the 'view text' option of a standard browser.
The 'http' routines. It is generally not hard to examine HTML pages visually and unravel their structure. Done successfully, this makes many Web information services available for use as a kind of subprocedure for programs you may wish to construct. For example, the well-known 'CNNfn' web site makes frequently updated stock- and bond-market quotes available in 4 forms: Dow-Jones average, NASDAQ average, Standard&Poor average, and Bond average. Suppose we want to access this data for continual use within a program (perhaps a Web application) that we are writing. As of Feb. 2001, CNN is putting this data into an HTML table with the HTML comment of the form '' near the start of the table (as can be seen by inspecting the HTML source text of their page). Hence we can begin to locate our desired data by taking the page text between the first occurrence of '' and the next following occurrence of ''. These occurrences are located by the code lines
stg := http_get("www.cnnfn.com"); -- download cnnfn home page dowlocs := kmp(stg,""); -- get occurrence locations of '' etlocs := kmp(stg,""); -- get occurrence locations of ''If we print the variables 'dowlocs' and 'etlocs' the result is
[14844, 19200] [2043, 3186, 3270, 3913, 12619, 14589, 20521, 26362, 30704, 30722, 30950, 33395, 33417, 34608, 35112, 35181, 42182, 42206, 51780, 53415, 54100, 56451, 56474, 56513, 56535, 57498]This makes it clear that '' has only two occurrences, both within the HTML table containing the data we are interested in. Hence
exists loc in etlocs | loc > dowlocs(2)will locate the end of this table. Breaking the range of text from the first occurrence of '' to the end of the table into lines and examining the result, we easily see that the desired data is found in lines 4,8,12,16,27,28,30,31,33,34,36 and 37 of the table. The following program assembles these observations into code that extracts the stock-market data we want from the 'CNNfn' home page.
program test; -- getting market quotes from cnnfn website use python; use net_utilities,stringm,string_utility_pak; const market_names:= ["DOW","NASDAQ","S&P","BOND"]; stg := http_get("www.cnnfn.com"); -- download cnnfn home page dowlocs := kmp(stg,""); -- get occurrence locations of '' etlocs := kmp(stg,""); -- get occurrence locations of '' if #dowlocs /= 2 or not (exists loc in etlocs | loc > dowlocs(2)) then -- assumed cnnfn page format has changed print(" ********** CNNFN page format has changed **********"); stop; -- give warning and stop end if; tup := breakup(stg(dowlocs(1)..loc),"\n\r"); -- break page section between '' and '' into lines tup := [tup(j): j in [4,8,12,16,27,28,30,31,33,34,36,37]]; -- get lines containing desired data data := [back3(line): line in tup]; -- extract numerical data from right end of lines for j in [1..4] | data(2 * j + 4)(1) = "-" loop data(j) := "-" + data(j); -- capture signs for first 4 data items end loop; for mname = market_names(j) loop j2 := 2 * j + 3; print(mname," ",data(j2)," ",data(j)," ",data(j2 + 1)); end loop; procedure back3(stg); -- extract numerical data from right end of lines; -- remove other HTML clutter for j in [1..3] loop rbreak(stg,"<"); -- back over 3 HTML tags rmatch(stg,"<"); end loop; data := rbreak(stg,">"); -- extract numerical data, back to next tag on right data := suppress_chars(data,"nbsp;"); -- suppress 'nbsp;' return "" +/ [if c = "&" then " " else c end if: c in data]; -- change remaining '&' signs to blanks end back3; end test;Note that the data tuple as first constructed has the form
[" 43.45", " 61.94", " 11.51", " 1/8", "10903.32", "-0.40%", "2427.72", "-2.49%", "1318.80", "-0.87%", "99 9/32", " 5.42%"]The first 4 items in this tuple lack the '-' sign that they need if the market is falling. (This is because these signs appear in cnnfn's page as icons). But the necessary signs are readily extracted from the corresponding percentage changes in data items 6, 8, 10,and 12. Doing this, our data tuple reads
["- 43.45", "- 61.94", "- 11.51", " 1/8", "10903.32", "-0.40%", "2427.72", "-2.49%", "1318.80", "-0.87%", "99 9/32", " 5.42%"]As of the early morning of Feb. 14, 2001 the output of the above program is
DOW 10903.32 - 43.45 -0.40% NASDAQ 2427.72 - 61.94 -2.49% S&P 1318.80 - 11.51 -0.87% BOND 99 7/32 3/16 5.42%As of 10:13 AM of Feb. 14 it produces
DOW 10855.29 - 48.03 -0.44% NASDAQ 2421.29 - 6.43 -0.26% S&P 1312.06 - 6.74 -0.51% BOND 99 5/16 3/32 5.42%(This illustrates the remark allegedly made by J.P. Morgan when asked to predict the future course of the stock market; "The market will fluctuate.").
Here is another example, The 'cnnfn' page we have been discussing also provides a stock quotation service. One can enter comma-separated 3-letter codes for several stocks (e.g. IBM,CNN) into a small text box and click on a 'Go' button. An HTML page containing the requested data is then generated and returned. If this is done in a browser it will be seen that the URL being requested is simply
That is, the parameter for the request (the part of this URL that follows the separating '?' character) is simply 'symbols=ibm%2Ccnn', essentially the two stock symbols separated by the 'Internet sanitized' form '%2C' of the comma character. This easy visual inspection makes it plain that the 'http_get' call
should retrieve this same data, as indeed it does. It is generally easy to read 'get'-generated URLs in just this way and so to write small routines which make equivalent Web services available as SETL procedures built using 'http_get'.
http_post The http_post operation has the form
It is generally used when a relatively large 'stg' parameter needs to be transmitted along with a Web URL request. In interactive pages, this data is commonly collected from For example, the cnnfn pare we have been discussing