Encyclopedia > MUMPS

Article Content

MUMPS

MUMPS, or simply M, is a programming language dedicated to building and managing databases. Whereas in most systems the database is the first-class citizen and a language is added on top, under M this is inverted, the language itself is the primary object, the database a "side effect" of one of the features of the language.

M contrasts strongly with more database systems, because the system is much "lower level". For instance, whereas most database systems will include a command to find all the records matching a particular patterm, on an M system you would have to write a program to do this search and collect up the results. As you might imagine, this makes even trivial tasks much more difficult, and has led to a number of M-based programs to act as a database management system and provide these features.

For people used to traditional database applications or database management systems, M can be a bit difficult to understand at first. However it's overwhelming speed and flexibility in dealing with many tasks that would cause problems under the relational database model leads many to claim that M is the best kept secret in the IT industry. Much of this secret seems self-imposed however, finding good introductory information on M is difficult, and the commercial side of the M market is fractured.

History

MUMPS started life as the Massachusetts General Hospital Utility Multi-Programming System, developed in Octo Barnett's animal lab at Massachusetts General Hospital[?] in Boston in 1966/7. Based on the then-common hierarchical database model, MUMPS added an interpreted language language to standardize interacting with the database. The MUMPS team deliberately chose to write a new language with portability in mind. Another feature not widely supported in operating systems of the era was multitasking, which was also built into MUMPS.

The original MUMPS system was built on a spare DEC PDP-7, but it was soon ported to a PDP-15[?] where it lived for some time. Developed on a government grant, MUMPS was required to be released in the public domain (no longer a requirement for grants), and was soon in use in a number of other organizations, ported to a number of other systems including the popular PDP-8 and Data General Nova minicomputers. Word of MUMPS spread mostly through the medical community, and by the early 1970s was in widespread use.

In 1972 various MUMPS users gathered in order to standardized the now fractured language, creating the MUMPS Users Group and MUMPS Development Committee. These efforts proved successful; a standard was complete by 1974, and by 1977 they had turned it into an ANSI standard. The group was later responsible for the change of the naming from MUMPS to M in 1990, after repeatedly seeing people reject the product out of hand due to its name. Over a decade later most people still refer to the product as MUMPS however.

The Veteran's Administration (today known as the United States Department of Veterans Affairs) officially adopted MUMPS as the programming language to be used to implement a patient admission, tracking and discharge system in the early 1980s. The original version, the Decentralized Hospital Computer Program (DHCP) was delivered early and under budget. DHCP has been continuously extended in the years since, and is available at no cost in source code. In order to implement DHCP, today known as VistA, the VA also wrote an intermediate layer known as FileMan in MUMPS to act as a database management system. Nearly the entire VA hospital system in the United States, the Indian Health Service, and major parts of the Department of Defense hospital system (Consolidated Hospital Computer System (CHCS) different than the VA's for historical reasons) all still run the system for clinical data tracking.

M also gained a following in the financial sector, in this case due to its much higher performance compared to traditional SQL based systems. Given similar hardware, multidimentional databases like M are typically about six times faster than SQL for transaction processing, making them ideal for online systems like banking. They also range from as-good to hundreds of times faster on queries, with more complex queries always favouring the multidimentional approach. They are also particularily good at looking up related information in other data stores "for free", a task that requires an expensive JOIN operation in SQL systems.

DEC became interested in the widespread use of MUMPS and decided to create their own standardized version in the 1980s, known as DSM (DEC Standard MUMPS). It quickly became the de-facto standard on DEC machines, and was later ported to their DEC Alpha-based systems running both VMS and Unix. In 1990 DSM was purchased by InterSystems, who released it on a number of platforms as OpenM (although nothing about it was "open") and/or ISM (InterSystems...). Through the 1990s InterSystems started buying up other MUMPS vendors, including the other "standard" from the IBM mainframe world, MSM (Micronetics Standard MUMPS). Since then InterSystems has increasingly distanced itself from its M history, referring to its product as Caché and removing any mention of M from their literature.

General use of M appears to be slowly disappearing. This appears to be a result of the industry's failure to provide a clear and compelling message comparing M with traditional SQL systems. There appears to be no M for SQL Programmers type introductory information available on the Internet, and the small number of books on M are difficult to find. To be truely usable, M systems require a "higher level" layer to act as a database manager, and while M is now well standardized, there is no such standard for these higher levels, adding to the confusion.

A recent release of the industrial-quality GT.M under the GPL may help address this to some degree, by providing a single, free, target for the M community. Several database management layers are available for GT.M, and with a little effort a single suggested platform could evolve.

Description

M is typically an interpreted language, and shares basic syntax with common 1960s data processing languages, most notable such as COBOL. Commands are listed one to a line with whitespace being important, and grouped into procedures (subroutines) in a fashion similar to most structured programming systems. Procedures are simply strings, so they can be easily stored in the underlying datastore, meaning that there is no need to a "stored procedure" concept as there is in SQL – anything can be stored.

A typical M procedure consists of several "blocks", each block separated by a lable (known as a tag in M-speak) in the first column. Calling into the procedure with no tag results in the entire procedure being run, whereas calling with a tag skips to that point. This allows programmers to place interactive commands at the top of the procedure and then tag the actual start of the code itself, allowing the procedure to be used both interactively and as a function to be called from other code (see the GRASS article for similar examples).

One main difference between M and most other languages is that M has only a single data type, the string, which it invisibly converts into common data types such as numbers or dates. As you might expect, M includes a complete and powerful set of string manipulation commands, grouped into libraries. Automated conversion of this sort is common to many scripting languages, but is generally considered a bad thing for most languages because it can all too easily lead to mistakes that are diffcult to debug. In the case of M this would be hard to avoid however, as the underlying datastore would grow in complexity if it had to deal with different types.

The key to the M system is that all variables are automatically multi-dimensional. For instance, this command:

SET A="abc"

creates the variable A and sets its value to the string. The same variable can then be used to hold additional information:

SET A(1,2)="def"

Will place the string def into "slot" (1,2). Slots can also be designated with strings:

SET A("first_name")="Bob" SET A("last_name")="Dobbs"

making the variables useful data stores on their own. Note that this example also demonstrates another feature of M, that assignments into variables do not erase other information already there. This makes it easy to "build up" a complex variable, using several assignments.

M variables work in a similar fashion as with other programming languages, in that when the program exits, the value will be lost. M comes onto its own with its concept of globals, variables which are automatically and invisibly stored to the datastore. Globals appear as normal variables with the caret character in front of the name. Modifying the earlier example thus:

SET ^A("first_name")="Bob" SET ^A("last_name")="Dobbs"

will result in a new record being created and inserted in the datastore.

One difference between M and the traditional SQL model of a database is the "level" of the commands in the language. M is a general purpose language with a datastore, whereas SQL is a language dedicated to database functions. This might sound like a minor distinction, but it is rather important to understand it. For instance, SQL includes a search function:

SELECT * from USER WHERE first_name like 'Bob%'

returns a list of matching records. M has no equivalent "high level" command like SELECT, instead the programmer must construct a small routine in order to collect up the matching records returned from its lower-level functions. It should also be noted that M does not include any transaction controls, all changes to globals happen instantly, and there is no logging in the basic system. Nor does M include any sort of user-based security.

For all of these reasons one of the most common M programs is a database management system, providing all of the classic ACID properties on top of a generic M implementation. FileMan is one such example. In the 1990s many of these layers were adapted to supporting SQL, turning most M systems into SQL systems. Although the user might be "fooled" (to some degree) into seeing the system as a SQL database, the system nevertheless retains the speed advantages of the M datastore (see below).

A side effect of the way M evolved is that the M system includes fairly complete support for multi-tasking, multi-user, multi-machine programming. The former two features are now commonplace on most operating systems, but the later is still not cleanly supported by most systems. To demonstrate the ease of multi-machine support, consider:

SET ^|DENVER|A("first_name")="Bob" SET ^|DENVER|A("last_name")="Dobbs"

which sets up A as before, but this time on the remote machine called "DENVER". M programs are thus trivial to distribute over many machines, a feature that is still difficult on most SQL systems. This support also made it easy to expose the same sorts of distribution in the SQL (and other) layers with ease, and it's not uncommon for M systems to be a better distributed SQL solution than a "real" SQL system.

Another use of M in more recent times has been to create object databases. By "flattening" objects into a string representation, like XML, M systems can be used to store objects. An M program then converts back and forth between the "real" objects and the string representations under it, and is able to do so much faster than similar object-relational mapping systems running over relational databases. This should be expected, if M can be used to make a SQL that's faster than SQL, making an object database that does't require conversion to and from SQL in the middle is bound to be even faster.

The MUMPS Datastore

In the relational model, datastores consist of a number of tables, each one holding records for some particular object (or "entity"). For instance, an address book application would typically contain tables for PEOPLE, ADDRESSES and PHONE_NUMBERS. The tables consist of a number of fixed-width columns holding one basic piece of data (like "first_name"), and each record is a row.

In this example any row in ADDRESSES is "for" a particular row in PEOPLE. SQL does not understand the concept of "ownership" however, and requires the user to collect this information back up. SQL supports this through procedure, using the concept of a foreign key; copying some unique bit of data found in the PEOPLE table into the ADDRESS table.

To re-create a single "real" record for the address book, the user must instruct the database to collect up the row in PEOPLE they are interested in, extract the key, and then search the ADDRESSES and PHONE_NUMBERS tables for all the rows containing this key. SQL offers a simple way to do this however:

SELECT * from PEOPLE p, ADDRESSES a, PHONE_NUNBERS n

  WHERE p.id = a.person_id

    AND p.id = n.person_id

    AND p.first_name = "Bob"

In this example the WHERE looks for all the a's and n's (addresses and phone numbers) that have the person's ID tag stored inside them, but only for p's named Bob.

This trivial example already requires three lookups in different tables to return the data, which, as you might expect, is very slow. In order to improve performance, the database administrator will place an index on heavily-searched columns, in this example the person_id columns. An index consists of a column containing the data to be found, and the record number of the matching row in the table. This is the reason that tables are fixed width, so that the database can easily figure out the physical location of the record given the location of the start of the table, the length of any row, and the number of rows to skip. Without this simplification, performance of the relational model would be unusable.

M's datastore stores only the physical locations. This means that records can be of any length, placed anywhere, and contain anything. Searching is not needed to find any record, a pointer directly to that record is easily retrieved and followed to the data in question. The physical data in an M is typically stored in a "blob" of strings, one after the other. This provides another advantage over the relational model, as empty cells do not take up room as they do in the fixed-length relational table. M databases are therefore smaller than relational ones, which is another reason for their increased performance (less disk operations).

So why doesn't the relational model do the same thing? Historical accident. At the time the difference in speed between storage and processor was much smaller than it is today, and the cost of having the CPU follow a pointer was expensive compared to the simple arithmatic needed for an index. Today the CPU's have grown many times faster than the storage, so this cost is effectively zero. This is the main reason why multidimensional datastores outperform relational ones today, something that was not true in the 1970s when the two models were in competition.

M globals are, in fact, indexes. Each node in the global contains a pointer to the data, just as an index does in the relational model. Unlike the relational model, where indexes are a special-purpose object included as a nessesary evil used in some lookups, under M indexes are first-class citizens that are used for all data access. This is yet another reason for M's performance.

This makes M systems particularily well suited to looking up related data, as in the example above. The equivalent M statement would be something more akin to:

SELECT * from PEOPLE p, ADDRESSES a, PHONE_NUNBERS n

  WHERE p.first_name = "Bob"

Related information can be stored directly in the index, in p.addresses for example. In this case no lookup is needed, PEOPLE can point directly to the addresses and phone numbers.

The biggest consequence of this internal representation is that database operations are economical (in both disk space and execution time). M is extremely well suited to real world data, which is often 'sparse' (ie has missing fields). There is no penalty in storage space if a defined data value is not present. This is an extremely helpful feature in a clinical context.

M includes almost no operating system specific command syntax, very few file system interface commands, and no machine specific commands. It is thus quite portable. Additionally, database manipulation code is extremely brief. A M routine implementing a complex database interaction might be a page or two of code. The equivalent in a less high level language (C, Pascal, Fortran, ...) is likely to be an order of magnitude larger. M is a highly cost effective application programming tool.

All Wikipedia text is available under the terms of the GNU Free Documentation License

Search Encyclopedia

Search over one million articles, find something about almost anything!