Medical Informatics/Case Studies of Computer Based Medical Records

a feature based evolution of GP computerized medical records
General practice medical records in Australia had the RACGP standard medical record forms for patient registration and summary, and progress notes, prior to the widespread introduction of PC based computer medical records, mainly based on the Windows operating system.

Using a computer based medical record system gave advantages of increasing script legibility, so that medication errors from written GP to pharmacist miscommunication were theoretically reduced, but also one source of business inefficiency where pharmacists had to phone GPs to clarify iilegible scripts was eliminated. It became apparent that referrals to specialists could be made faster, when an integrated address book and word processor functionality was added. The medical justification was that specialists could also get summary lists of medical conditions and medications prescribed automatically included in the form letter printing facility, thus improving the patient handover process, but this was rarely relied upon, except perhaps by surgeons.

Pathology and radiology reports could also be downloaded by Australian medical record systems ,usually via a pathology provider outsourced proprietary download client, which placed result files in a file directory for GP record systems to parse, given some simple line orientated format, like PIT. This was in the context of internet access developing from dial up, to ADSL /cable , with constant access , and the requirement of developing secure transfers over the internet using encryption and authentication internet standards.

In the Australian context, the early adoption of one vendors solution meant that database technology used to store the small business records remained fairly stable for over a decade, and in fact the DBF file format of DBase / Foxpro came to be the foundations of most general practice electronic medical records for many years.

Since the scale of general practice was mostly small business, there was not much need to change the simpler but elegant database technology of the DBASE format, and the Windows network filesystem opportunistic file locking sufficed for using the system even in multi doctor practices with concurrent access by 10 or more users. The b-tree based DBASE index files provided the scalability as records accumulated years of patient progress notes, actions, vaccinations, referrals, imaged scanned reply letters, downloaded text results. But the file structure of DBASE index files was not openly documented, and the ability to rapidly find all the associated records for a given patient without detailed know-how about DBASE IDX/MDX files, was one major hurdle to customizing functionality of the GP computer based medical record.

For a period, this meant that it was well within the intellectual resources of any general practitioner to become the master of his own record system, as DBF file format was well known, as it was a logical heap file format, where records are simply appended to the end of files storing a particular table's information. But the lack of easily available / understandable information about DBASE *index* file structure, provided a major hurdle.

This case study will try to illustrate how the simple, logical , transparent organization of DBASE can intersect with basic understanding of one particular computer algorithm, i.e. Linear Hashing, and provide the means for extending functionality. With the move to vendor specific implementations of SQL database systems, this era is sadly ended, and futureproofing and protection from vendor market withdrawal is much harder.

Background computer science
Modern database systems are usually front ended by a SQL declarative programming interface, as SQL is comparatively easy to learn for non-programmers, but needs quite a lot of infra-structure to support it, and puts quite a distance between the language grammar of SQL, and the underlying file formats and algorithms used for retrieval. Most database systems still relying on indices to indicate where a record is stored given a key value for that record, and the algorithms that make up indices are either based on variants of B-Trees or Hashing. The main advantage of B Tree indexes is that index keys are stored sorted, so that it is possible to retrieve a range of records based on an lower key value or an upper key value. for example. All records whose patient.surname start with Smi, could be retrieved by using Smia to Smiz as upper and lower range values.

The Hashing algorithms are used in indices mainly because of their relative speed ; roughly speaking, it takes longer to traverse the branches of a tree for a monkey to get to a banana then for the monkey to read a sign pointing to the banana. Thus Hash indices are generally faster when implementing a *JOIN* operation. As an example of a Join operation, a patient's medical record might consist of entries in the a demographic table, referenced by an ID field, and this ID field is called the Primary Key. The patient's medications be stored in a medications table, which also has a DEMOGRAPHIC_ID field, and this is what is termed a Foreign Key field, so that with a Join between the demographic table and the medications table, we can, by identifying a patient with an ID value, identify what medications the patient is on.

What is Hashing ? Hashing is where some data value is translated into a integer number. Generally, a good hashing function will generate a roughly random integer number , for domain of data values to be translated. The reason that a good hash function is required, is that the hash value produced is used as an index into a storage location, which in this context, is a file location within a index hash file. Therefore, for space to be efficiently used, the possible hash values derivable from the population of input data should be roughly evenly distributed across the storage addressing range.

Another useful concept is Collision, where two different values are hashed to the same hash integer number. In order to handle a collision, extra space can be allocated at each location, so that there a fixed number of extra locations, and this can be called a Bucket (of slots). What if the number of keys hashed to one location exceed the number of slots in a Bucket ? Then another bucket can be referenced by the bucket, and this is called overflow bucket / block /space.

Because of the smaller scale of general practice database, it could be argued that standard hash algorithm can be used to implement external index files, using very large index file spaces, with a bucket to each possible hash value, and overflow buckets appended to the end of the index files if ever required.

A more elegant algorithm is the Linear Hashing algorithm, and it is prominent enough to be mentioned in almost any decent database fundamentals textbook, along with Extendible Hashing, as the hashing algorithms of choice for use with the large, file based indices requirements of database systems.

The main limitation of file based indices, is the nature of files :- they are easy to create, and easy to append to, but it is expensive to insert a block data in the middle of a file, as at the point of insertion, the rest of the file after the insertion point must be copied to a temporary file, the original file truncated at the insertion point, the new block appended to truncated file, and the temporary file contents appended to this. BTrees expand a Node/Block at a time, and a new Node can be appended to the end of the file. Extendible/linear hashes can also expand a block at a time, although Extendible hashes probably either require an estimation of the maximal size of a directory structure reserved at the start of the index file, or a separate directory file used, because directory structures periodically double in size.

BTree indices use an algorithm where keys are associated with pointers to file addresses marking the start of other blocks of file data representing other nodes in the B Tree. The B Tree is self balancing because overflow of entries means a Node is split, and an extra key demarking the two new nodes is passed up to the parent node, which slows down growth of the height of the tree; this eventually leads to split of the root node. New nodes are simply appended to the end of the index file, and file offset values ( node "pointers") allow random file access to traverse the blocks from root downwards.

Extendible and linear hashes are kin to radix searches, because the address space is increased by increasing the number of bits used in a hash value, every extra bit doubling the address space. The wikipedia encyclopedic references give adequate description of extendible and linear hashing.

However, since Linear Hashing is used in the code below, a brief recap is in order : linear hashing works by using a conditional hash function, which depends an iteration variable, split , which , on the first iteration of the split value, ranges from hash value 0 to N-1 , the initial number of buckets on the first iteration. The conditional hash function works so: h' = h mod M where M = N x 2 ^ level, if h' < split, then h' = h mod M' where M = N x 2 ^(level +1).

Whenever any bucket overflows, more space is needed, and this is handled temporarily by creation of an overflow bucket attached to the bucket that overflowed. But it is also handled by appending a bucket to the current list of buckets, so when the split variable is zero, the appended bucket in N. Once this is done, the split variable is incremented : thereafter the condition hash function will use mod 2N instead of N for any value originally hashed as zero, so that records will be hashed to either 0 or N. i.e if y = x mod R and y' = x mod 2R, then y' = y , or y' = y + R for any non-negative integer x. This conveniently hashes values traversed by split into the original numbered bucket n or another numbered bucket which is n + N times 2 ^ level. This is linear hashing, because the number of buckets grow linearly. Whenever the split variable hits N * 2 ^ level ( or N left bit shifted level times ), then level is incremented, and split is reset to 0, in order for the conditional hash function to access the next level of added buckets, while ranging round robin across all available buckets splitting them before they accumulate too many records e.g. for 0 to N-1, N to 2N-1 is accessible ; for 0 to 2N-1, 2N to 4N-1 is accessible , and so on.

Case study account : allergy checker, and OS directory based file logs of consultations
In this case study, the initial "itch to scratch" was a user requirement that users be monitored for their entry of allergies into medical records, as, although doctors almost automatically ask whether any allergies exist before prescribing, the expectation of general practice in Australia was that allergies be recorded on all patients, regardless of whether the currency of the allergy list is being checked at each visit. This led to the implementation of an external program written in java to inspect the DBF tables related to allergy, and initially it was found that the in-memory Map classes (HashMap, TreeMap) were upto the task of indexing these tables on startup of the program, and maintaining a temporary in memory index to facilitate lookup. A method was found by polling the operating system "registry" of program associated attributes in order to detect when a patient record was opened, so that a lookup of the allergy tables could be triggered, and an reminder window flashed up if allergy recording was not yet done, and this check would occur periodically using a java Timer class.

The next user requirement was that consultation statistics could be recorded, to help with reviewing previous consultations, so that the clinician could review his practice if he so wished, and easily find patients he had recently seen, according to other detail not provided by the medical records program. This required lookup in a table which had a size that exceeded the capability of the in memory Map java classes, so it was proposed that a Linear Hashing implementation be used to scaleably index such tables.

Unfortunately, the initial implementation, which mixed in the block management as file entities, did not work and was difficult to debug, so the second implementation concentrated on proving the algorithm was working in memory first.

Once the bugs were ironed out ( which mainly due to the problems of unsigned integer representation inherent in the java language introducing a bug into the dynamic hashing function ), the in-memory Linear Hashing implementation was used as a cache structure for a file based block manager. A block was basically saved as a file with the filename being the same as the block number.

Due to the ability of java's in built LinkedList type being able to store null values at a given index location, the null value was used as a marker of a cache miss, requiring the file block manager to load in a block.

Whenever a block was loaded, a check was also made by comparing the runtime free memory with the size of block list multiplied by the average block size, and if it was below a certain multiple of this, then a number of non-empty blocks would be chosen by traversing the list and writing them to disk. Once the block was written, the data in the block was made available to the garbage collector by clearing any references to them (e.g. calling map.clear ), and then invoking the garbage collector ( System.gc ). This would result in an increase in available java memory for data object allocation.

Because the loading and saving of blocks already worked for the caching, it was found that these could be used to load a previously created linear hash structure, or save a newly constructed one. This was also needed because constructing a persistable index structure from a large number of records took more than 10 minutes, and this was an unacceptable startup time for such a monitoring program.

Below is the Map interface plugin, along with a File manager class with static public methods for saving and loading the index. It illustrates that from a basic understanding of algorithms, this can be used to provide transparency and extensibility of computer based medical record systems. This doesn't provide the DBF file access logic, but this is relatively straight forward to implement using easily available descriptions of DBF file structure and basic types, and these classes can be used as the drop-ins for the java associative array Map classes, as programmatic representations of database indexes, which can be built by walking the list of records stored in a DBF table file. Once an index is built and then saved to disk, it can be reloaded on future startups of the monitoring program. Before using any index, the number of entries in the index can be compared to the current number of records recorded in the header of the DBF table file, and if is less, than records can be read from position L , the last number of records, to N the current number of records, and also added to the index object. Thus once an index is built, it is periodically updated, by checking against the previous number of records in the file, stored as part of the index header on disc whenever the index is saved to disc. A shutdown hook is used to ensure the index is updated in normal shutdown.

The specific use of the linear hash map implementation was to create a progress entry index clustered on patient identity keys. Each list of progress notes associated for a given patient, was stored as a linked list object against the patient ID as a key. The patient ID was then passed through the dynamic hash function and the key-value pair of patientID-to-linked_list_progress_ID was stored as an entry into a Block object of limited size. Given this representation, the reader could infer that the bucket sizes were not variable, or used a lot of wasted space to accommodate for maximal value list sizes. Because the blocks were of variable size, it was decided that each block would be a file, and a whole file system directory would be used to hold the linear map's data. This would add kernel overhead of opening/closing a file when dealing with a block, so the implementation may have been slower because of the simplification decision to represent a block as a single file, instead of storing fixed size blocks appended to a single index file. Nevertheless the performance was brisk enough, when a fully built index was used to retrieve a particular patient's records in order to record consultation review information.

This user API of using the linear hashing map to build indices for joining/ relating tables trades the hidden complexity of SQL join queries, for explicit review of table schemas, for the arguably equal complexity of learning the java language, and using the java language's flexibility to output custom reports. There seems no obstacle to simply translating the methods used to a different language, python for instance.

What is missing from this simplistic implementation of database indexing ? ACID is a shopping list acronym that comes to mind; atomicity, isolation, consistency and durability , in the face of concurrent access and random system crashes during operation. So far, for single user machine use as a local allergy checker and consultation statistics recorder, there seems to be no problem in 3 months of usage, but the desktop machine has never suffered a failure in the middle of a index file image update, which is rare, but because not specifically accounted for, would likely require the index files to be deleted, and a prolonged fresh initial index required at next operating system startup ( the programs run via the Startup Menu of the windows os being used).

The easiest way to deal with concurrency is to wrap the Map interface with java's Collections.syncrhonizedMap wrapper, and the standard way of dealing with recoverability is write ahead logging with checkpointing. Some DBMS use copy-on-write and keep previous versions of written data values, but what is the minimum that will suffice ? The main writing operations are overwriting a value for an existing key-value pair, inserting a new key value pair, and splitting a block. One recovery scheme is called shadow pages, where a copy of the page directory / block list is kept at checkpoint, and blocks written to storage are given version numbers, so block x would have the name 'x.v'. This could be implemented by adding a serializable small map attribute that maps a block number to a block.version file name. If a crash happens after the last checkpoint, then the shadow block list and block name map are made the current block list and name map. If required, periodic garbage collection can be done by deleting any file's with names containing earlier version numbers than the current shadow block name map.

The "other" useful algorithm - B-trees
Linear hashing is a hash algorithm which is very good for key based joins, where equality between keys is required. However,  a range searching index is required to answer certain queries, such as names beginning with SM ( ending in SMZ) to find all the SMITHs, finding all diabetics greater than 50 years old, finding all the patients older than 50 years old seen by Dr Z in the last 12 months who haven't had a glucose done, etc.

The algorithm is usually B-trees, and that holds for popular databases of the 1980s such as the DBase family (foxpro, paradox, etc ), to newer trendy databases such as Postgresql, Firebird, DerbyDb etc ... B Trees were introduced as an indexing algorithm in the late 1960s and no better algorithm (that is as practised in implementation) seems to be around when range searchable indexes are required, and it has been said that if a choice between hashing and B Trees was required, than the latter provides superset functionality, at the cost of loss of O(0-2) disc access time.

Linear hashing is more easier to understand, and more straight forward to map to disc storage, but B Trees are conceptually as easy to understand, just a little harder to debug.

What are the important characteristic of B-Trees ? It is basically like a binary tree, and in fact a binary tree is really a B tree where a node has only 1 entry, whereas a B Tree node may have n entries, where n is around 10-100, and often quoted as being a size where n x (size of entry) = the size of a physical disc block. The idea is to load one block into memory and search it, and if the search key isn't found, but it is in between two keys or to the left or to the right of all the keys, and there exists a pointer to another block on the left of the block, or on the right of every key in the block, then follow the pointer, until the search succeeds or no block can be followed.

The reason B Trees are used is because when a block becomes full during a key insert, it can be split into 3 parts, by dividing it into a block with all entries to the left a median key, an entry with only the median key, and a block with all entries to the right of a median key, and then passing up the entry with the median key to the parent block on return from the insert, the parent then inserts the entry into its entries, and also splits if needed. If the root splits, then the median entry becomes a single entry block which becomes the root. Over a series of inserts, this is actually a self-balancing stategy, and the tree grows in width rather than depth, so that future searches are done mainly after a block is loaded and width traversing fast in-memory search is done, instead of a less frequent, slower vertical search loading from disc a non-cached child block.

Below is a java B Tree implementation, in-memory, which can be adapted for disc based block management, much like the linear hashing above, except with a little more tracking work e.g. using explicit reference IDs for blocks instead of references to blocks, and storing blocks in an map, with null values representing non-cached blocks.

[[]] Why the author gave up on binary search for now.

Example application of health data querying : finding patients overdue for HbA1c > 6 months,
Using the above (linear hashing with block persistence) data structure for permanent storage assisted indexes *1, this example program was applied for the above clinical use. Interestingly, the results of the query identified patients who had diabetes listed as a condition, along with co-morbid diagnoses, and atomized pathology HbA1c data when present, but often the patient listed under each practitioner was either a non-regular patient, had questionable diabetes diagnoses , and the query didn't examine whether patients were being currently managed elsewhere, other clinic or endocrinologist.

An exercise might be to reformulate the query as "group patients under practitioners, the patients who have a diabetes related diagnosis, and existing path atoms for HbA1c, but whose last HbA1c was more than 6 months". This would at least select patients with diagnoses diabetes who probably at some stage were being monitored by the consulting practitioners.

the ur_no indexing of tables via the MapperImpl1 class. This is required for tables with large number of rows, such as PathAtom, and certainly Progress, as the indexes are too large to fit in the virtual Java environment running on an ordinary 2G Windows desktop office machine.
 * 1 BTreeMap was used as an alternative to ordinary red-black TreeMap class, as proof of concept, and didn't actually overflow memory in the dataset used. Linear Hashing was used to implement

Conclusion
This case study illustrates that the Computer Science aphorism of "tradeoff" also applies to increasing the sophistication of technology used represent computer based medical records. One trades the marketed benefits of high level technology features of enterprise level SQL DBMSs for vendor lock-in, lack of software extendability, and the risk of costs of expensive data extraction needed if vendors decide to close shop. Arguably, this would not occur if medical record system vendors didn't decide to remove all the features of the underlying database management system, and make opaque, the access of data tables and data table schemas. It shows that scalable database management is just a few pages of code, a good algorithm, and an open computer language system away , and in people terms, one could just employ a local computer science graduate to indefinetely provide open source services , and another graduate as quality assurance to ensure source code readability , if the current programmer decides not to offer services anymore.

It might be argued that maintaining the small scale and cottage industry status of general practice computer medical records undermines the needs of standardization of medical informatics, and portability of electronic medical records between primary care and hospital medical care. But by maintaining the open, readable and extensible architectures based on well known , easily understood but scalable algorithms, the counter argument is that justified simplification assists the adaptation to future needs of medical codification , because the software is not opaque to all but a few dominant vendors.