dbm
The dbm library was a simple database engine, originally written by Ken Thompson and released by AT&T in 1979. The name is a three letter acronym for DataBase Manager, and can also refer to the family of database engines with APIs and features derived from the original dbm.
The dbm library stores arbitrary data by use of a single key (a primary key) in fixed-size buckets and uses hashing techniques to enable fast retrieval of the data by key.
The hashing scheme used is a form of extendible hashing, so that the hashing scheme expands as new buckets are added to the database, meaning that, when nearly empty, the database starts with one bucket, which is then split when it becomes full. The two resulting child buckets will themselves split when they become full, so the database grows as keys are added.
The dbm library and its derivatives are pre-relational databases – they manage associative arrays, implemented as on-disk hash tables. In practice, they can offer a more practical solution for high-speed storage accessed by key, as they do not require the overhead of connecting and preparing queries. This is balanced by the fact that they can generally only be opened for writing by a single process at a time. An agent daemon can handle requests from multiple processes, but introduces IPC overhead.
Joint Database Technology - DBM and Flat File Databases working in tandem - an ISAM-like, NoSQL implementation
Where external, binary, persistent, SDBM database files (tied to program hash tables - such as in PERL application programs) can really be made useful is in using the key/value pairs for random access indexing into a huge relational "text" flat file database composed of many flat files (with fixed-length records) exhibiting parent/child (1-to-many record) relationships. The key would be composed of a: single field, single partial field, or a compound key of multiple single and/or partial fields concatenated together (perhaps with a delimiter character between them such as a pipe "|"). The value in the key/value pair would be the location offset (in bytes) to seek to (i.e. position the file pointer) in a flat file at the start of a specific record wished to be random accessed for: READ, READ/WRITE, or APPEND access. Multiple SDBM files can be setup as alternate indexes into each of the Flat File database text files, each SDBM file containing a different key. An alternate key with duplicates can be created in the SDBM files by making as part of the key, an incremented number perhaps in the range 1-99.
ALT Key (with DUPS) example: BirthDate|LastNameFirst4Chars|FirstNameInitial|IncNbr "1959-12-19|Will|K|1" ... "1959-12-19|Will|K|74".
Note: This might be a useful Key for record look ups if someone did not remember their Social Security Nbr (Primary Key) #-- CODE SNIPPET: @Offsets=(); #-- we will build an array of Flat File record "byte offsets" to random access #-- for all records matching this ALT KEY with DUPS for ($i=1; $i<=99; $i++) { $KEY=$BirthDate . "|" . $LastNameFirst4Chars . "|" . $FirstNameInitial . "|" . $i; if (exists $Hash{$KEY}) { push @Offsets, $Hash{$KEY}; #-- add another $Hash VALUE(record byte offset) to the end of the array else { last; #-- exit out of the for loop (assuming you have not deleted any key/val pairs in the sequence 1-99) } } #-- Now what do you do with your array of record byte offsets? You could random access these records and load them into #-- a ListView GUI widget for your end-user to scroll and pick the particular record for viewing/editing having their #-- corresponding Social Security Nbr.
When EDITING Flat File records, the changes are saved "in place", overwriting existing data in the Flat File record. You may wish to employ a manual record locking strategy whereby your DB user-interface program places a lock on a particular record (to be edited) by writing that end-user's login name (used as the hash VALUE) and the record byte offset (used as the hash KEY), to an SDBM lookup file (of KEY/VALUE pairs), used specifically for record locks - which is accessible to all end-users. When the record is SAVED, or CANCEL is selected, then the record lock is released so that the record becomes immediately available for editing again by the first end-user initiating a new record lock on that particular record.
A DELETE flag indicator field can be employed to mark records in both the Flat Files and SDBM files (for later BATCH deletion Server-side during off hours) for an end-user DB interface application program to recognize as a BYPASS indicator until the records are physically removed. When this BATCH deletion is performed, a file reorganization is performed: (A.) removing discarded records (i.e. those marked for deletion), (B.) physical resorting of the Flat File records in a logical sequence, (C). rebuilding of the SDBM indexing files occurs so that the Flat File record byte offsets are refreshed to reflect the new positions/locations of the remaining records which have not been deleted.
For ADDING/APPENDING records in a concurrent multi-user environment, to avoid race conditions, it is recommended that you employ a semaphore file for locking during edits instead of placing a lock directly on the file to be edited. This strategy is discussed in depth by an expert/author on the subject here
If the user-interface was designed well, child records (in a 1-to-many, parent/child relationship) would not be directly editable, but only editable whenever the corresponding parent record was locked for edit. To be clear, fixed-length Child records would be housed in a Flat File separate from the fixed-length Parent records Flat File. The SDBM binary file used for Child Record look ups would have an ALT KEY with DUPS. Example: If the Parent records represented bank loans, and the Child records represented the individual items of collateral securing those bank loans, then the KEY for the Child record look ups would be the LOAN NBR of the Parent record. A sequence nbr (perhaps the range 1-9) added to the LOAN NBR would make the KEY unique for each separate item of collateral. LOAN_NBR|1 ... LOAN_NBR|9.
This is a very stable/safe database system. The binary SDBM files can easily be rebuilt from the Flat File database records. This is more desirable, then let's say, a MS-Access database, where the text data and indexes are stored together in a binary file which can become corrupted making it sometimes difficult to rescue your important textual data. In MS-Access, the database Data and Objects: back-end Tables/Indexes/Data, and front-end Reports/Forms/Macros/etc. are often mistakenly stored in one file (in binary format) - although MS-Access does provide for the means to separate the back-end and front-end into separate files allowing for a much more stable DB system.
One advantage joint/tandem/dual technology Flat File/SDBM databases have over MS-Access (for example) is that they require no MDAC (Microsoft Data Access Components) be installed to each client. ODBC-enabled MS-Access databases without the use of the MS-Access front-end software can be designed to create a huge database (perhaps to 1 Terabyte/5 Billion rows in practicality - depends on whether it is a READ ONLY Data Warehouse or a READ/WRITE Database) where each MDB file is used as a: single table, group of tables, or partial table (common to all the MDB files, and where the data is logically kept segregated for ease of random access to 1, or perhaps 2, MDB files - each MDB file containing as many as 10 million rows).
FLAT FILE/SDBM, or MDB, relational database systems can employ file naming convention to make it easy for a DB application user-interface to determine which file(s) to look in. Example: A flat file named US_CENSUS_2010_TX_A.txt (or .mdb for MS-Access) would be one way to identify a file logically segregated to contain only data associated with Texas citizens whose last name began with the letter "A". A business would need to determine what logical segregation of data made the most sense for their operational needs.
Server-side batch EDIT operations (and heavy reporting) could be performed during non-business hours. This could be to (daily or periodically) refresh a read-only database that does not allow (nor provide a DB user-interface for) concurrent multi-user edits. Additionally, for common data statistics, a statistics summary table could be maintained (Server-side during off hours) which answered most user questions, which would be an aggregate of the data across the entire database system (as in: Stats for the entire U.S., and Stats for each individual State of the 50 States - from the example given directly above). End-user self-initiated (client-side) queries and reporting may best be limited to accessing these precompiled summary statistics, and accessing a limited number of parent records with their corresponding child records. These limitations would be enforced by the design of the DB user-interface.
One useful user-interface design enforcing such limitations is to employ a combination TreeView/ListView format. A TreeView GUI widget is loaded on the left-side of the screen with summary information to give the end-user the "big picture" of what's available in the database. When a particular TreeView parent node is selected/expanded, more specific information becomes available to select from. When the end-user selects/clicks on any of these child nodes, then the specific/limited data is displayed in the ListView GUI widget located on the right-side of the screen. EXAMPLE: A database of KJV Bible verses is available. A TreeView widget is loaded with the 66 Book Names (parent nodes). When a particular Book (parent node) is selected/expanded, the individual Chapters (child nodes) are displayed. An end-user then selects/clicks on a particular Chapter of Verses for viewing. The Verses for the selected Chapter within the selected Book are then loaded into the ListView widget for scrolling/viewing/printing.
A Perl user's group discussion of this topic (Joint Database Technology) can be found here.
A primitive attempt to index the records within a variable-length record Flat File can be found here. The problem with this primitive method/technique is that it only shows you how to randomly access a particular Flat File, variable-length record by the sequential record number. That is only useful if you know the sequential record number to go to. This does not give you the benefit of arbitrarily seeking to a particular record(s) based upon information contained within the fields of a record (such as a combination of: {BirthDate, LastNameFirst4Chars, FirstNameInitial} - from the example given above).
#-- This Perl program retrieves 5 verses of King James Version Bible text #-- from a large Flat File (with fixed-length, "text" records) by random access lookup. #-- The Flat File contains 180 complete copies of the KJV Bible, with a bogus #-- translation number (tr) assigned to each Bible copy (tr = 1 to 180) #-- to make a unique key: {translation_nbr + book_nbr + chapter_nbr + verse_nbr}. #-- Record offsets (in bytes) are persistently stored in a binary Perl SDBM database file, #-- of key/value pairs, tied to a program hash table. The value is the offset. #-- The key is {tr + bk + chp + ver} numbers combined/concatenated. #-- If $offset is a negative value, seek from BOTTOM/END of file. #-- If $offset is a positive value, seek from BEGIN/TOP of file. #-- Each Bible contains 31102 verses of text, of max length 528 characters each. #-- But with the compound index {tr + bk + chp + ver} added to the Bible text, #-- for the purpose of proving the random access is working, and #-- MIMEbase64 encoding applied (to hide the Bible text), the fixed #-- length records have become 760 characters each. Decoding will occur as records are read. #-- The Flat File is just under 4 GIG. The SDBM file just under 1 GIG. #-- There are over 5 Million records each, in both the Flat File and SDBM file. #-- [180 copies of the Bible times 31102 verses per Bible] #-- Flat File, random access, record lookup, is instantaneous. #-- You can use Perl Portable Code: sysopen, syswrite, sysseek, sysread. #-- But the below example is Windows O/S specific Perl Code. #-- This example is a batch application process (no user front-end), having 5 hard-coded lookup keys. #-- You can build a user-interface to instead accept the lookup keys from user input: either typed in, #-- or selected from a GUI widget of preloaded values {tr, bk, chp, ver}. #-- A RANGE of values could even be selected to print Bible verses for an entire Book (ex. tr="134", bk="01" i.e. Genesis) #-- NOTE: No promoting of any religion is done here. The KJV Bible was selected because it is in the public domain, and because #-- it is logically segregated making it a good test file which can easily be copied over-and-over to fill up as many Flat Files #-- as is desired for testing. Dozens of Flat Files, each with 180 complete copies of the Bible, could be preloaded, and your #-- application program designed to access any one of them based upon the different set of distinct bogus translation numbers (tr) #-- contained within each Flat File (and its associated SDBM binary file tied to a hash table). #-- Not shown here, was the initial code used to load the Flat File with records, and load the SDBM file with key/val pairs. #-- To be clear: This "random access" technique requires no reading in of the Flat File records sequentially, nor does it #-- require reading in rows/records sequentially from the binary SDBM file of key/val pairs - to load them into the Hash table. #-- As soon as the below code is launched, the Bible verses are randomly accessed and printed to the screen in a split second.
use Win32API::File 0.08 qw( :ALL );
use Win32;
use SDBM_File;
use Fcntl;
use MIME::Base64 qw(decode_base64);
$PWD=Win32::GetCwd(); #-- working directory
#--- tie the external binary SDBM file contents (of key/value pairs) to a program hash table.
tie( %BibleVersesIDX, "SDBM_File", '.\BibleFlatFile_760_31102_180_IDX', O_RDONLY, 0444 );
if (tied %BibleVersesIDX) { print "BibleVersesIDX Hash now tied to external SDBM file\n\n"; } else { print "Could not tie BibleVersesIDX Hash with external SDBM file - Aborting\n\n"; die; } #-- create a file handle to: open, and random access read from, the Flat File of Bible verses. #-- the flat file already exists, and was preloaded with 5 million plus records. #-- if you are unfamiliar with the next line of code, it would be deceiving to you. $hFILE = createFile("$PWD\\BibleFlatFile_760_31102_180.dat", "r"); #-- $hFile is a native Windows file handle #-- tr bk chp ver foreach $key ("00101001001", "09066022021", "09101001001", "18001001001", "18066022021") { $offset=$BibleVersesIDX{$key}; if ($offset < 0) { $pos=SetFilePointer( $hFILE, $offset, [], FILE_END); #-- moves the file pointer to a specific record at $offset } else { $pos=SetFilePointer( $hFILE, $offset, [], FILE_BEGIN); #-- moves the file pointer to a specific record at $offset } #-- FYI: Don't rely on $pos, because if location is past 2 GIG mark, $pos (the return value) is wrong. #-- $offset will be an integer value to seek up to 2 GIG bytes from Top, or Bottom, of a 4 GIG file. ReadFile( $hFILE, $Buf, 760, [], [] ); #-- $Buf contains the 760 characters read in from the Flat File $decoded_Buf=decode_base64($Buf); #-- MIMEBASE64 decoded to length 570 from 760 $decoded_Buf=~s/ *$//; #-- remove trailing spaces print $decoded_Buf . "\n\n"; #-- print to the screen the decoded Bible verse fetched from the Flat File } exit; END { CloseHandle( $hFILE ); untie( %BibleVersesIDX ); sleep 5; }
Successors
The dbm library has had many successors, such as:
- Ndbm: In 1986 Berkeley produced ndbm (standing for New Database Manager). This added support for having multiple databases open concurrently.
- Sdbm: Some versions of Unix excluded ndbm due to licensing issues, so in 1987 Ozan Yigit produced this public-domain clone.[1]
- BDB: 1991 successor to ndbm by Sleepycat Software (now Oracle) created to get around the AT&T Unix copyright on BSD.
- GDBM (GNU dbm): A Free/Libre version written by Philip A. Nelson for the GNU project. It added support for arbitrary-length data in the database: previously all data had a fixed maximum length.[2][3] The latest version was released on 11 March 2017.[4]
- QDBM (Quick Database Manager): "an embedded database library compatible with GDBM and NDBM. It features hash database and B+ tree database."[5]
- tdb (trivial database library): developed and used internally within the Samba suite, implements an API inspired by GDBM but also supports multiple writers, released under the LGPL license.[6]
- tdbm: a version of ndbm with atomic transactions, in-memory databases, and other extensions, released under a BSD-style open source license.[7]
- MDBM: Ndbm work-alike hashed database library based on sdbm which is based on Per-Aake Larson's Dynamic Hashing algorithms.[8][9]
- Tokyo Cabinet and Kyoto Cabinet: C and C++ implementations employing hash table, B+ tree, or fixed-length array structures by FAL Labs.
- LMDB: copy-on-write memory-mapped B+ tree implementation in C with a dbm-style API.
See also
References
- ↑ Seltzer & Yigit. "A New Hashing Package for Unix" (PDF).
- ↑ "Gdbm - Free Software Directory".
- ↑ "GDBM".
- ↑ « gdbm-1.13 released », lists.gnu.org, 11 March 2017.
- ↑ "QDBM - Free Software Directory".
- ↑ "tdb: Main Page".
- ↑ "tdbm - a simple, high-performance database with nested atomic transactions".
- ↑ "yahoo/mdbm".
- ↑ "MDBM Documentation — MDBM Documentation".
General
- Matthew, Neil; Stones, Richard (2008), "Databases", Beginning Linux Programming, Wiley
- Brachman & Neufeld. "TDBM: A DBM Library With Atomic Transactions" (PDF).
- Olson, Bostic & Seltzer. "Berkeley DB" (PDF).