DBH - Disk Based Hashtables


Library to create and manage 64-bit disk based hash tables

What is DBH?

DBH (acronym for Disk Based Hashtable) is a convenient way to associate unsigned character keys to data records. Any kind of digital information can go into the data record, such as text, graphic information, database structures, you name it. The idea behind using a DBH is build the key index directly into a multidimensional file format.

If you are a technical guy and want to know how the guts of DBH are arranged around quantified numbers, take a look at this white paper, which is a first draft submitted recently to the online JCC.

Why DBH?

DBH (Disk Based HashTables) technology is relevant when the data challenges being faced push the limits of what can be addressed by traditional hashtables or database systems,. Leveraging the powerful DBH product, organizations realize substantial cost savings while solving challenges that seemed intractable using other data management technologies.

DBH adopters usually have very large datasets, very fast access requirements, or very complex datasets. Another reason to turn to DBH is because the software allows users to easily share data across a wide variety of computational platforms using applications written in standard C 90. License is GNU GPL-3.

Similar to XML documents, DBH files allow users to specify complex data structures. In contrast to XML documents, DBH files can contain binary data (in many representations) and allow direct access to parts of the file without first parsing the entire contents.

DBH, not surprisingly, allows data objects to be expressed in a very natural manner, in contrast to the tables of relational database. Whereas relational databases support tables, DBH supports n-dimensional datasets and each element in the dataset may itself be a complex object. Relational databases offer excellent support for queries based on field matching, but are not well-suited for sequentially processing all records in the database or for subsetting the data based on coordinate-style lookup.

In-house data formats are often developed by individuals or teams to meet the specific needs of their project. While the initial time to develop and deploy such a solution may be quite low, the results are often not portable, not extensible, and not high-performance. In many cases, the time devoted to extending and maintaining the data management portion of the code takes an increasingly large percentage of the total development effort - in effect reducing the time available for the primary objectives of the project. DBH offers a flexible format and powerful API backed by over 10 years of development history.

Projects can leverage DBH capabilities and still define their own data objects and project-specific API to those objects.

DBH is free software distributed at no cost. Potential users can evaluate DBH without any financial investment. Projects that adopt DBH are assured that the technology they rely on to manage their data is not dependent upon a proprietary format and binary-only software that a company may dramatically increase the price of, or decide to stop supporting altogether.

Bigger is not always better. In DBH, emphasis is on functionality, speed and size.

In a world of software dominated huge corporations, the natural way to evolve is by becoming small and fast. A clever and daring rodent scurring among the feet of dinosaurs.

What is the difference with a GHashTable?

GhashTables (from the GNU glib library) work fine for memory hashtables. But there is a size limitation determined by amount of available RAM memory. Also, GhashTables are only available during program execution, while DBH files are disk based and can be shared amongst different computer platforms.

Why use DBH? 

If you need to have large quantities of information available online, and with obsolete equipment you require better performance than that available with expensive proprietary databases, then DBH is for you.

Why not use DBH? 

If you wish to use a large database problem as a means to justify the purchase of expensive hardware and proprietary software, stay clear of DBH. Do not even mention it.

How does DBH work? 

DBH extends the concept of binary trees into a n-tuple dimensional space. This way DBH creates trees is in the fashion we actually see trees in nature, and nature is a wise thing.

On what platforms does DBH run? 

Linux, FreeBSD, AIX, Solaris, SunOS, Digital Unix, Irix, and in general any POSIX compliant OS. Gentoo-Linux ebuild, Ubuntu packages and FreeBSD ports are available at the download site. And now Windows with mingw-w64 (windows installer is available at download site).

How does DBH grow balanced trees? 

This is done using some concepts borrowed from mathematics. The first one is the concept of natural numbers, {1, 2, 3, ...}. Natural numbers is how we count, and there are infinite, countable and have a certain order. But we must not think that they are the only representation. Any other set which is countable, infinite and has a certain order is the exact same thing, even though they may seem strange or weird. Well constructed DBH tables make use of cuantified numbers, which are a different way of expressing the natural numbers, with a different ordering rule. Assuming that the computation bottleneck is the transfer of data from memory to CPU registers, cuantified numbers is the way index records to guarantee the fastest possible retreival speed.

In order to minimize access to any record in particular, the dimension of the DBH tree should equal the natural logarithm of the number of records in the DBH file. If the records represented data of the entire population of the Earth, the dimension (or keylenth) of the DBH file would be 22. Considering that the human genome to contain more than 3.4 billion base pairs, associating a data structure to each base pair would require a dimension of 21. 64-bit DBH garantees that such a DBH files can grow up to 9.2E+18 Bytes in size, enough for an average record size of 2.7 GBytes for each base pair. Untractable problem?

More technical information on cuantified numbers can be found soon to be published article on the subject.

How to start using DBH?

First of all you have to download it. The recomended method is building the library from source, and for this purpose a Gentoo-Linux ebuild and a FreeBSD port are provided. On systems other than Gentoo-Linux or FreeBSD, just download the source tarball, unpack with «tar -xzf dbh-X.X.X», and follow the generic installations instructions listed in dbh-X.X.X/INSTALL. Ubuntu packages are also available for both amd64 and i386 architectures.

HTML documentation and man pages for all function call will be installed. In the examples subdirectory there is a simple hash example (simple_hash.c) and a more complex example (filesystem.c). Make sure you have a look at them before you start.

Where do I get DBH? 

Sourceforge download site is located here.

Is there any support? 

The sourceforge mailing list has been disabled due to excessive spam attacks. We will soon see if the sourceforge blog is more robust.

Are old 32-bit DBH files supported?

Old 32-bit DBH files can be accessed with dbh-4.5.0. 64-bit dbh functions in dbh-4.6.0 and later are namespaced differently so that both libraries may coexist in the same executable. Old 32-bit DBH files cannot be accessed with the 64-bit library (dbh-4.6.0 and later).


Edscott Wilson García, who can be reached at <edscott at xfce dot org> for comments, suggestions and questions.

Support This Project