Bug Pattern In Java - Saboteur Data - Online Article

Overview

When a program crashes due to corrupted data, the reason can be elusive. Often a program can crash dead in its tracks while manipulating its own internal data, even after working flawlessly for long periods. You've probably already encountered the Saboteur Data bug pattern. We discuss the syntactic and semantic reasons for it and offer potential methods for eliminating it.

Being an industrious developer, you've deployed one of your well-written and well-tested applications for several of your clients who needed better access to their massive stores of complex data.

For each client, the on-site testing period went off without a hitch. You're on your way to the bank, only barely thinking about the six-month software checkup, when your pager goes off. One of your clients ran a report using your software and bombed the system.

You rush to the site and run a random test. It works fine. You run another. No problems. You run hundreds more. Still no problem. You check on other clients who have been running this application full tilt for six months. You get no complaints.

Then, you repeat the report that caused the problem. Crash! What's going on?

You've just met the Saboteur Data, a bug pattern that can be the culprit behind this sort of crash. We explain why it exists and offer several methods to eliminate it—before and after it occurs. Here is the summary of the Saboteur Data bug pattern:

  • Pattern: Saboteur Data.
  • Symptoms: A program that stores and manipulates complex input data crashes unexpectedly while performing a task similar to other tasks that cause no problem.
  • Cause: Either the syntax or the semantics of some of the internal data is corrupted.
  • Cures and Preventions: Perform as many integrity checks on input data as possible, as early as possible. For persistent data that is already corrupt, walk over it and check for integrity. 

About This Bug Pattern

Many programs need to intensively access and manipulate internally stored data to perform various complex tasks. This data might be retrieved from a large data structure in memory, from a database, or over a network.

This type of program is highly susceptible to a crash caused by corrupt internal data. I call this bug pattern the Saboteur Data pattern because such data can stay in the system indefinitely, much like a Cold War sleeper spy, causing no trouble until the particularly troublesome bit of data is accessed. The corrupt data then explodes like a bomb.

The Symptoms

A program that stores and manipulates complex input data crashes unexpectedly while performing a task similar to other tasks that cause no problem.

A Syntactic Cause

Suppose we have a JDBC application that stores a database table called Mapping that maps String names to sets of elements. Each element of each set refers to a key stored in another table, Properties, containing various known properties of these elements. (JDBC serves up a common API for connecting to and eliciting services from databases on a variety of platforms.)

Let's say that both Mapping and Properties are initially read from a text file developed by an outside source (outside meaning any data source not generated by our JDBC application itself) in which each line starts with a name and is followed by a representation of the corresponding set, as follows:

Data from an Outside Source Text File.

In the Mapping file:
apples {macintosh, gala, golden-delicious}
trees {elm, beech, maple, pine, birch}
rocks {quartz, limestone, marble, diamond}

In the Properties file:
macintosh {color: red, taste: sour}
gala {color: red, taste: sweet}
diamond {color: clear, rigidity: hard, value: high}

The Mapping and Properties table entries could be parsed and passed to a method that inserts them into a database. But there are potential pitfalls in this approach. For example, let's suppose that we have written a class that handles a JDBC-compliant database. Following the JDBC API, we could define a PreparedStatement object and use it to pass information into the database, like so:

Defining PreparedStatement Object for Passing Data.

PreparedStatement insertionStmt =  con.prepareStatement("
INSERT INTO MAPPING VALUES(?,?)");
public void insertEntry(String domain, String range)  throws SQLException
{
  insertionStatement.setString(1, domain);
  insertionStatement.setString(2, range);
  insertionStatement.executeUpdate();
}

Inserting two Strings this way may or may not be all right, depending on how the Strings are obtained from the text file. Suppose, for example, that a simple regular expression-matching tool is used to split each line into two Strings:

  • One String contains all the characters before the first space.
  • One String contains all the characters after the first space.

Such a rudimentary parse of the text file would not catch minor corruption in the data. For example, consider what would happen if one of the lines were in the following form:

trees  {elm, beech, maple, pine birch}

The comma between pine and birch is missing. An error such as this can easily result from a bug in the tool that generates the file or from manual editing of the file.

At any rate, the data would enter the database in its corrupt form, waiting silently to be accessed. If the method used to access data expects entries to be separated by commas and spaces, it will crash when reading this entry.

If the program simply distinguishes the elements of the set by commas alone, an even more serious error can occur. The system could interpret pine birch as a single type of tree (a single entry of data) and propagate the bug further into the computation.

A Semantic Cause

Our example is one in which a simple, syntactic constraint of the data was violated. Of course, that's not the only way in which the data might be corrupted. Semantic-level constraints can be violated as well. In our example, one expectation of the data in the Mapping table is that every element in each set is a domain entry in the Properties table. If this invariant were violated, we might end up trying to read an element in the Properties table that wasn't there, causing an exception to be thrown.

In this chapter we use database entries as examples, but a Saboteur Data bug can come at you in a variety of ways—as many ways as there are data-input avenues. When data is read by a program, whether it is from a file, a keyboard, a microphone, a network port, or a digital camera, the potential for a saboteur exists.

Cures and Preventions

The best defense against the Saboteur Data bug is one that is universally employed by compiler and interpreter developers. Because the input data to these programs is so complex, developers have no choice but to perform as thorough an integrity check as possible when first reading the input, rather than upon later access.

Let's look at several elimination methods.

Parsing as an Elimination Method

The very practice of parsing input is a way to eliminate most saboteurs. Unfortunately, programmers who would never think of writing a compiler without a parser fail to write adequate parsing methods for simpler data. The parsing of simpler data is easy, but that's no excuse for not parsing it at all.

Any program that reads data—no matter how simple—should parse it. After all, such a program can be viewed as a compiler or an interpreter over the "language" defined by its set of valid inputs. Take it from someone who has been there. I plead guilty to having manipulated data without proper parsing in my young and reckless days, and I suffered the consequences—rampant saboteurs. I don't recommend the experience.

Type Checking as an Elimination Method

Another common form of checking performed by compilers for many languages (including the Java language) is type checking. Type checking is an example of a semantic-level check on the integrity of a program.

Provided that the type system is sound (as the Java type system is), this integrity check literally guarantees that a huge class of errors can never occur at runtime. Like parsing, this example from compiler writers can be applied to other programs, which often stipulate semantic-level invariants over their input data (as in our example). These invariants are often not explicit, but they can be made explicit by putting in the corresponding checks.

Iteration as an Elimination Method

Of course, if you suspect an occurrence of this bug pattern with data that has already been read in and stored, it would be prudent to iterate over the data, accessing each bit of data as it would be accessed in the deployed application and ensuring that everything works as expected. In the process, you might be able to correct simple errors as well.

In cases where your data is stored in an immutable database or other immutable finite store, such an offline integrity check can also serve as a performance optimization. If you check over all of the data offline and it all passes, there is no need to check it again when it's used online. You might as well conserve the processor cycles.

But this optimization should be done only when the data is truly immutable, and only when there is no chance that the data will be corrupted while reading it from storage. If there is even a remote chance that new data will be entered or if the connection to this data is any less reliable than a connection to the local filesystem, check it again while reading it. After all, these integrity checks rarely cause significant performance degradation; the data retrieval process itself will almost always be the bottleneck.

But even a small risk of saboteur data is too much risk. Just one case will easily outweigh any advantages of not doing the checks—both from the perspective of your customer when the software makes a catastrophic nosedive and from your perspective when you try to diagnose what happened. Because the symptoms are far removed from the cause, saboteur data can be a bear to diagnose.

A Caveat on Elimination Methods

I don't mean to imply that it is always possible to perform enough checks to eliminate each piece of saboteur data from a program. If that were the case, this would be a much less problematic bug pattern.

There are many reasons why a saboteur might be undetectable before it starts wreaking havoc:

  • The data necessary to perform all the checks is not available until after the saboteurs are stored away, and they are not all accessible offline.
  • The function checking the complete set of constraints is not even computable (as is the case for many programming languages).
  • The set of constraints is computable, but the resources required to check them are beyond what's available to the program. 

In such cases, the best we can do is eliminate as many forms of saboteurs as possible.

What We've Learned

In this article on the Saboteur Data bug pattern we've learned the following:

  • Saboteur Data is often responsible when either the syntactic or the semantic constraints of complex input data or legacy data is violated.
  • The unpredictable nature of this bug rests in the fact that some actions call up the bits of corrupt data while other actions don't.
  • This saboteur data can stay in the system indefinitely, much like a sleeper spy, causing no trouble until the particularly troublesome bit of data is accessed. 
  • The bug can come at you in a variety of ways, as many ways as there are data-input avenues.
  • A syntactic data error can occur by manual editing or automated generation of a file. 
  • With a syntactic error, a simple parse—like splitting a text line into two Strings, one each before and after the first space—wouldn't catch a minor data corruption such as a missing comma separator between entries.
  • The results of the syntactic error above: if the program expects entries to be separated with a space and a comma, the program may crash; if not, the program may accept the two entries as one and propagate the error.
  • In a semantic error, expectations of elements can be violated. If an expectation of the data in one table is that every element in each set is a domain entry in another table, and we violate this expectation, we may throw an exception when we try to read an element in the second table that isn't there.
  • The best defense against this pattern is one that is universally employed by compiler and interpreter developers. Because the input data to these programs is so complex, developers perform as thorough an integrity check as possible when first reading the input rather than later. 
  • The practice of parsing input is a way of eliminating bugs. In fact, any program that reads data—no matter how simple—should parse it.
  • Type checking, another elimination method, is an example of a semanticlevel check on the integrity of a program.
  • If you suspect an occurrence of this bug with data that has already been stored, you can iterate over the data, accessing each bit as it would be accessed in the deployed application and ensuring that everything works as expected. When the data is stored in an immutable database or finite store, such an offline integrity check can also serve as a performance optimization.
  • One caveat: offline checking of data integrity should replace online checks only when the data is immutable and only when there is no chance that the data will be corrupted while reading it from storage.
  • A saboteur might be undetectable because the data necessary to perform all the checks won't be available until after the saboteurs are stored away and inaccessible offline.
  • A saboteur might be undetectable because the complete set of constraints is not even computable (as is the case for many programming languages).
  • A saboteur might be undetectable because the constraints are computable, but the resources required to check them are beyond the access of the program. 

The Golden Rule to eliminating data saboteurs: Any program that reads data should parse the data. Good luck in stamping them out!

About the Author:

No further information.




Comments

No comment yet. Be the first to post a comment.