Batching Select Statements in JDBC
by Jeanne Boyarsky

In networking, one of the most expensive things you can do is make a server roundtrip. In JDBC, this corresponds to a database call. If you are doing inserts, updates or deletes, you can use the executeBatch() signature to cut down the number of trips to the server. Unfortunately, there is no such built-in mechanism for select queries.

Suppose you want to get the names for a given list of ids. Logically, we would want to do something like:

PreparedStatement stmt = conn.prepareStatement(
    "select id, name from users where id in (?)");
stmt.setString("1,2,3");

However, this will not work. JDBC only allows you to substitute a single literal value in the "?". You can't substitute things that are meant to become part of the SQL statement. This is necessary because if the SQL itself can change, the driver can't precompile the statement. It also has the nice side effect of preventing SQL injection attacks.

Instead, you have four options for the implementation:

  1. Run the query separately for each id
  2. Build one query to do everything
  3. Use a stored procedure
  4. Select batching

Option 1: Run the query separately for each id

If you have 100 ids, this option results in 100 database calls.

PreparedStatement stmt = conn.prepareStatement(
    "select id, name from users where id = ?");
for ( int i=0; i < 3; i++ ) {
  stmt.setInt(i);  // or whatever values you are trying to query by
  // execute statement and get result
}

This option is the easiest to code, but results in the slowest performance.

Option 2: Build one query to do everything

At runtime, you use a loop to build a statement like this:

PreparedStatement stmt = conn.prepareStatement(
    "select id, name from users where id in (?, ?, ?)");
stmt.setInt(1);
stmt.setInt(2);
stmt.setInt(3);

This solution is next easiest to code (compared to running the query separately) and solves the server roundtrip problem. Unfortunately, the prepared statement must be recompiled each time you call it with a different number of parameters. This is due to the fact that the string literal of the SQL is matched against. So if I call it with ten ids, then three ids, then 100 ids, there will be three prepared statements in the cache. Aside from the added time from recompiling the prepared statement, other queries may be removed from the pool of available statements causing them to be recompiled again too. Finally, this solution can take longer for very large queries if memory is exceeded and paging to disk occurs.

Another variant of this approach is to hard code the values in the statement:

PreparedStatement stmt = conn.prepareStatement(
    "select id, name from users where id in (1, 2, 3)");

This approach is even worse as there is almost no chance of the statement being reused. At least with the "?", the statement will be reused for queries with same number of parameters.

Still another variant of this approach is to send multiple statements:

PreparedStatement stmt = conn.prepareStatement(
   "select id, name from users where id in (?) ; "   
   + "select id, name from users where id in (?); "
   + "select id, name from users where id in (?)");
stmt.setInt(1);
stmt.setInt(2);
stmt.setInt(3);

This approach has the advantage of the database seeing the statement as being the same each time. The database will not need to figure out the execution path each time. However, the SQL is different each time from the driver's point of view and the prepared statement must still be prepared and stored in the cache each time. Also, note that not all database support sending multiple statements separated with semicolons.

Option 3: Use a stored procedure

A stored procedure executes on the database server. Therefore, it can make many queries without any network overhead. The stored procedure can collect all the results and return them in one trip. This has the advantage of being the fastest solution. However, it ties you to one database and requires you to split logic between the application server and database server. If your architecture already uses stored procedures, this is the typically the best solution.

Option 4: Select batching

Select batching is a compromise between options one and two. The ideas is to predetermine a few numbers of parameters that you want to run in each query. Then you build a bunch of queries with those numbers of parameters. Since there are a finite number of queries involved, you get the advantages of prepared statements (pre-compiling and not bumping other prepared statements from the cache.) The batching of multiple values in the same query retains the benefit of minimizing the number of server roundtrips. Finally, you avoid the memory problems of large queries by controlling the upper limit of the batch sizes. This option is the best approach if you have a critical query in terms of performance that needs batching and are not using stored procedures. Now, we walk through an example and note the important things to consider.

Example

public static final int SINGLE_BATCH = 1;
public static final int SMALL_BATCH = 4;
public static final int MEDIUM_BATCH = 11;
public static final int LARGE_BATCH = 51;

The first thing to do is pick how many batches you want and what sizes they should be. [Note that in real code, these values should be in a property file, not hard coded. That way, you can experiment and change the batch sizes at runtime.] Regardless, of the actual batch sizes, you need a single batch - a batch of size 1. This is so you can do queries if someone only asks for one value or the "left over" values in a big query. For the batch sizes, relatively prime numbers work well. In other words, the sizes should not cleanly divide into each other or be cleanly divisible by the same number. This maximizes the number of requests that will have the fewest server roundtrips. The number of batch sizes and exact sizes vary based on your configuration. Note that the large batch size should not be too big or you will run into memory issues. Also note that the smallest batch size should be very small. You will be using this one a lot since most queries eventually get down to this point.

while ( totalNumberOfValuesLeftToBatch > 0 ) {

Everything that follows is done repeatedly until we run out of values.

int batchSize = SINGLE_BATCH;
if ( totalNumberOfValuesLeftToBatch >= LARGE_BATCH ) {
  batchSize = LARGE_BATCH;
} else if ( totalNumberOfValuesLeftToBatch >= MEDIUM_BATCH ) {
  batchSize = MEDIUM_BATCH;
} else if ( totalNumberOfValuesLeftToBatch >= SMALL_BATCH ) {
  batchSize = SMALL_BATCH;
}
totalNumberOfValuesLeftToBatch -= batchSize; 

The idea here is to find the largest batch size possible that is bigger than the number of values we are trying to query. If none of the batches are big enough, we use the single batch size of 1. As an example of how this works, suppose I do a query for 75 values. First I do a query for 51 elements. Now there are 24 values left to query. Then I do a query for 11 elements. Now there are 13 values left to query. Since there are still more than 11 elements left, I do another query for 11 elements. Now there only 2 values left to query. This is less than my smallest batch size of 4, so I do two single queries for 1 value each. So I have made a total of 5 roundtrips and used three prepared statements in the cache. A significant improvement over the 75 queries of doing this individually! Finally, we need to keep track of how many values are left to query.

StringBuilder inClause = new StringBuilder();
boolean firstValue = true;
for (int i=0; i < batchSize; i++) {
  inClause.append('?');
  if ( firstValue ) {
    firstValue = false;
  } else {
    inClause.append(',');
  }
}
PreparedStatement stmt = conn.prepareStatement(
    "select id, name from users where id in (" + inClause.toString() + ')');

Now we build the actual prepared statement. Since we always build the query in the same way, the driver will notice the SQL is identical on later executions of the same batch size. [Note: If you are not on Java 5 yet, replace StringBuilder with StringBuffer so this compiles.] It is important to return the id we are querying on so that we can later determine which name goes with which id.

for (int i=0; i < batchSize; i++) {
  stmt.setInt(i);  // or whatever values you are trying to query by
}

Set the proper number of values to query by. It is OK to include other search criteria in the query. Just put these parameters before or after the in clause parameters. In this case you will want to keep track of the current index.

From this point, you just execute the query and get the result normally. On the first try, you should expect to see a performance improvement over options 1 and 2. It will take some tweaking to optimize the batch sizes to the best ones for your situation.

Disclaimer: As noted in the famous quote "premature optimization is the root of all evil", select batching should be used to solve a performance problem. This is especially important here as you need the original numbers to confirm that you are actually getting a benefit.