Loading Data Faster: Testing Different Options

I will try to explain here at some of the Oracle features and techniques that can be employed to speed up data loads from flat files into Oracle tables. To demonstrate and compare data loading methods, I loaded a sample data set using each of the techniques that Oracle provides. I compared elapsed time, CPU time, and caveats of each method, but please note that timings can vary from one run to the next and all results depends on DB Activity at that time.

The Sample Data Set

To demonstrate various techniques for loading flat file data into Oracle tables, I worked with a flat file that contains 100,000 records of data into a call center table. The file is approximately 6 Mb in size, and a few sample records in the file look like this:

        82302284384,04-18-2003:13:18:58,5001,COMPLAINT,SYS TICKET 183487923
82302284385,04-18-2003:13:18:59,3352,INFO-REQUEST,PAYMENT ADDRESS
82302284386,04-18-2003:13:19:01,3142,DOC-REQUEST,2002 YE CONSOL STMT

The CALLS table looks like this:

        Name         Null?     Type          Comment
------------ --------- ------------- -----------------
CALL_ID NOT NULL NUMBER Primary key
CALL_DATE NOT NULL DATE Non-unique index
EMP_ID NOT NULL NUMBER
CALL_TYPE NOT NULL VARCHAR2(12)
DETAILS VARCHAR2(25)

1- Single Row Inserts

About the simplest way for an application to load data into Oracle is to insert one row at a time using the INSERT statement with the VALUES clause. This technique is convenient for loading small amounts of data, such as a SQL*Plus script that creates seed data at the time of application installation. However, this approach is quite slow when compared to more sophisticated techniques, and therefore is not appropriate for loading large volumes of data into Oracle.
Depending on the application and data volume involved, this most basic method for loading data into Oracle might give acceptable performance. If that is the case, then there is no need to over-engineer an unnecessarily complicated load program. However, if single row inserts are not fast enough, then there are more sophisticated data load methods to consider.
The chief advantages of the single row insert method are simplicity and flexibility. The key disadvantage of this technique is that it is the slowest and most CPU-resource intensive.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
Single row inserts 172 52


2-Single Row Inserts Without Indexes

When a row is inserted into a table, Oracle needs to update all indexes on the table at the same time. It is typically faster for Oracle to build an index from scratch after the table is populated rather than repeatedly update an index as rows are inserted. It is often true that the index will be more compact and efficient as well. These rules of thumb are especially true when bitmap indexes are involved.
From this, it follows that one way to speed up data loads would be to drop the indexes before loading the data and recreate the indexes after the load is complete. For primary and unique key constraints, you can get the same effect by disabling or dropping the constraint. (Be careful of the impact this might have on foreign key constraints in other tables.)
This technique might not be appropriate if you are loading data into a table that already has a lot of rows in it. Consider a table with 99 million rows and you wish to load 1 million more rows. The time saved in loading the 1 million new rows will be lost when Oracle has to rebuild an index on the 99 million existing rows plus the 1 million new rows.
Speeding up index creation is another issue to consider. Briefly, you can reduce the amount of sorting that must take place on disk by increasing the sort_area_size in your session (if you are not using Oracle 9i PGA aggregate targets). This allows your session to perform more of the sorting during index creation in memory. Also, you can use the NOLOGGING keyword to reduce the amount of redo Oracle generates during the index build. NOLOGGING has significant impacts on recoverability and standby databases, so do your homework before using the NOLOGGING keyword.
I modified the application used in the last section to disable the primary key on the CALLS table and drop its one non-unique index before loading the data, putting both back after the load was complete. In this example, the CALLS table was empty before the data load. Factoring in the amount of time required to recreate the two indexes, elapsed time for the load dropped from 172 seconds to 130 seconds. CPU time used by the database server process dropped from 52 seconds to 35 seconds.
Dropping and rebuilding indexes before and after a data load can speed up the load and yield more efficient indexes. Some drawbacks include the added complexity and potential embedding of schema design information into the application code. (When you add another index to the table being loaded, will you have to update your application code?) Dropping indexes before a load could also have significant performance impacts if users need to be able to query the target table while the load is taking place. Finally, dropping or disabling primary or unique key constraints could cause difficulties if foreign key constraints reference them.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
Single row inserts without indexes
(time includes index rebuild using
NOLOGGING and large sort area)
130 35


3-Bulk Inserts Without Indexes

I modified the application used in the last example to load call center records into arrays in memory and perform bulk inserts 100 rows at a time instead of one row at a time. I retained the logic that dropped the indexes and rebuilt them after the data load. Elapsed time to load the data dropped from 130 seconds to 14 seconds, while CPU time used by the database server process dropped from 35 seconds to 7 seconds. Of course, the amount of time required to rebuild the indexes was not affected by the use of bulk inserts. The actual time to load the data (setting aside the index rebuild time) dropped by about 95%.
A clear advantage of the bulk insert technique is that load time drops dramatically, as does CPU usage. One disadvantage of this technique is slightly increased complexity in application coding and a greater opportunity for bugs. This disadvantage is very modest.
We apply this technique for First Data and Bank of America.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
Bulk row inserts 100 rows at a time
without indexes (time includes
index rebuild using NOLOGGING
and large sort area)
14 7


4-CREATE as SELECT Using an Oracle 9i External Table

An external table looks somewhat like a regular table in that it has columns with standard Oracle data types and you can query it just like any other table. However, the table data is not stored inside the database. Instead, the external table is associated with a flat file on the database server. Whenever you query the external table, Oracle parses the flat file and returns the rows to you as if the data were stored in Oracle.
We apply this technique for Bank of America and BankLink.
You can join external tables to other tables in your queries, but you cannot index external tables nor can you insert, update, or delete rows. If the flat file on the database server changes or is deleted, this will affect the outcome of future queries against the external table.
External tables open the door to new strategies for loading data into Oracle. We can put the flat file on the database server and create an external table in the database. At that time we can instantly query the data as if it had been loaded into a conventional Oracle table. If the flat file gets removed or overwritten, the old contents will no longer be accessible in the database. Probably we can implement this for issuedata and images files?
However, we can (if needed) use an external table as a vehicle to copy data quickly from a flat file into a conventional Oracle table.

I wrote a SQL*Plus script to drop the CALLS table, recreate it from the CALLS_EXTERNAL table as shown above, and add the primary key and non-unique index. Populating the CALLS table in this way took 15 seconds, using 8 seconds of CPU time in the database server process. This case was slightly slower than the bulk insert method discussed in the last section.
One advantage of this technique is that it achieves very good performance with very little coding effort. It is also a method that works entirely within the database, so we can even schedule data loads using the Oracle dbms_job scheduler.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
CREATE as SELECT from an
external table with NOLOGGING
(time includes index build using
NOLOGGING and large sort area)
15 8


5- SQL*Loader Conventional and Direct Path

SQL*Loader is a utility provided by Oracle specifically for the purpose of loading large volumes of data from flat files into Oracle tables. It is very powerful, flexible, and fast. However, we must launch it from the operating system. (Invoking SQL*Loader within a dbms_job or a PL/SQL stored procedure is tricky but we can do it.).
SQL*Loader still seems to be about the fastest and most efficient way to get flat file data into Oracle. By default, SQL*Loader uses what it calls "conventional path" loading–bulk inserts, basically. The performance is not phenomenal, and there are faster alternatives. However, with a simple "direct=true" on the command line, we can invoke "direct path" loading. In a direct path load, SQL*Loader writes rows directly into new data blocks above the table's high water mark. This is like inserting with the Append hint as seen in the last section, but it's even better. In a direct path load, SQL*Loader also updates indexes using a very efficient bulk method.
Loading the call center data using SQL*Loader with conventional path load took about 81 seconds. The database server process used about 12 seconds of CPU time. This was with indexes in place and SQL*Loader bulk inserting rows 100 at a time. With a direct path load, SQL*Loader got the data into the database in just 9 seconds, with only 3 CPU seconds used by the database server process. This was with indexes in place.
Direct path SQL*Loader is fast and efficient. It is even fast when indexes on the target table are not dropped before the load. As for disadvantages, direct path loading has implications for recovery akin to the NOLOGGING keyword. Also, indexes on the target table are unusable for a period during the load. This can impact users trying to access the target table while a load is in progress. Also, indexes can be left in an unusable state if the SQL*Loader session were to crash midway through the load.

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
SQL*Loader conventional path
(indexes in place and rows=100)
81 12
SQL*Loader direct path (indexes
in place)
9 3


6 - Merge
The modern equivalent of the Updateable Join View. Gaining in popularity due to its combination of brevity and performance, it is primarily used to INSERT and UPDATE in a single statement. We are using the update-only version here. Note that I have included a FIRST_ROWS hint to force an indexed nested loops plan. This is to keep the playing field level when comparing to the other methods, which also perform primary key lookups on the target table. A Hash join may or may not be faster, that's not the point - I could increase the size of the target TEST table to 500M rows and Hash would be slower for sure.

MERGE /*+ FIRST_ROWS*/ INTO test
USING test2 new ON (test.pk = new.pk)
WHEN MATCHED THEN UPDATE SET
        fk = new.fk, fill = new.fill;

7 - Parallel DML Merge
This is the MERGE example on steroids. It uses Oracle's Parallel DML capability to spread the load over multiple slave threads

ALTER SESSION ENABLE PARALLEL DML;

MERGE /*+ first_rows parallel(test) parallel(test2) */ INTO test
USING test5 new ON (test.pk = new.pk)
WHEN MATCHED THEN UPDATE SET
        fk = new.fk, fill = new.fill;


Conclusion

There are many different ways to load data into Oracle. Each technique offers its own balance between speed, simplicity, scalability, recoverability, and data availability.To recap, here are all of the timing figures in one place:

Data Loading Method Elapsed Seconds Database Server
CPU Seconds
Single row inserts 172 52
Single row inserts without indexes
(time includes index rebuild using
NOLOGGING and large sort area)
130 35
Bulk row inserts 100 rows at a time
without indexes (time includes
index rebuild using NOLOGGING
and large sort area)
14 7
CREATE as SELECT from an
external table with NOLOGGING
(time includes index build using
NOLOGGING and large sort area)
15 8
SQL*Loader conventional path
(indexes in place and rows=100)
81 12
SQL*Loader direct path (indexes
in place)
9 3

Please keep in mind that I did not even touch on the subject of parallelism in data loads. (Inserts with the Append hint can use parallelism in the Enterprise Edition of the Oracle software).

Other options with a table containing 900M rows:
create table aaaa nologging parallel 4 as select * from bbbb;     --> 1.2 min
insert /* append */ into aaaa select * from bbbb;                     --> 8 min