The Sample Data Set
To demonstrate various techniques for loading flat file data into Oracle tables, I worked with a flat file that contains 100,000 records of data into a call center table. The file is approximately 6 Mb in size, and a few sample records in the file look like this:
82302284384,04-18-2003:13:18:58,5001,COMPLAINT,SYS TICKET 183487923
82302284385,04-18-2003:13:18:59,3352,INFO-REQUEST,PAYMENT ADDRESS
82302284386,04-18-2003:13:19:01,3142,DOC-REQUEST,2002 YE CONSOL STMT
The CALLS table looks like this:
Name Null? Type Comment
------------ --------- ------------- -----------------
CALL_ID NOT NULL NUMBER Primary key
CALL_DATE NOT NULL DATE Non-unique index
EMP_ID NOT NULL NUMBER
CALL_TYPE NOT NULL VARCHAR2(12)
DETAILS VARCHAR2(25)
1- Single Row Inserts
About the simplest way for an application to load data into Oracle
is to insert one row at a time using the INSERT statement with the
VALUES clause. This technique is convenient for loading small amounts
of data, such as a SQL*Plus script that creates seed data at the time
of application installation. However, this approach is quite slow when
compared to more sophisticated techniques, and therefore is not
appropriate for loading large volumes of data into Oracle.
Depending on the application and data volume involved, this most basic
method for loading data into Oracle might give acceptable performance.
If that is the case, then there is no need to over-engineer an
unnecessarily complicated load program. However, if single row inserts
are not fast enough, then there are more sophisticated data load
methods to consider.
The chief advantages of the single row insert method are simplicity and
flexibility. The key disadvantage of this technique is that it is the
slowest and most CPU-resource intensive.
Data Loading Method | Elapsed Seconds | Database Server CPU Seconds |
---|---|---|
Single row inserts | 172 | 52 |
2-Single Row Inserts Without
Indexes
When a row is inserted into a table, Oracle needs to update all
indexes on the table at the same time. It is typically faster for
Oracle to build an index from scratch after the table is populated
rather than repeatedly update an index as rows are inserted. It is
often true that the index will be more compact and efficient as well.
These rules of thumb are especially true when bitmap indexes are
involved.
From this, it follows that one way to speed up data loads would be to
drop the indexes before loading the data and recreate the indexes after
the load is complete. For primary and unique key constraints, you can
get the same effect by disabling or dropping the constraint. (Be
careful of the impact this might have on foreign key constraints in
other tables.)
This technique might not be appropriate if you are loading data into a
table that already has a lot of rows in it. Consider a table with 99
million rows and you wish to load 1 million more rows. The time saved
in loading the 1 million new rows will be lost when Oracle has to
rebuild an index on the 99 million existing rows plus the 1 million new
rows.
Speeding up index creation is another issue to consider. Briefly, you
can reduce the amount of sorting that must take place on disk by
increasing the sort_area_size in your session (if you are not using
Oracle 9i PGA aggregate targets). This allows your session to perform
more of the sorting during index creation in memory. Also, you can use
the NOLOGGING keyword to reduce the amount of redo Oracle generates
during the index build. NOLOGGING has significant impacts on
recoverability and standby databases, so do your homework before using
the NOLOGGING keyword.
I modified the application used in the last section to disable the
primary key on the CALLS table and drop its one non-unique index before
loading the data, putting both back after the load was complete. In
this example, the CALLS table was empty before the data load. Factoring
in the amount of time required to recreate the two indexes, elapsed
time for the load dropped from 172 seconds to 130 seconds. CPU time
used by the database server process dropped from 52 seconds to 35
seconds.
Dropping and rebuilding indexes before and after a data load can speed
up the load and yield more efficient indexes. Some drawbacks include
the added complexity and potential embedding of schema design
information into the application code. (When you add another index to
the table being loaded, will you have to update your application code?)
Dropping indexes before a load could also have significant performance
impacts if users need to be able to query the target table while the
load is taking place. Finally, dropping or disabling primary or unique
key constraints could cause difficulties if foreign key constraints
reference them.
Data Loading Method | Elapsed Seconds | Database Server CPU Seconds |
---|---|---|
Single row inserts without indexes (time includes index rebuild using NOLOGGING and large sort area) |
130 | 35 |
3-Bulk Inserts Without Indexes
Data Loading Method | Elapsed Seconds | Database Server CPU Seconds |
---|---|---|
Bulk row inserts 100 rows at a time without indexes (time includes index rebuild using NOLOGGING and large sort area) |
14 | 7 |
4-CREATE as SELECT Using an
Oracle 9i External Table
An external table looks somewhat like a regular table in that it has
columns with standard Oracle data types and you can query it just like
any other table. However, the table data is not stored inside the
database. Instead, the external table is associated with a flat file on
the database server. Whenever you query the external table, Oracle
parses the flat file and returns the rows to you as if the data were
stored in Oracle.
We apply this technique for Bank of America and BankLink.
You can join external tables to other tables in your queries, but you
cannot index external tables nor can you insert, update, or delete
rows. If the flat file on the database server changes or is deleted,
this will affect the outcome of future queries against the external
table.
External tables open the door to new strategies for loading data into
Oracle. We can put the flat file on the database server and create an
external table in the database. At that time we can instantly query the
data as if it had been loaded into a conventional Oracle table. If the
flat file gets removed or overwritten, the old contents will no longer
be accessible in the database. Probably we can implement this for
issuedata and images files?
However, we can (if needed) use an external table as a vehicle to copy
data quickly from a flat file into a conventional Oracle table.
I wrote a SQL*Plus script to drop the CALLS table, recreate it from
the CALLS_EXTERNAL table as shown above, and add the primary key and
non-unique index. Populating the CALLS table in this way took 15
seconds, using 8 seconds of CPU time in the database server process.
This case was slightly slower than the bulk insert method discussed in
the last section.
One advantage of this technique is that it achieves very good
performance with very little coding effort. It is also a method that
works entirely within the database, so we can even schedule data loads
using the Oracle dbms_job scheduler.
Data Loading Method | Elapsed Seconds | Database Server CPU Seconds |
---|---|---|
CREATE as SELECT from an external table with NOLOGGING (time includes index build using NOLOGGING and large sort area) |
15 | 8 |
5- SQL*Loader Conventional
and Direct Path
SQL*Loader is a utility provided by Oracle specifically for the
purpose of loading large volumes of data from flat files into Oracle
tables. It is very powerful, flexible, and fast. However, we must
launch it from the operating system. (Invoking SQL*Loader within a
dbms_job or a PL/SQL stored procedure is tricky but we can do it.).
SQL*Loader still seems to be about the fastest and most efficient way
to get flat file data into Oracle. By default, SQL*Loader uses what it
calls "conventional path" loading–bulk inserts, basically. The
performance is not phenomenal, and there are faster alternatives.
However, with a simple "direct=true" on the command line, we can invoke
"direct path" loading. In a direct path load, SQL*Loader writes rows
directly into new data blocks above the table's high water mark. This
is like inserting with the Append hint as seen in the last section, but
it's even better. In a direct path load, SQL*Loader also updates
indexes using a very efficient bulk method.
Loading the call center data using SQL*Loader with conventional path
load took about 81 seconds. The database server process used about 12
seconds of CPU time. This was with indexes in place and SQL*Loader bulk
inserting rows 100 at a time. With a direct path load, SQL*Loader got
the data into the database in just 9 seconds, with only 3 CPU seconds
used by the database server process. This was with indexes in place.
Direct path SQL*Loader is fast and efficient. It is even fast when
indexes on the target table are not dropped before the load. As for
disadvantages, direct path loading has implications for recovery akin
to the NOLOGGING keyword. Also, indexes on the target table are
unusable for a period during the load. This can impact users trying to
access the target table while a load is in progress. Also, indexes can
be left in an unusable state if the SQL*Loader session were to crash
midway through the load.
Data Loading Method | Elapsed Seconds | Database Server CPU Seconds |
---|---|---|
SQL*Loader conventional path (indexes in place and rows=100) |
81 | 12 |
SQL*Loader direct path (indexes in place) |
9 | 3 |
6 - Merge
The modern equivalent of the Updateable Join View.
Gaining in popularity due to its combination of brevity and
performance, it is primarily used to INSERT and UPDATE in a single
statement. We are using the update-only version here. Note that I have
included a FIRST_ROWS hint to force an indexed nested loops plan. This
is to keep the playing field level when comparing to the other methods,
which also perform primary key lookups on the target table. A Hash join
may or may not be faster, that's not the point - I could increase the
size of the target TEST table to 500M rows and Hash would be slower for
sure.
MERGE /*+ FIRST_ROWS*/ INTO test
USING test2 new ON (test.pk =
new.pk)
WHEN MATCHED THEN UPDATE SET
fk = new.fk, fill = new.fill;
7 - Parallel DML Merge
This is the MERGE example on steroids. It uses
Oracle's Parallel DML capability to spread the load over multiple slave
threads
ALTER SESSION ENABLE PARALLEL
DML;
MERGE /*+ first_rows
parallel(test) parallel(test2) */ INTO test
USING test5 new ON (test.pk =
new.pk)
WHEN MATCHED THEN UPDATE SET
fk = new.fk, fill = new.fill;
Conclusion
There are many different ways to load data into Oracle. Each technique offers its own balance between speed, simplicity, scalability, recoverability, and data availability.To recap, here are all of the timing figures in one place:
Data Loading Method | Elapsed Seconds | Database Server CPU Seconds |
---|---|---|
Single row inserts | 172 | 52 |
Single row inserts without indexes (time includes index rebuild using NOLOGGING and large sort area) |
130 | 35 |
Bulk row inserts 100 rows at a time without indexes (time includes index rebuild using NOLOGGING and large sort area) |
14 | 7 |
CREATE as SELECT from an external table with NOLOGGING (time includes index build using NOLOGGING and large sort area) |
15 | 8 |
SQL*Loader conventional path (indexes in place and rows=100) |
81 | 12 |
SQL*Loader direct path (indexes in place) |
9 | 3 |
Please keep in mind that I did not even touch on the subject of
parallelism in data loads. (Inserts with the Append hint can use
parallelism in the Enterprise Edition of the Oracle software).
Other
options with a table containing 900M rows:
create table aaaa nologging parallel 4 as select * from
bbbb; --> 1.2 min
insert /* append */ into aaaa select * from bbbb;
--> 8 min