Last post Nov 04, 2010 08:31 AM by swpulitzer
Nov 03, 2010 04:17 PM|swpulitzer|LINK
I'm hoping to get some insight on best practices. I thought it would be easy to find existing info in various forums but so far haven't found what I need.
I have a project in which I need to process all the rows of data in a large dataset one by one. The dataset is too large to load the whole thing into memory. I'm using C# and MySQL to do this. The challenge is that I need to iterate through the records,
load the data from each record into some local variables, calculate some stuff, and then update the record in the db. For the most part I'm populating empty columns in the record although I do occassionally change existing values.
Pseudocode would look something like this:
Connect to DB
Load table (using a memory-friendly fetch size)
Calculate the missing columns
Update the row
This code was originally written in Java and I was able to use the updateRow() method of the JDBC's ResultSet class. From what I've learned of ADO so far, it looks like there is no equivalent functionality.
Because the datasets are large, performance is a consideration.
Nov 03, 2010 08:42 PM|Lannie|LINK
Loops and database updates with periodic commits to avoid memory overload is PAINFULLY SLOOOWWWWWWW.
I would rename the old table, re-create a new table using NOLOGGING and compression
doing the math in SQL statement leveraged by SQL Analytics and functions with joins to any lookup data as required,
and this will run dramatically faster.
My small tables are 1 million rows and larger tables hundreds of million rows.
I gave up doing your method years ago. Too slow.
Nov 04, 2010 08:31 AM|swpulitzer|LINK
I agree with your points. For this application, though, the logic is pretty complex and the advantages of using an OOP approach (i.e. not implementing the logic in SQL) easily outweigh the performance limitations. The original java implementation I have
runs plenty fast enough. It seems like there should be a simple way to replicate that functionality in ADO but I can't find it.
One thing I've explored is building a stored proc for the update statements (which would effectively replace the java updateRow() method). But I'm not sure how to execute given that I'm using the connection to read the data. One (admittedly) crazy idea I've
had is to have separate read and update connections but I don't know enough about how this is implemented on the SQL side to tell if it will work.
Another crazy idea (I'm full of them) is to dump the modificed rows to a tmp text file and then reimport.
Before investing time in either of these I thought it might be good to see how other people have dealt with the issue.