I am currently working on a project for traversing an excel document and inserting data into a database using C#.
The relevant data for this project is:
- The excel sheet has 14 rows at the top that I do not care about. (sometimes 15, see Russia/Siberia below)
- The data is grouped by name into 2 columns (date and value), such as:
USA China Russia Date Value Date Value Siberia 1/1/09 4.3654 1/1/09 2.7456 Date Value 1/2/09 3.5545 1/3/09 9.3214 2/5/09 0.2454 1/3/09 3.2322 1/21/09 5.2234 2/6/09 0.5557
- The name I need to acquire is whichever is listed directly above “Date”.
- I only care about data from dates we do not have in the database. Before each column set is parsed, I will acquire the max date for any given name from the database, and skip anything at or before it.
- There is no guarantee that the columns will be in a constant order or have constant spacing.
- I do not want data for all names, rather only those in a list I put together before the file is acquired.
My current plan is this:
- For each column, if the date field is at row 16, save the name as the value in row 15 above it, check the database for the last date for that name, only insert data where the date is greater than the acquired date.
- If the date field is at row 17, do the same thing, but start the for loop through each row at 18.
- If the name is not in the list, skip the column. If it is, make sure to grab the column next to it for the necessary values.
My problem is:
- I am currently trying to use the ExcelDataReader from Codeplex(http://www.codeplex.com/ExcelDataReader). This only likes csv-like sheets, which this project has not.
- I do not know of any alternative Excel readers.
- To the best of my knowledge, a straight FileStream traversal of this file can only go row-by-row, rather than column-by-column.
To anyone still reading, thank you for your time. Any recommendations on how to proceed? Please ensure that solutions can traverse each column, not each row.
Also, please don’t worry about the database stuff, or the list of names that precedes the traversal.
Addendum: What I’d really like to end up with is some type of table that I can just traverse with a nested loop, making column-centric traversal much, much easier. Because there is so much garbage near the top of the sheet (14+ rows), most simple solutions are not feasible.
If you want to read from excel in C#, i’ve used this library with great success, it’ll give you the flexibility to parse columns/rows just however you’d like:
- http://sourceforge.net/projects/koogra/ (read-only)
Other open source libraries i haven’t used but could be good:
- http://nexcel.sourceforge.net/ (read-only)
- http://npoi.codeplex.com/ (can read and write)
http://developer.novell.com/wiki/index.php/Poi.Net(this project is dead)
Alternatively, you can use one of the many good Java libraries, and convert it into a C# assembly using IKVM:
- http://poi.apache.org/ (this one’s the grand-daddy of java XLS libraries)
I’ve covered how to do the IKVM Java -> C# conversion here (it’s really not as horrible an option as you think):
I highly recommend saving this Excel document in a CSV format before doing anything else with it. You can do using this code
After you have a CSV, you can either parse it using that library, or write your own parser for it.
Not a straight answer to your question but an alternative idea:
Your data looks like a pivot-ish table. I’d recommend “unpivoting” it into simple table.
Russia USA Q1 123 323 Q2 456 321 Q3 567 843
Quarter Country Value Q1 Russia 123 Q1 USA 323 Q2 Russia 321 ....
If that is the case, not sure if I got this right in your question, than processing the data using a OleDB driver or whatever CSV kind of stuff should be become much less painful.
You can access Excel directly using ADO.NET via the ODBC driver. See http://www.davidhayden.com/blog/dave/archive/2006/05/26/2973.aspx or Google for more info on how to do that. You may wish to try HDR=No in your connection string, since your first row isn’t really proper headers by the looks of it.
I haven’t done this for a while, but I remember that it is a bit “temperamental” and takes some playing around with to get the column names right, but it should work. Try
SELECT * FROM [Sheet1$] and see what you get.
As I did before, I prefer to use OLEDB connection in order to connect to an Excel document.
By the way, you can take a look at the following article for more information:
SpreadsheetGear for .NET can load workbooks and access any cells on any sheet in any order. You can get the formatted text of the cell (such as “1/1/09”) or the underlying value (“1/1/09” is stored as the double 39814.0 in Excel or SpreadsheetGear).
Disclaimer: I own SpreadsheetGear LLC