Home » excel » html – Ruby 2.0 CSV reader treating Microsoft Excel generated CSV files differently and not stripping control characters

html – Ruby 2.0 CSV reader treating Microsoft Excel generated CSV files differently and not stripping control characters

Posted by: admin May 14, 2020 Leave a comment

Questions:

PROBLEM: Ruby 2.0 CSV reader on Mac Mavericks treats Microsoft Excel generated CSV files that have embedded HTML differently. Works fine on Ruby 1.8 with FasterCSV.

I just upgraded my Mac to Mavericks (OS X 10.9.4) and also upgraded Ruby to 2.0.0p451 (I used to use Ruby 1.8+ with the FasterCSV gem but now use Ruby 2.0+ with it’s native CSV.)

Ruby Version:

ruby -v
ruby 2.0.0p451 (2014-02-24 revision 45167) [universal.x86_64-darwin13]

The CSV file is generated from Office 2011, saved from an original “.xlsx” file.

The following HTML is contained in a single cell of the Microsoft .xlsx file BEFORE it is saved as CSV…

<h1 style="text-align:center; font: bold 1.5em Arial;">This is the Title</h1>
<p style="text-align:center;"><img style="width:300px; height:100px" src="./IMAGES/MAIN/image1.png" alt="Image 1"/></p>
<p style="text-align:center;">This is a sentence.</p>

There are other cells, that also have HTML code embedded.

To reproduce…

  1. Open an Excel worksheet
  2. Copy the above HTML into cell A1 (ensure that there are Mac carriage returns control+command+return between HTML constructs (e.g. between the end of the “h1” construct and the start of a new “p” construct, in order to ensure line breaks between all complete HTML constructs, right in the Excel cells.
  3. Copy what in cell A1 to cell A2, directly below cell A1, to ensure multiple CSV rows (your file will have two formal CSV rows).
  4. First save the file as an xlsx file (e.g. “file.xlsx”)
  5. Then save the worksheet as a CSV file (e.g. “file.csv”).

You will now have an Excel generated CSV file that has two formal CSV rows, where each row will have multiple HTML constructs that are separated by line feeds, within it.

Reading the CSV File…

I use the following code to read CSV file and print the contents of each cell, both before and after I try to strip control characters…

arrayOfHtmlConstructs = CSV.read( file.csv )
arrayOfHtmlConstructs.each_with_index do | construct, i|
  output = "" << construct.to_s
  puts "BEFORE: " << output
  output = output.gsub(/\r/, "") # Replace Microsoft carriage returns FAILS!
  output = output.gsub(/\"/, "\"") # Replace escaped quotes with quotes WORKS FINE!
  output = output.gsub(/\[\"/, "") # Remove prefix [" WORKS FINE!
  output = output.gsub(/\"\]/, "") # Remove suffix "]  WORKS FINE!
  puts "AFTER: " << output
end

Before trying to strip code, the CSV string “output” looks as follows…

BEFORE: ["<h1 style=\"text-align:center; font: bold 1.5em Arial;\">This is the Title</h1>\r<p style=\"text-align:center;\"><img style=\"width:300px; height:100px\" src=\"./IMAGES/MAIN/image1.png\" alt=\"Image 1\"/></p>\r<p style=\"text-align:center;\">This is a sentence.</p>"]

You’ll notice that it includes [” at the beginning and ]” at the end, along with escaped quotes and embedded carriage returns /r

PROBLEM: All of the gsub statements work except for the one that tries to replace all carriage returns with blanks.

After running the Ruby script, the string “output” looks as follows, where everything gets substituted properly, except for the carriage returns…

AFTER: <h1 style="text-align:center; font: bold 1.5em Arial;">This is the Title</h1>\r<p style="text-align:center;"><img style="width:300px; height:100px" src="./IMAGES/MAIN/image1.png" alt="Image 1"/></p>\r<p style="text-align:center;">This is a sentence.</p>

For some reason, the carriage returns are NOT being replaced/substituted.

Also, before I upgraded to Ruby 2.0, I used to use FasterCSV and none of the substitution statements were needed. Everything just worked.

Any thoughts as to why this is all happening and how to properly handle it? Any assistance is greatly appreciated.

How to&Answers:

The scope of my answer has changed so I’ve edited down to just the RegEx as that seems to be more on topic.

I’ve updated my expression to cover all of your substitutions, simply update with this block of code:

arrayOfHtmlConstructs.each_with_index do | construct, i|
  output = "" << construct.to_s
  puts "BEFORE: " << output
  output = output.gsub(/\"/, "\"") # Replace escaped quotes with quotes WORKS FINE!
  output = output.gsub(/(\r|\[|\])/, "")
  puts "AFTER: " << output
end

Answer:

Try this:

@csv = CSV.read(params[:file].path, headers: true, skip_blanks: true, encoding:'windows-1256:utf-8')

You need to do the Microsoft CSV encoding