I just noticed a strange problem when reading a csv file with pandas read_csv.
When I open my file in an editor the header looks like this (there are a lot of columns so I skip most of them by adding
When I now do a
df = pd.read_csv("/path/to/csv-file.csv")
and the I check the columns like this:
I suddently get this output:
['tag_identifier', 'a', 'article', 'aside', 'b', 'body', 'br', 'button' ..., 'title', 'tspan', 'ul', 'use', 'a.1', 'article.1', 'aside.1', 'b.1', 'body.1', 'br.1', 'button.1', ... ]
As you can see the column names that correspond to a html tag are copied and a
.1 is appended.
For example the
body tag is copied and set to
So eventually I have now two columns:
body.1 which I can both access via
Even stranger, this only happens to the html-tag column names. All other column names are unaffected.
Has anybody an idea what could cause this issue?
This means you have duplicate column names. Rename them or if they’re really duplicate get rid of them in the data.
Anyway, you can filter them out using Pandas tools.