I just noticed a strange problem when reading a csv file with pandas read_csv.

When I open my file in an editor the header looks like this (there are a lot of columns so I skip most of them by adding ... here):


When I now do a df = pd.read_csv("/path/to/csv-file.csv")

and the I check the columns like this: print(df.columns.tolist())

I suddently get this output:

['tag_identifier', 'a', 'article', 'aside', 'b', 'body', 'br', 'button' ..., 'title', 'tspan', 'ul', 'use', 'a.1', 'article.1', 'aside.1', 'b.1', 'body.1', 'br.1', 'button.1', ... ]

As you can see the column names that correspond to a html tag are copied and a .1 is appended.

For example the body tag is copied and set to body.1.

So eventually I have now two columns: body and body.1 which I can both access via df["body"] and df["body.1"].

Even stranger, this only happens to the html-tag column names. All other column names are unaffected.

Has anybody an idea what could cause this issue?

How to&Answers:

This means you have duplicate column names. Rename them or if they’re really duplicate get rid of them in the data.
Anyway, you can filter them out using Pandas tools.