Home » excel » excel – Counting duplicate data which satisfy conditions and remove data

excel – Counting duplicate data which satisfy conditions and remove data

Posted by: admin April 23, 2020 Leave a comment


The sample file has been uploaded to MediaFile.

Background Information

Section 1: In the sample file, “Sheet1”

a.  Values in “Column A” are the original name. For example from Cell A1:
    “>hg19_refGene_NM_000392_0 range=chr10:101542463-101542634 5'pad=0 3'pad=0 strand=+ repeatMasking=none”

b.  Values in “Column B” is a value that correspond to values in Column A, for example  
    from Cell B1 which correspond to value in Cell A1: “ABCC2”  

Section 2: In the sample file, “Sheet2”

a.  In the Sheet2, the values from Sheet1 have been separated to clarify the data because  
    in Sheet1, everything is packed in one cell. 

b.  Column A represents “GENE”, which refers to the value in Column B in Sheet1, for example,  
    “ABCC2” from Section 1 of this article.

c.  Column B represents “refGENE”, an example of refGENE is “NM000392” which come from the  
    original name from “Sheet1”

d.  Column C represents “CHROMOSOME”, this is another value that was derived from Values in  
    Column A of Sheet1, for example, “chr10”

e.  Similar Idea, “EXON START” came from the original name in Column A of Sheet1, for  
    example “101542463”

f.  And “EXON END” came from the original name in Column A of Sheet1, for example “101542634”

The Challenge is to develop a program that can solve the following requirements:

Requirement 1: counting for each gene, the number of times each refGene is observed, e.g.:

Table Example refGENE COUNT
NM000927 29
NM00078 32
NM00042 32
. .
. .
. .

enter image description here

Note: The way I do it is to use SUMPRODUCT in Excel, however, I don’t know how to put everything in a simple table.

Requirement 2:
This requires comparing values in two different rows, Please note that this requires using the original name from “Sheet1”. Please don’t use the separated value from “Sheet2”.
Basically, it is query each row, if Gene, Chromosome, EXONSTART, EXON END are the same, then remove rows with the least frequent refgene. I will explain further below.

In “Shee1”, there are “Original Name” and “GENE”,

Step 1: Compare if the values in Column B are the same. For example, when comparing row 1 and row 2, there are ABCC2 and ABCC2. This satisfies the condition, so proceed to Step 2, else continue to compare GENE from different rows.

Step 2: Compare “chr” values from different rows, same example from previous step. Row 1 has chr10 and row 2 has chr10, as they are the same continue to the next step, else move on.

Step 3: Now compare “exon start” – a number looks like 101542463 in row 1 and the number in row 2 looks like 101544365, now they are not the same, save the file and move on. Imagine if the numbers are the same, then continue to compare “exon end”, which is step 4.

Step 4: Assume, the “exon start” from two different rows are the same, then compare “exon end”. The number from row 1 looks like 101542634 and the number of “exon end” from row 2 looks like 101544538. Same condition as above, if they are different, leave the file alone and continue comparing the next GENE.

Here is the part that requires attention, if they are the same, that means, “GENE” are the same, “chr” are the same, “exon start” and “exon end” are the same. In the end, everything is the same, that means there is a duplicated row. Now, the duplicated rows will be deleted. But what’s the condition of deleting the row. This will link us back to the challenge that we solved from requirement 1. Remember that the number of occurrences has been counted for all refGENE? Recall 29 times for NM000927, 32 times for Nm00078. The rows of “GENE” to be removed are the ones containing NM000927.

But, please keep a record for all the deleted data, and all the remaining data, preferably with a table.

How to&Answers:

I agree with @Siddharth for count of instances, ie PivotTable with Row Labels = GENE, Σ Values = Count of refGene.

Possibly the ‘duplicates’ solution would be (at least to start with) insert row at the top, Select Column A, Sort & Filter/ Advanced / Copy to another location = (say) C1 / tick Unique records only/ OK. That should give you a list that is 35 rows less than you started with.

To identify which rows are duplicates, copy Column A to another column (say D), Replace > (with nothing) then enter =COUNTIF(D:D,D2) in E2 and double click on bottom RH corner of cell. 1 = unique, anything else is the number of instances.