The sample file has been uploaded to MediaFile.
Section 1: In the sample file, “Sheet1”
a. Values in “Column A” are the original name. For example from Cell A1: “>hg19_refGene_NM_000392_0 range=chr10:101542463-101542634 5'pad=0 3'pad=0 strand=+ repeatMasking=none” b. Values in “Column B” is a value that correspond to values in Column A, for example from Cell B1 which correspond to value in Cell A1: “ABCC2”
Section 2: In the sample file, “Sheet2”
a. In the Sheet2, the values from Sheet1 have been separated to clarify the data because in Sheet1, everything is packed in one cell. b. Column A represents “GENE”, which refers to the value in Column B in Sheet1, for example, “ABCC2” from Section 1 of this article. c. Column B represents “refGENE”, an example of refGENE is “NM000392” which come from the original name from “Sheet1” d. Column C represents “CHROMOSOME”, this is another value that was derived from Values in Column A of Sheet1, for example, “chr10” e. Similar Idea, “EXON START” came from the original name in Column A of Sheet1, for example “101542463” f. And “EXON END” came from the original name in Column A of Sheet1, for example “101542634”
The Challenge is to develop a program that can solve the following requirements:
Requirement 1: counting for each gene, the number of times each refGene is observed, e.g.:
Note: The way I do it is to use SUMPRODUCT in Excel, however, I don’t know how to put everything in a simple table.
This requires comparing values in two different rows, Please note that this requires using the original name from “Sheet1”. Please don’t use the separated value from “Sheet2”.
Basically, it is query each row, if Gene, Chromosome, EXONSTART, EXON END are the same, then remove rows with the least frequent refgene. I will explain further below.
In “Shee1”, there are “Original Name” and “GENE”,
Step 1: Compare if the values in Column B are the same. For example, when comparing row 1 and row 2, there are
ABCC2. This satisfies the condition, so proceed to Step 2, else continue to compare GENE from different rows.
Step 2: Compare “chr” values from different rows, same example from previous step. Row 1 has
chr10 and row 2 has
chr10, as they are the same continue to the next step, else move on.
Step 3: Now compare “exon start” – a number looks like
101542463 in row 1 and the number in row 2 looks like
101544365, now they are not the same, save the file and move on. Imagine if the numbers are the same, then continue to compare “exon end”, which is step 4.
Step 4: Assume, the “exon start” from two different rows are the same, then compare “exon end”. The number from row 1 looks like
101542634 and the number of “exon end” from row 2 looks like
101544538. Same condition as above, if they are different, leave the file alone and continue comparing the next GENE.
Here is the part that requires attention, if they are the same, that means, “GENE” are the same, “chr” are the same, “exon start” and “exon end” are the same. In the end, everything is the same, that means there is a duplicated row. Now, the duplicated rows will be deleted. But what’s the condition of deleting the row. This will link us back to the challenge that we solved from requirement 1. Remember that the number of occurrences has been counted for all refGENE? Recall 29 times for
NM000927, 32 times for
Nm00078. The rows of “GENE” to be removed are the ones containing
But, please keep a record for all the deleted data, and all the remaining data, preferably with a table.
I agree with @Siddharth for count of instances, ie PivotTable with Row Labels =
GENE, Σ Values = Count of
Possibly the ‘duplicates’ solution would be (at least to start with) insert row at the top, Select Column A, Sort & Filter/ Advanced / Copy to another location = (say) C1 / tick Unique records only/ OK. That should give you a list that is 35 rows less than you started with.
To identify which rows are duplicates, copy Column A to another column (say D), Replace
> (with nothing) then enter
=COUNTIF(D:D,D2) in E2 and double click on bottom RH corner of cell.
1 = unique, anything else is the number of instances.