Apologies if this has already been asked and answered but I couldn’t find a satisfactory answer.
I have a list of chemical formulas including, in this order: C, H, N and O. And I would like to pull the number after each of these letters. The problem is that not all the formulas contain an N. All contain a C, H and O however. And the number can be either single, double or (in the case of H only) triple digit.
Thus the data looks like this:
I’d like each element number for the list in separate columns. So in the first example it would be:
20 37 1 5
I’ve been trying:
to separate out the C#. However, after this I get stuck as the H# is flanked by either an O or N.
Is there an excel formula or VBA that can do this?
Use Regular Expressions
This is a good task for regular expressions (regex). Because VBA doesn’t support regular expressions out of the box we need to reference a Windows library first.
Add this function to a module
Option Explicit Public Function ChemRegex(ChemFormula As String, Element As String) As Long Dim strPattern As String strPattern = "([CNHO])([0-9]*)" 'this pattern is limited to the elements C, N, H and O only. Dim regEx As New RegExp Dim Matches As MatchCollection, m As Match If strPattern <> "" Then With regEx .Global = True .MultiLine = True .IgnoreCase = False .Pattern = strPattern End With Set Matches = regEx.Execute(ChemFormula) For Each m In Matches If m.SubMatches(0) = Element Then ChemRegex = IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1) 'this IIF ensures that in CH4O the C and O are count as 1 Exit For End If Next m End If End Function
Use the function like this in a cell formula
Recognize also chemical formulas with multiple occurrences of elements like
Note that the code above cannot count something like
CH3OH where elements occur more than once. Then only the first
H3 is count the last is omitted.
If you need also to recognize formulas in the format like
CH2COOH (and summarize the occurrences of the elements) then you need to change the code to recognize these too …
If m.SubMatches(0) = Element Then ChemRegex = ChemRegex + IIf(Not m.SubMatches(1) = vbNullString, m.SubMatches(1), 1) 'Exit For needs to be removed. End If
Recognize also chemical formulas with 2 letter elements like
In addition to the change above for multiple occurrences of elements use this pattern:
strPattern = "([A-Z][a-z]?)([0-9]*)" 'https://regex101.com/r/nNv8W6/2
- Note that they need to be in the correct upper/lower letter case.
CaCl2works but not
Note that this doesn’t proof if these letter combinations are existing elements of the periodic table. So this will also recognize eg.
Xx2Zz5Qas fictive elements
Xx = 2,
Zz = 5and
Q = 1.
To accept only combinations that exist in the periodic table use the following pattern:
strPattern = "([A][cglmrstu]|[B][aehikr]?|[C][adeflmnorsu]?|[D][bsy]|[E][rsu]|[F][elmr]?|[G][ade]|[H][efgos]?|[I][nr]?|[K][r]?|[L][airuv]|[M][cdgnot]|[N][abdehiop]?|[O][gs]?|[P][abdmortu]?|[R][abefghnu]|[S][bcegimnr]?|[T][abcehilms]|[U]|[V]|[W]|[X][e]|[Y][b]?|[Z][nr])([0-9]*)" 'https://regex101.com/r/Hlzta2/3 'This pattern includes all 118 elements up to today. 'If new elements are found/generated by scientist they need to be added to the pattern.
This seems to work just fine:
B2 is below. Drag across and down
Or a shorter array formula, which must be entered with ctrl+shift+enter
If you wanted to keep the VBA super simple, something like this works as well:
Public Function ElementCount(str As String, element As String) As Long Dim i As Integer Dim s As String For i = 1 To 3 s = Mid(str, InStr(str, element) + 1, i) On Error Resume Next ElementCount = CLng(s) On Error GoTo 0 Next i End Function
Use it like so:
I did this in VBA, using regular expressions. You can probably do it like Vityata suggests by looping through the string too, but I suspect that this is slightly faster and easier to read.
Option Explicit Function find_associated_number(chemical_formula As Range, element As String) As Variant Dim regex As Object: Set regex = CreateObject("VBScript.RegExp") Dim pattern As String Dim matches As Object If Len(element) > 1 Or chemical_formula.CountLarge <> 1 Then find_associated_number = CVErr(xlErrName) Else pattern = element + "(\d+)\D" With regex .pattern = pattern .ignorecase = True If .test(chemical_formula) Then Set matches = .Execute(chemical_formula) find_associated_number = matches(0).submatches(0) Else find_associated_number = CVErr(xlErrNA) End If End With End If End Function
Then you use the formula in your sheet like normal:
Column C contains the number of carbon atoms, column D the number of nitrogen atoms. Just expand on this by copying the formula and changing the element it searches for.
With VBA this is an easy task – you have to loop through the chars and check the values for being numeric.
With Excel, the solution includes some redundancy. But it is doable. E.g.,
C20H37NO5 will return 20375 if you apply the following formula:
=IF(ISNUMBER(1*MID(A1,1,1)),MID(A1,1,1),"")& IF(ISNUMBER(1*MID(A1,2,1)),MID(A1,2,1),"")& IF(ISNUMBER(1*MID(A1,3,1)),MID(A1,3,1),"")& IF(ISNUMBER(1*MID(A1,4,1)),MID(A1,4,1),"")& IF(ISNUMBER(1*MID(A1,5,1)),MID(A1,5,1),"")& IF(ISNUMBER(1*MID(A1,6,1)),MID(A1,6,1),"")& IF(ISNUMBER(1*MID(A1,7,1)),MID(A1,7,1),"")& IF(ISNUMBER(1*MID(A1,8,1)),MID(A1,8,1),"")& IF(ISNUMBER(1*MID(A1,9,1)),MID(A1,9,1),"")
Currently, it checks the first 9 characters for being numeric. If you want to include more than 9, then simply add a few lines to the formula.
There is a small trick in the formula – the
1*. It converts a text character to a numeric, if it is possible. Thus, a
5 as a text, multiplied by
1 becomes a numeric character.
Use split and like method.
Sub test() Dim vDB As Variant, vR() As Variant Dim s As String Dim vSplit As Variant Dim i As Long, n As Long, j As Integer vDB = Range("a2", Range("a" & Rows.Count).End(xlUp)) n = UBound(vDB, 1) ReDim vR(1 To n, 1 To 4) For i = 1 To n s = vDB(i, 1) For j = 1 To Len(s) If Mid(s, j, 1) Like "[A-Z]" Then s = Replace(s, Mid(s, j, 1), " ") End If Next j vSplit = Split(s, " ") For j = 1 To UBound(vSplit) vR(i, j) = vSplit(j) Next j Next i Range("b2").Resize(n, 4) = vR End Sub
If you want a vba solution to extract all numbers my preferred solution is to use Regular Expressions. The following code will extract all numbers from a string
Sub GetMolecularFormulaNumbers() Dim rng As Range Dim RegExp As Object Dim match, matches Dim j As Long Set rng = Range(Cells(1, 1), Cells(Cells(Rows.Count, 1).End(xlUp).Row, 1)) Set RegExp = CreateObject("vbscript.regexp") With RegExp .Pattern = "\d+" .IgnoreCase = True .Global = True For Each c In rng j = 0 Set matches = .Execute(c) If matches.Count > 0 Then For Each match In matches j = j + 1 c.Offset(0, j) = CInt(match) Next match End If Next c End With End Sub