Home » Ruby » Split string into a list, but keeping the split pattern

Split string into a list, but keeping the split pattern

Posted by: admin December 28, 2017 Leave a comment

Questions:

Currently i am splitting a string by pattern, like this:

outcome_array=the_text.split(pattern_to_split_by)

The problem is that the pattern itself that i split by, always gets omitted.

How do i get it to include the split pattern itself?

Answers:

Thanks to Mark Wilkins for inpsiration, but here’s a shorter bit of code for doing it:

irb(main):015:0> s = "split on the word on okay?"
=> "split on the word on okay?"
irb(main):016:0> b=[]; s.split(/(on)/).each_slice(2) { |s| b << s.join }; b
=> ["split on", " the word on", " okay?"]

or:

s.split(/(on)/).each_slice(2).map(&:join)

See below the fold for an explanation.


Here’s how this works. First, we split on “on”, but wrap it in parentheses to make it into a match group. When there’s a match group in the regular expression passed to split, Ruby will include that group in the output:

s.split(/(on)/)
# => ["split", "on", "the word", "on", "okay?"

Now we want to join each instance of “on” with the preceding string. each_slice(2) helps by passing two elements at a time to its block. Let’s just invoke each_slice(2) to see what results. Since each_slice, when invoked without a block, will return an enumerator, we’ll apply to_a to the Enumerator so we can see what the Enumerator will enumerator over:

s.split(/(on)/).each_slice(2).to_a
# => [["split", "on"], ["the word", "on"], ["okay?"]]

We’re getting close. Now all we have to do is join the words together. And that gets us to the full solution above. I’ll unwrap it into individual lines to make it easier to follow:

b = []
s.split(/(on)/).each_slice(2) do |s|
  b << s.join
end
b
# => ["split on", "the word on" "okay?"]

But there’s a nifty way to eliminate the temporary b and shorten the code considerably:

s.split(/(on)/).each_slice(2).map do |a|
  a.join
end

map passes each element of its input array to the block; the result of the block becomes the new element at that position in the output array. In MRI >= 1.8.7, you can shorten it even more, to the equivalent:

s.split(/(on)/).each_slice(2).map(&:join)

Questions:
Answers:

You could use a regular expression assertion to locate the split point without consuming any of the input. Below uses a positive look-behind assertion to split just after ‘on’:

s = "split on the word on okay?"
s.split(/(?<=on)/)
=> ["split on", " the word on", " okay?"]

Or a positive look-ahead to split just before ‘on’:

s = "split on the word on okay?"
s.split(/(?=on)/)
=> ["split ", "on the word ", "on okay?"]

With something like this, you might want to make sure ‘on’ was not part of a larger word (like ‘assertion’), and also remove white-space at the split:

"don't split on assertion".split(/(?<=\bon\b)\s*/)
=> ["don't split on", "assertion"]

Questions:
Answers:

If you use a pattern with groups, it will return the pattern in the results as well:

irb(main):007:0> "split it here and here okay".split(/ (here) /)
=> ["split it", "here", "and", "here", "okay"]

Edit The additional information indicated that the goal is to include the item on which it was split with one of the halves of the split items. I would think there is a simple way to do that, but I don’t know it and haven’t had time today to play with it. So in the absence of the clever solution, the following is one way to brute force it. Use the split method as described above to include the split items in the array. Then iterate through the array and combine every second entry (which by definition is the split value) with the previous entry.

s = "split on the word on and include on with previous"
a = s.split(/(on)/)

# iterate through and combine adjacent items together and store
# results in a second array
b = []
a.each_index{ |i|
   b << a[i] if i.even?
   b[b.length - 1] += a[i] if i.odd?
   }

print b

Results in this:

["split on", " the word on", " and include on", " with previous"]