Home » Mysql » MySQL: Use REGEX to extract string (select REGEX)

MySQL: Use REGEX to extract string (select REGEX)

Posted by: admin November 1, 2017 Leave a comment

Questions:

I would like to have a mysql query like this:

select <second word in text> word, count(*) from table group by word;

All the regex examples in mysql are used to query if the text matches the expression, but not to extract text out of an expression. Is there such a syntax?

Answers:

The following is a proposed solution for the OP’s specific problem (extracting the 2nd word of a string), but it should be noted that, as mc0e’s answer states, actually extracting regex matches is not supported out-of-the-box in MySQL. If you really need this, then your choices are basically to 1) do it in post-processing on the client, or 2) install a MySQL extension to support it.


BenWells has it very almost correct. Working from his code, here’s a slightly adjusted version:

SUBSTRING(
  sentence,
  LOCATE(' ', sentence) + CHAR_LENGTH(' '),
  LOCATE(' ', sentence,
  ( LOCATE(' ', sentence) + 1 ) - ( LOCATE(' ', sentence) + CHAR_LENGTH(' ') )
)

As a working example, I used:

SELECT SUBSTRING(
  sentence,
  LOCATE(' ', sentence) + CHAR_LENGTH(' '),
  LOCATE(' ', sentence,
  ( LOCATE(' ', sentence) + 1 ) - ( LOCATE(' ', sentence) + CHAR_LENGTH(' ') )
) as string
FROM (SELECT 'THIS IS A TEST' AS sentence) temp

This successfully extracts the word IS

Questions:
Answers:

Shorter option to extract the second word in a sentence:

SELECT SUBSTRING_INDEX(SUBSTRING_INDEX('THIS IS A TEST', ' ',  2), ' ', -1) as FoundText

MySQL docs for SUBSTRING_INDEX

Questions:
Answers:

According to http://dev.mysql.com/ the SUBSTRING function uses start position then the length so surely the function for the second word would be:

SUBSTRING(sentence,LOCATE(' ',sentence),(LOCATE(' ',LOCATE(' ',sentence))-LOCATE(' ',sentence)))

Questions:
Answers:

No, there isn’t a syntax for extracting text using regular expressions. You have to use the ordinary string manipulation functions.

Alternatively select the entire value from the database (or the first n characters if you are worried about too much data transfer) and then use a regular expression on the client.

Questions:
Answers:

As others have said, mysql does not provide regex tools for extracting sub-strings. That’s not to say you can’t have them though if you’re prepared to extend mysql with user-defined functions:

https://github.com/mysqludf/lib_mysqludf_preg

That may not be much help if you want to distribute your software, being an impediment to installing your software, but for an in-house solution it may be appropriate.

Questions:
Answers:

I used Brendan Bullen’s answer as a starting point for a similar issue I had which was to retrive the value of a specific field in a JSON string. However, like I commented on his answer, it is not entirely accurate. If your left boundary isn’t just a space like in the original question, then the discrepancy increases.

Corrected solution:

SUBSTRING(
    sentence,
    LOCATE(' ', sentence) + 1,
    LOCATE(' ', sentence, (LOCATE(' ', sentence) + 1)) - LOCATE(' ', sentence) - 1
)

The two differences are the +1 in the SUBSTRING index parameter and the -1 in the length parameter.

For a more general solution to “find the first occurence of a string between two provided boundaries”:

SUBSTRING(
    haystack,
    LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>'),
    LOCATE(
        '<rightBoundary>',
        haystack,
        LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>')
    )
    - (LOCATE('<leftBoundary>', haystack) + CHAR_LENGTH('<leftBoundary>'))
)

Questions:
Answers:

I don’t think such a thing is possible. You can use substring function to extract the part you want.

Questions:
Answers:

The field’s value is:

 "- DE-HEB 20% - DTopTen 1.2%"
SELECT ....
SUBSTRING_INDEX(SUBSTRING_INDEX(DesctosAplicados, 'DE-HEB ',  -1), '-', 1) DE-HEB ,
SUBSTRING_INDEX(SUBSTRING_INDEX(DesctosAplicados, 'DTopTen ',  -1), '-', 1) DTopTen ,

FROM TABLA 

Result is:

  DE-HEB       DTopTEn
    20%          1.2%