Home » Php » regex – Extract email message itself from all its prior messages and meta data (Sendgrid Parse API/PHP)?

regex – Extract email message itself from all its prior messages and meta data (Sendgrid Parse API/PHP)?

Posted by: admin July 12, 2020 Leave a comment

Questions:

I’m using Sendgrid and their Parse API to send/receive email. The Parse API allows one’s web app to receive email as a $_POST but the problem is that in the $_POST I want to be able to extract the message itself from its prior messages and meta data that get chained along.

To show you what I mean in the following picture, i’d just like to capture the text, “trying sending from 12373 to 12373 from GMAIL” and not all the junk below it. If that is not possible, does anyone have any suggestions on how to parse the email body ($_POST['text']) such that I can separate out the message itself?

The problem is see is that depending on the email client (gmail, outlook, etc.), It’s not clear to me that the date information, in this case: “On Wed, Jan 23, 2013…”, will allows follow the message itself. If all email client’s put the date beneath the message, then it would seem I could design a fancy regex to look for a line break followed by a date or something. Thoughts?

**Entire** Message body containing prior messages

How to&Answers:

You have a couple of options:

1) Insert a token that splits the emails

You could do something like --- reply above this line --- and then cut out everything below that token.

2) Use an email reply parsing library

There is a really good one done by github, but it’s in ruby. There’s a php port though that might be good for what you need:

Fully working code:

<?php
  require_once 'application/third_party/EmailReplyParser-master/src/autoload.php';
  $email = new \EmailReplyParser\Email();
  $reply = $email->read($_POST['text']);            
  $message=$reply[0]->getContent();
  $message=preg_replace('~On(.*?)wrote:(.*?)$~si', '', $message); 
  //Last line is needed for some email clients, e.g., some university e-mails: [email protected] but not Gmail or Hotmail, to get rid of "On Jan 23...wrote:" 
  //This failure to remove "On Jan 23...wrote:" is a known issue and is documented in their README

 ?>

Answer:

There’s simply no guaranteed way to parse quoted message threads from an email message, so you won’t find a regex or any other code that will work in all cases. There’s no standard to define formatting of replies, and as you’ve already observed different mail clients use different conventions. Many, in fact, will allow the user to edit the quoted text. Also, users can paste in unrelated messages, with or without headers, resulting in a mix-and-match of formats.

If you can record and keep the history of all messages as they are sent and received, then you can (usually, but not always) use the In-Reply-To header (see RFC-5322) to locate the previous message by matching it’s Message-ID header, and do a diff on the body and remove duplicate text runs. It’s apparent that some email systems do this to improve their presentations, but I’m not aware of any available open source code.

Answer:

// cut quoted text, https://regex101.com/r/xO8nI1/5

    $message = preg_replace('/(On\s.*<\n){0,1}(.*\n(\n){0,1}((^>+\s?.*$)+\n?)+)/mi', '', $message);

Answer:

How about replies in languages other than English? We came out with solution to add marker, but instead of translating it for every email (based on user’s language) we put some invisible characters into it (zero width space U+200B , to be precise). Basing on “On…” regexp it’s error prone, it can easily cut some email content.