Mail extractor — LLM applied

Mail extractor — LLM applied

Mail extractor — LLM applied

Consider a situation you’re boss want to check the historical customer complaint at need you to share some insight. Now you had hundreds of mail with customer, unstructured what can you do?

Option1: Open a csv file, start check mail one by one, maybe some weeks forward you can complete all the mail checking.

Option2: Why don’t we use RPA or python help to crawler mail title and content?

#By asking copilot: How can I connect to my mail box and get mail title and content, content with title "Customer complain"#You can get below codeimport imaplibimport emailfrom email.header import decode_header# your email credentialsusername = 'myusername'password = 'mypassword'# create an IMAP4 class with SSL mail = imaplib.IMAP4_SSL("imap.example.com")# authenticatemail.login(username, password)# select the mailbox you want to delete in# if you want SPAM, use "INBOX.SPAM"mail.select("inbox")# search for specific mail by senderresp, items = mail.search(None, '(SUBJECT "Customer complainy")')items = items[0].split()  # getting the mails idfor emailid in items[::-1]:    resp, data = mail.fetch(emailid, "(BODY[HEADER.FIELDS (SUBJECT)])")    raw_email = data[0][1].decode("utf-8")  # converts byte literal to string removing b''    email_message = email.message_from_string(raw_email)    print('Subject:', email_message['Subject'])    resp, data = mail.fetch(emailid, "(BODY[TEXT])")    raw_email = data[0][1].decode("utf-8")    email_message = email.message_from_string(raw_email)    print('Body:', email_message.get_payload())

This is so efficient! But there is more to improve.

  • An event will not only contains one mail, but several mail back and forth. Also the title will had some change (ex: add RE:…)
  • Depends on the data you want to collect, in this case I want to collect customer complaint content only first.
    Assuming each my event contains “RE: Issue Notify Eventid: XXXXXXX This is a bad tool…”

We can use python to split by Eventid and space to get the only eventid.

def get_event_id_by_title_split(title:string)->string:    #Here write a separate function for better debugging    #Because each case will be different situation, good to isolate    return title.upper().split('EVENTID:')[1].split(' ')[0]

When we’re able to get the unique event id, we can make a hash map, key is eventid, content is mail.body. Checking mail from new to old so when duplicate mail happen, ignore.

complaint_hash = {}for emailid in items:        resp, data = mail.fetch(emailid, "(BODY[HEADER.FIELDS (SUBJECT)])")        raw_email = data[0][1].decode("utf-8")      email_message = email.message_from_string(raw_email)    eventid = get_event_id_by_title_split(email_message['Subject'])    if eventid not in complaint_hash:            #Not exist, add in hash            resp, data = mail.fetch(emailid, "(BODY[TEXT])")            raw_email = data[0][1].decode("utf-8")            email_message = email.message_from_string(raw_email)            complaint_hash[eventid] = email_message.get_payload()

For above script, we’re able to get a hash map of eventid and content, we can save to a csv file then check, still not so efficient, but it qty is not high, might be acceptable.

Option3: If we’re still lazy, why not call LLM API helping us to summarize the mail.

Create an LLM API function

import requestsimport json# LLM API endpoint, this is very depends on you applicationurl = "http://llmapi.example.com/summarize"#pack to a functiondef get_LLM_reply(content:string)->string:    # data to be sent to api    data = {'text': content}    # send a post request    response = requests.post(url, data=json.dumps(data))    return response

Get a summarize of each mail:

complaint_raw_summary = {}for i in list(complaint_hash.keys()):    customer_complaint= f'''    Below is a mail from customer, help to to summarize what is customer complaint.    {complaint_hash[i]}    '''    final_solution=f'''    Below is a mail from customer, help to to summarize what's final solution?        {complaint_hash[i]}    '''    complaint_raw_summary[i] = [get_LLM_reply(customer_complaint), get_LLM_reply(final_solution)]

We had summary of each mail which also help us to split by two category, customer complaint and final solution. But this is still not a structured data, we want to split by category, one option is if we had some experienced employee, he/she can help to generate some category for better structured. or…

Option4: Use LLM help to categorize the content, we can collect all summarize data and ask LLM to had some group, then ask LLM to write a function to categorize the data. Then ask LLM to base on category to judge each summary. This might seems a little redundant, because we can ask LLM to summarize at beginning, but this way we’re unable to decide the category (Or in Neural Network, node is unknown). By this way, we’re able to adjust the category ourselves. (Control the node)

#Get catrgory from summarycategory = f'''In many summary reason below, help me to get some catrgory on what's the high level complaint category. {",".join([x[0] for x in complaint_raw_summary.values()}'''#Once you got the category, now is domain input on is the grouping reasonable or notdef get_grouping(x:string)->string:    #Ask LLM to write a function to grouping the replu, will be used later        ......complaint_summary_cate = {}#ask LLM again to categorize the summaryfor i in list(complaint_raw_summary.keys()):    customer_complaint_cate= f'''    summary content below is some commend from customer, use following category to decide, keep reply short    Quality_issue: Customer unsatisfied about product quality.    ......    {complaint_summary_cate[i][0]}    '''    final_solution_cate=f'''    summary content below is some commend from customer, use following category to decide, keep reply short    ......    {complaint_summary_cate[i][1]}    '''    complaint_raw_summary[i] = [get_LLM_reply(customer_complaint_cate), get_LLM_reply(complaint_summary_cate)]

Now you can export the categorized data to csv.

df = pd.DataFrame.from_dict(complaint_summary_cate, orient='index', columns=['Complaint', 'Decision']).reset_index()df.rename(columns={'index': 'Eventid'}, inplace=True)

Option5 and more… as you perform data ETL from mail to csv, you can get some tableau data for visualization, it data is not enough, you can extract more information from mail or seeking if there’s other database had some data about event id you can link. This will be an open question for further digging.

Summary: here give a example how to extract mail to a structured data by python and LLM. Regarding LLM as a module as pandas, Dataframe but more focus on language. Save employee’s energy.

Comments

Loading comments…

Leave a Comment