Monday 5 September 2011

Playing with Python: The hunt for email addresses


Working in both customer services and as an SEO often presents a complex mix of tasks to do and challenges.

One such challenge was to extract the TO: addresses from over 30000+ plus emails. Not wanting to interrupt the dev team and it being an interesting task i decided to tackle it.


I haven't programmed properly in a few years well near on 10 so i turned to Google and a language i have an interest in Python. After a bit of searching i hit upon the following code by Tumas Rasila, this provide a great starting point as it covers the basics really well, that being reading a file and extracting email addresses.


http://rasilagarage.com/2009/06/extracting-email-addresses-from-any-text-file-with-python/


#!/usr/bin/env python
# coding: utf-8

import os
import re
import sys

def grab_email(file):
    """Try and grab all emails addresses found within a given file."""
    email_pattern = re.compile(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b',re.IGNORECASE)
    found = set()
    if os.path.isfile(file):
        for line in open(file, 'r'):
            found.update(email_pattern.findall(line))
    for email_address in found:
        print email_address

if __name__ == '__main__':
    grab_email(sys.argv[1])

The trouble was i did not really understand what it was doing, and it was not quiet what i needed i had 30000+ files not just one! So this is where the real work began.


Step 1 - Was to write the email address out to a file instead of the screen. This it truns out is fairly simple using the FILE command: 

  • FILE = open(filename,'w') which opens / creates a file based on the variable "filename" in write mode.
  • FILE.write() - writes data to the file
  • FILE.close() - does what it says and closes a file so the data is written to disk
Step 2 - Select only the To: field in each email:
  • The original regex was nearly spot on i just made the following tweak re.compile(r'(To:\s+\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)',re.IGNORECASE) The addition of: To:\s basically looks for To: in the document and \s is shorthand for "any white space"
The next bit caused the most headaches i had to take in a directory as a parameter, loop through the contents of it, check the contents to make sure it was a file, if it was read the contents. Then spit out any email addresses job done. So with only Google as my friend i set to work.

Step 3 - Grab a folder
  • Python has a handy function within the os module listdir() so i was able to pass the "file" now "folder" into my program. os.listdir(dirname)
Step 4 - Check to see if the contents is a file.
  • Again fairly straight forward: os.path.isfile()
With the above things were looking good, but i had come across a couple of issues for some reason my code was not passing the isfile() section. This i found was because the file path was not being passed in correctly so with a quick update: os.path.isfile(os.path.join(dirname, files)) I could now check each file (turns out the python os module is really quite useful). The next issue was my programme was going on and on and on. I had a looping issue it was so bad the file i was creating just kept getting bigger slowly eating all my disk space. Not good.

After doing a lot of reading i suddenly found out that a set() which i was writing all the data to was amazing. A set is an unordered collection with no duplicates! (wow no duplicates that solved an issue i had not even thought of!). The looping issue was caused because i had indented the write loop in the wrong place. I was writing all the email addresses out each pass through each file as the set got bigger more data was being written out each time over and over again. Turns out indents in python are very important.

So putting it altogether i ended up with the following:


=================== PYTHON SCRIPT EMAILS ==================
# June 13th, 2009 by Tuomas Rasila - with updates Matthew Brookes 2011
#!/usr/bin/env python # coding: utf-8 import os import re import sys def grab_email(dirname): #creates a file filename = "emails.txt" #Try and grab all emails addresses found within a given file. email_pattern = re.compile(r'(To:\s+\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)',re.IGNORECASE) #A set is an unordered collection with no duplicate elements found = set() #opens file in write mode FILE = open(filename,'w') #Get a directory list for Xfiles in os.listdir(dirname): #Check if its a file if os.path.isfile(os.path.join(dirname, Xfiles)): #creates a path to the file so it can be read emails = os.path.join(dirname, Xfiles) # loop through each of the files and match email addresses, write these to the set. for line in open(emails, 'r'): found.update(email_pattern.findall(line)) # read the set of eamil addresses and write these out to the file created earlier. for email_address in found: FILE.write("update [table] set [column] = 0 where [column_value] like '"+email_address+"'\n") #Closes the file so data can be written FILE.close() if __name__ == '__main__': grab_email(sys.argv[1])


As you can see i even managed to write out the SQL i needed with each row in the file! The other thing is this is reusable and i can adapt it in the future, so a bit of up front work has hopefully saved me hours in the future.


Hopefully the above will help someone else out in the future as well.


Useful resources i used were:
Extracting email addresses from any text file with python
Python Documentation
An SEOS Guide To Regex
Not forgetting Google!