Monday, 5 September 2011

Playing with Python: The hunt for email addresses

Working in both customer services and as an SEO often presents a complex mix of tasks to do and challenges.

One such challenge was to extract the TO: addresses from over 30000+ plus emails. Not wanting to interrupt the dev team and it being an interesting task i decided to tackle it.

I haven't programmed properly in a few years well near on 10 so i turned to Google and a language i have an interest in Python. After a bit of searching i hit upon the following code by Tumas Rasila, this provide a great starting point as it covers the basics really well, that being reading a file and extracting email addresses.

#!/usr/bin/env python
# coding: utf-8

import os
import re
import sys

def grab_email(file):
    """Try and grab all emails addresses found within a given file."""
    email_pattern = re.compile(r'\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b',re.IGNORECASE)
    found = set()
    if os.path.isfile(file):
        for line in open(file, 'r'):
    for email_address in found:
        print email_address

if __name__ == '__main__':

The trouble was i did not really understand what it was doing, and it was not quiet what i needed i had 30000+ files not just one! So this is where the real work began.

Step 1 - Was to write the email address out to a file instead of the screen. This it truns out is fairly simple using the FILE command: 

  • FILE = open(filename,'w') which opens / creates a file based on the variable "filename" in write mode.
  • FILE.write() - writes data to the file
  • FILE.close() - does what it says and closes a file so the data is written to disk
Step 2 - Select only the To: field in each email:
  • The original regex was nearly spot on i just made the following tweak re.compile(r'(To:\s+\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)',re.IGNORECASE) The addition of: To:\s basically looks for To: in the document and \s is shorthand for "any white space"
The next bit caused the most headaches i had to take in a directory as a parameter, loop through the contents of it, check the contents to make sure it was a file, if it was read the contents. Then spit out any email addresses job done. So with only Google as my friend i set to work.

Step 3 - Grab a folder
  • Python has a handy function within the os module listdir() so i was able to pass the "file" now "folder" into my program. os.listdir(dirname)
Step 4 - Check to see if the contents is a file.
  • Again fairly straight forward: os.path.isfile()
With the above things were looking good, but i had come across a couple of issues for some reason my code was not passing the isfile() section. This i found was because the file path was not being passed in correctly so with a quick update: os.path.isfile(os.path.join(dirname, files)) I could now check each file (turns out the python os module is really quite useful). The next issue was my programme was going on and on and on. I had a looping issue it was so bad the file i was creating just kept getting bigger slowly eating all my disk space. Not good.

After doing a lot of reading i suddenly found out that a set() which i was writing all the data to was amazing. A set is an unordered collection with no duplicates! (wow no duplicates that solved an issue i had not even thought of!). The looping issue was caused because i had indented the write loop in the wrong place. I was writing all the email addresses out each pass through each file as the set got bigger more data was being written out each time over and over again. Turns out indents in python are very important.

So putting it altogether i ended up with the following:

=================== PYTHON SCRIPT EMAILS ==================
# June 13th, 2009 by Tuomas Rasila - with updates Matthew Brookes 2011
#!/usr/bin/env python # coding: utf-8 import os import re import sys def grab_email(dirname): #creates a file filename = "emails.txt" #Try and grab all emails addresses found within a given file. email_pattern = re.compile(r'(To:\s+\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b)',re.IGNORECASE) #A set is an unordered collection with no duplicate elements found = set() #opens file in write mode FILE = open(filename,'w') #Get a directory list for Xfiles in os.listdir(dirname): #Check if its a file if os.path.isfile(os.path.join(dirname, Xfiles)): #creates a path to the file so it can be read emails = os.path.join(dirname, Xfiles) # loop through each of the files and match email addresses, write these to the set. for line in open(emails, 'r'): found.update(email_pattern.findall(line)) # read the set of eamil addresses and write these out to the file created earlier. for email_address in found: FILE.write("update [table] set [column] = 0 where [column_value] like '"+email_address+"'\n") #Closes the file so data can be written FILE.close() if __name__ == '__main__': grab_email(sys.argv[1])

As you can see i even managed to write out the SQL i needed with each row in the file! The other thing is this is reusable and i can adapt it in the future, so a bit of up front work has hopefully saved me hours in the future.

Hopefully the above will help someone else out in the future as well.

Useful resources i used were:
Extracting email addresses from any text file with python
Python Documentation
An SEOS Guide To Regex
Not forgetting Google!

Wednesday, 18 May 2011

Social and Search?

This was originally a for a guest post on but it never made it live so here it is:

I have been using the web since about 1995 and since then a lot has changed. I created my first website in 1999 while at university and it was found by about 2 people apart from the people i told as i didn't understand about search engines. Between 2000 and 2007 search engines played the biggest part in the discovery of new content, that and email. But since 2007 social has slowly been building up and up until last year it exploded. I remember searching Google on the first day of the F1 championships, only to see a twitter stream right there in the centre of the results providing real time updates from around the web on what was happening. Or organising a holiday over Facebook with friends from across the country. The social web allows interaction and discovery on-line in ways not possible 10 or even 5 years ago.

So what does this mean and what are the possible relationships between search and social, should you think of them as being married walking hand in hand?

Well to start with having the best social strategy in the world and a poor site is going to get you no where as you need to be able to be found via multiple channels and the best way to ensure independence is to have your own site. Search engines are not going to be disappearing over night and still provide the majority of traffic to most sites. If your analytics package is showing people are not engaged with your website content are people going to be flocking to your Twitter / Facebook / LinkedIn page? Consider doing some basic Search Engine Optimisation (SEO) or Search Engine Marketing (SEM) and try to get an understanding who, where and when visitors come to you site. Once armed with some insight you can begin your social campaign!

Along with social comes brand, this is becoming more important with search and a lot of the recent changes in Google’s algorithms have favoured strong brands. How do search engines get a good idea of brand awareness? Social signals both Bing and Google have stated that they use services such as Twitter updates and Facebook likes / shares to gain understanding. Recently Google has furthered its attempt at social by introducing the +1 button which allows someone to +1 a search result at present and eventually +1 a web page via a button in the same way as a like / tweet. This will further enhance the social data Google presents to people and make “gaming” the rankings a little harder.

So do i think search and social are closely linked? The short answer is yes. The sites you like and share both on-line and off plant seeds in other peoples minds who then search for them. Social networks allow viral content to spread as well as provide targeted discovery of content through friends, business contacts and acquaintances interests. This in turn is used by search engines to help build a better picture of the web and the content to return in a search query.

Do you need to be on every social platform? No. You need to pick the right ones for your audience and ensure the content you share adds to your brand and existing on-line portfolio. You also need to be prepared to communicate with the people that follow you. Social is about understanding your audience and working with them to enhance knowledge and experience on and offline.

Having the right social strategy will provide you with a great opportunity not just in the social networks but also the search results and is something i would make sure was on my digital marketing list of activities.

Further Reading:

+1 Info :

Search and Social signals :

Understating social networks :

Monday, 18 April 2011

What's the Waze? Social Navigation

With the advent of social networking and smart phones with global positioning satellites (GPS) capabilities, a unique opportunity opened up with the rather boring world of mapping and navigation.

Which_way by Matthew Brookes

I have always enjoyed hiking and camping so from a very early age have been able to read a OS or road map - this is a very useful skill as technology does not always have the answer. Still for day to day stuff life should be easier.

Back in 2007 my then Nokia N95 not only did it have a great camera for a phone but also a GPS which when combined with the rather poor mapping software could sort of tell me where i was heading. It also had some software for tracking when i went out walking or cycling. This was the first time I had used any personal GPS software other than when my Dad used his TomTom in the car, and i thought it was great even if a bit clunky.

The appearance of on-line tools like Google Maps also meant you no longer had to think how long should it take to get from A- B or which route should i take, a quick search and there was the result (now with traffic info) which cold be printed out and used on your next journey, I sort of feel sorry for the AA route planner which up until Google Maps was the best directions your could get ( some would argue still is).

Fast forward a couple of years and armed with a nice new Iphone 3G with proper web browser and Apps! a whole new ball game was under way, first i used the built in Google maps software this was fun and a massive improvement on the Nokia i could even plan a route and the GPS could track me along the road, ace!

The real problem i had though was no voice commands and you are not allowed to be driving fiddling with your phone so voice was a major feature missing for me, it was also around this time i found out about the Open Street Map project. This is a crowd sourced mapping project which on looking back at today, is fantastic providing a easy way to include mapping information with your apps or teach people about mapping. The project introduced me to crowd sourcing and the social aspect to mapping here were hundreds if not thousands of people around the world contributing on a daily basis to improve everyone's understanding of the places they lived in as well as useful mapping information.

After using Open Street Map for a while and realising their API could be used for a mapping application i stumbled across Skobbler. This app did exactly what i was after provide routes to destinations with voice commands, bingo! After using the app for a while i started thinking with all this mapping data at Open Street Map and Skobbler using it to provide me with directions wouldn't be good if while using the App i could provide data back to them so as to improve things??

Well as luck would have it one of my friends (@AlexBuchta) brought to my attention Waze.

Waze is the social way to user generated navigation. Unlike Skobbler when using Waze you are building the map as you travel! In some ways this is a little scary you would think a navigation app was supposed to already know the way. However the more you travel the the better the maps get the more people that use the app the better the app gets and the best bit is the gaming aspect.

Waze allows you to set-up an account with them and link it into Facebook and or Twitter if your friends are using it you can see how many points they have achieved. Points are awarded for all sorts of things mapping new roads, using the app multiple days, fixing mapping issues and so on. However what i liked the best were the cup cakes yum yum. As you drive using the app you unlocked certain achievements:

As you build up more points you get to customise your character a little as well so you can pick a different type of vehicle or select a mood based on how you are feeling that day (generally hungry I find). If you start using the App in an area not know by Waze no problem you can plough the roads as you dive. Its fun when you suddenly find out someone else is using the App and all of a sudden two parts of the map join up its at this point the App gets more useful as the work you have been doing helps all those other Wazers out there.

As you can see from the screen shots Waze is international as well so you take it on your travels but don't forget your international phone tariff!

So once you have you local area planned out and a few long journeys under your belt what else does the app do?

It can give you voice directions on your journey, travel certain routes regularly it learns this and offers up guidance on your journey, the first time the app did this was a little scary and I was a bit disappointed at being so predictable! As its a community based App you have the ability to report various different things on your travels. Be this traffic jams, accidents, police etc. It even allows you to take a photo and upload this. All this info goes back to some super type computer I guess and with the help of an algorithm it means Waze can alert other travellers to potential problems, this gets you from A- B in the quickest time!

They provide different goodies depending on the time of year and certain special occasions such as Easter, Valentines or Christmas which keeps things interesting as you travel around its not too often you spot a Easter egg or present sat in you path and these give you bonus points.

If you live in an area where there are lots of people using Waze you can join groups this means you can get updates from the people around you which also means the traffic info is really relevant and up to date.

So far Waze has been the best Free App I have found for car navigation. It might not have all the roads pre-mapped out but that's part of the fun, the simple points system keeps things interesting, but the best bit is the community and they way the more people us it the better the maps, directions get.

A couple of things I would like to see in the App are:
  • points of interest - by this I mean services on the road it would be great to get a head up on petrol stations or the next place I can find a toilet.
  • In App ads - little odd but often when travelling it would be good to be able to get a deal on something, not to many mind you!!
I would also like to see a time lapsed video of the UK Waze map forming, I think this would be really interesting as you could try and spot which parts of the country were early adopters. 

Have fun Wazing and check out the Waze site:

Monday, 7 March 2011

QR Codes In the Recruitment Industry

A little bit of background before i get into ideas on how to use these efficiently. QR codes are two-dimensional bar codes and i have just found out via Wikipedia they were invented way back in 1994 in Japan by a Toyota subsidiary as "quick response codes".  QR codes hold data (it can hold over 7,000 numeric characters) this can be anything from a vCard, text or URL and the best bit is they can be read by mobile phone!

So for those that have not seen one below is an example:

QR code

If you have a mobile phone such as an Android or Iphone you should be able to point the phones camera at it and be taken to the homepage of this blog! Neat. As an aside on my Iphone 3G i  had to install an app to make it work (recommendation is mobiletag). 

I first became aware of QR codes back when i had my N95 so at least 2yrs ago and remember pointing my camera at the back of Pepsi can while sat in a cafe one lunch time with my girl friend asking what on earth I was up to. So began the explanation that I was trying to enter a competition, via this weird image on the side of the can which should take me to a web page that i can fill out  a form and win! This did not happen as my phone did not have the right software but i thought it was great advertising as the can was just going to be recycled once i finished so Pepsi had a very small window of opportunity get me to enter and the QR code made it simple and fun to do. 

Since then I have seen QR codes used in many different ways for marketing products and services run and even on TV for Waitrose:

Since i initially became aware of QR codes i have slowly seen more and more information about them and obviously from the couple of examples above mainstream companies are already utilising them in may different ways.

So why do i think the recruitment industry should take note and possibly use them in certain instances? QR codes provide a way to bridge the gap between off-line and on-line campaigns. As the ad run by Waitrose and Pepsi both demonstrate when working with mediums that are difficult to engage someone in a short period of time or encourage them to make split decision a QR code is a simple and easy solution to implement.

Possible recruitment uses are in my opinion focused around off-line. Newspapers or Magazines both are used to display job adverts, but what if you also posted QR codes back to your website? Now that potential candidate on a train or in waiting room that picked a random paper / magazine can quickly be viewing the live vacancy and possibly even apply with a suitably enabled mobile website. Best of all you now have their attention so you could convert them to Jobs by email, provide further details on other similar vacancies or call them back?

Advertise jobs on TV? add a QR code into your advert and all those people sat a home with their mobiles could be viewing your site during the rest of the break without even having to get up to switch the computer on.

You could think about using QR codes in window displays along with the Job advert. Essentially what I a getting at it is that anywhere you currently use off-line advertising is a potential opportunity for the addition of a QR code to enhance the current information. They help to reduce a persons thinking as you no longer have to remember the long URL or that strange random shortened URL making life a little more convenient.

If you would like to see who's using them take a look at this Mashable write up it has one of those handy info-graphics - a 1200% increase in use in less than 6 months is not bad going!

Where is it heading? Until you can point your mobile camera at something and get information as well as meta data about it on screen expect to see an increase in QR code use. How ever i do think QR codes are only providing a bridge and that technology such as Google Goggles and Layar are where things are headed.

If you have any interesting examples of QR use let me know in the comments.

Further reading links: - company specialising in helping companies effectively use QR codes - Software fro reading bar codes and QR information. - Background information. - Siemens recruitment example

Saturday, 12 February 2011

Authority And The Web Of Influence

Trend Influence

Finding great content on-line is hard even with Google, Bing or Blekko to hand identifying the best resources on the web is not that straight forward!

With the rise of social networks you were suddenly able to find and follow people of interest, leading to the discovery of new websites or content which you may not have picked up on by just reading industry news sites or blogs. I know through the sites and networks i belong to i have been able to find and learn about new technologies faster than i ever was with just the blogs i follow, and this has been excellent in terms of expanding my knowledge and resources I use.

One of the best things about this is the fact that you are not necessarily reliant on following those famous people everyone knows about or journalists. It has opened up a market of amateur enthusiasts, industry professionals and people keen to share their knowledge and made them easy to find.

Now that there is all this sharing going on, how do you know who the best people to follow are? Or even out of the people you interact with who are the most influential and in what areas?

Question mark in Esbjerg

Two companies i have become aware of trying to answer this question are: PeerIndex and Klout. Both services are similar in that once signed in they give you a score out of 100 with that being the best. Klout looks at your Twitter and Facebook profiles whereas PeerIndex covers Facebook, Twitter, LinkedIn, and blogs or sites of your choosing.

At present I think PeerIndex is slightly better because of the wider range of services and sites covered as i feel this gives a move rounded take on your on-line life, I also like their interface a little better but you should take a look at both before trusting me.

So what does it mean having a score out of 100 for your social and on-line behaviour? Well so far not much both systems sort of give you information on how likely people are to act on what you say, pass on the message etc. If your into trying to understand how effective your Tweets / FaceBook Mesages /Blog posts are then they give you a different perspective to something like URL shorterns stats or your favourite analytics package.

Where things start to get interesting however is when other services make use of there API's, Datasift which i talked out the the other week allows you to filter data based on a persons Klout or PeerIndex score so all of a sudden one can filter data for the most "influential" people talking about BMW's for instance. Advertisers could then use this to target messages or approach people to promote products? People could use it to look at brands, products, schools, political parties and then start to ranking them.

Although i am not strictly sure this would be seen as OK imagine a HR or recruitment system integrated with this sort of information you could suddenly find your worth being decided by a random number some web service has assigned to you. I can just see reports of people not only combing twitter and facebook for details on potential new employees but then taking those handles and plugging them back into services like these for more insight into if they are the right fit for a business.

That being said I do think both services are useful although not perfect yet you can see that what they are trying to achieve could be helpful in the future. If you think about some of the things search engines are starting to take into account when ranking content such as Twitter and Facebook data you can understand that there is value in what you say and share on these services and that can impact the wider world. Being able to identify people of authority on different subjects or who have the ability to influence is important not only for search results but also in helping people connect and learn about new subjects or make decisions.

To find out more and see my score:

API pages:
PeerIndex Developers
Klout Developers

Sunday, 23 January 2011

Data and there's a lot of it?

Over recent years data has exploded especially driven by the real time web and media. With this comes a number of problems for many businesses and people: tracking, monitoring, understanding, engaging with it and filtering noise. Having Google Analytics or some other on site monitoring tool installed on your site, ensuring you have registered with the various search engine and their webmaster tools goes some way into understanding what's happening on your site and identifying problems. But what this does not tackle what is being said on the wider web keeping an eye on what your competitors are up to, tackling outbursts by clients or customers or any of the many other things I am sure you can think of that happen every day day on the web.

Trying to solve these problems is not straight forward especially as the data keeps on growing. One company offering an abstracted layer on top of this to help simplify the problem is Mediasift. I first became aware of Mediasift or as it was called then about 3 years ago while at a British Computer Society event on Pro-Blogging. One of the attendees was a man called Nick Halstead and during the talk he mentioned his company which curated channels based around RSS feeds from blogs making it easy for anyone to find content they were interested in and follow as well as comment on it.

This was a great idea as it solved the what is RSS question most people who don't have a technical understanding of the web have as well as that odd orange icon that appears in the address bar my parents ask about. With that I signed up to use it as soon as it came out and it provide me with some new resources to follow and help me to discover new content quickly. However around the same time on the internet another service was gaining traction Twitter (I think Nick was even using during the talk!).

So out of Twitter and their great API came TweetMeMe as far as I can tell this was a great success and still is with sites such as mashable using their retweet button. The following is taken straight from the Tweetmeme site but expalins their offering much better than i could:

"TweetMeme is a service which aggregates all the popular links on Twitter to determine which links are popular. TweetMeme categorises these links into Categories, Subcategories and Channels, making it easy to filter out the noise to find what you're interested in."

So this leads nicely on to Datasift a new web service (SAAS) going through Alpha testing at the moment I was lucky enough to get an Alpha invite and before Christmas was playing with the service, which is fantastic! So the problem I opened with all that data, many different API's to learn and it being difficult to get started with suddenly starts to get a little easier.

Datasift pulls in data from all over the web: Twitter, Myspace, Digg, Wordpress, Buzz, Six Apart and Facebook were listed last time I looked. As you can see they pretty much cover some of the top social destinations on the web, and I am sure the number of sources will continue to grow. The key thing to think about here is Datasift now provide a one stop shop for all that public data, so that's one API to learn and integrate with (*time saver)

One concern I had was that it was going to be difficult to get started, this turned out to be miss judged. If you are an Excel "guru" or have a basic understanding of SQL using Curated Stream Definition Language (CSDL) - (it was FSDL last time logged in so things are moving a long quiet quickly) is nice and straight forward. I had a great stream up and running in less than an hour doing something basic -  pulling job references from multiple sources.

Once you have something simple in place its time to read the documentation as they allow you to do all sorts of great things with the data such as play with geo information, look at influence metrics provide by sources such as PeerIndex and Klout. Streams can also be plugged together (using a unique ID called a Definition) this means you can build one plug another on to it to quickly iterate on ideas. The software also has a published list of Streams people have shared which anyone can build on or use.

Here is a quick example so you get a small idea of what its like to create a Stream:

((twitter.text contains_word "SEO" or twitter.text contains_word "SEM") and not twitter.text contains "guru") and language.tag == "en"

That would produce a list of all the tweets containing either SEO or SEM and nothing to do with the word guru in English. Now you have that how about only those people classed as influencers easy:

(((twitter.text contains_word "SEO" or twitter.text contains_word "SEM") and not twitter.text contains "guru") and language.tag == "en") and (peerindex.score >60 or klout.score >60)
As you can see its really easy to build streams that are focused around a subject your interested in as well as easily filter out all the stuff you don't want to see.

Some ideas I think Datasift is going to be good for:

  • Research - using there API you can slice and dice data a whole host of ways meaning you can quickly build up data sets, get a snap shot of the public's perception and understand whats going on in real time. Identify trends from multiple sources.
  • Dashboards - I expect to see lots of people use this service to add meaning to existing data. Imagine adding a client to salesforce and being able to pull in tweets, identify influential people on the web associated with the business, gather the publics view of the business as well as idenifty blog posts mentioning them or there products. 
  • Vertical / Niche content services.
  • Mashups  - See here for an example:
Benefits of the service:

  • Easy to get going. 
  • They handle the all the masses of data you only get the stuff you are interested in.
  • Real time data. 
  • Multiple input sources. 
  • API to integrate with your software
  • Great customer service!
As you can see I am pretty excited about the service and the opportunities it opens up, now all I need to do is come up with something great. 

Links to find out more:

Tuesday, 11 January 2011

Location Location and ... Jobs

Thanks To:



I recently started a conversation on LinkedIn with regard to location based search and jobs in particular a new feature Total jobs have added (see here for an example; http:// There were lots of positive comments and some good ideas suggested, mainly all the additional information that including mapping gives. This got me thinking how powerful mapping information can be as the visual representation goes so much further than just having the word displayed to provide the end user with a richer more engaging experinece.

Some of the ideas highlighted were:
  • Travel details such as train stations, underground stations, buses etc.
  • Addtional location specific details such as highlighting parking or local amenities.
Recruitment sites showing this rich information to candidates set themselves apart from the other sites because not only do they potentially help you find a job but also provide you a depth of knowledge that you would have to research yourself otherwise. It also helps to speed up the decision making process meaning companies should get more relevant or engaged candidates applying. 

The types of business I see this benefiting the most are direct employers as these are the ones mostly like to disclose highly targeted location information and this could lead to reducing costs of external employment agencies. Something which was hinted at in the video posted on the Guardian website (  where a number of employers commented that they would be looking to higher direct more this year. 

Something else corporate or direct employers can already take advantage when thinking about location based information is rich snippets and structured data.  The key area to think about here is search engines support of structured business data such as the address details, using the right type of mark up within a web page is what makes the difference. 

If for instance you were to encode your businesses address details using the hCard Mircoformats mark up you could include geo data such as the longitude and latitude of the location, now when search spiders crawl the web page indexing the data they can use this information to pin the business to a point on a map or target search results within local listing. Relating this back to the above now you can post business address details with each vacancy (on your site) including the Rich Snippet information giving your vacancy the maximum opportunity to appear in location based searches by candidates but note only that your vacancy could appear in map results. 

Location information could also be posted in Tweets to add great context to them and also mean that when the information is processed by third party software using the twitter stream you message is more highly targeted and when combined with a twitter users location you can only start to imagine what the possibilities are not only for the recruitment sector but any business. 

Links: - Google webmaster guide to Rich Snippets and HCard