Sunday, 2 April 2017

Word Anagrams Project - Python

Objective : 
To find Anagrams words in English dictionary. Two words are said to be anagrams if the letters can be rearranged to turn one word into another.

Example stop and pots are anagrams to each other.


Steps:
word = open ('words' , 'r')  -- Opens the word file

wordlist =  word.readlines() -- Reads the lines and put them in list

len(wordlist) -- Count the elements in the list

-- Get rid of newline character and lowercase the words

wordclean = [wordlist.strip().lower() for word in wordlist]

-- There still may be duplicates. Use set to remove duplicates
-- Set can have only one instance of any given object
-- and then convert back to a list

wordunique = list(set(wordclean)) -- Duplicates will be removed now, but sorting would be lost

wordunique.sort() -- sort the list

-- The above data transformation could have been done using list comprehension in easier way

wordclean = sorted(list(set([word.strip().lower() for word in open('words' , 'r')])))

-- Finding Anagrams

-- for anagrams , use the fact that sorted(pots) ==  sorted(stop)

def  signature(word):
    return ' '.join(sorted(word))

def anagram(myword):
    return [word for word in wordclean if signature(word) == signature(myword)]

-- Test it

anagram('dictionary')

-- This solution however is very costly, check with %timeit to see how long it takes

-- Need to find faster way to get anagrams for all the words in the file.
-- Creating signature is expensive if we keep repeating for every word in the file

-- A good idea would be to create a python dictionary of all the words indexed by their signature
-- Then getting an anagram of a given word would be as simple as looking inside the dictionary

-- Every item in dictionary would have signature as the key and a list of words as values
-- Loop through the list of words and append to the right item with signature as the key.

words_bysig = { }

for word in wordclean:
    words_bysig[signature(word)].append(word)

-- You may get an error when it tries to add an item 'a' that there is no such item in the dictionary

-- We make use of collections module where there is a default dict object
-- It provides a default value if we try to get a key which doesn't exist.
-- The default value what we want is empty list

import collections

words_bysig = collections.defaultdict(list) -- yields an empty list as the default value


for word in wordclean:
    words_bysig[signature(word)].append(word)

-- Now find anagrams by simple dictionary lookup

def anagram_fast(myword):
    return words_bysig[signature(myword)]

-- Test it


anagram_fast('dictionary')

-- get all the anagrams in the dictionary excluding the trivial ones where the word is an anagram of itself

-- build this using dictionary comprehension

anagram_all = {word : anagram_fast(word) for word in wordlist if len(anagram_fast(word) > 1}

-- Test it with %timeit, it should be relatively very fast













No comments:

Post a Comment