Objective :
To find Anagrams words in English dictionary. Two words are said to be anagrams if the letters can be rearranged to turn one word into another.
Example stop and pots are anagrams to each other.
Steps:
word = open ('words' , 'r') -- Opens the word file
wordlist = word.readlines() -- Reads the lines and put them in list
len(wordlist) -- Count the elements in the list
-- Get rid of newline character and lowercase the words
wordclean = [wordlist.strip().lower() for word in wordlist]
-- There still may be duplicates. Use set to remove duplicates
-- Set can have only one instance of any given object
-- and then convert back to a list
wordunique = list(set(wordclean)) -- Duplicates will be removed now, but sorting would be lost
wordunique.sort() -- sort the list
-- The above data transformation could have been done using list comprehension in easier way
wordclean = sorted(list(set([word.strip().lower() for word in open('words' , 'r')])))
-- Finding Anagrams
-- for anagrams , use the fact that sorted(pots) == sorted(stop)
def signature(word):
return ' '.join(sorted(word))
def anagram(myword):
return [word for word in wordclean if signature(word) == signature(myword)]
-- Test it
anagram('dictionary')
-- This solution however is very costly, check with %timeit to see how long it takes
-- Need to find faster way to get anagrams for all the words in the file.
-- Creating signature is expensive if we keep repeating for every word in the file
-- A good idea would be to create a python dictionary of all the words indexed by their signature
-- Then getting an anagram of a given word would be as simple as looking inside the dictionary
-- Every item in dictionary would have signature as the key and a list of words as values
-- Loop through the list of words and append to the right item with signature as the key.
words_bysig = { }
for word in wordclean:
words_bysig[signature(word)].append(word)
-- You may get an error when it tries to add an item 'a' that there is no such item in the dictionary
-- We make use of collections module where there is a default dict object
-- It provides a default value if we try to get a key which doesn't exist.
-- The default value what we want is empty list
import collections
words_bysig = collections.defaultdict(list) -- yields an empty list as the default value
for word in wordclean:
words_bysig[signature(word)].append(word)
-- Now find anagrams by simple dictionary lookup
def anagram_fast(myword):
return words_bysig[signature(myword)]
-- Test it
anagram_fast('dictionary')
-- get all the anagrams in the dictionary excluding the trivial ones where the word is an anagram of itself
-- build this using dictionary comprehension
anagram_all = {word : anagram_fast(word) for word in wordlist if len(anagram_fast(word) > 1}
-- Test it with %timeit, it should be relatively very fast
To find Anagrams words in English dictionary. Two words are said to be anagrams if the letters can be rearranged to turn one word into another.
Example stop and pots are anagrams to each other.
Steps:
word = open ('words' , 'r') -- Opens the word file
wordlist = word.readlines() -- Reads the lines and put them in list
len(wordlist) -- Count the elements in the list
-- Get rid of newline character and lowercase the words
wordclean = [wordlist.strip().lower() for word in wordlist]
-- There still may be duplicates. Use set to remove duplicates
-- Set can have only one instance of any given object
-- and then convert back to a list
wordunique = list(set(wordclean)) -- Duplicates will be removed now, but sorting would be lost
wordunique.sort() -- sort the list
-- The above data transformation could have been done using list comprehension in easier way
wordclean = sorted(list(set([word.strip().lower() for word in open('words' , 'r')])))
-- Finding Anagrams
-- for anagrams , use the fact that sorted(pots) == sorted(stop)
def signature(word):
return ' '.join(sorted(word))
def anagram(myword):
return [word for word in wordclean if signature(word) == signature(myword)]
-- Test it
anagram('dictionary')
-- This solution however is very costly, check with %timeit to see how long it takes
-- Need to find faster way to get anagrams for all the words in the file.
-- Creating signature is expensive if we keep repeating for every word in the file
-- A good idea would be to create a python dictionary of all the words indexed by their signature
-- Then getting an anagram of a given word would be as simple as looking inside the dictionary
-- Every item in dictionary would have signature as the key and a list of words as values
-- Loop through the list of words and append to the right item with signature as the key.
words_bysig = { }
for word in wordclean:
words_bysig[signature(word)].append(word)
-- You may get an error when it tries to add an item 'a' that there is no such item in the dictionary
-- We make use of collections module where there is a default dict object
-- It provides a default value if we try to get a key which doesn't exist.
-- The default value what we want is empty list
import collections
words_bysig = collections.defaultdict(list) -- yields an empty list as the default value
for word in wordclean:
words_bysig[signature(word)].append(word)
-- Now find anagrams by simple dictionary lookup
def anagram_fast(myword):
return words_bysig[signature(myword)]
-- Test it
anagram_fast('dictionary')
-- get all the anagrams in the dictionary excluding the trivial ones where the word is an anagram of itself
-- build this using dictionary comprehension
anagram_all = {word : anagram_fast(word) for word in wordlist if len(anagram_fast(word) > 1}
-- Test it with %timeit, it should be relatively very fast
No comments:
Post a Comment