Good and Bad Passwords How-To
Review of the Conclusions and Dictionaries Used in
a Password Cracking Study
How much do we know about how users create passwords? There is a lot
of anecdotal evidence but until relatively recently, not much quantitative
evidence regarding real user passwords. The post 2000, often quantitative
lists of common passwords are not academic studies. The sources and methodologies of
the collected passwords may not be representative of what one would
find in Windows and Unix OS user accounts versus the web accounts that
seem to be the source of most revealed passwords. Various lists of the
top n passwords show clear trends, but specific passwords change position,
sometimes significantly, on different lists. The largest list, the
10,000 combines all similar mixed case passwords into one all
lower case password.
In the "standard paper on UNIX password
"Password Security: A Case
Morris and Ken Thompson describe the characteristics of most
of 3289 passwords they had collected over a period of time.
551 were one to three characters, 477 were four letters,
706 were five single case characters and 605 were six lower
case characters. "An additional 492 passwords appeared in"
various dictionaries and lists. They also said "There was,
of course, considerable overlap between the dictionary
results and the character string searches. The dictionary
search ... produced about one third of the passwords. They
did not describe the remaining 14%. This was 1979 and the first
line starts with "Password security on the UNIX ...
time-sharing system" At the time, and for more than a decade,
UNIX passwords were limited to 8 characters; anything
longer was simply truncated.
The way I read this, the additional 492 passwords were
seven or eight character words, or reversed words, or words
with the first letter capitalized. These were the specific
transformations that were applied to the dictionary words.
About 450 (one third less 492) dictionary words appeared in
the one through six character groups.
In 1991, Daniel V. Klien did the only comprehensive password
analysis I've seen. The results were published as
"Foiling the Cracker: A Survey of, and Improvements to, Password
Security"4. He obtained
a database of 13,797 user accounts from a variety of sources, and
successfully cracked 3340 or 24% of them. The most
computationally efficient approach was 130 variations of the
account name, user name and other personal information taken
directly from the passwd file. This yielded 368 passwords or
2.7% at an efficiency of 2.83%.
Several name lists were used as dictionaries. In aggregate they
provided 1043 passwords or 7.6%. The cost/benefit ratio varied
dramatically with "common names" being the most productive. The
best single source was the dictionary provided in
/usr/dict/words. This yielded 1027 passwords or 7.4%. The list
"Phrases and Patterns" got 253 or 1.8% with a very good
efficiency. This included a somewhat diverse collection compiled
by Daniel Klien and others. Examples are 123abc, 4.2bsd, "get
lost", gotohell, ibmpc, itty-bitty, xyz.
"Machine Names" also found a significant number, 132 or 1% but
with a very low efficiency. This list was created from an
/etc/hosts file. It's worth noting, it has a significant number
of ordinary words and names in it, as well hundreds of demo9999
names. It would be interesting to know how many should have been
in another dictionary. Several of the lists were compiled by Dan
Klien and associates. Some are surprisingly small. The "Movies
and Actors" list is very small (118 entries, nearly always one
word per line) and eclectic, but resulted in 12 passwords. One
can only speculate, but it seems very likely that larger, more
comprehensive, and higher quality lists, would yield
significantly more found passwords, perhaps at a poorer cost
The words from the lists were each manipulated using 14 to 17
methods similar to those described in the previous
don't list. Additional
capitalization variations were performed.
An Analysis of Daniel Klein's Dictionaries
It's worth looking at the dictionaries used by Daniel Klien is some
detail. There were two general word dictionaries. One was
/usr/dict/words, the standard UNIX dictionary used for spell
checking. This was a small general purpose dictionary. Thus, most
of the words in it will be common, compared to some of the words
found in a collegiate dictionary, or most in an unabridged
dictionary. There were 3212 miscellaneous words from the "junk"
dictionary, that did not appear in the other dictionaries. Some
of these were more obscure words, but others are
character sequences that do not appear to be words from any
language I've ever seen; the comment admits this list contains
many junk words.
The 19,683 word, standard dictionary, lead to 1027 passwords at a
cost/benefit ratio of 0.052; the miscellaneous words resulted in
54 passwords at a cost/benefit ratio of 0.017. The results
support three conclusions that are consistent with common sense
and observation of common password lists. Many people use
ordinary words as the basis for their passwords. Of these most
choose words that quickly come to mind, i.e., common words. A
smaller group tries to find "obscure" words on which to base
their password. It would be interesting to know what results
an unabridged dictionary would have produced if used against
the same account and password database.
In Daniel Klein's paper, "Common names" were identified as the
second most productive dictionary with 2239 names yielding 548
passwords at a 0.245 cost/benefit ratio. This was the fourth
best cost/benefit ratio of 27 dictionaries used. It's by far the
largest of the "high yield" dictionaries, but still less than one eighth
the size of the /usr/dict/words list, which was the only single
list to yield more passwords. In short, common first names are
by a signifcant amount, the most frequent basis for passwords.
The actual contents of this dictionary are very interesting.
There is a comment at the beginning of the dictionary: "First
names garnered from a number of password files. We get a good hit
rate from these. Probably could be culled somewhat. By Daniel
Klein." Reviewing these, though the list certainly contains
many, perhaps mostly common names, I see a significant number of
names I don't recognize. The "could be culled" comment supports
The Census Bureau created three exceptional quality common name
lists. There is one for female first names, one for male first
names and one for last names. Each list is ordered by the
frequency that the name is used within the U.S. population in 1990. Each
list includes as many names as necessary, so that 90% of the U.S.
population has their name listed. There are 1219 male names,
4275 female names and 88,799 last names. 3.3% of the men in the
U.S. are named James and nearly as many named John. 2.6% of the
women are named Mary but this is more than two and a half times
as many as Patricia, the second most common female name. 1% of
the population is named Smith and .8% Johnson.. The Census Bureau
did a new last name list from the 2000 census, but no first name
Many of Daniel Klein's "Common names" do not appear in any Census
Bureau common name list. It's unlikely the odd names in the
common names list, got good, if any results. He stated that he
requested password lists from "around the United States and Great
Britain." It's not likely his password list had a strong local
bias, like a single Unix system in one of the four states bordering
Mexico might have a higher than average number of Hispanic names.
Though they were based on the
1990 census, these name lists were not created until 1996, so there
is no way that Daniel Klein could have used them. Similar
lists from other countries would most likely have a very high
return in the country of origin.
Actually it's fair to say many of Daniel Klein's common names are
not at all common. The Census Bureau lists each cover 90% of the U.S.
population (in 1990). His common name list
had 2273 names in it, but only 676 that were in the Census female
names list, and 543 in the male names list, and 211 were common to both
lists, so only 1008 match a Census list and were unique out of 2273.
It is very surprising to study the Censu lists and see how many women
have men's names and vice versa. I just obtained (Jan. 30, 2014)
the Social Security Addministration's lists of baby's names from
1880 through 2012. They include every name which was used for each
sex five or more times in each year. These contained 63,246 girl's names
and 38,015 boy's names. I dropped the less used names and used
29,952 girl's names and 19,685 boy's. I had no idea how these would
match up with Daniel Klein's "Common names." They matched 1036 girl's
names and 1267 boy's names, but 845 were in both lists, so only
1458 were unique. 815 were unmatchted even when compared against
almost 50,000 statistcally common to fairly infrequent names
(used 50 or more times in 133 years).
I have little doubt that if the top 800 male and 1500 female names were
taken from either the Census or SSA lists and run against Daniel
Klein's password list that either would get significantly better
results than his common name list. If the full Census lists or my
50,000 name selection from
the SSA lists were used, I'm sure many more passwords would have
been found than the 548 found by the "Common name" list. Even with
over 800 quite unusual names, this list still had the fourth highest
"Cost/Benefit Ratio." Using the full Census list I would not be
surprised if the efficiency was lower. Using 50,000 names from
the SSA lists, the efficiency would surely be much lower.
Daniel Klien's 62,727 word dictionary was a good first step in
building a password cracking dictionary. Today, larger, more
comprehensive, and more consistent quality lists are
available. Some of the specific dictionaries that he created
should be considered for inclusion in any cracking dictionary
that is to be built. The character sequences and
number sequences contained in the similarly named dictionaries probably
belong in any cracking dictionary, perhaps in expanded form.
Daniel Klein's dictionaries do not seem to be available at their
original or previous location but are now available
here. Matching the physical
file names to the "Types of Password" which Daniel Klein used
was not always obvious. You need to be a Unix administrator to
realize that "Machine names" and etc-hosts are a match. After
matching the obvious file names to password type, I used counts
to help match the remaining five. The only exact match on counts
was Mnemonics and abbr and I have no idea what the file has to
do with either. The counts in the rest were within 10, usually 5, and I did
check all those I thought were obvious, to be sure I had not
made a mistake. The table below shows the password type matched
to the file name. The counts are from Daniel Klien's paper, not
the file counts.