<<<
Chronological Index
>>> <<<
Thread Index
>>>
[ga] FC: What's so bad about Total Information Awareness? by Ben Brunk
All assembly members,
This might be of some interest with respect to the
Whois and privacy concerns related to Homeland Security
Department...
Very good questions possed here...
================ From Politech follows ================
---
Date: Mon, 09 Dec 2002 22:34:13 -0500
From: Ben Brunk <brunkb@ils.unc.edu>
To: declan@well.com
Subject: Debunking TIA
Declan,
I'm in the middle of writing a dissertation relating to online privacy,
but
I have been completely sidetracked by the recent discussion over the
Total
Information Awareness program authorized by the Homeland Security bill
that
just passed into law. All I've seen so far are a lot of reactionary
editorials written by people who haven't put an ounce of effort into
analyzing the proposed system. They seem infatuated with the TIA logo,
its
slogan, and Poindexter. I have read, with avid fascination, all the
dire
predictions and scary stories about a new Big Brother system spearheaded
by
a felon who managed to avoid accountability. What I have yet to see is
a
rational analysis of the idea itself from someone who knows something
about
computers, databases and statistics. I hope to fill in that gap as best
I
can, though I'm sure there are experts out there with even better
background in the appropriate research fields.
From what I have been able to find out about the TIA program, it is
supposed to be a massive computerized dragnet that culls information
from
dozens of different sources and is intended to locate potential
terrorists
so that government agents can scrutinize them more closely. This system
will draw data from sources such as credit reports, bank records,
airline
reservation systems, police records, gun purchase records, and many
others.
Many of these sources of information are private databases owned and
maintained by the corporations that rely on them. Even if they were all
implemented in say, Oracle, it would be difficult to match up records to
any reliable degree. Who knows if the John Poindexter in one database
is
the same as Jon Pointdexter in another? The social security number,
which
is apparently the holy grail of database keys, is not necessarily going
to
help since many of these companies did not collect it or use it as a
key.
Name and address might make a good cross referencing key, but people
move
all the time, and I get three catalogs from a company that I purchased
items from three times-even their internal database is not sophisticated
enough to detect slight differences in spacing or my apartment number
using
a '#' instead of 'apt' or 'apartment'. This is just inside one
organization; we're not even trying to connect any dots yet. It will be
easier to match records kept by the government, especially if they
include
SSNs and fingerprints. However, errors in government databases are well
documented (although not readily admitted to). Those systems contain
large
numbers of errors, and even when errors are located and fixed, they have
a
nasty tendency of recurring when data is shared or re-shared. If you
fix
an error in your Experian credit report, but not TRW, often times, the
Experian error will reappear. Many people play this sort of "whack a
mole"
game for years.
Another matter that no journalist has touched on, and the one I think is
the biggest nail in TIA's coffin, is the matter of database error are
several orders of magnitude higher than the number of terrorists in the
world. All databases contain errors. Data culled from multiple,
heterogeneous sources is going to have lots of errors. I don't have
current estimates on the average expected error rate in a database, but
let's suppose it is 5%. That means that in any given database, 95% of
the
data is right and 5% of it is junk. Garbage in, garbage out. Errors
such
as misspellings, flipped bits, juxtaposed numbers, and transaction
entries
that never took place or were unintentionally duplicated or omitted.
Five
percent isn't a big deal until you look at it on the scale of what TIA
is
proposing. There are approximately 300 million people in the United
States. Those 300 million people are very busy consumers, and their
paper
trail is enormous. There are trillions of transaction records, log
entries, and records that TIA would have to amass, standardize, and then
examine. Even if the government buys all the necessary computing power
and
the very best staff, the government can't do anything about randomness.
The
5% expected error rate is the monkey wrench in the works. 5% of 300
million is 15,000,000. Multiply that number by however many data points
will be looked at. Say 500 data points for each person. Now we are
looking at 300 million times 500, or 150,000,000,000 data points. 5% of
that number leaves us with 7,500,000,000. Seven and one half billion
data
points if they want to look at every American. Worse, this is not a
one-time scan. For any hope of success, they would have to look
longitudinally. That is, every year, month, day, hour, whatever. Some
indications of terrorism are very subtle: People who plan terror don't
just run out and buy their entire list of bomb making ingredients in one
day and then book a flight. Terrorists are slow and methodical. They
plan
over months and years. So what we're looking at here is 7.5 billion
data
points examined day in and day out for years and years. With a 5% error
rate, the number of false positives is outrageous, no matter what
analysis
technique used (and any analysis technique will have its own error
rate).
There is not enough manpower in the entire federal government to
possibly
track down every lead generated, even if much of that work is automated.
With each passing day, homeland security will drown a little more in a
hopeless pile of randomly generated false leads that grow even on
weekends
and holidays.
Let's suppose there are 1,000 terrorists hiding out in the USA, waiting
to
strike, which I personally think is a greatly exaggerated number. We
know
from the actions taken on 9/11 that these people are fairly cunning.
They
know how to hide from the system and how to hide in plain sight. They
pay
in cash, or they buy what they need by proxy, and they don't act any
different than anyone else. Like the millions of illegal immigrants in
the
US, terrorist operatives are good at using social networks to "fly below
the radar" and subvert the system. One thousand people is a lot, but
1,000
out of 300 million is 3.33 * 10^-6, or .000033%. In other words, TIA
would
be looking for a miniscule fraction of 1% of the population in their
database, the exact people who are going out of their way to escape
detection. With an error rate of even 1%, detecting such a tiny
fraction
would be impossible. You would not be able to separate the signal from
the
noise, no matter what techniques were used. Pollsters run into this
problem every election season when the 'margin of error' rises to a
level
greater than the projected differential between the candidates. 3%
margin
of error in a race where the candidates differ by 1% is "too close to
call." The same problem exists for scanning all airport baggage, but
that
is fodder for another day. The only way TIA would work is if some high
percentage of Americans were terrorists-20%, 50%, whatever. Only then
could there be enough comparison data in both sets to draw testable
conclusions from and be assured that those conclusions were not just
random
error phenomena.
Let's look at this on a much smaller scale: Suppose the system worked
well
enough each day to render a list of 10,000 people, one (1) of which is
an
actual terrorist (unbelievably good odds for the government). The
government has a .0001% probability of successfully picking the
terrorist
each day (using this system alone). Could the FBI/CIA/NSA/whatever even
investigate 10,000 people with other techniques carefully enough each
day
to locate the one terrorist? Could they do it in a month or a year? I
suppose the government could err on the side of caution and detain large
numbers of people, place them in custody, and hold them indefinitely
without due process until certain that they weren't terrorists. But
that
action presents nightmarish logistical and humanitarian prospects. The
US
prison population is bursting at the seams with an all time high of two
million. There would have to be enormous concentration camps for the
millions of suspected terrorists who would be detained until their
innocence is proven. That begs the question: Is it even possible to
prove
you are innocent in the current legal climate? The Red Scare (and the
more
recent FBI watch lists) has already taught us the folly of black lists
and
unsubstantiated accusations.
Lastly, data mining as a useful technique has been thoroughly debunked.
It
never lived up to its promises. This is why you don't hear much about
data
mining in the CS and IS literature these days; what of it that is left
has
morphed into the more esoteric "knowledge management" or KD. Like AI,
it
turned out to be quite a bit more difficult to do than expected and has
been largely abandoned. Had anyone in the government actually bothered
to
read any of the literature, they would already know this.
All in all, I can't see how TIA will do anything except harm innocent
people and create new jobs for bureaucrats. Any numerate person who
spends
five minutes thinking about what is proposed will come to the same
conclusion. If our system is going to become this arbitrary, there are
going to be an awful lot of lives ruined in this country. I fail to see
how the TIA approach could do anything positive for the war on terror or
for America in general. It will eat up resources better spent on more
proven and acceptable approaches. In fact, such a data-drive approach
might actually be more successful if it simply took a random sampling of
the population each day.
My hope is that this editorial will awaken those who are even more
skilled
in computer science, statistics, game theory, etc. and that they find
the
courage to speak up so we can put the brakes on the wasteful and
destructive blind alley called TIA.
Benjamin Brunk
-------------------------------------------------------------------------
POLITECH -- Declan McCullagh's politics and technology mailing list
You may redistribute this message freely if you include this notice.
To subscribe to Politech: http://www.politechbot.com/info/subscribe.html
This message is archived at http://www.politechbot.com/
Declan McCullagh's photographs are at http://www.mccullagh.org/
-------------------------------------------------------------------------
Like Politech? Make a donation here: http://www.politechbot.com/donate/
Recent CNET News.com articles: http://news.search.com/search?q=declan
-------------------------------------------------------------------------
--
This message was passed to you via the ga@dnso.org list.
Send mail to majordomo@dnso.org to unsubscribe
("unsubscribe ga" in the body of the message).
Archives at http://www.dnso.org/archives.html
<<<
Chronological Index
>>> <<<
Thread Index
>>>
|