Monday, March 20, 2006

Data Mining to find terrorists is bad for US

Remember the TIA program? For some reason I thought I wrote about it in the past but I can't find a post. Anyway, TIA was a data mining program proposed by the govt. that would collect as much data it could on every citizen of the country and then hopefully mine that data for patterns that would indicate terrorist activity. This is called data mining. TIA was killed but it still lives on in various forms in the govt.

Not only does Data Mining infringe on our privacy, it also doesn't work. Essentially it will create so much static (in the form of false alarms) that the system will be overwhealmed with waste:

When it comes to terrorism, however, trillions of connections exist between people and events -- things that the data-mining system will have to "look at" -- and very few plots. This rarity makes even accurate identification systems useless.

Let's look at some numbers. We'll be optimistic -- we'll assume the system has a one in 100 false-positive rate (99 percent accurate), and a one in 1,000 false-negative rate (99.9 percent accurate). Assume 1 trillion possible indicators to sift through: that's about 10 events -- e-mails, phone calls, purchases, web destinations, whatever -- per person in the United States per day. Also assume that 10 of them are actually terrorists plotting.

This unrealistically accurate system will generate 1 billion false alarms for every real terrorist plot it uncovers. Every day of every year, the police will have to investigate 27 million potential plots in order to find the one real terrorist plot per month. Raise that false-positive accuracy to an absurd 99.9999 percent and you're still chasing 2,750 false alarms per day -- but that will inevitably raise your false negatives, and you're going to miss some of those 10 real plots.

This isn't anything new. In statistics, it's called the "base rate fallacy," and it applies in other domains as well. For example, even highly accurate medical tests are useless as diagnostic tools if the incidence of the disease is rare in the general population. Terrorist attacks are also rare, any "test" is going to result in an endless stream of false alarms.

This is exactly the sort of thing we saw with the NSA's eavesdropping program: the New York Times reported that the computers spat out thousands of tips per month. Every one of them turned out to be a false alarm.


(From a wired article)

1 comment:

Anonymous said...

I didnt have time to read the article, I'm studying for finals, but the statistics you quoted seem very "iffy" to me (I have a statistics minor). I'm not saying they're wrong, but you have to be very careful with statistic, they are just as dangerous when trying to disprove something as they are at proving it (I think that makes sense).

As far as data mining goes, I dont feel as strongly as you do about protecting privacy. I understand the issues you have, but the problem lies in how the information is used, ie who has the information.