User:Penf0ld

From Wikipedia, the free encyclopedia

[edit] Statistical Issues

[edit] Qualitative Illustration of the Statistics of Civilian Wiretapping

One fundamental weakness of domestic surveillance is the sheer volume of data which must be processed in order to find important intelligence. This low signal to noise ratio necessitates the use of statistical algorithms (most of which employ Bayes Theorem) in order to separate the unimportant communications from the important communications. The shortcoming of this approach (false identification of civilian communications as "suspicious" can be easily understood graphically:

The top row of figures shows a series figures relating to international communications. Figure 1A represents a population of telephone communications which either involve a terrorist on one or both sides (red), or they do not (green). An eavesdropper cannot know which group a given communication belongs to. Therefore, they apply a computerized data mining algorithm to "flag" messages that have suspicious traits (shown with cross-hatching) . The traits might be that they come from certain countries, involve certain people, or contain suspicious words. Figure 1B represents the application of the data mining algorithm, and represents that it does a very good job separating the signal from the noise: almost all communications involving a terrorist are "flagged". It does a similarly excellent job by not flagging most of the innocent communications. Inevitably, however, some will be flagged. This might be because they are made by Osama Bin Laden to an innocent third party, include the word "Jihad", or any criteria that the data mining algorithm has concluded occurs more often in terrorist communications than innocent communications. All such flagged messages are presumably examined directly by a human being for actionable intelligence. Figure 1C represents the "flagged" content that is reviewed by human beings in the NSA, for example. It shows that most of the reviewed messages involved a terrorist.

The bottom row of figures illustrates the limitations of this technique in actual practice. Figure 2A again demonstrates a population of communications, but this time the division between the two is drawn closer to scale. This more accurately reflects the true situation: millions of innocent messages are sent for every terrorist communication that is transmitted. Figure 2B shows the same data mining algorithm as 1B when it is applied to situation which more accurately reflects the real world. The data mining algorithm still "flags" almost all (>99%) of the terrorist transmissions, and does not "flag" most (>99%)of the innocent transmissions. Figure 2C shows all "flagged" messages which must then be reviewed by a human spy. The proportional sizes of the terrorist and non-terrorist components are a reflection of the cost of domestic surveillance that must be paid in lost personal privacy of American citizens.

These figures are not meant to be quantitative. Rather, they demonstrate the qualitative concept that even a very advanced data mining algorithm that flags a very small percentage of innocent civilian communications can still end up identifying more innocent communications than terrorist communications, simply because innocent communications outnumber terrorist communications in the first place.

[edit] Quantitative Illustration of the Statistics of Civilian Wiretapping

A quantitative estimate can be made using conservative estimates for each piece of the diagram shown above. For the purposes of the estimate, let us assume that:

  • The "communications" in question are those affected by Bush's domestic spying program: they involve one party who is outside the borders of the United States (the "foreign participant") and one party who is inside borders of the United States (the "native participant").
  • Communications in which the foreign participant is a terrorist and the native participant is innocent are unimportant (and need not be flagged), because they aren't going to disclose details of their operations to an innocent person. Therefore, only communications in which the native participant is a terrorist are of interest.
  • There are fewer than 10,000 terrorists in the US. No statistics are available to guide this estimate, so an extremely conservative assumption is best.
  • There are 300,000,000 individuals in the United States.
  • Terrorists in the United States make an equal number of international calls as the average non-terrorist in the US. This seems reasonable given that they have an incentive to minimize the number of communications they make, as well as the huge number of international calls that are made from the US for business purposes.
  • The data mining algorithm catches every communication in which the native participant is a terrorist
  • The data mining algorithm flags only 0.1% of communications in which the native participant is a non-terrorist.

With these assumptions, the relative area of the red wedge in figure 2A would be 10,000 communications. The relative area of the green circle would be 299,990,000 communications. The data mining algorithm will therefore flag the following:

Non-terrorist Terrorist Total
Communication not flagged as suspicious 299,690,010 communications 0 communications 299,690,010 communications
Communication flagged as suspicious 299,990 communications 10,000 communications 309,990 communications
Total 299,990,000 communications 10,000 communications 300,000,000 communications

Since human spies must subsequently review all "flagged" messages, they deal only with 309,990 of the original messages. Of these flagged communications, 299,990 of them were made by nonterrorists. That means that using the assumptions provided above, 96% of those messages being reviewed by human spies are made by non-terrorists. Stated another way, using what seem like conservative assumptions, thirty messages made by an non-terrorist individuals in the US must be reviewed extrajudicially by the Federal Government without a warrant in order to find even a single terrorist communication.