Dick Thornburgh and Herbert S. Lin, Editors

Committee to Study Tools and Strategies for Protecting Kids from Pornography and Their Applicability to Other Inappropriate Internet Content

Computer Science and Telecommunications Board

National Research Council



C.

 

Selected Technology Issues




C.1.

 

INFORMATION RETRIEVAL TECHNOLOGIES



Information retrieval, a function that supports people who are actively seeking or searching for information, typically assumes a static or relatively static database of digital objects against which people search. These digital objects may be documents, images, sound recordings, and so on.

The general information retrieval problem is one of making a decision about which objects in the database to show to the user. Systems for information retrieval seek to maximize the material that a person sees that is likely to be relevant to his or her information problem and minimize the material that is not relevant to the problem.

An information retrieval system works by representing the objects in a database in some well-specified manner, and then representing the user's information problem ("information need") in a similar fashion. The retrieval techniques then compare needs with objects.

A key aspect of object representation is the language used to describe the objects in question. Typically, this language (more precisely, a "meta-language") consists of a vocabulary (and sometimes precise grammar) that can characterize the objects in a form suitable for automated comparison to the user's needs.

The representation of information objects also requires interpretations by a human indexer, machine algorithm, or other entity. When people are involved, the interpretation of a particular text by one person is likely to be different from that of someone else, and may even be different for one person at different times. As a person's state of knowledge changes, his or her understanding of a text changes. Everyone has experienced the situation of finding a document not relevant at some point, but quite relevant later on, perhaps for a different problem or perhaps because we, ourselves, are different. Such subjectivity means that decisions about representation will be inconsistent. An extensive literature on inter indexer consistency indicates that when people are asked to represent an information object, even if they are highly trained in using the same meta-language (indexing language), they might achieve 60 to 70 percent consistency at most in tasks like assigning descriptors.1

Machine-executable algorithms for representing information objects or information problems do give consistent representations. But any one such algorithm gives only one interpretation of the object, out of a great variety of possible representations, depending on the interpreter. The most typical way to accomplish such representations is through statistically - based automatic indexing. Techniques of this sort index (represent) documents by the words that appear in them, deleting the most common (articles, prepositions, and other function words), conflating different forms of a word into one (e.g., making plural and singular forms the same), and weighting the resulting terms according to some function of how often they appear in the document (term frequency--the more occurrences, the greater the weight) and how many documents they appear in (document frequency--the more documents, the lower the weight).

A consistent result in information retrieval experimentation is that automatic indexing of the sort just described gives results that are at least as good as human indexing, and usually better. Performance is evaluated by a pair of measures, recall and precision.2 This result is often stated as meaning that automatic indexing is better than manual indexing, because manual indexing is so much more expensive than automatic. But it is important to keep in mind another consistent information retrieval result, that the sets (or lists) of documents that are retrieved using one technique are different from those retrieved by another technique (both relevant and non-relevant documents).

Having constructed representations of both the information objects and the person's information problem, the information retrieval system is in a position to compare these representations to one another, in order to identify those objects that match the information problem description. Technologies for accomplishing this are called search or retrieval techniques. Because people in general cannot specify precisely what it is that they do not know (that for which they are searching), because people cannot represent accurately and completely what an information object is about, and because relevance judgments are inherently subjective, it is clear that such techniques must be probabilistic, rather than deterministic. That is, any decision as to whether to retrieve (or block) an information object is always a guess, more or less well - informed. Thus, in information retrieval, we cannot ever say with confidence that an information object is (or is not) relevant to an information need, but can only make judgments of probability (or belief) of (non)relevance.

It can be easily seen that many of the problems that information retrieval faces are also problems that information filtering must face. Although filtering aims to not retrieve certain information objects, it still must do so on the basis of some representation of those objects, and some representation of the person for whom the filtering is done. Filtering works by prespecifying some information object descriptions that are to be excluded from viewing by some person or class of persons. Thus, it depends crucially on having an accurate and complete representation of what is to be excluded (this is analogous to the query or information problem representation), and also accurate and complete representations of the information objects that are to be tested for exclusion. Just as in information retrieval, making the exclusion decision in information filtering is an inherently uncertain process.


C.1.1.

 

Text Categorization and Representation

Automatic text categorization is the primary language retrieval technology in content screening. Text categorization is the sorting of text into groups, such as pornography, hate speech, and violence. A text categorizer looks at a text-based information object and decides the proper categorization for this object. Applications of text categorization are filtering of electronic e-mail, chats, or Web access; and indexing and data mining.

The principal problem with text categorization is that text is ambiguous in many ways: polysemy, synonymy, and so on. For example, a bank can be either a financial institution or something on the side of a river (polysemy). The context matters a lot in the interpretation.

Efficient automatic text categorization requires an automated categorization decision that identifies, on the basis of some categorization rules, the category into which an object falls. (Note that if the rules are separated from the decision maker, the behavior of the decision maker can be changed merely by changing the rules, rather than requiring the rewriting of the software underlying the decision maker every time.) The decision-making software examines the text in the information object, and on the basis of these rules, categorizes the object. The simplest decision rules might base the decision on the mere presence of certain words (e.g., if the word "breast" appears, the object is pornographic). More sophisticated rules might search for combinations of words (e.g., if the word "breast" appears without the word "cancer," the object is pornographic), or even weighted combinations of different words.

Decision rules are developed by modeling the kinds of decisions that responsible human beings make. Thus, the first step in automated rule writing (also known as "supervised learning") is to ask a responsible individual to identify, for example, which of 500 digital text objects constitute pornography and which do not. The resulting categorization thus provides a training set of 500 sample decisions to be mimicked by the decision rules.

Of course, the selection of the persons who provide the samples is crucial, because whatever they do becomes the gold standard, which the decision rules then mimic. Everything depends on the particular persons and their judgments, but the technology does not provide guidance on how to define the community or whom to select as representatives of that community.

Research indicates that supervised learning is at least as good as expert human rule writing.3 The effectiveness of these methods is far from perfect--there is always a high error rate--but sometimes it is near agreement with human performance levels. Still, the results differ from category to category, and it is not clear how directly it applies to, for example, pornography. As discussed below, there is an inevitable trade-off between false positives and false negatives (i.e., attributing an item to a category when it should not be; not attributing an item to a category when it should be), and categories vary widely in difficulty. Substantially improved methods are not expected in the next 10 to 20 years.

It is not clear which text categorization techniques are most effective. The best techniques are not yet used commercially, so there may be incremental improvements. Nor is it clear how effective semiautomated categorization is, or whether the categories that are difficult for automated methods are the same as those that most perplex people.

The simplest machine representation of text is what is known as the "bag-of-words" model, in which all of the words in an object are treated as an unstructured list. If the object's text is, "Dick Armey chooses Bob Shaffer to lead committee," then a representative list would be: Armey, Bob, chooses, committee, Dick, lead, Shaffer. (A slightly more sophisticated version associates a count of the number of times each word occurs with the word itself.) Note that in such a representation, the structure and context of the text are completely lost.

Thus, one weakness in the representation is due to the ambiguity of language, which is resolved in human discourse through context. For example, the word "beaver" has a hunter's meaning and a pornographic meaning. Other words, such as "breast" and "blow," may be less ambiguous but can be used pornographically or otherwise. When context is important in determining meaning, the bag-of-words representation is inadequate.

The bag-of-words representation is most useful when two conditions are met: when there are unambiguous words indicating relevant content, and when there are relatively few of these indicators. Pornographic text has these properties; probably about 40 or 50 words, most of them unambiguous, indicate pornography. Thus, the bag-of-words representation is reasonably well suited for this application, especially if a high rate of false positives is acceptable. However, in many other areas, such as violence and hate speech, the bag-of-words representation is less useful. (For example, while a pornographic text such as the text on an adult Web page often can be identified by viewing the first few words, one must often read four or five sentences of a text before identifying it as hate speech.)

When fidelity of representation becomes important, a number of techniques can go beyond the bag-of-words model: morphological analysis, part-of-speech tagging, translation, disambiguation, genre analysis, information extraction, syntactic analysis, and parsing. For example, a technique more robust than the bag-of-words approach is to consider adjacent words, as search engines do when they give higher weight to information objects that match the query and have certain words in the same sentence. However, even with these technologies, true machine-aided text understanding will not be available in the near term, and there always will be a significant error rate with any automated method. Approaches that go beyond the bag-of-words representation improve accuracy, which may be important in contexts apart from pornography.


C.1.2.

 

Image Representation and Retrieval

Images can be ambiguous in at least as many ways as text can be. Furthermore, there is no universal meta-language for describing images. People who are interested in images for advertising purposes have different ways to talk and think about them than do art historians, even though they may be searching for the same images. The lack of a common meta-language for images means that special meta-languages must be developed for images for use in different problem domains.

The process of determining whether a given image is pornographic involves object recognition, which is very difficult for a number of reasons. First, it is difficult to know what an object is; things look different from different angles and in different lights. When color and texture change, things look different. People can change their appearance by moving their heads around. We do not look different to one another when we do this, but we certainly look different in pictures.

Today, it is difficult for computer programs to find people, though finding faces can be done with reasonably high confidence. It is often possible to tell whether a picture has nearly naked people in it, but there is no program that reliably determines whether there are people wearing clothing in a picture.

To find naked people, image recognition programs exploit the fact that virtually everyone's skin looks about the same in a picture, as long as one is careful about intensity issues. Skin is very easy to detect reliably in pictures, so an image recognition program searching for naked people first searches for skin. So, one might simply assume that any big blob of skin must be a naked person.

However, images of the California desert, apple pies, and all sorts of other things are rendered in approximately the same way as skin. A more refined algorithm would then examine how the skin/desert/pie coloring is arranged in an image. For example, if the coloring is arranged in patterns that are long and thin, that pattern might represent an arm, a leg, or a torso. Then, because the general arrangement of body parts is known, the location of an arm, for example, provides some guidance about where to search for a leg. Assembling enough of such pieces together can provide enough information for recognizing a person.

A number of factors help to identify certain pornographic pictures, such as those that one might find on an adult Web site. For example, in such pictures, the people tend to be big, and there is not much other than people in these pictures. Exploiting other information, such as the source of the image or the words and links on the Web page from which the image is drawn, can increase the probability of reliable identification of such an image.

But it is essentially impossible for current computer programs to distinguish between hard-core and soft-core pornography, because what constitutes "hard-core" versus "soft-core" pornography is in the eyes of the viewer rather than the image itself. Consider also whether the photographs of Jock Sturges, many of which depict naked children, constitute pornography. Furthermore, in the absence of additional information, it is quite impossible for computer programs to distinguish between images that contain naked people in what might be pornographic poses, which are considered "high art" (e.g., paintings by Rubens), and what someone might consider a truly pornographic image.

What computer programs can do with reasonable reliability today (and higher reliability in the future) is to determine whether there might be naked people in a picture. But any of the contextual issues raised above will remain beyond the purview of automated recognition for the foreseeable future.


C.2.

 

SEARCH ENGINES AND OTHER OPERATIONAL INFORMATION RETRIEVAL SYSTEMS



Information retrieval systems consist of a database of information objects, techniques for representing those objects and queries put to the database, and techniques for comparing query representations to information object representations. The typical technique for representing information objects is indexing them, according to words that appear in the documents, or words that are assigned to the documents by humans or by automatic techniques. An information retrieval system then takes the words that represent the user's query (or filter), and compares them to the "inverted index" of the system, in which all of the words used to index the objects in the collection are listed and linked to the documents that are indexed by them. Surrogates (e.g., titles) for those objects that most closely (or exactly) match (i.e., are indexed by) the words in the query are then retrieved and displayed to the user of the system. It is then up to the user to decide whether one or more of the retrieved objects is relevant, or worth looking at.

From the above description, it is easy to see that "search engines" are a type of information retrieval system, in which the database is some collection of pages from the Worldwide Web, which have been indexed by the system, and in which the retrieved results are links to the pages that have been indexed by the words in the user's query.

The basic algorithm in search engines is based on the "bag-of-words" model for handling data described above. However, they also use some kind of recognition of document structure to improve the effectiveness of a search. For example, search engines often treat titles differently from the body of a Web page; titles can often indicate the topic of a page. If the system can extract structure from documents, then this often can be used as an indicator for refining the retrieval process.

Many search engines also normalize the data, a process that involves stripping out capitalization and most of the other orthographic differences that distinguish words. Some systems do not throw this information away automatically, but rather attempt to identify things such as sequences of capitalized words possibly indicating a place name or person's name. The search engine often removes stop words, a list of words that it chooses not to index--typically quite common words like "and" and "the."4 In addition, the search engine may apply natural language processing to identify known phrases or chunks of text that properly belong together and indicate certain types of content.

What remains after such processing is a collection of words that need to be matched against documents represented in the database. The simplest strategy is the Boolean operator model. Simple Boolean logic says either "this word AND that word occur," or "this word OR that word occurs," and, therefore, the documents that have those words should be retrieved. Boolean matching is simple and easy to implement. Because of the volume of data on the Internet, almost all search engines today include an automatic default setting that, in effect, uses the AND operator with all terms provided to the search engine.

All Boolean combinations of words in a query can be characterized in a simple logic model that says, either this word occurs in the document, or it does not. If it does occur, then you have certain matches; if not, then you have other matches. Any combination of three words, for example, can be specified, such that the document has this word and not the other two, or all three together, or one and not the other of two. However, if the user does not specify the word exactly as it is stored in the index, then it will not be processed appropriately, and in particular the word cannot be a synonym (unless you supply that synonym), an alternate phrasing, or a euphemism.

Another strategy is the vector space model. A document is represented as an N-dimensional vector, in which N is the number of different words in the text, and the component of the vector in any given direction is simply the number of times the word appears in the text. The measure of similarity between two documents (or, more importantly, between a document and the search query similarly represented) is then given by the cosine of the angle between the two vectors. The value of this parameter ranges from zero to 1.0, and the closer the value is to 1.0, the more similar the document is to the query or to the other document. In this model, a perfect match (i.e., one with all the words present) is not necessary for a document to be retrieved. Instead, what is retrieved is a "best match" for the query (and of course, less good matches can also be displayed).

Most Web search engines use versions of the vector space model and also offer some sort of Boolean search. Some use natural language processing to improve search outcomes. Other engines (e.g., Google) use methods that weight pages depending on things like the number of links to a page. If there is only one link to a given page, then that page receives a lower ranking than a page with the same words but many links to it.

The preceding discussion assumes that the documents in question are text documents. Searching for images is much more difficult, because a similarity metric of images is very difficult to compute or even to conceptualize. For example, consider the contrast between the meaning of a picture of the Pope kissing a baby versus a picture of a politician kissing a baby. These pictures are the same in some ways, and very different in other ways.

More typically, image searches are performed by looking for text that is associated with images. A search engine will search for an image link tag within the HTML and the sentences that surround the image on either side--an approach with clear limitations. For example, the words, "Oh, look at the cute bunnies," mean one thing on a children's Web site and something entirely different on Playboy's site. Thus, the words alone may not indicate what those images are about.


C.3.

 

LOCATION VERIFICATION



Today, the Internet is designed and structured in such a way that the physical location of a user has no significance for the functionality he or she expects from the Internet or any resources to which he or she is connected. This fact raises the question of the extent to which an Internet user's location can in fact be established through technical means alone.

Every individual using the Internet at a given moment in time is associated with what is known as an IP address, and that IP address is usually associated with some fixed geographical location. However, because IP addresses are allocated hierarchically by a number of different administrative entities, knowing the geographical location of one of these entities does not automatically provide information about the locations associated with IP addresses that it allocates. For example, the National Academy of Sciences is based in Washington, D.C., and it allocates IP addresses to computers tied to its network. However, the Academy has employees in California, and also computers that are tied to the Academy network. The Academy knows which IP addresses are in California and in Washington, D.C., but someone who knew only that an IP address was one associated with the Academy would not know where that IP address was located.5

Under some circumstances, it can be virtually impossible to determine the precise physical location of an Internet user. Consider, for example, the case of an individual connecting to the Internet through a dial-up modem. It is not an unreasonable assumption that the user is most likely in the region in which calls to the dial-up number are local, simply because it would be unnecessary for most people to incur long-distance calling costs for such connections. Furthermore, the exchange serving dial-up modem access numbers can, in principle, employ caller-ID technology. However, the exchange associated with the telephone from which the dial-up call originates may not be technologically capable of providing caller-ID information; this would be the case in some areas in the United States and in much of the world. Or the user might simply suppress caller-ID information before making the dial-up modem call. In these instances, the number through which the individual connects to the Internet does not necessarily say anything about his location at that time.

Internet access routed through satellites can be difficult to localize as well. The reason is that a satellite's transmission footprint can be quite large (hundreds of square miles?), and more importantly is moving quite rapidly. Localization (but only within the footprint) can be accomplished only by working with a detailed knowledge of the orbital movements of an entire constellation of satellites.

However, those connecting to the Internet through a broadband connection can be localized much more effectively, though with some effort. For example, while a cable Internet ISP may assign IP addresses to users dynamically, any given address must be mappable to a specific cable modem that can be identified with its media access control address. While such mapping is usually done for billing and customer care reasons, it provides a ready guide to geographical addresses at the end user's level. Those who gain access through DSL connections can be located because the virtual circuit from the digital subscriber line access multiplexer is mapped to a specific twisted pair of copper wires going into an individual's residence. Also, wireless connections made through cell phones (and their data-oriented equivalents) are now subject to a regulation that requires the network client to provide location information for E-911 (enhanced emergency 911) reasons. This information is passed through the signaling network and would be available to a wireless ISP as well.

In principle, the information needed to ascertain the location of any IP address is known collectively by a number of administrative entities, and could be aggregated automatically. But there is no protocol in place to pass this information to relevant parties, and thus such aggregation is not done today. The result is that in practice, recovering location information is a complex and time-consuming process.

To bypass these difficulties, technical proposals have been made for location-based authentication.6 However, the implementation of such proposals generally requires the installation of additional hardware at the location of each access point, and thus cannot be regarded as a general-purpose solution that can localize all (or even a large fraction of) Internet users.

The bottom line is that determining the physical location of most Internet users is a challenging task today, though this task will become easier as broadband connections become more common.


C.4.

 

USER INTERFACES



The history of information technology suggests that increasingly realistic and human-like forms of human-computer interaction will develop. The immediately obvious trends in the near-term future call for greater fidelity and "realism" in presentation. For example, faster graphics processors will enable more realistic portrayals of moving images, which soon will approach the quality of broadcast television. Larger screens in which the displayed image subtends a larger angle in the eye will increase the sense of being immersed in or surrounded by the image portrayed. Goggles with built-in displays do the same, but also offer the opportunity for three-dimensional images to be seen by the user. Virtual reality displays carry this a step further, in that the view seen by the user is adjusted for changes in perspective (e.g., as one turns one's head, the view changes).

Speech and audio input/output are growing more common. Today, computers can provide output in the form of sound or speech that is either a reproduction of human speech or speech that is computer-synthesized. The latter kind of speech is not particularly realistic today but is expected to become more realistic with more research and over time. Speech recognition is still in its infancy as a useful tool for practical applications, even after many years of research, but it, too, is expected to improve in quality (e.g., the ability to recognize larger vocabularies, a broader range of voices, a lower error rate) over time.

Another dimension of user interface is touch and feel. The "joystick" often used in some computer-based video games provides the user with a kinesthetic channel for input. Some joysticks also feature a force feedback that, for example, increases the resistance felt by the user when the stick is moved farther to one side or another. Such "haptic" interfaces can also--in principle--be built into gloves and suits that could apply pressure in varying amounts to different parts of the body in contact with them.

Finally, gesture recognition is an active field of research. Humans often specify things by pointing with their hands. Computer-based efforts to recognize gestures can rely on visual processing in which a human's gestures are viewed optically through cameras connected to the computer, and the motions analyzed. A second approach is based on what is known as a dataglove, which can sense finger and wrist motion and transmit information on these motions to a computer.7

Product vendors of these technologies promise a user experience of extraordinarily high fidelity. For example, it is easy to see how these technologies could be used to enhance perceived awareness of others--one might be alone at home, but through one's goggles and headphones, hear and see others sharing the same "virtual" space. (In one of the simplest cases, one might just see others with goggles and headphones as well, but the digital retouching technologies that allow images to be modified might allow a more natural (though perhaps less realistic) depiction.)


Notes

1 L.E. Lawrence, 1977, Inter-indexer Consistency Studies 1954-1975: A Review of the Literature and summary of study results, University of Illinois, Graduate School of Library Science, Urbana-Champaign; K. Markey, 1984, "Interindexer Consistency Tests: A Literature Review and Report of a Test of Consistency in Indexing Visual Materials," Library and Information Research, 6(2): 155-177; L. Mai-Chan, 1989, "Inter-indexer consistency in subject cataloging," Information Technology and Libraries 8(4): 349-357.

2 "Recall" measures how complete the search results are, and is the proportion of relevant documents in the whole collection that have actually been retrieved. "Precision" measures how precise the search results are, and is the proportion of retrieved documents that are relevant.

3 Fabrizio Sebastiani. 2002. "Machine learning in automated text categorization," ACM Computing Surveys 34(1): 1-47.

4 Note that the stop list is a likely place to put a filter. For example, if "bitch" was included in the stop list, no Web site, including that of the American Kennel Club Web site, would be found in searches that included the word "bitch."

5 While location information is not provided automatically from the IP addresses an administrative entity allocates, under some circumstances, some location information can be inferred. For example, if the administrative entity is an ISP, and the ISP is, for example, a French ISP, it is likely--though not certain--that most of the subscribers to a French ISP are located in France. Of course, a large French company using this ISP might well have branch offices in London, so the geographical correspondence between French ISP and Internet user will not be valid for this case, though as a rule of thumb, it may not be a bad working assumption.

6 See for example, Dorothy E. Denning and Peter F. MacDoran, 1996,"Location-Based Authentication:Grounding Cyberspace for Better Security", in Computer Fraud & Security, February. (publisher Elsevier Science Ltd). A commercial enterprise now sells authentication systems that draw heavily on the technology described in this paper. See <http://www.cyberlocator.com/works.html>.

7 See, for example, <http://www.ireality.com/Wireless_announce.html> for a 1997 product announcement by the General Reality Company.











Buy this book

Buy this book

Copyright 2002 by the National Academy of Sciences
Previous Table of Contents Next