National Academies Press: OpenBook
« Previous: 1 Basic Concepts in Information Retrieval
Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

2
Text Categorization and Analysis

David Lewis and Hinrich Schütze

2.1 TEXT CATEGORIZATION

Automatic text categorization is the primary language retrieval technology in content filtering for children. Text categorization is the sorting of text into groups, such as pornography, hate speech, violence, and unobjectionable content. A text categorizer looks at a Web page and decides into which of these groups a piece of text should fall. Applications of text categorization include filtering of e-mail, chat, or Web access; text indexing; and data mining.

Why is content filtering a categorization task? One way to frame the problem is to say that the categories are actions, such as “allow,” “allow but warn,” or “block.” We either want to allow access to a Web page, allow access but also give a warning, or block access. Another way to frame the problem is to say that the categories are different types of content, such as news, sex education, pornography, or home pages. Depending on which category we put the page in, we will take different actions. For example, we want to block pornography and give access to news.

The automation of text categorization requires some input from people. The idea is to mimic what people do. Two parts of the task need to be automated. One is the categorization decision itself. The categorization decision says, for example, what we should do with a Web page. The second part to be automated is rule creation. We want to determine automatically the rules to apply.

Automation of the categorization decision requires a piece of software that applies rules to text. This is the best architecture because then

Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

we can change the behavior by changing the rules rather than rewriting the software every time. This automatic categorizer applies two types of rules. One type is extensional rules that explicitly list all sites that cannot be accessed (i.e., “blacklisted” sites) or, alternatively, all sites that can be accessed (e.g., kid-safe zones or “whitelisted” sites). The second type, which is technically more complicated, is intentional rules or keyword blocking. We look at the content of the page, and, if certain words occur, then we take certain actions, such as blocking access to that page. It can be more complicated than just a single word. For example, it can be logic based, where we use AND and OR operators, or it can be a weighted combination of different types of words.

Automated rule writing is called supervised learning. One or more persons are needed to provide samples of the types of decisions we wish to make. For example, we could ask a librarian to identify which of 500 texts or Web pages are pornography and which ones are not. This provides a training set of 500 sample decisions to be mimicked. The rule-writing software attempts to produce rules that mimic those categorization decisions. The goal is to mimic the categorization decisions made by people. The selection of the persons who provide the samples is fundamental, because whatever they do becomes the gold standard, which the machine tries to mimic. Everything depends on the particular persons and their judgments.

Research shows that supervised learning is at least as good as expert human rule writing. (Supervised learning is also very flexible. For example, foreign content is not a problem, as long as the content involves text rather than images.) The effectiveness of these methods is far from perfect—there is always some error rate—but sometimes it is near agreement with human performance levels. Still, the results differ from category to category, and it is not clear how directly it applies to, for example, pornography. As discussed in the next presentation, there is an inevitable trade-off between false positives and false negatives, and categories vary widely in difficulty. Substantially improved methods are not expected in the next 10 to 20 years.

It is not clear which text categorization techniques are most effective. Some recently developed techniques are not yet used commercially, so there may be incremental improvements. Nor is it clear how effective semiautomated categorization is, or whether the categories that are difficult for automated methods are the same as those that perplex people. With regard to spam e-mail, it is possible to circumvent it, but there is no foolproof way to filter it. The question is whether the error rate is acceptable.

This all comes back to community standards. We can train the classi-

Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

fier to predict the probability that a person would find an item inappropriate, and training can give equal weight to any number of community volunteers. In other words, we can build a machine that mimics a community standard. We take some people out of the community, get their judgments about what they find objectionable or not, and then build a machine that creates rules that mimic that behavior. But this does not solve the political question of how to define the community, who to select as representatives of that community, and where in that community to apply the filter. The technological capability does not solve the application issues in practice.

2.2 ADVANCED TEXT TECHNOLOGY

True text understanding will not happen for at least 20 or 30 years, and maybe never. Therein lies the problem, because to filter content with absolute accuracy we would need text understanding. As a result, there will always be an error rate; the question is how high it is.

The text categorization methods discussed above use the “bag-of-words” model. This is a simplistic machine representation of text. It takes all the words on a page and treats them as an unstructured list. If the text is “Dick Armey chooses Bob Shaffer to lead committee,” then a representative list would be: Armey, Bob, chooses, committee, Dick, lead, Shaffer. The structure and context of the text is completely lost. This impoverished representation is the basis of text classification methods in existing content filters.

There are problems with this type of representation. It fails, in many cases, because of ambiguous words. The context is important. Ambiguous words such as “beaver” have both a hunter’s meaning and a graphic meaning. Using the bag-of-words model alone, you cannot tell which meaning is relevant. The bag-of-words model is inherently problematic for these types of ambiguous words. Other words, such as “breast” and “blow,” are not ambiguous but can be used pornographically. Again, if we use a bag-of-words model, then we lose context and cannot deal with these words properly. When context counts, the bag-of-words model fails.

The problem cannot be resolved fully by looking for adjacent words, as search engines do when they give higher weight to information objects that match the query and have certain words in the same sentence. There is a distinction between search engines and classification. Search engines compute a ranking of pages. The end users look at the top 10 or maybe the top 100 ranked pages. Because they are looking only at pages in which the signal is strongest and because they are making a relative judgment, this type of methodology works very well; the highest-rated pages are

Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

probably very relevant to the query.1 But in classification, we have to make a decision about one page by itself. This is a much more difficult problem. By looking at the words that lie nearby, we cannot always make a decent statistical guess as to whether a situation is innocuous or not.

When context is important, when the bag-of-words model fails, pornography filters and content filters make errors. However—surprisingly—the bag-of-words model is effective in many applications, so it is not a hopeless basis for pornography filters despite its error rate. It always comes down to what error rate is acceptable.2 To go beyond the bag-of-words model, a number of technologies are currently available: morphological analysis, part-of-speech tagging, translation, disambiguation, genre analysis, information extraction, syntactic analysis, and parsing. Even using these technologies, thorough text understanding will remain in the distant future; a 100-percent-accurate categorization decision cannot be made today. But these advanced text technologies can increase the accuracy of content filters, and this increased accuracy may be significant in some areas.

The first area relates to over-broad filters that block material that should not be blocked, raising free speech issues. It is relatively easy to build an over-broad filter, which blocks pornography very well but also blocks a lot of good content, like Dick Armey’s home page. These over-broad filters may suffice in many circumstances. For example, there may be parents who would say, “As long as not a single pornographic page comes through, or it almost never happens, it is OK if my child cannot see a lot of good content.” But these over-broad filters are problematic in many other settings, such as in libraries, where there is an issue of free speech. If a lot of good content is blocked, then that is problematic. Advanced technology can really make a difference, because by increasing the accuracy of the filter, less good content would be blocked.

1  

Milo Medin said that various search engine companies have come with a number of techniques to filter adult content, so that you have to turn on the capability to see certain types of references. Most of it is ranking based, but there are some other obvious things as well. Part of the challenge is that many adult sites are trying to get people to visit, so they fill their headers with all kinds of information that make it obvious what is going on. The question is, how practical is that?

2  

Milo Medin said that the people who run search engines have an economic interest in making their results as accurate as possible, to satisfy their subscribers. Normal large search engines want the adult-content filter to be as accurate as possible. If the filter is turned on, we basically want to eliminate adult content. The Google folks, as an example, have devoted a lot of energy to these issues, but it is not aimed directly at pornography. They focus on a broader set of issues to which pornography is a business input.

Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

The second area is pornography versus other objectionable content, such as violence and hate speech. The bag-of-words model is most successful under two conditions: (1) when there are unambiguous words indicating relevant content and (2) when there are a few of these indicators. Pornography has these properties; probably about 40 or 50 words, most of them unambiguous, indicate pornography. Thus, the bag-of-words model is actually not so bad for this application, especially if you like over-broad filters. However, in many other areas, such as violence and hate speech, the bag-of-words model is less effective. Often you must read four or five sentences of a text before identifying it as hate speech. Accuracy becomes important in such applications, and advanced technology can be helpful here.

The third area is automated blacklisting. Remember the distinction between extensional and intentional rules; extensional rules are lists of sites that you want to block. This is an effective content-filtering technique, mostly driven by human editors now. This is a promising area for automation. Accuracy is important because blocking one site can block thousands of pages; you want to be sure of doing the right thing. Advanced text technology also can play a role here.

A potential problem with these text technologies is their lack of robustness. They can be circumvented through changes in meaning. If a pornographer wants to get through a filter that he knows and can test, then he or she will be able to get through it—it is simply a question of effort. But pornographers are not economically motivated to expend a lot of effort to get through these filters. I may be wrong, but my sense is that, because children do not pay for pornography, this is probably not a problem.

In summary, true machine-aided text understanding will not be available in the near term, and that means there always will be a significant error rate with any automated method. The advanced text technologies improve accuracy, which may be important in contexts such as free speech in libraries, identification of violence and hate speech, and automated blacklisting.

The extent of the improvement from these technologies depends on many parameters, and tests must be run.3 The latest numbers I know of are from Consumer Reports,4 but they are aggregated and not broken down

3  

Milo Medin said that it is difficult to do good experiments and that sloppy experimentation is rewarded in a strange way. First, you run a very large collection of text through your filter and determine how much of the material identified as pornographic was, in fact, not. Second, you find out how much of the material identified as not pornographic was, in fact, a problem. If you do that analysis badly or carelessly, your filter looks better.

4  

Consumer Reports, March 2001.

Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×

by area. There is probably a big difference in accuracy between pornography and the other objectionable areas. There is also a trade-off between false positives and false negatives. The extent to which advanced techniques make a difference depends on where in the trade-off you start out. If I had to give a number, I would expect a 20 to 30 percent improvement in accuracy over the bag-of-words model—if you want to let all good content through (if you do not want over-blocking).

Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 5
Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 6
Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 7
Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 8
Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 9
Suggested Citation:"2 Text Categorization and Analysis." National Research Council and Institute of Medicine. 2002. Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/10324.
×
Page 10
Next: 3 Categorization of Images »
Technical, Business, and Legal Dimensions of Protecting Children from Pornography on the Internet: Proceedings of a Workshop Get This Book
×
Buy Paperback | $48.00 Buy Ebook | $38.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

In response to a mandate from Congress in conjunction with the Protection of Children from Sexual Predators Act of 1998, the Computer Science and Telecommunications Board (CSTB) and the Board on Children, Youth, and Families of the National Research Council (NRC) and the Institute of Medicine established the Committee to Study Tools and Strategies for Protecting Kids from Pornography and Their Applicability to Other Inappropriate Internet Content.

To collect input and to disseminate useful information to the nation on this question, the committee held two public workshops. On December 13, 2000, in Washington, D.C., the committee convened a workshop to focus on nontechnical strategies that could be effective in a broad range of settings (e.g., home, school, libraries) in which young people might be online. This workshop brought together researchers, educators, policy makers, and other key stakeholders to consider and discuss these approaches and to identify some of the benefits and limitations of various nontechnical strategies. The December workshop is summarized in Nontechnical Strategies to Reduce Children's Exposure to Inappropriate Material on the Internet: Summary of a Workshop. The second workshop was held on March 7, 2001, in Redwood City, California. This second workshop focused on some of the technical, business, and legal factors that affect how one might choose to protect kids from pornography on the Internet. The present report provides, in the form of edited transcripts, the presentations at that workshop.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!