<<

Em 76 ■ ■ ■ KEY CONCEPTS n o i at m r o f n i By NigelShadbolt By Berners-Lee andTim forming forming society. trans are that theft, identity social to networking virtual emergent properties, from creating is links and pages Web in rise relentless The conveying trust. as securing privacy and such issues major solve can work more made; be to ning Important advances are begin benefitsociety. to check in held or harnessed be can they how and arise traits Web how discover to aims science, Web discipline, new A

SCIENTIFIC AMERICAN Studying the Web will reveal better revolutionize industry and manage our ever growing online lives ways to exploit information, — The The Editors prevent identity theft,

y g o l o n h c e t

-

erg - S built first, and followed, followed, science computer and first, built were computers history: fits timing The sues. science of branch new A humankind. for mean might this of any what or coming be may nomena phe new what them, harness might we how happened, actually have properties emergent tips. parenting to news concert from as such ating online communities that portals share everything YouTube. user-generated And content to tagging with labels is cre led have which net social Napster, as such sites to to led file-sharing ments led has which of docu transfer The as Facebook. works such messaging, instant to led E-mail society. transforming are that arisen sum of its pages. Vast emergent properties have however, is the fact that the Web is more to than run their the countries with it. Little appreciated, how even are considering it. governments And ing and health care are being revolutionized by people’s jobs depend on the Web. bank Media, all aspects of modern life. Today more and more But few investigators are howstudying such — than 15 than pages that billion touch almost more to exploded has it mid-1990s, the ince the blossomed in Web science © 2008 SCIENTIFIC AMERICAN, INC. AMERICAN, © SCIENTIFIC 2008 — aims to address such is such address to aims

es

------structure, structure, articulate the architectural principles versities have since expanded on that effort. 16from of world’sthe researchers ing top uni ning of a Web Lead Science Initiative. Research a page is determined by the importance of the ofthe importance the by determined is page a of this definitionis recursive:the importance of part that was difficulty The it. to linking pages the of importance and number the of terms in page a the prioritize results. to needed they realized Brin, Page The and Sergey founders of Larry Google, content. irrelevant more and more returning was pages of number mounting the among search words key for looking by information for ing progressed, 1990s the As work. such of earlier research has revealed the potential value new, is discipline a as science Web Although and arise, can that be altered? points tipping do How out? burn they Could have patterns tionary driven the Web’s growth? evolu What questions: fundamental answer to aims pursuit here. the Ultimately, presented some insights, crucial generated already has Yet Webscience reveal. might endeavor scent and more. gy, economics, law,science, political sociolo ecology, , science, computer Web physics, draw will on science mathematics, tellectual-property rights. To achieve these ends, net complex the issues such as privacy protection and that in ensure can settle and that to productively grow work continues principles the cidate driven are by and can change social conventions. It will elu interactions human online how cover that have fueled its and phenomenal growth, dis Insights Already This new discipline will model the Web’s the model will discipline new This Their big insight was that the importance of of importance the that was insight big Their na this what predict cannot we course, Of — how relevant it is it relevant how England announced the begin the announced England University of Southampton in in Southampton of University the and Technology of In stitute Massachusetts the at leagues col of our two and us the when 2006, formal November a in as discipline launched was science Web significantly. computing improved subsequently which — was best understood October 2008 ------

Cary Wolinsky; Matthew Hurst Microsoft Live Labs (http://datamining.typepad.com/data_mining) (blogosphere visualization) pages linking to it, whose importance is deter- Another early discovery, which came from mined by the importance of the pages linking to graph theory, is that the Web’s connectivity fol- them. Page and Brin figured out an elegant math- lows a so-called power-law degree distribution. ematical way to represent that property and de- In many networks, nodes have about the same veloped an algorithm they called PageRank to number of links to them. But on the Web a few exploit the recursiveness, thus returning pages pages have a huge number of other pages link- ranked from most relevant to least. ing to them, and a very large number of pages Google’s success shows that the Web needs to have only a few pages linking to them. North- be understood and that it needs to be engineered. eastern University’s Albert-László Barabási and Web science serves both purposes. The Web is an his colleagues coined the term “scale-free” to infrastructure of languages and protocols—a characterize such networks [see “Scale- piece of engineering. The philosophy of content Free Networks,” by Albert-László Ba- linking underlies the emergent properties, how- rabási and Eric Bonabeau; Sci- ever. Some of these properties are desirable and entific American, May therefore should be engineered in. For example, 2003]. Many people were ensuring that any page can link to any other page surprised because they makes the Web powerful both locally and glob- assumed Web pages ally. Other properties are undesirable and if pos- would have an aver- sible should be engineered out, such as the ability age number of links to build a site with thousands of artificial links to and from them. generated by software robots for the sole inten- In scale-free tion of improving that site’s search rankings —so- networks, even called link farms. if a majority

www.SciAm.com © 2008 SCIENTIFIC AMERICAN, INC. SCIENTIFIC AMERICAN 77 1.6

3.9 1.6 work. In 1967 Harvard University psychologist Stanley Milgram asked 8.1 3.3 residents in Omaha, Neb., and Wichita, Kan., to attempt to send a 1.6 package to an individual described only by his name, some general features and the fact that he lived in Boston. The resi- dents were to send the package to an intermedi- ary who they thought might know more about 1.6 how to reach the individual and who could then send it on to another intermediary. Eventually 38.4 64 of the almost 300 packages made it to the des- ignated recipients. On average the number of in- 1.6 termediaries needed was six—the basis of the catchphrase “six degrees of separation.” More recently, however, Watts, now at Co- algorithm that prioritizes 3.9 lumbia University, tried to repeat the experiment search results for Google— which makes the service so with an e-mail message to be forwarded on the valuable—was an early example Web and experienced failures in path finding. In of how Web science can pay off. particular, if individuals had no incentive to for- In this representation of the algorithm, called ward the note the paths broke down. Yet only PageRank, the weighting (out of 100) of the very slight incentives improved matters. importance of links among 11 search results The lesson is that network structure alone is determines which pages are most relevant. not everything; networks thrive only in the light 34.3 of the actions, strategies and perceptions of the of nodes are removed, a path from one of the re- individuals embedded in them. To realistically maining nodes to any other is still likely to ex- know why the Web has a beneficial structure of ist. Removing a relatively small number of the short paths, we need to know why people who highly connected nodes, or hubs, however, leads contribute content link it to other material. So- to significant disintegration. This analysis has cial drivers—goals, desires, interests and atti- been crucial for the companies and organiza- tudes—are fundamental aspects of how links are tions—be they telecommunications providers or made. Understanding the Web requires insights research laboratories—that design how infor- from sociology and psychology every bit as much mation is routed on the Web, allowing them to as from mathematics and computer science. build in substantial redundancy that balances traffic and makes the network more resistant From Micro to Macro to attack. Sexual One major area of Web science will explore how Thorough understanding of scale-free net- Orientation a small technical innovation can launch a large works, gleaned by analyzing the Web, has social phenomenon. A striking example is the prompted experts to analyze other network sys- Revealed emergence of the . Although early tems. They have since found power-law degree Carter Jernigan and Behram Web browsers did not provide a handy way for distributions in areas as far-flung as scientific ci- Mistree, both students at the the average person to “publish” his or her ideas, Massachusetts Institute of tations and business alliances. The work has by 1999 programs had made self-publish- Technology, recently analyzed helped the U.S. Centers for Disease Control and Facebook communities. They ing much easier. Blogging subsequently caught Prevention improve its models of sexual disease found that the structure identifies fire because as people got issues off their chest, transmission and has helped biologists better the likely sexual orientation of they also found others with similar views who understand protein interactions. individuals who have not could readily assemble into a like-minded explicitly stated their preference. Scientific analysis has also characterized the community. The reason is that people who Web as having short paths and small worlds. have made a declaration link It is difficult to estimate the size of the blogo- While at Cornell University in the 1990s, Dun- more often to others of a similar sphere accurately. David Sifry’s leading blog can J. Watts and Steven H. Strogatz showed that orientation; a kind of triangula- search engine, called Technorati, was tracking tion emerges. This sort of result

even though the Web was huge, a user could get more than 112 million worldwide in May k raises ethical and privacy c

from one page to any other page in at most 14 of this year, a number that may include only a etse

questions; more research about R

clicks. Yet to fully understand these traits, we the Web’s structure and users’ mere fraction of the 72 million blogs purport- g e

need to appreciate that the Web is a social net- behavior could provide answers. edly in China. Whatever the size, the explosive Geor

78 SCIENTIFIC AMERICAN © 2008 SCIENTIFIC AMERICAN, INC. October 2008 [THE AUTHORS] growth demands an explanation. Arguably, the cars for sale in western Massachusetts under introduction of very simple mechanisms, espe- $8,000” returns more than 2,000 general Web cially TrackBack, facilitated the growth. If a pages. Once capabilities are add- blogger writes an entry commenting on or refer- ed, a person will instead receive detailed infor- ring to an entry at another blog, TrackBack no- mation on seven or eight specific cars, including tifies the original blog with a “.” This noti- their price, color, mileage, condition and own- fication enables the original blog to display sum- er, and how to buy them. maries of all the comments and links to them. (left) is professor Engineers have devised powerful founda- In this way, conversations arise spanning sever- of artificial intelligence at the tions for the Semantic Web, notably the prima- University of Southampton in al blogs and rapidly form networks of individu- ry language—the Resource Description Frame- England, chief technology officer als interested in particular themes. And here of the Semantic Web company work (RDF)—which is layered on top of the ba- again large portions of the blog structure be- Garlik Ltd., and a past president sic HTML and other protocols that form Web come linked via short paths —not only the blogs of the . pages. RDF gives meaning to data through sets and bloggers themselves but also the topics and Tim Berners-Lee (right) invented of “triples.” Each triple resembles the subject, the World Wide Web and leads the entries made. verb and object of a sentence. For example, a tri- World Wide Web Consortium, As blogging blossomed, researchers quickly based at the Massachusetts Insti- ple can assert that “person X” [subject] “is a sis- created interesting tools, measurement tech- tute of Technology. ter of” [verb] “person Y” [object]. A series of tri- niques and data sets to try to track the dissemi- ples can determine that [car X] [is brand] [To­ nation of a topic through blogspace. Social-me- yota]; that [car X] [condition is] [used]; that [car dia analyst Matthew Hurst of Microsoft Live X] [costs] [$7,500]; that [car X] [is located in] Labs collected link data for six weeks and pro- [Lenox]; and that [Lenox] [is located in] [west- duced a plot of the most active and intercon- ern Massachusetts]. Together these triples can nected parts of the blogosphere [see illustration conclude that car X is indeed a proper answer on next page]. It showed that a number of blogs to our query. This simple triple structure turns are massively popular, seen by 500,000 differ- out to be a natural way to describe a large ma- ent individuals a day. A link or mention of an- jority of the data processed by machines. The other blog by one of these superblogs guaran- subjects, verbs and objects are each identified by tees a lot of traffic to the site referenced. The a Universal Resource Identifier (URI)—an ad- plot also shows isolated groups of dedicated en- dress just like that used for Web pages. Thus, thusiasts who are very much in touch with one anyone can define a new concept, or a new verb, another but barely connect to other bloggers. by defining a URI for it on the Web. )

orks,” If exploited correctly, the blogosphere can be w a powerful medium for spreading an idea or [work in progress] gauging the impact of a political initiative or the plexnet m o scale-free network scale-free c likely success of a product launch. The much an- Building a More Secure Web e of e ticipated release of the Apple iPhone generated c Understanding how the Web is linked can reveal ways to better engineer it. Many networks 27, 2000 ( 1.4 percent of all new postings on its launch day.

ULY are fairly homogeneous (an “exponential” structure): nodes, even the busiest (orange) and One challenge is to understand how this dissem- k tolerank their immediate neighbors (blue), have roughly the same number of links to and from them. c ination might change our view of

ol. 406; J But analysis at the University of Notre Dame showed that the Web is a scale-free network: V atta d and commentary. What mechanisms can assure a few nodes (Web sites) have many links coming in, and many nodes have only a few links. blog readers that the facts quoted are trustwor- Nature, rroran thy? Web science can provide ways to check this e: “ E e: c so-called provenance of information, while of- ási et al., in b );sour fering practical rules about conditions sur- rounding its reuse. Daniel Weitzner’s Transpar- ent Accountable Datamining Initiative at M.I.T. Berners-Lee ászló Bara L is doing just that. ert- b l , A g Rise of Semantics Jeon g

); Donna Coveney ( One emerging phenomenon that is benefiting

oon from concerted research is the rise of the Seman- w a Shadbolt tic Web—a network of data on the Web. Among ert, H b kner (

l many payoffs, the Semantic Web promises to c give much more targeted answers to our ques- éka A Exponential scale-free y R b Jason Bu tions. Today searching Google for “Toyota used

www.SciAm.com © 2008 SCIENTIFIC AMERICAN, INC. SCIENTIFIC AMERICAN 79 blogosphere has certain patterns of power. Matthew Hurst tracked how blogs link to one another. A visualization of the result (left) displays each blog as a white dot; the few large dots are massively popular sites. Blogs that share numerous cross citations form distinct communi- ties (purple). Isolated groups that communicate frequently among themselves but rarely with oth- ers appear as straight lines along the outer edges.

in Moscow or the names of all the mayors of towns in the U.S. that are at an altitude greater than 1,000 meters [see box below] and get back an exact answer. Naturally, we would like a similar tool for

the entire Web, but developing one would re- ) quire that more and more data on the Web were represented as linked sets of RDF. Meanwhile it is becoming apparent that DBpedia’s link struc- ture obeys the same power laws that have been blogospherevisualization

found for the Web. Just as some pages have a ) ( higher rank in the Web of documents, so it will be for data on the Semantic Web. At the same time, research by Oded Nov of the Polytechnic As these definitions grow and interlink, spe- Institute of New York University is beginning to cialists and enthusiasts will define taxonomies ascertain why Wikipedians post entries and and ontologies: data sets that describe classes of what motivates their activity; the psychological objects and relations among them. These sets drivers that are revealed will help us understand

will help computers everywhere to find, under- how to encourage people to contribute to the Se- http://datamining.typepad.com/data_mining ( stand and present targeted information. mantic Web. Numerous groups are already building Se- mantic Web frameworks, especially in biology Future Challenges and health care [see “The Semantic Web in Ac- It seems sensible to say that Web science can MicrosoftLiveLabs urst tion,” by Lee Feigenbaum; Scientific Ameri- help us engineer a better Web. Of course, we do H can, December 2007]. More than 1,000 people not fully know what Web science is, so part of w

attended the Semantic Technology conference the new discipline should be to find the most Matthe in San Jose, Calif., this past May. Web science powerful concepts that offers the prospect of creating more powerful will help the science ways to define, link and interpret data. The wiki world offers a good example of how [case study] useful such exploitation of can be. As of May, Wikipedia, the online encyclopedia Tennis, Anyone? generated by people everywhere, had more than Web science is generating tools that can understand online 2.3 million articles in English. The articles con- data (known as the Semantic Web) and can therefore pro- tain regular text, along with infobox tem- vide highly targeted search results. One effort, DBpedia, plates—sets of facts. More than 700,000 Eng- can assess the information in infoboxes on Wikipedia pag- lish infobox templates now exist, and program- es. For example, to find all tennis players from Moscow, a mers are looking for ways to mine them. One user fills in a simple form (below) and gets a short, effort is the DBpedia project, begun by Chris Bi- exact list as a result (right). zer and his colleagues at the Free University of Berlin and the University of Leipzig in Germa- ny. They have devised a tool by the same name (available at http://wikipedia.aksw.org) that uses Semantic Web techniques to query the in- foboxes. It can ask for all tennis players who live

80 SCIENTIFIC AMERICAN © 2008 SCIENTIFIC AMERICAN, INC. October 2008 itself grow. Perhaps insights will come from the second life and other virtual worlds are constructed by many enthusiasts who each work’s interdisciplinary nature. For example, contribute small pieces, such as stores and products that constitute a virtual mall biological concepts such as plasticity might where real people can profit from online avatars that shop there. Such creations prove useful. The brain and nervous system raise complicated questions about who owns what intellectual property—questions grow and adapt over our lifetimes by forming Web science hopes to answer. and deleting connections between neurons—the brain cells that act as nodes in our neural net- Sociology is another field to tap. Research is work. Changes in the connections occur in needed, for instance, to provide Web users with response to activity in the network, including better ways of determining whether material on learning, disuse and aging. a site can be trusted. How can we determine Similarly, Web connections decay and grow. whether we can trust the material emanating Web science could also explore the possibility of from a site? The Web was originally conceived protocols that disconnect Web nodes if there is as a tool for researchers who trusted one anoth- no inbound or outbound activity. Would such a er implicitly; strong models of security were not network function more effectively? built in. We have been living with the conse- Concepts such as population dynamics, food ➥ m ore to quences ever since. chains, and consumers and producers all have explore As a result, substantial research should be counterparts on the Web. Perhaps methods and devoted to engineering layers of trust and prov- models devised for ecology can help us under- Exploring Complex Networks. enance into Web interactions. The coming to- Steven H. Strogatz in Nature, Vol. 410, stand the Web’s digital ecosystem, which could pages 268–276; March 8, 2001. gether of our digital and physical personas pres- be prone to damage by single major events (anal- ents opportunities for progress, such as the in- ogous to hurricanes) or subtle but steady ero- The Semantic Web Revisited. Nigel tegration of financial, medical, social and sions (like invasive species). Shadbolt, Tim Berners-Lee and educational services for each of us. But it is also We will also need to examine a range of legal Wendy T. Hall in IEEE Intelligent an opportunity for identity theft, cyberstalking Systems, Vol. 21, No. 3, pages issues. Laws relating to intellectual property 96–101; May/June 2006. and cyberbullying, and digital espionage. Web and copyright for digital material are already science can help enhance the good and amelio- being debated. Fascinating issues have arisen in Creating a Science of the Web. rate the bad. virtual environments such as Second Life; for Tim Berners-Lee, Wendy T. Hall, Various other questions need to be tackled example, do laws and entitlements transfer to James W. Hendler, Nigel Shadbolt and before the rich potential of the Web can be Daniel Weitzner in Science, Vol. 313, digital worlds, where millions of people contrib- pages 769–771; August 11, 2006. mined to its fullest. How do social norms influ- ute tiny additions to existing content? Another ence emerging capabilities? How can online pri- issue is whether we can build rules of use into Google’s PageRank and Beyond: vacy protection, intellectual-property rights content itself. An example of such a framework, The Science of Search Engine and security be implemented? What trends called Creative Commons, lets authors, scien- Rankings. Amy N. Langville and could fragment the Web? Carl D. Meyer. Princeton University tists, artists and educators easily mark their cre- Press, 2006. Many people are working on parts of these ative work with the freedoms and restrictions questions. Web science can bring their efforts they want it to carry. Crucially, the mark also Web Science: An Interdisciplinary together and compound the insights. We need provides RDF data that describe the license, Approach to Understanding the to train a cadre of researchers, developers, prac- making it easy to automatically locate works Web. , Nigel Shadbolt, titioners and users in a broad range of skills and Wendy T. Hall, Tim Berners-Lee and b and understand their conditions of use. Web sci- Daniel Weitzner in Communications of subjects. They will help us fully understand the

en la ence could determine whether commons-style the ACM, Vol. 51, No. 1, pages 60–69; Web and discover how to engineer it for the 21st d

lin licenses affect the spread of the information. July 2008. century and beyond. n

www.SciAm.com © 2008 SCIENTIFIC AMERICAN, INC. SCIENTIFIC AMERICAN 81