What is the invisible web? A crawler perspective

DSpace Home
→
Harvested articles مقالات مستوردة من مؤسسات وجامعات عالمية
→
Digital Csic
→
View Item

dc.creator	Arroyo, Natalia
dc.date	2008-05-16T09:22:32Z
dc.date	2008-05-16T09:22:32Z
dc.date	2004
dc.date.accessioned	2017-01-31T01:20:45Z
dc.date.available	2017-01-31T01:20:45Z
dc.identifier	AoIR-ASIST 2004 Workshop on Web Science Research Methods, Brighton (UK)
dc.identifier	http://hdl.handle.net/10261/4297
dc.identifier.uri	http://dspace.mediu.edu.my:8181/xmlui/handle/10261/4297
dc.description	The invisible Web, also known as the deep Web or dark matter, is an important problem for Webometrics due to difficulties of conceptualization and measurement. The invisible Web has been defined to be the part of the Web that cannot be indexed by search engines, including databases and dynamically generated pages. Some authors have recognized that this is a quite subjective concept that depends on the point of view of the observer: what is visible for one observer may be invisible for others. In the generally accepted definition of the invisible Web, only the point of view search engines has been taken into account. Search engines are considered to be the eyes of the Web, both for measuring and searching. In addition to commercial search engines, other tools have also been used for quantitative studies of the Web, such as commercial and academic crawlers. Commercial crawlers are programs developed by software companies for other purposes than Webometrics, such as Web sites management, but might also be used for crawling Web sites and reporting on their characteristics (size, hypertext structure, embedded resources, etc). Academic crawlers are programs developed by academic institutions for measuring Web sites for Webometric purposes. In this paper, Sherman and Price’s “truly invisible Web” is studied from the point of view of crawlers. The truly invisible Web consists of pages that cannot be indexed for technical reasons. Crawler parameters are significantly different to search engines, due to different design purposes resulting in different technical specifications. In addition, huge differences among crawlers on their coverage of the Web have been demonstrated in previous investigations. Both aspects are clarified though an experiment in which different Web sites, including diverse file formats and built with different types of Web programming, are analyzed, on a set date, with seven commercial crawlers (Astra SiteManager, COAST WebMaster, Microsoft Site Analyst, Microsoft Content Analyzer, WebKing, Web Trends and Xenu), and an academic crawler (SocSciBot). Each Web site had been previously copied to a hard disk, using a file-retrieving tool, in order to compare them with the data obtained by crawlers. The results are reported and analyzed in detail to produce a definition and classification of the invisible Web for commercial and academic crawlers
dc.description	Peer reviewed
dc.format	33761 bytes
dc.format	application/pdf
dc.language	eng
dc.rights	openAccess
dc.subject	Invisible web
dc.subject	Crawlers
dc.subject	Cybermetrics
dc.subject	Web invisible
dc.subject	Cybermetría
dc.title	What is the invisible web? A crawler perspective
dc.type	Presentación

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Digital Csic

Show simple item record

Search DSpace

Advanced Search

Browse

All of DSpace
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

What is the invisible web? A crawler perspective

Files in this item

This item appears in the following Collection(s)

Search DSpace

Browse

All of DSpace

This Collection

My Account