Carley's LIS 2600 Blog: Reading Notes - Week of November 19, 2012

Hawking, David. "Web Search Engines: Part 1." Computer 39. no. 6 (2006): 86-88.

large search engines operate out of geographically distributed data centers for redundancy
there are hundreds of thousands of servers at these centers
within each data center groups of servers can be dedicated to specific functions, such as web crawling
large scale replication is necessary
the simplest crawling algorithm uses queues of URLs and a mechanism to determine whether it has seen the URL before
crawling algorithms must address speed, politeness, excluded content, duplicate content, continuous crawling and spam rejection

Hawking, David. "Web Search Engines: Part" Computer 39. no. 8 (2006): 88-90.

search engine use an inverted file to rapidly identify indexing terms
an inverted file is a concatenation of the posting lists for each term
indexers create inverted files in two phases

scanning - indexer scans the text of each input document
inversion - indexer sorts the files into term number order

real indexers have to deal with scaling, term lookup, compression, searching phrases, anchor texts, link popularity scores, and query-independent scoring
query processing algorithms

query processors looks up each query term and locates its posting list

Shreeves, Sarah, Thomas G. Habing, Kat Hagedorn, and Jeffery A. Young. "Current Developments and Future Trends for the OAI Protocol for Metadata Harvesting." Library Trends 53. no. 4 (2005): 576-589.

the Protocol for Metadata Harvesting is a tool developed to facilitate interoperability between different collections of metadata based on common standards
the OAI world is divided into data providers or repositories and service providers or harvesters
OAI requires data providers to expose metadata in at least unqualified Dublin Core
the Protocol can provide access to parts of the "invisible Web" that not easily accessible to search engines

Bergman, Michael K. "The Deep Web: Surfing Hidden Value," Journal of Electronic Publishing 7. no.1 (2001).

deep web sources store their content in searchable databases that only produce results dynamically in response to a direct request
deep web is much larger than the "surface" web

Carley's LIS 2600 Blog

Friday, November 16, 2012

Reading Notes - Week of November 19, 2012

No comments:

Post a Comment