Hawking, David. "Web Search Engines: Part 1." Computer 39. no. 6 (2006): 86-88.
- large search engines operate out of geographically distributed data centers for redundancy
- there are hundreds of thousands of servers at these centers
- within each data center groups of servers can be dedicated to specific functions, such as web crawling
- large scale replication is necessary
- the simplest crawling algorithm uses queues of URLs and a mechanism to determine whether it has seen the URL before
- crawling algorithms must address speed, politeness, excluded content, duplicate content, continuous crawling and spam rejection
Hawking, David. "Web Search Engines: Part" Computer 39. no. 8 (2006): 88-90.
- search engine use an inverted file to rapidly identify indexing terms
- an inverted file is a concatenation of the posting lists for each term
- indexers create inverted files in two phases
- scanning - indexer scans the text of each input document
- inversion - indexer sorts the files into term number order
- real indexers have to deal with scaling, term lookup, compression, searching phrases, anchor texts, link popularity scores, and query-independent scoring
- query processing algorithms
- query processors looks up each query term and locates its posting list
Shreeves, Sarah, Thomas G. Habing, Kat Hagedorn, and Jeffery A. Young. "Current Developments and Future Trends for the OAI Protocol for Metadata Harvesting." Library Trends 53. no. 4 (2005): 576-589.
- the Protocol for Metadata Harvesting is a tool developed to facilitate interoperability between different collections of metadata based on common standards
- the OAI world is divided into data providers or repositories and service providers or harvesters
- OAI requires data providers to expose metadata in at least unqualified Dublin Core
- the Protocol can provide access to parts of the "invisible Web" that not easily accessible to search engines
Bergman, Michael K. "The Deep Web: Surfing Hidden Value," Journal of Electronic Publishing 7. no.1 (2001).
- deep web sources store their content in searchable databases that only produce results dynamically in response to a direct request
- deep web is much larger than the "surface" web
No comments:
Post a Comment