2011-11-07
|
|
|
80legs - Custom Web Crawlers, Powerful Web Crawling, and Data Extraction
80legs offers powerful web crawling. Extract data from web pages, images, and any other online content. Start crawling websites now faster, easier, and with unlimited reach.
2011-11-07
|
|
|
CommonCrawl
Common Crawl produces and maintains a repository of web crawl data that is openly accessible to everyone. The crawl currently covers 5 billion pages and the repository includes valuable metadata. The crawl data is stored by Amazon’s S3 service, allowing it to be bulk downloaded as well as directly accessed for map-reduce processing in EC2. This makes wholesale extraction, transformation, and analysis of web data cheap and easy. Small startups or even individuals can now access high quality crawl data that was previously only available to large search engine corporations.
2010-11-22
|
|
|
An Introduction to Information Retrieval
The book aims to provide a modern approach to information retrieval from a computer science perspective. It is based on a course we have been teaching in various forms at Stanford University and at the University of Stuttgart.
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze
2010-11-12
|
|
|
A SURVEY OF EIGENVECTOR METHODS FOR WEB INFORMATION RETRIEVAL
Web information retrieval is significantly more challenging than traditional well controlled,
small document collection information retrieval. One main difference between traditional information retrieval and Web information retrieval is the Web's hyperlink structure. This structure has been exploited by several of today's leading Web search engines, particularly Google and Teoma. In this survey paper, we focus on Web information retrieval methods that use eigenvector
computations, presenting the three popular methods of HITS, PageRank, and SALSA.
AMY N. LANGVILLE, CARL D. MEYER