World Wide Web: A Digital Library |
By: Sanasam Ranbir Singh * |
Table 1: Estimated coverage of each search engine (average over 575 queries performed during 15 to 17 December 1997) [10]. Because of the decentralized nature of its growth, the Web has been widely believed to lack of structure and organization as a whole. Analysis of the Web’s network of hyperlinks have revealed an intricate structure that is providing to be valuable for organizing information, improving search methods and understanding the Web in a broader technological and serial context. HITS [2], SALSA[3] are some of the widely used hyperlink structure based algorithms and surprisingly found to provide good results. Many studies like PHITS[4], PageRank[1] etc. are farther studies on HITS. Several studies like spectral filtering[7], HyCon[6] find that combining content and link structure of the web provides better result. Our recent study, confidence based web search[6] finds to provide surprisingly good result by monitoring users’ web page access records. Precise classification of the web pages into categories is a key factor in web indexing that plays a major role to improve web search performance and provide quality pages. LSA (Latent semantic analysis), PLSA (probabilistic LSA), k-nearest neighbor etc. are some of the algorithms used to classify web pages. A recent study [8] indicates that the Web contains a large, strongly connected core in which every page can reach every other by a path of hyperlinks. This core contains most of the prominent sites on the Web. The remaining pages can be characterized by their relation to the core: Upstream nodes can reach the core but cannot be reached from it, downstream nodes can be reach from the core but cannot reach it and “tendrils” contain nodes that can neither reach the core nor be reached from the core. Searching for relevant pages of specific topic from this huge library poses a large degree of complexity due to its unstructured and dynamic in nature. Available search engines can be broadly classified into two such as content-based search engine and citation-based search engine based on the mechanisms used. Content-based search engines use only the document content and use them for key word matching operation. Users query topic is used as the keyword and find out the documents containing the keyword in the web. Some of the full text content-based search engines are Lycos (www.lycos.com), Alta Vista (www.altavista.digital.com), Excite (www. Excite.com), HotBot (www.hotbot.com), In-foseek (www.infoseek.com), Northern Light (www.nlsearch.com) etc. Several researches find that content-based search engines often provide poor quality results. Hyperlink structure based search engines use citations as well as content of the pages and provide better results than content-based search engines. Google (www.google.com), clever etc. are some of the hyperlink-based search engines. There are many issues like extraction of the features from the pages, organizational structure of web, identifying community of pages, crawling the web, large-scale search engine and its architecture, web structure, personalized web search, page ranking methods, optimizing web structure, web indexing etc. which are required for better web mining. Due to good amount of resources for research in Web, many researchers are attracted into this area. References: [1] Sergey Brin, Lawrence Page. The anatomy of a large-Scale hypertextual Web search engine. In Proc. 7th Int. World Wide Web Conf., 1998. [2] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of ACM (JASM), 46, 1999. [3] R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. In 9th Int. WWW Conference, Amsterdam, Nrtherlands, May 2000. [4] D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. Preprint, 2000. [5] D. Mukhopadhyay, D. Giri, S. R. Singh. A Confidence Based Methodology to Deduce User Oriented Page Ranking in Searching the Web. Preprint, 2003. [6] D. Mukhopadhyay, S. R. Singh. HyCon: A Hyperlink and Content based Topic Search Technique. Preprint, 2003. [7] S. Chakrabarti, B. Dom, R. Kumar, P. Raghavan, S. Rajagopalam, A. Tomkins. Spectral filtering for resource discovery. Preprint, 1998. [8] Jon Kleinberg, Steve Lawrence. The structure of the Web. In Science Vol. 294, 30 November 2001. [9] S. Deerwester et. al. Indexing by latent semantic analysis. Journal of the Society for Information Sc. 1990. [10] Stave Lawrence and C. Lee Gilis. Searching the World Wide Web. Science, Vol.280, 3 April 1998. www.science.com. Sanasam Ranbir Singh is a scholar at the Dept. Of Comp. Sc. & Engg. Haldia Institute Of Technology, West Bengal, India |
* Comments posted by users in this discussion thread and other parts of this site are opinions of the individuals posting them (whose user ID is displayed alongside) and not the views of e-pao.net. We strongly recommend that users exercise responsibility, sensitivity and caution over language while writing your opinions which will be seen and read by other users. Please read a complete Guideline on using comments on this website.